- Published on
Running inference with llama.cpp
- Author
- Illia Vasylevskyi
Since we built llama.cpp
from source, now we can run our models.
Different models have different settings, so you need to check model cards always before run.
Also, I found it handy to run
llama-server --help
And just explore options one by one.
This blog post is extremely helpful: https://blog.steelph0enix.dev/posts/llama-cpp-guide/ and recommend reading it before moving forward.
Downloading models
Let's install huggingface-cli
, I will use uv
to get LTS python 3.11:
uv venv --no-project --python 3.11 ~/python3.11
source ~/python3.11/bin/activate
uv pip install -U "huggingface_hub[cli]"
hf --help
Now you will need to log in to HuggingFace and get your token and add it to hf
CLI.
Here are some examples of my model downloads:
hf download lmstudio-community/gpt-oss-120b-GGUF --local-dir="models/gpt-oss-120b" --include='*gpt-oss*gguf'
hf download lmstudio-community/gpt-oss-20b-GGUF --local-dir="models/gpt-oss-20b" --include='*gpt-oss*gguf'
hf download unsloth/GLM-4.5-Air-GGUF --local-dir="models/unsloth/GLM-4.5-Air" --include='*IQ4_XS*gguf'
hf download unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF --local-dir="unsloth/Qwen3-30B-A3B-Thinking-2507" --include='*IQ4_XS*gguf'
Running inference
Here are some examples of my model setups for different models and the best performance, given my limitation of single 5060Ti 16GB GPU.
GPT-OSS 20B (around 100 tk/s for token generation)
llama-server --device CUDA0 \
--model ~/models/gpt-oss-20b/gpt-oss-20b-MXFP4.gguf \
--host 0.0.0.0 \
--port 8052 \
--jinja \
--ctx-size 65536 \
--threads 10 \
--threads-batch 10 \
--batch-size 16384 \
--ubatch-size 2048 \
--flash-attn \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--n-gpu-layers 999 \
--chat-template-kwargs '{"builtin_tools":["python"], "reasoning_effort":"high"}'
GPT-OSS 120B (~23-25 tk/s for token generation)
llama-server --device CUDA0 \
--model ~/models/gpt-oss-120b/gpt-oss-120b-MXFP4-00001-of-00002.gguf \
--host 0.0.0.0 \
--port 8052 \
--jinja \
--ctx-size 65536 \
--threads 10 \
--batch-size 2048 \
--ubatch-size 2048 \
--n-cpu-moe 30 \
--flash-attn \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--n-gpu-layers 999 \
--chat-template-kwargs '{"builtin_tools":["python"], "reasoning_effort":"high"}'
Qwen3-30B-A3B-Thinking-2507 (~40-45 tk/s for token generation)
llama-server --device CUDA0 \
--model ~/unsloth/Qwen3-30B-A3B-Thinking-2507/Qwen3-30B-A3B-Thinking-2507-IQ4_XS.gguf \
--host 0.0.0.0 \
--port 8052 \
--jinja \
--ctx-size 65536 \
--threads 10 \
--n-cpu-moe 22 \
--flash-attn \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0 \
--presence-penalty 2 \
--n-gpu-layers 999
Adding llama.cpp to systemd
Running a server with commands is cool, but I want to run it automatically on startup. This part explains how to add it to systemd.
mkdir -p ~/bin
cat > ~/bin/start-llama.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
# --- configuration ---------------------------------------------------------
MODEL="$HOME/models/gpt-oss-20b/gpt-oss-20b-MXFP4.gguf"
# --- launch ---------------------------------------------------------------
exec llama-server \
--device CUDA0 \
--model "$MODEL" \
--host 0.0.0.0 \
--port 8052 \
--jinja \
--ctx-size 65536 \
--threads 10 \
--threads-batch 10 \
--batch-size 16384 \
--ubatch-size 2048 \
--flash-attn \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--n-gpu-layers 999 \
--chat-template-kwargs '{"builtin_tools":["python"], "reasoning_effort":"high"}'
EOF
chmod +x ~/bin/start-llama.sh
Why a script?
systemd doesn’t use a shell when parsing ExecStart=
– quoting and line‑breaks are problematic.
Running a script lets you keep the command readable, and you can easily edit the flags later.
cat > ~/.config/systemd/user/llama-server.service <<'EOF'
[Unit]
Description=Llama Server
After=network-online.target
StartLimitIntervalSec=150
StartLimitBurst=3
[Service]
Type=simple
Environment=PATH=%h/llama.cpp/build/bin:/usr/local/cuda/bin:%h/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
ExecStart=%h/bin/start-llama.sh
Restart=on-failure
RestartSec=30
[Install]
WantedBy=default.target
EOF
You need to add your PATH to Environment=
here, so service would know where to find llama-server
.
Tell systemd to re‑read its config
systemctl --user daemon-reload
Enable it and start it immediately
systemctl --user enable --now llama-server
Follow the logs
journalctl --user -u llama-server -f
systemctl --user llama-server -n 50
Restart it
systemctl --user restart llama-server
Running llama-server
with Open WebUI
llama-server
has built in ui, but for more features you can run Open WebUI
uv venv --no-project --python 3.12 ~/open-webui
source ~/open-webui/bin/activate
pip install open-webui
open-webui serve
mkdir -p ~/.config/systemd/user
nano ~/.config/systemd/user/openwebui.service
[Unit]
Description=OpenWebUI minimal systemd wrapper
After=network-online.target
StartLimitIntervalSec=120
StartLimitBurst=3
[Service]
Type=simple
WorkingDirectory=/home/%u/open-webui
ExecStart=/usr/bin/bash -lc 'source $HOME/open-webui/bin/activate && exec open-webui serve'
Restart=on-failure
RestartSec=30
[Install]
WantedBy=default.target
systemctl --user daemon-reload
systemctl --user enable --now openwebui
systemctl --user restart openwebui
journalctl --user -u openwebui -f
systemctl --user status openwebui -n 50