Logo
Loading...
Published on

Running inference with llama.cpp

Author

Since we built llama.cpp from source, now we can run our models. Different models have different settings, so you need to check model cards always before run. Also, I found it handy to run

llama-server --help

And just explore options one by one.

This blog post is extremely helpful: https://blog.steelph0enix.dev/posts/llama-cpp-guide/ and recommend reading it before moving forward.

Downloading models

Let's install huggingface-cli, I will use uv to get LTS python 3.11:

uv venv --no-project --python 3.11  ~/python3.11
source ~/python3.11/bin/activate
uv pip install -U "huggingface_hub[cli]"
hf --help

Now you will need to log in to HuggingFace and get your token and add it to hf CLI.

Here are some examples of my model downloads:

hf download lmstudio-community/gpt-oss-120b-GGUF --local-dir="models/gpt-oss-120b" --include='*gpt-oss*gguf'
hf download lmstudio-community/gpt-oss-20b-GGUF --local-dir="models/gpt-oss-20b" --include='*gpt-oss*gguf'
hf download unsloth/GLM-4.5-Air-GGUF --local-dir="models/unsloth/GLM-4.5-Air" --include='*IQ4_XS*gguf'
hf download unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF --local-dir="unsloth/Qwen3-30B-A3B-Thinking-2507" --include='*IQ4_XS*gguf'

Running inference

Here are some examples of my model setups for different models and the best performance, given my limitation of single 5060Ti 16GB GPU.

GPT-OSS 20B (around 100 tk/s for token generation)

llama-server --device CUDA0 \
  --model ~/models/gpt-oss-20b/gpt-oss-20b-MXFP4.gguf \
  --host 0.0.0.0 \
  --port 8052 \
  --jinja \
  --ctx-size 65536 \
  --threads 10  \
  --threads-batch 10 \
  --batch-size 16384 \
  --ubatch-size 2048 \
  --flash-attn \
  --temp 1.0 \
  --top-p 1.0 \
  --top-k 0 \
  --n-gpu-layers 999 \
  --chat-template-kwargs '{"builtin_tools":["python"], "reasoning_effort":"high"}'

GPT-OSS 120B (~23-25 tk/s for token generation)

llama-server --device CUDA0 \
  --model ~/models/gpt-oss-120b/gpt-oss-120b-MXFP4-00001-of-00002.gguf \
  --host 0.0.0.0 \
  --port 8052 \
  --jinja \
  --ctx-size 65536 \
  --threads 10 \
  --batch-size 2048 \
  --ubatch-size 2048 \
  --n-cpu-moe 30 \
  --flash-attn \
  --temp 1.0 \
  --top-p 1.0 \
  --top-k 0 \
  --n-gpu-layers 999 \
  --chat-template-kwargs '{"builtin_tools":["python"], "reasoning_effort":"high"}'

Qwen3-30B-A3B-Thinking-2507 (~40-45 tk/s for token generation)

llama-server --device CUDA0 \
  --model ~/unsloth/Qwen3-30B-A3B-Thinking-2507/Qwen3-30B-A3B-Thinking-2507-IQ4_XS.gguf \
  --host 0.0.0.0 \
  --port 8052 \
  --jinja \
  --ctx-size 65536 \
  --threads 10 \
  --n-cpu-moe 22 \
  --flash-attn \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0 \
  --presence-penalty 2 \
  --n-gpu-layers 999

Adding llama.cpp to systemd

Running a server with commands is cool, but I want to run it automatically on startup. This part explains how to add it to systemd.

mkdir -p ~/bin

cat > ~/bin/start-llama.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

# --- configuration ---------------------------------------------------------
MODEL="$HOME/models/gpt-oss-20b/gpt-oss-20b-MXFP4.gguf"

# --- launch ---------------------------------------------------------------
exec llama-server \
  --device CUDA0 \
  --model "$MODEL" \
  --host 0.0.0.0 \
  --port 8052 \
  --jinja \
  --ctx-size 65536 \
  --threads 10 \
  --threads-batch 10 \
  --batch-size 16384 \
  --ubatch-size 2048 \
  --flash-attn \
  --temp 1.0 \
  --top-p 1.0 \
  --top-k 0 \
  --n-gpu-layers 999 \
  --chat-template-kwargs '{"builtin_tools":["python"], "reasoning_effort":"high"}'
EOF

chmod +x ~/bin/start-llama.sh

Why a script? systemd doesn’t use a shell when parsing ExecStart= – quoting and line‑breaks are problematic. Running a script lets you keep the command readable, and you can easily edit the flags later.

cat > ~/.config/systemd/user/llama-server.service <<'EOF'
[Unit]
Description=Llama Server
After=network-online.target

StartLimitIntervalSec=150
StartLimitBurst=3

[Service]
Type=simple
Environment=PATH=%h/llama.cpp/build/bin:/usr/local/cuda/bin:%h/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
ExecStart=%h/bin/start-llama.sh

Restart=on-failure
RestartSec=30

[Install]
WantedBy=default.target
EOF

You need to add your PATH to Environment= here, so service would know where to find llama-server.

Tell systemd to re‑read its config

systemctl --user daemon-reload

Enable it and start it immediately

systemctl --user enable --now llama-server

Follow the logs

journalctl --user -u llama-server -f
systemctl --user llama-server -n 50

Restart it

systemctl --user restart llama-server

Running llama-server with Open WebUI

llama-server has built in ui, but for more features you can run Open WebUI

uv venv --no-project --python 3.12  ~/open-webui
source ~/open-webui/bin/activate
pip install open-webui
open-webui serve 
mkdir -p ~/.config/systemd/user
nano ~/.config/systemd/user/openwebui.service

[Unit]
Description=OpenWebUI minimal systemd wrapper
After=network-online.target
StartLimitIntervalSec=120
StartLimitBurst=3

[Service]
Type=simple
WorkingDirectory=/home/%u/open-webui
ExecStart=/usr/bin/bash -lc 'source $HOME/open-webui/bin/activate && exec open-webui serve'
Restart=on-failure
RestartSec=30

[Install]
WantedBy=default.target
systemctl --user daemon-reload
systemctl --user enable --now openwebui
systemctl --user restart openwebui
journalctl --user -u openwebui -f
systemctl --user status openwebui -n 50