Logo
Loading...
Published on

Building llama.cpp for local inference

Author

AI models are getting smaller and now more broadly available to run on consumer grade hardware. You can run Ollama or LMStudio for easy models testing and integrations, for example, you can connect it to Jetbrains AI Assistant.

Tools like LMStudio and Ollama are great, but the first is not open source and the 2nd one has some weird architectural decisions, and besides that both of them run llama.cpp under the hood, so why not run it directly?

To squeeze some performance and try to understand how to run inference for LLMs more closely, I decided to try out llama.cpp

Building llama.cpp

First, we need to get a repository:

cd ~
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

I'm running Linux with Nvidia GPU, so will be building with CUDA.

More detailed instructions can be found here.

First need to install CUDA toolkit, get it from Nvidia Website

WARNING please choose the right cuda version for your driver version. Check you driver version first by running nvidia-smi and then check compatibility here

CRUCIAL WARNING please do backup your system before messing with your drivers, don't be like me =)

If you need older version archive is here

For example, I have 575 drivers and will install CUDA 12.9.1.

For some reason nvcc sets incorrect architecture, so need to set it manually, choose yours.

Also remember to add nvcc to your PATH, I like to keep it in ~/.bashrc

export PATH="/usr/local/cuda/bin:$PATH"

If you have issues with CURL during build, try to install it:

sudo apt install libcurl4-openssl-dev pkg-config libssl-dev
# add -DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/libcurl.so

Remember to add -j flag to run in parallel.

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_F16=ON -DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/libcurl.so
cmake --build build --config Release -j 20

DGGML_CUDA_FA_ALL_QUANTS=ON is slower to compile, but we will be able to use quants later.

Now you can add build/bin to your PATH and run llama-* commands.

export PATH="~/llama.cpp/build/bin:$PATH"

I found this blog post extremely helpful: https://blog.steelph0enix.dev/posts/llama-cpp-guide/ and recommend reading it before moving forward.