Exploring Llama2 on CPU only VM

Zhimin Wen
6 min readSep 11
Image by Regina from Pixabay

Since Meta released the open source large language model Llama2, thanks to the effort of the community, the barrier to access a LLM to developers and normal users is largely removed, which is the so called democratised LLM.

Lets explore running a LLM on a KVM based VM.

The VM

Create a VM on a host machine of AMD processor 2.3Ghz (No GPU available) with 24 vCPU and 64GB memory, Ubuntu 22.04 Jammy Jellyfish.

Install the base build tools,

sudo apt update -y
sudo apt install -y build-essential

The Engine to Run LLM

Thanks to the open source community, especially the llama.cpp project, its now possible to run a quantised LLM model (as in the latest format of GGUF) even without GPU.

Let’s built it.

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

A sample program is also built for you to test the LLM as a command line tool, let’s make install it to the system level.

sudo cp main /usr/local/bin/llm

The LLM Model

We have the command line engine, where to get the quantised model files? TheBlok, purveyor of fine local LLMs for your fun and profit, has provided us many of the GGUF LLM model from huggingface.co.

Lets download the top performer, the WizardCoder’s model based on the Llama coder, a bit greedily with the 34B model.

mkdir models
cd models
curl -LO https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GGUF/resolve/main/wizardcoder-python-34b-v1.0.Q5_K_M.gguf

It will be taking a while as file size is huge, 23GB.

Test it with Command Line

Run the model with the following command,

llm -m ~/models/wizardcoder-python-34b-v1.0.Q5_K_M.gguf \
--color
-c 2048
-i -ins
-t 24

We run the model interactively using the full 24 CPU cores. Ask a question to create a python code

> Write a python code to calculate Fibonacci numbers
To calculate the fibonacci sequence, we can use recursion. The nth Fibonacci…