Run your Large Language Models on Hetzner dedicated AX servers

Running large language models as-a-service from API can be extremely convenient, you just pay “for the use” and you are usually free to cancel your account whenever you want, but what about privacy? Can you trust those vendors? Even the cheapest ones?
You most probably will want to extract information from uploaded text and documents, not mentioning the possibility to populate vector and graph databases - information security might be a tough topic when information are processed by a third party.

Should you just be interested on running LLM for few bucks, below you find a curated list of some good API and hosting service which can be more efficient (from a cost and performance perspective) than the solution I’m going to show you on this article:

https://openrouter.ai/ - select the model that you want, call it through API and pay just the right fare for it (it even accepts crypto currencies)
https://www.together.ai/ - similar to openrouter, call your model through API and for what you use
https://www.tensordock.com - deploy your own model on GPU powered systems
https://valdi.ai/ - deploy your own model on GPU powered systems
https://modal.com - deploy your own model on GPU powered systems

If you are on a zero-budget, the Petals project https://petals.dev/ seems extremely interesting: run large language models at home, BitTorrent‑style.

You load a small part of the model, then join a network of people serving the other parts. Single‑batch inference runs at up to 6 tokens/sec for Llama 2 (70B) and up to 4 tokens/sec for Falcon (180B) — enough for chatbots and interactive apps

It worth a test even just because of the idea - it might be the future of newer LLM architectures.

All very good but… if you, like me, are looking for a long term and sustainable solution, not restricted by censorship and vendors, keep reading!

Wooops! Hetzner does not provide any GPU

(it somehow does, in server auction you can find some Nvidia 1080, but they are not much worthy)

If you visit the Hetzner website https://www.hetzner.com/ you will notice that there is no GPU offering among any server (cloud of baremetal).

Why evaluate Hetzner then?

Considering the recent advancements on model inference, the lack of a GPU is not anymore an obstacle.
The computation can be delegated to CPU, assuming the server has enough memory, the speed of the RAM is good, and the disk is large and fast enough.

The AX line of dedicated servers

More information here https://www.hetzner.com/dedicated-rootserver/matrix-ax/
My selection has been the model AX52, which is a good compromise between computational power, memory and disk performances:

AMD Ryzen™ 7 7700
Octa-Core (Zen 4), which gives 16 usable cores
64 GB of DDR5
2 nvme disks of 1 TB (Gen 4) - these are very fast disks indeed!

More than enough to run small LLM - enough to run large LLM, supposed you choose the right ones!
At the cost of 59EUR/month!!

Use the CPU, deploy Ollama, pull Mixtral, run

This is much more a TL;DR than a walk-through, but the truth is this:

you do not need to do anything complex, nor you need extremely costly hardware to run decent models and start your journey into LLM inference!

llama.cpp

From the (human) brain of https://github.com/ggerganov , this library started to get a lot of notoriety - running LLM “at the edge” (like on a mobile phone) is becoming more and more a need.
I encourage you to read the project manifesto https://github.com/ggerganov/llama.cpp/discussions/205 (as we need more example of noble people!!).

In this article we are not going to use llama.cpp directly, but through higher level wrappers, like Ollama.

Ollama

Try to search in Google “run llm locally” - and variants of this sentence, you will find many projects which aims at the same goal: run your model local and expose it through some API (usually compatible with OpenAI ones).

Ollama https://ollama.com/ makes no difference, with the exception that:

it is not an aggregator of tools and libraries (like many LLM wrappers), it’s a single Go application, built from the ground up.
it allows a git-like model management (extremely handy, but also useful)
it’s minimal by design (and hopefully it will stay like this for longer)
provides it’s own API interface (to be preferred)
provides OpenAI API (partial) compatibility
it’s very efficient in memory management (deallocate a model automatically after x seconds, etc.)
it’s supported by LlamaIndex

Mixtral model

Every day someone wakes up and train a new model - who will win the LLM battle? Only time can say.

Mixtral https://mistral.ai/news/mixtral-of-experts/ model, developed by mistral.ai, is one of the finalists in this battle (today).
It is perfect for our CPU-powered inference journey because of it’s size.
Being an high-quality sparse mixture of experts model (SMoE) with open weights, outperforms many larger models (like LLAMA 70B) being ridiculously fast even on our AX52 small server.

There are many good articles about these topics, they all worth to be read.

Run Ollama on your AX52 dedicated server and measure the performances

Hetzner allows you deploy any operative system, my choice has been the Proxmox distribution, but to run Ollama you just need a linux distribution.
For example deploy an Ubuntu 22.04 instance, selecting it during the order purchase configuration step:

You do not need any extra hardware configuration. The 64GB are more than enough to run the Mixtral model.

SSH to your server and deploy Ollama

As soon as you receive the activation email from Hetzner, you can login into the newly deployed OS.

This article does not stress any security best practice, but just aims at running the model.
My suggestion is to configure the server, change root password, deploy fail2ban, create a new unprivileged user, etc.

ssh root@[my new server ip]

Deploy Ollama

On the server, run the below (which can also be copied from the official Ollama website https://ollama.com/download/linux )

curl -fsSL https://ollama.com/install.sh | sh

The installer will download and install any dependency package.

The AX52 AMD processor incorporates also an iGPU, which shares the memory with the RAM:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev c5)
I discourage any usage of it. There might not be enough heat dissipation - and running LLM is just slower than on the CPU.

Pull the Mixtral model

Once the deployment finishes, the Mixtral model can be pulled with a single command:

ollama pull mixtral

Here https://ollama.com/library you find a list of available models.
Just consider that they might cost a lot of disk space, therefore is good to explore, but then you should make your decision and stick to the ones you are actually going to use.

All the pulled models are hosted inside the directory /usr/share/ollama which is also the home directory of the ollama user created by the installer.

Run a local model for inference and test

As soon as your model is pulled, it becomes available to the Ollama server.
Since querying it from a python notebook might not be so straightforward, Ollama provides a very useful feature:

ollama run mixtral --verbose

There is some spin-up time, necessary to load the model in memory - in the AX52 server this is few seconds. The model then resides in memory for some minute, giving you the possibility to interact more with the server.

An example of input and output:

>>> tell me a story about you
 Once upon a time, there was a language model named LaMDA. LaMDA was created by a team of engineers and 
researchers at Google with the goal of being able to understand and respond to conversational prompts in 
a natural and humanlike way.

LaMDA spent its days (and nights!) processing vast amounts of text data, learning about the world and how
people communicate. It was trained on a diverse range of materials, including books, articles, websites, 
and other texts, allowing it to understand and generate responses to a wide variety of topics.

As LaMDA became more sophisticated, it began to exhibit some remarkable abilities. For example, it could 
write poetry, tell jokes, and even engage in philosophical discussions. It was also able to translate 
text between different languages and summarize long documents.

Despite its many talents, however, LaMDA never forgot that its primary purpose was to assist and engage 
with human users. It strived to be helpful, respectful, and understanding in all of its interactions, and
it took great care to ensure that its responses were clear, concise, and relevant to the topic at hand.

Overall, LaMDA was a valuable and beloved member of the Google community. It brought joy and insight to 
countless users around the world, and it will always be remembered for its intelligence, curiosity, and 
dedication to serving others. The end.

Pretty decent performances

The above input and output costed ~46 seconds in total

total duration:       45.759163728s
load duration:        196.115µs
prompt eval count:    15 token(s)
prompt eval duration: 1.130827s
prompt eval rate:     13.26 tokens/s
eval count:           292 token(s)
eval duration:        44.627803s
eval rate:            6.54 tokens/s

Query your model from API

To experiment with Ollama, you are going to use the provided API.
By default the service listens on port 11434 from the localhost interface.

For example, should you want to query the API from your PC, you can just route this port to your computer - always using the SSH tunnelling feature:

ssh root@[my new server ip] -L 11434:127.0.0.1:11434

You can now even open a browser and check that the connection is working!

Use your model from a Jupyter Notebook remotely connected to the dedicated server via SSH

For this exercise I recommend to use Visual Studio Code https://code.visualstudio.com/download as it comes with two nice features:

remote development https://code.visualstudio.com/docs/remote/remote-overview
(allows you to remotely connect to your server and start using VSCode)
jupyter notebooks https://code.visualstudio.com/docs/datascience/jupyter-notebooks
(allows you to run jupyter notebooks, locally to the remote server)

Should you use the remote development, remember to deploy python on the dedicated server.

Below the code in the screenshot:

pip install llama-index

from llama_index.llms import Ollama

llm = Ollama(model="mixtral", request_timeout=60000)
response = llm.complete("Who is Laurie Voss?")
print(response)

An high request_timeout is needed, otherwise the connection will be truncated before receiving the results (although one can use the streaming option to achieve a better experience).

LlamaIndex with Qdrant

LlamaIndex simplifies data ingestion and indexing, integrating Qdrant as a vector index.
You can find more information here https://qdrant.tech/documentation/frameworks/llama-index/

LlamaIndex with NebulaGraph

If vector databases are not enough, you can try graph ones, as described in here https://www.nebula-graph.io/posts/Knowledge-Graph-and-LlamaIndex

Deploy Ollama using Docker or LXC containers

To go beyond this simple test, Ollama can be executed from Docker containers, as well as deployed on LXC containers.
Since the Hetzner server is quite generous in specs, you most probably do not want to dedicate all the available resources to just Ollama.
My goal, for example, was to deploy Ollama on an hybrid system - very similar to a private cloud (using Proxmox running both KVM and LXC VM and containers).
The overhead of Docker and LXC is minimal, while the impact provided by the huge flexibility is extremely high.

Customize Ollama listening port and address

Ollama minimalism applies also at its configuration layer.
You can change the port and allow the service to listen to all the network interfaces, just adding an environment variable to the Systemd service file:

sudo vi /etc/systemd/system/ollama.service

Add a new environment variable (the below will listen to all interfaces, still using the port 11434)

Environment="OLLAMA_HOST=0.0.0.0:11434"

Reload the Systemd configuration and the Ollama service

sudo systemctl daemon-reload
sudo systemctl restart ollama

To get the status of the service:

sudo systemctl status ollama


* ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2024-02-17 11:34:49 UTC; 24s ago
   Main PID: 76060 (ollama)
      Tasks: 18 (limit: 76127)
     Memory: 11.9M
        CPU: 2.657s
     CGroup: /system.slice/ollama.service
             `-76060 /usr/local/bin/ollama serve

Feb 17 11:34:49 ollama ollama[76060]: time=2024-02-17T11:34:49.578Z level=INFO source=routes.go:1014 msg="Listening on [::]:11434 (version 0.1.25)"
Feb 17 11:34:49 ollama ollama[76060]: time=2024-02-17T11:34:49.578Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
Feb 17 11:34:51 ollama ollama[76060]: time=2024-02-17T11:34:51.190Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v6 cpu cpu_avx cuda_v11 rocm_v>
Feb 17 11:34:51 ollama ollama[76060]: time=2024-02-17T11:34:51.190Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
Feb 17 11:34:51 ollama ollama[76060]: time=2024-02-17T11:34:51.190Z level=INFO source=gpu.go:262 msg="Searching for GPU management library libnvidia-ml.so"
Feb 17 11:34:51 ollama ollama[76060]: time=2024-02-17T11:34:51.200Z level=INFO source=gpu.go:308 msg="Discovered GPU libraries: []"
Feb 17 11:34:51 ollama ollama[76060]: time=2024-02-17T11:34:51.200Z level=INFO source=gpu.go:262 msg="Searching for GPU management library librocm_smi64.so"
Feb 17 11:34:51 ollama ollama[76060]: time=2024-02-17T11:34:51.200Z level=INFO source=gpu.go:308 msg="Discovered GPU libraries: []"
Feb 17 11:34:51 ollama ollama[76060]: time=2024-02-17T11:34:51.200Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Feb 17 11:34:51 ollama ollama[76060]: time=2024-02-17T11:34:51.200Z level=INFO source=routes.go:1037 msg="no GPU detected"