LLMs

This guide walks you through the process of starting a LLM server that you can use to host and run large language models locally on the server, while accessing them from your local machine.

ollama

Installation

Ollama is already installed in the shared directory /projects/main_compute-AUDIT/apps/ on the server.

Should you for some reason need to install it yourself, follow the manual installation guide on their github page.

Inference server

To use it simple load the module file with module load /projects/main_compute-AUDIT/apps/modules/ollama, and then run ollama serve through slurms interactive session.

This will start a server on 10.84.10.216:8899 (which is accessible on any machine connected to the KU-VPN) and store models in a shared cache at /projects/main_compute-AUDIT/data/.ollama/models

Call the API

See api documentation, use the ollama-sdk, or the openai client.

List local models

curl http://10.84.10.216:8899/api/tags

Pull a models

curl http://10.84.10.216:8899/api/pull -d '{
  "model": "gemma3:27b"
}'

Generate a chat completion

curl http://10.84.10.216:8899/api/chat -d '{
  "model": "gemma3:27b",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ]
}'

from openai import OpenAI

client = OpenAI(
    base_url = 'http://10.84.10.216:8899/v1',
    api_key='ollama', # required, but unused
)

response = client.chat.completions.create(
  model="gemma3:27b",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The LA Dodgers won in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ]
)
print(response.choices[0].message.content)

vLLM

Installation

Install vllm:

uv tool install --with setuptools vllm

Inference server

On the server, in an active Slurm session, run the following command to start the inference server with the specified model from huggingface:

vllm serve "allenai/OLMo-7B-0724-Instruct-hf" \ #(1)!
  --host=10.84.10.216 \ #(2)!
  --port=8899 \ #(3)!
  --download-dir=/projects/<project-dir>/data/.cache/huggingface \ #(4)!
  --dtype=half #(5)!

The model name from huggingface
The ip address of the slurm gpu server
The port of the slurm gpu server
Local cache dir for models, remember to substitute with a specific project eg. ainterviewer-AUDIT
For some models, this is needed since the GPUs on the server are a bit old

Tip

By default, the server is not protected by any authentication.

To add simple authentication to the server, you can generate an api-key and use it when starting the server:

uuid=$(uuidgen)
echo $uuid

then, when starting the server, you can use the --api-key=$uuid

Call the API

Then, you can consume the api through the following endpoint, from anywhere as long as you are connected to the VPN.

From the command line:

# Call the server using curl:
curl -X POST "http://10.84.10.216:8899/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $uuid" \
  --data '{
    "model": "allenai/OLMo-7B-0724-Instruct-hf",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

Or in python, using the openai client:

from openai import OpenAI
client = OpenAI(
    base_url="http://10.84.10.216:8899/v1",
    api_key="token-abc123", # (1)!
)

completion = client.chat.completions.create(
  model="allenai/OLMo-7B-0724-Instruct-hf",
  messages=[
    {"role": "user", "content": "Why dont scientists trust atoms?"}
  ]
)

print(completion.choices[0].message)

this value doesn't matter, unless you specified it with --api-key, if so refer to the value specified when starting the server.