This guide explains how to install and run the llama.cpp server using the ghcr.io/ggml-org/llama.cpp:server Docker image on a CPU-only system. This image is pre-built for running the llama-server executable, optimized for inference via an HTTP API.
Prerequisites
Before starting, ensure your system meets these requirements:
Operating System: Ubuntu 20.04/22.04 (or any Linux with Docker support).
Hardware: Any modern CPU (multi-core recommended).
Memory: At least 16GB RAM (more for larger models like 8B).
Storage: 10GB+ free space (for Docker image and models).
Internet: Required to pull the Docker image and download models.
Docker: Installed and running.
Step 1: Install Docker
If Docker 还没装好,按以下步骤安装:
Update package list:
sudo apt update
Install Docker:
sudo apt install -y docker.io
Start and enable Docker:
sudo systemctl start docker sudo systemctl enable docker
Verify installation:
docker --version
Expected output: Docker version 20.10.12 (或类似).
Add user to Docker group (避免每次用 sudo):
sudo usermod -aG docker $USER newgrp docker
Step 2: Prepare Model Directory
The Docker image needs a GGUF model file mounted from your local system.
Create a directory for models:
mkdir -p ~/llama-models
Download a model:
Example: DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf.
手动从 Hugging Face 下载并放入 ~/llama-models。
Verify:
ls -lh ~/llama-models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
Expected size: ~4-5 GB.
Step 3: Pull the Docker Image
Get the ghcr.io/ggml-org/llama.cpp:server image from GitHub Container Registry:
Pull the image:
docker pull ghcr.io/ggml-org/llama.cpp:server
Verify:
docker images
Expected output:
REPOSITORY TAG IMAGE ID CREATED SIZE ghcr.io/ggml-org/llama.cpp server xxxxxxxx yyyy-mm-dd zzzMB
说明:这个镜像默认支持 CPU,不含 CUDA。若需 GPU 支持,得用 server-cuda,但你指定 CPU-only,这里保持原样。文档在这里: llama.cpp/docs/docker.md at master · ggml-org/llama.cpp · GitHub
Step 4: Run the Server (CPU-Only)
Run the llama-server in a Docker container:
Basic command:
docker run \\ -v ~/llama-models:/models \\ -p 9000:9000 \\ ghcr.io/ggml-org/llama.cpp:server \\ -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf \\ -t 8 \\ --host 0.0.0.0 \\ --port 9000 \\ -n 64 \\ -c 4096
-v: Mounts your local model directory to container’s /models.
-p: Maps container port 9000 to host port 9000.
-t 8: Uses 8 CPU threads (调整为你的核心数,跑 nproc 查看).
--host 0.0.0.0: Allows external access.
-n 64: Limits output to 64 tokens.
-c 4096: Sets context size to 4096 tokens.
Run in background (可选):
docker run -d \\ -v ~/llama-models:/models \\ -p 9000:9000 \\ ghcr.io/ggml-org/llama.cpp:server \\ -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf \\ -t 8 \\ --host 0.0.0.0 \\ --port 9000 \\ -n 64 \\ -c 4096
Check logs:
docker ps # 找容器 ID docker logs <container_id>
Look for: HTTP server is listening 和 loading model.
Step 5: Test the Server
Send a request to verify it works:
curl -X POST <http://localhost:9000/v1/chat/completions> \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-R1-Distill-Llama-8B-Q4_K_M",
"messages": [{"role": "user", "content": "Hi, how’s it going?"}],
"max_tokens": 500,
"temperature": 0.1
}'
Expected: JSON response with generated text.
Step 6: Optimize for CPU
Adjust Threads (-t):
Run nproc 查 CPU 线程数(比如 16),设 -t 8 或 -t 16:
docker run ... -t 16 ...
Context Size (-c):
-c 4096 够用,若处理长对话可增到 8192(注意 RAM 使用)。
Parallel Slots (-np):
加 -np 4 支持 4 个并发请求:
docker run ... -np 4 ...
Troubleshooting
Error: "Cannot find model":
Check -v 路径和文件名是否正确。
Slow response:
Increase -t 到 CPU 最大核心数。
Use a smaller model (e.g., 7B).
Container exits:
加 -v 查看日志:
docker run ... -v
Port conflict:
改端口:-p 8080:8080 和 --port 8080。
Example Workflow
Install Docker:
sudo apt update && sudo apt install -y docker.io sudo usermod -aG docker $USER && newgrp docker
Pull image:
docker pull ghcr.io/ggml-org/llama.cpp:server
Prepare model:
mkdir ~/llama-models wget <model_url> -O ~/llama-models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
Run server:
docker run -d -v ~/llama-models:/models -p 9000:9000 ghcr.io/ggml-org/llama.cpp:server -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 8 --host 0.0.0.0 --port 9000
Test:
curl -X POST <http://localhost:9000/v1/chat/completions> -H "Content-Type: application/json" -d '{"model": "DeepSeek-R1-Distill-Llama-8B-Q4_K_M", "messages": [{"role": "user", "content": "Hello!"}]}'