Llama.cpp Server Installation Guide with Docker (CPU-Only)-EW帮帮网

This guide explains how to install and run the llama.cpp server using the ghcr.io/ggml-org/llama.cpp:server Docker image on a CPU-only system. This image is pre-built for running the llama-server executable, optimized for inference via an HTTP API.

Prerequisites

Before starting, ensure your system meets these requirements:

Operating System: Ubuntu 20.04/22.04 (or any Linux with Docker support).
Hardware: Any modern CPU (multi-core recommended).
Memory: At least 16GB RAM (more for larger models like 8B).
Storage: 10GB+ free space (for Docker image and models).
Internet: Required to pull the Docker image and download models.
Docker: Installed and running.

Step 1: Install Docker

If Docker 还没装好，按以下步骤安装：

Update package list:
```
sudo apt update
```
Install Docker:
```
sudo apt install -y docker.io
```

Start and enable Docker:

sudo systemctl start docker
sudo systemctl enable docker

Verify installation:
```
docker --version
```
Expected output: Docker version 20.10.12 (或类似).
Add user to Docker group (避免每次用 sudo)：
```
sudo usermod -aG docker $USER
newgrp docker
```

Step 2: Prepare Model Directory

The Docker image needs a GGUF model file mounted from your local system.

Create a directory for models:
```
mkdir -p ~/llama-models
```
Download a model:
- Example: DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf.
- 手动从 Hugging Face 下载并放入 ~/llama-models。
- 打开https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/tree/main?show_file_info=DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf

Verify:

ls -lh ~/llama-models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf

Expected size: ~4-5 GB.

Step 3: Pull the Docker Image

Get the ghcr.io/ggml-org/llama.cpp:server image from GitHub Container Registry:

Pull the image:

docker pull ghcr.io/ggml-org/llama.cpp:server

Verify:

docker images

Expected output:

REPOSITORY                     TAG       IMAGE ID       CREATED       SIZE
ghcr.io/ggml-org/llama.cpp    server    xxxxxxxx       yyyy-mm-dd    zzzMB

说明：这个镜像默认支持 CPU，不含 CUDA。若需 GPU 支持，得用 server-cuda，但你指定 CPU-only，这里保持原样。文档在这里： llama.cpp/docs/docker.md at master · ggml-org/llama.cpp · GitHub

Step 4: Run the Server (CPU-Only)

Run the llama-server in a Docker container:

Basic command:
```
docker run \\
    -v ~/llama-models:/models \\
    -p 9000:9000 \\
    ghcr.io/ggml-org/llama.cpp:server \\
    -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf \\
    -t 8 \\
    --host 0.0.0.0 \\
    --port 9000 \\
    -n 64 \\
    -c 4096
```
- -v: Mounts your local model directory to container’s /models.
- -p: Maps container port 9000 to host port 9000.
- -t 8: Uses 8 CPU threads (调整为你的核心数，跑 nproc 查看).
- --host 0.0.0.0: Allows external access.
- -n 64: Limits output to 64 tokens.
- -c 4096: Sets context size to 4096 tokens.

Run in background (可选)：

docker run -d \\
    -v ~/llama-models:/models \\
    -p 9000:9000 \\
    ghcr.io/ggml-org/llama.cpp:server \\
    -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf \\
    -t 8 \\
    --host 0.0.0.0 \\
    --port 9000 \\
    -n 64 \\
    -c 4096

Check logs:
```
docker ps  # 找容器 ID
docker logs <container_id>
```
Look for: HTTP server is listening 和 loading model.

Step 5: Test the Server

Send a request to verify it works:

curl -X POST <http://localhost:9000/v1/chat/completions> \
    -H "Content-Type: application/json" \
    -d '{
        "model": "DeepSeek-R1-Distill-Llama-8B-Q4_K_M",
        "messages": [{"role": "user", "content": "Hi, how’s it going?"}],
        "max_tokens": 500,
        "temperature": 0.1
    }'

Expected: JSON response with generated text.

Step 6: Optimize for CPU

Adjust Threads (-t):
- Run nproc 查 CPU 线程数（比如 16），设 -t 8 或 -t 16：
```
docker run ... -t 16 ...
```
Context Size (-c):
- -c 4096 够用，若处理长对话可增到 8192（注意 RAM 使用）。
Parallel Slots (-np):
- 加 -np 4 支持 4 个并发请求：
```
docker run ... -np 4 ...
```

Troubleshooting

Error: "Cannot find model":
- Check -v 路径和文件名是否正确。
Slow response:
- Increase -t 到 CPU 最大核心数。
- Use a smaller model (e.g., 7B).
Container exits:
- 加 -v 查看日志：
```
docker run ... -v
```
Port conflict:
- 改端口：-p 8080:8080 和 --port 8080。

Example Workflow

Install Docker:

sudo apt update && sudo apt install -y docker.io
sudo usermod -aG docker $USER && newgrp docker

Pull image:

docker pull ghcr.io/ggml-org/llama.cpp:server

Prepare model:

mkdir ~/llama-models
wget <model_url> -O ~/llama-models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf

Run server:

docker run -d -v ~/llama-models:/models -p 9000:9000 ghcr.io/ggml-org/llama.cpp:server -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 8 --host 0.0.0.0 --port 9000

Test:

curl -X POST <http://localhost:9000/v1/chat/completions> -H "Content-Type: application/json" -d '{"model": "DeepSeek-R1-Distill-Llama-8B-Q4_K_M", "messages": [{"role": "user", "content": "Hello!"}]}'

Llama.cpp Server Installation Guide with Docker (CPU-Only)

Prerequisites

Step 1: Install Docker

Step 2: Prepare Model Directory

Step 3: Pull the Docker Image

Step 4: Run the Server (CPU-Only)

Step 5: Test the Server

Step 6: Optimize for CPU

Troubleshooting

Example Workflow

网站公告

今日签到

热门文章

最新发布