Llama.cpp Server Installation Guide with Docker (CPU-Only)

发布于:2025-02-28 ⋅ 阅读:(346) ⋅ 点赞:(0)

This guide explains how to install and run the llama.cpp server using the ghcr.io/ggml-org/llama.cpp:server Docker image on a CPU-only system. This image is pre-built for running the llama-server executable, optimized for inference via an HTTP API.


Prerequisites

Before starting, ensure your system meets these requirements:

  • Operating System: Ubuntu 20.04/22.04 (or any Linux with Docker support).

  • Hardware: Any modern CPU (multi-core recommended).

  • Memory: At least 16GB RAM (more for larger models like 8B).

  • Storage: 10GB+ free space (for Docker image and models).

  • Internet: Required to pull the Docker image and download models.

  • Docker: Installed and running.


Step 1: Install Docker

If Docker 还没装好,按以下步骤安装:

  1. Update package list:

    sudo apt update
  2. Install Docker:

    sudo apt install -y docker.io
  3. Start and enable Docker:

    sudo systemctl start docker
    sudo systemctl enable docker
  4. Verify installation:

    docker --version

    Expected output: Docker version 20.10.12 (或类似).

  5. Add user to Docker group (避免每次用 sudo):

    sudo usermod -aG docker $USER
    newgrp docker

Step 2: Prepare Model Directory

The Docker image needs a GGUF model file mounted from your local system.

  1. Create a directory for models:

    mkdir -p ~/llama-models
  2. Download a model:

  3. Verify:

    ls -lh ~/llama-models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf

    Expected size: ~4-5 GB.


Step 3: Pull the Docker Image

Get the ghcr.io/ggml-org/llama.cpp:server image from GitHub Container Registry:

  1. Pull the image:

    docker pull ghcr.io/ggml-org/llama.cpp:server
  2. Verify:

    docker images

    Expected output:

    REPOSITORY                     TAG       IMAGE ID       CREATED       SIZE
    ghcr.io/ggml-org/llama.cpp    server    xxxxxxxx       yyyy-mm-dd    zzzMB

Step 4: Run the Server (CPU-Only)

Run the llama-server in a Docker container:

  1. Basic command:

    docker run \\
        -v ~/llama-models:/models \\
        -p 9000:9000 \\
        ghcr.io/ggml-org/llama.cpp:server \\
        -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf \\
        -t 8 \\
        --host 0.0.0.0 \\
        --port 9000 \\
        -n 64 \\
        -c 4096
    • -v: Mounts your local model directory to container’s /models.

    • -p: Maps container port 9000 to host port 9000.

    • -t 8: Uses 8 CPU threads (调整为你的核心数,跑 nproc 查看).

    • --host 0.0.0.0: Allows external access.

    • -n 64: Limits output to 64 tokens.

    • -c 4096: Sets context size to 4096 tokens.

  2. Run in background (可选):

    docker run -d \\
        -v ~/llama-models:/models \\
        -p 9000:9000 \\
        ghcr.io/ggml-org/llama.cpp:server \\
        -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf \\
        -t 8 \\
        --host 0.0.0.0 \\
        --port 9000 \\
        -n 64 \\
        -c 4096
  3. Check logs:

    docker ps  # 找容器 ID
    docker logs <container_id>

    Look for: HTTP server is listening 和 loading model.


Step 5: Test the Server

Send a request to verify it works:

curl -X POST <http://localhost:9000/v1/chat/completions> \
    -H "Content-Type: application/json" \
    -d '{
        "model": "DeepSeek-R1-Distill-Llama-8B-Q4_K_M",
        "messages": [{"role": "user", "content": "Hi, how’s it going?"}],
        "max_tokens": 500,
        "temperature": 0.1
    }'
  • Expected: JSON response with generated text.


Step 6: Optimize for CPU

  1. Adjust Threads (-t):

    • Run nproc 查 CPU 线程数(比如 16),设 -t 8 或 -t 16:

      docker run ... -t 16 ...
  2. Context Size (-c):

    • -c 4096 够用,若处理长对话可增到 8192(注意 RAM 使用)。

  3. Parallel Slots (-np):

    • 加 -np 4 支持 4 个并发请求:

      docker run ... -np 4 ...

Troubleshooting

  • Error: "Cannot find model":

    • Check -v 路径和文件名是否正确。

  • Slow response:

    • Increase -t 到 CPU 最大核心数。

    • Use a smaller model (e.g., 7B).

  • Container exits:

    • 加 -v 查看日志:

      docker run ... -v
  • Port conflict:

    • 改端口:-p 8080:8080 和 --port 8080。


Example Workflow

  1. Install Docker:

    sudo apt update && sudo apt install -y docker.io
    sudo usermod -aG docker $USER && newgrp docker
  2. Pull image:

    docker pull ghcr.io/ggml-org/llama.cpp:server
  3. Prepare model:

    mkdir ~/llama-models
    wget <model_url> -O ~/llama-models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
  4. Run server:

    docker run -d -v ~/llama-models:/models -p 9000:9000 ghcr.io/ggml-org/llama.cpp:server -m /models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 8 --host 0.0.0.0 --port 9000
  5. Test:

    curl -X POST <http://localhost:9000/v1/chat/completions> -H "Content-Type: application/json" -d '{"model": "DeepSeek-R1-Distill-Llama-8B-Q4_K_M", "messages": [{"role": "user", "content": "Hello!"}]}'

网站公告

今日签到

点亮在社区的每一天
去签到