Triton server的部署、构建、backend插件机制整体介绍-EW帮帮网

0 引言

最近再看Triton，因此花时间学习了源码、官方readme文档以及相关社区资料。通过完成 Triton Server 的下载、部署和构建，理解其 backend 插件机制、模型加载流程及生命周期管理等模块，并结合源码阅读，对 Triton Server 的整体架构与工作流程有了初步的认识。

本文内容基于对 GitHub 上 Triton 官方代码库的阅读、官方文档的研读，以及通过 ChatGPT 辅助解析源码和技术细节后所做的整理和总结。全文涵盖了 Triton Server 的安装与使用，梳理了 backend 插件机制的实现细节、模型加载过程和自定义 C++ backend 的开发实践，为想了解 Triton 内部机制的工程师和开发者提供参考。

由于本人对 Triton 的理解仍处于不断深化阶段，文中难免存在不足或疏漏，欢迎读者批评指正，共同交流进步。

1 什么是Trition inference server

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

GitHub的readme里面有上面一段介绍triton inference server的，我用自己通俗点的语言来概括：

我的通俗理解：triton inference server就是一个服务，然后这个服务的底层可以用tensorRT, pytorch，onnxruntime等不同的后端做具体的模型推理，然后客户端可以通过http请求给triton server发送比如推理请求，然后triton server通过api调用具体的推理后端做具体的模型推理，并将推理后的结果返回给客户端。

2 Trition inference server部署

在https://github.com/triton-inference-server/server 的readme里面，有一个用docker镜像部署的，我先部署试一下。

2.1 下载server

首先下载server，这一步其实对于部署来说不需要，因为其实是用docker镜像部署的。

git clone https://github.com/triton-inference-server/server

2.2 下载模型

cd docs/examples
./fetch_models.sh

这里的这个fetch_models.sh脚本内容如下(我把第一个wget注释掉了)，其实就是下载模型，还有就是创建了python虚拟环境。执行脚本的时候如果wget下载模型报错，那么就把脚本中wget那一行注释掉，然后把网址复制到浏览器中，然后手动把模型下载完之后上传到做实验的服务器中。

#!/bin/bash
# Copyright (c) 2018-2025, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

set -ex

# Convert Tensorflow inception V3 module to ONNX
# Pre-requisite: Python3, venv, and Pip3 are installed on the system
mkdir -p model_repository/inception_onnx/1
#wget -O /tmp/inception_v3_2016_08_28_frozen.pb.tar.gz \
#     https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz
(cd /tmp && tar xzf inception_v3_2016_08_28_frozen.pb.tar.gz)
python3 -m venv tf2onnx
source ./tf2onnx/bin/activate
pip3 install "numpy<2" tensorflow tf2onnx
python3 -m tf2onnx.convert --graphdef /tmp/inception_v3_2016_08_28_frozen.pb --output inception_v3_onnx.model.onnx --inputs input:0 --outputs InceptionV3/Predictions/Softmax:0
deactivate
mv inception_v3_onnx.model.onnx model_repository/inception_onnx/1/model.onnx


# ONNX densenet
mkdir -p model_repository/densenet_onnx/1
wget -O model_repository/densenet_onnx/1/model.onnx \
     https://github.com/onnx/models/raw/main/validated/vision/classification/densenet-121/model/densenet-7.onnx

2.3 实验

这里我直接在cpu上做，所以按照这里的步骤做：https://github.com/triton-inference-server/server/blob/main/docs/getting_started/quickstart.md#run-on-cpu-only-system

这里步骤有这样的命令

docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

但别真的就照抄了，照抄肯定报错，要把里面的/full/path/to/docs/examples/还有<xx.yy>这些地方改一下，并且第一行是运行服务端的，后面几行是运行客户端的，所以还需要开两个连接服务器的终端，其中第一个终端的examples路径下运行下面的命令

docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:25.05-py3 tritonserver --model-repository=/models --model-control-mode explicit --load-model densenet_onnx

然后大约会出现的下面的界面

第二个终端直接运行下面的命令

docker pull nvcr.io/nvidia/tritonserver:25.05-py3-sdk
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:25.05-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

然后会出现下面的结果。

Request 0, batch size 1
Image '/workspace/images/mug.jpg':
    15.349563 (504) = COFFEE MUG
    13.227461 (968) = CUP
    10.424893 (505) = COFFEEPOT

3 triton inference server的构建

前面的体验其实是用现成的docker镜像部署体验了下triton，接下来看一下triton的build过程，当然这里也不会真正实践build的全流程，只是想通过这个构建过程尽可能多的学习和了解triton的一些东西。

3.1 build时候需要哪些repo

最开始我以为triton inference server只需要repo这一个repo就够了，其实并不只是他其实需要下面的这一系列的repo，当然这些repo你可以自己手动下载，不手动下载那么在执行build.py的时候也会自动的进行下载。

模块类别	模块名示例（仓库名）	作用简述	是否构建时自动拉取
Server 主体	`server`	Triton 服务主程序和框架	你手动 clone
核心组件（Core）	`core`	Triton 核心逻辑和调度	是
公共模块（Common）	`common`	公共工具和基础设施	是
后端接口层	`backend`	后端统一接口定义	是
推理后端（Backend）	`onnxruntime_backend` `pytorch_backend` `tensorrt_backend` `python_backend` 等	具体推理引擎后端实现	是
第三方依赖	`thirdparty`	依赖的第三方库源码（魔改版）	是

然后其中推理后端又包括这些，下面的这些repo也是需要的。

后端名	功能说明
`onnxruntime_backend`	使用 ONNX Runtime 推理 ONNX 模型
`pytorch_backend`	使用 LibTorch 推理 PyTorch 模型
`tensorflow_backend`	适配 TensorFlow 模型（TF1/TF2）
`tensorrt_backend`	使用 NVIDIA TensorRT 推理，适合高性能场景
`openvino_backend`	使用 Intel OpenVINO 在 x86/边缘设备上进行推理
`python_backend`	支持 Python 写自定义后端逻辑
`dali_backend`	图像预处理后端，集成 NVIDIA DALI
`fil_backend`	使用 RAPIDS 的 FIL 进行树模型推理（如 XGBoost）
`identity_backend` / `repeat_backend` / `square_backend`	示例/测试用后端
`ensemble_backend`	核心内置，用于将多个模型组合成一条流水线执行

3.2 构建过程做了什么

最开始我以为build.py只是编译了可执行程序或者编译得到so动态库，但其实不是，执行build.py的过程主要分为下面三个。

自动下载需要的代码仓库（repo）
- 根据配置和版本标签，自动从 GitHub 拉取 core、common、backend、各个具体后端（onnxruntime_backend、pytorch_backend 等）和第三方依赖的源码。
编译得到可执行程序和动态库
- 编译 Triton Server 主程序
- 编译各个后端的插件（so 文件）
- 编译公共库和核心模块，生成 Triton 运行时依赖的各种二进制文件。
打包成一个 Docker 镜像
- 把上面编译得到的程序和动态库，以及运行时环境（依赖库、配置文件等）打包成一个完整的 Docker 镜像，方便部署和分发。

3.3 构建体验

这里使用--dryrun命令体验一下构建

python3 build.py  --dryrun --enable-all
Building Triton Inference Server
platform rhel
machine x86_64
version 2.58.0dev
build dir ./triton_20250611/server/build
install dir None
cmake dir None
default repo-tag: r25.05
container version 25.05dev
upstream container version 25.05
endpoint "http"
endpoint "grpc"
endpoint "sagemaker"
endpoint "vertex-ai"
filesystem "gcs"
filesystem "s3"
filesystem "azure_storage"
backend "ensemble" at tag/branch "r25.05"
backend "identity" at tag/branch "r25.05"
backend "square" at tag/branch "r25.05"
backend "repeat" at tag/branch "r25.05"
backend "onnxruntime" at tag/branch "r25.05"
backend "python" at tag/branch "r25.05"
backend "dali" at tag/branch "r25.05"
backend "pytorch" at tag/branch "r25.05"
backend "openvino" at tag/branch "r25.05"
backend "fil" at tag/branch "r25.05"
backend "tensorrt" at tag/branch "r25.05"
repoagent "checksum" at tag/branch "r25.05"
cache "local" at tag/branch "r25.05"
cache "redis" at tag/branch "r25.05"
component "common" at tag/branch "r25.05"
component "core" at tag/branch "r25.05"
component "backend" at tag/branch "r25.05"
component "thirdparty" at tag/branch "r25.05"
Traceback (most recent call last):
  File "./triton_20250611/server/build.py", line 3162, in <module>
    create_build_dockerfiles(
  File "./triton_20250611/server/build.py", line 1696, in create_build_dockerfiles
    raise KeyError("A base image must be specified when targeting RHEL")
KeyError: 'A base image must be specified when targeting RHEL'

从上面的打印也能看出来，其实执行build.py的时候是会自动下载相应的repo的。然后有个报错，那么修改build.py

def create_build_dockerfiles(
    container_build_dir, images, backends, repoagents, caches, endpoints
):
    if "base" in images:
        base_image = images["base"]
        if target_platform() == "rhel":
            print(
                "warning: RHEL is not an officially supported target and you will probably experience errors attempting to build this container."
            )
    elif target_platform() == "windows":
        base_image = "mcr.microsoft.com/dotnet/framework/sdk:4.8"
    elif target_platform() == "rhel":
        #修改这里的分支逻辑，直接设定 base image
        base_image = "registry.access.redhat.com/ubi8/ubi:latest"
        print("Using manually set RHEL base image:", base_image)
        #raise KeyError("A base image must be specified when targeting RHEL")
    elif FLAGS.enable_gpu:
        base_image = "nvcr.io/nvidia/tritonserver:{}-py3-min".format(
            FLAGS.upstream_container_version
        )
    else:
        base_image = "ubuntu:24.04"

运行build.py脚本后，会在当前文件夹下创建：build文件夹，build文件夹下面有5个新文件产生，分别为：

cmake_build:负责执行 cmake 配置和编译的脚本，编译 Triton Inference Server 的可执行文件和动态库。
docker_build:调用 cmake_build 进行编译，然后基于生成的产物结合 Dockerfile 构建 Docker 镜像的脚本，是整个容器构建流程的入口。
Dockerfile:用于构建最终的 Triton 推理服务器运行镜像，包含已经编译好的程序和依赖，作为实际运行环境。
Dockerfile.buildbase:定义构建基础镜像，包含编译环境和必要依赖，用于支持后续的编译和构建过程。
Dockerfile.cibase:定义持续集成（CI）环境的镜像，通常用于自动化构建和测试流程。

这里其实是从docker_build这个脚本开始，这个脚本中间会调用cmake_build编译一些可执行程序和动态库，这个docker_build还会根据下面的三个Dockfile文件去生成最终的镜像文件。

好了，triton的构建就先看到这里，先整体上理解，以后真正构建的时候再实际操作并且进行深入了解。

4 阅读readme整体了解下backend机制

4.1 什么是backend

A Triton backend is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework, like PyTorch, TensorFlow, TensorRT, ONNX Runtime or OpenVINO. A backend can also implement any functionality you want as long as it adheres to the backend API. Triton uses this API to send requests to the backend for execution and the backend uses the API to communicate with Triton.

Every model must be associated with a backend. A model's backend is specified in the model's configuration using the backend setting. For using TensorRT backend, the value of this setting should be tensorrt. Similarly, for using PyTorch, ONNX and TensorFlow backends, the backend field should be set to pytorch, onnxruntime or tensorflow respectively. For all other backends, backend must be set to the name of the backend. Some backends may also check the platform setting for categorizing the model, for example, in TensorFlow backend, platform should be set to tensorflow_savedmodel or tensorflow_graphdef according to the model format. Please refer to the specific backend repository on whether platform is used.

在https://github.com/triton-inference-server/backend 的readme里面有上面两段话，我用通俗一点的话总结一下就是：

所谓的backend就是triton中用来进行具体的模型推理的模块，然后这个backend可以是由一些推理框架比如onnxruntime/tensorRT等封装得到，当然你可以可以按照backend要求的api格式自己写一个backend。

每个模型必须和一个backend来结合起来，怎么结合呢，可以使用backend关键字，也可以使用platform关键字。

然后前面做部署实验的时候，其实在这个目录下可以看到

cat ./docs/examples/model_repository/densenet_onnx/config.pbtxt

name: "densenet_onnx"
platform: "onnxruntime_onnx"
max_batch_size : 0
input [
  {
    name: "data_0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
    reshape { shape: [ 1, 3, 224, 224 ] }
  }
]
output [
  {
    name: "fc6_1"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    reshape { shape: [ 1, 1000, 1, 1 ] }
    label_filename: "densenet_labels.txt"
  }

这里面有platform: "onnxruntime_onnx"，我猜测大概率就是通过这个知道这个模型要用onnxruntime这个后端来推理吧，但是不确定，等后面看代码的时候会再次确认一下。

4.1 Backend Shared Library

在 https://github.com/triton-inference-server/backend/tree/main#backends 这里有下面这样一段话

Can I add (or remove) a backend to an existing Triton installation?

Yes. See Backend Shared Library for general information about how the shared library implementing a backend is managed by Triton, and Triton with Unsupported and Custom Backends for documentation on how to add your backend to the released Triton Docker image. For a standard install the globally available backends are in /opt/tritonserver/backends.

然后我去看看Backend Shared Library是怎么个事，

Backend Shared Library

Each backend must be implemented as a shared library and the name of the shared library must be libtriton_<backend-name>.so. For example, if the name of the backend is "mybackend", a model indicates that it uses the backend by setting the model configuration 'backend' setting to "mybackend", and Triton looks for libtriton_mybackend.so as the shared library that implements the backend. The tutorial shows examples of how to build your backend logic into the appropriate shared library.

For a model, M that specifies backend B, Triton searches for the backend shared library in the following places, in this order:

<model_repository>/M/<version_directory>/libtriton_B.so

<model_repository>/M/libtriton_B.so

<global_backend_directory>/B/libtriton_B.so

Where <global_backend_directory> is by default /opt/tritonserver/backends. The --backend-directory flag can be used to override the default.

Typically you will install your backend into the global backend directory. For example, if using Triton Docker images you can follow the instructions in Triton with Unsupported and Custom Backends. Continuing the example of a backend names "mybackend", you would install into the Triton image as:
/opt/
  tritonserver/
    backends/
      mybackend/
        libtriton_mybackend.so
        ... # other files needed by mybackend
Starting from 24.01, the default backend shared library name can be changed by providing the runtime setting in the model configuration. For example,
runtime: "my_backend_shared_library_name.so"
A model may choose a specific runtime implementation provided by the backend.

我也用通俗的语言概括下上面的这一段话，意思就是：每个backend都是一个.so的动态库，然后库文件的名字必须遵循libtriton_<backend-name>.so 的命名格式，然后配置文件backend:"mybackend"，那么triton就回去寻找libtriton_mybackend.so，还有就是说了这个动态库存放的路径有三个地方，但是一般来说存放在global backend directory也就是<global_backend_directory>/B/libtriton_B.so这个目录，比如看下官方的。

4.3 how to add your backend to the released Triton Docker image.

说明在这里：https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/compose.md#triton-with-unsupported-and-custom-backends

Build it yourself

If you would like to do what compose.py is doing under the hood yourself, you can run compose.py with the --dry-run option and then modify the Dockerfile.compose file to satisfy your needs.

Triton with Unsupported and Custom Backends

You can create and build your own Triton backend. The result of that build should be a directory containing your backend shared library and any additional files required by the backend. Assuming your backend is called "mybackend" and that the directory is "./mybackend", adding the following to the Dockerfile compose.py created will create a Triton image that contains all the supported Triton backends plus your custom backend.
COPY ./mybackend /opt/tritonserver/backends/mybackend
You also need to install any additional dependencies required by your backend as part of the Dockerfile. Then use Docker to create the image.
$ docker build -t tritonserver_custom -f Dockerfile.compose .

这一块说的就是怎么把我们自己定义的backend添加到一个已有的docker镜像中，然后有两种方法，第一种方法是直接执行一次compose.py加上一些需要的参数，然后就结束了，第二种方法是执行compose.py的时候加上--dry-run然后生成Dockerfile.compose，然后再手动修改这个文件然后再执行生成docker镜像的命令。

4.3.1 手动模式

python3 compose.py --dry-run --container-version=25.05       #这一句生成一个server/Dockerfile.compose
COPY ./mybackend /opt/tritonserver/backends/mybackend        #在./server/Dockerfile.compose中手动添加这一行
apt update && apt install -y libmydep-dev                    #自定义的backend可能需要的一些依赖
docker build -t tritonserver_custom -f Dockerfile.compose .  #构建镜像

我试了下大体会有些面的这些打印。

python3 compose.py --dry-run --container-version=25.05
using container version 25.05
pulling container:nvcr.io/nvidia/tritonserver:25.05-py3
25.05-py3: Pulling from nvidia/tritonserver
Digest: sha256:3189f95bb663618601e46628af7afb154ba2997e152a29113c02f97b618d119f
Status: Image is up to date for nvcr.io/nvidia/tritonserver:25.05-py3
nvcr.io/nvidia/tritonserver:25.05-py3
25.05-py3-min: Pulling from nvidia/tritonserver
f03f49e66a78: Already exists
4f4fb700ef54: Already exists
bd0ed3dadbe9: Already exists
7b57f70af223: Already exists
3f0b11d337e6: Already exists
2104594958ce: Already exists
ba15f2616882: Already exists
4e46d4ab7302: Already exists
50f087002df9: Already exists
f94296dbf484: Already exists
03a8530f6876: Already exists
cce238fffcb6: Already exists
64a55035aee6: Already exists
1394c771d714: Already exists
2312e005f291: Already exists
95f88b748512: Already exists
b34261a35067: Already exists
b29ea1b3ef7d: Already exists
8accdaa104b8: Pull complete
046521000f43: Pull complete
Digest: sha256:3a1c84e22d2df22d00862eb651c000445d9314a12fd7dd005f4906f5615c7f6a
Status: Downloaded newer image for nvcr.io/nvidia/tritonserver:25.05-py3-min
nvcr.io/nvidia/tritonserver:25.05-py3-min

4.3.2 自动模式

python3 compose.py \
  --container-version=24.01 \
  --backend-dir=./mybackend \
  --output-name=tritonserver_custom

参考文献：

https://github.com/triton-inference-server/server

https://github.com/triton-inference-server/backend

Triton中文社区

https://github.com/triton-inference-server/backend/tree/main#backends

tritonserver学习之五：backend实现机制_triton backend-CSDN博客

tritonserver学习之三：tritonserver运行流程_trition-server 使用教程-CSDN博客

Triton Server 快速入门_tritonserver-CSDN博客

深度学习部署神器-triton inference server第一篇 - Oldpan的个人博客

tritonserver学习之六：自定义c++、python custom backend实践_triton c++-CSDN博客

Triton server的部署、构建、backend插件机制整体介绍

目录