笔者最近和同事一起在研究vulkan在OpenHarmony上的作用,我们使用ncnn的benchncnn对cpu和gpu进行对照,现将结论分析如下:
测试环境
瑞莎星睿O6开发板 + amd rx580显卡
- https://docs.radxa.com/orion/o6/getting-started/introduction
- 瑞莎星睿 O6 (Radxa Orion O6) 是一款面向 AI 计算和多媒体应用的专业级 Mini-ITX 主板。它搭载 此芯科技 Cix P1 SoC(型号 CD8180),支持 最高 64GB LPDDR5 内存,在紧凑的尺寸下提供服务器级性能。Orion O6 具备丰富的 I/O 接口,包括 四路显示输出、双 5GbE 网络 和 PCIe Gen4 扩展,非常适合 AI 开发工作站、边缘计算节点 以及 高性能个人计算 应用。
- 最重要的一点是支持PCIe x16 全尺寸插槽,支持 PCIe Gen4 8 通道,可以插显卡,笔者这里插了一张rx580
OpenHarmony版本:5.0.0
- Vulkan版本:1.4.313
什么是vulkan
可以参考 https://blog.csdn.net/Interview_TC/article/details/149866464
目前OpenHarmony上主要使用的图像api是OpenGL ES ,以rk3568为例,3568使用mail系列的GPU,如果需要得到其vulkan驱动的话,建议使用mesa3d提供的开源实现
benchncnn纯cpu推理
# ./benchncnn 10 1 0 -1 0
loop_count = 10
num_threads = 1
powersave = 0
gpu_device = -1
cooling_down = 0
squeezenet min = 14.58 max = 14.76 avg = 14.68
squeezenet_int8 min = 9.62 max = 10.14 avg = 9.94
mobilenet min = 26.55 max = 26.87 avg = 26.75
mobilenet_int8 min = 12.81 max = 13.14 avg = 13.01
mobilenet_v2 min = 16.31 max = 16.45 avg = 16.38
mobilenet_v3 min = 12.52 max = 12.80 avg = 12.67
shufflenet min = 8.39 max = 8.64 avg = 8.53
shufflenet_v2 min = 9.21 max = 9.43 avg = 9.33
mnasnet min = 16.63 max = 16.81 avg = 16.75
proxylessnasnet min = 18.66 max = 19.09 avg = 18.75
efficientnet_b0 min = 26.62 max = 26.92 avg = 26.79
efficientnetv2_b0 min = 31.94 max = 32.17 avg = 32.03
regnety_400m min = 20.36 max = 21.84 avg = 20.61
blazeface min = 2.42 max = 2.65 avg = 2.50
googlenet min = 51.36 max = 51.76 avg = 51.52
googlenet_int8 min = 37.29 max = 37.65 avg = 37.42
resnet18 min = 38.96 max = 40.10 avg = 39.12
resnet18_int8 min = 31.86 max = 32.07 avg = 31.97
alexnet min = 33.00 max = 34.18 avg = 33.21
vgg16 min = 193.38 max = 195.13 avg = 193.95
vgg16_int8 min = 238.43 max = 241.54 avg = 239.21
resnet50 min = 117.60 max = 119.09 avg = 117.81
resnet50_int8 min = 66.19 max = 66.59 avg = 66.31
squeezenet_ssd min = 30.69 max = 31.07 avg = 30.82
squeezenet_ssd_int8 min = 29.76 max = 30.41 avg = 29.99
mobilenet_ssd min = 53.97 max = 55.14 avg = 54.19
mobilenet_ssd_int8 min = 25.82 max = 26.22 avg = 26.07
mobilenet_yolo min = 119.46 max = 120.76 avg = 119.71
mobilenetv2_yolov3 min = 59.16 max = 59.45 avg = 59.31
yolov4-tiny min = 69.37 max = 69.81 avg = 69.54
nanodet_m min = 20.09 max = 20.32 avg = 20.14
yolo-fastest-1.1 min = 7.96 max = 8.18 avg = 8.06
yolo-fastestv2 min = 6.75 max = 6.95 avg = 6.85
vision_transformer min = 1068.13 max = 1069.96 avg = 1069.13
FastestDet min = 8.41 max = 8.66 avg = 8.57
benchncnn Vulkan推理
# ./benchncnn 10 1 0 0 0
[0 AMD Radeon RX 580 2048SP (RADV POLARIS10)] queueC=1[4] queueG=0[1] queueT=0[1]
[0 AMD Radeon RX 580 2048SP (RADV POLARIS10)] fp16-p/s/u/a=1/1/1/0 int8-p/s/u/a=1/1/1/1
[0 AMD Radeon RX 580 2048SP (RADV POLARIS10)] subgroup=64(64~64) ops=1/1/1/1/1/1/1/1/0/0
[0 AMD Radeon RX 580 2048SP (RADV POLARIS10)] fp16-cm=0 int8-cm=0 bf16-cm=0 fp8-cm=0
loop_count = 10
num_threads = 1
powersave = 0
gpu_device = 0
cooling_down = 0
squeezenet min = 2.26 max = 2.38 avg = 2.33
squeezenet_int8 min = 9.65 max = 9.86 avg = 9.77
mobilenet min = 2.62 max = 2.77 avg = 2.68
mobilenet_int8 min = 12.39 max = 12.47 avg = 12.42
mobilenet_v2 min = 3.61 max = 3.84 avg = 3.72
mobilenet_v3 min = 3.62 max = 3.79 avg = 3.70
shufflenet min = 1.94 max = 2.17 avg = 2.02
shufflenet_v2 min = 2.74 max = 2.93 avg = 2.84
mnasnet min = 3.86 max = 4.22 avg = 3.99
proxylessnasnet min = 3.73 max = 3.96 avg = 3.86
efficientnet_b0 min = 5.84 max = 6.68 avg = 6.15
efficientnetv2_b0 min = 20.50 max = 22.40 avg = 21.91
regnety_400m min = 4.44 max = 5.01 avg = 4.68
blazeface min = 1.30 max = 1.67 avg = 1.48
googlenet min = 9.27 max = 9.81 avg = 9.51
googlenet_int8 min = 38.14 max = 38.52 avg = 38.26
resnet18 min = 4.48 max = 4.87 avg = 4.65
resnet18_int8 min = 31.54 max = 32.87 avg = 31.81
alexnet min = 3.15 max = 3.53 avg = 3.26
vgg16 min = 11.14 max = 11.54 avg = 11.46
vgg16_int8 min = 238.41 max = 239.90 avg = 238.76
resnet50 min = 10.11 max = 11.20 avg = 10.64
resnet50_int8 min = 66.91 max = 67.03 avg = 66.95
squeezenet_ssd min = 7.90 max = 8.85 avg = 8.42
squeezenet_ssd_int8 min = 28.88 max = 29.10 avg = 28.98
mobilenet_ssd min = 6.98 max = 8.32 avg = 7.37
mobilenet_ssd_int8 min = 24.77 max = 24.98 avg = 24.88
mobilenet_yolo min = 6.73 max = 7.86 avg = 7.46
mobilenetv2_yolov3 min = 9.55 max = 10.76 avg = 10.20
yolov4-tiny min = 11.57 max = 12.67 avg = 12.28
nanodet_m min = 5.73 max = 7.84 avg = 6.24
yolo-fastest-1.1 min = 3.44 max = 3.85 avg = 3.58
yolo-fastestv2 min = 3.90 max = 5.13 avg = 4.14
vision_transformer min = 77.86 max = 78.55 avg = 78.29
FastestDet min = 2.85 max = 3.24 avg = 2.97
结论
- 优先使用 GPU (Vulkan): 对于绝大多数模型,尤其是中大型模型(resnet, vgg, yolo 系列, vision transformer 等),启用 GPU 能带来 数量级 的性能提升。这是提升推理速度最有效的手段。
CPU 适用场景:
- 运行极小的模型(如 blazeface),此时 GPU 启动开销占比过高,优势不大。
- 当系统没有兼容的 GPU 或 Vulkan 驱动不可用时。
GPU 首次运行模型时可能会有编译着色器的开销,导致第一次推理时间较长。Benchmark 中的 min 时间通常能反映预热后的最佳性能。
INT8 量化: 在 CPU 上,INT8 量化模型通常比 FP32 模型快得多(如 mobilenet_int8 13.01ms vs mobilenet 26.75ms)。但在 rx580 GPU (Vulkan) 上,INT8 模型的加速比 FP32 模型小很多,有时甚至不如 FP32 模型快。