cutlass v2.0.0 简介 01 -- 编译,测试,简单分析源码

发布于:2025-07-04 ⋅ 阅读:(17) ⋅ 点赞:(0)

因为 cutlass 2.0.0 版本相对简单一些,这里先分析 这个版本。

0. 系统环境

x86_64

rtx-2080ti

ubuntu 22.04

cutlass 2.0.0

cuda 12.9

1. 下载cutlass

https://github.com/NVIDIA/cutlass/tree/v2.0.0

​git clone https://github.com/NVIDIA/cutlass.git

cd cutlass

git checkout v2.0.0
​

 2.编译运行 cutlass_profiler

cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all

会有一个较长的时间自动下载 googletest

编译 cutlass_profiler:

make cutlass_profiler -j

运行:

./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096

3. cutlass 文件夹布局

CUTLASS Templates are implemented by header files in the following directory structure:

include/                     # Top-level include directory. Client applications should target this path.

  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only

    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)
      *
    gemm/                    # code specialized for general matrix product computations
      thread/                #   thread-level operators
      warp/                  #   warp-level operators
      threadblock/           #   CTA-level operators
      kernel/                #   CUDA kernel entry points
      device/                #   launches kernel(s) over a full device
      *                      # scope-agnostic components and basic vocabular type definitions for GEMM

    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory
      *

    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" models
      thread/                #   thread-level operators
      warp/                  #   warp-level operators
      threadblock/           #   CTA-level operators
      kernel/                #   CUDA kernel entry points
      device/                #   launches kernel(s) over a full device
      *                      # scope-agnostic components and basic vocabular type definitions

    transform/               # code specialized for layout, type, and domain transformations
      thread/                #   thread-level operators
      warp/                  #   warp-level operators
      threadblock/           #   CTA-level operators
      kernel/                #   CUDA kernel entry points
      device/                #   launches kernel(s) over a full device
      *                      # scope-agnostic components and basic vocabulary type definitions

    util/                    # miscellaneous CUTLASS components
      *
    *                        # core vocabulary types and fundamental arithmetic operators

编程参考文档:

https://github.com/NVIDIA/cutlass/blob/v2.0.0/media/docs/programming_guidelines.md

4. 分析 cutlass_profiler

4.1. 编译 debug 版本 cutlass_profiler

首先,编译一个debug 版本的cutlass 的测试app程序。

cmake 配置时加一个 -DCMAKE_BUILD_TYPE=Debug,具体命令:

$ cd cutlass/
$ make build_gdb/
$ cd build_gdb/

$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all -DCMAKE_BUILD_TYPE=Debug

$ make cutlass_profiler -j VERBOSE=1

如果遇到了这样的报错,可以给这个 dummy 变量赋个初始值:

cutlass/build_gdb/_deps/googletest-src/googletest/src/gtest-death-test.cc:1008:24: error: ‘dummy’ may be used uninitialized [-Werror=maybe-uninitialized]
 1008 |   StackLowerThanAddress(&dummy, &result);

4.2. debug cutlass_profiler

gdb 载入主程序,并设置命令行参数

gdb ./tools/profiler/cutlass_profiler

(gdb) set args --kernels=sgemm --m=4352 --n=4096 --k=4096
(gdb) start
(gdb) layout src

除了 return profiler();里边会运行到 cuda kernel, 前边的几行代码是在设置参数等。

会执行到这段代码:

/// Execute the program
int CutlassProfiler::operator()() {

... ....

  if (options_.execution_mode == ExecutionMode::kProfile ||
    options_.execution_mode == ExecutionMode::kDryRun ||
    options_.execution_mode == ExecutionMode::kTrace) {

    // Profiles all operations
    profile_();
  }
... ...

  return 0;
}

主要内容在函数     profile_(); 中调用。


/// Profiles all operations
int CutlassProfiler::profile_() {

  library::Manifest manifest;
  Status status = manifest.initialize();

  if (status != Status::kSuccess) {
    return -1;
  }

  int result = 0;
  DeviceContext device_context;

  // For all profilers
  for (auto & profiler : operation_profilers_) {

    if (options_.operation_kind == library::OperationKind::kInvalid ||
      options_.operation_kind == profiler->kind()) {

      result = profiler->profile_all(options_, manifest, device_context);

      if (result) {
        return result;
      } 
    }
  }

  return result;
}

主要是这个函数的调用:

result = profiler->profile_all(options_, manifest, device_context);

 通过

nvprof ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096

可以发现 这个 cutlass_profiler 调用 了cublas Gemm,

通过 grep -rn cublas | grep emm 发现了 cublasGemmEx 存在于 cutlass 源代码中;

再结合 gdb 断点到 cublasGemmEx行,continue过去之后,bt,发现 调用栈 :


  /// Executes GEMM using these arguments
  cublasStatus_t operator()(cublasHandle_t handle) {

    return cublasGemmEx(
      handle,
      trans_A,
      trans_B,
      configuration.problem_size.m(),
      configuration.problem_size.n(),
      configuration.problem_size.k(),
      arguments.alpha,
      arguments.A,
      data_type_A,
      int(configuration.lda),
      arguments.B,
      data_type_B,
      int(configuration.ldb),
      arguments.beta,
      arguments.D,
      data_type_C,
      int(configuration.ldc),
      compute_type,
      algo
    );
  }
};

back trace 的结果:

588	    return cublasGemmEx(
(gdb) bt
#0  cutlass::profiler::detail::cublasGemmExDispatcher::operator() (this=0x7fffffffce50, handle=0x555556d7b330) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/gemm_operation_profiler.cu:588
#1  0x00005555555e4b29 in cutlass::profiler::GemmOperationProfiler::verify_with_cublas_ (this=0x5555567acf80, options=..., report=..., device_context=..., operation=0x5555564d1cd0, problem_space=..., 
    problem=std::vector of length 25, capacity 32 = {...}) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/gemm_operation_profiler.cu:691
#2  0x00005555555e455b in cutlass::profiler::GemmOperationProfiler::verify_cutlass (this=0x5555567acf80, options=..., report=..., device_context=..., operation=0x5555564d1cd0, problem_space=..., 
    problem=std::vector of length 25, capacity 32 = {...}) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/gemm_operation_profiler.cu:440
#3  0x00005555555db955 in cutlass::profiler::OperationProfiler::profile_all (this=0x5555567acf80, options=..., manifest=..., device_context=...) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/operation_profiler.cu:320
#4  0x000055555555f6f6 in cutlass::profiler::CutlassProfiler::profile_ (this=0x7fffffffda90) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/cutlass_profiler.cu:131
#5  0x000055555555f577 in cutlass::profiler::CutlassProfiler::operator() (this=0x7fffffffda90) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/cutlass_profiler.cu:95
#6  0x000055555555d3b4 in main (argc=5, arg=0x7fffffffe158) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/main.cpp:44