Background

3D的对象为了表示BBox（Bounding box），一般都是立体的。这种方法是模仿了2D中检测的方法。在传统的方法中，一般使用anchor。本文讲2D检测中的一篇十分经典的文章CenterNet在3D检测方面得到的实现，并且取得了不错的成绩。

Progress

PointCenter使用了点的方式表示了目标的位置，简化了三维目标检测的任务。也同时避免了因为点云中三维目标很难判断方向的问题。同时保证了点云的旋转不变性;
使用点的方法，简化了下游任务。比如说是Tracking任务，如果track的对象是点的话，那么只需要预测对象在连续帧之间的相对偏移就可以了；
文中提到了一点：基于点的方法可以更好的设计出一种比之前方法快的多的有效的两阶段细化模块（第二阶段在KITTI数据集中没有使用，只有在Waymo数据集中使用了）；

CenterPoint Pipline

CenterPoint Architecture

CenterPoint分为三个阶段，分别是特征提取Backbone，HeatMap（First stage）和生成额外点的特征（Second stage），接下来对着三个阶段分别介绍。在说明CenterPoint之前，先普及一下CenterNet的知识；

CenterNet

在CenterNet中，网络会输入一张宽为 $W$ ，高位 $H$ 的图片。CenterNet会根据输入的图片生成以一个HeatMap（也就是Label），大小为 $\frac{W}{R}*\frac{H}{R}*2$ 。根据输出的结果，会选比较周围的峰值是否大于当前结果，如果都小于当前结果则选取当前峰值为正样本（这个是个大致的结果）。而且，因为热力图的大小为 $\frac{W}{R}*\frac{H}{R}*2$ 。所以可能会对检测带来一些误差。因此加入Offset。

但是，为什么要在3D检测中，不使用用Anchor，而要使用基于点的检测方法呢？

主要是有两个原因。第一个是因为如果使用Anchor的话，其实面领着需要有巨大的运算量。第二个原因是因为基于anchor的方法的BBox的和坐标轴平齐的，但是在显是情况下，当车辆转弯时候基于anchor的方法没有办法很好的检测对象。论文中的一个图示很好的解释了这个问题。

Backbone

对于Centerpoint有两个Backbone（3D encoder）可以供选择，分别是PointPollars和VoxelNet这两种。对于PointPollars需要要相对较大的运算量，VoxelNet相对来说小一点。在这里我只针对的是VoxelNet展开，以下是对于两种Backbone的测试结果；

在VoxleNet的Backbone中，对Voxle采用的均值的方法，对每个Voxel内的10个点的特征求了平平均。在代码中对应的是MeanVFE（Mean voxel feature encoding）部分。

class MeanVFE(VFETemplate):
    def __init__(self, model_cfg, num_point_features, **kwargs):
        super().__init__(model_cfg=model_cfg)
        self.num_point_features = num_point_features

    def get_output_feature_dim(self):
        return self.num_point_features

    def forward(self, batch_dict, **kwargs):
        """
        Args:
            batch_dict:
                voxels: (num_voxels, max_points_per_voxel, C)           voxel feature -> (12000,10,5)
                voxel_num_points: optional (num_voxels)                 Voxle number of points -> (12000)
            **kwargs:

        Returns:
            vfe_features: (num_voxels, C)
        """
        
        voxel_features, voxel_num_points = batch_dict['voxels'], batch_dict['voxel_num_points']
        points_mean = voxel_features[:, :, :].sum(dim=1, keepdim=False)                                 #对输入的10个点求和
        normalizer = torch.clamp_min(voxel_num_points.view(-1, 1), min=1.0).type_as(voxel_features)     #正则化，防止除0错误
        points_mean = points_mean / normalizer                                                          #求均值
        batch_dict['voxel_features'] = points_mean.contiguous()                                         #返回均值 -> (12000,5)

        return batch_dict

在经过均值操作之后，输入到特征提取网络进行稀疏卷积的特征提取（因为点云在空间中的分布是离散的，即使分配到各个Voxel中也是有的Voxel有，有的没有。所以使用稀疏卷积可以有效地减少模型的运算量，帮助模型加速。具体稀疏卷积可以看一下这篇文章）。

'''
对3D特征进行提取，输入是MeanVFE的输出，输出为经过稀疏卷积的特征向量（有点类似PointNet++，把输入从原始点云换成了体素）
将Voxel_feature根据coors进行稀疏卷积，变成了体素特征，维度为（batch_size, channels, grid_nums_z, grid_nums_y,grid_nuns_x）
'''
class VoxelBackBone8x(nn.Module):
    def __init__(self, model_cfg, input_channels, grid_size, **kwargs):
        super().__init__()
        self.model_cfg = model_cfg
        norm_fn = partial(nn.BatchNorm1d, eps=1e-3, momentum=0.01)

        self.sparse_shape = grid_size[::-1] + [1, 0, 0]
        #稀疏卷积（因为点云是稀疏的，无法使用标准的卷积操作，稀疏卷积有助于模型加速）
        #还有一个原因是如果使用普通卷积，会导致点云的稀疏性被损坏（也就是便稠密了，影响了几何轮廓的表示），会导致出现回归不准确的问题
        self.conv_input = spconv.SparseSequential(
            spconv.SubMConv3d(input_channels, 16, 3, padding=1, bias=False, indice_key='subm1'),
            norm_fn(16),
            nn.ReLU(),
        )
        
        #这里的block代表选择什么卷积方式：
            #subm:          ubMConv3d
            #spconv:        SparseConv3d
            #inverseconv:   SparseInverseConv3d
        block = post_act_block
        
        self.conv1 = spconv.SparseSequential(
            block(16, 16, 3, norm_fn=norm_fn, padding=1, indice_key='subm1'),
        )

        self.conv2 = spconv.SparseSequential(
            # [1600, 1408, 41] <- [800, 704, 21]
            block(16, 32, 3, norm_fn=norm_fn, stride=2, padding=1, indice_key='spconv2', conv_type='spconv'),
            block(32, 32, 3, norm_fn=norm_fn, padding=1, indice_key='subm2'),
            block(32, 32, 3, norm_fn=norm_fn, padding=1, indice_key='subm2'),
        )

        self.conv3 = spconv.SparseSequential(
            # [800, 704, 21] <- [400, 352, 11]
            block(32, 64, 3, norm_fn=norm_fn, stride=2, padding=1, indice_key='spconv3', conv_type='spconv'),
            block(64, 64, 3, norm_fn=norm_fn, padding=1, indice_key='subm3'),
            block(64, 64, 3, norm_fn=norm_fn, padding=1, indice_key='subm3'),
        )

        self.conv4 = spconv.SparseSequential(
            # [400, 352, 11] <- [200, 176, 5]
            block(64, 64, 3, norm_fn=norm_fn, stride=2, padding=(0, 1, 1), indice_key='spconv4', conv_type='spconv'),
            block(64, 64, 3, norm_fn=norm_fn, padding=1, indice_key='subm4'),
            block(64, 64, 3, norm_fn=norm_fn, padding=1, indice_key='subm4'),
        )

        last_pad = 0
        last_pad = self.model_cfg.get('last_pad', last_pad)
        self.conv_out = spconv.SparseSequential(
            # [200, 150, 5] -> [200, 150, 2]
            spconv.SparseConv3d(64, 128, (3, 1, 1), stride=(2, 1, 1), padding=last_pad,
                                bias=False, indice_key='spconv_down2'),
            norm_fn(128),
            nn.ReLU(),
        )
        self.num_point_features = 128

    def forward(self, batch_dict):
        """
        Args:
            batch_dict:
                batch_size: int
                vfe_features: (num_voxels, C)
                voxel_coords: (num_voxels, 4), [batch_idx, z_idx, y_idx, x_idx]
        Returns:
            batch_dict:
                encoded_spconv_tensor: sparse tensor
        """
        
        '''
        Voxel_feature(12000,5):Voxel的特征均值。Voxel_coords(12000,4):Voxel的坐标
        '''
        voxel_features, voxel_coords = batch_dict['voxel_features'], batch_dict['voxel_coords']
        batch_size = batch_dict['batch_size']
        
        #变成了体素
        #对Voxel_features按照coors进行索引，coors之前加入了一个维度（batch_indx），变成了4维[batch_idx, z_idx, y_idx, x_idx]
        #输出为[batch_size, channels, sparse_shap]，也就是[batch_size,channels,z_idx,y_idx,x_idx]
        #将coors加入了排列，变成了体素空间的索引，也就是[batch_size,channels,z_idx,y_idx,x_idx]
        input_sp_tensor = spconv.SparseConvTensor(
            features=voxel_features,
            indices=voxel_coords.int(),
            spatial_shape=self.sparse_shape,
            batch_size=batch_size
        )
        #经过一个稀疏卷积(SubMConv3d)，feature.shape从[12000,5]变成[12000,16]
        x = self.conv_input(input_sp_tensor)
        
        x_conv1 = self.conv1(x)             #[12000,16]->[12000,16]
        x_conv2 = self.conv2(x_conv1)       #[12000,16]->[113293,32]
        x_conv3 = self.conv3(x_conv2)       #[113293,32]->[55250,64]
        x_conv4 = self.conv4(x_conv3)       #[55250,64]->[20650,64]

        # for detection head
        # [200, 176, 5] -> [200, 176, 2]
        out = self.conv_out(x_conv4)
        
        
        batch_dict.update({
            'encoded_spconv_tensor': out,
            'encoded_spconv_tensor_stride': 8
        })
        batch_dict.update({
            'multi_scale_3d_features': {
                'x_conv1': x_conv1,
                'x_conv2': x_conv2,
                'x_conv3': x_conv3,
                'x_conv4': x_conv4,
            }
        })

        return batch_dict

在上面的过程，也就是完成了对Backbone3D的过程，接下来需要完成Backbone2D的过程。

经过稀疏卷积之后，提取出了当前点云的3D与特征。因为需要在BEV视角下做中心点检测。所以接下来需要将点云的稀疏特征转移到BEV视角上。在这里对Voxel特征在Z轴方向进行堆叠（也就是拍扁了）。这样的作用可以扩大在Z轴上的感受野婢妾简化了网络的计算难度。

class HeightCompression(nn.Module):
    def __init__(self, model_cfg, **kwargs):
        super().__init__()
        self.model_cfg = model_cfg
        self.num_bev_features = self.model_cfg.NUM_BEV_FEATURES      #搞毒的特征数量

    def forward(self, batch_dict):
        """
        Args:
            batch_dict:
                encoded_spconv_tensor: sparse tensor
        Returns:
            batch_dict:
                spatial_features:

        """
        encoded_spconv_tensor = batch_dict['encoded_spconv_tensor']  #得到VoxelBackbone8x的输出
        spatial_features = encoded_spconv_tensor.dense()             #从Spocv的格式中取出体素格式的特征（这里将稀疏特征转化为了密集特征）   dense()全连接层 
        N, C, D, H, W = spatial_features.shape                       #（Bacth_size,channel, Depth, Height, weight）
        spatial_features = spatial_features.view(N, C * D, H, W)     #将体素格式的特征转换为二维的BEV格式
        batch_dict['spatial_features'] = spatial_features
        batch_dict['spatial_features_stride'] = batch_dict['encoded_spconv_tensor_stride']
        return batch_dict

3D特征被压扁之后，继续对2D的特征进行特征提取。目的是为了提取到更多特征。在这个过程中，对提取到的特征进行了上采样和下采样。目的是为了得到更多尺度的特征。

'''
对Spatial_feature提取的2D特征进行进一步的特征提取
得到2D的BEV下的特征
'''
class BaseBEVBackbone(nn.Module):
    def __init__(self, model_cfg, input_channels):
        super().__init__()
        self.model_cfg = model_cfg

        if self.model_cfg.get('LAYER_NUMS', None) is not None:
            assert len(self.model_cfg.LAYER_NUMS) == len(self.model_cfg.LAYER_STRIDES) == len(self.model_cfg.NUM_FILTERS)
            layer_nums = self.model_cfg.LAYER_NUMS
            layer_strides = self.model_cfg.LAYER_STRIDES
            num_filters = self.model_cfg.NUM_FILTERS
        else:
            layer_nums = layer_strides = num_filters = []

        if self.model_cfg.get('UPSAMPLE_STRIDES', None) is not None:
            assert len(self.model_cfg.UPSAMPLE_STRIDES) == len(self.model_cfg.NUM_UPSAMPLE_FILTERS)
            num_upsample_filters = self.model_cfg.NUM_UPSAMPLE_FILTERS
            upsample_strides = self.model_cfg.UPSAMPLE_STRIDES
        else:
            upsample_strides = num_upsample_filters = []

        num_levels = len(layer_nums)
        c_in_list = [input_channels, *num_filters[:-1]]
        self.blocks = nn.ModuleList()
        self.deblocks = nn.ModuleList()
        for idx in range(num_levels):
            cur_layers = [
                nn.ZeroPad2d(1),
                nn.Conv2d(
                    c_in_list[idx], num_filters[idx], kernel_size=3,
                    stride=layer_strides[idx], padding=0, bias=False
                ),
                nn.BatchNorm2d(num_filters[idx], eps=1e-3, momentum=0.01),
                nn.ReLU()
            ]
            for k in range(layer_nums[idx]):
                cur_layers.extend([
                    nn.Conv2d(num_filters[idx], num_filters[idx], kernel_size=3, padding=1, bias=False),
                    nn.BatchNorm2d(num_filters[idx], eps=1e-3, momentum=0.01),
                    nn.ReLU()
                ])
            self.blocks.append(nn.Sequential(*cur_layers))
            if len(upsample_strides) > 0:
                stride = upsample_strides[idx]
                if stride >= 1:
                    self.deblocks.append(nn.Sequential(
                        nn.ConvTranspose2d(
                            num_filters[idx], num_upsample_filters[idx],
                            upsample_strides[idx],
                            stride=upsample_strides[idx], bias=False
                        ),
                        nn.BatchNorm2d(num_upsample_filters[idx], eps=1e-3, momentum=0.01),
                        nn.ReLU()
                    ))
                else:
                    stride = np.round(1 / stride).astype(np.int)
                    self.deblocks.append(nn.Sequential(
                        nn.Conv2d(
                            num_filters[idx], num_upsample_filters[idx],
                            stride,
                            stride=stride, bias=False
                        ),
                        nn.BatchNorm2d(num_upsample_filters[idx], eps=1e-3, momentum=0.01),
                        nn.ReLU()
                    ))

        c_in = sum(num_upsample_filters)
        if len(upsample_strides) > num_levels:
            self.deblocks.append(nn.Sequential(
                nn.ConvTranspose2d(c_in, c_in, upsample_strides[-1], stride=upsample_strides[-1], bias=False),
                nn.BatchNorm2d(c_in, eps=1e-3, momentum=0.01),
                nn.ReLU(),
            ))

        self.num_bev_features = c_in

    def forward(self, data_dict):
        """
        Args:
            data_dict:
                spatial_features
        Returns:
        """
        spatial_features = data_dict['spatial_features']
        ups = []
        ret_dict = {}
        x = spatial_features
        
        # 对不同分支进行卷积操作
        '''
        Centerpoint中有两个下采样层，一个是下采样到原图的大小，一个是下采样到BEV的大小
        '''
        for i in range(len(self.blocks)):
            x = self.blocks[i](x)

            stride = int(spatial_features.shape[2] / x.shape[2])
            ret_dict['spatial_features_%dx' % stride] = x
            if len(self.deblocks) > 0:
                ups.append(self.deblocks[i](x))     #反卷积，作用：上采样
            else:
                ups.append(x)

        if len(ups) > 1:
            x = torch.cat(ups, dim=1)   #(batch, c, 128, 128)  -->>    (batch, c*2, 128, 128)
        elif len(ups) == 1:
            x = ups[0]

        if len(self.deblocks) > len(self.blocks):
            x = self.deblocks[-1](x)

        data_dict['spatial_features_2d'] = x
        #输出
        return data_dict

First stage: Centers & 3D boxs

在这里生成HeatMap的形式和CenterNet中的基本相似，其实就是一个3D Encoding的过程。在CenterNet中如果输入的图像大小是 $W*H*3$ 。那么会对生成一个大小为 $\frac{W}{R}*\frac{H}{R}*k$ 的HeatMap（其中 $k$ 是检测的类别数）。在CenterPoint中，因为点云是离散的，并不是集中的。如果使用和CenterNet的办法的话则大部分都为背景，计算效率较低。CenterPoint中将目标对象放大实现高斯峰值。其中将高斯半径设置为 $\delta =max(f(wl),\tau )$ ，其中 $\tau =2$ 是最小的高斯半径， $f$ 为在CornerNet中定义的半径函数。经过3D Encoding之后，输出的有：热力图，目标的尺寸，目标的朝向和目标的速度。；

def forward(self, data_dict):
        spatial_features_2d = data_dict['spatial_features_2d']          #获得BEV上的特征->(B, C, H, W)

        cls_preds = self.conv_cls(spatial_features_2d)                  #获得分类预测结果->(B, C, H, W)
        box_preds = self.conv_box(spatial_features_2d)                  #获得回归预测结果->(B, C, H, W)

        cls_preds = cls_preds.permute(0, 2, 3, 1).contiguous()          # [N, H, W, C]
        box_preds = box_preds.permute(0, 2, 3, 1).contiguous()          # [N, H, W, C]

        self.forward_ret_dict['cls_preds'] = cls_preds                  
        self.forward_ret_dict['box_preds'] = box_preds

        if self.training:
            targets_dict = self.assign_targets(
                gt_boxes=data_dict['gt_boxes']
            )
            self.forward_ret_dict.update(targets_dict)

        if not self.training or self.predict_boxes_when_training:
            batch_cls_preds, batch_box_preds = self.generate_predicted_boxes(
                batch_size=data_dict['batch_size'],
                cls_preds=cls_preds, box_preds=box_preds, dir_cls_preds=None
            )
            data_dict['batch_cls_preds'] = batch_cls_preds
            data_dict['batch_box_preds'] = batch_box_preds
            data_dict['cls_preds_normalized'] = False

        return data_dict

Second stage: Score & 3D boxes

第二阶段只有在Waymo数据集和nuScenes数据集中有使用，在KITTI中并没有使用。总的来说，其实就是又进行了一次特征提取，也就是用了第一阶段检测对象的一些附加特征来细化了估计。在第二阶段中，使用了之前预测的边框的四个面（只考虑周围的4个面，顶部和底部的面中心都是中心点）的中心点提取特征。对于每个点，在文章中提到了使用线性插值的方法从主干Map中输出特征，最后将此特征传递给MLP，对第二阶段进行特征细化。

Loss

对于与类别无关的置信度预测使用预测结果与IoU计算Loss, 如图：

对于训练时使用交叉熵损失函数计算Loss，如图:

Experiments

作者这里主要在Waymo和nuScenes数据集上进行了测试，Waymo结果如下：

nuScenes数据集结果：

对于KITTI数据集上的结果，我按照centerpoint.yaml跑的，也就是使用的VoxelNet的Backbone，没有第二阶段优化（作者说第二阶段在KITTI上并没有很好的效果，所以没有用）结果如下：

2022-08-19 00:19:20,746   INFO  *************** EPOCH 80 EVALUATION *****************
eval: 100% 1257/1257 [02:45<00:00,  7.61it/s, recall_0.3=(0, 16574) / 17558]
2022-08-19 00:22:05,838   INFO  *************** Performance of EPOCH 80 *****************
2022-08-19 00:22:05,838   INFO  Generate label finished(sec_per_example: 0.0438 second).
2022-08-19 00:22:05,838   INFO  recall_roi_0.3: 0.000000
2022-08-19 00:22:05,839   INFO  recall_rcnn_0.3: 0.943957
2022-08-19 00:22:05,839   INFO  recall_roi_0.5: 0.000000
2022-08-19 00:22:05,839   INFO  recall_rcnn_0.5: 0.884326
2022-08-19 00:22:05,839   INFO  recall_roi_0.7: 0.000000
2022-08-19 00:22:05,839   INFO  recall_rcnn_0.7: 0.650871
2022-08-19 00:22:05,842   INFO  Average predicted number of objects(3769 samples): 14.189
2022-08-19 00:22:25,293   INFO  Car AP@0.70, 0.70, 0.70:
bbox AP:95.1977, 89.6228, 88.9443
bev  AP:89.5762, 87.3362, 84.9389
3d   AP:87.8885, 78.1958, 76.9271
aos  AP:95.15, 89.50, 88.76
Car AP_R40@0.70, 0.70, 0.70:
bbox AP:97.6645, 93.8115, 91.6160
bev  AP:92.1521, 88.1394, 87.0467
3d   AP:89.8526, 80.9555, 76.7696
aos  AP:97.62, 93.66, 91.41
Car AP@0.70, 0.50, 0.50:
bbox AP:95.1977, 89.6228, 88.9443
bev  AP:95.2045, 89.6821, 89.1962
3d   AP:95.1455, 89.6365, 89.1102
aos  AP:95.15, 89.50, 88.76
Car AP_R40@0.70, 0.50, 0.50:
bbox AP:97.6645, 93.8115, 91.6160
bev  AP:97.6538, 94.5334, 93.8886
3d   AP:97.5949, 94.4241, 93.6071
aos  AP:97.62, 93.66, 91.41
Pedestrian AP@0.50, 0.50, 0.50:
bbox AP:72.8755, 69.8601, 67.5803
bev  AP:59.0602, 56.1858, 53.7814
3d   AP:55.9983, 52.7308, 48.4896
aos  AP:70.86, 67.22, 64.51
Pedestrian AP_R40@0.50, 0.50, 0.50:
bbox AP:72.8309, 70.8996, 67.8249
bev  AP:58.7370, 56.0706, 52.6835
3d   AP:54.1025, 51.4533, 47.5276
aos  AP:70.72, 67.81, 64.29
Pedestrian AP@0.50, 0.25, 0.25:
bbox AP:72.8755, 69.8601, 67.5803
bev  AP:76.5677, 76.0058, 73.3515
3d   AP:76.2825, 75.5901, 72.9315
aos  AP:70.86, 67.22, 64.51
Pedestrian AP_R40@0.50, 0.25, 0.25:
bbox AP:72.8309, 70.8996, 67.8249
bev  AP:78.5230, 77.4655, 74.4151
3d   AP:78.1941, 76.9804, 73.7875
aos  AP:70.72, 67.81, 64.29
Cyclist AP@0.50, 0.50, 0.50:
bbox AP:88.2125, 79.4905, 76.0368
bev  AP:86.1723, 70.7543, 67.7885
3d   AP:77.0715, 64.8973, 60.6476
aos  AP:88.06, 79.12, 75.66
Cyclist AP_R40@0.50, 0.50, 0.50:
bbox AP:92.1393, 80.4166, 77.3878
bev  AP:86.9190, 71.4113, 67.6574
3d   AP:80.4734, 64.4435, 60.6898
aos  AP:91.93, 80.02, 76.99
Cyclist AP@0.50, 0.25, 0.25:
bbox AP:88.2125, 79.4905, 76.0368
bev  AP:88.7953, 75.6457, 72.2460
3d   AP:88.7953, 75.6457, 72.2460
aos  AP:88.06, 79.12, 75.66
Cyclist AP_R40@0.50, 0.25, 0.25:
bbox AP:92.1393, 80.4166, 77.3878
bev  AP:90.0662, 76.6310, 73.2075
3d   AP:90.0662, 76.6310, 73.2075
aos  AP:91.93, 80.02, 76.99

本文含有隐藏内容，请开通VIP 后查看

Paper Reading- Center-based 3D Object Detection and Tracking (Based: KITTI)