[2025CVPR]ESC-Net:一种开放词汇语义分割模型

发布于:2025-07-06 ⋅ 阅读:(18) ⋅ 点赞:(0)

 

目录

一、开放词汇语义分割的挑战

二、ESC-Net核心原理

1. 整体架构

2. 关键公式

(1) 图像-文本相关性计算

(2) SAM块注意力机制

三、创新技术详解

1. 伪提示生成器(PPG)

2. Vision-Language Fusion(VLF)模块

四、实验结果对比

1. 定量分析

2. 定性分析

五、代码实现要点

六、总结与展望


一、开放词汇语义分割的挑战

传统的开放词汇语义分割方法通常采用两阶段流程:首先使用强大的无类别分割模型(如SAM)生成区域提案,再通过CLIP等视觉语言模型进行分类。然而,这种方法存在两大缺陷:

  1. 计算效率低​:两阶段流程需要多次前向计算,GPU内存占用高
  2. 域偏移问题​:裁剪后的区域包含其他物体部分,导致分类精度下降

如图1(a)所示的传统两阶段架构,虽然能生成高质量分割提案,但效率瓶颈明显。最近的单阶段方法(图1(b))通过直接建模图像-文本相关性提升了效率,但在局部细节重建上仍存在不足。

Figure 1.(a) A model structure that generates proposal masks using a mask generation model.(b) A model structure that refines the correlation between image and text.(c) The structure of the proposed ESC-Net. Our ESC-Net efficiently models the relationship between images and text by combining a pre-trained SAM block with pseudo prompts instead of an inefficient mask generation model. This approach enables much denser mask prediction compared to conventional correlation-based methods.

二、ESC-Net核心原理

ESC-Net通过创新性地结合CLIP的全局特征和SAM的局部建模能力,实现了高效精准的开放词汇分割。其核心架构如下:

1. 整体架构

ESC-Net由三个主要模块组成:

  • CLIP编码器​:提取图像和文本的全局特征
  • 伪提示生成器(PPG)​​:从图像-文本相关性图中生成类特定提示
  • SAM块序列​:通过预训练的SAM解码器块进行空间聚合
  • VLF融合模块​:多模态特征融合生成最终分割图

Figure 2. The proposed ESC-Net consists of the CLIP vision and language encoders, N consecutive ESCBlocks, and a decoder. Each ESC-Block generates a pseudo prompt from the image-text correlation map and uses it as input to the SAM block. The SAM block aggregates the CLIP image features. The VLF block models the image-text correlation using image features and text features, refining the correlation map through this process.

2. 关键公式

(1) 图像-文本相关性计算

Cv&ln​(i)=∥Fv​(i)∥∥Fln​∥Fv​(i)⋅Fln​​

通过余弦相似度计算每个文本类别的图像相关性图,生成初始的物体概率掩码。

(2) SAM块注意力机制

(Fvn​)′Fv′​​=BCA(SA(Fln​),Fv​),=Conv([Fv0​;Fv1​;…;FvNc​​]),​

通过自注意力和双向交叉注意力机制,实现全局上下文信息的有效聚合。

三、创新技术详解

1. 伪提示生成器(PPG)

如图3所示,PPG通过以下步骤生成类特定提示:

  1. 对相关性图进行softmax归一化
  2. 二值化处理生成初步物体掩码
  3. 使用k-means聚类分离重叠物体
  4. 提取高概率区域的伪点和伪掩码

Figure 3. The process of the proposed Pseudo Prompt Generator(PPG). PPG aims to generate class-specific pseudo prompts from image-text correlation maps. For efficiency, all processes are com-puted in batch-wise parallelization over all classes.

2. Vision-Language Fusion(VLF)模块

VLF模块通过以下步骤增强多模态特征融合:

  1. 将相关性图与图像特征拼接后输入Transformer块
  2. 使用线性Transformer捕捉类别间关系
  3. 生成多模态精炼的相关性图

Figure 4. The structure of the proposed Vision-Language Fusion(VLF) block. VLF sequentially applies image and text guidance to the correlation map to refine it.

四、实验结果对比

1. 定量分析

在ADE20K、PASCAL-VOC等基准数据集上的表现:

数据集 mIoU (ESC-Net) 提升幅度
ADE20K 59.0 +3.8
PC-459 27.0 +2.1
PAS-20b 86.3 +2.8

Figure 5. Qualitative comparison of CAT-Seg and our ESC-Net across various datasets. Our model is capable of generating more accurate and robust masks compared to existing correlation-based state-of-the-art method.

2. 定性分析

在复杂场景下的分割效果对比:

Figure 6. Visualization of image-text correlation maps with and without the SAM block. We visualize the model activation maps for the“Person” class for each ESC-Block. The proposed SAM-based method enables more accurate and dense object localization compared to the baseline.

五、代码实现要点

python

class ESCNet(nn.Module):
    def __init__(self, clip_encoder, sam_blocks, vlf_module):
        super().__init__()
        self.clip_encoder = clip_encoder  # CLIP图像编码器
        self.sam_blocks = nn.ModuleList(sam_blocks)  # 预训练SAM块
        self.vlf = vlf_module  # 视觉语言融合模块
        self.decoder = nn.Sequential(
            nn.Upsample(scale_factor=2),
            nn.Conv2d(512, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU()
        )

    def forward(self, image, text):
        # CLIP特征提取
        img_feat = self.clip_encoder.encode_image(image)
        txt_feat = self.clip_encoder.encode_text(text)
        
        # 生成伪提示
        pseudo_prompts = self.ppg_generator(img_feat, txt_feat)
        
        # SAM块处理
        refined_feat = img_feat
        for block in self.sam_blocks:
            refined_feat = block(refined_feat, pseudo_prompts)
        
        # 多模态融合
        corr_map = self.vlf(refined_feat, txt_feat)
        
        # 解码生成分割图
        mask = self.decoder(corr_map)
        return mask

六、总结与展望

ESC-Net通过创新性地结合SAM的空间聚合能力和CLIP的全局特征,实现了高效精准的开放词汇分割。未来工作可以探索:

  1. 更高效的伪提示生成策略
  2. 动态调整SAM块数量以适应不同复杂度场景
  3. 结合更高分辨率的特征图提升边界定位精度

该模型为开放词汇分割任务提供了新的技术思路,有望在医疗影像分析、自动驾驶等场景中发挥重要作用。

论文地址:https://openaccess.thecvf.com/content/CVPR2025/papers/Lee_Effective_SAM_Combination_for_Open-Vocabulary_Semantic_Segmentation_CVPR_2025_paper.pdf