[2025CVPR]ESC-Net:一种开放词汇语义分割模型-EW帮帮网

Figure 1.(a) A model structure that generates proposal masks using a mask generation model.(b) A model structure that refines the correlation between image and text.(c) The structure of the proposed ESC-Net. Our ESC-Net efficiently models the relationship between images and text by combining a pre-trained SAM block with pseudo prompts instead of an inefficient mask generation model. This approach enables much denser mask prediction compared to conventional correlation-based methods.

二、ESC-Net核心原理

ESC-Net通过创新性地结合CLIP的全局特征和SAM的局部建模能力，实现了高效精准的开放词汇分割。其核心架构如下：

1. 整体架构

ESC-Net由三个主要模块组成：

CLIP编码器：提取图像和文本的全局特征
伪提示生成器（PPG）：从图像-文本相关性图中生成类特定提示
SAM块序列：通过预训练的SAM解码器块进行空间聚合
VLF融合模块：多模态特征融合生成最终分割图

Figure 2. The proposed ESC-Net consists of the CLIP vision and language encoders, N consecutive ESCBlocks, and a decoder. Each ESC-Block generates a pseudo prompt from the image-text correlation map and uses it as input to the SAM block. The SAM block aggregates the CLIP image features. The VLF block models the image-text correlation using image features and text features, refining the correlation map through this process.

2. 关键公式

(1) 图像-文本相关性计算

Cv&ln(i)=∥Fv(i)∥∥Fln∥Fv(i)⋅Fln

通过余弦相似度计算每个文本类别的图像相关性图，生成初始的物体概率掩码。

(2) SAM块注意力机制

(Fvn)′Fv′=BCA(SA(Fln),Fv),=Conv([Fv0;Fv1;…;FvNc]),

通过自注意力和双向交叉注意力机制，实现全局上下文信息的有效聚合。

三、创新技术详解

1. 伪提示生成器（PPG）

如图3所示，PPG通过以下步骤生成类特定提示：

对相关性图进行softmax归一化
二值化处理生成初步物体掩码
使用k-means聚类分离重叠物体
提取高概率区域的伪点和伪掩码

Figure 3. The process of the proposed Pseudo Prompt Generator(PPG). PPG aims to generate class-specific pseudo prompts from image-text correlation maps. For efficiency, all processes are com-puted in batch-wise parallelization over all classes.

2. Vision-Language Fusion（VLF）模块

VLF模块通过以下步骤增强多模态特征融合：

将相关性图与图像特征拼接后输入Transformer块
使用线性Transformer捕捉类别间关系
生成多模态精炼的相关性图

Figure 4. The structure of the proposed Vision-Language Fusion(VLF) block. VLF sequentially applies image and text guidance to the correlation map to refine it.

四、实验结果对比

1. 定量分析

在ADE20K、PASCAL-VOC等基准数据集上的表现：

数据集	mIoU (ESC-Net)	提升幅度
ADE20K	59.0	+3.8
PC-459	27.0	+2.1
PAS-20b	86.3	+2.8

Figure 5. Qualitative comparison of CAT-Seg and our ESC-Net across various datasets. Our model is capable of generating more accurate and robust masks compared to existing correlation-based state-of-the-art method.

2. 定性分析

在复杂场景下的分割效果对比：

Figure 6. Visualization of image-text correlation maps with and without the SAM block. We visualize the model activation maps for the“Person” class for each ESC-Block. The proposed SAM-based method enables more accurate and dense object localization compared to the baseline.

五、代码实现要点

python

class ESCNet(nn.Module):
    def __init__(self, clip_encoder, sam_blocks, vlf_module):
        super().__init__()
        self.clip_encoder = clip_encoder  # CLIP图像编码器
        self.sam_blocks = nn.ModuleList(sam_blocks)  # 预训练SAM块
        self.vlf = vlf_module  # 视觉语言融合模块
        self.decoder = nn.Sequential(
            nn.Upsample(scale_factor=2),
            nn.Conv2d(512, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU()
        )

    def forward(self, image, text):
        # CLIP特征提取
        img_feat = self.clip_encoder.encode_image(image)
        txt_feat = self.clip_encoder.encode_text(text)
        
        # 生成伪提示
        pseudo_prompts = self.ppg_generator(img_feat, txt_feat)
        
        # SAM块处理
        refined_feat = img_feat
        for block in self.sam_blocks:
            refined_feat = block(refined_feat, pseudo_prompts)
        
        # 多模态融合
        corr_map = self.vlf(refined_feat, txt_feat)
        
        # 解码生成分割图
        mask = self.decoder(corr_map)
        return mask

六、总结与展望

ESC-Net通过创新性地结合SAM的空间聚合能力和CLIP的全局特征，实现了高效精准的开放词汇分割。未来工作可以探索：

更高效的伪提示生成策略
动态调整SAM块数量以适应不同复杂度场景
结合更高分辨率的特征图提升边界定位精度

该模型为开放词汇分割任务提供了新的技术思路，有望在医疗影像分析、自动驾驶等场景中发挥重要作用。

论文地址：https://openaccess.thecvf.com/content/CVPR2025/papers/Lee_Effective_SAM_Combination_for_Open-Vocabulary_Semantic_Segmentation_CVPR_2025_paper.pdf

[2025CVPR]ESC-Net:一种开放词汇语义分割模型

一、开放词汇语义分割的挑战