目录
2. Vision-Language Fusion(VLF)模块
一、开放词汇语义分割的挑战
传统的开放词汇语义分割方法通常采用两阶段流程:首先使用强大的无类别分割模型(如SAM)生成区域提案,再通过CLIP等视觉语言模型进行分类。然而,这种方法存在两大缺陷:
- 计算效率低:两阶段流程需要多次前向计算,GPU内存占用高
- 域偏移问题:裁剪后的区域包含其他物体部分,导致分类精度下降
如图1(a)所示的传统两阶段架构,虽然能生成高质量分割提案,但效率瓶颈明显。最近的单阶段方法(图1(b))通过直接建模图像-文本相关性提升了效率,但在局部细节重建上仍存在不足。
Figure 1.(a) A model structure that generates proposal masks using a mask generation model.(b) A model structure that refines the correlation between image and text.(c) The structure of the proposed ESC-Net. Our ESC-Net efficiently models the relationship between images and text by combining a pre-trained SAM block with pseudo prompts instead of an inefficient mask generation model. This approach enables much denser mask prediction compared to conventional correlation-based methods.
二、ESC-Net核心原理
ESC-Net通过创新性地结合CLIP的全局特征和SAM的局部建模能力,实现了高效精准的开放词汇分割。其核心架构如下:
1. 整体架构
ESC-Net由三个主要模块组成:
- CLIP编码器:提取图像和文本的全局特征
- 伪提示生成器(PPG):从图像-文本相关性图中生成类特定提示
- SAM块序列:通过预训练的SAM解码器块进行空间聚合
- VLF融合模块:多模态特征融合生成最终分割图
Figure 2. The proposed ESC-Net consists of the CLIP vision and language encoders, N consecutive ESCBlocks, and a decoder. Each ESC-Block generates a pseudo prompt from the image-text correlation map and uses it as input to the SAM block. The SAM block aggregates the CLIP image features. The VLF block models the image-text correlation using image features and text features, refining the correlation map through this process.
2. 关键公式
(1) 图像-文本相关性计算
Cv&ln(i)=∥Fv(i)∥∥Fln∥Fv(i)⋅Fln
通过余弦相似度计算每个文本类别的图像相关性图,生成初始的物体概率掩码。
(2) SAM块注意力机制
(Fvn)′Fv′=BCA(SA(Fln),Fv),=Conv([Fv0;Fv1;…;FvNc]),
通过自注意力和双向交叉注意力机制,实现全局上下文信息的有效聚合。
三、创新技术详解
1. 伪提示生成器(PPG)
如图3所示,PPG通过以下步骤生成类特定提示:
- 对相关性图进行softmax归一化
- 二值化处理生成初步物体掩码
- 使用k-means聚类分离重叠物体
- 提取高概率区域的伪点和伪掩码
Figure 3. The process of the proposed Pseudo Prompt Generator(PPG). PPG aims to generate class-specific pseudo prompts from image-text correlation maps. For efficiency, all processes are com-puted in batch-wise parallelization over all classes.
2. Vision-Language Fusion(VLF)模块
VLF模块通过以下步骤增强多模态特征融合:
- 将相关性图与图像特征拼接后输入Transformer块
- 使用线性Transformer捕捉类别间关系
- 生成多模态精炼的相关性图
Figure 4. The structure of the proposed Vision-Language Fusion(VLF) block. VLF sequentially applies image and text guidance to the correlation map to refine it.
四、实验结果对比
1. 定量分析
在ADE20K、PASCAL-VOC等基准数据集上的表现:
数据集 | mIoU (ESC-Net) | 提升幅度 |
---|---|---|
ADE20K | 59.0 | +3.8 |
PC-459 | 27.0 | +2.1 |
PAS-20b | 86.3 | +2.8 |
Figure 5. Qualitative comparison of CAT-Seg and our ESC-Net across various datasets. Our model is capable of generating more accurate and robust masks compared to existing correlation-based state-of-the-art method.
2. 定性分析
在复杂场景下的分割效果对比:
Figure 6. Visualization of image-text correlation maps with and without the SAM block. We visualize the model activation maps for the“Person” class for each ESC-Block. The proposed SAM-based method enables more accurate and dense object localization compared to the baseline.
五、代码实现要点
python
class ESCNet(nn.Module):
def __init__(self, clip_encoder, sam_blocks, vlf_module):
super().__init__()
self.clip_encoder = clip_encoder # CLIP图像编码器
self.sam_blocks = nn.ModuleList(sam_blocks) # 预训练SAM块
self.vlf = vlf_module # 视觉语言融合模块
self.decoder = nn.Sequential(
nn.Upsample(scale_factor=2),
nn.Conv2d(512, 256, 3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU()
)
def forward(self, image, text):
# CLIP特征提取
img_feat = self.clip_encoder.encode_image(image)
txt_feat = self.clip_encoder.encode_text(text)
# 生成伪提示
pseudo_prompts = self.ppg_generator(img_feat, txt_feat)
# SAM块处理
refined_feat = img_feat
for block in self.sam_blocks:
refined_feat = block(refined_feat, pseudo_prompts)
# 多模态融合
corr_map = self.vlf(refined_feat, txt_feat)
# 解码生成分割图
mask = self.decoder(corr_map)
return mask
六、总结与展望
ESC-Net通过创新性地结合SAM的空间聚合能力和CLIP的全局特征,实现了高效精准的开放词汇分割。未来工作可以探索:
- 更高效的伪提示生成策略
- 动态调整SAM块数量以适应不同复杂度场景
- 结合更高分辨率的特征图提升边界定位精度
该模型为开放词汇分割任务提供了新的技术思路,有望在医疗影像分析、自动驾驶等场景中发挥重要作用。