CLIP-Adapter: Better Vision-Language Models with Feature Adapters-EW帮帮网

当前的问题

由于CLIP的过度参数化和缺乏足够的训练样例，简单的微调会导致对特定数据集的过拟合，并且训练过程会非常缓慢由于在所有CLIP层之间的向前和向后传播。

方法

在这里插入图片描述

视觉适配器 $A_v(\cdot)$ (包含 $\textbf{W}^v_1,\textbf{W}^v_2$ )和文本适配器 $A_t(\cdot)$ (包含 $\textbf{W}^t_1,\textbf{W}^t_2$ )的设计：
在这里插入图片描述
我们对特征适配器采用残差连接，以避免遗忘预训练CLIP编码的原始知识。采用两个恒定值 $\alpha$ 和 $\beta$ 作为“残差比”，以帮助调整保持原始知识的程度，以获得更好的性能。

在少量训练中，通过交叉熵损失对 $A_v(\cdot)$ 和 $A_t(\cdot)$ 的权值进行优化：
在这里插入图片描述
其中 $\theta=\{\textbf{W}^v_1,\textbf{W}^v_2,\textbf{W}^t_1,\textbf{W}^t_2\}$

结果

对实验的思考：适配器只做了域内（非跨域）的实验，有可能是适配器鲁棒性的不好，相比prompt而言。
在这里插入图片描述

适配器中超参的不足

在这里插入图片描述
超参数 $\alpha,\beta$ 对不同数据集影响很大。
作者建议设计一个超参网络 $Q$ 来动态生成超参，即 $\alpha,\beta=Q(f,\textbf{W})$ ，然而作者没有解决这个问题。
适配器以及核心代码

class Adapter(nn.Module):
    def __init__(self, c_in, reduction=4):
        super(Adapter, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(c_in, c_in // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(c_in // reduction, c_in, bias=False),
            nn.ReLU(inplace=True)
        )
    def forward(self, x):
        x = self.fc(x)
        return x

class CustomCLIP(nn.Module):

    def __init__(self, cfg, classnames, clip_model):
        super().__init__()
        self.image_encoder = clip_model.visual
        self.text_encoder = TextEncoder(cfg, classnames, clip_model)
        self.logit_scale = clip_model.logit_scale
        self.dtype = clip_model.dtype
        self.adapter = Adapter(1024, 4).to(clip_model.dtype)

            
    def forward(self, image):
        image_features = self.image_encoder(image.type(self.dtype))
        x = self.adapter(image_features)

        ratio = 0.2
        image_features = ratio * x + (1 - ratio) * image_features

        text_features = self.text_encoder()

        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        logit_scale = self.logit_scale.exp()
        logits = logit_scale * image_features @ text_features.t()

        return logits

参考资料

论文下载(2023 IJCV, 2021发在arixv)

https://arxiv.org/pdf/2110.04544
在这里插入图片描述

代码地址(470 stars)

https://github.com/gaopengcuhk/CLIP-Adapter

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

当前的问题

方法

结果

适配器中超参的不足

参考资料

论文下载(2023 IJCV, 2021发在arixv)

代码地址(470 stars)

网站公告

今日签到

热门文章

最新发布