Keye-VL-8B-Preview:由快手 Kwai Keye 团队精心打造的尖端多模态大语言模型

发布于:2025-07-02 ⋅ 阅读:(20) ⋅ 点赞:(0)

在这里插入图片描述

🔥 News

  • 2025.06.26 🌟 我们非常自豪地推出Kwai Keye-VL,这是快手Kwai Keye团队精心打造的前沿多模态大语言模型。作为快手先进技术生态中的核心AI产品,Keye在视频理解、视觉感知和推理任务方面表现卓越,树立了新的性能标杆。我们的团队正在不懈努力突破可能的边界,敬请期待更多令人兴奋的进展!

在这里插入图片描述

快速入门

以下,我们通过简单示例展示如何结合🤗 Transformers使用Kwai Keye-VL。

Kwai Keye-VL的代码已集成至最新版Hugging Face transformers库,建议您通过以下命令从源码构建:

pip install git+https://github.com/huggingface/transformers accelerate

我们提供一套工具包,帮助您像调用API一样更便捷地处理各类视觉输入。包括base64编码、URL链接以及交错排列的图像和视频。您可以通过以下命令进行安装:

# It's highly recommanded to use `[decord]` feature for faster video loading.
pip install "keye-vl-utils[decord]==1.0.0"

如果您未使用Linux系统,可能无法直接从PyPI安装decord。此时可以使用pip install keye-vl-utils命令,该工具包将自动回退至torchvision进行视频处理。但您仍可通过源码安装decord来启用视频加载时的decord支持。

使用🤗 Transformers进行对话

以下代码片段展示如何结合transformers和keye_vl_utils使用对话模型:

继Qwen3之后,我们也提供了软切换机制,允许用户动态控制模型行为。通过在用户提示中添加/think、/no_think或不添加任何指令,即可切换模型的思考模式。

from transformers import AutoModel, AutoTokenizer, AutoProcessor
from keye_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model_path = "Kwai-Keye/Keye-VL-8B-Preview"

model = AutoModel.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = KeyeForConditionalGeneration.from_pretrained(
#     "Kwai-Keye/Keye-VL-8B-Preview",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_pat, min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True)

# Non-Thinking Mode
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",
            },
            {"type": "text", "text": "Describe this image./no_think"},
        ],
    }
]

# Auto-Thinking Mode
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Thinking mode
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",
            },
            {"type": "text", "text": "Describe this image./think"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Messages containing a video url and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "http://s2-11508.kwimgs.com/kos/nlav11508/MLLM/videos_caption/98312843263.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

#In Keye-VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages2]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

更多使用提示

对于输入图像,我们支持本地文件、base64编码和URL链接。对于视频,目前仅支持本地文件。

# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Image URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 encoded image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

提升性能的图像分辨率设置

该模型支持多种分辨率输入。默认情况下,它会采用原始分辨率处理输入,但更高的分辨率可提升性能(需消耗更多计算资源)。用户可设置像素的最小值和最大值(如256-1280的token计数范围)来优化配置,从而平衡运行速度与内存占用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Kwai-Keye/Keye-VL-8B-Preview", min_pixels=min_pixels, max_pixels=max_pixels
)

此外,我们提供两种方法对输入模型的图像尺寸进行细粒度控制:

  1. 定义最小像素和最大像素范围:图像将按原比例缩放,确保像素值落在设定区间内。

  2. 指定精确尺寸:直接设置resized_height和resized_width参数。这些数值会自动圆整为28的整数倍。

# min_pixels and max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height and resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

👀 架构与训练策略

在这里插入图片描述
快意-VL模型架构基于Q文3-8B语言模型,整合了从开源SigLIP初始化的视觉编码器。该模型支持原生动态分辨率,通过将每幅图像划分为14x14的补丁序列来保持原始宽高比,随后通过简单的MLP层映射并融合视觉标记。模型采用3D旋转位置编码(RoPE)对文本、图像和视频信息进行统一处理,在位置编码与绝对时间之间建立一对一对应关系,从而确保对视频信息时序变化的精准感知。

🌟 Pre-Train

在这里插入图片描述
快影关键预训练流程,采用四阶段渐进式策略:图文匹配、ViT-LLM对齐、多任务预训练及模型融合退火。

预训练数据:海量、高质量、多样化

  • 多样性:涵盖图文对、视频、纯文本等数据类型,任务类型包括细粒度描述、OCR文本识别、问答、目标定位等。
  • 高质量:采用CLIP评分和视觉语言模型(VLM)判别器进行数据筛选,并利用MinHASH去重技术防止数据泄露。
  • 自建数据集:专门构建高质量内部数据集,尤其在精细描述和中文OCR领域,以弥补开源数据的不足。

训练流程:四阶段渐进式优化
Kwai Keye-VL采用四阶段渐进式训练策略:

  • 阶段0(视觉预训练):持续预训练视觉编码器以适应内部数据分布并支持动态分辨率。
  • 阶段1(跨模态对齐):冻结骨干模型,仅训练MLP以低成本实现鲁棒的图文对齐。
  • 阶段2(多任务预训练):解锁全部参数,全面提升模型的视觉理解能力。
  • 阶段3(退火训练):通过高质量数据微调,进一步提升模型的细粒度理解能力。

最终,Kwai Keye-VL探索同构异构融合技术——对不同数据比例的退火训练模型进行参数平均,在保留多维能力的同时减少模型偏差,从而增强模型的鲁棒性。

📈 实验结果

在这里插入图片描述

  1. Keye-VL-8B凭借强大且先进的感知能力崭露头角,其性能足以与顶尖模型媲美。
  2. Keye-VL-8B在视频理解领域展现出非凡的熟练度。在包括Video-MME、Video-MMMU、TempCompass、LongVideoBench和MMVU在内的一系列权威公共视频基准测试中,该模型的表现明显超越了同规模的其他顶级模型。
  3. 在需要复杂逻辑推理和数学问题求解的评估集(如WeMath、MathVerse和LogicVista)中,Kwai Keye-VL-8B展现出强劲的性能曲线,凸显了其在逻辑推演和解决复杂量化问题方面的高阶能力。

网站公告

今日签到

点亮在社区的每一天
去签到