OpenSearch 向量搜索与Qwen3-Embedding 集成示例

发布于:2025-07-07 ⋅ 阅读:(25) ⋅ 点赞:(0)

本项目演示了如何将 OpenSearch 的 k-NN (k-Nearest Neighbors) 向量搜索功能与 OpenAI 的高级文本嵌入模型(如 Qwen3-Embedding)相结合,以实现强大的语义搜索。

核心概念

  • 文本嵌入 (Text Embedding): 将文本(单词、句子、段落)转换为一个高维的数字向量。语义上相似的文本在向量空间中的距离会更近。
  • Qwen3-Embedding: 我们调用 Qwen3-Embedding 来为我们的文本生成这些高质量的向量。
  • k-NN 向量搜索: OpenSearch 接收一个查询向量,并利用专门的 k-NN 算法在索引中快速找到与该查询向量最“邻近”的 N 个文档向量,从而实现语义搜索。

第 1 步:环境准备

在运行脚本之前,请确保完成以下设置。

1.1. 启动 OpenSearch

请确保您已经通过 docker-compose.yml 文件启动了 OpenSearch 和 OpenSearch Dashboards 服务。

version: '3.8'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:2.19.1
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.type=single-node
      - bootstrap.memory_lock=true # along with the memlock settings below.
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
      - "DISABLE_SECURITY_PLUGIN=true" # Disables security plugin for easier local development
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems.
        hard: 65536
    volumes:
      - opensearch-data:/usr/share/opensearch/data
    ports:
      - 9200:9200
      - 9600:9600 # required for Performance Analyzer
    networks:
      - opensearch-net

  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.19.1
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    expose:
      - "5601"
    environment:
      OPENSEARCH_HOSTS: '["http://opensearch-node1:9200"]'
      DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true" # Disables security plugin for easier local development
    networks:
      - opensearch-net
    depends_on:
      - opensearch-node1

volumes:
  opensearch-data:

networks:
  opensearch-net:

docker-compose up -d

在这里插入图片描述

1.2. 安装 Python 依赖库

此脚本需要 opensearch-pyopenaipython-dotenv 库。通过 pip 安装它们:

uv pip install opensearch-py openai python-dotenv
1.3. 设置 OpenAI API 密钥

这是一个关键步骤!

  1. 在项目根目录 下创建一个名为 .env 的新文件。

  2. 打开 .env 文件,并按以下格式添加您的 OpenAI API 密钥:

    QWEN_API_KEY= `sk-YourActualOpenAIKeyHere`
    QWEN_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
    QWEN_MODEL_NAME=qwen-turbo
    QWEN_EMBEDDING_MODEL_NAME=text-embedding-v4
    
    

    重要提示: 请将 sk-YourActualOpenAIKeyHere 替换为您自己的、真实的 API 密钥。脚本会从此文件中自动加载密钥,避免了将其硬编码在代码中的安全风险。


第 2 步:Python 脚本 (opensearch_openai_vector_search.py)

以下是完整的 Python 脚本。它负责:

  1. .env 文件加载 API 密钥。
  2. 创建一个适用于 OpenAI 向量维度(1536)的 OpenSearch 索引。
  3. 定义一个函数,用于调用 OpenAI API 将文本转换为向量。
  4. 将示例文本转换为向量并存入 OpenSearch。
  5. 执行一个向量搜索,找到与查询最相关的文档。
# -*- coding: utf-8 -*-

"""
This script demonstrates how to use OpenAI embeddings for vector search in OpenSearch.

It requires the following libraries:
- opensearch-py
- openai
- python-dotenv

You can install them using: pip install opensearch-py openai python-dotenv

Setup:
1. Make sure your OpenSearch instance is running (e.g., via docker-compose).
2. Create a file named .env in the same directory as this script.
3. Add your OpenAI API key to the .env file like this:
   OPENAI_API_KEY="sk-YourActualOpenAIKeyHere"
"""

import os
import time
from dotenv import load_dotenv
from openai import OpenAI
from opensearchpy import OpenSearch

# --- 1. Configuration ---

# Load environment variables from .env file
load_dotenv()

QWEN_API_KEY = os.getenv("QWEN_API_KEY")
QWEN_BASE_URL = os.getenv("QWEN_BASE_URL")
QWEN_EMBEDDING_MODEL_NAME = os.getenv("QWEN_EMBEDDING_MODEL_NAME")

client_openai = OpenAI(base_url=QWEN_BASE_URL, api_key=QWEN_API_KEY)
# Connect to OpenSearch
client_opensearch = OpenSearch(
    hosts=[{'host': 'localhost', 'port': 9200}],
    http_auth=None,  # No authentication
    use_ssl=False,
    verify_certs=False,
    ssl_assert_hostname=False,
    ssl_show_warn=False,
)


# Dimension of vectors produced by text-embedding-3-small
VECTOR_DIMENSION = 1024

INDEX_NAME = "my-openai-vector-index"


# --- 2. OpenAI Embedding Function ---

def get_openai_embedding(text):
    """Generates a vector embedding for the given text using OpenAI's API."""
    # OpenAI recommends replacing newlines with spaces for better performance
    text = text.replace("\n", " ")
    response = client_openai.embeddings.create(input=[text], model=QWEN_EMBEDDING_MODEL_NAME)
    return response.data[0].embedding


# --- 3. Index Setup ---

def create_index_with_vector_mapping():
    """Creates an OpenSearch index with a mapping for k-NN vector search using OpenAI dimensions."""
    if client_opensearch.indices.exists(index=INDEX_NAME):
        print(f"Index '{INDEX_NAME}' already exists. Deleting it.")
        client_opensearch.indices.delete(index=INDEX_NAME)

    settings = {
        "settings": {
            "index": {
                "knn": True,
                "knn.algo_param.ef_search": 100
            }
        },
        "mappings": {
            "properties": {
                "text": {"type": "text"},
                "text_vector": {
                    "type": "knn_vector",
                    "dimension": VECTOR_DIMENSION, # Crucial: Must match the model's output dimension
                    "method": {
                        "name": "hnsw",
                        "space_type": "l2",
                        "engine": "nmslib",
                        "parameters": {
                            "ef_construction": 128,
                            "m": 24
                        }
                    }
                }
            }
        }
    }
    client_opensearch.indices.create(index=INDEX_NAME, body=settings)
    print(f"Index '{INDEX_NAME}' created successfully with dimension {VECTOR_DIMENSION}.")


# --- 4. Indexing Documents ---

def index_documents():
    """Generates vector embeddings for sample documents using OpenAI and indexes them."""
    documents = [
        {"text": "The sky is blue and the sun is bright."},
        {"text": "I enjoy walking in the park on a sunny day."},
        {"text": "Artificial intelligence is transforming many industries."},
        {"text": "The new AI model shows impressive capabilities in natural language understanding."},
        {"text": "My favorite food is pizza, especially with pepperoni."},
        {"text": "I'm planning a trip to Italy to enjoy the local cuisine."}
    ]

    for i, doc in enumerate(documents):
        print(f"Generating embedding for document {i+1}...")
        vector = get_openai_embedding(doc["text"])
        
        doc_body = {
            "text": doc["text"],
            "text_vector": vector # The embedding is already a list
        }
        
        client_opensearch.index(index=INDEX_NAME, body=doc_body, id=i+1, refresh=True)
        print(f"Indexed document {i+1}")

    time.sleep(2)


# --- 5. Vector Search ---

def search_with_vector(query_text, k=3):
    """Performs a k-NN search for the most similar documents using an OpenAI embedding."""
    print(f"\n--- Performing k-NN search for: '{query_text}' ---")
    
    query_vector = get_openai_embedding(query_text)
    
    search_query = {
        "size": k,
        "query": {
            "knn": {
                "text_vector": {
                    "vector": query_vector,
                    "k": k
                }
            }
        }
    }
    
    response = client_opensearch.search(index=INDEX_NAME, body=search_query)
    
    print("Search Results:")
    for hit in response["hits"]["hits"]:
        print(f"  - Score: {hit['_score']:.4f}, Text: {hit['_source']['text']}")


# --- 6. Main Execution ---
if __name__ == "__main__":
    create_index_with_vector_mapping()
    index_documents()
    
    # Perform a simple vector search
    search_with_vector("intelligent machines")
    
    # Perform another vector search
    search_with_vector("sunny weather activities")

    # Clean up the index (optional)
    # client_opensearch.indices.delete(index=INDEX_NAME)
    # print(f"\nIndex '{INDEX_NAME}' deleted.")


第 3 步:运行脚本

完成上述所有准备工作后,在您的终端中运行以下命令:

uv run opensearch_openai_vector_search.py

输出

在这里插入图片描述

opensearch dashboard可视化

登录
在这里插入图片描述
创建index pattern
在这里插入图片描述
discover观察
在这里插入图片描述

代码链接: https://github.com/zhouruiliangxian/Awesome-demo/blob/main/Database/opensearch_test/opensearch_openai_vector_search.py


网站公告

今日签到

点亮在社区的每一天
去签到