向量数据库ChromaDB的使用-EW帮帮网

ChromaDB 是一个开源的嵌入式向量数据库，专用于AI设计，好处这里就不多写了，这里做一个学习的记录。当前环境是Window，Python版本3.10

安装

执行命令，即可安装，注意提前准备Python环境

pip install chromadb

安装的时候注意Python版本，在我使用Python3.9版本时，执行代码一直报错The onnxruntime python package is not installed. Please install it with pip install onnxruntime in add. 因为这个错纠结三四天，一直以为是环境的问题，直到最后发现换下版本就好

使用

创建客户端

服务器上的数据库，使用Http的方式读取

import chromadb

chroma_client = chromadb.HttpClient(host="localhost", port=8000)

服务端的话需要使用命令启动

chroma run --path "文档上传地址" --host 0.0.0.0 --port 8000

本地安装的话，使用PersistentClient，注意地址中间不要使用中文

import chromadb

client = chromadb.PersistentClient(path="文档上传地址")

集合

collection 集合，用于存储数据的地方，类似于table

# 获取一个存在的Collection对象
collection = client.get_collection("testname")

# 如果不存在就创建collection对象，一般用这个更多一点
collection = client.get_or_create_collection("testname")

数据操作

添加

# documents 文档
# metadatas 元数据，文档的一些备注
# ids 对应ID，ID唯一
# embeddings 嵌入向量
collection.add(
    documents=["This is a document about cat", "This is a document about car", "This is a document about bike"],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}, {"category": "vehicle"}],
    ids=["id1", "id2", "id3"]
)

添加完成后，会根据数据创建相关文档，文档中有多个属性

属性	类型	作用	示例
id	String	文档的唯一标识符	`"doc_001"`
embedding	List[float]	文档的向量表示	`[0.12, -0.34, ..., 0.78]`
document	String	原始文本内容	`"ChromaDB 是一个开源的向量数据库..."`
metadata	Dict	文档的附加描述信息	`{"author": "John", "category": "AI"}`
uris	List[String]	外部资源链接	`["https://example.com/doc.pdf"]`
data	Any	自定义二进制数据	图像、音频等非文本数据

修改

upsert函数如果没有该数据则添加，有的话修改

    # documents 文档
    # metadatas 元数据，文档的一些备注
    # ids 对应ID，ID唯一
    # embeddings 嵌入向量

    collection.upsert(
        documents=["一个关于狗的文档", "关于猫的文档", "一个关于自行车的文档"],
        metadatas=[{"category": "animal"}, {"category": "vehicle"}, {"category": "vehicle"}],
        ids=["id1", "id2", "id3"]
    )

查询

查询时主要使用query函数和get函数，查询结果除了文档数据之外还会有distances属性数据，该数据表示查询向量与结果向量之间的相似程度， ChromaDB 默认使用余弦距离作为距离函数

距离值越小，相似度越高，距离值越大，相似度越低

    print("查询列表>>>")
    print(collection.peek())
    # print(collection.get())

    print("相似性查询>>>")
    result = collection.query(
        query_texts=["自行车"],
        n_results=1
    )
    print(result)

    print("按照条件查询>>>")
    result = collection.query(
        query_texts=["自行车"],
        n_results=1,
        where={"category": "animal"}
    )
    print(result)

    print("运算符过滤>>>")
    result = collection.query(
        query_texts=["关于"],
        n_results=2,
        where={"$or": [{"category": "animal"}, {"category": "vehicle"}]},
    )
    print(result)

    print("运算符过滤>>>")
    result = collection.query(
        query_texts=["关于"],
        n_results=2,
        where={"$or": [{"category": "animal"}, {"category": "vehicle"}]},
    )
    print(result)

    # print("按照向量查>>>")
    # query_embeddings = ...
    # result = collection.query(
    #     query_embeddings=query_embeddings,
    #     n_results=2
    # )

    print("按照 ID 查>>>")
    result = collection.get(
        ids=["id2"]
    )
    print(result)

    print("分页查询 >>>")
    result = collection.get(
        where={"category": "animal"},
        limit=2,
        offset=1
    )
    print(result)

完整全部代码

import chromadb
from chromadb.api.models import Collection


# 获得 collection 列表
def list_collection(client: chromadb.PersistentClient):
    print(client.list_collections())


# 添加数据
def add(collection: Collection):
    # documents 文档
    # metadatas 元数据，文档的一些备注
    # ids 对应ID，ID唯一
    # embeddings 嵌入向量
    collection.add(
        documents=["This is a document about cat", "This is a document about car", "This is a document about bike"],
        metadatas=[{"category": "animal"}, {"category": "vehicle"}, {"category": "vehicle"}],
        ids=["id1", "id2", "id3"]
    )


# 修改数据
def edit(collection: Collection):
    # documents 文档
    # metadatas 元数据，文档的一些备注
    # ids 对应ID，ID唯一
    # embeddings 嵌入向量

    collection.upsert(
        documents=["一个关于狗的文档", "关于猫的文档", "一个关于自行车的文档"],
        metadatas=[{"category": "animal"}, {"category": "vehicle"}, {"category": "vehicle"}],
        ids=["id1", "id2", "id3"]
    )


def search(collection: Collection):

    print("查询列表>>>")
    print(collection.peek())
    # print(collection.get())

    print("相似性查询>>>")
    result = collection.query(
        query_texts=["自行车"],
        n_results=1
    )
    print(result)

    print("按照条件查询>>>")
    result = collection.query(
        query_texts=["自行车"],
        n_results=1,
        where={"category": "animal"}
    )
    print(result)

    print("运算符过滤>>>")
    result = collection.query(
        query_texts=["关于"],
        n_results=2,
        where={"$or": [{"category": "animal"}, {"category": "vehicle"}]},
    )
    print(result)

    print("运算符过滤>>>")
    result = collection.query(
        query_texts=["关于"],
        n_results=2,
        where={"$or": [{"category": "animal"}, {"category": "vehicle"}]},
    )
    print(result)

    # print("按照向量查>>>")
    # query_embeddings = ...
    # result = collection.query(
    #     query_embeddings=query_embeddings,
    #     n_results=2
    # )

    print("按照 ID 查>>>")
    result = collection.get(
        ids=["id2"]
    )
    print(result)

    print("分页查询 >>>")
    result = collection.get(
        where={"category": "animal"},
        limit=2,
        offset=1
    )
    print(result)


if __name__ == '__main__':
    client = chromadb.PersistentClient(path="D:\\uploadTemplate\\chromadb")
    collection = client.get_or_create_collection(name="my-collection")

    search(collection)

向量数据库ChromaDB的使用

安装

使用

创建客户端

集合

数据操作

添加

修改

查询

网站公告

今日签到

热门文章

最新发布