向量数据库技术系列三-Chroma介绍

dierren · 发表于 2025-9-18 21:49:21

作者：CSDN博客
一、前言

Chroma是一个开源的AI原生向量数据库，旨在帮助开发者更加便捷地构建大模型应用，将知识、事实和技能等文档整合进大型语言模型（LLM）中。它提供了简单易用的API，支持存储嵌入及其元数据、嵌入文档和查询、搜索嵌入等功能。主要有以下特点:

轻量级

易用性

功能丰富

多语言支持

开源

二、相关概念

1、数据的组织组件

多租户（Tenant）

租户是数据库的逻辑分组，用于模拟一个组织或用户。一个租户可以拥有多个数据库。

集合(Collection)

集合是嵌入向量、文档和元数据的分组机制，是存储嵌入、文档和任何附加元数据的地方。类似于传统数据库的表。
文档(Document)
文档是向量化前的原始的文本块。需要注意的是，它并不是文件。

元数据(MetaData)

元数据是用来描述文档的相关属性，它是一对kv的键值对，支持String，boolean，int，float类型。
2、存储组件

SQLite数据库

在Chroma单节点模式下，所有关于租户、数据库、集合和文档的数据都存储在一个SQLite数据库中。SQLite作为一个轻量级的关系型数据库，能够为Chroma提供稳定的数据存储基础。
3、查询处理组件

也称为嵌入模型，提供统一的api接口，将原始的文本转化为向量，chroma支持多种大模型嵌入，比如openai，grmini等，默认为all-MiniLM-L6-v2。

距离函数（Distance Function）l

 用于计算两个嵌入向量之间的差异（距离）。Chroma支持余弦相似度、欧几里得距离（L2）和内积（IP）等距离函数。在查询过程中，Chroma会根据选定的距离度量，将输入向量与存储的向量进行比较，并返回最相似的向量。
三、基本操作

1、安装chromadb库

以下都使用python进行实操演示

pip install chromadb

复制代码

2、创建客户端

chroma支持三种客户端类型
(1)非持久化客户端
这种运行在内存中，一般用在对于数据不需要持久化的场景。比如调试，实验的场景。

import chromadb
client = chromadb.Client()

复制代码

(2)持久化客户端
在创建的时候，可以配置本地的存储路径

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection")

复制代码

(3)http模式
前两种，都是本地模式，chroma的服务端和客户端需要位于同一台机器。CS模式可以独立部署，通过httpclient进行访问。

import chromadb
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

复制代码

本案采用持久化客户端模式进行演示。
3、创建collect

创建collect时，可以配置如下参数。

这里使用get_or_create_collection方法进行创建，避免每次都创建新的集合。同时使用默认的嵌入模型。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")

复制代码

创建完成后，可以看到本地的myCollection2文件下有个chroma.sqlite3的文件。
4、写入数据

写入数据时，配置以下参数:

写入如下数据，upsert表示如存在就更新，否则新写入数据。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")
# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
documents=[
"This is a document about dog",
"This is a document about oranges"
],
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
ids=["id1", "id2"]
)

复制代码

写入后可以到sqlite中查看相关的数据

5、查询数据

(1)、向量查询

使用query实现向量相似度的查询。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")
# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
documents=[
"This is a document about dog",
"This is a document about oranges"
],
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
ids=["id1", "id2"]
);
results = collection.query(
query_texts=["This is a query document about cat"], # Chroma will embed this for you
n_results=2 # how many results to return
)
print(results)

复制代码

打印的结果如下

{'ids': [['id1', 'id2']], 'embeddings': None, 'documents': [['This is a document about dog', 'This is a document about oranges']], 'uris': None, 'data': None, 'metadatas': [[{'chapter': '3', 'verse': '16'}, {'chapter': '3', 'verse': '5'}]], 'distances': [[0.9647536639619662, 1.4269337554105601]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

复制代码

可以看到第一条的相似度最高。
(2)元数据过滤

chroma还支持元数据的过滤，比如我要在元数据verse为5的结果中查询相似度最高的。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")
# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
documents=[
"This is a document about dog",
"This is a document about oranges"
],
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
ids=["id1", "id2"]
);
results = collection.query(
query_texts=["This is a query document about cat"], # Chroma will embed this for you
n_results=2 ,# how many results to return
where={
"verse": {
"$eq": "5"
}
}
)
print(results)

复制代码

查询的结果如下：：

{'ids': [['id2']], 'embeddings': None, 'documents': [['This is a document about oranges']], 'uris': None, 'data': None, 'metadatas': [[{'chapter': '3', 'verse': '5'}]], 'distances': [[1.4269337554105601]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

复制代码

由于仅id为id2的数据匹配这一过滤条件，这里仅返回的该条数据。
(3)全文搜索

chroma支持文档的全文检索，比如再包含apple的document中查询相似度最高的数据。

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/myCollection2")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="my_collection2")
# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
documents=[
"This is a document about dog",
"This is a document about oranges"
],
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
ids=["id1", "id2"]
);
results = collection.query(
query_texts=["This is a query document about cat"], # Chroma will embed this for you
n_results=2 ,# how many results to return
where_document={"$contains":"apple"}
)
print(results)

复制代码

查询结果如下：

{'ids': [[]], 'embeddings': None, 'documents': [[]], 'uris': None, 'data': None, 'metadatas': [[]], 'distances': [[]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

复制代码

由于没有文档包含"apple"字符，所以检索的结果为空。
四、案例实践

下面我们将从中国诗词中选取20个短句的向量作为输入，再挑选一句作为查询语句，验证下向量化以及检索的效果。
挑选的输入诗词如下(使用kimi挑选)：

海内存知己，天涯若比邻。

大漠孤烟直，长河落日圆。

春眠不觉晓，处处闻啼鸟。

会当凌绝顶，一览众山小。

海上生明月，天涯共此时。

举头望明月，低头思故乡。

山重水复疑无路，柳暗花明又一村。

不识庐山真面目，只缘身在此山中。

采菊东篱下，悠然见南山。

谁言寸草心，报得三春晖。

忽如一夜春风来，千树万树梨花开。

落霞与孤鹜齐飞，秋水共长天一色。

青山遮不住，毕竟东流去。

春江潮水连海平，海上明月共潮生。

两岸猿声啼不住，轻舟已过万重山。

问渠那得清如许？为有源头活水来。

竹外桃花三两枝，春江水暖鸭先知。

身无彩凤双飞翼，心有灵犀一点通。

众里寻他千百度，蓦然回首，那人却在，灯火阑珊处。

莫愁前路无知己，天下谁人不识君。

待查询的语句：
明月几时有，把酒问青天
代码如下

import chromadb
chroma_client = chromadb.PersistentClient(path="/chroma/shici")
# switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
collection = chroma_client.get_or_create_collection(name="shici")
# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
documents=[
"海内存知己，天涯若比邻",
"大漠孤烟直，长河落日圆",
"春眠不觉晓，处处闻啼鸟",
"会当凌绝顶，一览众山小",
"海上生明月，天涯共此时",
"举头望明月，低头思故乡",
"山重水复疑无路，柳暗花明又一村",
"不识庐山真面目，只缘身在此山中",
"采菊东篱下，悠然见南山",
"谁言寸草心，报得三春晖",
"忽如一夜春风来，千树万树梨花开",
"落霞与孤鹜齐飞，秋水共长天一色",
"青山遮不住，毕竟东流去",
"春江潮水连海平，海上明月共潮生",
"两岸猿声啼不住，轻舟已过万重山",
"问渠那得清如许？为有源头活水来",
"竹外桃花三两枝，春江水暖鸭先知",
"身无彩凤双飞翼，心有灵犀一点通",
"众里寻他千百度，蓦然回首，那人却在，灯火阑珊处",
"莫愁前路无知己，天下谁人不识君"
],
ids=["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8", "id9", "id10", "id11", "id12", "id13", "id14", "id15", "id16", "id17", "id18", "id19", "id20"]
);
results = collection.query(
query_texts=["明月几时有，把酒问青天"], # Chroma will embed this for you
n_results=5 # how many results to return
# where_document={"$contains":"apple"}
)
print(results)

复制代码

打印的结果

{'ids': [['id6', 'id5', 'id13', 'id1', 'id20']], 'embeddings': None, 'documents': [['举头望明月，低头思故乡', '海上生明月，天涯共此时', '青山遮不住，毕竟东流去', '海内存知己，天涯若比邻', '莫愁前
路无知己，天下谁人不识君']], 'uris': None, 'data': None, 'metadatas': [[None, None, None, None, None]], 'distances': [[0.4827560689814884, 0.5092440264675281, 0.5768293567797822, 0.5936621113091055, 0.5975973195463598]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

复制代码

可以看出，前两名都是与月亮有关，也是表达思念之前。总体效果还不错。。
五、总结

本文介绍了chromadb的相关概念，以及创建数据库，向量新增和查询等基本操作。并通过案例实践演示了检索效果。
附件

向量数据库技术系列一-基本原理
向量数据库技术系列二-Milvus介绍
向量数据库技术系列三-Chroma介绍
向量数据库技术系列四-FAISS介绍
向量数据库技术系列五-Weaviate介绍

原文地址：https://blog.csdn.net/tcy83/article/details/144943921

25位大厂高管转战生成式AI创业！吸金猛，不

向量数据库技术系列三-Chroma介绍

智能体知识库处理

关于我们

产品与服务

解决方案

产品与服务