向 Collection 写入数据

原文：向 Collection 写入数据

一句话

使用 .add 方法向 Chroma 集合中插入新记录，每个记录需要唯一 ID，可提供 documents、embeddings 或两者，metadatas 为可选。

什么时候翻这页

当你需要向 Chroma vector store 中添加数据时，无论是文本内容、预计算的 embeddings，还是仅存储 embeddings 和元数据以关联外部文档。

核心概念

Collection: Chroma 中存储数据的容器
ID: 每个记录的唯一标识符，必须是字符串
Documents: 文本内容，如果提供，Chroma 会自动生成 embeddings
Embeddings: 向量表示，可以手动提供或由 Chroma 自动生成
Metadatas: 可选的元数据，可以是字符串、整数、浮点数、布尔值或这些类型的数组
Chunk: 在 RAG 系统中，文档被分割成的小块，每个块可以有自己的 ID、embedding 和 metadata

怎么做

使用 .add 方法向集合添加记录
必须提供 ids 参数，每个记录需要唯一的字符串 ID
必须提供 documents 或 embeddings 或两者
metadatas 参数是可选的
如果只提供 documents，Chroma 会使用集合的嵌入函数自动生成 embeddings
如果已经计算了 embeddings，可以与 documents 一起传递
如果文档存储在外部，可以只添加 embeddings 和 metadatas，使用 ids 关联外部文档

命令 / API 速查

Python

# 基本添加
collection.add(
    ids=["id1", "id2", "id3"],
    documents=["lorem ipsum...", "doc2", "doc3"],
    metadatas=[{"chapter": 3, "verse": 16}, {"chapter": 3, "verse": 5}, {"chapter": 29, "verse": 11}],
)

# 提供预计算的 embeddings
collection.add(
    ids=["id1", "id2", "id3"],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
    documents=["doc1", "doc2", "doc3"],
    metadatas=[{"chapter": 3, "verse": 16}, {"chapter": 3, "verse": 5}, {"chapter": 29, "verse": 11}],
)

# 仅添加 embeddings 和 metadatas（文档在外部）
collection.add(
    ids=["id1", "id2", "id3"],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
    metadatas=[{"chapter": 3, "verse": 16}, {"chapter": 3, "verse": 5}, {"chapter": 29, "verse": 11}],
)

# 使用数组类型的元数据
collection.add(
    ids=["id1"],
    documents=["lorem ipsum..."],
    metadatas=[{
        "chapter": 3,
        "tags": ["fiction", "adventure"],
        "scores": [1, 2, 3],
    }],
)

TypeScript

// 基本添加
await collection.add({
    ids: ["id1", "id2", "id3"],
    documents: ["lorem ipsum...", "doc2", "doc3"],
    metadatas: [{"chapter": 3, "verse": 16}, {"chapter": 3, "verse": 5}, {"chapter": 29, "verse": 11}],
});

// 提供预计算的 embeddings
await collection.add({
    ids: ["id1", "id2", "id3"],
    embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
    documents: ["doc1", "doc2", "doc3"],
    metadatas: [{"chapter": 3, "verse": 16}, {"chapter": 3, "verse": 5}, {"chapter": 29, "verse": 11}],
});

// 仅添加 embeddings 和 metadatas（文档在外部）
await collection.add({
    ids: ["id1", "id2", "id3"],
    embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
    metadatas: [{"chapter": 3, "verse": 16}, {"chapter": 3, "verse": 5}, {"chapter": 29, "verse": 11}],
});

// 使用数组类型的元数据
await collection.add({
    ids: ["id1"],
    documents: ["lorem ipsum..."],
    metadatas: [{
        chapter: 3,
        tags: ["fiction", "adventure"],
        scores: [1, 2, 3],
    }],
});

Rust

// 基本添加
collection.add(
    vec!["id1".to_string(), "id2".to_string(), "id3".to_string()],
    vec![
        vec![1.1, 2.3, 3.2],
        vec![4.5, 6.9, 4.4],
        vec![1.1, 2.3, 3.2],
    ],
    Some(vec![
        Some("lorem ipsum...".to_string()),
        Some("doc2".to_string()),
        Some("doc3".to_string()),
    ]),
    None,
    None,
).await?;

// 仅添加 embeddings 和 metadatas（文档在外部）
collection.add(
    vec!["id1".to_string(), "id2".to_string(), "id3".to_string()],
    vec![
        vec![1.1, 2.3, 3.2],
        vec![4.5, 6.9, 4.4],
        vec![1.1, 2.3, 3.2],
    ],
    None,
    None,
    None,
).await?;

// 使用数组类型的元数据
use chroma::types::{Metadata, MetadataValue};

let mut metadata = Metadata::new();
metadata.insert("chapter".into(), MetadataValue::Int(3));
metadata.insert(
    "tags".into(),
    MetadataValue::StringArray(vec!["fiction".to_string(), "adventure".to_string()]),
);
metadata.insert("scores".into(), MetadataValue::IntArray(vec![1, 2, 3]));

与 Hello-Agents / LangGraph / 本博客 handbook 索引的联系

在 Hello-Agents 记忆与检索章节中，我们学习了如何使用 Chroma 作为 vector store 来存储和检索文档嵌入。本页内容详细介绍了如何向 Chroma 集合中添加数据，这是构建 RAG 系统的基础步骤。在 LangGraph 中，当需要增强节点的记忆能力时，可以使用 Chroma 来存储和检索相关文档片段。本博客 handbook 中已使用 Chroma 建立了索引，这些添加数据的方法正是构建该索引所使用的核心 API。

初学者易错点

忘记为每个记录提供唯一的 ID，这会导致记录被忽略
只提供 embeddings 而不提供 documents 时，忘记使用 ids 来关联外部文档
元数据数组中的元素类型不一致，例如混合字符串和数字
尝试添加空数组作为元数据，这是不允许的
提供的嵌入维度与集合中已有的嵌入维度不匹配，会引发异常
试图通过添加具有已存在 ID 的记录来更新数据，实际上应该使用 update 方法

语义检索