langchain从入门到精通（十九）——Document 组件与文档加载器的使用

义者艺也2 · 发表于 2025-9-7 23:39:35

1. Document 与文档加载器

Document 类是 LangChain 中的核心组件，这个类定义了一个文档对象的结构，涵盖了文本内容和相关的元数据，Document 也是文档加载器、文档分割器、向量数据库、检索器这几个组件之间交互传递的状态数据。
在 LangChain 旧版本中，Document 还支持 lookup 检索功能，不过新版本下 Document 组件只拥有最基础的记录信息功能：

Document = page_content(页面内容) + metadata(元数据)

复制代码

在 RAG 开发中，一般会读取特定来源的数据，而非手动录入数据，例如：本地 markdown 文件、HTML 网页、PDF 文档、DOC 文档、URL 链接等多种方式来加载数据，然后再将原始文档按照特定切割成特定大小的文档，最后再将数据存储到向量数据库中，很少会手动录入数据。所以在 RAG 应用外部，一般都会有一个额外的扩展，专门用于处理读取数据-切割数据-存储数据 这个流程，并且这个流程非常耗时，例如上传一个 30M 的文档，需要执行加载/切割/文本嵌入，一般都会使用队列/异步进行处理，架构流程图更新如下:

在新的架构流程中，文档加载器起到的作用就是从各式各样的数据中提取出相应的信息，并转换成标准的 Document 组件，从而屏蔽不同类型文件的读取差异。
在 LangChain 中所有文档加载器的基类为 BaseLoader，封装了统一的 5 个方法：

LangChain 文档加载器文档
2.TextLoader 使用技巧与源码解析

在 LangChain 中最简单的加载器组件就是 TextLoader，这个加载器可以加载一个文本文件（源码、markdown、text 等存储成文本结构的文件，DOC 并不是文本文件），并把整个文件的内容读入到一个 Document 对象中，同时为文档对象的 metadata 添加 source 字段用于记录源数据的来源信息。
TextLoader 使用起来非常简单，传递对应的文本路径即可：
示例代码:

from langchain_community.document_loaders import TextLoader
loader = TextLoader("./电商产品数据.txt", encoding="utf-8")
documents = loader.load()print(documents)print(len(documents))print(documents[0].metadata)

复制代码

输出内容:

[Document(page_content='xxx', metadata={'source': './电商产品数据.txt'})]
1
{'source': './电商产品数据.txt'}

复制代码

TextLoader 源码底层主要通过 open 函数与对应的编码方式打开对应的文件，获取其内容，并将传递的路径信息复制到生成的文档示例中的 metadata 字段中，从而实现数据的快速加载。
核心源码:

# langchain_community/document_loaders/text.py->TextLoader::lazy_loaddeflazy_load(self)-> Iterator[Document]:"""Load from file path."""
text =""try:withopen(self.file_path, encoding=self.encoding)as f:
text = f.read()except UnicodeDecodeError as e:if self.autodetect_encoding:
detected_encodings = detect_file_encodings(self.file_path)for encoding in detected_encodings:
logger.debug(f"Trying encoding: {encoding.encoding}")try:withopen(self.file_path, encoding=encoding.encoding)as f:
text = f.read()breakexcept UnicodeDecodeError:continueelse:raise RuntimeError(f"Error loading {self.file_path}")from e
except Exception as e:raise RuntimeError(f"Error loading {self.file_path}")from e
metadata ={"source":str(self.file_path)}yield Document(page_content=text, metadata=metadata)

复制代码

以 TextLoader 为例，扩展到 LangChain 封装的其他文档加载器，使用技巧都是一模一样的，在实例化加载器的时候，传递对应的信息（文件路径、网址、目录等），然后调用加载器的 load() 方法即可一键加载文档。
3. Markdown文档加载器

Markdown 加载器用于将本地的 .md 文件解析为文本，并封装成 LangChain 所使用的 Document 对象，以便后续进行嵌入、匹配、问答等。

#!/usr/bin/env python# -*- coding: utf-8 -*-from langchain_community.document_loaders import UnstructuredMarkdownLoader
loader = UnstructuredMarkdownLoader("./项目API资料.md")
documents = loader.load()print(documents)print(len(documents))print(documents[0].metadata)

复制代码

4. Office文档加载器

Office 文档加载器（Document Loaders for Office Formats）是专门用来处理如 Word（.doc, .docx）、Excel（.xls, .xlsx）、PowerPoint（.ppt, .pptx）等 Microsoft Office 文件类型的模块，适用于构建文档问答（RAG）、生成摘要、知识库索引等应用。

#!/usr/bin/env python# -*- coding: utf-8 -*-from langchain_community.document_loaders import(
UnstructuredPowerPointLoader,)# excel_loader = UnstructuredExcelLoader("./员工考勤表.xlsx", mode="elements")# excel_documents = excel_loader.load()# word_loader = UnstructuredWordDocumentLoader("./喵喵.docx")# documents = word_loader.load()
ppt_loader = UnstructuredPowerPointLoader("./章节介绍.pptx")
documents = ppt_loader.load()print(documents)print(len(documents))print(documents[0].metadata)

复制代码

5. URL网页加载器

URL 网页加载器是 LangChain 提供的文档加载器中的一种，它能够根据网页链接地址（URL）抓取网页的数据内容，并封装为 LangChain 支持的 Document 对象，供后续 RAG、查询、摘要等处理。
常见加载器:

加载器名称	描述	依赖库
WebBaseLoader	最常用，使用 requests＋BeautifulSoup 抓取网页纯文本	beautifulsoup4，html2text
SeleniumURLLoader	支持动态加载内容的网页（如 JavaScript 渲染页面）	selenium
UnstructuredURLLoader	用于结构化提取网页中的段落、标题、图像等	unstructured

#!/usr/bin/env python# -*- coding: utf-8 -*-from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("www.baidu.com")
documents = loader.load()print(documents)print(len(documents))print(documents[0].metadata)

复制代码

6. 通用文件加载器

通用文件加载器是 LangChain 中用于读取本地或远程文件内容（如 .pdf, .docx, .pptx, .html, .md, .txt 等）的工具。
它的核心作用是将这些数据转换为 LangChain 能理解和处理的标准 Document 对象，以便后续用于问答系统、信息检索、摘要生成、RAG 等任务。

类型	场景示例	举例
．txt，．md	纯文本文件	文本日志、笔记、Markdown文档
．pdf	带排版文档	报告、简历、研究论文
．docx	Word 文档	合同、方案、规格说明书
．pptx	演示幻灯片	拓展用语料库知识
．csv，．xlsx	表格与数据	结构化数据文件
．html	网页存档	本地网页文件

#!/usr/bin/env python# -*- coding: utf-8 -*-from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("./项目API资料.md")
documents = loader.load()print(documents)print(len(documents))print(documents[0].metadata)

复制代码

25位大厂高管转战生成式AI创业！吸金猛，不

langchain从入门到精通（十九）——Document 组件与文档加载器的使用

【LangChain系列】第七篇：工作流（链）简

关于我们

产品与服务

解决方案

产品与服务