1、langchain的2个能力
可以将 LLM 模型与外部数据源进行连接允许与 LLM 模型进行交互
2、基础功能
LLM 调用
支持多种模型接口,比如 OpenAI、Hugging Face、AzureOpenAI ...Fake LLM,用于测试缓存的支持,比如 in-mem(内存)、SQLite、Redis、SQL用量记录支持流模式(就是一个字一个字的返回,类似打字效果)
Prompt管理,支持各种自定义模板
拥有大量的文档加载器,比如 Email、Markdown、PDF、Youtube ...
对索引的支持
文档分割器向量化对接向量存储与搜索,比如 Chroma、Pinecone、Qdrand
Chains
LLMChain各种工具ChainLangChainHub
3、一些概念
3.1、loader加载器
文件夹、csv文件、印象笔记、google网盘、任意网页、pdf、s3、youtube等加载器
3.2、Document 文档
loader加载器读取到数据源后,数据源需要转换为Document 对象才能使用。
3.3、text spltters文本分割
prompt、openai api embedding 功能都是有字符限制,需要进行分割
3.4、向量数据库
数据相关搜索是向量运算。doc需要先转化为向量才能进行向量运算,不能直接查询
3.5、chain链
chain可以理解为任务,一个chain就是一个任务。
3.6、agent代理
动态帮助我们选择或调用已有工具的代理器
3.7、embedding
用于衡量文本的相关性,openai实现知识库的关键所在
4、实战
4.1、完成一次问答
- import os
- os.environ["OPENAI_API_KEY"] = '你的api key'
- from langchain.llms import OpenAI
- llm = OpenAI(model_name="text-davinci-003",max_tokens=1024)
- llm("怎么评价人工智能")
复制代码 4.2、调用google工具实现问答
SerpApi: Google Search API 注册Serpapi账号- import os
- os.environ["OPENAI_API_KEY"] = '你的api key'
- os.environ["SERPAPI_API_KEY"] = '你的api key'
- from langchain.agents import load_tools
- from langchain.agents import initialize_agent
- from langchain.llms import OpenAI
- from langchain.agents import AgentType
- # 加载 OpenAI 模型
- llm = OpenAI(temperature=0,max_tokens=2048)
- # 加载 serpapi 工具
- tools = load_tools(["serpapi"])
- # 如果搜索完想再计算一下可以这么写
- # tools = load_tools(['serpapi', 'llm-math'], llm=llm)
- # 如果搜索完想再让他再用python的print做点简单的计算,可以这样写
- # tools=load_tools(["serpapi","python_repl"])
- # 工具加载后都需要初始化,verbose 参数为 True,会打印全部的执行详情
- agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
- # 运行 agent
- agent.run("What's the date today? What great events have taken place today in history?")
复制代码 4.3、对超长文本进行总结
现将文本分段,然后逐段总结,最后合并各个总结。- from langchain.document_loaders import UnstructuredFileLoader
- from langchain.chains.summarize import load_summarize_chain
- from langchain.text_splitter import RecursiveCharacterTextSplitter
- from langchain import OpenAI
- # 导入文本
- loader = UnstructuredFileLoader("/content/sample_data/data/lg_test.txt")
- # 将文本转成 Document 对象
- document = loader.load()
- print(f'documents:{len(document)}')
- # 初始化文本分割器
- text_splitter = RecursiveCharacterTextSplitter(
- chunk_size = 500,
- chunk_overlap = 0
- )
- # 切分文本
- split_documents = text_splitter.split_documents(document)
- print(f'documents:{len(split_documents)}')
- # 加载 llm 模型
- llm = OpenAI(model_name="text-davinci-003", max_tokens=1500)
- # 创建总结链
- chain = load_summarize_chain(llm, chain_type="refine", verbose=True)
- # 执行总结链,(为了快速演示,只总结前5段)
- chain.run(split_documents[:5])
复制代码 4.4、构建本地问答机器人
文档向量化、query向量化检索- from langchain.embeddings.openai import OpenAIEmbeddings
- from langchain.vectorstores import Chroma
- from langchain.text_splitter import CharacterTextSplitter
- from langchain import OpenAI
- from langchain.document_loaders import DirectoryLoader
- from langchain.chains import RetrievalQA
- # 加载文件夹中的所有txt类型的文件
- loader = DirectoryLoader('/content/sample_data/data/', glob='**/*.txt')
- # 将数据转成 document 对象,每个文件会作为一个 document
- documents = loader.load()
- # 初始化加载器
- text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
- # 切割加载的 document
- split_docs = text_splitter.split_documents(documents)
- # 初始化 openai 的 embeddings 对象
- embeddings = OpenAIEmbeddings()
- # 将 document 通过 openai 的 embeddings 对象计算 embedding 向量信息并临时存入 Chroma 向量数据库,用于后续匹配查询
- docsearch = Chroma.from_documents(split_docs, embeddings)
- # 创建问答对象
- qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)
- # 进行问答
- result = qa({"query": "科大讯飞今年第一季度收入是多少?"})
- print(result)
复制代码 4.5、构建向量索引数据库
4.4中的向量化需要实时计算,可以考虑通过 Chroma 和 Pinecone 这两个数据库来讲一下如何做向量数据持久化。
Chroma持久化- from langchain.vectorstores import Chroma
- # 持久化数据
- docsearch = Chroma.from_documents(documents, embeddings, persist_directory="D:/vector_store")
- docsearch.persist()
- # 加载数据
- docsearch = Chroma(persist_directory="D:/vector_store", embedding_function=embeddings)
复制代码 Pinecone持久化- # 持久化数据
- docsearch = Pinecone.from_texts([t.page_content for t in split_docs], embeddings, index_name=index_name)
- # 加载数据
- docsearch = Pinecone.from_existing_index(index_name, embeddings)
复制代码 问答搜索- from langchain.text_splitter import CharacterTextSplitter
- from langchain.document_loaders import DirectoryLoader
- from langchain.vectorstores import Chroma, Pinecone
- from langchain.embeddings.openai import OpenAIEmbeddings
- from langchain.llms import OpenAI
- from langchain.chains.question_answering import load_qa_chain
- import pinecone
- # 初始化 pinecone
- pinecone.init(
- api_key="你的api key",
- environment="你的Environment"
- )
- loader = DirectoryLoader('/content/sample_data/data/', glob='**/*.txt')
- # 将数据转成 document 对象,每个文件会作为一个 document
- documents = loader.load()
- # 初始化加载器
- text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
- # 切割加载的 document
- split_docs = text_splitter.split_documents(documents)
- index_name="liaokong-test"
- # 持久化数据
- # docsearch = Pinecone.from_texts([t.page_content for t in split_docs], embeddings, index_name=index_name)
- # 加载数据
- docsearch = Pinecone.from_existing_index(index_name,embeddings)
- query = "科大讯飞今年第一季度收入是多少?"
- docs = docsearch.similarity_search(query, include_metadata=True)
- llm = OpenAI(temperature=0)
- chain = load_qa_chain(llm, chain_type="stuff", verbose=True)
- chain.run(input_documents=docs, question=query)
复制代码 4.6、使用GPT3.5模型构建油管频道问答机器人
加载视频,文本化、构建向量索引库,构建模版、初始化prompt,初始化问答链、回答问题- import os
- from langchain.document_loaders import YoutubeLoader
- from langchain.embeddings.openai import OpenAIEmbeddings
- from langchain.vectorstores import Chroma
- from langchain.text_splitter import RecursiveCharacterTextSplitter
- from langchain.chains import ChatVectorDBChain, ConversationalRetrievalChain
- from langchain.chat_models import ChatOpenAI
- from langchain.prompts.chat import (
- ChatPromptTemplate,
- SystemMessagePromptTemplate,
- HumanMessagePromptTemplate
- )
- # 加载 youtube 频道
- loader = YoutubeLoader.from_youtube_url('https://www.youtube.com/watch?v=Dj60HHy-Kqk')
- # 将数据转成 document
- documents = loader.load()
- # 初始化文本分割器
- text_splitter = RecursiveCharacterTextSplitter(
- chunk_size=1000,
- chunk_overlap=20
- )
- # 分割 youtube documents
- documents = text_splitter.split_documents(documents)
- # 初始化 openai embeddings
- embeddings = OpenAIEmbeddings()
- # 将数据存入向量存储
- vector_store = Chroma.from_documents(documents, embeddings)
- # 通过向量存储初始化检索器
- retriever = vector_store.as_retriever()
- system_template = """
- Use the following context to answer the user's question.
- If you don't know the answer, say you don't, don't try to make it up. And answer in Chinese.
- -----------
- {question}
- -----------
- {chat_history}
- """
- # 构建初始 messages 列表,这里可以理解为是 openai 传入的 messages 参数
- messages = [
- SystemMessagePromptTemplate.from_template(system_template),
- HumanMessagePromptTemplate.from_template('{question}')
- ]
- # 初始化 prompt 对象
- prompt = ChatPromptTemplate.from_messages(messages)
- # 初始化问答链
- qa = ConversationalRetrievalChain.from_llm(ChatOpenAI(temperature=0.1,max_tokens=2048),retriever,condense_question_prompt=prompt)
- chat_history = []
- while True:
- question = input('问题:')
- # 开始发送问题 chat_history 为必须参数,用于存储对话历史
- result = qa({'question': question, 'chat_history': chat_history})
- chat_history.append((question, result['answer']))
- print(result['answer'])
复制代码 4.7、用 OpenAI 连接万种工具
使用 zapier 来实现将万种工具连接;申请zapier的api key .4Get Started - Zapier AI Actions- import os
- os.environ["ZAPIER_NLA_API_KEY"] = ''
- from langchain.llms import OpenAI
- from langchain.agents import initialize_agent
- from langchain.agents.agent_toolkits import ZapierToolkit
- from langchain.utilities.zapier import ZapierNLAWrapper
- llm = OpenAI(temperature=.3)
- zapier = ZapierNLAWrapper()
- toolkit = ZapierToolkit.from_zapier_nla_wrapper(zapier)
- agent = initialize_agent(toolkit.get_tools(), llm, agent="zero-shot-react-description", verbose=True)
- # 我们可以通过打印的方式看到我们都在 Zapier 里面配置了哪些可以用的工具
- for tool in toolkit.get_tools():
- print (tool.name)
- print (tool.description)
- print ("\n\n")
- agent.run('请用中文总结最后一封"******@qq.com"发给我的邮件。并将总结发送给"******@qq.com"')
复制代码 4.8、执行多个chain
链式或顺序执行多个chain,定义多个chain通过SimpleSequentialChain串联起来- from langchain.llms import OpenAI
- from langchain.chains import LLMChain
- from langchain.prompts import PromptTemplate
- from langchain.chains import SimpleSequentialChain
- # location 链
- llm = OpenAI(temperature=1)
- template = """Your job is to come up with a classic dish from the area that the users suggests.
- % USER LOCATION
- {user_location}
- YOUR RESPONSE:
- """
- prompt_template = PromptTemplate(input_variables=["user_location"], template=template)
- location_chain = LLMChain(llm=llm, prompt=prompt_template)
- # meal 链
- template = """Given a meal, give a short and simple recipe on how to make that dish at home.
- % MEAL
- {user_meal}
- YOUR RESPONSE:
- """
- prompt_template = PromptTemplate(input_variables=["user_meal"], template=template)
- meal_chain = LLMChain(llm=llm, prompt=prompt_template)
- # 通过 SimpleSequentialChain 串联起来,第一个答案会被替换第二个中的user_meal,然后再进行询问
- overall_chain = SimpleSequentialChain(chains=[location_chain, meal_chain], verbose=True)
- review = overall_chain.run("Rome")
复制代码 4.9、结构化输出
定义输出结构,使用StructuredOutputParser结构化,让模型按照结构输出- from langchain.output_parsers import StructuredOutputParser, ResponseSchema
- from langchain.prompts import PromptTemplate
- from langchain.llms import OpenAI
- llm = OpenAI(model_name="text-davinci-003")
- # 告诉他我们生成的内容需要哪些字段,每个字段类型式啥
- response_schemas = [
- ResponseSchema(name="bad_string", description="This a poorly formatted user input string"),
- ResponseSchema(name="good_string", description="This is your response, a reformatted response")
- ]
- # 初始化解析器
- output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
- # 生成的格式提示符
- # {
- # "bad_string": string // This a poorly formatted user input string
- # "good_string": string // This is your response, a reformatted response
- #}
- format_instructions = output_parser.get_format_instructions()
- template = """
- You will be given a poorly formatted string from a user.
- Reformat it and make sure all the words are spelled correctly
- {format_instructions}
- % USER INPUT:
- {user_input}
- YOUR RESPONSE:
- """
- # 将我们的格式描述嵌入到 prompt 中去,告诉 llm 我们需要他输出什么样格式的内容
- prompt = PromptTemplate(
- input_variables=["user_input"],
- partial_variables={"format_instructions": format_instructions},
- template=template
- )
- promptValue = prompt.format(user_input="welcom to califonya!")
- llm_output = llm(promptValue)
- # 使用解析器进行解析生成的内容
- output_parser.parse(llm_output)
复制代码 4.10、以prompt的方式格式化输出
- from langchain.prompts import PromptTemplate
- from langchain.llms import OpenAI
- from langchain.chains import LLMRequestsChain, LLMChain
- llm = OpenAI(model_name="gpt-3.5-turbo", temperature=0)
- template = """在 >>> 和 <<< 之间是网页的返回的HTML内容。
- 网页是新浪财经A股上市公司的公司简介。
- 请抽取参数请求的信息。
- >>> {requests_result} <<<
- 请使用如下的JSON格式返回数据
- {{
- "company_name":"a",
- "company_english_name":"b",
- "issue_price":"c",
- "date_of_establishment":"d",
- "registered_capital":"e",
- "office_address":"f",
- "Company_profile":"g"
- }}
- Extracted:"""
- prompt = PromptTemplate(
- input_variables=["requests_result"],
- template=template
- )
- chain = LLMRequestsChain(llm_chain=LLMChain(llm=llm, prompt=prompt))
- inputs = {
- "url": "https://vip.stock.finance.sina.com.cn/corp/go.php/vCI_CorpInfo/stockid/600519.phtml"
- }
- response = chain(inputs)
- print(response['output'])
复制代码 4.11、自定义agent使用工具
定义工具的具体使用情景,遇到相应的任务使用自定义的工具- from langchain.agents import initialize_agent, Tool
- from langchain.agents import AgentType
- from langchain.tools import BaseTool
- from langchain.llms import OpenAI
- from langchain import LLMMathChain, SerpAPIWrapper
- llm = OpenAI(temperature=0)
- # 初始化搜索链和计算链
- search = SerpAPIWrapper()
- llm_math_chain = LLMMathChain(llm=llm, verbose=True)
- # 创建一个功能列表,指明这个 agent 里面都有哪些可用工具,agent 执行过程可以看必知概念里的 Agent 那张图
- tools = [
- Tool(
- name = "Search",
- func=search.run,
- description="useful for when you need to answer questions about current events"
- ),
- Tool(
- name="Calculator",
- func=llm_math_chain.run,
- description="useful for when you need to answer questions about math"
- )
- ]
- # 初始化 agent
- agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
- # 执行 agent
- agent.run("Who is Leo DiCaprio's girlfriend? What is her current age raised to the 0.43 power?")
复制代码 4.12、使用memory实现带记忆机器人
使用自带memory实现带记忆机器人- from langchain.memory import ChatMessageHistory
- from langchain.chat_models import ChatOpenAI
- chat = ChatOpenAI(temperature=0)
- # 初始化 MessageHistory 对象
- history = ChatMessageHistory()
- # 给 MessageHistory 对象添加对话内容
- history.add_ai_message("你好!")
- history.add_user_message("中国的首都是哪里?")
- # 执行对话
- ai_response = chat(history.messages)
- print(ai_response)
复制代码 4.13、使用hugging face模型
有2种使用hugging face的方式:在线、离线
在线使用hugging face- #配置hugging face环境变量
- import os
- os.environ['HUGGINGFACEHUB_API_TOKEN'] = ''
- from langchain import PromptTemplate, HuggingFaceHub, LLMChain
- template = """Question: {question}
- Answer: Let's think step by step."""
- prompt = PromptTemplate(template=template, input_variables=["question"])
- llm = HuggingFaceHub(repo_id="google/flan-t5-xl", model_kwargs={"temperature":0, "max_length":64})
- llm_chain = LLMChain(prompt=prompt, llm=llm)
- question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"
- print(llm_chain.run(question))
复制代码 将模型加载到本地运行- from langchain import PromptTemplate, LLMChain
- from langchain.llms import HuggingFacePipeline
- from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
- model_id = 'google/flan-t5-large'
- tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
- pipe = pipeline(
- "text2text-generation",
- model=model,
- tokenizer=tokenizer,
- max_length=100
- )
- local_llm = HuggingFacePipeline(pipeline=pipe)
- print(local_llm('What is the capital of France? '))
- template = """Question: {question} Answer: Let's think step by step."""
- prompt = PromptTemplate(template=template, input_variables=["question"])
- llm_chain = LLMChain(prompt=prompt, llm=local_llm)
- question = "What is the capital of England?"
- print(llm_chain.run(question))
复制代码 4.14、根据自然语言执行sql
- from langchain.agents import create_sql_agent
- from langchain.agents.agent_toolkits import SQLDatabaseToolkit
- from langchain.sql_database import SQLDatabase
- from langchain.llms.openai import OpenAI
- db = SQLDatabase.from_uri("sqlite:///../notebooks/Chinook.db")
- toolkit = SQLDatabaseToolkit(db=db)
- agent_executor = create_sql_agent(
- llm=OpenAI(temperature=0),
- toolkit=toolkit,
- verbose=True
- )
- agent_executor.run("Describe the playlisttrack table")
复制代码 5、参考文档
GitHub - liaokongVFX/LangChain-Chinese-Getting-Started-Guide: LangChain 的中文入门教程
使用 LangChain 在本地部署私有化 HuggingFace 模型 - 知乎
吴恩达OPENAI《ChatGPT 提示工程》学习笔记7:自动回复评论及temperature温度参数 - 知乎 |