LangChain ChatGPT RAG with scraped website content

Nov 21, 2023

RAG is a technique for augmenting LLM knowledge with additional data. Often known as 'chat with your data'. In this tutorial we'll show you how to scrape all the content from any website using UseScraper, then load it into a vector store ready for Retrieval-augmented generation (RAG) in LangChain using ChatGPT.

Crawling a website's content

First signup for a free UseScraper account (you get $25 free credit on signup).

You can start a crawling job simply from the UseScraper dashboard, or via the /crawler/jobs API endpoint. Enter the URL of the website you want to scrape, feel free to input https://usescraper.com as a test. The other options can be left as is. Markdown is the default output format, which is ideal for our use case. Click "Start Job".

You can then watch the progress of the crawling job live. The results will be shown when it's complete.

Now click the "Download markdown file of results" button to get all the content as a single markdown file. This is the file we're going to load into our vector store via LangChain.

LangChain RAG with the website content

Now we'll create a Python script which loads the markdown file with all the website content, and uses it for RAG Q&A.

First install the required libraries:

pip install -U langchain openai chromadb langchainhub unstructured markdown

Now here's our full script to load the markdown and ask a question "What is UseScraper?" using the knowledge of the UseScraper.com website content we previously downloaded. You'll need to edit the markdown_path variable to be the filename of the markdown file you downloaded from the UseScraper dashboard (you'll also need to move the file to the same directory as you save this script to). If you scraped another website, you'll also want to edit the line print(rag_chain.invoke("What is UseScraper?")) near the end and choose a more relevant question.

from langchain import hub
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")

markdown_path = "scrape_https___usescraper.com_7YEG777KVS5R6X51SJ8S2EPKGA.md"

loader = UnstructuredMarkdownLoader(markdown_path)
docs = loader.load()

# Split Markdown file

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(docs[0].page_content)
# Char-level splits
chunk_size = 1000
chunk_overlap = 200
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
splits = text_splitter.split_documents(md_header_splits)

# Get embeddings and save to vectorstore
# print(splits)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("What is UseScraper?"))

# cleanup
vectorstore.delete_collection()

Save the code to rag.py and run it with python rag.py

You'll be prompted to enter your OpenAI API key, do so and hit enter. It'll take a few seconds to load the data and get a response from ChatGPT.

$ python rag.py
OpenAI API Key: 
UseScraper is a web crawling and scraping API that allows you to easily scrape the content of any website and save it in various formats such as text, markdown, or HTML. It offers advanced features and a scalable architecture to handle large crawling jobs. Additionally, it has fair and affordable pricing with no monthly fee and the ability to launch multiple crawling jobs simultaneously

Great, it works!

Code walkthrough

Let's walk through how the code works step-by-step.

First we load up the markdown_path file, and use the LangChain splitter methods to split it up into chunks.

markdown_path = "scrape_https___usescraper.com_7YEG777KVS5R6X51SJ8S2EPKGA.md"

loader = UnstructuredMarkdownLoader(markdown_path)
docs = loader.load()

# Split Markdown file

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(docs[0].page_content)
# Char-level splits
chunk_size = 1000
chunk_overlap = 200
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
splits = text_splitter.split_documents(md_header_splits)

We then load this array of splits into the vector store (in this case we use a local Chroma instance). If you'd like to see the markdown content you can uncomment the print line here.

# Get embeddings and save to vectorstore
# print(splits)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

Now we have all the website content in the vector store, we we setup the RAG chain and prompt.

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Finally we send a question to the RAG chain.

print(rag_chain.invoke("What is UseScraper?"))

Run the script to see the response.

python rag.py

UseScraper is a web crawling and scraping API that allows you to easily scrape the content of any website and save it in various formats such as text, markdown, or HTML. It offers advanced features and a scalable architecture to handle large crawling jobs. Additionally, it has fair and affordable pricing with no monthly fee and the ability to launch multiple crawling jobs simultaneously.

Terrific! You can extend this script to take in more prompts, or access a hosted vector store. Combine it with the UseScraper API to programmatically crawl website content when needed.

Let's get scraping 🚀

Ready to start?

Get scraping now with a free account and $25 in free credits when you sign up.