How to fetch SERPs and save them to ChromaDB using Langchain
Published on: May 12, 2023 by Joona Tuunanen
Recently I’ve been very interested in Langchain and it’s capabilities. I think the idea of chaining different tools and using LLMs is such a great idea and Langchain seems to provide good wrappers for many useful tasks.
I have bunch of different ideas what I want to build on top of Langchain. Most of those are somehow organic search related, because SEO is still very interesting for me.
While Langchain offers many wrappers out of the box or with easy integrtation, including SERP scraping, it didn’t seem to be able to do what I wanted it to do. Or perhaps the docs where just insufficient.
In any case, I wanted to:
- Give my script a keyword (or keywords)
- Go and fetch top X ranking URLs (in this example 20). I’m using dataforseo.com for this, but there are other options as well.
- Extract the body text of those URLs and to save them locally to one file
- …so that I can use all that for my purposes
Here’s a breakdown of my code that accomplishes all that and saves the SERPs first to .txt file as well as to ChromaDB for further processing. All this is written in (bad) Python. Mandatory 🐍🔥 -emoji.
I don’t pretend to be an expert in any of this, but this process works for me as of today (May 9th 2023).
Install Langchain and Trafilatura
It’s not a surprise you’re going to need to install Langchain to get this process to work. It probably works something like this on your computer terminal:
$ pip install langchain
You’ll also need to install Trafilatura as that’s the package I decided to go for scraping and to extract main body content from the URLs.
$ pip install trafilatura
If neither of those above spells don’t work, please refer to their docs, ChatGPT or whatever you’re using to help with programming things.
Import packages
Here are all the things I’ve needed to import to get this to work
from langchain.agents import load_tools
from langchain.utilities import TextRequestsWrapper
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
import trafilatura
import json
Basically you’re importing a bunch of different tools and utilities from Langchain that make this process eaasier. Here we’re also importing Trafilatura so that we can crawl and scrape.
Set variables
You’ll need to set a few variables in the beginning to get this process humming:
DataForSEO_endpoint = "https://api.dataforseo.com/v3/serp/google/organic/live/regular"
DataForSEO_username="your_username"
DataForSEO_password= "your_password"
keyword = "your keyword"
You’ll notice that here we set the endpoint, username and password to fetch data from Data For SEO.
With this, we’re more or less done with the actual setting up and can start to get to actually fetch and deal with data.
Fetch SERPs from DataForSEO
As the goal is to get top X ranking URLs from Google Search Results (SERPs), we need to first define what is the data we want to get and from where.
# these make it possible to use requests library using Langchain
requests_tools = load_tools(["requests_all"])
requests = TextRequestsWrapper()
# setting up post data to be sent to the DataForSEO API
post_data = dict()
post_data[len(post_data)] = dict(
keyword= keyword,
language_name= "English",
location_code= 2840,
depth= 20,
device= "desktop"
)
# create empty list for urls
urls = list()
# Get SERPs from DataForSEO
def get_serps(post_data):
raw_serps = requests.post(DataForSEO_endpoint, data=post_data,
auth=(DataForSEO_username,DataForSEO_password))
# read raw_serps as json
data = json.loads(raw_serps)
# loop through the results and append urls to urls list
for k in data["tasks"][0]["result"][0]["items"]:
# check if the url is not None and if not, append to urls list
if k["url"] is not None:
urls.append(k["url"])
print("Successfully fetched " + len(urls) + " SERPs" )
# uncomment below if you want to see all the fetched urls
# print(urls)
return data
get_serps(post_data)
Please tailor the code above to fit your needs. Naming conventions should be straight forward to understand except for the location_code, which you’ll need to refer to DataForSEO documentation.
I’ve tried to leave comments in the code to make it easier to read (and for Github Copilot to help). But basically I define a function get_serps() which takes the above defined post_data as an argument. The function simply makes a request to DataForSEO, reads the results and then parses the URLs from them before appending them to the urls-list.
The final row of the above code block calls the function.
Getting the main content from URLs
Now that I have all the URLs, I can go and fetch the main body content for them.
def get_html(urls):
htmls = list()
extracted = list()
for url in urls:
html = trafilatura.fetch_url(url)
#extracts text
extracted.append(trafilatura.extract(html, output_format="txt"))
htmls.append(html)
return htmls, extracted
htmls, extracted = get_html(urls)
This code block defines function get_html, which takes the above received URLs as an argument. Please note that I’ve decided to get both raw HTML and extracted body content as it happens basically so easily side by side. Those are saved to “htmls” and “extracted” respectively.
The final row again calls this function and saves the returned HTMLs and extracted body texts to their own variables. I’m not using the HTML for anything at the moment, but it might come handy for some other project down the line so there’s no harm if it already exists in this script.
Write the extracted body content locally into a .txt file
The next step is to write all the extracted content locally into a .txt file so that I don’t need to go and fetch the SERPs whenever I want to use the content found in there. Yes, SERPs change, but for my purposes, I just need to have a good understanding of what those pages ranking on top of the SERPs actually say about the topic.
In other words, I care a lot more about the combined info found in the SERPs than any individual page that exists there.
# write the extracted text to a file keyword.txt
with open(f'{keyword}.txt', 'w') as f:
for doc in extracted:
# write all docs to output.txt file and close it
if doc is not None:
f.write(doc)
print(f"Successfully wrote SERPS to {keyword}.txt")
f.close()
Here I’m opening a file keyword.txt with the purpose of writing the scraped content there. Keyword just refers to the keyword I defined at the top of the script. There’s also very basic error handling there as sometimes I received None instead of an URL and that crashed the script.
And finally, save the data to ChromaDB with OpenAI embeddings
It makes a lot of sense to process the content at this point before saving it for futher use. I’ve found Langchain docs to be not that great, which is understandable as the whole framework changes and develops so quickly.
That also means that I can’t completely explain what the stuff below does, but hey, it works:
# load the documents from {keyword}.txt
persist_directory = 'db'
loader = TextLoader(f'{keyword}.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=20)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=docs, embedding=embeddings
persist_directory=persist_directory)
vectordb.persist()
vectordb = None
First off, the recently saved .txt file is loaded and then split into smaller chunks (rows 1-4). I’ve been playing around with different chunk sizes and overlaps and decided to use 400 and 20 for this. Feel free to change them around. The default values as of now are 1000 and 0. For me a slight overlap makes sense, but what do I know about anything.
The second code block creates embeddings for the vector database using OpenAIs embeddings. I’m also trying to persist the data locally, hence all that persist stuff you see there. I’m still learning about dealing with vector databases, but based on my understanding at this point all the data is there ready to be used for the next steps.
And if not, at least I have the raw text data that I can feed to the vector database later on when I know better what I’m doing.
So what’s all this good for?
I have several ideas what I want to experiment with based on this stuff. Here’s a few of them:
- I can build a chatbot to ask about all the content found on top of the SERPs. This allows to dig deeper into the topics and saves time from having to check all pages one by one.
- I can summarize all the content.
- I can do some prompt engineering and use the content as a basis for something new. I’ll leave the rest for your imagination.. :)
- I can use some NLP stuff like entity extraction etc to form a better understanding what kind of content ranks for the kw.
So, there are a lot of things I can do with this type of data. And if it turns out that all this was useless in the end, hey at least I learned a bit more about Langchain :).