Hand-written bots - what languages and libraries are often used?
Posted: Wed Dec 11, 2024 10:08 am
Web scraping is within reach for people who have basic knowledge of JavaScript or Python.
For JavaScript, commonly used libraries and frameworks are:
Node Fetch - a library for downloading resources from the network.
cheerio - an HTML parsing library that, when combined with Node Fetch, is ideal for extracting data from static pages.
Puppeteer is a tool that allows you to get data from dynamically generated pages. The library allows you to control the browser using JavaScript code.
Playwright - a tool with similar functionality to Puppeteer.
Database client e.g. SQLite - a database is useful for organizing and storing collected data, it is a better solution in many respects than storing data in the form of text files.
In the case of Python, we most often deal with:
Playwright ,
Scrapy framework ,
BeautifulSoup ,
requests ,
requests_html .
Low Code or No Code Solutions
In the last decade, platforms have been developing very dynamically, allowing people with no programming experience to create their own internet bots, which are already advanced enough to handle collecting interesting data in almost every case. Such platforms include, among others:
Apify
Make.com - a platform for automating processes with a visual editor
Zapier - an alternative to Make
n8n - an open source alternative to Zapier and Make
A Word About CSS Selectors and XPath
What if the web page changes its structure frequently? To make the web scraper as resistant to any changes in the page structure as possible, CSS or XPath selectors should be as simple as possible, but not simpler. Avoid creating selectors that are too specific and contain unnecessary elements, which in the future, after the page structure changes, may cause the selector to be useless, because it will no longer point to the resource of interest.
You need to find a happy medium between precision and 99 acres database robustness. By specifying selectors in a sensible way, you can become somewhat resistant to changes in the structure of the page by the developer or administrator. There are situations when the structure, attributes, class names, and identifiers change so often that you have to deal with them in a different way. In such situations, you need to be creative or familiar with other techniques for extracting data from the HTML structure, such as searching the structure for elements containing a specific piece of text.
Summary
Web scraping is a technique of manually or automatically collecting data from the Internet. This can be done using special tools or programs that crawl websites and collect information such as text, images, or product pricing data.
The main advantage of web scraping is that it allows for the quick and automatic collection of large amounts of data, which can then be used for various purposes, e.g. to create product price lists, monitor price changes on the market or research online trends.
For JavaScript, commonly used libraries and frameworks are:
Node Fetch - a library for downloading resources from the network.
cheerio - an HTML parsing library that, when combined with Node Fetch, is ideal for extracting data from static pages.
Puppeteer is a tool that allows you to get data from dynamically generated pages. The library allows you to control the browser using JavaScript code.
Playwright - a tool with similar functionality to Puppeteer.
Database client e.g. SQLite - a database is useful for organizing and storing collected data, it is a better solution in many respects than storing data in the form of text files.
In the case of Python, we most often deal with:
Playwright ,
Scrapy framework ,
BeautifulSoup ,
requests ,
requests_html .
Low Code or No Code Solutions
In the last decade, platforms have been developing very dynamically, allowing people with no programming experience to create their own internet bots, which are already advanced enough to handle collecting interesting data in almost every case. Such platforms include, among others:
Apify
Make.com - a platform for automating processes with a visual editor
Zapier - an alternative to Make
n8n - an open source alternative to Zapier and Make
A Word About CSS Selectors and XPath
What if the web page changes its structure frequently? To make the web scraper as resistant to any changes in the page structure as possible, CSS or XPath selectors should be as simple as possible, but not simpler. Avoid creating selectors that are too specific and contain unnecessary elements, which in the future, after the page structure changes, may cause the selector to be useless, because it will no longer point to the resource of interest.
You need to find a happy medium between precision and 99 acres database robustness. By specifying selectors in a sensible way, you can become somewhat resistant to changes in the structure of the page by the developer or administrator. There are situations when the structure, attributes, class names, and identifiers change so often that you have to deal with them in a different way. In such situations, you need to be creative or familiar with other techniques for extracting data from the HTML structure, such as searching the structure for elements containing a specific piece of text.
Summary
Web scraping is a technique of manually or automatically collecting data from the Internet. This can be done using special tools or programs that crawl websites and collect information such as text, images, or product pricing data.
The main advantage of web scraping is that it allows for the quick and automatic collection of large amounts of data, which can then be used for various purposes, e.g. to create product price lists, monitor price changes on the market or research online trends.