January 3, 2024

Web

Web Scraping

Contents
Facebook
Twitter
LinkedIn

First, what is Web Scraping?

Web scraping or data scraping is the term to describe how automated bots (web scrapers or web crawlers) extract data from websites. Web scraping also determines search rankings, so this may not be news. The process involves combing through the HTML code of a website to collect specific data points. Creating an archive of data that AI developers can use to train generative AI tools like chatbots.

As of last year, ChatGPT-4 can now use web scraping to train on current internet data to improve AI models. With this development, AI chatbots could replace search engines for users’ access to information.

However, with the propensity for hallucinations in gen AI, provenance of information is vital. Developers of The New Bing make it clear to users which websites the information came from. This development will be great for charities as users will know where the information came from.

Should charities be concerned about content scraped from websites being used to create new content by generative AI tools? Ian McLintock at Charity Excellence Framework addresses web scraping as an AI tool developer and charity advisor. By scraping the websites created by public sector sources (the Charity Commission and the Charities Regulator), he trained AI tools to provide answers to questions about charity governance and regulations.

Because of the Open Government license, this is perfectly fine to do. You can copy, publish, distribute, and adapt the information. But you must acknowledge the source of information.

For non-profit or commercial websites, this license does not apply. So long as no personal information or anything subject to intellectual property rights is accessed, these websites can be web-scraped.

security text
Photo by Pixabay on Pexels

Robots Exclusion Protocol (REP)

The REP or robots.txt file instructs what web crawlers can access on a website. So administrators can prevent certain pages from being indexed by search engines. The robot.txt file was a deal that used to be mutually beneficial. Previously you would let search engine crawlers scrape your site and in return they would send people to your site. Now Gen AI companies can use your data to build models and won’t give anything in return.

While most web crawlers respect the instructions in the robots.txt, some malicious bots will ignore them. But IP blocking and CAPTCHA challenges can help us here. If you have not come across them, they are a way to distinguish a human from a bot.

IP blocking and CAPTCHA

While most web crawlers respect the instructions in the robots.txt, some malicious bots will ignore them. But IP blocking and CAPTCHA challenges can help us here. If you have not come across them, they are a way to distinguish a human from a bot.

IP blocking prevents access to a website based on the IP address of the requesting user or bot. Administrators create a list of specific IP addresses or ranges that determine access to the website. Blocking IP addresses associated with known offenders or suspicious activity will mitigate unauthorized data extraction from your website.

CAPTCHA challenges can be simple picture puzzles that ask you to select the squares with motorbikes, etc. Or they can be typographically random letters and numbers that users decipher to access the website.

Good web crawlers will filter out all personally identifiable information, like GPTBot, which is Open AI’s web crawler for ChatGPT. But you can disallow GPTBot to access your site by adding the bot to your robot.txt like this:

User-agent: GPTBot

Disallow: /

You can instruct GPTBot to access specific parts of your website like this:

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

About this, OpenAI has published Robots.txt standards for GPTBot.

Charity Excellence controls web scraping using meta tags and amended terms and conditions.

In conclusion...

This worry about bots having access to websites is not new. Concerns about search engine bots having access to website data arose 30 years ago, and the solution was the Robots Exclusion Protocol (REP). Now, web scraping, data scraping, and data mining (there are many names for it) are covered under the existing GDPR. However, there have been instances of AI crawling for gen AI training data that is ignoring the robot.txt file. So, it remains a hot topic as new developments are made constantly, with new AI laws expected from the EU and the US.

Got a project?

Let us talk it through