Scraping, short for “screen scraping” or “web scraping”, refers to the process of extracting information from websites or online services. The process can be done manually with copy and paste, or with the help of a web scraping tool or a crawler. In this case,the information is scraped from the screen, so to speak.
For example, only a few steps are necessary to access and collect the data using web scraping:
- Sending an HTTP GET request to a specific URL.
- As soon as the website responds, the scraper searches the HTML document for the specific data field that was specified.
- The data is extracted, and a report is created.
Well-known examples are search engine crawlers that continuously scour the internet to index websites. Another example is comparison portals that use spcaping to collect, evaluate and provide data. However, scraping tools can also be used to collect e-mail addresses or social media profiles, which are then being sold in bulk to third parties.
Scraping is not in all situations legal and the copyright and database rights of the website must be considered beforehand. The site operators also have the right to implement technical processes to prevent web scraping. These must not be circumvented. However, as long as the data to be extracted is freely accessible to third parties on the internet, scraping is not illegal per se.
As a website operator, you have various options for protecting yourself from scrapers and crawlers.
- You may use a robots.txt file to block search engine bots and automatic scraping by software bots.
- Contact details and personal information can be embedded in images instead of in text.
- Rate limiting can be used to restrict the number of requests that a single IP address can make within a certain period of time.
- If multiple requests are made from the same server, the user can be asked to confirm their identity using a CAPTCHA.
- To prevent the scraping of e-mail addresses, an [at] can be used instead of an @.