Real-time web scraping, a new world to explore
Web scraping is a quite old technique used to extract information from websites. You can find tons of information on the Internet about web scraping.
But, are not APIs intended to do this? Well, yes, of course, but sadly APIs are not everywhere. Indeed, the most typical situation will be a website with interesting information, which is meant to be human-readable, and it lacks an API to access the information directly from your source code.
Scraping the website, you can extract the information you need, parsing the website and making it machine-readable. It allows you to interact and analyse the information of the website with your favourite framework.
There are a lot of tools to scrape websites, but they are slow. For example, there are a lot of powerful libraries in Python. You tipically don’t need real-time information so you usually are fine.
But, what happens if you need to extract real-time information? For example, you need to get live stock information, betting quotes, flight prices, park spaces, etc. They all have one characteristic in common: they change, and change quickly!
In all these cases you have to apply different sophisticated techniques in order to get updated information. Of course, the performance in this case is a key feature that you cannot forget.
In the next articles I will research about this problem, trying to find out:
The time that is needed in each step when you scrape a website, identifying where are the bottlenecks.
The best framework to scrape real-time information. I will consider Python, Golang, Rust and plain curl (Bash), making comparisons between them.
The list of problems & optimizations that should be done in order to avoid real-time issues.
(Bonus) A good architecture to support getting information from multiple sources. This could not be important in stock prices (the stock price is universal) but is relevant for example on betting quotes, where each betting house has different quotes on the same event.