It is essential for the collection of large datasets, which are the cornerstone of big dataset analysis, Machine Learning (ML) and Artificial Intelligence (AI) algorithms.
The difficult point is that information is the most valuable asset in the world (after time, because time cannot be bought back), as Michael Douglas said in the famous movie “Wall Street” long before the Internet era.
What is web scraping?
This means that those who possess information take every possible precaution to protect it from copying. In the pre-Internet era, this was easy, because copyright laws are quite strong in developed countries. The World Wide Web has changed everything, because anyone can copy text from one page and paste it onto another, and scrapers are simply algorithms that can do this much faster than humans.
There are three levels of complexity in web scrapping, depending on how much JavaScript (JS) you have to deal with:
You are lucky:
Web pages that you need to scramble have clean, simple markup without JS. In this case, just create the “locators” for the data in question. XPath instructions are excellent examples of such locators. All URLs to other websites and pages are direct. Finding only relevant URLs is the main difficulty here. For example, you can search for the `class’ attribute. In such a case, the XPath will look like this: `///a[@class=’Your_target_class’]“
A competent professional :
- Partial JS rendering. For example, the search results page contains all the information, but it is generated by JS. Typically, if you open a specific result, the complete data without JS is there.
- Simple pagination. Instead of constantly clicking on the “next page” button, you can get pages simply by creating the necessary URL like this: http://votresite.com/data?page=2&limit10.
- In the same way, you can, for example, increase the number of results in the same query.
- Simple URL creation rules. Links can be formed by JS, but you can decipher the rule and recreate them yourself.
Jedi Knight, may the Force be with you.
The page is entirely generated with JS. There is no way to get the data without running JS. In this case, you should use more sophisticated tools. Selenium or other WebKit-based tools will do the job.
URLs are formed using the JS.
The tools in the previous paragraph should also solve this problem, but there may be a slowdown in processing due to the fact that JS rendering takes longer. Perhaps you should consider splitting the scrapper and the crawler and perform the operations with separate processes.
A CAPTCHA is present. Usually the CAPTCHA does not appear immediately and requires several requests. In this case, you can use different proxy services and simply change the IP address when the crawler is blocked by a CAPTCHA. By the way, these services can also be useful for emulating access from different locations. Capmonster2 is for example a tool to “break” captchas.
The website has an underlying API with complex data transfer rules. JS scripts render the pages after referring to the back-end. It may be easier to receive data when queries are made directly in the back-end.
To analyze how scripts work, use the Developer Console in your browser. Press F12 and go to the Network tab.
The following is intended for Big Data researchers who comply with the permissions in robots.txt files, set the correct User Agent, and do not violate the terms of use of the sites they are scraping.
LEARN MORE FROM A VIDEO
It is also important to understand the difference between web scrapping and data mining. In short, while data scraping can occur in any data table and can be done manually, web scrapping or crawling only takes place on web pages and is performed by special robots: scraping robots. There are several success factors for Big Data Mining. Knowing where to find the correct and relevant data sources is the most important basis for successful analysis.
For example, a manufacturer may want to monitor market trends and discover actual customer behavior, without relying on monthly reports from the distributor or salesperson. Using web scraping, the company can collect a huge set of data on product descriptions on retailer sites, customer comments, and feedback on the websites of distributors such as Amazon. Analyzing this data can help the manufacturer provide retailers with better descriptions of their products, as well as list the issues end users face with their product and apply their feedback to further improve their product and secure their bottom line through increased sales.
Most of the scrapers are written in Python to facilitate the process of further processing of the collected data. Many scrapers are written using frameworks and libraries for web crawling, such as Scrapy, Ghost, lxml, aiohttp or Selenium.
When you want to develop a scrapper, you have to be prepared to deal with any level of complexity. That’s why you should validate the data sets you want to retrieve before you develop your scrapper in order to allocate sufficient resources. In addition, conditions can (and often do) change during the development of scrapers, so a skilled data specialist must be ready to step in and help you recover reliable data.