Machine learning is only possible when there’s enough of data available. Data collection is one of the most vital components of the machine learning lifecycle, but it’s also the one that’s often overlooked.
Web scraping—programmatically collecting data from web pages—is often the primary technique used to collect information from the Internet. While web scraping can be very straightforward in some instances, it can be very tricky in other instances. Especially when the data we’re interested in is hidden behind a button that requires a mouse click.
That’s where an advanced scraper — like the one we are going to build in this article—is required.
What’s an Advanced Scraper?
A (normal) web scraper scrapes/collects information from a given web page. This is a very straightforward exercise, where the component we’re interested in from the web page is usually specified either as a CSS Selector or an Xpath, and the scraper extracts the information.
But as web page design started to evolve and become more sophisticated, web designers started hiding information under ajax calls and buttons to show only important content and hide other more lengthy or irrelevant pieces of content.
That’s where any normal web scraper fails, and the need for an advanced scraper comes in. Advanced scrapers help emulate clicks and other browser activities to access an entire web page, which then can be scraped to collect relevant information.
Advanced Scraper Architecture
Kaggle Public LB Scraper
In this project, we’re going to build an advanced scraper using R, leveraging RSelenium and rvest to extract the complete Leaderboard of a Kaggle Competition.
The reason why we need an advanced scraper for this project is that by default, the Kaggle Public Leaderboard (LB) page displays only the top leaderboard entries, and for someone to scroll through the list, they’d need multiple browser clicks to expand the table.
Our advanced scraper will be designed in such a way that it does all these programmatically and finally outputs two visualization plots from the extracted LB data. This advanced Kaggle LB scraper could be used for a variety of use-cases like Automated Kaggle LB Score Alerts or Fellow-Teams score tracking.
As a prerequisite for this project, let’s begin by installing the required R packages:
- RSelenium — R Bindings for Selenium 2.0 Remote WebDriver
- rvest — R package for Web scraping
- tidyverse — Collection of R packages designed for Data science
All these packages can be installed from CRAN using the following code:
After installation, we have to load these R packages into our current R session.
Starting a Selenium Server and Browser
The next step is to start a Selenium server and browser. Because our scraping requires advanced browser emulations, we have to perform this step. In our case, we’ll use chrome as our browser of choice (using firefox will also produce similar results). The port value is optional, and explicitly specifying a port value will avoid any existing conflicts.
In this section, we’ll specify the URL from which we’re scraping our required data. For our project, a Kaggle public LB is associated with a Kaggle Competition, so for ease of use, we’ll specify the Kaggle Competition URL and then build the public LB URL from that.
Please note, this code is written for the current active competition and can be modified with minimal changes to make it work for past competitions.
URL in the Browser
Now that we’re ready with our URL and browser, we just have to request that our browser navigate to the specified URL.
More Browser Emulations
As specified above, we’re in the process of building an advanced scraper because the Kaggle public LB page doesn’t display the entire table when the web page is loaded initially. So we have to scroll down to the bottom of the page and then click the expand button at the end. (As in the below screenshot).
After the above code execution and along with some page load time, the browser should show the expanded Leaderboard table (as in the below screenshot).
Now that we have the web page displaying the complete LB table, we can simply extract the page source and use traditional web scraping data extraction methods to extract the cleaned up table data. The below code does the same.
As you can see in the above code, first we save the complete web page source code as an R object source . This source contains a huge corpus of html data, which is then parsed using read_html(). Since our Kaggle public LB data is using a standard html table, we can extract the table using html_table() and save it as a dataframe in the R object lb .
Finally, we’ll save the dataframe lb in a file lb.csv for archiving purposes and other possible uses.
Data Visualization and Insights
With the extracted LB data, we can build a few charts to derive insights about the competition and team performances.
Public LB Score Density Plot
This plot helps us understand which score-range is most crowded—this is where the majority of scores fell.
Number of Entries Density Plot
Another data point of Kaggle competitions is the number of times each Kaggler has submitted their competition submission (solution). The below code is used to generate a density plot with the Number of Entries.
The above plot tells us that the first peak is at 1, which is expected because most people become inactive after their first submission. And the second peak is before 5 entries—we see a steep decline from there.
The primary objective of this tutorial was to introduce the concept of advanced scraping and build an advanced web scraper using Rselenium and rvest. We then used this advanced scraper to extract Kaggle public leaderboard data, which could help Kagglers who are active in competitions. The entire code and plots used in this tutorial are available on my GitHub.
Do you have any interesting use cases for an advanced web scraper? Would love to hear about them in the comments!