Web Scraping using Nokogiri and Kimono

Introduction

Internet has become widely popular nowadays and it is also growing as a source of information. In this era of growing information on the internet there is much need of organizing the data which is available publicly and analyze it. Scraping according to the definition means collecting or accumulating something with great difficulty and web scraping means collecting or accumulating web data. Web Scraping is nothing but extracting data from web which can be easily read and organized by humans. The data which is to be scraped is available publicly and it does not contains private data like user’s information in most of the cases.

There are various tools available for web scraping among which the one discussed in this paper are “Nokogiri” and “Kimono”. Nokogiri is a programming based web scraping tools and Kimono is GUI based web scraping tool.

How web scraping works?

Web Scraping is about getting a particular data which is scattered among many web pages or many tables or many tags all around the internet. Web Scraping means accessing all the HTML data from a given webpage and sometimes more than just HTML. Web Scraping is easy to learn and implement using programming languages while there are some softwares available in the market which does the work of programming and user just have to enter the details regarding which data to scrape.

In general to access a webpage we need to access the web through HTTP (HyperText Transfer Protocol) or through web browser by creating an agent which behaves like web browser.

There are certain ways we can process data after scraping

Parsing : Parsing means collection of data from the host directly.
Searching : Searching means to search something particular from the parsed data.
Copying to spreadsheet : It means copying the data to a database in form of .xls or .csv file which makes data processing even more easier.

Reformatting

The general method of scraping consists of accessing the website through HTTP using the methods which are available in the language library. Then the user visits the website once to look at the structure of the website. To see the structure of the website the user uses developer tools like Inspect Elements in Google Chrome. He/She observers clearly the part which is to be scrapped. Next he/she finds something unique in that tag like its class or id and he/she can get the text from that tag by searching using the unique class or id from the parsed data. This is a general procedure for scraping the data using HTML. However another methods for web scraping includes usage of xpath selectors. They are generally used for XML.

Why should I scrape data?

Web scraping in general can be used to analyze a particular type of data from different websites. Some of the applications of Web Scraping are:-

Price Comparators :- Nowadays as we have so much options for purchasing products online. Competition for the same has so increased that it becomes tedious to track all the prices and benefits of different companies. The solution to this problem is web scraping. The data from different websites should be scraped and exported to a database for easy one to one comparison. While using programming interface for web scraping the programmer can use only one program to scrape the data for different products by taking the product name as a string input from user. Thus one program could be used everywhere. However the data should be scraped with caution and terms of service of the websites should be read first. There are many comparators available for comparing prices of products form different shopping sites, comparing real estate prices, insurance policies, flight tickets etc.

Weather Data Analysis :- Weather data analysis uses some type of data sets as an input in the weather prediction system. The data set for training the weather prediction system comes through web scraping. There is no commercial loss in using weather data for web scraping because it is available to everyone. The scrapers use web scraping techniques to get the data into a database in .csv file or .xls file as required by the analyst.

Website change detection :- Website change detection in general means detecting a particular change on a website. Suppose if a user wants a particular job in a company but there is no vacancy so he/she can use web scraping to see if there is vacancy available or not. On the implementation part the programmer has to look for a particular part in the HTML code which contains this information and then scrape that part of HTML from the website and run the program when he/she needs to check. This will save the user’s time of going to website through browser and checking for changes. It also saves data usage but that is very negligible.

Checking Online Reputation :- Google and other search engines uses GET method for web scraping and that makes easy for scrapers to scrape a particular search on search engine and get the description easily which is shown on the results page. The results on the results page are summarized and hence beneficial for any users to take a quick glance on online reputation of a person, a product or a place. The data fetched from scraping could be further extended for sentimental analysis which takes the scraped data as a raw data to obtain exact analysis on whether a particular thing is good or bad.

Personal Purpose :- One of the benefits of Web Scraping is that it is easy to implement even for a person with a little bit of programming knowledge. A person or a student can use web scraping for his/her personal usage like collecting data or previous papers for research work, collecting test papers of a competitive exam, obtaining online material for studying. Instead of visiting the tens of hundreds of websites by a quick scraping program the user can easily copy the data to their machines. There is no loss for a website hoster also because there aren’t too many requests and all the data is available publicly. However advertisers suffer a loss due to this strategy because we when we don’t go to the actual page of the web site, we do not see the ads and they suffer the money invested in advertising.

Nokogiri

Nokogiri is a gem (library) of Ruby programming language. A gem or library means providing additional features to a programming language and Nokogiri does that. Nokogiri is one of the most downloaded Ruby gem. Ruby and Nokogiri creates scraping using programming very easy for the users.

Nokogiri takes a url in input as an argument in open() method which comes in open-uri gem of ruby. The method is returned as an object of Nokogiri class which contains the parsed data. Now suppose the name of the object which contains the parsed data is obj, obj contains methods such as at_css() or css() which takes class or id as argument and returns the whole tag in string form. Further using .text after at_css() gives only the text on that tag.

There is another method that is used to scrape data and that is xpath() method. Instead of taking class or Id name in tag it takes the name of HTML tag or XML tag and if we want to go inside a particular tag, the path is separated by using “//”. It returns the same string as we get using at_css() method. The only difference between both the methods is the searching methods for both the methods^[^3].

Nokogiri is different, Here is why

Nokogiri is fast and efficient. It is made by combining the powers of raw power of native C parser Libxml2 and Hapricot .

The main competitors of Nokogiri are Beautiful Soup which is a library of Python and rvest which is library of R.

Nokogiri is well documented so it is easy for beginners to learn. As compared to other web scrapers, Nokogiri can deal with broken HTML, which is a drawback of web scraping, while other scrapers cannot deal with broken HTML easily.

The only disadvantage of Nokogiri is that Ruby lacks machine learning and NLP libraries and hence the data cannot be processed properly.

#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'

# Fetch and parse HTML document
doc = Nokogiri::HTML(open('https://nokogiri.org/tutorials/installing_nokogiri.html'))

puts "### Search for nodes by css"
doc.css('nav ul.menu li a', 'article h2').each do |link|
  puts link.content
end

puts "### Search for nodes by xpath"
doc.xpath('//nav//ul//li/a', '//article//h2').each do |link|
  puts link.content
end

puts "### Or mix and match."
doc.search('nav ul.menu li a', '//article//h2').each do |link|
  puts link.content
end

To know more about Nokogiri, https://nokogiri.org/

Kimono

Kimono is a GUI based chrome extension for web scraping. It is robust, automated and easy to use web scraper. It also allows the user to create an API and use it further. The fields are calibrated automatically and detects automatically what the user might want to scrape. The user can also confirm or deny the suggestive listing which is automatically done by kimono.

One of the unique feature of kimono is that it shows the unique breakdown code of HTML which could also be used as argument to different functions in the functions in programming based web scrapers. It also gives regular expression of the thing which user is trying to find. Pagination is also supported in Kimono which means the user can scrape pages of pages. Kimono also gives features of automated crawling which means that it updates the data after every particular time selected by the user. The crawl limit of Kimono is only 25 pages.

DRAWBACKS OF KIMONO

Kimono is inaccurate sometimes.
It can crawl up to 25 pages only.

To start with Kimono check out https://www.youtube.com/watch?v=8D9gGxS1_qg

Wrapping up,

Web scraping is extracting data and convert the data from human readable to the language which could be understood by machines and programming languages. In other words converting into a structured data which could be processed easily. There are two types of tools for Web Scraping first is programming based and second is GUI based. Nokogiri is very famous programming based web scraping environment while Kimono is famous GUI based web scraping chrome extension.

Web Scraping is very important skill to learn in the world where there is so much popularity of Data Science. Collecting data is one of the primitive steps of Data Mining and Data Processing. Various tools for Web Scraping are discussed here according to the need of the user.