23 June 2017

Web Scrapping. Part 2: Scrap Me One More Time


In case you didn’t read our previous articles, we are creating a bot which will automatically check you in. In the previous part of this series, you learned how to write and use basic methods required in web scrapping.

I described such things as how to navigate through the website or execute simple JavaScript code on the target web page. We will use some of the code presented in the previous article so if you didn’t read it yet then you are welcome to do it now.

In this part, you will learn how to:

  • Fill and submit forms
  • Intercept asynchronous requests and read their response bodies
  • Take a screenshot of a specific element on the web page

Fill and submit forms

Wherever you see a set of fields where you can enter some data like your name, address or phone number, in most cases, it means that you have to deal with a web form. During the online check-in, you may encounter on several forms where you will be required to enter your personal information.

Thankfully, filling and submitting forms by using programmatic approach isn’t hard, and you can use either DOM methods with a combination of a Runtime.callFunctionOn method or execute JavaScript code directly on the web page using evaluate method.

Below you can find the relevant part of the code which finds the newsletter form on SoftwareHut’s blog, enters your e-mail and submits the form by simulating mouse click event on the submit button (but you can use submit() method as well!).

As you can see, filling and submitting forms is easy. Let’s try something harder.

Intercept asynchronous requests and read their response bodies

There are situations when simple things require complicated methods. For example, if you would like to programmatically reserve a seat using the user interface of the web page, you would need to:

  • Open a seat map
  • Wait until it loads
  • Select a seat from a drop-down list or find another related HTML element
  • Check all required checkboxes
  • Submit the form
  • Confirm any message that might have popped up.

There are plenty of possibilities and many places where something can go wrong. Instead of doing the above, check what requests are being sent by the browser during the whole procedure and you may notice that all these complicated steps can be omitted if you understand how the system behaves.

What’s more, it’s worth to check for any globally available methods and data in the window object. You would be surprised what options can be available, very often not even obfuscated. For example, real example, you may find a method called ‘reserveSeat’ which takes two parameters – seat number and passenger identifier. Calling this method sends an asynchronous request to the server, and the only thing you need to do is to listen for the response to know if the call was successful or not.

Please look at the below example which listens for all ongoing requests and checks if one of them is a request to reserve a seat by checking if the requests’ URL ends with ‘reserveSeat’ keyword. If it does, the method waits for the response to arrive and returns the response body which is later processed to check if the seat was reserved or not.

Take screenshot of specific element on the web page

There can be situations where you would like to take a partial screenshot of the web page, for example, to have an image with nice looking plane seat map with all the available seats selected on it. Sometimes, such data is easy to grab, especially if it’s presented as an image. In such case, the only thing you need to do is to grab a link to the picture itself. Easy.

But what if a seat map is built from HTML elements? There are two solutions to this problem.

The first one assumes the element is fully visible at the time of taking the screenshot. You can get coordinates of the element’s location on the web page by using a getBoundingClientRect method, take a full page screenshot and crop the resulting image using some image-processing tool. For node.js you can use gm which requires either GraphicsMagick or ImageMagick to be installed on the system.

Please look at the below example which presents the general idea of the above approach:

jQuery to traverse the whole DOM and hides all elements except the right parents of your element. Also, the below scripts take part out of any scrollable content and position it relatively to the browser window.

After running such script, you may still need to enhance the result, for example by getting rid of some unwanted paddings, margins or borders but it’s a good start to make your element ready for a screenshot.

That’s it. With the set of functions and examples presented in this two-part article (read first part), you should be able to extract any information you need from almost any web page you can think of. Also, you should check what options give you read first part), you should be able to extract any information you need from almost any web page you can think of. Also, you should check what options gives you Chrome DevTools Protocol as you may find some of them useful for your needs, for example, you can use printToPdf method to save the web page to the PDF file.

Happy web scrapping!

Konrad Kierus
Software Developer

Software developer focused mainly on mobile applications development, both native and hybrid solutions, but familiar with web technologies as well. He believes that programming languages are just tools to solve problems and he is not afraid of learning new ones. In a free time, he makes sure his son won't hurt himself nor anyone else.