BLOG
09 June 2017

Web Scrapping. Part 1: Basics

tech

In case you didn’t read our previous articles, we are creating a bot which will automatically check you in. In this part, you will learn what the best tools to achieve this goal and what obstacles you can encounter during the web scraping are.

But what is the web scraping? It is a term which can be described as extracting various data from the websites for various purposes, typically performed by automatic processes like bots. You can read more theory about it on the Wikipedia but let’s focus on practice here. You will learn how to write a program which will be able to grab the essential information automatically, process it, submit required information, navigate between pages and more.

Choosing a tool

There are various tools over the internet which automatize web scraping and can be used even by non-programmers. But to have full control over the web page you need a tool that will match your needs, a tool you can write yourself.

Do you know what the most popular browser over the Internet is? Chrome, of course.

What’s more, many web developers use Chrome as their primary browser to check if their page looks and works as expected.

What if you could write a bot capable of running on the server environment working inside the most modern browser and take benefits of the Chrome’s developer tools? Well, now you can.

Meet the Headless Chrome, a Chrome which doesn’t need any graphical interface to run. With the extra tool, chrome-remote-interface, you will be able to create a JavaScript bot hosted on Node.js server.

Basics

Let’s write a program which will open SoftwareHut blog and take a screenshot of the whole page.

Ability to take screenshots is important for several reasons:

  • You will know if all elements on the page were loaded correctly in the moment of screenshot command execution
  • You may want to present some parts of the web page to a third party, like a nice looking plane seat map with all the available seats or a form with filled data to confirm if your bot is working as expected
  • You can see the result of your code in non-headless Chrome version by entering the following text in the Chrome’s address bar ‘localhost:9222’ but not while your program is still running. It’s because Chrome doesn’t support multiple connected clients (you – using non-headless Chrome and your program) to a single tab (target to be more precise)

As a first step, you will need to open a Chrome in headless mode. You can do it manually or with the usage of Node.js in a few ways described here.

Let’s focus on more interesting things and just start the Chrome browser manually in the headless mode using the following flags:

google-chrome --headless --remote-debugging-port=9222 --disable-gpu

Below code connects to headless Chrome instance on the default port (9222), navigates to SoftwareHut blog’s web page and takes a screenshot of it.

Note: in case if you are unfamiliar with async/await statements, please read the docs. They are compatible with promises and simplify code reading a lot because the communication with Chrome is asynchronous.

Is it that easy? Well, check your saved screenshot and don’t be surprised if there is nothing on the image. Navigate method doesn’t wait until the page is fully loaded so let’s wait until all resources have finished loading by adding the following line before taking a screenshot.

await Page.loadEventFired();

Check your screenshot again. Is it better? Probably yes. Is it ideal? Definitively no.

Many pages load content dynamically, and even if the browser thinks the page is completely loaded, then it doesn’t mean it’s true from the user perspective. Also, pages are having a different kind of animations and transitions, for example, they can make the page fully visible to the user after a second or two to have a nice fade-in effect.

There are two ways to wait for the page to be ready from a user perspective:

  1. Set interval which will periodically check if the particular condition is met, for example, wait until some element on the page is visible or if it reacts to click events.
  2. Listen for requests (or WebSocket frames) that are sent and received after the load event was fired. If loading of all required resources is finished (check for both successes and failures), then wait for a second or two before you consider the page as loaded to observe if there aren’t any new requests sent. Also, it may be worth to check if the page doesn’t send similar requests every single second endlessly or set a timeout because your method may never end.

The most reliable method is the first one, especially if the page has some animations as well, but sometimes you may need to dig a little to find the right candidate for the waiting condition.

SoftwareHut blog is a perfect candidate to present the first method as many resources are loaded dynamically, and there is a fade-in animation when the page is fully loaded.

If you inspect the page, you will notice the presence of “#preloaded” div which hides using fade-out effect when the page is ready to display.

But first things first.

To check if the page is ready, you need to check if some particular condition is met.

Below you can find a useful example which periodically checks if some test function returns a valid result. If it makes the promise will be resolved, but if it doesn’t before the given timeout, then the promise will be rejected.

Great, now you would probably want to know how to check if the ‘#preloaded’ div is visible or not and how to connect with ‘waitForResult’ method.

There are two ways to do that:

  1. Directly use the methods available in Chrome DevTools Protocol to query or manipulate the DOM.
  2. Evaluate the code directly in the web page with the possibility to inject your favourite JS library to query or manipulate the DOM like jQuery for example.

Both of these methods have pros and cons, and it depends on what would you like to achieve to decide which one of them should be used.

Fortunately, both of them can be used in our scenario.

Below you can find the function which uses the DevTools Protocol methods to check if a blog is ready to be displayed:

At the first step the above method retrieves the documentElement, then it searches for #preloader element, and in the end, it checks if the element’s ‘display’ property is currently set to ‘none’.

You can use this function with waitForFunction helper method in a way like this:

And that’s it. Check your screenshot; it’s ready.

The second method, to check if the page is fully visible, evaluates your code directly on the web page. Below you can find the example of it with the usage of waitForFunction helper method.

As you might have noticed, passing the ‘expression’ as a string is not a very readable solution and what’s more, it will look even worse in case if you would like to pass to execution a bit more complicated code.

The solution for it is to convert your function to string and wrap it with the anonymous function which will automatically execute.

Below you can find the example with the usage of waitForFunction helper method.

The above example is a very basic example only. If you want, you can extend the function to accept some parameters. One thing is worth noticing. If you would like to return a JSON object in your evaluated function, then you would need to add ‘returnByValue: true’ parameter to the Runtime.Evaluate function.

With the set of functions and examples presented in this article, you will be able to do the most basic things like navigation or web page code execution which gives you endless possibilities.

In the next part, you will learn how to use these methods as well as other techniques to achieve more complicated things necessary to pass the check-in procedure automatically.



Author
Konrad Kierus
Software Developer

Software developer focused mainly on mobile applications development, both native and hybrid solutions, but familiar with web technologies as well. He believes that programming languages are just tools to solve problems and he is not afraid of learning new ones. In a free time, he makes sure his son won't hurt himself nor anyone else.