Thursday, November 3, 2022
HomeWordPress DevelopmentNet Scraping With Playwright: Tutorial for 2022

Net Scraping With Playwright: Tutorial for 2022


You most likely received’t get stunned if we inform you that in recent times, the web and its affect have grown tremendously. This may be attributed to the expansion of the applied sciences that assist create extra user-friendly functions. Furthermore, there may be an increasing number of automation at each step – from the event to the testing of net functions.

Having good instruments to check net functions is essential. Libraries equivalent to Playwright assist pace up processes by opening the online utility in a browser and different consumer interactions equivalent to clicking parts, typing textual content, and, in fact, extracting public knowledge from the online.

On this submit, we’ll clarify every thing you’ll want to find out about Playwright and the way it may be used for automation and even net scraping.

What’s Playwright?

Playwright is a testing and automation framework that may automate net browser interactions. Merely put, you possibly can write code that may open a browser. Because of this all the online browser capabilities can be found to be used. The automation scripts can navigate to URLs, enter textual content, click on buttons, extract textual content, and so on. Probably the most thrilling function of Playwright is that it could actually work with a number of pages on the similar time, with out getting blocked or having to attend for operations to finish in any of them.

It helps most browsers equivalent to Google Chrome, Microsoft Edge utilizing Chromium, Firefox. Safari is supported when utilizing WebKit. In reality, cross-browser net automation is Playwright’s power. The identical code may be effectively executed for all of the browsers. Furthermore, Playwright helps varied programming languages equivalent to Node.js, Python, Java, and .NET. You may write the code that opens web sites and interacts with them utilizing any of those languages.

Playwright’s documentation is in depth. It covers every thing from getting began to an in depth rationalization about all of the lessons and strategies.

Assist for proxies in Playwright

Playwright helps the usage of proxies. Earlier than we discover this topic additional, here’s a fast code snippet exhibiting methods to begin utilizing a proxy with Chromium:

Node.js:

const { chromium } = require(‘playwright’); “

const browser = await chromium.launch();

Python:

from playwright.async_api import async_playwright

import asyncio

with async_playwright() as p:

browser = await p.chromium.launch()

This code wants solely slight modifications to completely make the most of proxies.

Within the case of Node.js, the launch operate can settle for an non-compulsory parameter of LauchOptions sort. This LaunchOption object can, in flip, ship a number of different parameters, e.g.,  headless. The opposite parameter wanted is proxy. This proxy is one other object with properties equivalent to server, username, password, and so on. Step one is to create an object the place these parameters may be specified.

// Node.js

const launchOptions = {

    proxy: {

        server: 123.123.123.123:80

    },

    headless: false

}

The subsequent step is to move this object to the launch operate:

const browser = await chromium.launch(launchOptions);

Within the case of Python, it’s barely completely different. There’s no have to create an object of LaunchOptions. As an alternative, all of the values may be despatched as separate parameters. Right here’s how the proxy dictionary shall be despatched:

# Python

proxy_to_use = {

    ‘server’: ‘123.123.123.123:80’

}

browser = await pw.chromium.launch(proxy=proxy_to_use, headless=False)

When deciding on which proxy to make use of, it’s greatest to make use of residential proxies as they don’t depart a footprint and received’t set off any safety alarms. For instance, our personal Oxylabs’ Residential Proxies might help you with an in depth and steady proxy community. You may entry proxies in a particular nation, state, or perhaps a metropolis. What’s important, you possibly can combine them simply with Playwright as properly.

Primary scraping with Playwright

Let’s transfer to a different matter the place we’ll cowl methods to get began with Playwright utilizing Node.js and Python.

In the event you’re utilizing Node.js, create a brand new undertaking and set up the Playwright library. This may be finished utilizing these two easy instructions:

npm init y

npm set up playwright

A fundamental script that opens a dynamic web page is as follows:

const playwright = require(‘playwright’);

(async () => {

    const browser = await playwright.chromium.launch({

        headless: false // Present the browser.

    });

   

    const web page = await browser.newPage();

    await web page.goto(https://books.toscrape.com/');

    await web page.waitForTimeout(1000); // await 1 seconds

    await browser.shut();

})();

Let’s check out the supplied code – the primary line of the code imports Playwright. Then, an occasion of Chromium is launched. It permits the script to automate Chromium. Additionally, observe that this script is working with a visual UI. We did it by passing headless:false. Then, a brand new browser web page is opened. After that, the web page.goto operate navigates to the Books to Scrape net web page. After that, there’s a wait of 1 second to point out the web page to the end-user. Lastly, the browser is closed.

The identical code may be written in Python simply. First, set up Playwright utilizing pip command:

pip set up playwright

Observe that Playwright helps two variations – synchronous and asynchronous. The next instance makes use of the asynchronous API:

from playwright.async_api import async_playwright

import asyncio

 

async def essential():

    async with async_playwright() as pw: 

        browser = await pw.chromium.launch(

            headless=False  # Present the browser

        )

        web page = await browser.new_page()

        await web page.goto(https://books.toscrape.com/')

        # Information Extraction Code Right here

        await web page.wait_for_timeout(1000)  # Wait for 1 second

        await browser.shut()

       

if title == essential:

    asyncio.run(essential())

This code is just like the Node.js code. The most important distinction is the usage of asyncio library. One other distinction is that the operate names change from camelCase to snake_case.

If you wish to create a couple of browser context or need to have finer management, you possibly can create a context object and create a number of pages in that context. This is able to open pages in new tabs:

const context = await browser.newContext();

const page1 = await context.newPage();

const page2 = await context.newPage();

You might also need to deal with web page context in your code. It’s attainable to get the browser context that the web page belongs to utilizing the web page.context()operate.

Finding parts

To extract info from any component or to click on any component, step one is to find the component. Playwright helps each CSS and XPath selectors.

This may be understood higher with a sensible instance. Open https://books.toscrape.com/ in Chrome. Proper-click the primary guide and choose examine
web scraping with playwright.

You may see that every one the books are underneath the article component, which has a category product_prod.

To pick out all of the books, you’ll want to run a loop over all these article parts. These article parts may be chosen utilizing the CSS selector:

.product_pod

Equally, the XPath selector can be as following:

//*[@class=”product_pod”]

To make use of these selectors, the commonest features are as following:

  • $eval(selector, operate) – selects the primary component, sends the component to the operate, and the results of the operate is returned;
  • $$eval(selector, operate) – similar as above, besides that it selects all parts;
  • querySelector(selector) – returns the primary component;
  • querySelectorAll(selector)– return all the weather.

These strategies will work accurately with each CSS and XPath Selectors.

Scraping textual content

Persevering with with the instance of Books to Scrape, after the web page has been loaded, you need to use a selector to extract all guide containers utilizing the $$eval operate.

const books = await web page.$$eval(‘.product_pod’, all_items => {

// run a loop right here

})

Now all the weather that include guide knowledge may be extracted in a loop:

all_items.forEach(guide => {

    const title = guide.querySelector(‘h3’).innerText;

})

Lastly, the innerText attribute can be utilized to extract the info from every knowledge level. Right here’s the entire code in Node.js:

const playwright = require(‘playwright’);

 

(async () => {

    const browser = await playwright.chromium.launch();

    const web page = await browser.newPage();

    await web page.goto(https://books.toscrape.com/');

    const books = await web page.$$eval(‘.product_pod’, all_items => {

        const knowledge = [];

        all_items.forEach(guide => {

            const title = guide.querySelector(‘h3’).innerText;

            const worth = guide.querySelector(‘.price_color’).innerText;

            const inventory = guide.querySelector(‘.availability’).innerText;

            knowledge.push({ title, worth, inventory});

        });

        return knowledge;

    });

    console.log(books);

    await browser.shut();

})();

The code in Python shall be a bit completely different. Python has a operate eval_on_selector, which has similarities to $eval of Node.js, however it’s not appropriate for this situation. The reason being that the second parameter nonetheless must be JavaScript. This may be good in a sure situation, however on this case, it is going to be a lot better to put in writing your entire code in Python.

It could be higher to make use of query_selector and query_selector_all which is able to return a component and a listing of parts respectively.

from playwright.async_api import async_playwright

import asyncio

 

 

async def essential():

    async with async_playwright() as pw:

        browser = await pw.chromium.launch()

        web page = await browser.new_page()

        await web page.goto(https://books.toscrape.com')

 

        all_items = await web page.query_selector_all(‘.product_pod’)

        books = []

        for merchandise in all_items:

            guide = {}

            name_el = await merchandise.query_selector(‘h3’)

            guide[‘name’] = await name_el.inner_text()

            price_el = await merchandise.query_selector(‘.price_color’)

            guide[‘price’] = await price_el.inner_text()

            stock_el = await merchandise.query_selector(‘.availability’)

            guide[‘stock’] = await stock_el.inner_text()

            books.append(guide)

        print(books)

        await browser.shut()

 

if title == essential:

    asyncio.run(essential())

The output of each the Node.js and the Python code would be the similar. You may click on right here to search out the entire code used on this submit in your comfort.

Playwright vs Puppeteer and Selenium

There are different instruments like Selenium and Puppeteer that may additionally do the identical factor as Playwright.

Nonetheless, Puppeteer is proscribed with regards to browsers and programming languages. The one language that can be utilized is JavaScript, and the one browser that works with it’s Chromium.

Selenium, however, helps all main browsers and a number of programming languages. It’s, nevertheless, gradual and fewer developer-friendly.

Additionally observe that Playwright can intercept community requests. For extra particulars about community requests, see this web page.

The next desk is a fast abstract of the variations and similarities:

PLAYWRIGHT

PUPPETEER

SELENIUM

SPEED

Quick

Quick

Slower

DOCUMENTATION

Wonderful

Wonderful

Truthful

DEVELOPER EXPERIENCE

Finest

Good

Truthful

PROGRAMMING LANGUAGES

JavaScript, Python, C#, Java

JavaScript

Java, Python, C#, Ruby

JavaScript, Kotlin

BACKED BY

Microsoft

Google

Group and Sponsors

COMMUNITY

Small however energetic

Massive and energetic

Massive and energetic

BROWSER SUPPORT

Chromium, Firefox, and WebKit

Chromium

Chrome, Firefox, IE, Edge, Opera, Safari, and extra

Comparability of efficiency

As we talked about within the earlier part, due to the huge distinction within the programming languages and supported browsers, it isn’t straightforward to check each situation.

The one mixture that may be in contrast is when scripts are written in JavaScript to automate Chromium. That is the one mixture that every one three instruments assist.

An in depth comparability can be out of the scope of this submit. You may learn extra in regards to the efficiency of Puppeteer, Selenium, and Playwright on this article. The important thing takeaway is that Puppeteer is the quickest, adopted by Playwright. Observe that in some eventualities, Playwright was sooner. Selenium is the slowest of the three.

Once more, do not forget that Playwright has different benefits, equivalent to multi-browser assist, supporting a number of programming languages.

In the event you’re searching for a quick cross-browser net automation or don’t know JavaScript, Playwright shall be your solely selection.

Conclusion

In at this time’s submit, we explored the capabilities of Playwright as an internet testing device that can be utilized for net scraping dynamic websites. Resulting from its asynchronous nature and cross-browser assist, it’s a preferred different to different instruments. We additionally coated code examples in each Node.js and Python.

Playwright might help navigate to URLs, enter textual content, click on buttons, extract textual content, and so on. Most significantly, it could actually extract textual content that’s rendered dynamically. These items can be finished by different instruments equivalent to Puppeteer and Selenium, but when you’ll want to work with a number of browsers, or should work with language apart from JavaScript/Node.js, then Playwright can be a terrific selection.

In the event you’re to learn extra about different comparable subjects, try our weblog posts on net scraping with Selenium or Puppeteer tutorial.

And, in fact, in case you have got any questions or impressions about at this time’s tutorial, don’t hesitate to depart a remark under!

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments