plugins – Need assistance creating asynchronous information scraper in WordPress

August 20, 2022

2

I want some assistance on an information scraping script I have been engaged on. I need to create a script by means of which content material might be extracted from a web page.

I fetched the web page by way of wp_remote_get() and wp_remote_retrieve_body(). Utilizing the DOMDocument() class, I used to be in a position to goal the precise components and add them in database afterwards. Please tke a have a look at the next code –

$url = "https://www.myexampleurl.com/page-to-be-scraped";

 $information = wp_remote_get( $url,
        array(
            'timeout'   =>  60
        )
    );
    $physique = wp_remote_retrieve_body( $information );

    $dom = new DOMDocument();
    $dom->loadHTML( $physique );

    $xpath = new DomXPath( $dom );
    $xpath->registerNamespace( 'm', $url );

    //The web page has a middle tag, imagine it or not
    $middle = $dom->getElementsByTagName('middle')->merchandise(0);

    //Focusing on all hyperlinks
    $question = '//a';
    $entries = $xpath->question( $question, $middle );

    $rely = 1;
    foreach ($entries as $entry) {

        //The goal components have 'data-lightbox' attribute
        $attr = $entry->attributes->getNamedItem( 'data-lightbox' );

        //Importing the sibling attribute to 'data-lightbox'
        if ( !empty( $attr ) ) {

            //The information fetched is uploaded to dtabase utilizing this operate
           my_upload_file_by_url( $attr->previousSibling->nodeValue );

        }
    }

Now, what I must know is make this request asynchronous. I attempted AJAX but it surely will get timed out and throws an error.
Additionally, $dom->loadHTML( $physique ) throws an error as follows –

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: anticipating ';' in Entity, line: 241 in /Customers/apple/Websites/indidev/wp-content/plugins/crawler/crawler.php on line 56

Additionally tried wp_schedule_single_event operate but it surely additionally would not operate as anticipated. Any pointers are appreciated!

PS – There are a lot of pages that should be scraped and concurrently inserted within the database.

Previous articleAndroid Studio java code no hints by any means – cocos2d-x

Next articleThe best way to Italicize Textual content in InDesign

plugins – Need assistance creating asynchronous information scraper in WordPress

Programming: the place to start – DEV Group

Oracle Interview Expertise for Server Aspect Expertise Engineer (On-Campus) 2022

5 Methods to Transfer From Native Website to Dwell Server

LEAVE A REPLY Cancel reply

Most Popular

The best way to Italicize Textual content in InDesign

Android Studio java code no hints by any means – cocos2d-x

Is Apple customer support good? 2022 ranking

Home windows Fax and Scan not working in Home windows 11

Recent Comments

ABOUT US

POPULAR POSTS

The best way to Italicize Textual content in InDesign

Android Studio java code no hints by any means – cocos2d-x

Is Apple customer support good? 2022 ranking

POPULAR CATEGORY