Saturday, August 20, 2022
HomeWordPress Developmentplugins - Need assistance creating asynchronous information scraper in WordPress

plugins – Need assistance creating asynchronous information scraper in WordPress


I want some assistance on an information scraping script I have been engaged on. I need to create a script by means of which content material might be extracted from a web page.

I fetched the web page by way of wp_remote_get() and wp_remote_retrieve_body(). Utilizing the DOMDocument() class, I used to be in a position to goal the precise components and add them in database afterwards. Please tke a have a look at the next code –

$url = "https://www.myexampleurl.com/page-to-be-scraped";

 $information = wp_remote_get( $url,
        array(
            'timeout'   =>  60
        )
    );
    $physique = wp_remote_retrieve_body( $information );

    $dom = new DOMDocument();
    $dom->loadHTML( $physique );

    $xpath = new DomXPath( $dom );
    $xpath->registerNamespace( 'm', $url );

    //The web page has a middle tag, imagine it or not
    $middle = $dom->getElementsByTagName('middle')->merchandise(0);

    //Focusing on all hyperlinks
    $question = '//a';
    $entries = $xpath->question( $question, $middle );

    $rely = 1;
    foreach ($entries as $entry) {

        //The goal components have 'data-lightbox' attribute
        $attr = $entry->attributes->getNamedItem( 'data-lightbox' );

        //Importing the sibling attribute to 'data-lightbox'
        if ( !empty( $attr ) ) {

            //The information fetched is uploaded to dtabase utilizing this operate
           my_upload_file_by_url( $attr->previousSibling->nodeValue );

        }
    }

Now, what I must know is make this request asynchronous. I attempted AJAX but it surely will get timed out and throws an error.
Additionally, $dom->loadHTML( $physique ) throws an error as follows –

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: anticipating ';' in Entity, line: 241 in /Customers/apple/Websites/indidev/wp-content/plugins/crawler/crawler.php on line 56

Additionally tried wp_schedule_single_event operate but it surely additionally would not operate as anticipated. Any pointers are appreciated!

PS – There are a lot of pages that should be scraped and concurrently inserted within the database.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments