I want some assistance on an information scraping script I have been engaged on. I need to create a script by means of which content material might be extracted from a web page.
I fetched the web page by way of wp_remote_get()
and wp_remote_retrieve_body()
. Utilizing the DOMDocument() class, I used to be in a position to goal the precise components and add them in database afterwards. Please tke a have a look at the next code –
$url = "https://www.myexampleurl.com/page-to-be-scraped";
$information = wp_remote_get( $url,
array(
'timeout' => 60
)
);
$physique = wp_remote_retrieve_body( $information );
$dom = new DOMDocument();
$dom->loadHTML( $physique );
$xpath = new DomXPath( $dom );
$xpath->registerNamespace( 'm', $url );
//The web page has a middle tag, imagine it or not
$middle = $dom->getElementsByTagName('middle')->merchandise(0);
//Focusing on all hyperlinks
$question = '//a';
$entries = $xpath->question( $question, $middle );
$rely = 1;
foreach ($entries as $entry) {
//The goal components have 'data-lightbox' attribute
$attr = $entry->attributes->getNamedItem( 'data-lightbox' );
//Importing the sibling attribute to 'data-lightbox'
if ( !empty( $attr ) ) {
//The information fetched is uploaded to dtabase utilizing this operate
my_upload_file_by_url( $attr->previousSibling->nodeValue );
}
}
Now, what I must know is make this request asynchronous. I attempted AJAX but it surely will get timed out and throws an error.
Additionally, $dom->loadHTML( $physique )
throws an error as follows –
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: anticipating ';' in Entity, line: 241 in /Customers/apple/Websites/indidev/wp-content/plugins/crawler/crawler.php on line 56
Additionally tried wp_schedule_single_event
operate but it surely additionally would not operate as anticipated. Any pointers are appreciated!
PS – There are a lot of pages that should be scraped and concurrently inserted within the database.