Web Scraping with PHP – How to Crawl Web Pages Using Open Source Tools

Web scraping lets you collect data from web pages across the internet. It’s also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.

composer require fabpot/goutte
scrap.php

<?php

require 'vendor/autoload.php';

use Goutte\Client;

// Create a new Goutte client
$client = new Client();

// Specify the URL to scrape
$url = 'https://ndtv.in';

// Fetch the HTML content of the page
$crawler = $client->request('GET', $url);

// Extract the title
$title = $crawler->filter('.crd_lnk')->text();

// Output the title
echo "Title: $title\n";

?>
scrapping in loop

<?php

require 'vendor/autoload.php';

use Goutte\Client;

// Create a new Goutte client
$client = new Client();

// Specify the URL to scrape
$url = 'https://ndtv.in';

// Fetch the HTML content of the page
$crawler = $client->request('GET', $url);

// Extract the title
// $title = $crawler->filter('.crd_lnk')->text();

$crawler->filter('.crd_lnk')->each(function ($node) {
print $node->text()."\n";
});


?>
$news = $crawler->filter("<headline's selector>")->text();
$link = $crawler->selectLink($news)->link();
$crawler = $client->click($link);
$link = $node->filter('a')->attr('href'); 

$crawler->filter('.crd_img-full > a > img')->each(function ($node) {
print $node->attr('src')."\n";
});
$client = new Client();
$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'your email', 'password' => 'your password']);$h1 = $crawler->filter("h1")->text();

echo($h1."\n");
  • request(): sends a request to the specified URL and returns an object that represents the HTML content of a web page
  • selectLink(): selects a link with a particular condition on a web page
  • link(): returns a link from a specific HTML element on a web page
  • click(): performs a click action on a selected link on a web page
  • text(): prints the text content presented on an HTML element
  • filter(): selects only HTML elements with specific values such as class name, ID, and tags
  • selectButton(): selects a form with a button that has a specific label
  • submit(): submits data to a form object with specific form data

 

Scrap and save into csv formate

<?php 

// Include the required autoload file 
require 'vendor/autoload.php'; 

// Import the Goutte client class 
use Goutte\Client; 

// Create a new instance of the Goutte client 
$client = new Client(); 

// Define the URL of the web page to scrape 
$url = "https://news.ycombinator.com/"; 

// Send a GET request to the URL and retrieve the web page 
$crawler = $client->request('GET', $url); 

// Create an empty array to store the extracted data 
$data = []; 

// Filter the DOM elements with class 'titleline' and perform an action for each matched element 
$crawler->filter('.titleline')->each(function ($node) use (&$data) { 

// Extract the title text from the node 
$title = $node->text(); 

// Extract the link URL from the node 
$link = $node->filter('a')->attr('href'); 

// Add the title and link to the data array 
$data[] = [$title, $link]; 
}); 

// Specify the directory path where you want to save the CSV file 
$directory = 'data/'; 

// Specify the CSV file path 
$filePath = $directory . 'scraped_data.csv'; 

// Create a CSV file for writing 
$csvFile = fopen($filePath, 'w'); 

// Write headers to the CSV file 
fputcsv($csvFile, ['Title', 'Link']); 

// Write each row of data to the CSV file 
foreach ($data as $row) { 
// Write a row to the CSV file 
fputcsv($csvFile, $row); 
} 

// Close the CSV file 
fclose($csvFile);

 

 

 

 

 

 

Leave a Reply