Web scraping lets you collect data from web pages across the internet. It’s also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.
composer require fabpot/goutte
scrap.php
<?php
require 'vendor/autoload.php';
use Goutte\Client;
// Create a new Goutte client
$client = new Client();
// Specify the URL to scrape
$url = 'https://ndtv.in';
// Fetch the HTML content of the page
$crawler = $client->request('GET', $url);
// Extract the title
$title = $crawler->filter('.crd_lnk')->text();
// Output the title
echo "Title: $title\n";
?>
scrapping in loop
<?php
require 'vendor/autoload.php';
use Goutte\Client;
// Create a new Goutte client
$client = new Client();
// Specify the URL to scrape
$url = 'https://ndtv.in';
// Fetch the HTML content of the page
$crawler = $client->request('GET', $url);
// Extract the title
// $title = $crawler->filter('.crd_lnk')->text();
$crawler->filter('.crd_lnk')->each(function ($node) {
print $node->text()."\n";
});
?>
$news = $crawler->filter("<headline's selector>")->text();
$link = $crawler->selectLink($news)->link();
$crawler = $client->click($link);
$link = $node->filter('a')->attr('href');
$crawler->filter('.crd_img-full > a > img')->each(function ($node) {
print $node->attr('src')."\n";
});
$client = new Client();
$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, ['login' => 'your email', 'password' => 'your password']);
$h1 = $crawler->filter("h1")->text();
echo($h1."\n");
request()
: sends a request to the specified URL and returns an object that represents the HTML content of a web pageselectLink()
: selects a link with a particular condition on a web pagelink()
: returns a link from a specific HTML element on a web pageclick()
: performs a click action on a selected link on a web pagetext()
: prints the text content presented on an HTML elementfilter()
: selects only HTML elements with specific values such as class name, ID, and tagsselectButton()
: selects a form with a button that has a specific labelsubmit()
: submits data to a form object with specific form data
Scrap and save into csv formate
<?php
// Include the required autoload file
require 'vendor/autoload.php';
// Import the Goutte client class
use Goutte\Client;
// Create a new instance of the Goutte client
$client = new Client();
// Define the URL of the web page to scrape
$url = "https://news.ycombinator.com/";
// Send a GET request to the URL and retrieve the web page
$crawler = $client->request('GET', $url);
// Create an empty array to store the extracted data
$data = [];
// Filter the DOM elements with class 'titleline' and perform an action for each matched element
$crawler->filter('.titleline')->each(function ($node) use (&$data) {
// Extract the title text from the node
$title = $node->text();
// Extract the link URL from the node
$link = $node->filter('a')->attr('href');
// Add the title and link to the data array
$data[] = [$title, $link];
});
// Specify the directory path where you want to save the CSV file
$directory = 'data/';
// Specify the CSV file path
$filePath = $directory . 'scraped_data.csv';
// Create a CSV file for writing
$csvFile = fopen($filePath, 'w');
// Write headers to the CSV file
fputcsv($csvFile, ['Title', 'Link']);
// Write each row of data to the CSV file
foreach ($data as $row) {
// Write a row to the CSV file
fputcsv($csvFile, $row);
}
// Close the CSV file
fclose($csvFile);