WWW::Mechanize for PHP
I've been familiar with Perl's WWW::Mechanize for some years now, and thought it would be fun to create a PHP version of Mechanize that is similar in features and functionality. Getting a page's content with file_get_contents() is okay in some situations, but often something that acts more like a User Agent - and has the capabilities to parse, retrieve, and manipulate specific data - is needed.
This is not an exact port of WWW::Mechanize, so if you've used Mechanize's interface you'll find many similarities but also a lot of differences. However, this gives you a powerful PHP web scraper for visiting urls or scraping data and provides a PHP alternative for Mechanize.
Overview
Compass_Mechanize is a PHP package that allows you to visit websites, parse and collect data, and interact with the website without having to worry about the user agent portion. Rather than relying on regular expressions for finding and gathering data from a site, Compass_Mechanize utilizes xpath selectors since they are accurate and the perfect tool for selecting elements in an xhtml document. Regular expressions are still available.
Requirements
Note: The Zend Framework has a fantastic http client and http response object - which were the main reasons for making the Zend Framework a requirement. However, I also utilized Validators and Filters which will make your life much easier if you're using this script. However, you can always extend the class and create your own http client and response object if you'd prefer not to have the Zend Framework be a dependency.
Initializing
Unzip the file and place the Compass folder in your project library. For the most flexibility, set up an autoloader. I typically use this format:
/httpdocs
/library
/Compass
/Zend
....
Sample Uses
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
echo $mech;
?>
Helper Methods
Helper methods are there to help you find common elements on the page or perform common tasks. All of the get...() methods below (except getTitle()) return a Compass_Mechanize_Elements object.
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$links = $mech->getLinks();
$images = $mech->getImages();
$mech->get('www.example.org');
$forms = $mech->getForms();
$javascript = $mech->getJavascript();
$stylesheets = $mech->getStylesheets();
$statsScripts = $mech->find("//script[starts-with(@src, 'http://stats.compasswebpublisher.com')]");
echo $mech->getTitle();
?>
find() Method
The find() method allows you to find anything on the page using an xpath selector. It takes an optional second argument - a numeric limit - to limit the number of elements it returns. find() returns a Compass_Mechanize_Elements object.
<?php
$mech->find("//div[@id='wrapper']//img", 3);
?>
Working with Compass_Mechanize_Elements
Compass_Mechanize_Elements holds an array of elements where each element is a Compass_Mechanize_Element. These are just wrapper classes to DOMNodeList and DOMElement objects - and provide some valuable methods.
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$links = $mech->getLinks();
echo 'Links Found: ' . $links->length . '<br />';
foreach ($links->getElements() as $link) {
echo 'Anchor Text: ' . $link->getText() . '<br />';
echo 'URL:' . $link->getAttribute('href') . '<br />';
echo 'Absolute URL: ' . $mech->absoluteUrl($link->getAttribute('href')) . '<br /><br />';
}
?>
Working with Compass_Mechanize_Element objects
Compass_Mechanize_Elements (plural) holds an array of Compass_Mechanize_Element (singular) objects. You can access this array by using the getElements() method from a Compass_Mechanize_Elements object. Compass_Mechanize_Element is a wrapper for the DOMElement object. You have complete access to the DOMElement methods defined for DOMElement, with these three additional/modified methods:
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$elements = $mech->getLinks();
$filter = new Zend_Filter;
$filter->addFilter(new Zend_Filter...);
$filter->addFilter(new Zend_Filter...);
foreach ($elements->getElements() as $element) {
echo 'Anchor Text: ' . $element->getText($filter) . '<br />';
echo 'Href: ' . $element->getAttribute('href', array(new Zend_Filter_StringToLower)) . '<br />';
echo 'Extracted: ' . $element->extractText($regexPattern) . '<br /><br />';
}
?>
Compass_Mechanize_Element objects also allow you to do contextual finds. Here is an example:
<?php
$mech->get('http://www.example.com');
$rows = $mech->find("//table[@id='monthly-figures']/tr");
foreach ($rows->getElements() as $row) {
$date = $row->find('td[1]')->getText();
$revenue = $row->find('td[2]')->getText();
}
?>
Getting the HTML
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$mech->find('/html/head/meta')->remove();
$html = $mech->getContents();
?>
Getting the Text-only content
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$content = $mech->getContents(true);
$contentWithNewlines = $mech->getContents(true, true);
?>
Narrowing Compass_Mechanize_Elements with Validators
You can further narrow down results using Zend_Validators by stripping out any elements that don't validate based on your criteria. You can very easily write your own validators as well.
<?php
$links = $mech->findLinks()
->addCriteria('_text', array(new Zend_Validate_Regex('/^Click/i'));
$links = $mech->getLinks()
->addCriteria('href', array(
new Zend_Validate_Regex('/^https?/i'),
new Zend_Validate_StringLength(15, 27)
));
->randomize()
->setLimit(2);
$images = $mech->find("//img[starts-with(@src, '/assets/images')]")
->addCriteria('src', array(
new Zend_Validate_Regex('/jpe?g$/'),
));
foreach ($images->getElements() as $image) {
echo $mech->absoluteUrl($image->getAttribute('src')) . '<br />';
}
?>
Removing Duplicate Elements with unique()
Compass_Mechanize_Elements has a unique() method that allows you to specify an attribute and will filter our duplicates. This is helpful if you are going to be following links on a page and want to remove duplicates first.
<?php
$links = $mech->find("//a")->unique('href', true);
?>
Working with Filters
Filters allow you to filter or change elements on the page. Compass_Mechanize_Elements uses the Composite design pattern to allow you to apply filters regardless of how many elements you have. Any changes you make will be applied to the DOM in case you need to echo or save the updated HTML. You can very easily write your own filters as well.
<?php
$mech->getLinks()
->getAttribute('href', array(
new Zend_Filter_Callback('urlencode')
));
$mech->getImages()
->getAttribute('src', array(
new Zend_Filter_Pregreplace('/jpe?g$/', 'gif')
));
?>
Removing Elements from the Page
A call to remove() works on the entire Compass_Mechanize_Elements object whether there are 1 or 100 elements.
<?php
$mech->find('/html/head/meta')->remove();
$mech->find('/html/head/base')->remove();
$mech->getLinks()
->addCriteria('href', array(
new Zend_Validate_Regex('/^https?/i'),
))
->remove();
?>
Clicking Links with followLink()
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$mech->followLink("//a[contains(text(), 'RFC 2606')]");
?>
Submitting Forms and Authenticating
Compass_Mechanize allows you to submit forms and handles storing and sending cookies for you (using Zend_Http_Client). This allows you to authenticate against sites that require logging in. HTTPS authentication is allowed.
submitForm() accepts an array with up to four options:
- form: the xpath selector for the form you are working with
- fields: an array of fields and values to submit
- [submit]: Optional xpath selector for the submit button with the context of the form. Defaults to //input[@type='submit']
- [hidden]: Optional. bool true or false. Whether or not
Compass_Mechanize should automatically submit any hidden inputs on your behalf. Defaults to true
<?php
$mech = new Compass_Mechanize;
$mech->get('https://www.example.com');
$mech->submitForm(array(
'form' => '//form[@id=user_login_form]',
'fields' => array(
'email' => 'you@example.com',
'pass' => 'password1'
)
));
$mech->followLink("//a[@href='/myProfile.php']");
echo $mech;
?>
Setting Custom HTTP Headers
Sometimes you need to set custom HTTP headers for a request. A great example is a site that submits a form using AJAX - and checks for AJAX submissions. If you try to submit the form without AJAX, the site may reject your submission. Here's an example for this that shows you how to use the addHeaders() method.
<?php
$mech = new Compass_Mechanize;
$mech->get('https://www.example.com');
$mech->addHeaders(array(
'X-Requested-With' => 'XMLHttpRequest'
));
$mech->submitForm(array(
'form' => "//form[@id='user_login_form']",
'fields' => array(
'email' => 'you@example.com',
'pass' => 'password1'
)
));
?>
Utilizing the extractText() method
Extract text allows you to extract a piece of text from an element on the page using a regular expression. This could be useful for extracting something that cannot be extracted using xpath - like a date on the page that is not wrapped in an html tag.
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$date = $mech->find("//div[@id='content']")->extractText($pattern);
?>
Working with Zend_Http_Response
After making an http request, Compass_Mechanize stores a Zend_Http_Response object that you can access via the getResponse() method. Here are a few examples to get you going. You can see more detail in the Zend documentation.
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
echo 'Status Code: ' . $mech->getResponse()->getStatusCode() . '<br />';
echo 'Server: ' . $mech->getResponse()->getHeader('Server') . '<br /><br />';
?>
Utilizing Delays
Delays can be helpful when you want your code to look more natural - rather than executing all of your commands so quickly. Compass_Mechanize allows you to set a minimum and maximum delay in seconds, and will randomly select a delay that falls within that range between requests. Delays are turned off by default.
<?php
$mech = new Compass_Mechanize;
$mech->enableDelay(0.5, 2.5);
$mech->get('http://www.example.com');
$mech->followLink("//a[@id='someValue']");
$mech->followLink("//a[href='/some_content.html']");
?>
Moving forward() and back()
Compass_Mechanize keeps track of history items during the script's execution. This allows you to move forward or back should you need to.
<?php
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$mech->get('http://www.example.org');
$mech->back(-1);
?>
Downloading Files
Although you could call get(), Compass_Mechanize provides a getFile() method that will submit a get request and return the file contents, the content type defined by the server, and the filename.
<?php
$pdfs = $mech->find("//a")
->addCriteria('href', array(
new Zend_Validate_Regex('/.pdf$/')
))
->unique('href', true);
foreach ($pdfs->getElements() as $pdf) {
if (($file = $mech->getFile($pdf->getAttribute('href'))) !== false) {
file_put_contents('pdfs/' . $file->filename, $file->contents);
}
}
?>
Powerful PHP Web Scraper
Compass_Mechanize can be used as a powerful web scraper for PHP programmers. Keep in mind, however, that PHP has a script execution time (usually 30-60) seconds that limits your scraping. You can get around this by changing these settings:
<?php
ini_set('max_execution_time', '0');
ini_set('max_input_time', '0');
set_time_limit(0);
?>
You can also execute your PHP code from the command line.
Because of how PHP works, you are limited to only running one get() request at a time - which will dramatically increase the time required to scrape a website. You may want to look at PHP's Process Control functions to create child processes and run multiple request at once. Unfortunately, PCNTL functions are not enabled by default and require you to recompile PHP. They also will not run on non-unix platforms. However, if you do enable it, here is an untested code sample that will allow you to spawn child processes:
<?php
$items = array();
$maxChildren = 3;
$execute = 0;
$mech = new Compass_Mechanize;
$mech->get('http://www.example.com');
$mech->followLink("//a[@href='something/']");
$links = $mech->find("//p[following-sibling::h4[text()='Some Text']]/a");
foreach ($links->getElements() as $link) {
$pid = pcntl_fork();
if ($pid == -1) {
die("could not fork");
} elseif ($pid) {
$execute++;
if ($execute >= $maxChildren){
pcntl_wait($status);
$execute--;
}
} else {
$mech->get($link->getAttribute('href'));
$items[] = array(
'title' => $mech->find('//h2')->getText(),
'body' => $mech->find("//div[@id='content']")->getText(),
'date' => $mech->find('//body')->extractText($pattern)
);
}
}
?>
The Reason for Zend Validators and Filters
Every web page differs greatly. A solution that only allows you to use regular expressions was too limited for my needs. One of the first tests I did was a site that had url href's go through a tracking url such as www.thissite.com?track.php?url=http://www.theActualSite.com. The value of the url variable was what I wanted, but I also needed to urldecode() it. Utilizing Zend Filters, I can quickly write a regex that removes everything before ?url= and a second filter that urldecode()'s the url. This format allows you to apply one filter or a chain of filters for ultimate flexibility. Zend_Filter_Callback is handy when you need to run a PHP function as a filter - such as urldecode() - and you can also write your own custom filters quickly and easily.
Validators allow you to do the same type of thing, but are used to reduce the elements in Compass_Mechanize_Elements that do not pass your validation tests.
Finally, the Zend Framework already comes with a variety of Validators and Filters out of the box that will do most of the work for you. This allows you to write powerful scrapers without all of the code or need for regular expressions.
Some XPath Selectors
XPath is surprisingly easy, and I decided to use XPath in Compass_Mechanize because it is designed for finding specific elements in an xhtml document. Most of the time you will be able to get what you need with XPath selectors, and can add in a regular expression as needed. Here are a few basics to get you started:
/
The child of an attribute. ie. /p/a selects a elements that are immediate descendants of p elements.
//
Allows you to select an element that is some descendant of another. I.e. /html/body//a finds links that are descendants (no matter how many levels deep) of the body tag.
//a[@id]
An a element that has an id attribute - regardless of the value.
//a[@id='login']
An a element that has an id attribute equal to login
//a[contains(@id, 'login')]
An a element where the id attribute contains login. contains() is a function that takes two parameters. The first parameter can be an attribute or function like text()
//img[starts-with(@src, '/assets/images']
An img element whose src attribute starts with /assets/images
//a[@id and @href='/test']
The and lets you specify multiple criteria within the square brackets. In this case, a link that has an id attribute and href = '/test'
//a[text()='PHP Scraper']
Find all a elements where the anchor text equals PHP Scraper
License
Download
The package will be available shortly for download. In the meantime, leave your comments below.
Comments