WWW::Mechanize for PHP

I've been familiar with Perl's WWW::Mechanize for some years now, and thought it would be fun to create a PHP version of Mechanize that is similar in features and functionality. Getting a page's content with file_get_contents() is okay in some situations, but often something that acts more like a User Agent - and has the capabilities to parse, retrieve, and manipulate specific data - is needed.

This is not an exact port of WWW::Mechanize, so if you've used Mechanize's interface you'll find many similarities but also a lot of differences. However, this gives you a powerful PHP web scraper for visiting urls or scraping data and provides a PHP alternative for Mechanize.

Overview

Compass_Mechanize is a PHP package that allows you to visit websites, parse and collect data, and interact with the website without having to worry about the user agent portion. Rather than relying on regular expressions for finding and gathering data from a site, Compass_Mechanize utilizes xpath selectors since they are accurate and the perfect tool for selecting elements in an xhtml document. Regular expressions are still available.

Requirements

Note: The Zend Framework has a fantastic http client and http response object - which were the main reasons for making the Zend Framework a requirement. However, I also utilized Validators and Filters which will make your life much easier if you're using this script. However, you can always extend the class and create your own http client and response object if you'd prefer not to have the Zend Framework be a dependency.

Initializing

Unzip the file and place the Compass folder in your project library. For the most flexibility, set up an autoloader. I typically use this format:

/httpdocs
     /library
          /Compass
          /Zend
....

Sample Uses

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. /* Output the page www.example.com to your screen */
  5. /* echo'ing the Compass_Mechanize object is equivalent to echo $mech->getHtml() */
  6. echo $mech;
  7. ?>

 

Helper Methods

Helper methods are there to help you find common elements on the page or perform common tasks. All of the get...() methods below (except getTitle()) return a Compass_Mechanize_Elements object.

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. /* Find all links on the page */
  5. $links = $mech->getLinks();
  6. /* Find all images on the page */
  7. $images = $mech->getImages();
  8. /* Go to another URL. Upcoming functions apply to this URL */
  9. $mech->get('www.example.org');
  10. /* Find all the forms on example.org */
  11. $forms = $mech->getForms();
  12. /* Find Javascript files */
  13. $javascript = $mech->getJavascript();
  14. /* Find Stylesheets */
  15. $stylesheets = $mech->getStylesheets();
  16. /* Find anything on the page using an xpath selector */
  17. /* find() accepts an xpath selector or a Compass_XPath object */
  18. /* Here we'll look for any script tag that has an src starting with a particular hostname */
  19. $statsScripts = $mech->find("//script[starts-with(@src, 'http://stats.compasswebpublisher.com')]");
  20. /* Print out the page title */
  21. echo $mech->getTitle();
  22. ?>

 

find() Method

The find() method allows you to find anything on the page using an xpath selector. It takes an optional second argument - a numeric limit - to limit the number of elements it returns. find() returns a Compass_Mechanize_Elements object.

  1. <?php
  2. /* Find the first three images inside div#wrapper */
  3. $mech->find("//div[@id='wrapper']//img", 3);
  4. ?>

 

Working with Compass_Mechanize_Elements

Compass_Mechanize_Elements holds an array of elements where each element is a Compass_Mechanize_Element. These are just wrapper classes to DOMNodeList and DOMElement objects - and provide some valuable methods.

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. $links = $mech->getLinks();
  5. /* Compass_Mechanize_Elements holds a length variable with the number of elements */
  6. echo 'Links Found: ' . $links->length . '<br />';
  7. /* Output the href of all links on the page */
  8. /* Notice the use of getElements() */
  9. foreach ($links->getElements() as $link) {
  10. echo 'Anchor Text: ' . $link->getText() . '<br />';
  11. echo 'URL:' . $link->getAttribute('href') . '<br />';
  12. echo 'Absolute URL: ' . $mech->absoluteUrl($link->getAttribute('href')) . '<br /><br />';
  13. }
  14. ?>

 

Working with Compass_Mechanize_Element objects

Compass_Mechanize_Elements (plural) holds an array of Compass_Mechanize_Element (singular) objects. You can access this array by using the getElements() method from a Compass_Mechanize_Elements object. Compass_Mechanize_Element is a wrapper for the DOMElement object. You have complete access to the DOMElement methods defined for DOMElement, with these three additional/modified methods:

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. $elements = $mech->getLinks();
  5. /* Filter chains. Can be passed to the functions below as either a filter chain
  6. or as an array of filters */
  7. $filter = new Zend_Filter;
  8. $filter->addFilter(new Zend_Filter...);
  9. $filter->addFilter(new Zend_Filter...);
  10. /* $element is an instance of Compass_Mechanize_Element */
  11. foreach ($elements->getElements() as $element) {
  12. /* getText() returns the value of the node */
  13. /* Accepts an optional filter */
  14. echo 'Anchor Text: ' . $element->getText($filter) . '<br />';
  15. /* getAttribute() returns the value of an attribute. */
  16. /* Accepts an optional filter */
  17. echo 'Href: ' . $element->getAttribute('href', array(new Zend_Filter_StringToLower)) . '<br />';
  18. /* extractText() allows you to extract text from the getText() method
  19. based on a regex. Explained in more detail below */
  20. echo 'Extracted: ' . $element->extractText($regexPattern) . '<br /><br />';
  21. }
  22. ?>

Compass_Mechanize_Element objects also allow you to do contextual finds. Here is an example:

  1. <?php
  2. $mech->get('http://www.example.com');
  3. /* Find all table rows */
  4. $rows = $mech->find("//table[@id='monthly-figures']/tr");
  5. foreach ($rows->getElements() as $row) {
  6. /* Do a find on the $row object to find the second td (counting starts at 0) */
  7. $date = $row->find('td[1]')->getText();
  8. $revenue = $row->find('td[2]')->getText();
  9. }
  10. ?>

Getting the HTML

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. /* Remove Meta Tags */
  5. $mech->find('/html/head/meta')->remove();
  6. /* Get the page's HTML (think View > Source) with any modifications you've made */
  7. $html = $mech->getContents();
  8. ?>

 

Getting the Text-only content

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. /* Get the page's content */
  5. $content = $mech->getContents(true);
  6. /* Optional parameter set to true will convert line breaks to newlines */
  7. $contentWithNewlines = $mech->getContents(true, true);
  8. ?>

 

Narrowing Compass_Mechanize_Elements with Validators

You can further narrow down results using Zend_Validators by stripping out any elements that don't validate based on your criteria. You can very easily write your own validators as well.

  1. <?php
  2. /* Find all links where the anchor Text starts with 'Click' */
  3. /* addCriteria() takes two parameters. */
  4. /* First: Either an attribute (href, src, etc) or the magic _text */
  5. /* Second: A Zend_Validate chain or an array of Zend_Validate objects */
  6. $links = $mech->findLinks()
  7. ->addCriteria('_text', array(new Zend_Validate_Regex('/^Click/i'));
  8. /* Chaining example */
  9. /* Randomize() the links and only return 2 */
  10. $links = $mech->getLinks()
  11. ->addCriteria('href', array(
  12. new Zend_Validate_Regex('/^https?/i'),
  13. new Zend_Validate_StringLength(15, 27)
  14. ));
  15. ->randomize()
  16. ->setLimit(2);
  17. /* find() also returns Compass_Mechanize_Elements object */
  18. $images = $mech->find("//img[starts-with(@src, '/assets/images')]")
  19. ->addCriteria('src', array(
  20. new Zend_Validate_Regex('/jpe?g$/'),
  21. ));
  22. /* absoluteUrl() converts a relative URL to absolute for you */
  23. foreach ($images->getElements() as $image) {
  24. echo $mech->absoluteUrl($image->getAttribute('src')) . '<br />';
  25. }
  26. ?>

 

Removing Duplicate Elements with unique()

Compass_Mechanize_Elements has a unique() method that allows you to specify an attribute and will filter our duplicates. This is helpful if you are going to be following links on a page and want to remove duplicates first.

  1. <?php
  2. /* Find all links on the page and ensure the href is unique */
  3. /* The second parameter lets unique() know these are URLs, not just strings */
  4. /* So it will compare absolute URLs */
  5. $links = $mech->find("//a")->unique('href', true);
  6. ?>

 

Working with Filters

Filters allow you to filter or change elements on the page. Compass_Mechanize_Elements uses the Composite design pattern to allow you to apply filters regardless of how many elements you have. Any changes you make will be applied to the DOM in case you need to echo or save the updated HTML. You can very easily write your own filters as well.

  1. <?php
  2. /* Urlencode all href attributes in links */
  3. $mech->getLinks()
  4. ->getAttribute('href', array(
  5. new Zend_Filter_Callback('urlencode')
  6. ));
  7. /* Replace jpg and jpeg extensions with gif in all images */
  8. $mech->getImages()
  9. ->getAttribute('src', array(
  10. new Zend_Filter_Pregreplace('/jpe?g$/', 'gif')
  11. ));
  12. ?>

 

Removing Elements from the Page

A call to remove() works on the entire Compass_Mechanize_Elements object whether there are 1 or 100 elements.

  1. <?php
  2. /* Find and remove all Meta tags on page */
  3. $mech->find('/html/head/meta')->remove();
  4. /* Remove the base tag if it exists */
  5. $mech->find('/html/head/base')->remove();
  6. /* Remove all external links */
  7. $mech->getLinks()
  8. ->addCriteria('href', array(
  9. new Zend_Validate_Regex('/^https?/i'),
  10. ))
  11. ->remove();
  12. ?>

 

Clicking Links with followLink()

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. $mech->followLink("//a[contains(text(), 'RFC 2606')]");
  5. ?>

 

Submitting Forms and Authenticating

Compass_Mechanize allows you to submit forms and handles storing and sending cookies for you (using Zend_Http_Client). This allows you to authenticate against sites that require logging in. HTTPS authentication is allowed.

submitForm() accepts an array with up to four options:

  • form: the xpath selector for the form you are working with
  • fields: an array of fields and values to submit
  • [submit]: Optional xpath selector for the submit button with the context of the form. Defaults to //input[@type='submit']
  • [hidden]: Optional. bool true or false. Whether or not Compass_Mechanize should automatically submit any hidden inputs on your behalf. Defaults to true
  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('https://www.example.com');
  4. $mech->submitForm(array(
  5. 'form' => '//form[@id=user_login_form]',
  6. 'fields' => array(
  7. 'email' => 'you@example.com',
  8. 'pass' => 'password1'
  9. )
  10. ));
  11. /* We should now be logged in */
  12. $mech->followLink("//a[@href='/myProfile.php']");
  13. echo $mech;
  14. ?>

 

Setting Custom HTTP Headers

Sometimes you need to set custom HTTP headers for a request. A great example is a site that submits a form using AJAX - and checks for AJAX submissions. If you try to submit the form without AJAX, the site may reject your submission. Here's an example for this that shows you how to use the addHeaders() method.

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('https://www.example.com');
  4. /* Forge an AJAX request for your form submission */
  5. $mech->addHeaders(array(
  6. 'X-Requested-With' => 'XMLHttpRequest'
  7. ));
  8. /* Form will be submitted and look like an AJAX Request */
  9. $mech->submitForm(array(
  10. 'form' => "//form[@id='user_login_form']",
  11. 'fields' => array(
  12. 'email' => 'you@example.com',
  13. 'pass' => 'password1'
  14. )
  15. ));
  16. ?>

 

Utilizing the extractText() method

Extract text allows you to extract a piece of text from an element on the page using a regular expression. This could be useful for extracting something that cannot be extracted using xpath - like a date on the page that is not wrapped in an html tag.

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. /* $pattern would hold your regex. An optional second parameter is the index of
  5. the array you would like returned from the matches. Defaults to 0 */
  6. $date = $mech->find("//div[@id='content']")->extractText($pattern);
  7. ?>

 

Working with Zend_Http_Response

After making an http request, Compass_Mechanize stores a Zend_Http_Response object that you can access via the getResponse() method. Here are a few examples to get you going. You can see more detail in the Zend documentation.

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. echo 'Status Code: ' . $mech->getResponse()->getStatusCode() . '<br />';
  5. echo 'Server: ' . $mech->getResponse()->getHeader('Server') . '<br /><br />';
  6. ?>

 

Utilizing Delays

Delays can be helpful when you want your code to look more natural - rather than executing all of your commands so quickly. Compass_Mechanize allows you to set a minimum and maximum delay in seconds, and will randomly select a delay that falls within that range between requests. Delays are turned off by default.

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. /* Set our delay range between 0.5 and 2.5 seconds */
  4. $mech->enableDelay(0.5, 2.5);
  5. /* Start by visiting a site and following links. Delays are now enabled */
  6. $mech->get('http://www.example.com');
  7. $mech->followLink("//a[@id='someValue']");
  8. $mech->followLink("//a[href='/some_content.html']");
  9. ?>

 

Moving forward() and back()

Compass_Mechanize keeps track of history items during the script's execution. This allows you to move forward or back should you need to.

  1. <?php
  2. $mech = new Compass_Mechanize;
  3. $mech->get('http://www.example.com');
  4. $mech->get('http://www.example.org');
  5. /* We are currently on example.org; move back one step to example.com */
  6. $mech->back(-1);
  7. ?>

 

Downloading Files

Although you could call get(), Compass_Mechanize provides a getFile() method that will submit a get request and return the file contents, the content type defined by the server, and the filename.

  1. <?php
  2. /* Find all links to PDF files, then remove duplicates with a call to unique() */
  3. $pdfs = $mech->find("//a")
  4. ->addCriteria('href', array(
  5. new Zend_Validate_Regex('/.pdf$/')
  6. ))
  7. ->unique('href', true);
  8. foreach ($pdfs->getElements() as $pdf) {
  9. /* getFile() returns false on failure */
  10. /* Otherwise returns on object with filename, contents, and contentType */
  11. if (($file = $mech->getFile($pdf->getAttribute('href'))) !== false) {
  12. file_put_contents('pdfs/' . $file->filename, $file->contents);
  13. }
  14. }
  15. ?>

 

Powerful PHP Web Scraper

Compass_Mechanize can be used as a powerful web scraper for PHP programmers. Keep in mind, however, that PHP has a script execution time (usually 30-60) seconds that limits your scraping. You can get around this by changing these settings:

  1. <?php
  2. ini_set('max_execution_time', '0');
  3. ini_set('max_input_time', '0');
  4. set_time_limit(0);
  5. ?>

You can also execute your PHP code from the command line.

Because of how PHP works, you are limited to only running one get() request at a time - which will dramatically increase the time required to scrape a website. You may want to look at PHP's Process Control functions to create child processes and run multiple request at once. Unfortunately, PCNTL functions are not enabled by default and require you to recompile PHP. They also will not run on non-unix platforms. However, if you do enable it, here is an untested code sample that will allow you to spawn child processes:

  1. <?php
  2. $items = array();
  3. $maxChildren = 3;
  4. $execute = 0;
  5. $mech = new Compass_Mechanize;
  6. $mech->get('http://www.example.com');
  7. $mech->followLink("//a[@href='something/']");
  8. $links = $mech->find("//p[following-sibling::h4[text()='Some Text']]/a");
  9. // Follow each link one by one
  10. foreach ($links->getElements() as $link) {
  11. $pid = pcntl_fork();
  12. if ($pid == -1) {
  13. die("could not fork");
  14. } elseif ($pid) {
  15. $execute++;
  16. if ($execute >= $maxChildren){
  17. pcntl_wait($status);
  18. $execute--;
  19. }
  20. } else {
  21. // we are the child
  22. $mech->get($link->getAttribute('href'));
  23. $items[] = array(
  24. 'title' => $mech->find('//h2')->getText(),
  25. 'body' => $mech->find("//div[@id='content']")->getText(),
  26. 'date' => $mech->find('//body')->extractText($pattern)
  27. );
  28. }
  29. }
  30. ?>

 

The Reason for Zend Validators and Filters

Every web page differs greatly. A solution that only allows you to use regular expressions was too limited for my needs. One of the first tests I did was a site that had url href's go through a tracking url such as www.thissite.com?track.php?url=http://www.theActualSite.com. The value of the url variable was what I wanted, but I also needed to urldecode() it. Utilizing Zend Filters, I can quickly write a regex that removes everything before ?url= and a second filter that urldecode()'s the url. This format allows you to apply one filter or a chain of filters for ultimate flexibility. Zend_Filter_Callback is handy when you need to run a PHP function as a filter - such as urldecode() - and you can also write your own custom filters quickly and easily.

Validators allow you to do the same type of thing, but are used to reduce the elements in Compass_Mechanize_Elements that do not pass your validation tests.

Finally, the Zend Framework already comes with a variety of Validators and Filters out of the box that will do most of the work for you. This allows you to write powerful scrapers without all of the code or need for regular expressions.

Some XPath Selectors

XPath is surprisingly easy, and I decided to use XPath in Compass_Mechanize because it is designed for finding specific elements in an xhtml document. Most of the time you will be able to get what you need with XPath selectors, and can add in a regular expression as needed. Here are a few basics to get you started:

/

The child of an attribute. ie. /p/a selects a elements that are immediate descendants of p elements.

 

//

Allows you to select an element that is some descendant of another. I.e. /html/body//a finds links that are descendants (no matter how many levels deep) of the body tag.

 

//a[@id]

An a element that has an id attribute - regardless of the value.

 

//a[@id='login']

An a element that has an id attribute equal to login

 

//a[contains(@id, 'login')]

An a element where the id attribute contains login. contains() is a function that takes two parameters. The first parameter can be an attribute or function like text()

 

//img[starts-with(@src, '/assets/images']

An img element whose src attribute starts with /assets/images

 

//a[@id and @href='/test']

The and lets you specify multiple criteria within the square brackets. In this case, a link that has an id attribute and href = '/test'

 

//a[text()='PHP Scraper']

Find all a elements where the anchor text equals PHP Scraper

 

 

License

 

 

Download

The package will be available shortly for download. In the meantime, leave your comments below.

 

Comments

 

We've taken a chore and made it enjoyable

Our customers tell us that they actually enjoy updating their websites with Compass. Find out why. Schedule a one-on-one, no-obligation demo to see Compass in action.

Small to Midsize Businesses • Non-Profits • Organizations • Professional Service Firms

Testimonial CFO Selections

Tom Varga
Managing Partner, CFO Selections

    "Compass is a must have application for any business. It's brilliantly crafted and significantly
                  better then what has been previously available on the market to manage your website."

Hosted Web Content Management System CMS

We offer a fully managed Software as a Service (SaaS) solution that lets you focus on running a great website - not managing servers and software code. Learn More

Our Usability Principles

Compass CMS is built around the principle that software should be easy to use. Here's what that means to us. Learn More

Sign up for our WCM email newsletter

Our email newsletter contains information on web content management (WCM), usability, website best practices, SaaS, and other free resources.

Powered By Compass