Web Automation, from The PHP Cookbook -WebReference- | WebReference

Web Automation, from The PHP Cookbook -WebReference-

PHP Cookbook: Web Automation

Chapter 11: Web Automation


Most of the time, PHP is part of a web server, sending content to browsers. Even when you run it from the command line, it usually performs a task and then prints some output. PHP can also be useful, however, playing the role of a web browser--retrieving URLs and then operating on the content. Most recipes in this chapter cover retrieving URLs and processing the results, although there are a few other tasks in here as well, such as using templates and processing server logs.

There are four ways to retrieve a remote URL in PHP. Choosing one method over another depends on your needs for simplicity, control, and portability. The four methods are to use fopen( ), fsockopen( ), the cURL extension, or the HTTP_Request class from PEAR.

Using fopen( ) is simple and convenient. We discuss it in Fetching a URL with the GET Method. The fopen( ) function automatically follows redirects, so if you use this function to retrieve the directory http://www.example.com/people and the server redirects you to http://www.example.com/people/, you'll get the contents of the directory index page, not a message telling you that the URL has moved. The fopen( ) function also works with both HTTP and FTP. The downsides to fopen( ) include: it can handle only HTTP GET requests (not HEAD or POST), you can't send additional headers or any cookies with the request, and you can retrieve only the response body with it, not response headers.

Using fsockopen( ) requires more work but gives you more flexibility. We use fsockopen( ) in Fetching a URL with the POST Method. After opening a socket with fsockopen( ), you need to print the appropriate HTTP request to that socket and then read and parse the response. This lets you add headers to the request and gives you access to all the response headers. However, you need to have additional code to properly parse the response and take any appropriate action, such as following a redirect.

If you have access to the cURL extension or PEAR's HTTP_Request class, you should use those rather than fsockopen( ). cURL supports a number of different protocols (including HTTPS, discussed in Fetching an HTTPS URL) and gives you access to response headers. We use cURL in most of the recipes in this chapter. To use cURL, you must have the cURL library installed, available at http://curl.haxx.se. Also, PHP must be built with the --with-curl configuration option.

PEAR's HTTP_Request class, which we use in Fetching a URL with the POST Method, Fetching a URL with Cookies, and Fetching a URL with Headers, doesn't support HTTPS, but does give you access to headers and can use any HTTP method. If this PEAR module isn't installed on your system, you can download it from http://pear.php.net/get/HTTP_Request. As long as the module's files are in your include_path, you can use it, making it a very portable solution.

Debugging the Raw HTTP Exchange helps you go behind the scenes of an HTTP request to examine the headers in a request and response. If a request you're making from a program isn't giving you the results you're looking for, examining the headers often provides clues as to what's wrong.

Once you've retrieved the contents of a web page into a program, use Marking Up a Web Page through Removing HTML and PHP Tags to help you manipulate those page contents. Marking Up a Web Page demonstrates how to mark up certain words in a page with blocks of color. This technique is useful for highlighting search terms, for example. Extracting Links from an HTML File provides a function to find all the links in a page. This is an essential building block for a web spider or a link checker. Converting between plain ASCII and HTML is covered in Converting ASCII to HTML and Converting HTML to ASCII. Removing HTML and PHP Tags shows how to remove all HTML and PHP tags from a web page.

Another kind of page manipulation is using a templating system. Discussed in Using Smarty Templates, templates give you freedom to change the look and feel of your web pages without changing the PHP plumbing that populates the pages with dynamic data. Similarly, you can make changes to the code that drives the pages without affecting the look and feel. Parsing a Web Server Log File discusses a common server administration task--parsing your web server's access log files.

Two sample programs use the link extractor from Extracting Links from an HTML File. The program in Program: Finding Stale Links scans the links in a page and reports which are still valid, which have been moved, and which no longer work. The program in Program: Finding Fresh Links reports on the freshness of links. It tells you when a linked-to page was last modified and if it's been moved.

Created: March 27, 2003
Revised: March 27, 2003

URL: http://webreference.com/programming/php/chap11/1