|
|
 |
HTTP Compression Speeds up the Web
What is IETF Content-Encoding (or HTTP Compression)?
In a nutshell... it is simply a publicly defined way to compress HTTP content
being transferred from Web servers down to browsers using nothing more than
public domain compression algorithms that are freely available.
"Content-Encoding" and "Transfer-Encoding" are both clearly defined in the
public IETF Internet RFCs that govern the development and improvement of the
HTTP protocol which is the 'language' of the World Wide Web. "Content-Encoding" applies to
methods of encoding and/or compression that have been applied to
documents before they are requested. This is also known as "pre-compressing
pages." The concept never really caught on because of the complex file
maintenance burden it represents and there are few Internet sites that use
pre-compressed pages of any description. "Transfer-Encoding" applies to methods
of encoding and/or compression used during the actual transmission of the data
itself.
In modern practice, however, the two are now one and the
same. Since most HTTP content from major online sites is now dynamically
generated, the line has blurred between what is happening before a document is
requested and while it is being transmitted. Essentially, a dynamically
generated HTML page doesn't even exist until someone asks for it. The original concept of all pages being
"static" and already present on the disk has quickly become an 'older' concept
and the originally well-defined separation between "Content-Encoding"
and "Transfer-Encoding" has simply turned into a rather pale shade of
gray. Unfortunately, the ability for any modern Web or proxy server to supply
"Transfer-Encoding" in the form of compression is even less available than the
spotty support for "Content-Encoding."
Suffice it to say that regardless of the two different publicly defined
encoding specifications, if the goal is to compress the requested content
(static or dynamic) it really doesn't matter which of the two publicly defined
encoding methods is used, the result is still the same. The user receives
far fewer bytes than normal and everything happens much faster on the client side. The publicly defined exchange
goes like this:
- A browser that is capable of receiving compressed
content indicates this in all of its requests for documents by supplying the
following request header field when it asks for something....
- "Accept-Encoding: " and a comma-separated list of encoding names, including (hopefully)
gzip. There are other compressions out there, like "deflate" and "compress." But only gzip is supported by most modern browsers. Some very new browsers even allow the user to configure which HTTP headers to send: Opera 6 allows you to explicitly set the HTTP level, and Mozilla 0.9.9 allows you to set the "Accept-Encoding" string (which may be problematic, as Mozilla doesn't understand each and every fancy encoding scheme).
- When the Web server sees that request field then it
knows that the browser is able to receive compressed data in one of two
formats, either standard GZIP or the UNIX "compress" format. It is up to the
server to compress the response data using either one of these methods (if it
is capable of doing so).
- If a compressed static version of the requested
document is found on the Web server's hard drive which matches one of the
formats the browser says it can handle then the server can simply choose to
send the pre-compressed version of the document instead of the much larger uncompressed original.
- If no static document is found on the disk which matches
any of the compressed formats the browser is saying it can "Accept" then the
server can now either choose to just send the original uncompressed version of
the document or make an attempt to compress it in "real-time" and send the
newly compressed and much smaller version back to the browser.
Most popular Web servers are still unable to do this final step.
- The Apache Web Server which has over 50% percent of the Web
server market is still incapable of providing any real-time compression of
requested documents even though all modern browsers have been requesting them
and capable of receiving them for more than two years.
- Microsoft's Internet Information Server is nearly as deficient. If it finds a pre-compressed version of a requested document it
might send it but has no real-time compression capability. It will, however, use precompressed files if they are available.
IIS 5.0 uses an ISAPI filter to support GZIP compression. It works as follows. The user requests a page, the server sends the page and then stores a copy of it "compressed" in a temporary folder.
The next time a user requests the page it sends the one stored in the temp
directory.
What it then tries to do is constantly check that the pages in the temp
directory are always current, and if not gets a current page and then
compresses it.
- IBM's WebSphere Server has some limited support for
real-time compression but it has "appeared" and "disappeared" from the marketplace through various release
versions of WebSphere.
The original designers of the HTTP
protocol really did not foresee the current reality with so many people using
the protocol that every single byte would count. The heavy use of
pre-compressed graphics formats such as GIF and the relative difficulty to
further reduce the graphics content makes it even more important that all other
exchange formats be optimized as much as possible. The same designers also did not foresee that most HTTP content
from major online vendors would be generated dynamically and so there really is
no real chance for there to ever be a "static" compressed version of the
requested document(s). However, there is the possibility to cache even dynamic content, as long as you know something about it, like it cannot change in real-time but only at some occasions. Public IETF Content-Encoding is still not a "complete"
specification for the reduction of Internet content but it does work and the
performance benefits achieved by using it are both obvious and dramatic.
What is GZIP?
It's a lossless compressed data format. The deflation algorithm used by GZIP (also zip and zlib)
is an open-source, patent-free variation of LZ77 (Lempel-Ziv 1977). It finds
duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the
previous string, in the form of a pair (distance, length), distances are limited to 32K bytes, and
lengths are limited to 258 bytes. When a string does not occur anywhere in the
previous 32K bytes, it is emitted as a sequence of literal bytes. (In this description, "string" must be taken
as an arbitrary sequence of bytes, and is not restricted to printable
characters.)
What about Benchmarking Software?
Most standard benchmarking tools are not fully HTTP 1.1 compliant and almost none of them are
capable of handling IETF Content encoding. If you use a standard HTTP
benchmarking program that does not include the 'Accept-Encoding:" header with at least the gzip operand then the server will not (as per RFC
standards) actually send any compressed data. Some benchmarking programs do
not supply the "Accept-Encoding:" request field by default but do allow you to
add it yourself via a command line parameter or special configuration file.
Check the documentation for the benchmarking program itself. Everything will
still work without the "Accept-Encoding:" field in the request but the
benchmarking won't tell you much since it won't actually be receiving anything
compressed. If you need a benchmarking or testing tool to measure the compression
performance on your system and you don't have one that is capable of doing
so... contact Hyperspace
Communications Inc. They have developed custom versions of just
about all major load generating and HTTP benchmarking tools that are capable of
requesting and receiving standard IETF Content encoding(s).
Download the Free Apache mod_gzip Module
You can try HTTP compression on your site with Hyperspace Communications' Apache gzip module! mod_gzip was originally authored by a company named Remote Communications, Inc. RCI was purchased by HyperSpace Communications Inc. and HCI is responsible for maintaining the websites.
Contact HCI for more details about mod_gzip. Remote Communications released the code into the public domain, the first ever module for the Apache Web Server which accelerates/compresses data on the fly. Available for Windows, Linux, and Solaris. Full source code included. The current version compresses dynamic output (from PHP, CGI, Perl, SSI's, EXE files etc.)
Results
Webmasters typically see a 150-160% increase in Web server performance, and a 70% - 80% reduction in HTML/XML/JavaScript bandwidth utilized, using this module. Overall the bandwidth savings are approximately 30 to 60%. (This factors in the transmission of graphics.) Here's a test run by Remote Communications using their modified Apachebench above.
Next page, Technical Overview
  
Comments are welcome
|