HTTP for HTML Authors, Part II - HTML with Style | 3 | WebReference

HTTP for HTML Authors, Part II - HTML with Style | 3


HTTP for HTML Authors, Part II

Using Content-Type to specify character encoding

Web developers, being the lazy opportunists that we are, rarely care about the Content-Type header and generally assume that the server knows how to set it correctly. However, there is one case when configuring your server to send out a different value for the Content-Type header is critical, and that is when you're using a character encoding other than ISO-8859-1 to transport your documents over the Net.

I introduced you to character sets and character encodings back in Tutorial 17, and you should have a good understanding of them by now if you've written Web pages in anything except Western European languages.

You see, in addition to the type and subtype, MIME media types can have a number of optional parameters as well. In the case of the text/html media type, there is only one parameter that you might use, and it's called charset.

Now as you may well have gathered by now, the people who have taken it upon themselves to design and implement the technologies we use on the Web have a penchant for cruelty that makes the Marquis de Sade resemble the Easter Bunny on Valium. Thus, the designers of HTML decided that the charset parameter designates not, as you might have imagined, the character set of the document (as you may recall, the character set of HTML 4.0 documents is always supposed to be UCS, even though some versions of certain browsers to remain unnamed have differing opinions), but instead the character encoding of the document.

So, let's assume that I want to write a Web page that contains some text in Greek, and decide to encode it with the most commonly used encoding for this language, ISO-8859-7. When I send these documents to my readers through the magic of HTTP, I want them, or rather their browsers, to know this so that my alphas will be alphas and not left-facing accented squiggly flurbs or whatever else happens to be the imaginatively named equivalent of an alpha in ISO-8859-1. To achieve this, I need to set up my Web server to send a Content-Type header that looks like the following:

Content-Type: text/html;charset=iso-8859-7

Those of you with a keen eye for glaringly obvious facts will have gathered by now that you have to replace iso-8859-7 in the above line with the name of the encoding you're using, if it's not ISO-8859-7. You can get the canonical list of character sets (in this case used as encodings) from IANA.

How you achieve this depends on your Web server; consult the documentation that came with it or ask your hosting provider. Setting the charset parameter is all-important when producing properly internationalized pages, as it is the only way to guarantee that a browser that understands your chosen encoding will display your document correctly.

As mentioned in Tutorial 17, there are other ways to hint at the character encoding, but none of them work as well or are as widely supported as sending the right Content-Type HTTP header before the document.

Most Web servers decide which media type to assign to a document based on its filename extension (the .html bit in a file called index.html is the extension). If this is the case with your server, you have two options: you can configure your server to send the modified media type, including the charset parameter, for all files ending in .html; this is the ideal solution if you want to use this encoding for all of your HTML files. Alternatively, if you want a mix of encodings, you could store your ISO-8859-7-encoded documents in files ending in, for instance, .html-el and configure your Web server to send the modified Content-Type header for those files only.

Once again, the details of this procedure depend on your particular setup; Macintoshes mostly ignore filename extensions and rely on resource forks instead for identifying file types, while some Web servers can even look inside a file and figure out its media type by examining the content. Obviously, if your documents are generated on the fly by something like a CGI program or a Java Servlet, you'll have to write this program so that it sends out the correct Content-Type header.


Next Page...


Produced by Stephanos Piperoglou
Created: January 24, 2001
Revised: February 27, 2001