HTML Unleashed. SGML and the HTML DTD: SGML Declaration for HTML 4.0 | WebReference

HTML Unleashed. SGML and the HTML DTD: SGML Declaration for HTML 4.0


HTML Unleashed: SGML and the HTML DTD

SGML Declaration for HTML 4.0


SGML declaration is a formal construct used to specify some general information about an SGML application and its associated document type.  The following sections list and analyze the SGML declaration for HTML 4.0 provided by W3C.

The SGML declaration is contained in the SGML statement, which has the following syntax:

<!SGML  "ISO 8879:1986" ...  >

The ellipsis here represents the body of SGML declaration, and the string ISO 8879:1986 is meant to denote the level of SGML standard that this declaration conforms to.  In our case, this is the original ISO specification published in 1986.

In the body of the declaration, first comes the comment part:

     SGML Declaration for HyperText Markup Language
     version 4.0
     With support for Unicode UCS-2 and increased limits
     for tag and literal lengths etc.

The rest of the declaration body is divided into sections that are described next.




The CHARSET section of SGML declaration is used to specify the character set to be used by the documents conforming to this document type.  So what is a character set?

You probably know that the characters that appear on your display are coded inside the computer by some bit combinations, usually bytes consisting of eight bits.  Unfortunately, different computers and operating systems sometimes use the same bytes to represent different characters on the screen.  The most frequent reason for this is that localized versions of programs and operating systems need to represent non-Latin characters of a particular language's alphabet (such as Cyrillic alphabet for Russian).

Thus, to make the SGML document as unambiguous as possible, SGML declaration defines exactly what character set it uses, that is, what bit combinations (codes) are allowed within a conforming document and what characters they are intended to mean.  To define a character set, you need to specify three things: first, the set of codes used; second, the set of characters represented; and third, the mapping between these two sets.

The set of codes is easy to specify by simply listing these codes in decimal or hexadecimal form.  The set of characters, or character repertoire, is more tricky.  You cannot simply "draw" a character in the specification because the SGML declaration itself is represented by a plain text file where every character is coded by a bit combination not guaranteed to mean the same on all systems.  One possible way to overcome this difficulty is to give a textual description for every character in the repertoire (for example, "CYRILLIC CAPITAL LETTER A").

However, SGML creators have chosen a less complicated way of dealing with the problem.  SGML declaration makes use of other character set standards that have already been adopted by standard-setting bodies (mostly ISO) and that can provide us with a full specification of nearly any character in the world.  Having made a reference to such a standard, you can then use character numbers in that standard to clearly identify what character you need for your document's character set.  Here's how this is done.

First comes the CHARSET keyword that marks the beginning of the corresponding section.  It is followed by the BASESET keyword that contains the name of the character set standard referred to thereupon:

    BASESET "ISO 646:1983//CHARSET
             International Reference Version
             (IRV)//ESC 2/5 4/0"

The standard specified here, commonly referred to as "ISO 646," is practically indistinguishable from what is called "7-bit ASCII."  Its 128 characters cover all Latin alphabet characters, digits, punctuation, and some special characters.  It is the greatest common subset for nearly all character sets in use now, and you're unlikely to find a computer or a program (even a localized version) that uses something other than ISO 646 for its first 128 byte codes.

However, SGML declaration for HTML 4.0 does not use all this character set, but only a certain part of it.  The selection is done using the DESCSET keyword:

         0   9   UNUSED
         9   2   9
         11  2   UNUSED
         13  1   13
         14  18  UNUSED
         32  95  32
         127 1   UNUSED

Here, the target HTML character set that we need to define is divided into subranges, with a clear identification of where characters in each subrange come from.  The first number in each line specifies the starting code of the subrange; the second, its length; and the third position is occupied either by a number that identifies the code to start copying characters from the reference standard, or the UNUSED keyword, which means that the characters in this subrange are not allowed.

Thus, the first line in the preceding code means that the codes in the range 0-8 inclusive (decimal) cannot be used within documents conforming to the HTML 4.0 specification.  The second line says that, starting from code 9 onward, we borrow 2 characters that are coded 9 and 10 in the ISO 646 standard (in other words, within this two- character range our character set is identical to ISO 646).  The next two characters are again unused, then we take one character with code 13, skip 18 more characters, and so on.

So, we have defined the first 128 characters of the HTML 4.0 character set.  However, to specify the remainder of the code table, we have to refer to another standard.  The syntax of the CHARSET section allows the specification of as many external standards as needed and the borrowing of characters from each of them (that is, to have as many BASESET/DESCSET pairs as necessary).

BASESET  "ISO Registration Number 176//CHARSET
             ISO/IEC 10646-1:1993 UCS-2 with
             implementation level 3//ESC 2/5 2/15 4/5"
         160 65375 160

The previous version of HTML, 3.2, used to refer to the standard named "ISO 8859-1" or "ISO Latin-1" to define the characters beyond 7-bit ASCII.  ISO Latin-1 uses 8-bit codes and therefore accommodates the total of 256 characters (coded 0-255 inclusive), with the range 128-255 containing letters with diacritical marks used in different European languages as well as some special symbols (trademark, copyright, fractions, and so on).  The first 128 characters of Latin-1 are identical to those of 7-bit ASCII.

However, the need for a better support of languages other than English and Western European languages led to developing of a set of provisions commonly referred to as "HTML Internationalization," initially described in RFC 2070 and then incorporated into HTML 4.0.  One of the key features of the internationalized HTML is the extended character set that makes use of the Unicode coding standard.  Unicode uses 16 bit (two bytes) codes and therefore covers as many as 65536 characters, including nearly all national alphabets of the world and hordes of special symbols.

More precisely, HTML 4.0 refers to the ISO standard named "ISO/IEC 10646-1:1993" or simply "ISO 10646" which is a superset of Unicode and generally uses four-byte codes.  However, the UCS-2 in the BASESET statement above identifies a special mode of ISO 10646 which uses two-byte codes and is in effect indistinguishable from Unicode.  All of these coding standards and related issues are covered in much more detail in Chapter 39, "Internationalizing HTML."

One question that you may have by now, however, needs to be answered immediately.  Does the SGML declaration imply that with HTML 4.0, you have to use Unicode for your documents? No, because the document character set we're defining is different from the external character encoding that the documents is in when created, stored and served over the network.  For the external character encoding, you may use any character set standard that is best suited for the document's content.  In practice, the only area affected by the document character set as per SGML declaration is numerical character references such as &#160; that must in HTML 4.0 point to Unicode code positions.  Again, for more details on these issues refer to Chapter 39.

Unicode itself is a superset of Latin-1, as the first 256 characters of Unicode are identical to those of Latin-1.  Also, the latter is likely to remain for a long time the most popular choice for the external character encoding of HTML documents.  In a separate Table, we list the first 256 characters of the HTML document character set as specified by SGML declaration for HTML 4.0.




This section is meant to provide a rough estimate of the system resources (more specifically, different types of memory) that an SGML parser will need to allocate in order to process the DTD.  This is not very reliable information, however, because the memory usage is largely dependent on the internal architecture of the parsing application.  Most SGML parsers do not take these values into account, and HTML creators simply assigned big enough numbers to these parameters to ensure that processing the DTD won't be aborted because of exceeding one of the CAPACITY values.  The CAPACITY parameters are not discussed individually here; you can refer to the SGML specification for details.

           TOTALCAP        150000
           GRPCAP          150000
           ENTCAP          150000

The SGMLREF keyword means that all CAPACITY types that are not indicated here should take their default values from the SGML reference concrete syntax.  (See the next section for more on this.)


SYNTAX Section


The next major section of SGML declaration is introduced by the SYNTAX keyword.  It is provided to define various syntax features of the SGML application, such as naming rules, delimiter and control characters, reserved names and limits used by the DTD and conforming SGML documents.  This syntax is called "application concrete syntax" as opposed to "reference concrete syntax" of SGML itself, which is used in the SGML declaration (but not the DTD, as specified by the SCOPE parameter).  As you'll see shortly, in the case of HTML, the differences between these syntaxes are minimal.


SCOPE Declaration


Immediately before the SYNTAX section comes the SCOPE DOCUMENT declaration:


Its sole purpose is to specify that the application concrete syntax to be declared will be used not only by the conforming SGML documents but also by the DTD of this SGML application.


Shunned Characters Declaration


The SYNTAX section starts with the list of shunned characters' codes preceded by the SHUNCHAR keyword:

SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11
        12 13 14 15 16 17 18 19 20 21 22 23
        24 25 26 27 28 29 30 31 127

"Shunned" doesn't mean "prohibited," and the list of shunned character codes doesn't fully coincide with the UNUSED codes in the character set declaration.  In fact, some of the shunned characters (for example, the carriage return and line feed characters) are outright necessary in any text file, SGML document being no exception.

However, these characters should be used with care as their meaning and usage may depend on the computer environment in which the text is processed (for example, although a text line in MS DOS and Windows is terminated by a pair of carriage return and line feed characters, UNIX systems use single carriage return).  The keyword CONTROLS means that if a particular computer system uses some other characters as control codes (and not displayable characters), these should be added to the SHUNCHAR list.


Syntax Character Set Declaration


Next comes what may be considered a duplicate of the CHARSET section---a BASESET/DESCSET pair defining a character set (see "CHARSET Section" above):

          International Reference Version
          (IRV)//ESC 2/5 4/0"
DESCSET  0 128 0

What is the purpose of this additional definition?

The character set defined in the SYNTAX section is used only within that section and nowhere else.  This reminds us once again of the fact that any text document, SGML declaration included, is actually nothing but a sequence of codes, and to get to the meaning we need to know which character corresponds to each code.  Having provided a separate character set declaration within the SYNTAX section, we can ensure that the syntax definition is completely independent of the document character set (defined in the CHARSET section).  In other words, we won't have to rewrite the SYNTAX section when the content of CHARSET section is changed.


Function Characters Declaration


The FUNCTION keyword is used to identify the character codes for so-called function characters:

        RE            13
        RS            10
        SPACE         32
        TAB SEPCHAR    9

Function characters are special characters that may have effect on syntax.  All function characters defined here are separators whose role is identical to that of a white space.  The RE and RS identifiers denote simply carriage return and line feed characters; they are short for Record End and Record Start, respectively.  (In SGML, a line in a text file is sometimes termed a record, similar in a way to a database record.) TAB (tabulation character) is not recognized as separator by SGML standard, that is why it is accompanied by the additional classifier SEPCHAR.


Naming Rules Declaration


Next comes the NAMING declaration which regulates usage of characters in element and entity names and as names' start characters:

        UCNMSTRT ""
        LCNMCHAR ".-"  -- ?include "~/_" for URLs? --
        UCNMCHAR ".-"

To facilitate recognition of a name by the parser, the repertoire of characters allowed in the first position of a name is limited as compared to the rest of the name.  SGML standard itself allows Latin letters only as name start characters and Latin letters plus digits as ordinary name characters, so here we only need to specify additions to these sets.  The characters are specified by using strings in quotes (called literals), and separate parameters are provided for indicating uppercase and lowercase character versions in each class.

Thus, the preceding lines tell us that in HTML, only Latin letters are allowed as name first characters (the corresponding parameter strings are empty) while the repertoire of ordinary name characters is extended by the period and the hyphen.  These characters are caseless and thus are shown the same in both LCNMCHAR (LowerCase NaMe CHARacters) and UCNMCHAR (UpperCase NaMe CHARacters) parameters.

         ENTITY  NO

The NAMECASE declaration governs case sensitivity of the SGML application concrete syntax.  It is further subdivided into ENTITY, which applies to entity names only (for more on entities, see "Entities" below), and GENERAL, which covers all the rest, including element names.  Here's the answer to the question of why <img> and <IMG> are treated the same in HTML while &eacute and &Eacute; aren't.


Delimiters Declaration


The DELIM declaration allows you to change the character sequences used as tag delimiters in the SGML application.


The values SGMLREF indicate that in this respect HTML syntax is no different from SGML syntax; you use < as start delimiter of an opening tag, </ as start delimiter of a closing tag, > as end tag delimiter, and so on.  As this part of SGML declaration adds very little information on HTML syntax, it need not be discussed in detail.


Reserved Names Declaration


The NAMES keyword may be used to change some of the reserved SGML names that will be used in DTD declarations.


Again, the SGMLREF value indicates that the list of these reserved names is exactly that provided by SGML specification.  Many of these reserved names are discussed later in the section on DTD.


Quantity Limits Declaration


The last in the SYNTAX section is the QUANTITY declaration:

         ATTSPLEN 65536 -- Implementors are recommended --
         LITLEN   65536 -- to avoid fixed limits but --
         NAMELEN  65536 -- this is the best we can say here --
         PILEN    65536
         TAGLVL   100
         TAGLEN   65536
         ATTCNT   100
         GRPGTCNT 150
         GRPCNT   64

This declaration sets limits for some lengths and counters used by the parser in processing the DTD and conforming documents.  Just like in the CAPACITY section, many of these parameters are assigned arbitrary big values that effectively mean "no limit at all;" it is difficult to imagine that one might need, for example, an element name (governed by the NAMELEN parameter) that is 65,536 characters long.  Most HTML browsers disregard these limitations (or have their own instead), so the different QUANTITY parameters aren't discussed here.




The section of SGML declaration introduced by the FEATURES keyword contains parameters that turn on or off some of the features of SGML syntax; that is, they allow or disallow using these features in the SGML application being defined.  These features are divided into three classes: MINIMIZE, LINK, and OTHER.  Following the HTML-oriented approach used throughout the chapter, only those features that are turned on in the SGML declaration for HTML 4.0 are considered here.




The MINIMIZE class contains the markup minimization features that are intended to facilitate using SGML markup and to make it more readable for humans.  Minimization features allow you to omit tags and other markup instructions in certain situations where context is sufficient to resolve the resulting ambiguity.

The OMITTAG feature allows the DTD to specify that for certain elements, start and/or end tags may be omitted.  Such an element will be opened or closed based on matching the context against the corresponding content model.  (See the upcoming "Elements" section.) The most common example in HTML is the <P> tag, whose closing tag </P> can always be safely omitted.

This feature is very interesting.  In fact, it contains a whole bunch of different features that could save a lot of typing when marking up a document.  With SHORTTAG YES, you can use empty open tag <>, empty closing tag </>, type pairs of tags in the form <TAGNAME/.../, omit attribute names, and so on, with all missing information implied by the parser through simple and effective rules.  Unfortunately, common browsers do not support these features, so they are mostly of theoretical interest for HTML users.

LINK class


The LINK class contains features that affect processing attributes of elements.  None of these are allowed in HTML.


OTHER class


The OTHER class contains miscellaneous features that didn't fit into MINIMIZE or LINK classes.

This feature indicates that the PUBLIC entity declarations (see the section on public identifiers) should use formal syntax of public identifiers to enable automatic substitution of external sources by the parser.

Created: Jun. 15, 1997
Revised: Jun. 16, 1997