RSS and Atom in Action: Newsfeed Formats | WebReference

RSS and Atom in Action: Newsfeed Formats


RSS and Atom in Action: Newsfeed Formats

By Dave Johnson

A little history lesson on newsfeed formats will help you understand the choices you have and the direction that newsfeed technology is headed.

One of the first challenges developers face when building applications with RSS and Atom is making sense of the many slightly different XML newsfeed formats. The most popular newsfeed format is RSS, but RSS is not really a single format. RSS is a family of informally developed and competing formats that has forked into two opposing camps, which can't even agree on what the letters RSS stand for. In this chapter, we'll help you through this confusion by explaining the history of RSS, the RSS fork, and the details of the most widely used RSS formats.

We'll also explain that, at least in the world of newsfeed formats, there is hope for the future. The Internet Engineering Task Force (IETF) has developed a new newsfeed format, known as the Atom Publishing Format, or Atom for short, which will eventually replace the RSS family of formats. We'll cover the details of the Atom format in this chapter and look at its sister specification, the Atom Publishing Protocol, in chapter 10. But for now, let's get started with our RSS history lesson.

4.1 The birth of RSS

RSS began life at Netscape as part of th e My Netscape project. It was given the name RDF Site Summary (RSS) because it was an application of the Resource Description Framework (RDF), a sophisticated language for describing resources on the Web. Netscape used RSS to describe news stories and to allow users to build their own information portal, called My Netscape, by selecting the news sources they wanted to have displayed in their personal portal. RSS caught on quickly, as web sites scrambled to provide newsfeeds compatible with Netscape's innovative new portal.

By the time Netscape developer Dan Libby produced the first specification, RSS 0.9, in January 1999, the RSS user community was already starting to divide into two camps. One camp wanted Netscape to make better use of RDF in RSS, and the other wanted to simplify RSS by removing RDF altogether. Influential blogger Dave Winer of Userland Software was among those arguing for simplicity and the removal of RDF. In the end, Winer's camp won.

4.1.1 RSS 0.91

When Netscape released the RSS 0.91 specification, all references to RDF were removed. Since RDF was no longer part of the specification, the acronym RSS no longer made sense. Dave Winer declared, "There is no consensus on what RSS stands for, so it's not an acronym, it's a name." It was around this time that he started his stewardship of RSS. In July 2000, he released his own version of the RSS 0.91 specification. He reformatted the document to make it shorter and easier to read. He also removed the specification's document type definition (DTD), making it more difficult for XML parsers to validate RSS newsfeeds.

RSS 0.91 is still in use today and is the oldest ancestor of RSS 2.0, which is currently the most widely used newsfeed format. Let's take a closer look at RSS 0.91, starting with an example newsfeed.

Listing 4.1 Example of an RSS 0.91 format newsfeed

Listing 4.1 is a newsfeed in RSS 0.91 format. The root element of an RSS newsfeed is [line 2]. Within the root element there is one <channel> element, and within that, a channel header [line 3], which consists of metadata elements, including <title> [line 4], <link> [line 5], and <description> [line 6].

Within a channel are the most recent news items from the channel's web site. As you can see, an <item> [line 11] contains a <title> [line 12], a <link> [line 13], and a <description> [line 14]. The description is meant to hold the HTML content of the newsfeed item. In some newsfeeds, the item descriptions include the full content of the news story or blog item they represent. In others, the descriptions include just an excerpt, and the reader must visit the web site to read the full story.

Escaped content

Note that the HTML content in the <description> elements in listing 4.1 is escaped. That's why it's so hard to read the link <a href="">World OS</a> that is embedded in the text. We do this to follow the rules of XML. Special characters that have meaning in XML must be replaced with escape codes. So we replace any left brackets (<) with &lt;, any right brackets (>) with &gt; and any ampersands (&) with &amp;. We'll discuss some other ways to represent content in newsfeeds, but escaped content is the common practice for representing HTML within an RSS newsfeed.