HTML Well-formedness


To totally unlock this section you need to Log-in


Login

In web page design, and generally for all markup languages such as SGML, HTML, and XML, a well-formed element is one that is either:

  • Opened and subsequently closed.
  • An empty element, which in that case must be terminated.
  • properly nested so that it does not overlap.

For example, in HTML: <b>word</b> is a well-formed element, while <i><b>word</i> is not, since the bold element is not closed. In XHTML, empty elements (elements that inherently have no content) should be closed by putting a slash at the end of the opening tag, e.g. <img />, <br />, <hr />, etc. In HTML, there is no closing tag for such elements, and no slash is added to the opening tag.

What Is Well-Formedness?

Well-formedness is a concept that comes from XML. Technically, it means that a document adheres to certain rigid constraints, such as every start-tag has a matching end-tag, elements begin and end in the same parent element, and every entity reference is defined.

Classic HTML is based on SGML, which allows a lot more leeway than does XML. For example, in HTML and SGML, its perfectly OK to have a <br> or <li> tag with no corresponding </br> and </li> tags. However, this is no longer allowed in a well-formed document.

Well-formedness ensures that every conforming processor treats the document in the same way at a low level. For example, consider this malformed fragment:

<p>The quick <strong>brown fox</p>

jumped over the
<p>lazy</strong> dog.</p>

The strong element begins in one paragraph and ends in the next. Different browsers can and do build different internal representations of this text. For example, Firefox and Safari fill in the missing start-and end-tags (including those between the paragraphs). In essence, they treat the preceding fragment as equivalent to this markup:

<p>The quick <strong>brown fox</strong></p>

<strong>jumped over the </strong>
<p><strong>lazy dog.</p>

By contrast, Opera places the second p element inside the strong element which is inside the first p element. In essence, the Opera DOM treats the fragment as equivalent to this markup:

<p>The quick

<strong>brown fox jumped over the
<p>lazy dog.</p>
</strong>
</p>

If you've ever struggled with writing JavaScript code that works the same across browsers, you know how annoying these cross-browser idiosyncrasies can be.
By contrast, a well-formed document removes the ambiguity by requiring all the end-tags to be filled in and all the elements to have a single unique parent. Here is the well-formed markup corresponding to the preceding code:

<p>foo<strong></strong></p> <p><strong>bar</strong></p>

This leaves no room for browser interpretation. All modern browsers build the same tree structure from this well-formed markup. They may still differ in which methods they provide in their respective DOMs and in other aspects of behavior, but at least they can agree on whats in the HTML document. Thats a huge step forward.

Anything that operates on an HTML document, be it a browser, a CSS stylesheet, an XSL transformation, a JavaScript program, or something else, will have an easier time working with a well-formed document than the malformed alternative. For many use cases such as XSLT, this may be critical.

An XSLT processor will simply refuse to operate on malformed input. You must make the document well-formed before you can apply an XSLT stylesheet to it.
Most web sites will need to make at least some and possibly all of the following fixes to become well-formed.

  • Every start-tag must have a matching end-tag.
  • Empty elements should use the empty-element tag syntax.
  • Every attribute must have a value.
  • Every attribute value must be quoted.
  • Every raw ampersand must be escaped as &.
  • Every raw less-than sign must be escaped as < .

There must be a single root element. Every nonpredefined entity reference must be declared in the DTD.
In addition, namespace well-formedness requires that you add an xmlns="http://www.w3.org/1999/xhtml" attribute to the root html element.

Although its easy to find and fix some of these problems manually, youre unlikely to catch all of them without help. As discussed in the preceding chapter, you can use xmllint or other validators to check for well-formedness. For example:

$ xmllint --noout --loaddtd http://www.aw.com

http://www.aw-bc.com/:118: parser error : Specification mandate value for attribute nowrap
<TD class="headerBg" bgcolor="#004F99" nowrap align="left">
http://www.aw-bc.com/:118: parser error : attributes construct error
<TD class="headerBg" bgcolor="#004F99" nowrap align="left">
http://www.aw-bc.com/:118: parser error : Couldn't find end
of Start-tag TD line 118
<TD class="headerBg" bgcolor="#004F99" nowrap align="left">

Change Name to Lowercase

Make all element and attribute names lowercase. Make most entity names lowercase, except for those that refer to capital letters.

XHTML uses lowercase names exclusively. All elements and attributes are written in lowercase. In XHTML mode, lowercase is required.

No overlapping tags are allowed

XML does not allow start and end tags to overlap, but enforces a strict hierarchy within the document. The following table shows an example of these tags.

HTML Well-formedness

Case matters

Choose a consistent case for start and end tags. Generally, try to use uppercase for HTML elements. The following table shows how case matching should appear in well-formed HTML.

HTML Well-formedness

Quote your attributes

All attribute values must be surrounded by either single or double quotation marks. The following table shows how to appropriately include attributes.

HTML Well-formedness

Use a single root

Shortcuts that eliminate the <HTML> element as the single top-level element are not allowed. The following table shows how to properly include the <HTML> element.

HTML Well-formedness

Fewer built-in entities

XML defines only the following minimal set of built-in character entities:

&lt; — (< )

&gt; — (>)
&amp; — (&)
&quot; — (")
&apos; — (')

Numeric character entities are supported.

Escape script blocks

Script blocks in HTML can contain characters that cannot be parsed, such as < and &. These must be escaped in well-formed HTML by using character entities, or by enclosing the script block in a CDATA section.
The following table shows HTML script block that contains both a character that cannot be parsed (<) and JScript comments. The well-formed script block uses CDATA to encapsulate the script.

HTML Well-formedness