Some common bugs in HTML parsers

This page is completely legal HTML 2.0, and should pass the HAL validation service. The page includes examples of common bugs in "historical" HTML implementations, and is intended for use by implementers in checking their parsers for such bugs.

If you know of more things that belong on this page, please let us know.

If you think these things are interesting, you might want to look at some other examples of the dusty corners of html.

Bugs in comment processing

Your parser terminates on dash-dash-> whether it terminates the comment or not -->

Your parser thinks the first pair of -'s can terminate a comment.

If your parser is working correctly, This sentence is the first one that that appears in this section.

If there is no text after the :, then your parse doesn't handle whitespace between the -- and >:

Problems with attributes

The following form showcases some problems parsing attributes.

If the first checkbox isn't checked, then your parser doesn't handle boolean tags in their full form.

If this element is not a checkbox, then your browser is not ignoring case property in attribute values.

The default value of this string input element should be Single quotes around this one. If it says 'Single, then your browser doesn't handle single quoted attributes.

This should say no.quotes.needed. If it says no, then your browser is terminating name values improperly.

This form has a newline in the middle. The newline should appear as as a space character.

The default value here should be Quotes with a > in the middle. If it says Quotes with a, then your browser is terminating tags - and attribute values - improperly.

Here, you should see This should show a real greater than sign: >. If it you see the text string > in the text, then your browser isn't dealing with entites in attributes, and it should be.

Paragraph close tag

Implementations have had problems with the definition of a paragraph and line breaks. In particular, older implementations tend to treat a paragraph close tag - </P> - as creating a new paragraph, generating extra whitespace.

The following two "XXX"'s should be seperated by the same vertical white space.

XXX

XXX

Some browsers fail to recognize a paragraph close tag as ending a paragraph. If that's the case, the string XXX will be at the end of this paragraph. Otherwise, it will be after this paragraph.

XXX

Problems requiring other documents

These problems require checking another document to demonstrate. Each document is just long enough to demonstrate the problem at hand.

Some browsers don't handle entities in titles, and display the name of the entity instead. This document has an entity in the title so you can check your browser.

Problems with entities

The most common entity error is not terminating them properly. This string (&) and this string (&) should be the same in all browsers.

Many browsers don't implement the full set of entities. This is not unexpected, as it wasn't until recently that the working group reached concensus that providing all of them was reasonable for HTML 2.0.

Just for reference, here's the complete set for HTML 2.0. The symbolic form is followed by the numeric entity in parenthesis so you can tell if the browser gets them both right.

< (<)
Less than sign
> (>)
Greater than sign
& (&)
Ampersand
" (")
Double quote sign
  ( )
no-break space
¡ (¡)
inverted exclamation mark
¢ (¢)
cent sign
£ (£)
pound sterling sign
¤ (¤)
general currency sign
¥ (¥)
yen sign
¦ (¦)
broken (vertical) bar
§ (§)
section sign
¨ (¨)
umlaut (dieresis)
© (©)
copyright sign
ª (ª)
ordinal indicator, feminine
« («)
angle quotation mark, left
¬ (¬)
not sign
­ (­)
soft hyphen
® (®)
registered sign
¯ (¯)
macron
° (°)
degree sign
± (±)
plus-or-minus sign
² (²)
superscript two
³ (³)
superscript three
´ (´)
acute accent
µ (µ)
micro sign
¶ (¶)
pilcrow (paragraph sign)
· (·)
middle dot
¸ (¸)
cedilla
¹ (¹)
superscript one
º (º)
ordinal indicator, masculine
» (»)
angle quotation mark, right
¼ (¼)
fraction one-quarter
½ (½)
fraction one-half
¾ (¾)
fraction three-quarters
¿ (¿)
inverted question mark
À (À)
capital A, grave accent
Á (Á)
capital A, acute accent
 (Â)
capital A, circumflex accent
à (Ã)
capital A, tilde
Ä (Ä)
capital A, dieresis or umlaut mark
Å (Å)
capital A, ring
Æ (Æ)
capital AE diphthong (ligature)
Ç (Ç)
capital C, cedilla
È (È)
capital E, grave accent
É (É)
capital E, acute accent
Ê (Ê)
capital E, circumflex accent
Ë (Ë)
capital E, dieresis or umlaut mark
Ì (Ì)
capital I, grave accent
Í (Í)
capital I, acute accent
Î (Î)
capital I, circumflex accent
Ï (Ï)
capital I, dieresis or umlaut mark
Ð (Ð)
capital Eth, Icelandic
Ñ (Ñ)
capital N, tilde
Ò (Ò)
capital O, grave accent
Ó (Ó)
capital O, acute accent
Ô (Ô)
capital O, circumflex accent
Õ (Õ)
capital O, tilde
Ö (Ö)
capital O, dieresis or umlaut mark
× (×)
multiply sign
Ø (Ø)
capital O, slash
Ù (Ù)
capital U, grave accent
Ú (Ú)
capital U, acute accent
Û (Û)
capital U, circumflex accent
Ü (Ü)
capital U, dieresis or umlaut mark
Ý (Ý)
capital Y, acute accent
Þ (Þ)
capital THORN, Icelandic
ß (ß)
small sharp s, German (sz ligature)
à (à)
small a, grave accent
á (á)
small a, acute accent
â (â)
small a, circumflex accent
ã (ã)
small a, tilde
ä (ä)
small a, dieresis or umlaut mark
å (å)
small a, ring
æ (æ)
small ae diphthong (ligature)
ç (ç)
small c, cedilla
è (è)
small e, grave accent
é (é)
small e, acute accent
ê (ê)
small e, circumflex accent
ë (ë)
small e, dieresis or umlaut mark
ì (ì)
small i, grave accent
í (í)
small i, acute accent
î (î)
small i, circumflex accent
ï (ï)
small i, dieresis or umlaut mark
ð (ð)
small eth, Icelandic
ñ (ñ)
small n, tilde
ò (ò)
small o, grave accent
ó (ó)
small o, acute accent
ô (ô)
small o, circumflex accent
õ (õ)
small o, tilde
ö (ö)
small o, dieresis or umlaut mark
÷ (÷)
divide sign
ø (ø)
small o, slash
ù (ù)
small u, grave accent
ú (ú)
small u, acute accent
û (û)
small u, circumflex accent
ü (ü)
small u, dieresis or umlaut mark
ý (ý)
small y, acute accent
þ (þ)
small thorn, Icelandic
ÿ (ÿ)
small y, dieresis or umlaut mark

Mike W. Meyer