Some common bugs in HTML parsers
This page is completely legal HTML 2.0, and should pass the
HAL validation service. The page includes examples of common bugs in
"historical" HTML implementations, and is intended for use by
implementers in checking their parsers for such bugs.
If you know of more things that belong on this page, please
let us know.
If you think these things are interesting, you might want to look
at some other examples of the dusty corners
of html.
Bugs in comment processing
Your parser terminates on dash-dash-> whether it terminates the comment or not -->
Your parser thinks the first pair of -'s can terminate a comment.
If your parser is working correctly, This sentence is the first one
that that appears in this section.
If there is no text after the :, then your parse doesn't handle
whitespace between the -- and >:
Problems with attributes
The following form showcases some problems parsing attributes.
Paragraph close tag
Implementations have had problems with the definition of a
paragraph and line breaks. In particular, older implementations tend
to treat a paragraph close tag - </P> - as creating a new
paragraph, generating extra whitespace.
The following two "XXX"'s should be seperated by the same vertical
white space.
XXX
XXX
Some browsers fail to recognize a paragraph close tag as ending a
paragraph. If that's the case, the string XXX will be at the end
of this paragraph. Otherwise, it will be after this paragraph.
XXX
Problems requiring other documents
These problems require checking another document to demonstrate.
Each document is just long enough to demonstrate the problem at hand.
Some browsers don't handle entities in titles, and display the name
of the entity instead. This document has
an entity in the title so you can check your browser.
Problems with entities
The most common entity error is not terminating them properly. This
string (&) and this string (&) should be the same in all browsers.
Many browsers don't implement the full set of entities. This is not
unexpected, as it wasn't until recently that the working group reached
concensus that providing all of them was reasonable for HTML 2.0.
Just for reference, here's the complete set for HTML 2.0. The
symbolic form is followed by the numeric entity in parenthesis so you
can tell if the browser gets them both right.
- < (<)
- Less than sign
- > (>)
- Greater than sign
- & (&)
- Ampersand
- " (")
- Double quote sign
- ( )
- no-break space
- ¡ (¡)
- inverted exclamation mark
- ¢ (¢)
- cent sign
- £ (£)
- pound sterling sign
- ¤ (¤)
- general currency sign
- ¥ (¥)
- yen sign
- ¦ (¦)
- broken (vertical) bar
- § (§)
- section sign
- ¨ (¨)
- umlaut (dieresis)
- © (©)
- copyright sign
- ª (ª)
- ordinal indicator, feminine
- « («)
- angle quotation mark, left
- ¬ (¬)
- not sign
- ()
- soft hyphen
- ® (®)
- registered sign
- ¯ (¯)
- macron
- ° (°)
- degree sign
- ± (±)
- plus-or-minus sign
- ² (²)
- superscript two
- ³ (³)
- superscript three
- ´ (´)
- acute accent
- µ (µ)
- micro sign
- ¶ (¶)
- pilcrow (paragraph sign)
- · (·)
- middle dot
- ¸ (¸)
- cedilla
- ¹ (¹)
- superscript one
- º (º)
- ordinal indicator, masculine
- » (»)
- angle quotation mark, right
- ¼ (¼)
- fraction one-quarter
- ½ (½)
- fraction one-half
- ¾ (¾)
- fraction three-quarters
- ¿ (¿)
- inverted question mark
- À (À)
- capital A, grave accent
- Á (Á)
- capital A, acute accent
- Â (Â)
- capital A, circumflex accent
- Ã (Ã)
- capital A, tilde
- Ä (Ä)
- capital A, dieresis or umlaut mark
- Å (Å)
- capital A, ring
- Æ (Æ)
- capital AE diphthong (ligature)
- Ç (Ç)
- capital C, cedilla
- È (È)
- capital E, grave accent
- É (É)
- capital E, acute accent
- Ê (Ê)
- capital E, circumflex accent
- Ë (Ë)
- capital E, dieresis or umlaut mark
- Ì (Ì)
- capital I, grave accent
- Í (Í)
- capital I, acute accent
- Î (Î)
- capital I, circumflex accent
- Ï (Ï)
- capital I, dieresis or umlaut mark
- Ð (Ð)
- capital Eth, Icelandic
- Ñ (Ñ)
- capital N, tilde
- Ò (Ò)
- capital O, grave accent
- Ó (Ó)
- capital O, acute accent
- Ô (Ô)
- capital O, circumflex accent
- Õ (Õ)
- capital O, tilde
- Ö (Ö)
- capital O, dieresis or umlaut mark
- × (×)
- multiply sign
- Ø (Ø)
- capital O, slash
- Ù (Ù)
- capital U, grave accent
- Ú (Ú)
- capital U, acute accent
- Û (Û)
- capital U, circumflex accent
- Ü (Ü)
- capital U, dieresis or umlaut mark
- Ý (Ý)
- capital Y, acute accent
- Þ (Þ)
- capital THORN, Icelandic
- ß (ß)
- small sharp s, German (sz ligature)
- à (à)
- small a, grave accent
- á (á)
- small a, acute accent
- â (â)
- small a, circumflex accent
- ã (ã)
- small a, tilde
- ä (ä)
- small a, dieresis or umlaut mark
- å (å)
- small a, ring
- æ (æ)
- small ae diphthong (ligature)
- ç (ç)
- small c, cedilla
- è (è)
- small e, grave accent
- é (é)
- small e, acute accent
- ê (ê)
- small e, circumflex accent
- ë (ë)
- small e, dieresis or umlaut mark
- ì (ì)
- small i, grave accent
- í (í)
- small i, acute accent
- î (î)
- small i, circumflex accent
- ï (ï)
- small i, dieresis or umlaut mark
- ð (ð)
- small eth, Icelandic
- ñ (ñ)
- small n, tilde
- ò (ò)
- small o, grave accent
- ó (ó)
- small o, acute accent
- ô (ô)
- small o, circumflex accent
- õ (õ)
- small o, tilde
- ö (ö)
- small o, dieresis or umlaut mark
- ÷ (÷)
- divide sign
- ø (ø)
- small o, slash
- ù (ù)
- small u, grave accent
- ú (ú)
- small u, acute accent
- û (û)
- small u, circumflex accent
- ü (ü)
- small u, dieresis or umlaut mark
- ý (ý)
- small y, acute accent
- þ (þ)
- small thorn, Icelandic
- ÿ (ÿ)
- small y, dieresis or umlaut mark
Mike W. Meyer