|
Abstract: . . . TagSoup: A SAX parser in Java for nasty, ugly HTML John Cowan ( cowan@ccil.org ) Page 2 Copyright 2002 John Cowan; licensed under the GNU GPL 35 Copyright This presentation is: Copyright © 2004 John Cowan . . . . . . confused if an attribute value does contain whitespace but has no quotation marks Page 31 A Few Other Points Page 32 Copyright 2002 John Cowan; licensed under the GNU GPL 35 Roll Your Own SAX2 properties let you: Specify your own scanner object (if your surface syntax is not HTML ) Specify your own schema object (if your elements, content models, attributes, and entities are not HTML ) Specify your own auto-detector object (if you know how to . . . . . . Entities must resolve to a single character You can clone the standard HTML schema, or start from an empty schema and build up Page 34 Copyright 2002 John Cowan; licensed under the GNU GPL 35 How Big Is TagSoup? 7 classes: Parser, Schema, HTMLSchema, HTMLScanner, Element, ElementType, a private copy of AttributesImpl 4 interfaces: Scanner, ScanHandler, AutoDetector, HTMLModels 3 debug classes: PYXScanner, PYXWriter, a modified version of XMLWriter About 4600 lines of Java 2000 . . . . . . 35 State Table Markup Language STML is another special-purpose language for describing arbitrary state machines TagSoup's lexical analyzer is written in STML A RELAX NG schema for TSSL is provided An XSLT script converts the STML into tables for the Java HTMLScanner class Page 37 Copyright 2002 John Cowan; licensed under the GNU GPL 35 PYX Format PYX format is a linearized syntax for XML, almost the same as SGML ESIS format TagSoup provides support for PYX format on . . . . . . parsed Page 39 Improvements Suggestions and patches are welcome Subscribe to tagsoup-friends@yahoogroups.com Copyright 2002 John Cowan; licensed under the GNU GPL 35 Page 40 More Information http://tagsoup.info . . . --3000,5,300,2812,15676
|
...downloading document:
TagSoup: A SAX parser in Java for nasty, ugly HTML
from: www.ccil.org
If download not starts automatically click here
|