HEX
Server: nginx/1.18.0
System: Linux mail.dakarash.co.id 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64
User: www-data (33)
PHP: 8.1.2-1ubuntu2.23
Disabled: NONE
Upload Files
File: //proc/thread-self/root/usr/share/doc/php-masterminds-html5/README-Parser.md
# The Parser Model

The parser model here follows the model in section
[8.2.1](http://www.w3.org/TR/2012/CR-html5-20121217/syntax.html#parsing)
of the HTML5 specification, though we do not assume a networking layer.

     [ InputStream ]    // Generic support for reading input.
           ||
      [ Scanner ]       // Breaks down the stream into characters.
           ||
     [ Tokenizer ]      // Groups characters into syntactic
           ||
    [ Tree Builder ]    // Organizes units into a tree of objects
           ||
     [ DOM Document ]     // The final state of the parsed document.


## InputStream

This is an interface with at least two concrete implementations:

- StringInputStream: Reads an HTML5 string.
- FileInputStream: Reads an HTML5 file.

## Scanner

This is a mechanical piece of the parser.

## Tokenizer

This follows section 8.4 of the HTML5 spec. It is (roughly) a recursive
descent parser. (Though there are plenty of optimizations that are less
than purely functional.

## EventHandler and DOMTree

EventHandler is the interface for tree builders. Since not all
implementations will necessarily build trees, we've chosen a more
generic name.

The event handler emits tokens during tokenization.

The DOMTree is an event handler that builds a DOM tree. The output of
the DOMTree builder is a DOMDocument.

## DOMDocument

PHP has a DOMDocument class built-in (technically, it's part of libxml.)
We use that, thus rendering the output of this process compatible with
SimpleXML, QueryPath, and many other XML/HTML processing tools.

For cases where the HTML5 is a fragment of a HTML5 document a
DOMDocumentFragment is returned instead. This is another built-in class.