Uma HTML Parser GitHub issue

vibecode
{"vibecode": {
    "doc": "uma-parser",
    "role": "in-progress spec for Uma's pure-Caspian HTML parser; accepts well-formed-ish HTML to avoid bundling a C parser, rejects malformed input rather than attempting browser-style recovery",
    "key_concepts": ["pure_caspian_parser", "well_formed_ish_input", "void_element_handling",
        "opaque_script_style", "no_browser_recovery"]
}}

Status: design in progress. This is the pure-Caspian HTML parser that Uma will use to read existing HTML documents into a tree it can manipulate. Goal: avoid bundling a C HTML parser (gumbo or similar) at the cost of accepting "well-formed-ish" HTML rather than the full HTML5 quirky-recovery spec.

See uma.md for the broader Uma builder / DOM helper this parser feeds.


Scope GitHub issue

In scope: - Parse well-formed HTML5-style input into an element tree. - Recognize void elements (per Uma's schema). - Handle text content, attributes (single-quoted, double-quoted, unquoted, valueless), comments, CDATA, DOCTYPE, processing instructions. - Recognize <script> and <style> content as opaque (don't parse tags inside them). - Decode standard HTML named character references (&amp;, &lt;, &nbsp;, etc.) and numeric references (&#42;, &#x2A;). - Reject malformed input loudly (raise a flag), don't attempt browser-style recovery.

Out of scope: - Full HTML5 parsing-spec compliance (insertion modes, foster parenting, quirks-mode handling, foreign-element switching for SVG/MathML, encoding sniffing). - Recovering from messy real-world HTML. Use a Lua HTML library outside Caspian for that. - Streaming. Whole-document parsing only.

The trade is: developers feeding Uma their own templates or serialized output get a fast, small, native parser. Anyone parsing wild-world HTML (scraping, malformed input) reaches for a bundled library instead.


Architecture GitHub issue

Two layers, with all tag knowledge supplied externally via a schema config:

raw input string                  schema config (e.g., html5.json)
      │                                  │
      ▼                                  ▼
[ TOKENIZER ]   ← dumb, fast,    [ SCHEMA ]  ← all rules about
      │           single linear        │       tags: which are
      │           pass; emits a        │       void, what can
      │           flat token stream    │       nest where, what
      │           with no tag-shape    │       are opaque
      │           knowledge            │       (script-like),
      │                                │       what attributes
      │                                │       are allowed, etc.
      │                                │
      ▼                                ▼
[ TREE BUILDER ] ← state machine that walks the token stream,
      │           consults the schema for tag rules, produces
      ▼           an element tree
element tree

The parser knows nothing intrinsic about HTML. It knows how to tokenize an angle-bracket-and-attribute syntax, walk a state machine, and ask a schema "is this tag void?" / "can this tag close that tag?" / "should I treat this tag's content as opaque?" Substitute a different schema and the same engine parses a different language.

HTML5 is just the first schema (Uma's html5.json). Custom schemas let developers define their own markup languages or restricted HTML subsets — see Schemas other than HTML5 below.

Splitting tokenization from tree-building keeps both layers small. The tokenizer doesn't know about any tag semantics; the tree-builder doesn't deal with character-level parsing; the schema knows nothing about parsing at all (it's just data).


Tokenizer GitHub issue

Single linear pass over the input string. Splits on:

The result is a flat array of tokens. Each token is a small hash:

{type: 'tag-open',  value: '<'}
{type: 'word',      value: 'div'}
{type: 'space',     value: ' '}
{type: 'word',      value: 'class'}
{type: 'eq',        value: '='}
{type: 'quote',     value: '"'}
{type: 'word',      value: 'container'}
{type: 'quote',     value: '"'}
{type: 'tag-close', value: '>'}

Why this shape rather than a richer tokenizer that already knows "open-tag" vs "attribute" vs "text": the smart work belongs in the tree-builder. The tokenizer's job is splitting the string fast. Lua patterns don't have alternation, so the tokenizer either uses several string.find calls per chunk or a hand-rolled character-level scan; either way it stays simple.

Open: exact token type set. The minimum is enough types for the tree-builder to recognize structure (tag-open, tag-close, eq, quote, word, space). Adding more (e.g., comment-open for <!--) might be worth it if it simplifies the tree-builder.


Tree builder GitHub issue

State machine over the token stream. States:

State Description
outside Default. Tokens are text or < starting a tag.
tag-name Just after <. Next word is the tag name.
attributes After tag name, inside the tag. Tokens are attribute names, =, values.
attr-value-quoted After =". Collect tokens as the value until matching quote.
attr-value-unquoted After = with no quote. Collect tokens as the value until whitespace or >.
comment After <!--. Discard tokens until -->.
cdata After <![CDATA[. Collect as raw text until ]]>.
doctype After <!DOCTYPE. Collect until >.
pi After <?. Collect until ?>.
script-content Inside <script> / <style>. Treat all tokens (including <) as text until </script> / </style>.

The state machine maintains a stack of open elements. On <tag-open><word>tagname:

  1. Look up tagname in the schema.
  2. If tagname is void, create the element and don't push it (no closing tag expected).
  3. If tagname is non-void, create the element, push it on the stack.

On </tagname>:

  1. Pop the stack until we find a matching open tag.
  2. If no matching open tag is on the stack: malformed input; raise a flag (specific class TBD).

On reaching EOF with a non-empty stack: malformed input; raise a flag (unclosed tags).

Schema-driven behavior GitHub issue

All tag knowledge comes from the schema config. The parser queries the schema for:

For HTML5, this schema is Uma's html5.json. For other markup languages, a different config file describes the same kinds of facts about a different tag set.

The implicit-close rules are the trickiest part. Common ones for HTML5:

These belong in the schema, not the parser code — the parser's job is "read schema, follow its rules," not "know HTML."


Output: the element tree GitHub issue

Each element node carries:

Text nodes are represented as simple hashes:

{type: 'text', value: 'Hello world'}

Or possibly as bare strings in the children array — TBD on which is cleaner.

Comments (if preserved at all) and DOCTYPE / PI nodes get their own type tags if we keep them in the tree.

Open: comment preservation. Some Uma use cases (round-trip edit-and-serialize) want comments preserved. Others (DOM querying) treat them as noise. Probably preserve by default with an option to strip; or always preserve and let the serializer's tidy step strip them.


Schemas other than HTML5 GitHub issue

Because all tag knowledge lives in the schema config, the same parser engine can parse any markup language that fits the angle-bracket-and-attribute syntax model. Possible use cases:

The parser engine doesn't care. Provide a schema that lists the tags, their voidness, their nesting rules, their opaque-content flags — and the parser produces a tree following those rules.

A DSL for defining schemas (open) GitHub issue

Writing schema configs by hand in JSON is tolerable but verbose. A DSL for defining these schemas — not for writing documents in the resulting language — could make it much easier to declare custom markup formats.

Rough sketch (purely speculative):

language 'recipe' do
    tag 'recipe' do
        contains 'metadata', 'ingredient', 'step'
    end
    tag 'ingredient', void: true
    tag 'step' do
        contains_text
    end
    tag 'temperature', void: true, attrs: ['unit']
end

The DSL emits a normalized schema config (the same shape Uma's html5.json takes), which the parser then consumes. Schemas can inherit from each other (HTML5 with extras, restricted-HTML with subtractions, etc.).

This is a separate piece of work from the parser itself — the parser only needs the resulting schema config. But it's worth keeping in mind that the schema format should be ergonomic to emit from a DSL, not just hand-written.

Filed as a future direction. Not in v1; the parser only needs to consume schemas, regardless of how they're written.


Error handling GitHub issue

Malformed input raises a flag rather than attempting recovery. Specific flag classes (all under puck.uno/uma/error/):

Open: should the parser accept a lenient: true mode that tries to recover from minor issues (e.g., unbalanced tags by auto-closing)? Probably not in v1 — the explicit failure is the value. If real-world HTML needs to be parsed, that's the Lua-library use case.


Performance GitHub issue

Rough budget:

Not the fastest HTML parser ever written. Suitable for "developer-controlled HTML" use cases (server-side templates, serialized output round-trips, builder validation). Not suitable for parsing megabytes of scraped wild-world HTML.


Open questions GitHub issue


Why a hand-rolled parser is worth it GitHub issue

If we bundle gumbo: ~150–200k of native code, well-tested, correct on real-world HTML. Pros: zero maintenance burden, handles anything. Cons: significant binary-size hit; outside our core budget; uses C; depends on Google's release cadence.

If we hand-roll: ~30–50k of Caspian source, less correct on weird HTML, but tractable to maintain and entirely inside our own ecosystem. Pros: no native dep, fits the budget, can evolve with our schema. Cons: never matches a browser exactly; rejects HTML that browsers would accept.

The trade lands well for Uma's intended use cases: developer-controlled HTML (templates, builders, internal tools). For wild-world parsing, expect to install a Lua-library adapter outside core.


Next steps GitHub issue


© 2026 Puck.uno