Idea: String Provenance GitHub issue

vibecode
{"vibecode": {
    "doc": "string-provenance",
    "role": "speculative high-security feature: every string carries its full construction history via base strings and query strings, so operations track full provenance back to original sources",
    "key_concepts": ["string_provenance", "base_string_vs_query_string",
        "construction_history", "opt_in_high_security"],
    "status": "brainstorm"
}}

Not implemented. Probably too expensive and complicated for the current implementation, but worth recording so future decisions can build on it. A reasonable direction is for this to eventually be an opt-in feature for a higher level of security, not the default behavior of the runtime.

The Idea GitHub issue

Every string carries its complete construction history — not just "this string exists and is tainted," but the full record of every operation and every original source that went into building it. You wouldn't just know what a string is; you would know every step that produced it and where those original strings came from.

The mechanism is two kinds of string:

The developer does not interact with base strings directly. From the developer's point of view, base strings and query strings have the same methods and behave the same way. The distinction is internal.

$string = "Falstaff"    # creates a base string; $string is a query that returns it

$foo = $string[0, 2]    # returns a query string that asks for the first three
                        # characters of $string's base

$bar = $foo.upper_case  # returns a query that asks $foo for its result, then
                        # uppercases it. $foo in turn queries $string.

When $bar's value is actually needed — printed, compared, passed to a sink — the engine materializes the query: walks back through the query chain to the base string(s), runs the operations, and produces the result. Until then, no characters have been computed.

Why This Would Be Powerful GitHub issue

The coarse model of one role-tag per value (see roles.md) collapses construction history down to a single owning role per string. That's enough for "should this sink refuse to act on this string," but it loses information that could be useful in several places:

Why It's Deferred GitHub issue

The cost is significant, and most code doesn't need the richness:

File caching as a partial mitigation GitHub issue

The memory pressure on base strings could be mitigated by spilling cold base strings to disk. Query strings still reference them, but the materialization step might pay an I/O cost to read the base back. This is clunky: it adds file management, I/O latency, and complexity to what is supposed to be a language-level feature. It would help, but it would not be elegant.

Other possible mitigations exist (eager materialization when a base would otherwise be freed; weak references with snapshot fallback; etc.), but each trades away some of the provenance benefit. The fundamental tension is real: keeping full provenance means keeping the inputs, and inputs accumulate.

Things to Think About When Revisiting GitHub issue

Relation to the Current Trust Model GitHub issue

The coarse tracking (one trust tag per value) is the practical floor. Full provenance is a strict superset: if you have the query chain back to base strings, you can derive the coarse tag at any time (intersection of all base trust levels). When the coarse tag is sufficient, provenance richness is overkill.

For an eventual implementation, the cleanest separation is probably:

The developer-facing surface stays the same in both modes. The engine config is what determines which representation is used at runtime.

© 2026 Puck.uno