Example: a web spider GitHub issue

vibecode
{"vibecode": {
    "doc": "obrien_spider_example",
    "role": "worked example demonstrating O'Brien through a simple web spider — each URL to crawl is a job, the worker fetches and parses the page, reports back extracted data plus newly-discovered URLs to enqueue. Shows the core O'Brien patterns concretely: per-job fork isolation, manager-as-single-writer, Xeme as the result format, the two verification modes, and how a worker requests further enqueues without touching Mikobase directly.",
    "status": "illustrative example — API specifics are sketched, not committed",
    "parent_doc": "index.md (the O'Brien design)",
    "sibling_docs": ["../study.md (broader survey of job-queue patterns)"]
}}

A worked example showing how O'Brien could be used to index web pages. The spider starts from seed URLs, fetches each page, extracts links and indexed data, and enqueues newly-discovered URLs for further crawling. Each fetch is one job. Each worker is one fork.

This example is illustrative — exact method signatures are sketched, not committed. The point is to make O'Brien's mechanics concrete through a real use case.


What the spider does GitHub issue

For each URL in its queue:

  1. Fetch the page over HTTP.
  2. Parse the HTML to extract the title, the visible text, and the outbound links.
  3. Index the page (write a spider/indexed_page record to Mikobase).
  4. Enqueue any newly-discovered URLs as follow-up jobs.
  5. Report back to the O'Brien manager: success with extracted data, or failure with structured error.

When the queue drains, the spider is done. Failed pages are recorded as failed Mikobase rows with structured errors[] explaining why.


The job function GitHub issue

A job is just a function the worker calls. No class needed when the job is one operation that takes some args and outputs a Xeme to STDOUT — and most spider-style jobs are that shape.

caspian
%vibecode
    role: 'one crawl job — fetch a URL, parse it, return what was found plus URLs to enqueue next';
    notes: 'returned Xeme uses producer-namespaced `enqueue` field to ask the manager to enqueue follow-up jobs';
end

$execute = function($url)
    $response = %net.fetch($url)

    if $response.ok?
        $doc = %['https://puck.uno/html/parser/links']($response.body)
        $followups = $doc.links

        puts {
            success: true,

            meta: {
                name:     $url,
                run_time: $response.run_time
            },

            title:   $doc.title,
            enqueue: $followups
        }
    else
        puts {
            success: false,
            meta:    {name: $url},
            errors: [
                {
                    class:  'spider/error/http_status',
                    status: $response.status
                }
            ]
        }
    end
end

Notes:


Starting the spider GitHub issue

Seed the queue and start the manager:

caspian
%vibecode
    role: 'driver script: seed URLs, start the manager with concurrency cap';
end

# Instantiate the O'Brien manager, telling it which function to run per job
$obrien = %puck['https://puck.uno/obrien'].new(
    job: $execute
)

# Enqueue seed URLs
['https://example.com/', 'https://example.org/'].each do($url)
    $obrien.enqueue($url)
end

# Start the worker pool; blocks until the queue drains
$obrien.run(concurrency: 10)

concurrency: 10 means at most 10 worker forks run at once. New workers spawn as old ones complete and free up slots. The total number of jobs processed is bounded only by the size of the crawled space (and whatever termination the caller adds — depth limit, max-pages cap, or seen-set deduplication).


What happens at runtime GitHub issue

Walking through one URL's lifecycle:

  1. Enqueue. Driver (or a follow-up emitted by a previous run) calls $obrien.enqueue($url). Manager writes a row to Mikobase: the job's function reference, the args ($url), status: pending.

  2. Claim. Manager's main loop picks up the pending row, marks it claimed, and forks a worker. The fork is tracked by the engine; the manager holds the fork handle.

  3. Run. Inside the fork, the worker invokes the registered function with the row's args. The fork has whatever role-restricted capabilities the spider was granted — typically network access (for %net.fetch) but no Mikobase write capability.

  4. Report. The function writes its Xeme to STDOUT via puts, then the worker process exits. The manager captures the worker's STDOUT.

  5. Reap. Worker exits cleanly (exit 0). Manager calls $mgr.wait, gets the exit status, and parses the trailing JSON hash of the captured STDOUT to recover the Xeme. Anything earlier on STDOUT (debug output, progress logs, sub-process output) gets stuffed into the Xeme's io.stdout field for forensics.

  6. Record. Manager writes the Xeme to the job's Mikobase row. Status moves from claimed to done (or failed if the Xeme says so).

  7. Follow up. Manager inspects the Xeme for the producer-namespaced enqueue[] field. For each URL in it, the manager writes a new pending row to Mikobase. Those rows get claimed by future iterations of the same loop.

  8. Next. Manager dispatches the next pending job, spawns a replacement worker. As long as there are pending jobs and free concurrency slots, the loop keeps going.

When the queue empties (no pending rows, no claimed rows), $obrien.run returns.


Failure modes GitHub issue

The spider exercises several of O'Brien's failure paths concretely:

In every case, every job ends in a terminal Mikobase row. Nothing is ever stuck in flight.


Inspecting results GitHub issue

Because O'Brien writes every job's Xeme to Mikobase, querying results is just a Mikobase query:

caspian
# All successfully crawled pages
$done = %mikobase.query({
    job:    $execute,
    status: 'done'
})

# All HTTP failures
$http_errs = %mikobase.query({
    job:            $execute,
    status:         'failed',
    'errors.class': 'spider/error/http_status'
})

# All pages that timed out
$timeouts = %mikobase.query({
    job:            $execute,
    status:         'failed',
    'errors.class': 'spider/error/timeout'
})

The structured Xeme failure taxonomy means classifying failures is just a query filter. No log parsing, no string-matching.


What this example shows about O'Brien GitHub issue

The spider exercises every load-bearing design choice:

A real spider would add features (robots.txt support, per-host rate-limiting, URL canonicalization, content-type sniffing, etc.) but those are spider-level concerns built ON TOP of O'Brien, not features O'Brien provides. O'Brien just runs jobs.

© 2026 Puck.uno