Design GitHub issue

vibecode
{"vibecode": {
    "doc": "obrien_design",
    "role": "the actual design for the Caspian job-queue manager we may implement — distinct from study.md, which surveys the broader space and how other languages handle this. This file is where committed-to-implementation design decisions land; study.md is the exploration.",
    "codename": "O'Brien",
    "status": "active design — being shaped collaboratively, not yet a committed spec",
    "sibling_docs": ["study.md (survey + exploration of how other languages do this)"],
    "settled_decisions": ["one_and_done_jobs",
        "one_and_done_workers",
        "mikobase_as_the_queue_store",
        "manager_is_single_writer_to_mikobase",
        "worker_reports_back_via_stdout_as_json",
        "stderr_is_for_everything_else",
        "xeme_is_the_job_result_format",
        "two_verification_modes_implicit_default_and_positive_assertion",
        "no_periodic_sweeper_needed_just_one_time_startup_sweep"]
}}

A potential design for a job-queue manager. This project is code named O'Brien.

O'Brien manages a queue of jobs in a single Caspian process. The manager forks one worker per job (subject to a concurrency cap), waits for each worker to exit, records the outcome, and moves on. The guarantee: every job ends in a terminal state — either completed or marked as cannot-be-completed. No job is ever stuck in flight; no job is ever silently lost.

The design pulls heavily on three Caspian primitives:

The sibling study.md surveys how other languages do this and discusses the broader design space. This file is the path we may actually build.


Core decisions GitHub issue

The following are settled for V1. Together they collapse a lot of the complexity that other job-queue systems carry.

Jobs are one-and-done GitHub issue

Each job is attempted exactly once. No retries, no backoff, no max-attempts cap. The worker either succeeds or fails; the outcome is recorded; the job is finished.

Callers that need retry semantics build them on top — by re-enqueuing on failure, by writing a job class whose body itself retries, etc. O'Brien doesn't decide retry policy; the caller does.

The job lifecycle is correspondingly small:

pending → claimed → done
                  → failed
                  → cannot_be_completed (in practice, same as failed)

Four states, three transitions. Once a job leaves claimed, it's terminal.

Workers are one-and-done GitHub issue

Each worker is a fresh OS fork that runs exactly one job and exits. No worker reuse, no long-lived worker process polling for the next job, no worker dispatch loop.

This is the opposite of Sidekiq / Celery / RQ, which pool long-lived workers. The trade-off is a higher per-job fork cost (OS fork is fast but not free) in exchange for:

For jobs short enough that fork overhead dominates the work, O'Brien isn't the right tool (in-process work is). For jobs in the millisecond-to-minute range, the fork cost is noise.

The manager is the single writer to Mikobase GitHub issue

Only the manager process writes to job records. Workers report back to the parent via STDOUT (see Worker-to-parent IPC below); the manager translates that into Mikobase writes. Three things this buys:

Job outcomes are Xemes GitHub issue

Every job result is recorded as a Xeme. Xeme is the existing Caspian-standard JSON tree format for structured outcomes (test results, log entries, anything that needs a verdict plus structured reasons).

Concretely, a job record stores a Xeme that carries:

Bonus benefits of using Xeme:

Two verification modes for worker success GitHub issue

The worker can signal success in two ways. The default is implicit; the opt-in is positive-assertion.

Implicit mode (default). A clean exit (status 0) means success. The worker doesn't have to write anything special. Anything else (non-zero exit, abnormal exit, signal) means failure (or null, on abnormal exit).

Right for jobs where the work itself is its own verification — the email got sent, the file got moved, the database write committed. The act of completing IS the signal.

Positive-assertion mode (opt-in). A clean exit isn't enough; the worker must explicitly report success: true (write a Xeme verdict back to the parent). A clean exit with no report is treated as a failure.

Catches the "worker silently no-op'd" failure mode that implicit mode can't see: a bug where the worker caught an exception, swallowed it, ran to its end without doing the work, and exited 0. Useful when the work itself isn't externally observable, or when defensive verification is wanted.

The failure side is symmetric in both modes:

The modes only diverge when the worker exits 0 with no verdict reported:

Granularity of the mode setting (per-job, per-job-class, per-queue, global) is not yet decided — see Open questions.

Fork lifecycle IS the lease GitHub issue

In other job systems, "worker crashed and abandoned its claim" is a real problem that requires lease leases, heartbeats, sweepers, etc. In O'Brien it's not, because the OS-level fork lifecycle gives the manager direct observation of worker liveness.

The only sweeper-equivalent that remains is a one-time startup sweep to handle the case where the manager itself crashed and restarted: any rows still marked claimed from a previous run necessarily have no live worker (the manager process is fresh), so the manager marks them all failed and starts clean. One query and one update at boot; no ongoing loop.


Architecture GitHub issue

The core loop, in shape:

  1. Manager queries Mikobase for pending jobs. Up to N at a time, where N is the configured concurrency cap (and how many forks the manager has room for).
  2. Manager spawns one worker fork per job, via %utils.forks.multiple(N) or %utils.forks.single() (sketched below — exact API TBD). Each job's record gets marked claimed.
  3. Each worker runs its job. On completion (or failure it caught), the worker writes its Xeme verdict as the final JSON hash on STDOUT and exits. Whatever else the worker wrote to STDOUT during execution (debug messages, progress, etc.) is preserved as forensic context.
  4. Manager reaps the worker via $mgr.wait and captures its STDOUT. Four outcome paths:
  1. Manager updates the job's Mikobase record. Status leaves claimed; goes to done or failed.
  2. Loop continues. Manager queries for the next pending job and spawns a replacement worker (as long as it's under the concurrency cap).

On startup, before entering the main loop:

On shutdown (SIGTERM, etc.):

(Graceful-shutdown details are subject to refinement.)

Worker-to-parent IPC: STDOUT GitHub issue

The worker reports its Xeme verdict to the parent by writing a JSON hash at the end of STDOUT and exiting. The manager captures the worker's full STDOUT and parses the trailing JSON hash as the Xeme.

The contract is "STDOUT must end with a valid JSON hash," not "STDOUT must contain only JSON." Workers are free to write whatever they want to STDOUT during execution — progress messages, debug output, sub-process output — as long as the very last thing they write is a JSON hash that the manager can parse as the verdict. Anything earlier on STDOUT is captured and stuffed into the verdict Xeme's io.stdout field for forensics. STDERR is also free for the worker to use; the Xeme's io.stderr field captures whatever lands there.

Concretely:

This is a deliberate O'Brien-specific choice — not necessarily the right answer for every fork-based parent-child IPC in Caspian. Other contexts (where streaming or richer message structure matters) might still prefer the shared-hash IPC primitive. For O'Brien, STDOUT-tail-JSON wins on simplicity:


Deliberate non-features GitHub issue

Things O'Brien explicitly does NOT do in V1:


What's NOT yet decided GitHub issue

These are the design questions still in flight:


See also GitHub issue


© 2026 Puck.uno