Job-queue manager design GitHub issue

vibecode
{"vibecode": {
    "doc": "job_queue_manager",
    "role": "design exploration for a Caspian job-queue manager — a process that maintains a queue of jobs and forks a child per job (capped concurrency), waits for each child to complete, harvests results, and guarantees that every job ends in either a completed state or a cannot-be-completed state. Surveys how this is done in other languages (Sidekiq, Celery, Resque, RQ, BullMQ, etc.) and proposes a Caspian-native implementation leveraging the existing fork primitives and Mikobase for persistence.",
    "status": "speculative idea report — not a commitment to build",
    "proposed_by": "Miko",
    "audience": "Caspian designers and implementers thinking about concurrency, queueing, and durable job processing",
    "key_concepts": ["job_queue_manager", "fork_per_job_worker_model",
        "capped_concurrency", "result_harvesting",
        "every_job_completed_or_marked_uncompletable",
        "mikobase_backed_durability",
        "caspian_fits_naturally_with_existing_fork_primitives"]
}}

A design exploration for a Caspian job-queue manager. Survey of how the pattern is implemented in other languages, then a sketch of how it would land in Caspian.


The proposed design GitHub issue

A single Caspian process — call it the manager — holds a queue of jobs and dispatches them to child workers. The shape:

This is a classic worker-pool / job-queue pattern. The interesting parts in Caspian are (1) leaning on the fork-per-job model that Caspian's primitives already favor, and (2) figuring out what makes the "completed or marked uncompletable" guarantee robust in the face of crashes.


How it's done in other languages GitHub issue

Job-queue managers are a well-trodden problem; almost every modern language has at least one canonical library. The differences are in the queue store, the worker model, and the operational story.

Python: Celery, RQ, Dramatiq, Huey GitHub issue

The pattern across all of them: workers are separate OS processes, jobs are serialized to a broker, retries are handled at the framework level, monitoring is dashboard-based.

Ruby: Sidekiq, Resque GitHub issue

Node.js: BullMQ, Bee-Queue, Agenda GitHub issue

Elixir: Oban, Exq GitHub issue

Erlang/OTP: built into the platform GitHub issue

OTP's gen_server + supervision trees are basically a built-in job-queue framework: every process is supervised, restart policies are declarative, and the "let it crash" philosophy means individual job failures are handled by the supervisor restarting the worker. No separate library needed for the basics.

Go: asynq, machinery, gocraft/work GitHub issue

Java: Quartz, Spring Batch, JMS-based GitHub issue

Quartz handles scheduling-focused workloads; Spring Batch tackles large-scale ETL; for general task queues most Java shops use JMS-based brokers (ActiveMQ, RabbitMQ via AMQP). More fragmentation than other ecosystems.

Managed services GitHub issue

AWS SQS + Lambda, Google Cloud Tasks, Azure Service Bus — each provides queue + worker infrastructure as a service. Often the right answer for cloud-native shops that don't want to operate the queue themselves.


Common patterns across the libraries GitHub issue

A few patterns are essentially universal:

The storage choice is often the biggest architectural decision: Redis-backed (Sidekiq, RQ, BullMQ, asynq) is fast and operationally cheap if Redis is already there; Postgres-backed (Oban) is durable and inherits the application's transactional story; in-memory (some lightweight options) loses everything on restart but is simplest.


How Caspian could implement this GitHub issue

Caspian's design — single-threaded with explicit forks, Mikobase available as a structured store, existing fork primitives that already support N-capped pools — makes a fork-per-job manager a natural fit. Most of the pieces already exist; the work is in the orchestration layer on top.

What's already there GitHub issue

Proposed architecture GitHub issue

The job-queue manager is itself a Caspian object (a class instance, or a top-level script) that:

  1. Maintains a queue. Jobs sit as Mikobase records of a puck.uno/job (or similar) class. The bucket carries the payload; the stack carries the job class. Status field tracks lifecycle (pending / claimed / running / done / failed). Created-at and updated-at timestamps for monitoring.
  2. Spawns a pool of workers. A %utils.forks.multiple(N) do($fork) block starts N worker forks. Each worker loops: claim a pending job from Mikobase, run it, write the result back, repeat. The N cap enforces the concurrency limit naturally — only N jobs run at any moment.
  3. Claim with leases. A worker claims a job atomically (Mikobase transaction-like): "find one pending job, mark it claimed by my fork ID with a deadline." If the worker dies, the lease expires; a separate sweep finds expired claims and reverts them to pending.
  4. Harvest results. When a worker finishes a job successfully, it updates the Mikobase record (status = done, result = ...). On failure, it records the error and either schedules a retry or marks the job permanently failed if max-attempts is exceeded.
  5. Lease-sweeper. A separate small loop (in the manager, or as its own fork) periodically queries Mikobase for jobs whose lease has expired and reverts them to pending. This is what catches worker-fork crashes.
  6. Permanent failure tracking. Jobs that have exhausted retries get status = cannot_be_completed with the captured error info. They stay in Mikobase as a record but are not re-tried.

Sketch:

caspian
class
    function &start(concurrency: 4)
        # Spawn N worker forks; each runs its own dispatch loop
        @workers = %utils.forks.multiple($concurrency) do($fork)
            %self.&worker_loop($fork.index)
        end

        # Spawn a lease-sweeper to recover crashed jobs
        @sweeper = %utils.forks.single() do
            %self.&sweep_loop
        end
    end

    function &worker_loop($worker_id)
        while true
            $job = %self.&claim_one_job($worker_id)

            if $job.is_null
                %utils.sleep 0.5    # nothing pending; back off briefly
            else
                %self.&run_job $job
            end
        end
    end

    function &run_job($job)
        try
            $result = $job.execute
            $job.mark_done(result: $result)
        catch $err
            if $job.attempts >= @max_attempts
                $job.mark_cannot_be_completed(error: $err)
            else
                $job.mark_for_retry(error: $err)
            end
        end
    end

    function &sweep_loop()
        while true
            $expired = %mikobase.query({class: 'puck.uno/job',
                status: 'claimed', lease_expires_before: %utils.now})
            $expired.each do($job)
                $job.revert_to_pending
            end
            %utils.sleep 5
        end
    end
end

Sketch only — the exact API of $job.mark_done, %mikobase.query, etc. needs the corresponding specs to firm up. But the shape is straightforward, and every primitive used already exists or is on the design map.

Guaranteeing "completed or cannot-be-completed" GitHub issue

Miko's specific requirement — every job ends in one of two terminal states — falls out of three pieces working together:

  1. Worker tries to complete. On success, write status = done. On caught error, either retry or cannot_be_completed. As long as the worker runs to completion of either branch, the guarantee holds for "expected" failures.
  2. Lease sweeper catches unexpected failures. A worker crash (segfault, SIGKILL, host reboot) leaves a job in claimed status with an expired lease. The sweeper sees the expired lease and reverts the job to pending, where another worker picks it up. If the same job repeatedly crashes workers, the retry counter eventually pushes it to cannot_be_completed.
  3. Mikobase persistence guarantees state survives. Even if the entire manager process dies and restarts, jobs in flight are still in Mikobase. The new manager's lease-sweeper finds the expired claims and recovers them. Pending jobs are pending; done jobs are done; cannot-be-completed jobs are not retried.

The retry counter is what bounds "indefinite retry." The lease sweeper is what catches "worker died without updating the record." Together they cover every failure path. No job state can persist forever as "running" — either the worker finishes and updates the record, or the lease expires and recovery kicks in.

Where Caspian fits naturally GitHub issue

Where it gets harder GitHub issue


Open design questions GitHub issue

If Caspian builds this, the design questions worth settling early:

The simplest first cut would be: in-memory queue, fork-per-job with a fixed concurrency, no retries (job either succeeds or is marked failed once), no scheduling. That gets the core pattern working in a few hundred lines and proves the design. Persistence, retries, scheduling, priorities all add on incrementally.


See also GitHub issue


© 2026 Puck.uno