# MeterCall L4 Bot Architecture

**Last updated:** 2026-04-17 · **Owner:** L4 infra · **Status:** T+0 production, file-lock primitive

This document is the honest inventory of how the ~1,076 bots running inside MeterCall L4 actually execute — what's solid, what's a stopgap, and what the upgrade staircase looks like. No marketing. If you're reviewing, searching for a cron, or writing a new bot, this is the source of truth.

---

## 1. The current model: in-process `setInterval`

Today every bot lives inside the Node server process. We call `setInterval(fn, ms)` at boot, the function runs on the event loop, it exits, the timer fires again. That's it. No queue, no worker pool, no cron daemon, no Kubernetes cron-job, no Lambda. This is deliberate:

- **Zero ops.** One process. One `fly deploy`. One log stream.
- **Zero cold start.** Every bot can read the same in-memory state (manifest cache, rate limiter, DB pool).
- **Fast iteration.** `registerBot({...})` is three lines. New bot ships with the next deploy.

The cost is equally honest:

- **Not horizontally scalable out of the box.** Every replica would duplicate every bot.
- **A slow handler blocks the event loop.** If one bot runs 800ms of sync work, every other bot is late.
- **Process restart = missed tick.** We don't yet guarantee "this bot ran once in the last N seconds across the fleet."

`/v1/bots/health` reports any bot that hasn't ticked in `3 × intervalMs` — that's the live canary for all three failure modes above.

## 2. Leader election (the sharp edge we just shipped)

When we go multi-replica behind the Fly load balancer, every replica would run every bot. That's expensive and wrong (double-writes, double-notifications, inconsistent state). The fix is leader election: for any bot marked `leaderOnly: true`, exactly one replica in the fleet runs it at a time.

The v1 primitive is in `lib/bot-lease.js`:

- `acquireLease(botId, ttlMs)` → atomic file-rename (POSIX `link(2)` with EEXIST) in `.data/bot-leases/{botId}.lock`.
- `renewLease(botId, leaseId, ttlMs)` → extend before expiry; bot-runtime renews at 1/3 TTL.
- `releaseLease` on SIGTERM for clean shutdown.
- Expired leases are stolen via atomic `rename(2)` — bounded race window is the TTL.

File locks are correct on **a single Fly Machine**. That's our T+0 config. When we go multi-machine, the file lock breaks — no shared filesystem. We swap the implementation behind the same interface:

| Phase | Backend | Shared state | Swap cost |
| --- | --- | --- | --- |
| T+0 (today) | filesystem `link(2)` | local disk | shipped |
| T+30 (multi-machine) | Redis `SET NX PX` | managed Redis | ~20 lines in bot-lease.js |
| T+90 (durable) | Postgres advisory lock | Postgres (already in stack) | ~20 lines in bot-lease.js |

**The call sites never change.** That's the whole point. `bot-runtime.js` and every `registerBot({leaderOnly: true})` caller speaks one interface.

## 3. Sharding primitive

Leader-only is all-or-nothing: one replica runs the bot, the rest idle. Some bots can be **sharded** — split the work, run N copies, each handles 1/N of the keyspace. We seeded this with `shardKey` on the bot config:

```js
registerBot({
  id: 'SCAN_INDEXER_BOT',
  shardKey: 'scan:shard-0',   // opaque string
  leaderOnly: true,
  intervalMs: 30_000,
  handler: async ({ shardKey }) => {
    const shardId = parseInt(shardKey.split('-')[1], 10);
    const rows = await db.query(`... WHERE hash_mod(id, 4) = $1`, [shardId]);
    ...
  },
});
```

T+0 we don't dispatch shards automatically — you register `SCAN_INDEXER_BOT_0..3` with different `shardKey` values. T+90 the runtime will pick up a shard count from config and spawn the copies itself, each holding its own lease.

## 4. The T+30 / T+90 / T+365 scaling staircase

This matches the plan Pat signed off on:

- **T+0 (now).** Single Fly Machine. File-lock leader. In-process `setInterval`. Registry + metrics shipped. We are here.
- **T+30.** Multi-machine Fly. Swap file-lock for Redis lease (20 LOC). All ~1,076 bots keep running; `leaderOnly` bots now fail over between machines. `/v1/bots/leaders` starts showing different holders per bot.
- **T+90.** Task queue (BullMQ on Redis, or NATS JetStream). Bots that don't need persistent in-process state migrate to producers: `setInterval` schedules a job, any worker can consume. Automatic sharding via queue concurrency. Event-loop isolation per bot.
- **T+365.** Community bot marketplace. Outside developers publish bots; MeterCall stakes them, meters their CPU + IO, pays revenue-share on calls they generate. Isolation via V8 isolates or a Firecracker microVM per bot.

At **every step** the `registerBot({...})` signature is preserved. The bot author doesn't know or care where their handler actually runs.

## 5. Observability

Three endpoints and one page:

- `GET /v1/bots/catalog` — every registered bot, full metrics, leader, opt-out flag.
- `GET /v1/bots/bot/:id` — single bot detail.
- `GET /v1/bots/leaders` — who holds each lease right now, across the fleet.
- `GET /v1/bots/health` — aggregate + list of stalled bots (haven't ticked in 3× interval).
- `GET /v1/bots/metrics/prometheus` — Prom exposition: `metercall_bot_ticks_total`, `metercall_bot_errors_total`, `metercall_bot_last_run_timestamp_ms`, `metercall_bot_tick_duration_ms{q="0.5|0.95"}`, `metercall_bot_lease_held`.
- `/bots` — public dark-theme page with filters, live 10s refresh, opt-out toggles.

Per-tick metrics (ticks, errors, p50, p95, last run, in-cooldown) are captured inside `bot-runtime.js` — zero instrumentation cost to the bot author. Errors are rate-limited: >10 errors/minute trips a 60-second cooldown so a misbehaving bot can't saturate the log or thrash an upstream API.

## 6. Opt-out policy

Three bots touch user data:

- `DATA_AGG_BOT`
- `ANALYTICS_ROLLUP_BOT`
- `STARLINK_ORBIT_SYNC_BOT` (location-adjacent — tracks which orbit segment covers which user)

Users can opt out via `/bots` (toggle) or directly via `POST /v1/bots/opt-out/:botId` with `{userId, reason?}`. Records persist to `.data/bot-opt-outs.json` (swap to DB-backed row in T+30). The bot handlers must consult `readOptOuts()` before processing any user — that's the bot author's responsibility; the runtime doesn't enforce it. Our CI will enforce it when we add a lint rule.

## 7. How to write a new bot

```js
// file: my-module.js
const { registerBot } = require('./lib/bot-runtime');

registerBot({
  id: 'MY_FEATURE_BOT',
  name: 'My Feature',
  category: 'markets',       // core|markets|oracle|msg|agents|scan|ops|user
  description: 'One sentence that explains what this does to a stranger.',
  intervalMs: 60_000,
  leaderOnly: true,           // usually yes; set false only for pure local work
  shardKey: null,             // opt in later if you need horizontal split
  handler: async () => {
    // do the work. throw on failure — the runtime catches and records.
    // return anything; return value is ignored.
  },
});
```

**Rules:**

1. Keep the handler under ~500ms of sync work. If you need longer, queue a job.
2. Never hit Anthropic/OpenAI directly — route through `/gateway/claude` or `ai-router`.
3. If you touch user data, wire it into the opt-out list in `bot-catalog-module.js`.
4. `id` is a stable identifier. Don't rename it — metrics will orphan.
5. `leaderOnly: false` is the rare case — only for work that's genuinely per-replica (like heartbeats).

## 8. Community bot marketplace (roadmap)

End-state: a bot is a published npm-style package; developers stake MeterCall credits to run one; successful bots earn revenue-share (30/70 creator split, matching the module economy). The `registerBot()` interface is already the seam — we just need:

- Per-bot CPU/IO metering (trivial — wrap handler with `perf_hooks.performance.now()` and rusage deltas).
- Sandboxing (V8 isolate or Firecracker) so a community bot can't read `process.env`.
- A publish flow that pins the handler hash into the lease record.

That's T+365 work. The file-lock + registry we shipped today is step one.

---

**Files of record:**

- `lib/bot-lease.js` — leader election primitive
- `lib/bot-runtime.js` — bot wrapper, registry, metrics
- `bot-catalog-module.js` — HTTP surface + opt-out store
- `bots.html` — public catalog page

If you're reading this and something feels hand-wavy, grep the source. The code is short.
