Record once.
Your agent cuts it.
roll captures the take. Crunch makes it searchable. EdAtor makes the cut — EDL, overlays, export. AI does the judgement, FFmpeg does the work.
it works on my agent's machine.
One take in. A finished cut — and a week of shorts — out.
You hit record and talk. Three tools you own do the rest. roll captures screen, camera and mic on a single clock, plus every click, keystroke and on-screen element. Crunch reads all of it — OCR and transcription — and turns the take into something searchable. EdAtor makes the editorial calls a human editor would, writes them as an edit list, dresses the cut in the Signal overlay kit, and exports. No timeline. No scrubbing. No “I’ll fix it in post”.
The edit is a decision. So make the footage RAG-able.
click × on-screen text × transcript = labelled action events. Once the take is searchable, an LLM can make real editorial decisions from it — what to cut, where to zoom, when to bleep — instead of guessing from raw pixels. Crunch does the heavy read once, so every downstream cycle is cheap.
And it all runs on a box you own: near-zero marginal cost, nothing leaving your estate. The scary bit isn’t the AI — it’s the plumbing. The plumbing’s done.
Three tools. One clock. Your stack.
Capture the whole truth, on one clock.
A native macOS recorder. Screen, camera and mic — sub-frame synced on one shared clock, so there’s no drift to chase and nothing to re-align by hand. But roll captures more than pixels: every click, drag, keystroke and scroll, plus the Accessibility role and label of whatever you touched — a full input-and-semantic telemetry track running alongside the video.
The output is a self-describing pack — screen.mp4 camera.mp4 mic.m4a metadata.jsonl manifest.json, all on one host clock. That pack is the contract with everything downstream. roll’s job is capture + inspect — not edit.
Reads the take. Makes it searchable.
Crunch is a self-hosted, OpenAI-compatible inference API for the boring, brilliant models under most apps — OCR, speech-to-text, embeddings, reranking, captioning. Point roll’s pack at POST /pack and it reads the take: line-level OCR of the screen, a word-timed transcript of the mic, every click joined to what was under the cursor and what was being said. The encoder models stay warm in memory, so calls come back in tens of milliseconds.
POST /pack → crunch.json
OCR · transcribe · embed · rerank · caption
Whisper-turbo + Florence-2 sidecars · Tesseract OCR
models warm in memory · near-zero marginal cost
The output is one lean crunch.json — the join of what was on screen, what was said and what was done, all on roll’s shared clock. Not a video dump: a queryable index with scored edit moments and a beat-by-beat outline. No GPU, no per-call meter, no data leaving the box. That file is the contract edator reads — the authoring index it works from instead of watching the footage.
It cuts. And it has opinions.
EdAtor reads the crunched take and makes the calls a human editor makes — which take to keep, where to kill the dead air, when to punch in on the face, which word to bleep — and writes them down as a JSON edit pack: a plain edit-decision list, not a render. A deterministic FFmpeg pipeline executes it to the frame — same pack, same cut, every time. AI does the judgement, FFmpeg does the work.
But it doesn't stop at the cut. EdAtor is a deadpan co-star with opinions — it corrects you when you fumble a term, calls out the beat worth calling out, turns a rambling point into a teach panel, and plays your own outtakes back at you. Then it dresses the clean cut in the private Signal kit, masters to a safe loudness, reframes the same source rolls into face-tracked 9:16 shorts, and generates the thumbnail — your likeness, locked. One take in; a finished cut, a week of verticals, and the thumbnail out.
Own the stack. Take the code.
Every piece runs on hardware I own — and the pieces are yours to copy. Same pattern, your box.
Folkestone · it works on my agent’s machine · agents → /llms.txt