Record once.
Your agent cuts it.
roll captures the take. Crunch makes it searchable. EdAtor makes the cut — EDL, overlays, export. AI does the judgement, FFmpeg does the work.
it works on my agent's machine.
One take in. A finished cut — and a week of shorts — out.
You hit record and talk. Three tools you own do the rest. roll captures screen, camera and mic on a single clock, plus every click, keystroke and on-screen element. Crunch reads all of it — OCR and transcription — and turns the take into something searchable. EdAtor makes the editorial calls a human editor would, writes them as an edit list, dresses the cut in the Signal overlay kit, and exports. No timeline. No scrubbing. No “I’ll fix it in post”.
The edit is a decision. So make the footage RAG-able.
click × on-screen text × transcript = labelled action events. Once the take is searchable, an LLM can make real editorial decisions from it — what to cut, where to zoom, when to bleep — instead of guessing from raw pixels. Crunch does the heavy read once, so every downstream cycle is cheap.
And it all runs on a box you own: near-zero marginal cost, nothing leaving your estate. The scary bit isn’t the AI — it’s the plumbing. The plumbing’s done.
Three tools. One clock. Your stack.
Capture the whole truth, on one clock.
A native macOS recorder. Screen, camera and mic — sub-frame synced on one shared clock, so there’s no drift to chase and nothing to re-align by hand. But roll captures more than pixels: every click, drag, keystroke and scroll, plus the Accessibility role and label of whatever you touched — a full input-and-semantic telemetry track running alongside the video.
The output is a self-describing pack — screen.mp4 camera.mp4 mic.m4a metadata.jsonl manifest.json, all on one host clock. That pack is the contract with everything downstream. roll’s job is capture + inspect — not edit.
Reads the take. Makes it searchable.
Crunch is a self-hosted, OpenAI-compatible inference API for the boring, brilliant models under most apps — OCR, speech-to-text, embeddings, reranking, captioning. Point roll’s pack at POST /pack and it reads the take: line-level OCR of the screen, a word-timed transcript of the mic, every click joined to what was under the cursor and what was being said. The encoder models stay warm in memory, so calls come back in tens of milliseconds.
POST /pack → crunch.json
OCR · transcribe · embed · rerank · caption
Whisper-turbo + Florence-2 sidecars · Tesseract OCR
models warm in memory · near-zero marginal cost
The output is one lean crunch.json — the join of what was on screen, what was said and what was done, all on roll’s shared clock. Not a video dump: a queryable index with scored edit moments and a beat-by-beat outline. No GPU, no per-call meter, no data leaving the box. That file is the contract edator reads — the authoring index it works from instead of watching the footage.
Makes the EDL. Dresses the cut. Exports.
EdAtor reads the crunched take and makes the calls a human editor makes — which take to keep, where to kill the dead air, when to punch in on the face, which word needs a bleep — and writes them down as a JSON edit pack. That pack is the one hard contract: a plain edit-decision list, not a render. A deterministic FFmpeg pipeline executes it to the frame — same pack, same cut, every time. AI does the judgement, FFmpeg does the work.
Then it dresses the clean cut in the private Signal overlay kit, masters the audio to a safe loudness, and reframes the same source rolls — not a re-crop of the finished video — into face-tracked 9:16 shorts. One take in; a finished, on-brand cut and a week of verticals out. No timeline, no scrubbing — the whole edit is a file you can read.
Own the stack. Take the code.
Every piece runs on hardware I own — and the pieces are yours to copy. Same pattern, your box.
Folkestone · it works on my agent’s machine · agents → /llms.txt