Training environments that compound.
930 programmatically generates training environments for computer-use models — tasks, diagnostic
grading, and reward signals — from natural language. Every failure feeds the next generation
of training data. The universe expands where models need it most.
What we believe
The ceiling on agent capabilities is the training environment.
Not compute. Not architecture. The quality and diversity of the environments your model trains in
determine what it can learn. Today’s eval and training pipelines are hand-crafted, brittle, and siloed —
every team rebuilds the same environments, and failures are discarded instead of learned from.
We’re building infrastructure where environments are programs, grading is diagnostic,
trajectories are first-class data, and the training universe improves itself.
The loop
A training universe that expands with every rollout.
930 generates environments from natural language, so it can also analyze your rollouts,
identify your model’s blind spots, and generate new tasks targeting exactly those weaknesses.
Do that for every team on the platform, and the universe grows — especially
where models need it most.
01
Evaluate
Run tasks with criterion-level grading. See exactly where your model fails and why.
02
Analyze blind spots
Analyze rollouts across sessions. Surface strengths, weaknesses, and missing capabilities.
03
Generate targeted tasks
Create new environments and tasks in the model’s blind spots. Curriculum that adapts to what’s actually broken.
04
Train & repeat
Export training data. Fine-tune. Re-evaluate. The loop compounds. The universe expands.
Technical bets
Four bets on how agent training should work.
Each shapes the platform — and each is a hypothesis we’re testing in production.
Environments as programs, not prompts
Generated from natural language, compiled to executable code, validated at runtime.
Not templates — real state machines with event handlers, world generators, and grading logic.
Every environment is deterministic, forkable, and composable.
Diagnostic grading, not scalar scores
Each task is graded by composable rubric functions. Every criterion returns a score and
detailed textual feedback — “cell (row_1, amount) = 450, expected 500”, not just “73%”.
The same results serve as RL reward signals with per-criterion auxiliary rewards.
Trajectories as first-class data
Every agent action is stored with full state snapshots. Replay any moment, fork from any failure,
seed for deterministic reproduction. One run produces eval grades, RL reward signals,
and SFT training data — no separate pipelines.
A shared universe that compounds
Every team’s training generates environments that benefit everyone. Failures surface
blind spots, blind spots generate new tasks, new tasks grow the universe. The harder
the frontier, the faster the coverage expands.
See it in action
Models training in real environments.
Criterion-level grading across CRM workflows, spreadsheet reasoning, spatial manipulation, and more.
Cross-app workflow
Model onboards a client across CRM, spreadsheet, and task list in one run.
Find the right contact, record the contract in revenue, and capture follow-ups in todos — each step scored on the real UI state, not a single pass/fail.
Example prompt
Onboard Quigley-Block as a new client with a $126000 contract. Update their CRM status and deal value, add the contract as revenue in Excel, and create three onboarding …
Spreadsheet reasoning
Model reconciles multiple sheets without corrupting source data.
Cross-reference tabs, infer what is missing or inconsistent, and write only the derived output. Graded on correctness and on leaving originals intact.
Example prompt
Reconcile invoices against payments: 1. Switch to the 'Invoices' sheet and review all invoice IDs 2. Switch to the 'Payments' sheet and note which invoice IDs have payme…
Spatial layout
Model arranges furniture under explicit spatial constraints.
A floor-plan editor with movable pieces: satisfy relationships like “sofa opposite windows” and “lamp symmetry” while avoiding collisions — precise geometric grading.
Example prompt
Arrange the living room: place the sofa against the wall opposite the windows, put the coffee table in front of the sofa, place a lamp on each side of the sofa, move the…
We’re building the training infrastructure for computer-use models.
If you’re working on agent capabilities, model evaluation, or synthetic training data —
we should talk. We’re looking for research partners who want to push
the frontier of what agents can learn.