autoresearch

Introduction: Claude Autoresearch Skill — Autonomous goal-directed iteration for Claude Code. Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat forever.
More: Author   ReportBugs   OfficialWebsite   
Tags:

Turn Claude Code, OpenCode, or OpenAI Codex into a relentless improvement engine.

Based on Karpathy's autoresearch — constraint + mechanical metric + autonomous iteration = compounding gains.

Claude Code Skill OpenCode Codex Version License: MIT

Based on Follow @iuditg Support


"Set the GOAL → The agent runs the LOOP → You wake up to results"

You don't need AGI. You need a goal, a metric, and a loop that never quits.

Supports Claude Code, OpenCode, and OpenAI Codex. 13 commands. 9 safety hooks. 95% fewer tokens per invocation.


How It Works · Commands · Quick Start · Guides · FAQ


      PLAN              LOOP             DEBUG              FIX            SECURE            SHIP
 ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
 │   Goal   │     │  Modify  │     │   Find   │     │   Fix    │     │  STRIDE  │     │  Stage   │
 │  Metric  │────▶│  Verify  │────▶│   Bugs   │────▶│  Errors  │────▶│  OWASP   │────▶│  Deploy  │
 │  Scope   │     │  Keep/   │     │  Trace   │     │  Repair  │     │  Red     │     │ Release  │
 └──────────┘     │  Discard │     └──────────┘     └──────────┘     │  Team    │     └──────────┘
/autoresearch:    └──────────┘    /autoresearch:    /autoresearch:   └──────────┘    /autoresearch:
  plan            /autoresearch     debug              fix          /autoresearch:      ship
                                                                     security

 ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
 │  Probe   │     │ Scenario │     │ Predict  │     │  Learn   │     │  Reason  │     │ Improve  │
 │ Require- │     │   Edge   │     │ 5-Expert │     │   Docs   │     │  Debate  │     │ Research │
 │  ments   │     │   Cases  │     │  Swarm   │     │   Gen    │     │ Converge │     │   PRDs   │
 └──────────┘     └──────────┘     └──────────┘     └──────────┘     └──────────┘     └──────────┘
/autoresearch:   /autoresearch:   /autoresearch:   /autoresearch:   /autoresearch:   /autoresearch:
  probe            scenario         predict           learn           reason            improve

                                                                                      ┌──────────┐
                                                                                      │  Evals   │
                                                                                      │ Analyze  │
                                                                                      │ Results  │
                                                                                      └──────────┘
                                                                                     /autoresearch:
                                                                                       evals

Why This Exists

Karpathy's autoresearch demonstrated that a 630-line Python script could autonomously improve ML models overnight — 100 experiments per night — by following simple principles: one metric, constrained scope, fast verification, automatic rollback, git as memory.

Claude Autoresearch generalizes these principles to ANY domain. Not just ML — code, content, marketing, sales, HR, DevOps, or anything with a number you can measure.

v2.1.0 is a major architecture rebuild. The monolithic SKILL.md (813 lines, ~100K tokens per invocation) is replaced with a thin 41-line routing file and 12 self-contained command files (94–120 lines each, ~5–8K tokens per invocation). That is a 95% token reduction with the same capability surface.


How It Works

LOOP (N iterations or until done):
  1. Review current state + git history + results log
  2. Pick the next change (based on what worked, what failed, what's untried)
  3. Make ONE focused change
  4. Git commit (before verification)
  5. Run mechanical verification (tests, benchmarks, scores)
  6. If improved → keep. If worse → git revert. If crashed → fix or skip.
  7. Log the result
  8. Repeat until N iterations complete or goal is met.

Every improvement stacks. Every failure auto-reverts. Progress is logged in TSV format.

The Setup Phase

Before looping, Claude performs a one-time setup:

  1. Read context — reads all in-scope files
  2. Define goal — extracts or asks for a mechanical metric
  3. Define scope — which files can be modified vs read-only
  4. Establish baseline — runs verification on current state (iteration #0)
  5. Confirm and go — shows setup, then begins the loop

8 Critical Rules

# Rule
1 Bounded by default — every command has a default iteration count; unlimited is opt-in via Iterations: unlimited
2 Read before write — understand full context before modifying
3 One change per iteration — atomic changes; if it breaks, you know why
4 Mechanical verification only — no subjective "looks good"; use metrics
5 Automatic rollback — failed changes revert instantly
6 Simplicity wins — equal results + less code = keep
7 Git is memory — experiments committed with experiment: prefix; agent reads git log + git diff before each iteration
8 When stuck, think harder — re-read, combine near-misses, try radical changes

Hooks & Safety

v2.1.1 ships a 9-hook safety system that protects your sessions automatically. Hooks fire on every session — not just during autoresearch commands.

What's Protected

Hook What it does Event
scout-block Blocks node_modules/, .git/, pycache/, etc. from filling your context PreToolUse
privacy-block Blocks .env, SSH keys, credentials from being read in sessions PreToolUse
dangerous-cmd-block Blocks force-push, rm -rf, git reset --hard PreToolUse
iteration-context Injects recent TSV iteration data after context compaction UserPromptSubmit
subagent-context Gives subagents awareness of active loop state SubagentStart
dev-rules-reminder Re-injects plan path and code standards after compaction UserPromptSubmit
simplify-gate Warns at 400 LOC, blocks at 800 LOC before shipping UserPromptSubmit
session-init Sets up project context at session start SessionStart
stop-notify Terminal notification + optional webhook on session end SessionEnd

Configuration

All hooks are on by default. Disable individually:

# Disable a specific hook
export AR_DISABLE_SCOUT_BLOCK=1
export AR_DISABLE_PRIVACY_BLOCK=1
export AR_DISABLE_DANGEROUS_CMD_BLOCK=1
# ... etc for each hook name

Optional webhook for session completion notifications:

export AR_NOTIFY_WEBHOOK=https://hooks.slack.com/services/...

Customize blocked directories with a .ckignore file (gitignore syntax) at your project root.

See guide/hooks.md for full reference.


Commands

Command What it does Default Iterations
/autoresearch Core iterate loop: modify → verify → keep/discard 25
/autoresearch:plan Convert goal into validated config one-shot
/autoresearch:debug Hunt bugs via hypothesis iteration 15
/autoresearch:fix Crush errors one-by-one to zero 20
/autoresearch:security STRIDE + OWASP audit with red-team 15
/autoresearch:ship Ship through 8 phases linear
/autoresearch:scenario Generate edge cases across 12 dimensions 20
/autoresearch:predict 5 expert personas debate one-shot
/autoresearch:learn Scout → generate docs → validate → fix 10
/autoresearch:reason Adversarial debate with blind judges 8
/autoresearch:probe 8 personas interrogate requirements 15
/autoresearch:improve Research ICP, discover improvements, generate PRDs 15
/autoresearch:evals Analyze iteration results: trends, plateaus one-shot

Universal flags: Iterations: N, Iterations: unlimited, --evals, --evals-interval N, --chain <targets>, --<subcommand> shorthand.

All commands use interactive setup when invoked without arguments. Just type the command — the agent asks for what it needs with smart defaults based on your codebase.

OpenCode users: Commands use underscore naming (/autoresearch_debug, /autoresearch_fix, etc.). All 13 commands available.

Codex users: Invoke via $autoresearch mention syntax. Subcommands are keywords: $autoresearch debug, $autoresearch plan, etc.

Quick Decision Guide

I want to... Use
Improve test coverage / reduce bundle size / any metric /autoresearch
Run bounded iterations Add Iterations: N to any command
Don't know what metric to use /autoresearch:plan
Run a security audit /autoresearch:security
Ship a PR / deployment / release /autoresearch:ship
Optimize without breaking existing tests Add Guard: npm test
Hunt all bugs in a codebase /autoresearch:debug
Fix all errors (tests, types, lint) /autoresearch:fix
Debug then auto-fix /autoresearch:debug --fix
Check if something is ready to ship /autoresearch:ship --checklist-only
Explore edge cases for a feature /autoresearch:scenario
Generate test scenarios /autoresearch:scenario --format test-scenarios
Get expert opinions before starting /autoresearch:predict
Analyze from multiple angles then debug /autoresearch:predict --chain debug
Generate docs for a new codebase /autoresearch:learn --mode init
Update existing docs after changes /autoresearch:learn --mode update
Debate an architecture decision /autoresearch:reason --domain software
Surface hidden constraints before starting /autoresearch:probe
Pre-flight a fuzzy goal then loop /autoresearch:probe --chain plan,autoresearch
Discover what to build next for your ICP /autoresearch:improve
Research competitors and generate PRDs /autoresearch:improve --depth deep
Probe requirements then research improvements /autoresearch:probe --improve
Analyze trends and plateaus across past runs /autoresearch:evals
Check if a run has stalled /autoresearch:evals --file *-results.tsv

Quick Start

Claude Code

Option A — npx install (recommended):

npx skills add uditgoenka/autoresearch

All 13 commands are available after restarting Claude Code.

Option B — Plugin install:

/plugin marketplace add uditgoenka/autoresearch
/plugin install autoresearch@autoresearch

Note: Start a new Claude Code session after installing. Reference files aren't resolvable in the same session where installation happened — this is a Claude Code platform limitation.

Updating (no reinstall needed):

/plugin update autoresearch

Run /reload-plugins to activate. No need to uninstall or re-clone.

Option C — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git

# Copy skill + subcommands to your project
cp -r autoresearch/.claude/skills/autoresearch .claude/skills/autoresearch
cp -r autoresearch/.claude/commands/autoresearch .claude/commands/autoresearch
cp autoresearch/.claude/commands/autoresearch.md .claude/commands/autoresearch.md

Or install globally:

cp -r autoresearch/.claude/skills/autoresearch ~/.claude/skills/autoresearch
cp -r autoresearch/.claude/commands/autoresearch ~/.claude/commands/autoresearch
cp autoresearch/.claude/commands/autoresearch.md ~/.claude/commands/autoresearch.md

Option D — Guided installer:

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --claude --global

OpenCode Quick Start

Option A — Guided installer (recommended):

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --opencode --global

Option B — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git

cp -r autoresearch/.opencode/skills/autoresearch .opencode/skills/autoresearch
cp autoresearch/.opencode/commands/autoresearch*.md .opencode/commands/

Or globally:

cp -r autoresearch/.opencode/skills/autoresearch ~/.config/opencode/skills/autoresearch
cp autoresearch/.opencode/commands/autoresearch*.md ~/.config/opencode/commands/

All 13 commands available as /autoresearch_debug, /autoresearch_fix, /autoresearch_improve, etc.

Codex Quick Start

Option A — Guided installer (recommended):

git clone https://github.com/uditgoenka/autoresearch.git
cd autoresearch
./scripts/install.sh --codex --global

Option B — Manual copy:

git clone https://github.com/uditgoenka/autoresearch.git
cp -r autoresearch/.agents/skills/autoresearch ~/.codex/skills/autoresearch

Invoke via $autoresearch mention syntax. Subcommands are keywords: $autoresearch plan, $autoresearch debug, $autoresearch evals, etc.

Run It

/autoresearch
Goal: Increase test coverage from 72% to 90%
Scope: src/**/*.test.ts, src/**/*.ts
Metric: coverage % (higher is better)
Verify: npm test -- --coverage | grep "All files"
Iterations: 25

Claude reads all files, establishes a baseline, and starts iterating — one change at a time. Keeps improvements, auto-reverts failures, logs everything. Stops after N iterations or when you interrupt.


/autoresearch:plan — Goal to Config

The hardest part isn't the loop — it's defining Scope, Metric, and Verify correctly. /autoresearch:plan converts your plain-language goal into a validated, ready-to-execute configuration.

/autoresearch:plan
Goal: Make the API respond faster

Walks through 5 steps: capture goal → define scope → define metric → define direction → validate verify command (dry-run). Every gate is mechanical — scope must resolve to files, metric must output a number, verify must pass a dry-run. Emits a handoff.json for chaining.


/autoresearch:debug — Autonomous Bug Hunter

Scientific method meets autoresearch loop. Doesn't stop at one bug — iteratively hunts ALL bugs using falsifiable hypotheses, evidence-based investigation, and 7 investigation techniques.

/autoresearch:debug
Scope: src/api/**/*.ts
Symptom: API returns 500 on POST /users
Iterations: 15

How it works: Gather symptoms → Recon → Hypothesize (specific, testable) → Test (one experiment per iteration) → Classify (confirmed/disproven/inconclusive) → Log → Repeat.

Every finding requires code evidence (file:line + reproduction steps). Every disproven hypothesis is logged — equally valuable.

Flag Purpose
--fix After hunting, auto-switch to /autoresearch:fix
--scope <glob> Limit investigation scope
--symptom "<text>" Pre-fill symptom
--severity <level> Minimum severity to report

/autoresearch:fix — Autonomous Error Crusher

Takes a broken state and iteratively repairs it until everything passes. ONE fix per iteration. Atomic, committed, verified, auto-reverted on failure.

/autoresearch:fix
Iterations: 20

Auto-detects what's broken (tests, types, lint, build) → Prioritizes (blockers first) → Fixes ONE thing → Commits → Verifies error count decreased → Guard check → Keep/Revert → Repeat. Stops automatically when error count hits zero.

Flag Purpose
--target <command> Explicit verify command
--guard <command> Safety command that must always pass
--category <type> Only fix specific type (test, type, lint, build)
--from-debug Read findings from latest debug session

Chain them: /autoresearch:debug/autoresearch:fix --from-debug


/autoresearch:security — Autonomous Security Audit

Read-only security audit using STRIDE threat modeling, OWASP Top 10 sweeps, and red-team adversarial analysis with 4 hostile personas.

/autoresearch:security
Iterations: 15

Codebase recon → asset inventory → trust boundaries → STRIDE threat model → attack surface map → autonomous testing loop → structured report. Every finding requires code evidence (file:line + attack scenario).

Flag Purpose
--diff Only audit files changed since last audit
--fix Auto-fix confirmed Critical/High findings
--fail-on <severity> Exit non-zero for CI/CD gating

Output: Creates security/{date}-{slug}/ with 7 structured report files.


/autoresearch:ship — Universal Shipping Workflow

Ship anything through 8 phases: Identify → Inventory → Checklist → Prepare → Dry-run → Ship → Verify → Log.

/autoresearch:ship --auto

Auto-detects what you're shipping (code PR, deployment, blog post, email campaign, sales deck, research paper, design assets) and generates domain-specific checklists — every item mechanically verifiable.

Flag Purpose
--dry-run Validate everything but don't ship
--auto Auto-approve if checklist passes
--force Skip non-critical items (blockers still enforced)
--rollback Undo last ship action
--monitor N Post-ship monitoring for N minutes
--checklist-only Just check readiness

9 supported types: code-pr, code-release, deployment, content, marketing-email, marketing-campaign, sales, research, design.


/autoresearch:scenario — Scenario Explorer

Autonomous scenario exploration engine. Takes a seed scenario and iteratively generates situations across 12 dimensions — happy paths, errors, edge cases, abuse, scale, concurrency, temporal, data variation, permissions, integrations, recovery, and state transitions.

/autoresearch:scenario
Scenario: User attempts to checkout with multiple payment methods
Iterations: 20

Seed analysis → Decompose into 12 dimensions → Generate ONE situation per iteration → Classify (new/variant/duplicate) → Expand edge cases → Log → Repeat.

Flag Purpose
--domain <type> software, product, business, security, marketing
--depth <level> shallow (10), standard (20), deep (50+)
--format <type> use-cases, user-stories, test-scenarios, threat-scenarios
--focus <area> edge-cases, failures, security, scale

/autoresearch:predict — Multi-Persona Prediction

Before you debug, fix, or ship — get 5 expert perspectives in 2 minutes.

Simulates a team (Architect, Security Analyst, Performance Engineer, Reliability Engineer, Devil's Advocate) who independently analyze your code, debate findings, and reach consensus.

/autoresearch:predict --chain debug
  • --chain debug — pre-ranked hypotheses before debugging
  • --chain security — multi-persona red team analysis
  • --chain scenario,debug,fix — full quality pipeline

/autoresearch:learn — Autonomous Documentation Engine

Scout codebase → generate docs → validate → fix → repeat. 4 modes: init (create from scratch), update (refresh existing), check (read-only health report), summarize (quick overview).

/autoresearch:learn --mode init --depth deep
Iterations: 10

Dynamic doc discovery, project-type detection, validation-fix loop, git-diff scoping for updates, selective single-doc update with --file. Auto-generates Mermaid architecture diagrams, API reference, testing guide, config guide, and cross-reference links.


/autoresearch:reason — Adversarial Refinement

Extends autoresearch to subjective domains where no objective metric exists. The blind judge panel is the fitness function.

/autoresearch:reason
Task: Should we use event sourcing for our order management system?
Domain: software
Iterations: 8

How it works: Generate-A → Critic attacks → Author-B responds → Synthesizer merges → Blind judge panel (randomized labels) picks winner → Winner becomes new A → Repeat until convergence. Every agent is a cold-start fresh invocation — no history bleed.

Flag Purpose
--judges N Judge count (3-7, odd preferred)
--convergence N Consecutive wins to converge (default 3)
--mode <mode> convergent (default), creative, debate
--domain <type> software, product, business, security, research, content
--chain <targets> Chain converged output to any autoresearch command

Output: Creates reason/{date}-{slug}/ with lineage.md, candidates.md, judge-transcripts.md, reason-results.tsv, handoff.json.


/autoresearch:probe — Adversarial Requirement Interrogation

Eight adversarial personas interrogate user and codebase together until net-new constraints saturate. Output is the 5 autoresearch primitives (Goal/Scope/Metric/Direction/Verify) plus a handoff.json ready to feed any downstream command.

/autoresearch:probe --chain plan,autoresearch
Topic: Add multi-tenant isolation to the database layer

The 8 personas: Skeptic, Edge-Case Hunter, Scope Sentinel, Ambiguity Detective, Contradiction Finder, Prior-Art Investigator, Success-Criteria Auditor, Constraint Excavator.

Flag Purpose
--depth <level> shallow (5 rounds), standard (15), deep (30)
--adversarial Rotate Skeptic + Contradiction Finder + Edge-Case Hunter to front
--mode <mode> interactive (default) or autonomous
--chain <targets> plan, predict, debug, scenario, reason, fix, ship, learn

Output: Creates probe/{date}-{slug}/ with probe-spec.md, constraints.tsv, autoresearch-config.yml, handoff.json.


/autoresearch:improve — Product Improvement Engine

Research what to build next. Discovers ICP challenges via deep multi-source research, scores and ranks improvements, generates per-feature PRDs with evidence chains.

/autoresearch:improve
Goal: Improve onboarding conversion
ICP: B2B SaaS product managers at 50-500 person companies

How it works: Resolve product context → Research across 5 categories (ICP challenges, competitor gaps, market trends, UX & experience, revenue & growth) → Saturate → ICP binary gate → Tiered ranking (Must-have / Nice-to-have / Moonshot) → User selects features → Generate PRDs.

Flag Purpose
--icp "<text>" Ideal customer profile
--discover Force codebase scan even with existing context
--no-discover Skip auto-discover
--depth <level> shallow (5), standard (15), deep (30+)
--seeds <categories> Override default research categories

Output: Creates improve/{date}-{slug}/ with research-findings.md, improvement-plan.md, per-feature PRDs, summary.md, improve-results.tsv, handoff.json.

Terminal emitter — improve is the last link in any autoresearch chain. PRDs are consumed by external tools (/ck:plan, /ck:cook), not by other autoresearch commands.

Chain into improve: /autoresearch:probe --improve, /autoresearch:predict --improve, /autoresearch:debug --improve.


/autoresearch:evals — Results Analyzer

Analyzes *-results.tsv files from any autoresearch run. Surfaces trends, plateau detection, convergence signals, and iteration efficiency. Backward compatible with v2.0.x TSV format.

/autoresearch:evals
/autoresearch:evals --file coverage-results.tsv

Adaptive checkpoints: floor(max_iterations/3), minimum 1 checkpoint. Reports per-checkpoint delta, stall detection, best iteration, and a recommendation (continue / stop / change strategy).

Inline evals during a run:

/autoresearch
Goal: Reduce bundle size below 200kb
Iterations: 30
--evals-interval 10

Prints a checkpoint report every 10 iterations without interrupting the loop.


Guard — Prevent Regressions

When optimizing a metric, the loop might break existing behavior. Guard is an optional safety net.

/autoresearch
Goal: Reduce API response time to under 100ms
Verify: npm run bench:api | grep "p95"
Guard: npm test
  • Verify = "Did the metric improve?" (the goal)
  • Guard = "Did anything else break?" (the safety net)

If the metric improves but the guard fails, Claude reworks the optimization (up to 2 attempts). Guard/test files are never modified.

Credit: Guard was contributed by @pronskiy (JetBrains) in PR #7.


Results Tracking

Every iteration is logged in TSV format:

iteration  commit   metric  delta   status    description
0          a1b2c3d  85.2    0.0     baseline  initial state
1          b2c3d4e  87.1    +1.9    keep      add tests for auth edge cases
2          -        86.5    -0.6    discard   refactor test helpers (broke 2 tests)
3          c3d4e5f  88.3    +1.2    keep      add error handling tests

Run /autoresearch:evals at any time to analyze trends across any TSV file. Adaptive checkpoints fire at floor(max_iterations/3) intervals.


Crash Recovery

Failure Response
Syntax error Fix immediately, don't count as iteration
Runtime error Attempt fix (max 3 tries), then move on
Resource exhaustion Revert, try smaller variant
Infinite loop / hang Kill after timeout, revert
External dependency Skip, log, try different approach

Repository Structure

autoresearch/
├── README.md
├── COMPARISON.md                                  ← Karpathy's vs Claude Autoresearch
├── guide/                                         ← Guides — one per command + advanced patterns
├── scripts/
│   ├── install.sh                                 ← Guided installer (Claude Code + OpenCode + Codex)
│   ├── transform.sh                               ← Single transform: .claude/ → .opencode/ + .agents/
│   ├── release.sh                                 ← Release automation
│   └── release.md                                 ← Release checklist
├── .claude/
│   ├── skills/autoresearch/
│   │   ├── SKILL.md                               ← Thin routing table (41 lines)
│   │   └── references/                            ← 3 focused reference files
│   │       ├── security-checklist.md              ← STRIDE + OWASP
│   │       ├── predict-personas.md                ← 5 personas + adversarial set
│   │       └── reason-judge-protocol.md           ← Adversarial refinement loop
│   └── commands/
│       ├── autoresearch.md                        ← Core loop (self-contained, ~100 lines)
│       └── autoresearch/                          ← 12 subcommand files (self-contained)
│           ├── plan.md
│           ├── debug.md
│           ├── fix.md
│           ├── security.md
│           ├── ship.md
│           ├── scenario.md
│           ├── predict.md
│           ├── learn.md
│           ├── reason.md
│           ├── improve.md
│           ├── probe.md
│           └── evals.md
├── .opencode/                                     ← OpenCode port (via transform.sh)
│   ├── skills/autoresearch/
│   └── commands/                                  ← 13 command files (autoresearch_*.md)
├── .agents/                                       ← Codex port (via transform.sh)
│   └── skills/autoresearch/
├── plugins/                                       ← Codex plugin metadata
│   └── openai.yaml
└── LICENSE

FAQ

Q: I don't know what metric to use. A: Run /autoresearch:plan — it analyzes your codebase, suggests metrics, and dry-runs the verify command before you launch.

Q: What changed in v2.1.0? A: Architecture rebuild. The monolithic SKILL.md (813 lines, 100K tokens) is replaced with a thin routing file + 12 self-contained command files (5–8K tokens each). 95% token reduction. A new /autoresearch:evals command analyzes iteration results. Every looping command now has a bounded default instead of running unlimited.

Q: How do bounded defaults work? A: Every looping command ships with a sensible default (e.g., /autoresearch defaults to 25 iterations). Override inline: Iterations: 50 for more, Iterations: unlimited for the old unbounded behavior.

Q: How does /autoresearch:evals work? A: Point it at any *-results.tsv file from a previous run. It reports trends, plateau detection, and a recommendation. Use --evals-interval N during a live run to get checkpoint reports without interrupting the loop.

Q: Does this work with any project? A: Yes. Any language, framework, or domain. Install via plugin (Claude Code), installer script, or manual copy.

Q: Does this work with OpenCode? A: Yes. Run ./scripts/install.sh --opencode --global or manually copy .opencode/ files. Commands use underscore naming (/autoresearch_debug, /autoresearch_evals, etc.). All 13 commands available.

Q: Does this work with OpenAI Codex? A: Yes. Run ./scripts/install.sh --codex --global or copy .agents/skills/autoresearch/ to ~/.codex/skills/autoresearch. Invoke via $autoresearch mention syntax.

Q: How do I stop the loop? A: Ctrl+C or add Iterations: N to your inline config. Claude commits before verifying, so your last successful state is always in git.

Q: Can I use this for non-code tasks? A: Absolutely. Sales emails, marketing copy, HR policies, runbooks — anything with a measurable metric. See Examples by Domain.

Q: Does /autoresearch:security modify my code? A: No. Read-only by default. Use --fix to opt into auto-remediation of confirmed Critical/High findings.

Q: What's the difference between /autoresearch:predict and /autoresearch:reason? A: Predict is a one-shot analysis — 5 experts debate your existing code. Reason is an iterative refinement loop — competing candidates are generated, critiqued, synthesized, and blind-judged over multiple rounds until convergence. Use predict for analysis before acting; use reason for decisions where no objective metric exists.

Q: What is handoff.json? A: A structured file emitted by plan, probe, reason, and other commands that carries Goal/Scope/Metric/Verify config for downstream commands. When you --chain plan,autoresearch, the chain reads handoff.json automatically.


Contributing

Contributions welcome. See CONTRIBUTING.md.

Areas of interest: new domain examples, verification script templates, CI/CD integrations, real-world benchmarks. All guides are in guide/.


Star History

Star History Chart

License

MIT — see LICENSE.


Credits


About the Author

Udit Goenka

Udit Goenka — AI Product Expert, Founder & Angel Investor

Self-taught builder who went from a slow internet connection in India to founding multiple companies and helping 700+ startups generate over ~$25m in revenue.

Building: TinyCheque (India's first agentic AI venture studio) · Firstsales.io (sales automation)

Investing: 38 startups backed, 6 exits. Focused on early-stage AI and SaaS.

Connect: udit.co · @iuditg · @uditgoenka · Newsletter

"Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy."

Apps
About Me
GitHub: Trinea
Facebook: Dev Tools
AI Daily Digest