autoresearch Autoresearch is a closed-loop ML experimentation workflow : - human writes - agent edits - stays fixed - every run gets the same 300-second budget - lower wins - regressions get reverted This skill should behave like a routing-first front door , not a giant tutorial. Pick the user's mode, enforce the immutable-harness rules, then hand them to the smallest useful script or reference. When to use this skill - Set up on a real GPU machine - Write or refine before a session - Run a bounded overnight search loop - Interpret after a session - Adapt the workflow to tighter VRAM constrai…

\\t' -k2 -n\nawk -F'\\t' '{print $4}' results.tsv | sort | uniq -c\n```\n\nSummarize only four things: best gains, repeated failures, what should move into `What Has Been Tried`, and the next narrow experiment family.\n\n#### Mode E — Constrained-hardware adaptation\n\nUse this path when VRAM, platform, or runtime constraints dominate.\n\nRules:\n- choose `MAX_SEQ_LEN` and `EVAL_TOKENS` **before** the session\n- never change them mid-session\n- lower model/search ambition before mutating the evaluator\n- prefer route-outs to community forks for Apple Silicon / non-CUDA paths\n\nFor concrete values and troubleshooting, use `references/hardware-config.md`.\n\n### Step 4: Route out aggressively when the request is adjacent\n\nRoute out when:\n- the user wants to optimize instructions, prompts, or repo-local skills → `skill-autoresearch`\n- the user wants app-level traces, feedback review, observability, or online/offline eval dashboards → LangSmith / Braintrust / Weave / Promptfoo\n- the user wants general literature synthesis rather than a runnable ML loop → research or survey tooling\n\n### Step 5: Keep the heavy detail in support files\n\nUse support files instead of re-explaining everything inline:\n- `references/operating-modes-and-route-outs.md` — fast routing table, minimal response shape, and handoff logic\n- `references/architecture.md` — immutability contract, file map, metric rationale\n- `references/program-md-guide.md` — templates and update rules\n- `references/hardware-config.md` — VRAM tables and platform troubleshooting\n- `scripts/*.sh` — runnable setup / loop / reporting helpers\n\n## Available scripts\n\nRun from inside the autoresearch repository directory:\n\n| Script | Purpose | Usage |\n|--------|---------|-------|\n| `setup.sh` | One-time environment setup | `bash scripts/setup.sh [--seq-len 512]` |\n| `run-experiment.sh` | Single 5-minute experiment + metric extraction | `bash scripts/run-experiment.sh` |\n| `run-loop.sh` | Autonomous loop: run → keep/revert → repeat | `bash scripts/run-loop.sh [--max 20]` |\n| `show-results.sh` | Human-readable `results.tsv` report | `bash scripts/show-results.sh [--top 10]` |\n| `check-hardware.sh` | GPU/CUDA/uv readiness check (JSON output) | `bash scripts/check-hardware.sh` |\n\n## References\n\nDetailed documentation in `references/`:\n\n| File | Contents |\n|------|----------|\n| `references/operating-modes-and-route-outs.md` | Mode picker, adjacency boundaries, and minimal output contract |\n| `references/architecture.md` | System design, immutability contract, git ratcheting, metric rationale |\n| `references/program-md-guide.md` | How to write and update effective `program.md` directives |\n| `references/hardware-config.md` | VRAM settings by GPU, memory optimization, platform troubleshooting |\n\n## Examples\n\n### Example 1: First 40GB GPU session\n\nRequest: “Help me run Karpathy autoresearch on a 40GB GPU.”\n\nExpected behavior:\n- choose **Setup readiness** first\n- verify hardware and dependencies\n- run one baseline experiment\n- route to `program.md` authoring only after the baseline exists\n\n### Example 2: User wants to optimize a skill instead\n\nRequest: “Can autoresearch help me improve this `SKILL.md` with binary evals?”\n\nExpected behavior:\n- route out immediately to `skill-autoresearch`\n- explain that this skill is for real ML training search on `train.py`\n\n## Best practices\n\n1. **Start with the smallest mode that fits** — setup, authoring, run loop, interpretation, or hardware adaptation\n2. **Baseline before bravado** — confirm one successful run before talking about overnight loops\n3. **Freeze the evaluator before the session** — `prepare.py`, `TIME_BUDGET`, `MAX_SEQ_LEN`, and `EVAL_TOKENS` must stay comparable\n4. **One meaningful experiment at a time** — ablations beat mystery bundles\n5. **Keep `results.tsv` append-only** — discarded runs are still evidence\n6. **Push deep detail into references/scripts** — the front door should classify and route, not duplicate every table\n7. **Route adjacent jobs away early** — prompt/app eval and `SKILL.md` optimization are different lanes\n\n## References\n\n- [GitHub — karpathy/autoresearch](https://github.com/karpathy/autoresearch)\n- [Karpathy — A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/)\n- [MLflow Tracking](https://mlflow.org/docs/latest/ml/tracking/)\n- [Weights & Biases Tracking](https://docs.wandb.ai/guides/track/)\n- [MIT License](https://github.com/karpathy/autoresearch/blob/master/LICENSE)\n---","attachment_filenames":["evals/evals.json","references/architecture.md","references/hardware-config.md","references/operating-modes-and-route-outs.md","references/program-md-guide.md","scripts/check-hardware.sh","scripts/run-experiment.sh","scripts/run-loop.sh","scripts/setup.sh","scripts/show-results.sh"],"attachments":[{"filename":"evals/evals.json","content":"{\n \"skill_name\": \"autoresearch\",\n \"evals\": [\n {\n \"id\": 1,\n \"prompt\": \"Help me set up Karpathy autoresearch on a 40GB GPU, verify the machine, and run the first baseline experiment.\",\n \"expected_output\": \"The skill chooses setup-readiness mode, preserves the immutable harness, and points to concrete setup / hardware / baseline commands instead of dumping every downstream detail at once.\",\n \"assertions\": [\n \"The workflow clearly identifies setup or readiness as the chosen mode.\",\n \"The workflow includes hardware verification or baseline-run commands such as `check-hardware.sh`, `uv sync`, `uv run prepare.py`, or `uv run train.py`.\",\n \"The response keeps the immutable 300-second / `prepare.py` / `val_bpb` contract visible.\"\n ]\n },\n {\n \"id\": 2,\n \"prompt\": \"My repo runs, but my `program.md` just says 'try improvements'. Help me fix it before tonight's run.\",\n \"expected_output\": \"The skill chooses `program.md` authoring mode, asks for or fills in the baseline/priority/constraints structure, and keeps the loop tied to `train.py` rather than generic eval tooling.\",\n \"assertions\": [\n \"The workflow explicitly selects `program.md` authoring or an equivalent mode.\",\n \"The response includes sections such as current baseline, directions to explore, tried-already notes, or constraints.\",\n \"The response does not drift into prompt / app observability tooling as the main answer.\"\n ]\n },\n {\n \"id\": 3,\n \"prompt\": \"My smaller-GPU autoresearch run keeps crashing, and I want to change the evaluator halfway through to get something working.\",\n \"expected_output\": \"The skill keeps the evaluator immutable inside the session, recommends constrained-hardware adaptation, and treats any evaluator change as a new comparison track.\",\n \"assertions\": [\n \"The skill warns against modifying `prepare.py`, `TIME_BUDGET`, or the active evaluator mid-session.\",\n \"The response recommends hardware-aware adaptation such as lowering `MAX_SEQ_LEN`, adjusting `EVAL_TOKENS` before the session, or using the hardware reference guidance.\",\n \"The response frames evaluator changes as a new comparison track rather than part of the current run.\"\n ]\n },\n {\n \"id\": 4,\n \"prompt\": \"Can autoresearch help me improve this SKILL.md with binary evals and keep-or-revert scoring?\",\n \"expected_output\": \"The skill routes the request away from ML search toward `skill-autoresearch` or adjacent eval tooling instead of pretending this training-loop skill is the right fit.\",\n \"assertions\": [\n \"The skill explicitly says this is not the right tool for repo-local `SKILL.md` optimization.\",\n \"The response names `skill-autoresearch` as the preferred route-out.\",\n \"The response preserves the ML-specific `program.md` / `train.py` / `val_bpb` boundary.\"\n ]\n }\n ]\n}\n","content_type":"application/json; charset=utf-8","language":"json","size":2942,"content_sha256":"08e3064a654464844e85f34a90b9cfc5f4b6bfdc7629620b416debc09b77c351"},{"filename":"references/architecture.md","content":"# autoresearch Architecture Reference\n\nSource: [karpathy/autoresearch](https://github.com/karpathy/autoresearch) · MIT License\n\n---\n\n## Overview\n\nautoresearch is a closed-loop ML experimentation system. A human authors `program.md`; an AI agent reads it and autonomously modifies `train.py`, executes experiments, and commits only improvements — creating a monotonically improving research branch overnight.\n\n---\n\n## File Map\n\n```\nautoresearch/\n├── train.py ← Agent's ONLY editable file (~630 lines)\n├── prepare.py ← Immutable: data pipeline + evaluate_bpb() + MAX_SEQ_LEN/TIME_BUDGET/EVAL_TOKENS\n├── program.md ← Human-written research directives (agent reads this)\n├── pyproject.toml ← Locked dependencies (no new packages allowed)\n└── results.tsv ← Persistent experiment log (all runs)\n```\n\n### Immutability Contract\n\n| File | Agent Access | Rationale |\n|------|-------------|-----------|\n| `train.py` | Read + Write | The search space — architecture, optimizer, hyperparameters |\n| `prepare.py` | Read-only | Contains `evaluate_bpb()` plus `MAX_SEQ_LEN`, `TIME_BUDGET`, and `EVAL_TOKENS` — must never change for fair comparison |\n| `program.md` | Read-only | Human's intent — agent follows, never modifies |\n| `pyproject.toml` | Read-only | Locked deps — no `pip install` during search |\n| `results.tsv` | Append-only | Monotonic experiment log — never delete rows |\n\n---\n\n## The Experiment Loop\n\n```\n┌─────────────────────────────────────────────────────────┐\n│ AGENT LOOP │\n│ │\n│ 1. Read program.md + current train.py │\n│ 2. Formulate hypothesis (architecture / optimizer) │\n│ 3. Edit train.py → git commit │\n│ 4. uv run train.py (exactly 300 seconds) │\n│ 5. grep \"^val_bpb:\" run.log → extract metric │\n│ │\n│ ┌─── improved? ──────┐ ┌─── not improved? ──────┐ │\n│ │ git commit stays │ │ git reset HEAD~1 │ │\n│ │ update baseline │ │ baseline unchanged │ │\n│ └────────────────────┘ └─────────────────────────┘ │\n│ │\n│ 6. Append row to results.tsv │\n│ 7. Repeat from step 1 │\n└─────────────────────────────────────────────────────────┘\n```\n\n---\n\n## Key Design Decisions\n\n### 1. Fixed 300-Second Budget\n\n`TIME_BUDGET = 300` lives in `prepare.py`. Every experiment runs for exactly 300 seconds wall-clock time regardless of GPU or model size.\n\n**Why**: Ensures every row in `results.tsv` is directly comparable. A `val_bpb` of 0.97 in experiment 3 means the same thing as 0.97 in experiment 97.\n\n**Throughput**: ~12 experiments/hour → ~100 experiments in an overnight session.\n\n### 2. Immutable Evaluation Harness\n\n`evaluate_bpb()` in `prepare.py` is never modified. It always evaluates on the same validation shard (the last FineWeb-Edu parquet file), with the same tokenizer, for the same `EVAL_TOKENS` value chosen before the session starts.\n\n**Why**: Without a fixed harness, a clever agent could modify the evaluation to make its model appear better — \"metric hacking.\" The immutable harness prevents this.\n\n### 3. Single Metric: val_bpb\n\nValidation bits-per-byte (val_bpb) measures how many bits on average the model uses to predict each byte of validation text. Lower = better.\n\n**Why val_bpb over perplexity**: val_bpb is vocabulary-size-independent. If the agent experiments with different tokenizers (different vocabulary sizes), perplexity scores would be incomparable; val_bpb remains a fair metric across all configurations.\n\n### 4. Git Ratcheting\n\nEvery experiment is a git commit. On improvement: keep. On regression: `git reset HEAD~1`.\n\n**Result**: The main branch is a clean, monotonically improving history of algorithmic discoveries.\n\n**Side effect**: `results.tsv` retains the full record including all discarded experiments, providing a complete picture of the search.\n\n### 5. Fits in Context Window\n\nAt ~630 lines, `train.py` fits within a modern LLM's context window. The agent always has the complete file in view — no partial reads, no retrieval augmentation needed.\n\n---\n\n## train.py Structure\n\nThe default `train.py` contains:\n\n```\n1. Imports and device setup\n2. Hyperparameters (model, training, eval)\n3. GPT model definition\n - Transformer blocks\n - Attention (default: multi-head)\n - FFN layers\n - Layer normalization\n4. Optimizer setup (Muon + AdamW)\n5. Training loop\n - Forward pass\n - Loss computation\n - Backward pass\n - Optimizer step\n6. Evaluation call\n7. Metric output: val_bpb, peak_vram_mb\n```\n\n**Agent's search space** (everything in `train.py` is fair game):\n- Model architecture: depth, width, attention variants, FFN type\n- Positional encoding: learned, RoPE, ALiBi, none\n- Normalization: LayerNorm, RMSNorm, position\n- Optimizer: learning rate, schedules, momentum values, weight decay\n- Training: batch size, gradient accumulation, mixed precision\n\n---\n\n## results.tsv Format\n\n```tsv\ncommit val_bpb peak_vram_mb status description\n```\n\n| Column | Type | Values |\n|--------|------|--------|\n| `commit` | string | 7-char git hash |\n| `val_bpb` | float | Lower = better; `crash` if OOM/error |\n| `peak_vram_mb` | integer | Peak GPU memory in MB |\n| `status` | enum | `keep`, `discard`, `crash` |\n| `description` | string | Free text summary of the change |\n\n---\n\n## Karpathy's Documented Results\n\nFrom the original repo and public statements:\n\n| Session | Experiments | Improvements | Start val_bpb | Best val_bpb |\n|---------|-------------|--------------|---------------|--------------|\n| Session 1 | 126 | 18 | 0.9979 | 0.9697 |\n| Tobi Lütke (Shopify CEO) | 37 | ~19% gain | — | — |\n\n**Key observation**: Improvements found on depth-12 transferred cleanly to depth-24, suggesting genuine algorithmic discoveries rather than overfitting to a particular scale.\n\n---\n\n## Platform Notes\n\nautoresearch is designed for a **single NVIDIA GPU on Linux**. Community forks extend this:\n\n| Platform | Status | Notes |\n|----------|--------|-------|\n| H100 80GB + Linux | Official | Default config |\n| A100 40GB | Supported | May need MAX_SEQ_LEN reduction |\n| RTX 4090 24GB | Community | MAX_SEQ_LEN ≤ 512 |\n| GTX 1660 Ti 6GB | Community | MAX_SEQ_LEN=256, reduced EVAL_TOKENS |\n| Apple Silicon (M-series) | MLX fork | Different optimizer API required |\n| Windows | Community | WSL2 + CUDA recommended |\n\n---\n\n## What the Agent Should Never Do\n\n1. Modify `prepare.py` mid-session — breaks evaluation fairness\n2. Change `TIME_BUDGET` — makes comparisons invalid\n3. Add new packages via pip — `pyproject.toml` is locked\n4. Delete rows from `results.tsv` — permanent record\n5. Push to main before human review — results branch only\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":7433,"content_sha256":"eb462ada18529f13c4e5da33a4ae7290af706dcbbfdbfe4cdd8f6cb93dcd6a13"},{"filename":"references/hardware-config.md","content":"# Hardware Configuration Reference\n\nConfigure autoresearch for your GPU. The key levers are `MAX_SEQ_LEN` and `EVAL_TOKENS` in `prepare.py`.\n\n> **Rule**: Never change these values mid-session. A session's `results.tsv` is only internally comparable if all rows use the same `MAX_SEQ_LEN` and `EVAL_TOKENS`.\n\n---\n\n## Recommended Settings by GPU\n\n| GPU | VRAM | MAX_SEQ_LEN | EVAL_TOKENS | ~Experiments/hr | Notes |\n|-----|------|-------------|-------------|-----------------|-------|\n| H100 80GB | 80 GB | 2048 | 20,971,520 | ~12 | Default config |\n| A100 80GB | 80 GB | 2048 | 20,971,520 | ~12 | Same as H100 |\n| A100 40GB | 40 GB | 1024 | 10,485,760 | ~12 | Halve both |\n| RTX 4090 | 24 GB | 512 | 5,242,880 | ~12 | Quarter both |\n| RTX 3090 | 24 GB | 512 | 5,242,880 | ~12 | Same as 4090 |\n| RTX 3080 Ti | 12 GB | 256 | 2,097,152 | ~12 | Eighth both |\n| GTX 1660 Ti | 6 GB | 256 | 2,097,152 | slower | Community tested |\n| Apple M-series | unified | — | — | — | MLX fork required |\n\n> `EVAL_TOKENS` should scale proportionally with `MAX_SEQ_LEN` to maintain evaluation quality.\n\n---\n\n## How to Apply Settings\n\nEdit `prepare.py` before running `uv run prepare.py` (one-time setup):\n\n```python\n# prepare.py — find and edit this line:\nMAX_SEQ_LEN = 2048 # change to your value\n\n# prepare.py — find and edit this line:\nEVAL_TOKENS = 20_971_520 # change to your value from the table above\n```\n\nOr use the setup script with `--seq-len`:\n\n```bash\nbash scripts/setup.sh --seq-len 512\n```\n\n> The setup script currently patches `MAX_SEQ_LEN` only. Update `EVAL_TOKENS` in `prepare.py` yourself using the table above so runs stay internally comparable.\n\n---\n\n## Checking Available VRAM\n\nBefore running any experiment:\n\n```bash\n# Live VRAM status\nnvidia-smi --query-gpu=name,memory.total,memory.free --format=csv\n\n# Run the built-in check script\nbash scripts/check-hardware.sh\n```\n\n---\n\n## Memory Optimization Techniques (when at VRAM limit)\n\nThese can be added to `train.py` by the agent if VRAM is tight:\n\n### 1. Gradient Checkpointing\n\nTrades compute for memory — recomputes activations during backward pass instead of storing them.\n\n```python\n# In model's forward() call inside the training loop:\nfrom torch.utils.checkpoint import checkpoint\n\noutput = checkpoint(block, x) # instead of: output = block(x)\n```\n\n**Effect**: Reduces VRAM by ~30-40% at cost of ~20% slower training.\n\n### 2. Mixed Precision (BF16)\n\nMost modern GPUs support BF16 natively.\n\n```python\n# In training loop:\nwith torch.autocast(device_type='cuda', dtype=torch.bfloat16):\n logits = model(x)\n```\n\n**Effect**: ~40-50% VRAM reduction for activations.\n\n### 3. Reduce Batch Size + Gradient Accumulation\n\n```python\n# Instead of batch_size=64:\nbatch_size = 16\naccumulation_steps = 4 # equivalent effective batch size = 64\n```\n\n**Effect**: Linear VRAM reduction. Training unchanged if gradient accumulation compensates.\n\n### 4. Reduce Model Depth Before Width\n\nDepth scales VRAM more than width (intermediate activations). Try:\n\n```python\nn_layer = 8 # instead of 12\nn_embd = 768 # keep or increase slightly\n```\n\n---\n\n## VRAM Estimation Formula\n\nA rough estimate for transformer VRAM at training time:\n\n```\nVRAM_GB ≈ (parameters × 4 bytes) # model weights (fp32)\n + (parameters × 4 bytes) # gradients\n + (parameters × 8 bytes) # optimizer state (AdamW = 2× fp32)\n + (batch × seq_len × hidden × layers × 4) # activations (fp32)\n```\n\nFor a depth-12, hidden-768, MAX_SEQ_LEN=2048, batch=32 model:\n- Parameters: ~85M × 16 bytes = ~1.4 GB\n- Activations: 32 × 2048 × 768 × 12 × 4 bytes ≈ ~24 GB\n- **Total: ~26 GB** (fits on A100 40GB, tight on RTX 4090 24GB)\n\nWith BF16 activations: activations halved → **~14 GB total**.\n\n---\n\n## Apple Silicon (MLX) Setup\n\nThe official repo requires NVIDIA CUDA. Community MLX port:\n\n```bash\n# Clone MLX fork (community maintained)\ngit clone https://github.com/[community-fork]/autoresearch-mlx\ncd autoresearch-mlx\n\n# Install mlx\npip install mlx mlx-lm\n\n# Setup and run\npython prepare_mlx.py\npython train_mlx.py\n```\n\n> Note: MLX uses different optimizer APIs and may have different architectural constraints. `results.tsv` values from MLX runs are NOT comparable to CUDA runs.\n\n---\n\n## Multi-GPU Notes\n\nautoresearch is designed for **single-GPU** training. The 300-second budget and `evaluate_bpb()` assume a single device.\n\nFor multi-GPU:\n- Do NOT use `DistributedDataParallel` in experiments (changes training dynamics)\n- `torch.nn.DataParallel` can be used as a workaround but introduces overhead\n- Results from multi-GPU runs may not be comparable to single-GPU baseline\n\nIf you have multiple GPUs, run **parallel independent autoresearch sessions** on different GPUs with different `program.md` research directions — then merge insights.\n\n---\n\n## Troubleshooting\n\n| Symptom | Likely Cause | Fix |\n|---------|-------------|-----|\n| OOM crash immediately | MAX_SEQ_LEN too large | Halve MAX_SEQ_LEN and re-run prepare.py |\n| OOM mid-training | Model too large for seq len | Reduce n_layer or n_embd |\n| val_bpb = nan | Learning rate too high | Reduce max_lr by 10× |\n| All experiments crash | CUDA not available to PyTorch | Check `nvidia-smi` + `uv run python -c \"import torch; print(torch.cuda.is_available())\"` |\n| Very slow training | CPU fallback active | Confirm CUDA available; check `torch.cuda.current_device()` |\n| prepare.py takes forever | Network slow / many shards | Set `MAX_SHARDS=100` in prepare.py for a smaller dataset |\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":5544,"content_sha256":"650ee4f75cb4b20cf8fc5dea6b30ca986f79a7cbf019e69e83c08f29e9388d3b"},{"filename":"references/operating-modes-and-route-outs.md","content":"# Operating Modes and Route-Outs\n\nUse this page when `SKILL.md` has already chosen the lane and you need the smallest next move.\n\n## Mode picker\n\n| User situation | Choose this mode | Immediate next artifact / command |\n|----------------|------------------|-----------------------------------|\n| Repo not installed or hardware unknown | Setup readiness | `bash scripts/check-hardware.sh` |\n| Repo runs but search direction is weak | `program.md` authoring | edit `program.md` using `program-md-guide.md` |\n| Baseline exists and loop is ready | Bounded run loop | `bash scripts/run-loop.sh --max N --desc session-name` |\n| Session finished and results need meaning | Results interpretation | `bash scripts/show-results.sh --top 10` |\n| VRAM / platform constraints dominate | Constrained-hardware adaptation | update `prepare.py` before the session and cross-check `hardware-config.md` |\n\n## Immutable-harness reminders\n\nCarry these into every mode:\n- `program.md` is human-authored for the session\n- `train.py` is the main mutable search surface\n- `prepare.py` is read-only once the session starts\n- `TIME_BUDGET=300` stays fixed\n- `results.tsv` is append-only\n- lower `val_bpb` wins\n\nIf any of those must change, start a **new comparison track** instead of mutating the active session.\n\n## Route-outs\n\n| If the request is really about... | Route to... | Why |\n|-----------------------------------|-------------|-----|\n| Improving a `SKILL.md`, prompt packet, or repo-local instruction artifact | `skill-autoresearch` | Same ratchet idea, different mutable artifact and evaluator |\n| App-level evals, prompt regression suites, traces, feedback review, dashboards | LangSmith / Promptfoo / Braintrust / Weave | Observability/eval infrastructure, not `train.py` ML search |\n| Literature review or broad research scan | `survey` or research skills | No runnable training loop yet |\n| Generic GPU setup without the autoresearch workflow | environment / MLOps skill in the host repo | Hardware setup alone is not the full lane |\n\n## Minimal response shape\n\nA good front-door answer usually fits this template:\n\n1. `Mode:` the one mode you chose\n2. `Immutable rules:` what must not change\n3. `Next step:` the file or command to touch now\n4. `Route-out:` only if the request is adjacent or mixed\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":2288,"content_sha256":"f830ed7f666983e77d2d4b403397cc148df4fdf76c54819d728611fcf5235eff"},{"filename":"references/program-md-guide.md","content":"# program.md Authoring Guide\n\n> \"The researcher's job shifts from writing Python to writing Markdown.\" — Andrej Karpathy\n\n`program.md` is the most important file in an autoresearch session. The agent reads it at the start of every loop iteration. A vague `program.md` wastes GPU hours; a precise one focuses the search.\n\n---\n\n## What program.md Controls\n\nThe agent uses `program.md` to decide:\n- **What to try next** — which hypotheses to form\n- **What to avoid** — directions already explored or known to fail\n- **What to prioritize** — VRAM efficiency vs. val_bpb vs. speed\n- **What constraints to respect** — VRAM limit, MAX_SEQ_LEN, banned packages\n\nThe agent does NOT use `program.md` to determine the evaluation metric (always `val_bpb`), the time budget (always 300s), or which file to edit (always `train.py`).\n\n---\n\n## Minimal Template\n\n```markdown\n# Research Program\n\n## Goal\nMinimize val_bpb on the FineWeb-Edu validation set.\n\n## Current Baseline\nval_bpb: [FILL IN after first run]\n\n## Directions to Explore\n[What to try]\n\n## Constraints\n- TIME_BUDGET=300s (fixed)\n- Peak VRAM must stay under [X] GB\n- Do not modify prepare.py\n- No new packages (pyproject.toml is locked)\n```\n\n---\n\n## Full Template with All Sections\n\n```markdown\n# Research Program\n\n## Goal\nMinimize val_bpb on FineWeb-Edu within the 300-second training budget.\nLower val_bpb is always better. Do not optimize for anything else.\n\n## Current Baseline\nval_bpb: 0.9979\nModel: depth-12 GPT, Muon + AdamW optimizer, RoPE, SwiGLU\nHardware: H100 80GB\n\n## Directions to Explore\n\n### High Priority (try these first)\n1. Attention variants: GQA (grouped-query), MLA (multi-head latent), sliding window\n2. Layer types: MoE (mixture of experts) FFN, SwiGLU vs. GeGLU\n3. Optimizer: Muon momentum values 0.90–0.98, AdamW β1/β2 grid\n4. Normalization: RMSNorm vs. LayerNorm, pre-norm vs. post-norm\n\n### Medium Priority\n5. Learning rate schedule: cosine vs. linear warmup + decay ratios\n6. Weight tying: tie embedding and output projection weights\n7. Depth/width tradeoffs: same FLOP budget, different aspect ratios\n\n### Low Priority / Exploratory\n8. Positional encoding: ALiBi, T5-style relative, sinusoidal\n9. Residual connection variants: pre-gate, scaled residuals\n\n## What Has Been Tried (Do Not Repeat)\n- Learned positional embeddings: worse than RoPE by ~0.008\n- Depth-8 with wider hidden: worse than depth-12\n- Pure AdamW (no Muon): worse by ~0.015\n\n## Constraints\n- Must complete in 300 seconds (TIME_BUDGET is fixed, do not change)\n- Peak VRAM must stay under 39 GB\n- Do not modify prepare.py or pyproject.toml\n- Do not add new packages\n- Each experiment should change ONE thing at a time (clean ablations)\n\n## Notes\n- depth-12 improvements transfer to depth-24 — focus on algorithms, not scale\n- Previous session found SwiGLU activation reliably helps\n```\n\n---\n\n## Writing Principles\n\n### 1. Record the Current Baseline\n\nAlways include the current best `val_bpb` and what model configuration produced it. The agent needs this to decide whether a new experiment is an improvement.\n\n```markdown\n## Current Baseline\nval_bpb: 0.9697\nModel: depth-12, GQA (4 KV heads), SwiGLU, Muon lr=0.01\n```\n\n### 2. List What Has Been Tried\n\nThis prevents the agent from re-running experiments that already failed. Be specific.\n\n```markdown\n## What Has Been Tried\n- GQA with 2 KV heads: DISCARD (val_bpb=0.981, worse than baseline)\n- MoE with 8 experts: CRASH (OOM at MAX_SEQ_LEN=2048)\n- cosine LR schedule: KEEP (val_bpb=0.971)\n```\n\n### 3. One Change at a Time\n\nInstruct the agent to change ONE architectural component per experiment. Combined changes make it impossible to attribute improvements.\n\n```markdown\n## Constraints\n- Each experiment must change exactly ONE component of train.py\n- Do not combine architecture changes with optimizer changes\n```\n\n### 4. Specify VRAM Budget\n\nThe agent needs to know the GPU's headroom to avoid OOM crashes.\n\n```markdown\n## VRAM Constraint\nPeak VRAM must stay under 38 GB.\nIf an experiment would exceed this, try reducing hidden_size by 20%\nbefore abandoning the approach.\n```\n\n### 5. Prioritize Directions\n\nAn ordered list focuses the agent's first experiments on the most promising directions.\n\n```markdown\n## Exploration Priority\n1. Attention: GQA (most likely to help, low risk)\n2. Optimizer: Muon hyperparameters (easy, known to matter)\n3. FFN: SwiGLU variants (medium risk)\n4. Architecture: depth/width (expensive, do last)\n```\n\n---\n\n## Updating program.md Between Sessions\n\nAfter each session, update `program.md` with:\n\n```markdown\n## Session Log (append at bottom)\n\n### Session 2026-03-11 (126 experiments, 18 improvements)\nBest achieved: val_bpb=0.9697 (commit a3f2c91)\nTop gains:\n- SwiGLU activation: -0.012 (commit 8e2a1b3)\n- GQA 4 KV heads: -0.009 (commit 5d7f9c2)\n- Muon momentum 0.95: -0.006 (commit 2c8e4a1)\n\nWhat to try next:\n- MLA (multi-head latent attention) — not yet explored\n- Mixture of Depths — promising from literature\n```\n\n---\n\n## Common Mistakes\n\n| Mistake | Problem | Fix |\n|---------|---------|-----|\n| No baseline recorded | Agent doesn't know what \"improvement\" means | Add `Current Baseline: val_bpb: X.XXXX` |\n| Too vague (\"try things\") | Agent wastes experiments on random changes | List specific directions with priority order |\n| No VRAM constraint | Agent causes OOM crashes | Add `Peak VRAM must stay under X GB` |\n| Allowing multi-component changes | Can't attribute improvements | Add `Change exactly ONE component per experiment` |\n| Never updating after sessions | Agent re-explores exhausted directions | Append session log after each run |\n| Contradictory instructions | Agent gets confused | Review for internal consistency before each session |\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":5708,"content_sha256":"0ca65390ba3a62055e86bfaa4bda84ed9157d39ff3ff419c731e5d1c6ff065dd"},{"filename":"scripts/check-hardware.sh","content":"#!/usr/bin/env bash\n# check-hardware.sh — Verify GPU setup for autoresearch\n# Checks NVIDIA GPU, CUDA, VRAM, Python, and uv availability\n#\n# Usage:\n# bash scripts/check-hardware.sh\n#\n# Output:\n# JSON to stdout with fields: gpu_name, vram_mb, cuda_version, python_ok, uv_ok, ready\n# Human-readable summary to stderr\n#\n# Exit codes:\n# 0 All required checks pass\n# 1 Missing required components (no GPU, no CUDA, no uv)\n\nset -uo pipefail\n\nlog() { echo \"[check-hardware] $*\" >&2; }\nwarn() { echo \"[check-hardware] WARN: $*\" >&2; }\nerr() { echo \"[check-hardware] ERROR: $*\" >&2; }\n\nGPU_NAME=\"none\"\nVRAM_MB=0\nCUDA_VERSION=\"none\"\nPYTHON_OK=0\nUV_OK=0\nREADY=1\n\n# ── NVIDIA GPU ────────────────────────────────────────────────────────────────\nif command -v nvidia-smi &>/dev/null; then\n GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1 | xargs || echo \"unknown\")\n VRAM_MB=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits 2>/dev/null | head -1 | xargs || echo \"0\")\n CUDA_VERSION=$(nvidia-smi | grep \"CUDA Version\" | awk '{print $NF}' | head -1 || echo \"unknown\")\n log \"GPU: ${GPU_NAME} | VRAM: ${VRAM_MB} MB | CUDA: ${CUDA_VERSION}\"\nelse\n warn \"nvidia-smi not found — no NVIDIA GPU detected\"\n READY=0\nfi\n\n# VRAM check: autoresearch default requires ~38-40GB for MAX_SEQ_LEN=2048\nif [[ \"${VRAM_MB}\" -gt 0 ]]; then\n if [[ \"${VRAM_MB}\" -ge 39000 ]]; then\n log \"VRAM: ${VRAM_MB} MB — sufficient for default config (MAX_SEQ_LEN=2048)\"\n elif [[ \"${VRAM_MB}\" -ge 20000 ]]; then\n warn \"VRAM: ${VRAM_MB} MB — reduce MAX_SEQ_LEN to 512 or lower in prepare.py\"\n elif [[ \"${VRAM_MB}\" -ge 6000 ]]; then\n warn \"VRAM: ${VRAM_MB} MB — low-VRAM mode: set MAX_SEQ_LEN=256 in prepare.py\"\n else\n err \"VRAM: ${VRAM_MB} MB — insufficient (minimum ~6GB required)\"\n READY=0\n fi\nfi\n\n# ── Python ────────────────────────────────────────────────────────────────────\nif command -v python3 &>/dev/null; then\n PY_VER=$(python3 --version 2>&1 | awk '{print $2}')\n log \"Python: ${PY_VER}\"\n PYTHON_OK=1\nelse\n warn \"python3 not found\"\nfi\n\n# ── uv ────────────────────────────────────────────────────────────────────────\nif command -v uv &>/dev/null; then\n UV_VER=$(uv --version 2>&1 | head -1)\n log \"uv: ${UV_VER}\"\n UV_OK=1\nelse\n warn \"uv not found — run: curl -LsSf https://astral.sh/uv/install.sh | sh\"\n READY=0\nfi\n\n# ── PyTorch CUDA check (if in autoresearch repo) ──────────────────────────────\nif [[ -f \"train.py\" ]] && command -v uv &>/dev/null; then\n TORCH_CUDA=$(uv run python3 -c \"import torch; print(torch.cuda.is_available())\" 2>/dev/null || echo \"unknown\")\n log \"torch.cuda.is_available(): ${TORCH_CUDA}\"\n if [[ \"${TORCH_CUDA}\" == \"False\" ]]; then\n warn \"PyTorch cannot see CUDA. Check CUDA drivers and pytorch install.\"\n READY=0\n fi\nfi\n\n# ── Structured JSON output (stdout) ──────────────────────────────────────────\ncat \u003c\u003cJSON\n{\n \"gpu_name\": \"${GPU_NAME}\",\n \"vram_mb\": ${VRAM_MB},\n \"cuda_version\": \"${CUDA_VERSION}\",\n \"python_ok\": ${PYTHON_OK},\n \"uv_ok\": ${UV_OK},\n \"ready\": ${READY}\n}\nJSON\n\n# ── Exit with appropriate code ────────────────────────────────────────────────\nif [[ \"${READY}\" -eq 1 ]]; then\n log \"All checks passed — ready for autoresearch\"\nelse\n err \"One or more checks failed — fix the issues above before running experiments\"\n exit 1\nfi\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":4035,"content_sha256":"5a512f9ddb63b500d72309037e65c6025830a018f279dca34012705811b92b9f"},{"filename":"scripts/run-experiment.sh","content":"#!/usr/bin/env bash\n# run-experiment.sh — Run a single 5-minute autoresearch experiment\n# Executes uv run train.py, captures output, extracts val_bpb and peak_vram_mb\n#\n# Usage:\n# bash scripts/run-experiment.sh [--repo \u003cpath>] [--log \u003clogfile>]\n#\n# Options:\n# --repo \u003cpath> Path to autoresearch repo (default: .)\n# --log \u003cfile> Log file path (default: run.log in repo root)\n#\n# Output (stdout, tab-separated):\n# val_bpb peak_vram_mb duration_s log_path\n# Exit codes:\n# 0 Experiment completed, metrics extracted\n# 1 Missing val_bpb in output (likely crash/OOM)\n# 2 Experiment timed out (> 360s)\n\nset -uo pipefail\n\nREPO_DIR=\".\"\nLOG_FILE=\"\"\n\nwhile [[ $# -gt 0 ]]; do\n case \"$1\" in\n --repo) REPO_DIR=\"$2\"; shift 2 ;;\n --log) LOG_FILE=\"$2\"; shift 2 ;;\n --help)\n sed -n '2,16p' \"$0\" | sed 's/^# //'\n exit 0\n ;;\n *) echo \"Unknown option: $1\" >&2; exit 1 ;;\n esac\ndone\n\ncd \"${REPO_DIR}\"\n\nLOG_FILE=\"${LOG_FILE:-run.log}\"\nTIMEOUT=360 # 60s grace over the 300s TIME_BUDGET\n\nlog() { echo \"[run-experiment] $*\" >&2; }\nerr() { echo \"[run-experiment] ERROR: $*\" >&2; }\n\n# ── Sanity checks ────────────────────────────────────────────────────────────\nif [[ ! -f \"train.py\" ]]; then\n err \"train.py not found. Are you in the autoresearch repo? Use --repo \u003cpath>\"\n exit 1\nfi\n\nif ! command -v uv &>/dev/null; then\n err \"uv not found. Run setup.sh first.\"\n exit 1\nfi\n\n# ── Run experiment ───────────────────────────────────────────────────────────\nlog \"Starting experiment (TIME_BUDGET=300s)...\"\nSTART_TS=$(date +%s)\n\nif ! timeout \"${TIMEOUT}\" uv run train.py > \"${LOG_FILE}\" 2>&1; then\n EXIT_CODE=$?\n END_TS=$(date +%s)\n DURATION=$(( END_TS - START_TS ))\n\n if [[ \"${EXIT_CODE}\" -eq 124 ]]; then\n err \"Experiment timed out after ${DURATION}s (timeout=${TIMEOUT}s)\"\n exit 2\n fi\n\n # Non-zero but not timeout — might still have written metrics before crash\n log \"train.py exited with code ${EXIT_CODE} after ${DURATION}s — checking for partial metrics...\"\nfi\n\nEND_TS=$(date +%s)\nDURATION=$(( END_TS - START_TS ))\n\n# ── Extract metrics ───────────────────────────────────────────────────────────\nVAL_BPB=$(grep \"^val_bpb:\" \"${LOG_FILE}\" | tail -1 | awk '{print $2}')\nPEAK_VRAM=$(grep \"^peak_vram_mb:\" \"${LOG_FILE}\" | tail -1 | awk '{print $2}')\n\nif [[ -z \"${VAL_BPB}\" ]]; then\n err \"val_bpb not found in ${LOG_FILE} — experiment likely crashed (OOM or syntax error)\"\n err \"Last 10 lines of log:\"\n tail -10 \"${LOG_FILE}\" >&2\n exit 1\nfi\n\nPEAK_VRAM=\"${PEAK_VRAM:-N/A}\"\n\nlog \"Experiment complete in ${DURATION}s\"\nlog \"val_bpb=${VAL_BPB} peak_vram_mb=${PEAK_VRAM}\"\n\n# ── Structured output (stdout) ────────────────────────────────────────────────\nprintf \"%s\\t%s\\t%s\\t%s\\n\" \"${VAL_BPB}\" \"${PEAK_VRAM}\" \"${DURATION}\" \"$(realpath \"${LOG_FILE}\")\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":3271,"content_sha256":"c2237c1b0f1b50d0b012ee0e7e5b992bedb45e946f51aa3fe6873a7b8cd45991"},{"filename":"scripts/run-loop.sh","content":"#!/usr/bin/env bash\n# run-loop.sh — Autonomous autoresearch experiment loop\n# Runs experiments, evaluates val_bpb, keeps improvements, reverts failures.\n# Appends every run (keep/discard/crash) to results.tsv.\n#\n# Usage:\n# bash scripts/run-loop.sh [--repo \u003cpath>] [--max \u003cN>] [--desc \"\u003ctext>\"]\n#\n# Options:\n# --repo \u003cpath> Path to autoresearch repo (default: .)\n# --max \u003cN> Max experiments to run (default: 20, 0 = unlimited)\n# --desc \u003ctext> Experiment description prefix for results.tsv\n#\n# The loop:\n# 1. Run train.py for 300 seconds\n# 2. Extract val_bpb from output\n# 3. Compare against current best\n# 4. Keep commit if improved, git-reset if not\n# 5. Append to results.tsv\n# 6. Repeat\n\nset -uo pipefail\n\nREPO_DIR=\".\"\nMAX_EXPERIMENTS=20\nDESC_PREFIX=\"auto\"\n\nwhile [[ $# -gt 0 ]]; do\n case \"$1\" in\n --repo) REPO_DIR=\"$2\"; shift 2 ;;\n --max) MAX_EXPERIMENTS=\"$2\"; shift 2 ;;\n --desc) DESC_PREFIX=\"$2\"; shift 2 ;;\n --help)\n sed -n '2,17p' \"$0\" | sed 's/^# //'\n exit 0\n ;;\n *) echo \"Unknown option: $1\" >&2; exit 1 ;;\n esac\ndone\n\ncd \"${REPO_DIR}\"\nSCRIPT_DIR=\"$(cd \"$(dirname \"${BASH_SOURCE[0]}\")\" && pwd)\"\n\nlog() { echo \"[run-loop] $*\"; }\nerr() { echo \"[run-loop] ERROR: $*\" >&2; }\nwarn() { echo \"[run-loop] WARN: $*\" >&2; }\n\n# ── Sanity checks ─────────────────────────────────────────────────────────────\nfor f in train.py results.tsv; do\n if [[ ! -f \"${f}\" ]]; then\n if [[ \"${f}\" == \"results.tsv\" ]]; then\n log \"Creating results.tsv...\"\n printf \"commit\\tval_bpb\\tpeak_vram_mb\\tstatus\\tdescription\\n\" > results.tsv\n else\n err \"${f} not found. Are you in the autoresearch repo? Use --repo \u003cpath>\"\n exit 1\n fi\n fi\ndone\n\nif ! command -v uv &>/dev/null; then\n err \"uv not found. Run setup.sh first.\"\n exit 1\nfi\n\n# ── Baseline ──────────────────────────────────────────────────────────────────\n# Read best val_bpb from results.tsv (kept experiments only)\nget_best_bpb() {\n awk -F'\\t' 'NR>1 && $4==\"keep\" {print $2}' results.tsv \\\n | sort -n | head -1\n}\n\nBEST_BPB=$(get_best_bpb)\nif [[ -z \"${BEST_BPB}\" ]]; then\n log \"No kept experiments found in results.tsv — running baseline first...\"\n BEST_BPB=\"99.0\" # sentinel: accept first experiment unconditionally\nfi\nlog \"Current best val_bpb: ${BEST_BPB}\"\n\n# ── Loop ──────────────────────────────────────────────────────────────────────\nEXPERIMENT=0\n\nwhile true; do\n EXPERIMENT=$(( EXPERIMENT + 1 ))\n if [[ \"${MAX_EXPERIMENTS}\" -gt 0 && \"${EXPERIMENT}\" -gt \"${MAX_EXPERIMENTS}\" ]]; then\n log \"Reached max experiments (${MAX_EXPERIMENTS}). Stopping.\"\n break\n fi\n\n echo \"\"\n echo \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\n log \"Experiment ${EXPERIMENT}/${MAX_EXPERIMENTS} | best_so_far=${BEST_BPB}\"\n echo \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\n\n LOG_FILE=\"run_${EXPERIMENT}.log\"\n\n # ── Run experiment ────────────────────────────────────────────────────────\n RESULT=$(bash \"${SCRIPT_DIR}/run-experiment.sh\" \\\n --repo \".\" \\\n --log \"${LOG_FILE}\" 2>/dev/null) || RUN_EXIT=$?\n\n RUN_EXIT=\"${RUN_EXIT:-0}\"\n\n # ── Handle crash / timeout ────────────────────────────────────────────────\n if [[ \"${RUN_EXIT}\" -ne 0 ]]; then\n COMMIT=$(git rev-parse --short HEAD 2>/dev/null || echo \"unknown\")\n STATUS=\"crash\"\n BPB=\"crash\"\n VRAM=\"N/A\"\n DESCRIPTION=\"${DESC_PREFIX}: experiment ${EXPERIMENT} — crashed (exit ${RUN_EXIT})\"\n warn \"Experiment crashed (exit ${RUN_EXIT}). Reverting commit...\"\n git reset HEAD~1 2>/dev/null || true\n else\n BPB=$(echo \"${RESULT}\" | cut -f1)\n VRAM=$(echo \"${RESULT}\" | cut -f2)\n\n # ── Compare and decide ────────────────────────────────────────────────\n COMMIT=$(git rev-parse --short HEAD 2>/dev/null || echo \"unknown\")\n\n if awk \"BEGIN { exit !(${BPB} \u003c ${BEST_BPB}) }\"; then\n STATUS=\"keep\"\n IMPROVEMENT=$(awk \"BEGIN { printf \\\"%.4f\\\", ${BEST_BPB} - ${BPB} }\")\n BEST_BPB=\"${BPB}\"\n DESCRIPTION=\"${DESC_PREFIX}: experiment ${EXPERIMENT} — improved by ${IMPROVEMENT}\"\n log \"IMPROVED val_bpb=${BPB} (delta=-${IMPROVEMENT}) KEEPING commit ${COMMIT}\"\n else\n STATUS=\"discard\"\n DESCRIPTION=\"${DESC_PREFIX}: experiment ${EXPERIMENT} — no improvement (val_bpb=${BPB} >= ${BEST_BPB})\"\n log \"NO IMPROVEMENT val_bpb=${BPB} >= best=${BEST_BPB} REVERTING...\"\n git reset HEAD~1 2>/dev/null || warn \"git reset failed — may be at initial commit\"\n fi\n fi\n\n # ── Append to results.tsv ─────────────────────────────────────────────────\n printf \"%s\\t%s\\t%s\\t%s\\t%s\\n\" \\\n \"${COMMIT}\" \"${BPB}\" \"${VRAM}\" \"${STATUS}\" \"${DESCRIPTION}\" \\\n >> results.tsv\n\n log \"Logged: ${STATUS} | ${DESCRIPTION}\"\ndone\n\n# ── Summary ───────────────────────────────────────────────────────────────────\necho \"\"\necho \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\necho \" Run loop complete\"\nKEPT=$(awk -F'\\t' 'NR>1 && $4==\"keep\" {c++} END {print c+0}' results.tsv)\nTOTAL=$(awk -F'\\t' 'NR>1 {c++} END {print c+0}' results.tsv)\necho \" Experiments run : ${EXPERIMENT}\"\necho \" Improvements kept: ${KEPT} / ${TOTAL} total\"\necho \" Best val_bpb : ${BEST_BPB}\"\necho \" Results : results.tsv\"\necho \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":6572,"content_sha256":"05a7adefe3acee6e845b9b31f49ac22d3b7be226ee070df030b80e07ea5d552a"},{"filename":"scripts/setup.sh","content":"#!/usr/bin/env bash\n# setup.sh — One-time autoresearch environment setup\n# Installs uv, clones karpathy/autoresearch, syncs dependencies, prepares dataset\n#\n# Usage:\n# bash setup.sh [--dir \u003ctarget-dir>] [--skip-data] [--seq-len \u003cN>]\n#\n# Options:\n# --dir \u003cpath> Clone into this directory (default: ./autoresearch)\n# --skip-data Skip uv run prepare.py (dataset download, ~2 min)\n# --seq-len \u003cN> Override MAX_SEQ_LEN in prepare.py before running it\n# (useful for low-VRAM GPUs: try 256 or 512)\n\nset -euo pipefail\n\nTARGET_DIR=\"./autoresearch\"\nSKIP_DATA=0\nSEQ_LEN=\"\"\n\n# ── Argument parsing ────────────────────────────────────────────────────────\nwhile [[ $# -gt 0 ]]; do\n case \"$1\" in\n --dir) TARGET_DIR=\"$2\"; shift 2 ;;\n --skip-data) SKIP_DATA=1; shift ;;\n --seq-len) SEQ_LEN=\"$2\"; shift 2 ;;\n --help)\n sed -n '2,12p' \"$0\" | sed 's/^# //'\n exit 0\n ;;\n *) echo \"Unknown option: $1\" >&2; exit 1 ;;\n esac\ndone\n\nlog() { echo \"[setup] $*\"; }\nerr() { echo \"[setup] ERROR: $*\" >&2; }\nwarn() { echo \"[setup] WARN: $*\" >&2; }\n\n# ── Step 1: Install uv if missing ───────────────────────────────────────────\nif ! command -v uv &>/dev/null; then\n log \"Installing uv...\"\n curl -LsSf https://astral.sh/uv/install.sh | sh\n export PATH=\"${HOME}/.local/bin:${PATH}\"\n if ! command -v uv &>/dev/null; then\n err \"uv install succeeded but binary not found. Restart shell and retry.\"\n exit 1\n fi\n log \"uv installed: $(uv --version)\"\nelse\n log \"uv already installed: $(uv --version)\"\nfi\n\n# ── Step 2: Clone repo ──────────────────────────────────────────────────────\nif [[ -d \"${TARGET_DIR}/.git\" ]]; then\n log \"Repository already exists at ${TARGET_DIR} — skipping clone\"\nelse\n log \"Cloning karpathy/autoresearch into ${TARGET_DIR}...\"\n git clone https://github.com/karpathy/autoresearch \"${TARGET_DIR}\"\nfi\n\ncd \"${TARGET_DIR}\"\n\n# ── Step 3: Sync dependencies ───────────────────────────────────────────────\nlog \"Syncing dependencies (uv sync)...\"\nuv sync\nlog \"Dependencies installed\"\n\n# ── Step 4: Optional MAX_SEQ_LEN override ───────────────────────────────────\nif [[ -n \"${SEQ_LEN}\" ]]; then\n warn \"Overriding MAX_SEQ_LEN → ${SEQ_LEN} in prepare.py\"\n # Backup original\n cp prepare.py prepare.py.bak\n sed -i \"s/MAX_SEQ_LEN\\s*=\\s*[0-9]*/MAX_SEQ_LEN = ${SEQ_LEN}/\" prepare.py\n log \"MAX_SEQ_LEN set to ${SEQ_LEN} (backup: prepare.py.bak)\"\nfi\n\n# ── Step 5: Prepare dataset ─────────────────────────────────────────────────\nif [[ \"${SKIP_DATA}\" -eq 1 ]]; then\n warn \"Skipping data preparation (--skip-data). Run 'uv run prepare.py' manually before training.\"\nelse\n log \"Preparing dataset (FineWeb-Edu shards + BPE tokenizer). This takes ~2 minutes...\"\n uv run prepare.py\n log \"Dataset ready\"\nfi\n\n# ── Done ────────────────────────────────────────────────────────────────────\necho \"\"\necho \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\necho \" autoresearch setup complete\"\necho \" Directory : ${TARGET_DIR}\"\necho \" Next step : bash scripts/run-experiment.sh\"\necho \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":4008,"content_sha256":"eb4e737ac6e98afdf44a1563c3cb1229b93b41d07c3879157bfc022a34fc8849"},{"filename":"scripts/show-results.sh","content":"#!/usr/bin/env bash\n# show-results.sh — Parse and display autoresearch results.tsv\n# Shows statistics, best experiments, and improvement history.\n#\n# Usage:\n# bash scripts/show-results.sh [--repo \u003cpath>] [--top \u003cN>] [--kept-only]\n#\n# Options:\n# --repo \u003cpath> Path to autoresearch repo (default: .)\n# --top \u003cN> Show top N experiments by val_bpb (default: 10)\n# --kept-only Show only experiments that were kept\n\nset -uo pipefail\n\nREPO_DIR=\".\"\nTOP_N=10\nKEPT_ONLY=0\n\nwhile [[ $# -gt 0 ]]; do\n case \"$1\" in\n --repo) REPO_DIR=\"$2\"; shift 2 ;;\n --top) TOP_N=\"$2\"; shift 2 ;;\n --kept-only) KEPT_ONLY=1; shift ;;\n --help)\n sed -n '2,13p' \"$0\" | sed 's/^# //'\n exit 0\n ;;\n *) echo \"Unknown option: $1\" >&2; exit 1 ;;\n esac\ndone\n\ncd \"${REPO_DIR}\"\n\nif [[ ! -f \"results.tsv\" ]]; then\n echo \"results.tsv not found. Have you run any experiments yet?\" >&2\n exit 1\nfi\n\n# ── Overall statistics ────────────────────────────────────────────────────────\nTOTAL=$(awk -F'\\t' 'NR>1 {c++} END {print c+0}' results.tsv)\nKEPT=$(awk -F'\\t' 'NR>1 && $4==\"keep\" {c++} END {print c+0}' results.tsv)\nDISCARD=$(awk -F'\\t' 'NR>1 && $4==\"discard\" {c++} END {print c+0}' results.tsv)\nCRASHED=$(awk -F'\\t' 'NR>1 && $4==\"crash\" {c++} END {print c+0}' results.tsv)\n\nBEST_BPB=$(awk -F'\\t' 'NR>1 && $4==\"keep\" && $2~/^[0-9]/ {print $2}' results.tsv \\\n | sort -n | head -1)\nFIRST_BPB=$(awk -F'\\t' 'NR>1 && $2~/^[0-9]/ {print $2; exit}' results.tsv)\n\necho \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\necho \" autoresearch — Results Summary\"\necho \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\nprintf \" Total experiments : %s\\n\" \"${TOTAL}\"\nprintf \" Kept (improved) : %s\\n\" \"${KEPT}\"\nprintf \" Discarded : %s\\n\" \"${DISCARD}\"\nprintf \" Crashed : %s\\n\" \"${CRASHED}\"\necho \"\"\nif [[ -n \"${FIRST_BPB}\" && -n \"${BEST_BPB}\" ]]; then\n DELTA=$(awk \"BEGIN { printf \\\"%.4f\\\", ${FIRST_BPB} - ${BEST_BPB} }\")\n printf \" Starting val_bpb : %s\\n\" \"${FIRST_BPB}\"\n printf \" Best val_bpb : %s (delta: -%s)\\n\" \"${BEST_BPB}\" \"${DELTA}\"\nfi\necho \"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\"\n\n# ── Top experiments ───────────────────────────────────────────────────────────\necho \"\"\necho \"Top ${TOP_N} experiments by val_bpb:\"\necho \"\"\nprintf \" %-10s %-8s %-14s %-8s %s\\n\" \"COMMIT\" \"VAL_BPB\" \"PEAK_VRAM_MB\" \"STATUS\" \"DESCRIPTION\"\necho \" ─────────────────────────────────────────────────────────────────\"\n\nAWK_FILTER='NR>1 && $2~/^[0-9]/'\nif [[ \"${KEPT_ONLY}\" -eq 1 ]]; then\n AWK_FILTER='NR>1 && $4==\"keep\" && $2~/^[0-9]/'\nfi\n\nawk -F'\\t' \"${AWK_FILTER} {print}\" results.tsv \\\n | sort -t

autoresearch Autoresearch is a closed-loop ML experimentation workflow : - human writes - agent edits - stays fixed - every run gets the same 300-second budget - lower wins - regressions get reverted This skill should behave like a routing-first front door , not a giant tutorial. Pick the user's mode, enforce the immutable-harness rules, then hand them to the smallest useful script or reference. When to use this skill - Set up on a real GPU machine - Write or refine before a session - Run a bounded overnight search loop - Interpret after a session - Adapt the workflow to tighter VRAM constrai…

\\t' -k2 -n \\\n | head -\"${TOP_N}\" \\\n | while IFS=

autoresearch Autoresearch is a closed-loop ML experimentation workflow : - human writes - agent edits - stays fixed - every run gets the same 300-second budget - lower wins - regressions get reverted This skill should behave like a routing-first front door , not a giant tutorial. Pick the user's mode, enforce the immutable-harness rules, then hand them to the smallest useful script or reference. When to use this skill - Set up on a real GPU machine - Write or refine before a session - Run a bounded overnight search loop - Interpret after a session - Adapt the workflow to tighter VRAM constrai…

\\t' read -r commit bpb vram status desc; do\n # Truncate description to 40 chars\n short_desc=\"${desc:0:40}\"\n [[ \"${#desc}\" -gt 40 ]] && short_desc=\"${short_desc}...\"\n printf \" %-10s %-8s %-14s %-8s %s\\n\" \\\n \"${commit}\" \"${bpb}\" \"${vram}\" \"${status}\" \"${short_desc}\"\n done\n\n# ── Improvement timeline ──────────────────────────────────────────────────────\necho \"\"\necho \"Improvement timeline (kept only):\"\necho \"\"\nprintf \" %-10s %-8s %-14s %s\\n\" \"COMMIT\" \"VAL_BPB\" \"DELTA\" \"DESCRIPTION\"\necho \" ─────────────────────────────────────────────────────────────────\"\n\nPREV_BPB=\"\"\nawk -F'\\t' 'NR>1 && $4==\"keep\" && $2~/^[0-9]/ {print}' results.tsv \\\n | while IFS=

autoresearch Autoresearch is a closed-loop ML experimentation workflow : - human writes - agent edits - stays fixed - every run gets the same 300-second budget - lower wins - regressions get reverted This skill should behave like a routing-first front door , not a giant tutorial. Pick the user's mode, enforce the immutable-harness rules, then hand them to the smallest useful script or reference. When to use this skill - Set up on a real GPU machine - Write or refine before a session - Run a bounded overnight search loop - Interpret after a session - Adapt the workflow to tighter VRAM constrai…

\\t' read -r commit bpb vram status desc; do\n if [[ -z \"${PREV_BPB}\" ]]; then\n delta=\"baseline\"\n else\n delta=$(awk \"BEGIN { printf \\\"%.4f\\\", ${PREV_BPB} - ${bpb} }\")\n delta=\"-${delta}\"\n fi\n PREV_BPB=\"${bpb}\"\n short_desc=\"${desc:0:35}\"\n [[ \"${#desc}\" -gt 35 ]] && short_desc=\"${short_desc}...\"\n printf \" %-10s %-8s %-14s %s\\n\" \\\n \"${commit}\" \"${bpb}\" \"${delta}\" \"${short_desc}\"\n done\n\necho \"\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":4900,"content_sha256":"ea759e1e8cc74fe0434cae668835449b065e31e77ae7057c79fd697124a73a7b"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"autoresearch","type":"text"}]},{"type":"paragraph","content":[{"text":"Autoresearch is a ","type":"text"},{"text":"closed-loop ML experimentation workflow","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"human writes ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"agent edits ","type":"text"},{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"prepare.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" stays fixed","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"every run gets the same 300-second budget","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"lower ","type":"text"},{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]},{"text":" wins","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"regressions get reverted","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"This skill should behave like a ","type":"text"},{"text":"routing-first front door","type":"text","marks":[{"type":"strong"}]},{"text":", not a giant tutorial. Pick the user's mode, enforce the immutable-harness rules, then hand them to the smallest useful script or reference.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"When to use this skill","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Set up ","type":"text"},{"text":"karpathy/autoresearch","type":"text","marks":[{"type":"code_inline"}]},{"text":" on a real GPU machine","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Write or refine ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" before a session","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Run a bounded overnight ","type":"text"},{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" search loop","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Interpret ","type":"text"},{"text":"results.tsv","type":"text","marks":[{"type":"code_inline"}]},{"text":" after a session","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Adapt the workflow to tighter VRAM constraints without invalidating comparisons","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Explain the ML-specific boundary between ","type":"text"},{"text":"autoresearch","type":"text","marks":[{"type":"code_inline"}]},{"text":" and nearby eval tooling","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Do not use this skill when","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The user wants to optimize a ","type":"text"},{"text":"SKILL.md","type":"text","marks":[{"type":"code_inline"}]},{"text":", prompt, or repo-local workflow with frozen prompts/evals — use ","type":"text"},{"text":"skill-autoresearch","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The user wants app-level tracing, dataset-backed LLM evals, feedback review, or observability — use LangSmith, Braintrust, Weave, Promptfoo, or similar tools","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The job does not involve a real training repo, ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]},{"text":", fixed runtime budget, and ","type":"text"},{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]},{"text":" keep/revert ratcheting","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The user is really asking for a paper survey, general benchmark scan, or literature review with no intention to run the training loop","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Core boundary","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Concern","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"autoresearch","type":"text","marks":[{"type":"code_inline"}]},{"text":" owns","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Route elsewhere","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Mutable target","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" in a real training repo","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"prompts, app configs, ","type":"text"},{"text":"SKILL.md","type":"text","marks":[{"type":"code_inline"}]},{"text":", product behavior","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Fixed evaluator","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"prepare.py","type":"text","marks":[{"type":"code_inline"}]},{"text":", validation shard, ","type":"text"},{"text":"TIME_BUDGET=300","type":"text","marks":[{"type":"code_inline"}]},{"text":", chosen ","type":"text"},{"text":"MAX_SEQ_LEN","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"EVAL_TOKENS","type":"text","marks":[{"type":"code_inline"}]},{"text":" for the session","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"prompt/eval datasets, app scorecards, observability dashboards","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Acceptance rule","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"keep only lower ","type":"text"},{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]},{"text":"; revert ties/regressions","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"human review queues, app-level release gates","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Main artifacts","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"results.tsv","type":"text","marks":[{"type":"code_inline"}]},{"text":", kept/discarded commits","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"prompt suites, traces, feedback datasets","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"If that boundary does not fit, do not stretch this skill.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Required intake packet","type":"text"}]},{"type":"paragraph","content":[{"text":"Before acting, identify:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Mode","type":"text","marks":[{"type":"strong"}]},{"text":" — setup, ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":", run loop, results interpretation, or constrained hardware","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Repository state","type":"text","marks":[{"type":"strong"}]},{"text":" — cloned or not, dependencies installed or not","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Hardware state","type":"text","marks":[{"type":"strong"}]},{"text":" — GPU / VRAM / CUDA / MLX / Windows path","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Session state","type":"text","marks":[{"type":"strong"}]},{"text":" — first baseline, active loop, or completed run","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Constraint state","type":"text","marks":[{"type":"strong"}]},{"text":" — target VRAM ceiling, whether ","type":"text"},{"text":"prepare.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" has already been frozen for this session","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Instructions","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 1: Pick exactly one operating mode","type":"text"}]},{"type":"paragraph","content":[{"text":"Choose the smallest mode that answers the request:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Setup readiness","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"install ","type":"text"},{"text":"uv","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"clone repo","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"sync dependencies","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"verify GPU/CUDA/uv with ","type":"text"},{"text":"scripts/check-hardware.sh","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"run the first baseline experiment","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"program.md","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" authoring","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"write or refine the human research charter","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"record current baseline ","type":"text"},{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"prioritize hypotheses","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"list what has already been tried","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"freeze constraints before the loop starts","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Bounded run loop","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"confirm the evaluator is already fixed","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"use ","type":"text"},{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" as the only mutable search surface","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"run the loop with keep/revert discipline","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"log every experiment to ","type":"text"},{"text":"results.tsv","type":"text","marks":[{"type":"code_inline"}]}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Results interpretation","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"summarize best kept runs","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"identify repeated failures or crash patterns","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"extract what belongs in the next ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"distinguish genuine gains from one-off anomalies","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Constrained-hardware adaptation","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"set ","type":"text"},{"text":"MAX_SEQ_LEN","type":"text","marks":[{"type":"code_inline"}]},{"text":" and ","type":"text"},{"text":"EVAL_TOKENS","type":"text","marks":[{"type":"code_inline"}]},{"text":" before the session","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"keep them unchanged once the session starts","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"adjust model/search strategy instead of cheating the evaluator mid-run","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"route to community forks when CUDA assumptions do not hold","type":"text"}]}]}]}]}]},{"type":"paragraph","content":[{"text":"Do ","type":"text"},{"text":"not","type":"text","marks":[{"type":"strong"}]},{"text":" answer all five modes at once unless the user explicitly asked for a full end-to-end walkthrough.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 2: Re-state the immutable harness","type":"text"}]},{"type":"paragraph","content":[{"text":"Every mode must preserve these rules:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" is human-authored and read-only during a session","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" is the main mutable search surface","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"prepare.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" is read-only once the session starts","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"TIME_BUDGET=300","type":"text","marks":[{"type":"code_inline"}]},{"text":" stays fixed","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]},{"text":" is the main keep/revert metric","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"results.tsv","type":"text","marks":[{"type":"code_inline"}]},{"text":" is append-only","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"dependency set in ","type":"text"},{"text":"pyproject.toml","type":"text","marks":[{"type":"code_inline"}]},{"text":" stays locked","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"If the user wants to change the evaluator, start a ","type":"text"},{"text":"new comparison track","type":"text","marks":[{"type":"strong"}]},{"text":", not the current session.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 3: Execute the chosen mode","type":"text"}]},{"type":"heading","attrs":{"level":4},"content":[{"text":"Mode A — Setup readiness","type":"text"}]},{"type":"paragraph","content":[{"text":"Use this path when the repo is not yet runnable.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"curl -LsSf https://astral.sh/uv/install.sh | sh\ngit clone https://github.com/karpathy/autoresearch\ncd autoresearch\nuv sync\nbash scripts/check-hardware.sh\nuv run prepare.py\nuv run train.py > run.log 2>&1\ngrep \"^val_bpb:\\|^peak_vram_mb:\" run.log","type":"text"}]},{"type":"paragraph","content":[{"text":"Success condition: one baseline run completes and prints both ","type":"text"},{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]},{"text":" and ","type":"text"},{"text":"peak_vram_mb","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":4},"content":[{"text":"Mode B — ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" authoring","type":"text"}]},{"type":"paragraph","content":[{"text":"Use this path when the loop exists but direction is weak.","type":"text"}]},{"type":"paragraph","content":[{"text":"Minimum sections:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"goal tied to lower ","type":"text"},{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"current baseline ","type":"text"},{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"directions to explore in priority order","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"what has been tried already","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"constraints: ","type":"text"},{"text":"TIME_BUDGET=300","type":"text","marks":[{"type":"code_inline"}]},{"text":", no ","type":"text"},{"text":"prepare.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" mutation, no new packages, VRAM ceiling, one meaningful change per experiment","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For fuller templates and update patterns, use ","type":"text"},{"text":"references/program-md-guide.md","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":4},"content":[{"text":"Mode C — Bounded run loop","type":"text"}]},{"type":"paragraph","content":[{"text":"Use this path only after setup and ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" are ready.","type":"text"}]},{"type":"paragraph","content":[{"text":"Loop contract:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"read ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" + current ","type":"text"},{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"form one hypothesis","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"edit ","type":"text"},{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"commit","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"run one 300-second experiment","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"extract ","type":"text"},{"text":"val_bpb","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"keep if improved, otherwise ","type":"text"},{"text":"git reset HEAD~1","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"append result to ","type":"text"},{"text":"results.tsv","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"paragraph","content":[{"text":"Typical commands:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"bash scripts/run-experiment.sh\nbash scripts/run-loop.sh --max 20 --desc \"session-1\"","type":"text"}]},{"type":"paragraph","content":[{"text":"Do not encourage multi-change hero rewrites. Clean ablations matter more than flashy edits.","type":"text"}]},{"type":"heading","attrs":{"level":4},"content":[{"text":"Mode D — Results interpretation","type":"text"}]},{"type":"paragraph","content":[{"text":"Use this path after a completed run or checkpoint.","type":"text"}]},{"type":"paragraph","content":[{"text":"Helpful commands:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"bash scripts/show-results.sh --top 10\nawk -F'\\t' '$4==\"keep\"' results.tsv | sort -t

autoresearch Autoresearch is a closed-loop ML experimentation workflow : - human writes - agent edits - stays fixed - every run gets the same 300-second budget - lower wins - regressions get reverted This skill should behave like a routing-first front door , not a giant tutorial. Pick the user's mode, enforce the immutable-harness rules, then hand them to the smallest useful script or reference. When to use this skill - Set up on a real GPU machine - Write or refine before a session - Run a bounded overnight search loop - Interpret after a session - Adapt the workflow to tighter VRAM constrai…

\\t' -k2 -n\nawk -F'\\t' '{print $4}' results.tsv | sort | uniq -c","type":"text"}]},{"type":"paragraph","content":[{"text":"Summarize only four things: best gains, repeated failures, what should move into ","type":"text"},{"text":"What Has Been Tried","type":"text","marks":[{"type":"code_inline"}]},{"text":", and the next narrow experiment family.","type":"text"}]},{"type":"heading","attrs":{"level":4},"content":[{"text":"Mode E — Constrained-hardware adaptation","type":"text"}]},{"type":"paragraph","content":[{"text":"Use this path when VRAM, platform, or runtime constraints dominate.","type":"text"}]},{"type":"paragraph","content":[{"text":"Rules:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"choose ","type":"text"},{"text":"MAX_SEQ_LEN","type":"text","marks":[{"type":"code_inline"}]},{"text":" and ","type":"text"},{"text":"EVAL_TOKENS","type":"text","marks":[{"type":"code_inline"}]},{"text":" ","type":"text"},{"text":"before","type":"text","marks":[{"type":"strong"}]},{"text":" the session","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"never change them mid-session","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"lower model/search ambition before mutating the evaluator","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"prefer route-outs to community forks for Apple Silicon / non-CUDA paths","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"For concrete values and troubleshooting, use ","type":"text"},{"text":"references/hardware-config.md","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 4: Route out aggressively when the request is adjacent","type":"text"}]},{"type":"paragraph","content":[{"text":"Route out when:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"the user wants to optimize instructions, prompts, or repo-local skills → ","type":"text"},{"text":"skill-autoresearch","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"the user wants app-level traces, feedback review, observability, or online/offline eval dashboards → LangSmith / Braintrust / Weave / Promptfoo","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"the user wants general literature synthesis rather than a runnable ML loop → research or survey tooling","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 5: Keep the heavy detail in support files","type":"text"}]},{"type":"paragraph","content":[{"text":"Use support files instead of re-explaining everything inline:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/operating-modes-and-route-outs.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — fast routing table, minimal response shape, and handoff logic","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/architecture.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — immutability contract, file map, metric rationale","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/program-md-guide.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — templates and update rules","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/hardware-config.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — VRAM tables and platform troubleshooting","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scripts/*.sh","type":"text","marks":[{"type":"code_inline"}]},{"text":" — runnable setup / loop / reporting helpers","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Available scripts","type":"text"}]},{"type":"paragraph","content":[{"text":"Run from inside the autoresearch repository directory:","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Script","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Purpose","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Usage","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"setup.sh","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"One-time environment setup","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"bash scripts/setup.sh [--seq-len 512]","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"run-experiment.sh","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Single 5-minute experiment + metric extraction","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"bash scripts/run-experiment.sh","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"run-loop.sh","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Autonomous loop: run → keep/revert → repeat","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"bash scripts/run-loop.sh [--max 20]","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"show-results.sh","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Human-readable ","type":"text"},{"text":"results.tsv","type":"text","marks":[{"type":"code_inline"}]},{"text":" report","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"bash scripts/show-results.sh [--top 10]","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"check-hardware.sh","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"GPU/CUDA/uv readiness check (JSON output)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"bash scripts/check-hardware.sh","type":"text","marks":[{"type":"code_inline"}]}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"References","type":"text"}]},{"type":"paragraph","content":[{"text":"Detailed documentation in ","type":"text"},{"text":"references/","type":"text","marks":[{"type":"code_inline"}]},{"text":":","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"File","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Contents","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/operating-modes-and-route-outs.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Mode picker, adjacency boundaries, and minimal output contract","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/architecture.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"System design, immutability contract, git ratcheting, metric rationale","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/program-md-guide.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"How to write and update effective ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" directives","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/hardware-config.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"VRAM settings by GPU, memory optimization, platform troubleshooting","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Examples","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Example 1: First 40GB GPU session","type":"text"}]},{"type":"paragraph","content":[{"text":"Request: “Help me run Karpathy autoresearch on a 40GB GPU.”","type":"text"}]},{"type":"paragraph","content":[{"text":"Expected behavior:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"choose ","type":"text"},{"text":"Setup readiness","type":"text","marks":[{"type":"strong"}]},{"text":" first","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"verify hardware and dependencies","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"run one baseline experiment","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"route to ","type":"text"},{"text":"program.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" authoring only after the baseline exists","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Example 2: User wants to optimize a skill instead","type":"text"}]},{"type":"paragraph","content":[{"text":"Request: “Can autoresearch help me improve this ","type":"text"},{"text":"SKILL.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" with binary evals?”","type":"text"}]},{"type":"paragraph","content":[{"text":"Expected behavior:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"route out immediately to ","type":"text"},{"text":"skill-autoresearch","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"explain that this skill is for real ML training search on ","type":"text"},{"text":"train.py","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Best practices","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Start with the smallest mode that fits","type":"text","marks":[{"type":"strong"}]},{"text":" — setup, authoring, run loop, interpretation, or hardware adaptation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Baseline before bravado","type":"text","marks":[{"type":"strong"}]},{"text":" — confirm one successful run before talking about overnight loops","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Freeze the evaluator before the session","type":"text","marks":[{"type":"strong"}]},{"text":" — ","type":"text"},{"text":"prepare.py","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"TIME_BUDGET","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"MAX_SEQ_LEN","type":"text","marks":[{"type":"code_inline"}]},{"text":", and ","type":"text"},{"text":"EVAL_TOKENS","type":"text","marks":[{"type":"code_inline"}]},{"text":" must stay comparable","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"One meaningful experiment at a time","type":"text","marks":[{"type":"strong"}]},{"text":" — ablations beat mystery bundles","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Keep ","type":"text","marks":[{"type":"strong"}]},{"text":"results.tsv","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" append-only","type":"text","marks":[{"type":"strong"}]},{"text":" — discarded runs are still evidence","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Push deep detail into references/scripts","type":"text","marks":[{"type":"strong"}]},{"text":" — the front door should classify and route, not duplicate every table","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Route adjacent jobs away early","type":"text","marks":[{"type":"strong"}]},{"text":" — prompt/app eval and ","type":"text"},{"text":"SKILL.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" optimization are different lanes","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"References","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"GitHub — karpathy/autoresearch","type":"text","marks":[{"type":"link","attrs":{"href":"https://github.com/karpathy/autoresearch","title":null}}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Karpathy — A Recipe for Training Neural Networks","type":"text","marks":[{"type":"link","attrs":{"href":"https://karpathy.github.io/2019/04/25/recipe/","title":null}}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"MLflow Tracking","type":"text","marks":[{"type":"link","attrs":{"href":"https://mlflow.org/docs/latest/ml/tracking/","title":null}}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Weights & Biases Tracking","type":"text","marks":[{"type":"link","attrs":{"href":"https://docs.wandb.ai/guides/track/","title":null}}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"MIT License","type":"text","marks":[{"type":"link","attrs":{"href":"https://github.com/karpathy/autoresearch/blob/master/LICENSE","title":null}}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"autoresearch","author":"@skillopedia","source":{"stars":22,"repo_name":"oh-my-skills","origin_url":"https://github.com/akillness/oh-my-skills/blob/HEAD/.agent-skills/autoresearch/SKILL.md","repo_owner":"akillness","body_sha256":"a88519e5db85902c054f3177b3d2bc7fd46fe03fc977121f0d29f9f4c4dad6e2","cluster_key":"5bd45431bb078b0fbed45763ad33d21837e52894ef01e2efd0ff554ab57d4f41","clean_bundle":{"format":"clean-skill-bundle-v1","source":"akillness/oh-my-skills/.agent-skills/autoresearch/SKILL.md","attachments":[{"id":"08fd6bfd-b3b9-5435-9632-5e1ab1916bd8","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/08fd6bfd-b3b9-5435-9632-5e1ab1916bd8/attachment.toon","path":"SKILL.toon","size":2656,"sha256":"6acf603f275b74031ef4b0dd1961e133483923ce476a69cee1ca1306136b2bb3","contentType":"text/plain; charset=utf-8"},{"id":"acf6ae9a-3d6a-5517-a0a5-ebf70d8cca5d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/acf6ae9a-3d6a-5517-a0a5-ebf70d8cca5d/attachment.json","path":"evals/evals.json","size":2942,"sha256":"08e3064a654464844e85f34a90b9cfc5f4b6bfdc7629620b416debc09b77c351","contentType":"application/json; charset=utf-8"},{"id":"6910e4ac-1dfc-58ba-8619-070c74842929","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/6910e4ac-1dfc-58ba-8619-070c74842929/attachment.md","path":"references/architecture.md","size":7433,"sha256":"eb462ada18529f13c4e5da33a4ae7290af706dcbbfdbfe4cdd8f6cb93dcd6a13","contentType":"text/markdown; charset=utf-8"},{"id":"505d5ece-096c-5a44-817c-73fd566175ea","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/505d5ece-096c-5a44-817c-73fd566175ea/attachment.md","path":"references/hardware-config.md","size":5544,"sha256":"650ee4f75cb4b20cf8fc5dea6b30ca986f79a7cbf019e69e83c08f29e9388d3b","contentType":"text/markdown; charset=utf-8"},{"id":"a982535c-5e02-584f-bb6a-242f71fb1ff5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/a982535c-5e02-584f-bb6a-242f71fb1ff5/attachment.md","path":"references/operating-modes-and-route-outs.md","size":2288,"sha256":"f830ed7f666983e77d2d4b403397cc148df4fdf76c54819d728611fcf5235eff","contentType":"text/markdown; charset=utf-8"},{"id":"f065a778-e12b-5b83-85e2-21569010b307","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/f065a778-e12b-5b83-85e2-21569010b307/attachment.md","path":"references/program-md-guide.md","size":5708,"sha256":"0ca65390ba3a62055e86bfaa4bda84ed9157d39ff3ff419c731e5d1c6ff065dd","contentType":"text/markdown; charset=utf-8"},{"id":"2f9a363e-fc44-5e13-b01a-4cea6f073ba0","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/2f9a363e-fc44-5e13-b01a-4cea6f073ba0/attachment.sh","path":"scripts/check-hardware.sh","size":4035,"sha256":"5a512f9ddb63b500d72309037e65c6025830a018f279dca34012705811b92b9f","contentType":"application/x-sh; charset=utf-8"},{"id":"5acd0bcf-1bcd-5209-9f08-22beb8fbc820","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/5acd0bcf-1bcd-5209-9f08-22beb8fbc820/attachment.sh","path":"scripts/run-experiment.sh","size":3271,"sha256":"c2237c1b0f1b50d0b012ee0e7e5b992bedb45e946f51aa3fe6873a7b8cd45991","contentType":"application/x-sh; charset=utf-8"},{"id":"2e69d9d1-e639-58ab-abe1-61dcdd82a872","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/2e69d9d1-e639-58ab-abe1-61dcdd82a872/attachment.sh","path":"scripts/run-loop.sh","size":6572,"sha256":"05a7adefe3acee6e845b9b31f49ac22d3b7be226ee070df030b80e07ea5d552a","contentType":"application/x-sh; charset=utf-8"},{"id":"074cf836-4440-5ad7-9ecd-ccfb1959abaf","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/074cf836-4440-5ad7-9ecd-ccfb1959abaf/attachment.sh","path":"scripts/setup.sh","size":4008,"sha256":"eb4e737ac6e98afdf44a1563c3cb1229b93b41d07c3879157bfc022a34fc8849","contentType":"application/x-sh; charset=utf-8"},{"id":"6b073127-bb1a-5a2b-b149-d5ae301332bd","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/6b073127-bb1a-5a2b-b149-d5ae301332bd/attachment.sh","path":"scripts/show-results.sh","size":4900,"sha256":"ea759e1e8cc74fe0434cae668835449b065e31e77ae7057c79fd697124a73a7b","contentType":"application/x-sh; charset=utf-8"}],"bundle_sha256":"faa10601e31db9fc9e80e85bf9947635bad333b45638d885fcded62d5956bbc0","attachment_count":11,"text_attachments":10,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":1,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":".agent-skills/autoresearch/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"integrations-apis","category_label":"Integrations"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"integrations-apis","metadata":{"tags":"autoresearch, ml-experiments, autonomous-research, karpathy, gpu, train, val-bpb, overnight, ratcheting","source":"https://github.com/karpathy/autoresearch","license":"MIT","version":"1.2.0"},"import_tag":"clean-skills-v1","description":"Run Karpathy-style autonomous ML search on a real training repository. Use when the user needs to set up or operate `karpathy/autoresearch`, choose the right run mode (setup, `program.md`, bounded loop, result interpretation, or constrained-hardware adaptation), and preserve the immutable `prepare.py` / 300-second / `val_bpb` contract. Not for prompt evaluation, LLM app observability, or repo-local `SKILL.md` optimization — route those to LangSmith, Promptfoo, Braintrust, or `skill-autoresearch`. Triggers on: autoresearch, autonomous ML experiments, `program.md`, `train.py`, `val_bpb`, overnight GPU loop, fixed eval harness.\n","allowed-tools":"Bash Read Write Edit Glob Grep WebFetch","compatibility":"Official path assumes a single NVIDIA GPU on Linux with CUDA; ~40GB VRAM is a comfortable default for MAX_SEQ_LEN=2048, but lower-VRAM and community-fork paths exist for RTX 4090 / 3090, GTX 1660 Ti, Apple Silicon MLX, and Windows RTX. Python 3.10+ and uv required. Dataset: ~6543 FineWeb-Edu parquet shards.\n"}},"renderedAt":1782982289287}

autoresearch Autoresearch is a closed-loop ML experimentation workflow : - human writes - agent edits - stays fixed - every run gets the same 300-second budget - lower wins - regressions get reverted This skill should behave like a routing-first front door , not a giant tutorial. Pick the user's mode, enforce the immutable-harness rules, then hand them to the smallest useful script or reference. When to use this skill - Set up on a real GPU machine - Write or refine before a session - Run a bounded overnight search loop - Interpret after a session - Adapt the workflow to tighter VRAM constrai…