firecrawl-research-patterns

Firecrawl Research Patterns Programmatic patterns for using self-hosted Firecrawl in research workflows — search, scrape, route academic papers, run recursive deep research, and persist raw results for future re-analysis. Also covers self-hosted deployment, health checks, and recovery. For archiving AI chat conversations (ChatGPT/Gemini shares), see . --- Self-Evolving Skill : This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues. FIRST — TodoWrite Task Temp…

&& echo \"8080 OK\" || echo \"8080 DOWN\"\n\n# Real end-to-end probe — proves /v1/scrape works against a known-good URL\ncurl -s --max-time 15 -X POST \"${BASE}:3002/v1/scrape\" \\\n -H 'Content-Type: application/json' \\\n -d '{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}' \\\n | python3 -c \"import sys,json; d=json.load(sys.stdin); print('OK' if d.get('success') else 'FAIL')\"\n```\n\n> **Do not** probe `/v1/health`, `/health`, or `/v0/health` on port 3002 — all three return HTTP 404 (Express's HTML error page), which looks like a service-down signal but isn't. Confirmed 2026-05-27.\n\nFor architecture diagrams, health checks, recovery commands, and deployment details, see:\n\n- [Self-Hosted Operations](./references/self-hosted-operations.md) — Architecture, health checks, recovery commands\n- [Self-Hosted Bootstrap Guide](./references/self-hosted-bootstrap-guide.md) — Fresh installation (7 steps)\n- [Self-Hosted Best Practices](./references/self-hosted-best-practices.md) — Docker restart policies, monitoring\n- [Self-Hosted Troubleshooting](./references/self-hosted-troubleshooting.md) — Symptom-based diagnosis\n\n---\n\n## Section 6 — Image and Figure Capture\n\nText-only scrapers (Jina, direct Firecrawl) capture prose but lose architecture diagrams, result plots, and attention maps. For image-rich papers, always capture figures.\n\n### When to Capture Images\n\nCapture figures when the paper contains any of:\n\n- Architecture diagrams (model structure, attention patterns)\n- Benchmark/result comparison plots\n- Qualitative examples (generated outputs, visualizations)\n- Algorithm flowcharts or pseudocode figures\n\n### arXiv HTML Figure URL Discovery\n\narXiv HTML papers store figures at sequential absolute URLs (`x1.png`, `x2.png`, ...). Probe to discover all figure URLs — do NOT download them locally:\n\n```bash\nARXIV_ID=\"2312.00752\"\nARXIV_VER=\"v2\"\nBASE_URL=\"https://arxiv.org/html/${ARXIV_ID}${ARXIV_VER}\"\nFIGURE_URLS=()\n\n# Probe sequential URLs until 404 — collect absolute URLs only\nfor i in $(seq 1 50); do\n url=\"${BASE_URL}/x${i}.png\"\n status=$(curl -s -o /dev/null -w \"%{http_code}\" \"$url\")\n if [ \"$status\" != \"200\" ]; then\n echo \"Stopped at x${i}.png (${status}) — found ${#FIGURE_URLS[@]} figures\"\n break\n fi\n FIGURE_URLS+=(\"$url\")\n echo \"Found: $url\"\ndone\n```\n\nThe collected absolute URLs go directly into the markdown body and frontmatter — no local copies needed.\n\n### Inline Figure Embedding (GFM)\n\nEach figure must appear inline in the corpus markdown as an absolute URL so GitHub renders it in-place:\n\n```markdown\n## Key Figures\n\n![Figure 1 — Mamba SSM architecture](https://arxiv.org/html/2312.00752v2/x1.png)\n\n![Figure 2 — Selective scan mechanism](https://arxiv.org/html/2312.00752v2/x2.png)\n\n![Figure 3 — Performance vs sequence length](https://arxiv.org/html/2312.00752v2/x3.png)\n```\n\n> **Never rewrite to relative paths** like `./figures/x1.png` — relative paths break on GitHub unless images are committed to the same repo.\n\n### Extracting Existing Inline URLs from Scraped Markdown\n\nWhen port 3003 (Playwright) already embedded absolute URLs in the scraped markdown, extract them for the frontmatter catalog:\n\n```bash\nCORPUS_FILE=\"docs/research/corpus/2026-03-13-mamba-ssm.md\"\n\n# Extract all absolute image URLs already in the markdown\ngrep -oE 'https://[^)]+\\.(png|jpg|svg|gif|webp)' \"$CORPUS_FILE\" | sort -u\n```\n\nThese URLs are already inline — just copy them into the frontmatter `figure_urls` list.\n\n### Frontmatter for Image-Rich Papers\n\nThe YAML frontmatter catalogs all figure source URLs for provenance. The markdown body embeds them inline:\n\n```yaml\n---\nsource_url: https://arxiv.org/html/2312.00752v2\nscraped_at: \"2026-03-13T00:00:00Z\"\nscraper: firecrawl-port3003\ntags: [ssm, state-space-model, mamba, sequence-modeling]\ncontent_tokens_approx: 4200\nhas_figures: true\nfigure_count: 12\nfigure_urls:\n - https://arxiv.org/html/2312.00752v2/x1.png\n - https://arxiv.org/html/2312.00752v2/x2.png\n - https://arxiv.org/html/2312.00752v2/x3.png\n - https://arxiv.org/html/2312.00752v2/x4.png\n - https://arxiv.org/html/2312.00752v2/x5.png\n---\n```\n\n### Corpus Index Entry with Figures\n\n```json\n{\n \"url\": \"https://arxiv.org/html/2312.00752v2\",\n \"file\": \"corpus/2026-03-13-mamba-ssm.md\",\n \"scraped_at\": \"2026-03-13T00:00:00Z\",\n \"session\": \"2026-03-13-mamba-ssm\",\n \"scraper\": \"firecrawl-port3003\",\n \"has_figures\": true,\n \"figure_count\": 12,\n \"figure_urls\": [\n \"https://arxiv.org/html/2312.00752v2/x1.png\",\n \"https://arxiv.org/html/2312.00752v2/x2.png\"\n ]\n}\n```\n\n### Port 3003 vs Jina Reader: Empirical Comparison (arXiv)\n\n**Validated on arXiv:2312.00752v2 (Mamba paper) — both scrapers running, same URL:**\n\n| Scraper | Bytes | Lines | Words | Figures (absolute inline) | Math on GitHub |\n| ------------------------ | ------ | ----- | ------ | ------------------------- | -------------------------------------- |\n| Port 3003 (Firecrawl) | 99,104 | 1,267 | 13,182 | 13 ✅ | ❌ doubled Unicode+LaTeX, no `$... firecrawl-research-patterns — Skillopedia |\n| Port 3002 (direct API) | 99,104 | 1,267 | 13,182 | 13 ✅ (identical to 3003) | ❌ doubled Unicode+LaTeX, no `$... firecrawl-research-patterns — Skillopedia |\n| Jina Reader | 84,832 | 596 | 10,761 | 12 ✅ | ❌ doubled Unicode+LaTeX, no `$... firecrawl-research-patterns — Skillopedia |\n| Pandoc from LaTeX source | — | — | — | via `\\includegraphics` | ✅ `$inline firecrawl-research-patterns — Skillopedia + ` ```math ``` ` blocks |\n\n**Verdict**: Firecrawl (port 3002/3003) gets **17% more bytes, 2.1× more lines, 22% more words, 1 extra figure** vs Jina. Port 3002 and 3003 produce identical markdown (3003 just wraps 3002 and saves to Caddy). **Both emit absolute inline figure URLs** — no URL reconstruction needed from either scraper.\n\n**Note on the earlier session timeout**: The March 2026 session failure was machine downtime (littleblack was offline), not a routing issue. When littleblack is up, port 3003 reaches arxiv.org fine.\n\n**Recommended arXiv workflow**:\n\n1. Port 3003 (preferred) — more complete content, figures inline, saves to Caddy\n2. Jina Reader (fallback when littleblack is down) — 17% less content but still gets absolute figure URLs\n3. Probe loop to build `figure_urls` frontmatter catalog regardless of scraper used\n4. For human-readable math on GitHub: Pandoc from arXiv LaTeX source (see below)\n\n### Math Rendering: Empirically Validated Approaches\n\n**Validated on arXiv:2312.00752v2 (Mamba paper), March 2026.**\n\n#### Firecrawl/Jina Math Output: Unreadable on GitHub\n\nBoth Firecrawl (port 3002/3003) and Jina Reader extract math by doubling content — each equation appears as a Unicode render followed immediately by raw LaTeX source, packed into markdown table cells with `\\displaystyle` prefixes and `\\\\bm{}` escaping. Example from the empirical test:\n\n```\n| | h′(t)\\\\displaystyle h^{\\\\prime}(t) | \\=𝑨h(t)+𝑩x(t)\\\\displaystyle=\\\\bm{A}h(t)+\\\\bm{B}x(t) | | (1a) |\n```\n\nNo `$... firecrawl-research-patterns — Skillopedia delimiters — **GitHub cannot render this as math**. The raw LaTeX portion is parseable by an LLM (equations are present), but the output is completely unreadable to humans on GitHub.\n\n**For LLM consumption**: Firecrawl's doubled content is sufficient — the LaTeX source is embedded and an LLM can extract it.\n\n**For human-readable GitHub rendering**: Use Pandoc from the arXiv LaTeX source tarball (see below).\n\n#### Pandoc from arXiv LaTeX Source (Human-Readable Math)\n\nProduces proper `$inline firecrawl-research-patterns — Skillopedia and ` ```math ``` ` display blocks that GitHub's MathJax/KaTeX renders natively:\n\n```bash\nARXIV_ID=\"2312.00752\"\n\n# Download arXiv LaTeX source tarball\ncurl -L \"https://arxiv.org/src/${ARXIV_ID}\" -o \"${ARXIV_ID}-src.tar.gz\"\nmkdir -p \"${ARXIV_ID}-src\"\ntar xzf \"${ARXIV_ID}-src.tar.gz\" -C \"${ARXIV_ID}-src/\"\n\n# Find main .tex entry point and section files\nls \"${ARXIV_ID}-src/\"*.tex\nls \"${ARXIV_ID}-src/src/\"*.tex 2>/dev/null # some papers put sections in src/\n\n# Option A: Convert individual section files (safer — avoids macro parse errors)\npandoc \"${ARXIV_ID}-src/src/background.tex\" \\\n --to gfm+tex_math_dollars \\\n --wrap=none \\\n -o \"${ARXIV_ID}-background.md\"\n\n# Option B: Convert full main.tex (may fail on custom macros like \\iftoggle)\npandoc \"${ARXIV_ID}-src/main.tex\" \\\n --to gfm+tex_math_dollars \\\n --wrap=none \\\n -o \"${ARXIV_ID}-pandoc.md\"\n```\n\nInstall: `brew install pandoc`. Works on any arXiv paper that publishes LaTeX source (most do).\n\n**Pandoc output quality** (empirically validated):\n\n- Inline math: `$x(t) \\in \\R \\mapsto y(t) \\in \\R firecrawl-research-patterns — Skillopedia ✅ GitHub renders\n- Display math: ` ```math\\n\\begin{align}\\nh'(t) &= \\A h(t) + \\B x(t)\\n\\end{align}\\n``` ` ✅ GitHub renders\n- Custom macros (`\\A`, `\\B`, `\\R`, `\\dt`, `\\dA`, `\\dB`): ⚠️ **undefined in KaTeX** — macros pass through as-is and may partially fail on GitHub without the preamble's `\\newcommand` definitions\n\n**Handling custom macros**: Prepend the `\\newcommand` block from `main.tex` preamble to the output:\n\n````bash\n# Extract custom macro definitions from preamble\ngrep '\\\\newcommand\\|\\\\renewcommand\\|\\\\def ' \"${ARXIV_ID}-src/main.tex\" > macros.tex\n\n# Pandoc does not read preamble macros — include them explicitly in a math block at the top:\necho '```math' > preamble-block.md\ncat macros.tex >> preamble-block.md\necho '```' >> preamble-block.md\n\ncat preamble-block.md \"${ARXIV_ID}-pandoc.md\" > \"${ARXIV_ID}-with-macros.md\"\n````\n\n**Known Pandoc parse errors on arXiv LaTeX**:\n\n| Error trigger | Cause | Workaround |\n| -------------------- | ---------------------------------------------- | ----------------------------------------- |\n| `\\iftoggle{arxiv}` | Undefined toggle macro (etoolbox package) | Convert section files instead of main.tex |\n| `\\begin{figure*}` | Two-column figure environment breaks structure | Use `head -N` to avoid broken `\\end` tags |\n| `\\bm{}`, `\\mathbf{}` | Passes through — may not render in KaTeX | Check paper's macro file for mappings |\n\n---\n\n## Anti-Patterns\n\n| # | Anti-Pattern | Why It Fails | Correct Approach |\n| --- | --------------------------------------------- | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- |\n| 1 | Using `@mendable/firecrawl-js` SDK | `jiti` dynamic imports break in Bun | Direct `fetch()` calls |\n| 2 | Searching paywalled sites without `waitFor` | JS SPAs return empty shell | Use `waitFor: 3000` for IEEE, ACM DL |\n| 3 | Setting depth > 5 | Exponential query explosion, diminishing returns | Cap at depth 5 (`clampDepth()`) |\n| 4 | No timeout on `fetch()` | Hangs indefinitely on unreachable pages | Always use `AbortController` with 15s timeout |\n| 5 | Not trimming long page content | Exceeds LLM context window | `trimToTokenLimit(text, 25_000)` per page |\n| 6 | Aborting on partial failure | Loses all completed work | Log failures, continue with remaining queries |\n| 7 | Probing `/v1/health` for health | Returns HTTP 404 — endpoint doesn't exist; HTML 404 page looks like service-down but isn't | `GET /` against port 3002, check body contains `\"Firecrawl API\"`. See Section 1 Health Check. |\n| 8 | Saving only synthesis without raw originals | Loses source material, prevents re-analysis | Always persist raw Firecrawl markdown to corpus |\n| 9 | Rewriting figure URLs to local relative paths | Relative paths like `./figures/x1.png` break on GitHub — images don't render | Keep absolute URLs inline in markdown body (`![Fig](https://arxiv.org/html/{id}/x1.png)`); catalog in frontmatter `figure_urls` list — see Section 6 |\n\n---\n\n## References\n\n- [API Endpoint Reference](./references/api-endpoint-reference.md) — `/v1/search` and `/v1/scrape` contracts\n- [Academic Paper Routing](./references/academic-paper-routing.md) — Decision tree for paper sources\n- [Recursive Research Protocol](./references/recursive-research-protocol.md) — Step-by-step recursive pattern\n- [Corpus Persistence Format](./references/corpus-persistence-format.md) — Raw content archival format + directory layout\n- [Self-Hosted Operations](./references/self-hosted-operations.md) — Architecture, health checks, recovery\n- [Self-Hosted Bootstrap Guide](./references/self-hosted-bootstrap-guide.md) — Fresh installation guide\n- [Self-Hosted Best Practices](./references/self-hosted-best-practices.md) — Docker restart policies, monitoring\n- [Self-Hosted Troubleshooting](./references/self-hosted-troubleshooting.md) — Symptom-based diagnosis and recovery\n\n## Post-Execution Reflection\n\nAfter this skill completes, check before closing:\n\n1. **Did the command succeed?** — If not, fix the instruction or error table that caused the failure.\n2. **Did parameters or output change?** — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.\n3. **Was a workaround needed?** — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.\n\nOnly update if the issue is real and reproducible — not speculative.\n---","attachment_filenames":["references/academic-paper-routing.md","references/api-endpoint-reference.md","references/corpus-persistence-format.md","references/evolution-log.md","references/recursive-research-protocol.md","references/self-hosted-best-practices.md","references/self-hosted-bootstrap-guide.md","references/self-hosted-operations.md","references/self-hosted-troubleshooting.md"],"attachments":[{"filename":"references/academic-paper-routing.md","content":"# Academic Paper Routing\n\nDecision tree for choosing the best retrieval method based on paper source. Optimized for content quality and reliability.\n\n---\n\n## Routing Table\n\n| Source | Best Method | Why | Fallback | `waitFor` |\n| --------------------- | ------------------------------- | ---------------------------------------------------------------------------------------- | ----------------------------------- | --------- |\n| arxiv.org | Port 3003 (`/scrape?url=...`) | **+17% more content** than Jina (99KB vs 85KB), 13 figures vs 12, identical to port 3002 | Jina Reader (when littleblack down) | No |\n| Semantic Scholar | API (`api.semanticscholar.org`) | Structured JSON, free, rate-limited | Firecrawl search for paper title | No |\n| ACL Anthology | Firecrawl `/v1/scrape` | Clean HTML, free access | Direct PDF download | No |\n| NeurIPS/ICML/ICLR | Firecrawl `/v1/scrape` | JS-rendered proceedings pages | Firecrawl search by title | 2000 |\n| IEEE Xplore | Firecrawl `/v1/scrape` | Heavy JS SPA | Author's personal website | 3000 |\n| ACM Digital Library | Firecrawl `/v1/scrape` | Heavy JS SPA | Author's personal website | 3000 |\n| Author blogs/websites | Jina Reader (`r.jina.ai`) | Static HTML, fast, clean output | Firecrawl `/v1/scrape` | No |\n| Google Scholar | Firecrawl `/v1/search` | Needs JS rendering for results | Direct search query reformulation | No |\n\n---\n\n## Source-Specific Patterns\n\n### arxiv.org\n\narxiv provides multiple access paths. Prefer HTML over PDF for LLM consumption.\n\n```\narxiv.org/abs/2401.12345 → metadata page (abstract, authors)\narxiv.org/html/2401.12345 → full HTML paper (preferred for LLM)\narxiv.org/pdf/2401.12345 → PDF (less useful for text extraction)\n```\n\n**Primary**: Port 3003 (Firecrawl wrapper) — empirically gets 17% more content than Jina:\n\n```bash\ncurl \"http://littleblack:3003/scrape?url=https://arxiv.org/html/2401.12345&name=paper-slug\"\n# Returns: {\"url\":\"http://littleblack:8080/paper-slug-TIMESTAMP.md\",\"file\":\"...\"}\n```\n\n**Fallback** (when littleblack is down): Jina Reader:\n\n```bash\ncurl -s \"https://r.jina.ai/https://arxiv.org/html/2401.12345\" -o paper.md\n```\n\n**Empirically validated (arXiv:2312.00752v2, Mamba paper, March 2026)**:\n\n- Port 3003: 99,104 bytes, 1,267 lines, 13 figures (absolute inline URLs ✅)\n- Jina Reader: 84,832 bytes, 596 lines, 12 figures (absolute inline URLs ✅)\n- Both emit absolute figure URLs — no URL reconstruction needed\n- The earlier session timeout was machine downtime, not a routing issue — port 3003 reaches arxiv.org fine when littleblack is online\n\n**Math rendering gap** (empirically validated): Both Jina and Firecrawl double all equations — each equation appears as Unicode render + raw LaTeX source in the same table cell with `\\displaystyle` prefixes, no `$... firecrawl-research-patterns — Skillopedia delimiters. Unreadable on GitHub for humans; LaTeX is still parseable by LLMs. For human-readable GFM math, use Pandoc from the arXiv LaTeX source tarball (`--to gfm+tex_math_dollars`) — produces proper `$inline firecrawl-research-patterns — Skillopedia and ` ```math ``` ` blocks GitHub renders, but paper-specific custom macros (`\\A`, `\\B`, `\\R`, etc.) need the preamble's `\\newcommand` definitions prepended (see Section 6 of SKILL.md).\n\n#### arXiv Figure URL Pattern\n\narXiv HTML papers store figures at sequential absolute URLs (`x1.png`, `x2.png`, …). The correct approach is to **keep these URLs inline in the markdown body** and **catalog them in the YAML frontmatter** — do NOT download to local paths (relative paths break on GitHub).\n\n```bash\n# Probe sequential URLs to discover figure_count — collect absolute URLs for frontmatter\nARXIV_ID=\"2401.12345\"\nBASE=\"https://arxiv.org/html/${ARXIV_ID}/\"\nFIGURE_URLS=()\n\nfor i in $(seq 1 50); do\n url=\"${BASE}x${i}.png\"\n http_code=$(curl -s -o /dev/null -w \"%{http_code}\" \"$url\")\n if [ \"$http_code\" = \"404\" ]; then\n echo \"Found ${#FIGURE_URLS[@]} figures (stopped at x${i}.png)\"\n break\n fi\n FIGURE_URLS+=(\"$url\")\ndone\n\n# Embed inline in GFM corpus markdown (renders on GitHub without hosting):\nfor i in \"${!FIGURE_URLS[@]}\"; do\n echo \"![Figure $((i+1))](${FIGURE_URLS[$i]})\"\ndone\n```\n\n**Frontmatter catalog** (YAML, inside the corpus `.md` file):\n\n```yaml\nhas_figures: true\nfigure_count: 12\nfigure_urls:\n - https://arxiv.org/html/2401.12345/x1.png\n - https://arxiv.org/html/2401.12345/x2.png\n - https://arxiv.org/html/2401.12345/x3.png\n```\n\n**Notes**:\n\n- Files are `x1.png`, `x2.png`, … (sequential, 1-indexed); first 404 means no more figures\n- Some papers use `.svg` or `.jpg`; probe `.png` first, then alternatives\n- Version suffix: `https://arxiv.org/html/2401.12345v2/` for a specific version\n- Port 3003 already embeds these as inline absolute URLs — just extract them with `grep -oE 'https://arxiv.org/html/[^)]+\\.png'`\n\n**Fallback**: If `/html/` is unavailable (older papers), use Firecrawl to scrape `/abs/`:\n\n```typescript\nconst res = await fetch(\"http://littleblack:3002/v1/scrape\", {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n url: `https://arxiv.org/abs/${arxivId}`,\n formats: [\"markdown\"],\n }),\n});\n```\n\n### Semantic Scholar\n\nAPI-first approach for structured metadata. Free tier: 100 requests/5 minutes.\n\n```typescript\n// Search by title\nconst res = await fetch(\n `https://api.semanticscholar.org/graph/v1/paper/search?query=${encodeURIComponent(title)}&limit=5&fields=title,abstract,url,year,authors,citationCount`,\n);\nconst { data } = await res.json();\n\n// Get by paper ID (S2 ID, DOI, arxiv ID, etc.)\nconst paper = await fetch(\n `https://api.semanticscholar.org/graph/v1/paper/${paperId}?fields=title,abstract,url,year,authors,references,citations`,\n);\n```\n\n**Fallback**: If API rate-limited or paper not indexed, search via Firecrawl:\n\n```typescript\nconst res = await fetch(\"http://littleblack:3002/v1/search\", {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n query: `\"${paperTitle}\" site:semanticscholar.org`,\n limit: 3,\n scrapeOptions: { formats: [\"markdown\"] },\n }),\n});\n```\n\n### Conference Proceedings (NeurIPS, ICML, ICLR)\n\nThese use JS-rendered pages. Always use `waitFor`:\n\n```typescript\nconst res = await fetch(\"http://littleblack:3002/v1/scrape\", {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n url: proceedingsUrl,\n formats: [\"markdown\"],\n waitFor: 2000,\n }),\n});\n```\n\n### IEEE Xplore / ACM Digital Library\n\nHeavy JS SPAs that require extended wait times:\n\n```typescript\nconst res = await fetch(\"http://littleblack:3002/v1/scrape\", {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n url: ieeeOrAcmUrl,\n formats: [\"markdown\"],\n waitFor: 3000, // Critical — page won't render without this\n }),\n});\n```\n\n**Note**: Paywalled content may return only abstract + metadata. For full text, check if the author has a preprint on arxiv or their personal website.\n\n### Author Blogs / Personal Websites\n\nStatic HTML — Jina Reader is faster and cleaner than Firecrawl:\n\n```bash\ncurl -s \"https://r.jina.ai/https://author-blog.com/post-about-paper\"\n```\n\nOr via WebFetch in Claude Code:\n\n```\nWebFetch(url=\"https://r.jina.ai/https://author-blog.com/post\", prompt=\"Extract full content\")\n```\n\n---\n\n## DOI Resolution\n\nDOIs redirect to the publisher's canonical URL. Resolve first, then route:\n\n```typescript\n// Follow redirects to get the publisher URL\nconst res = await fetch(`https://doi.org/${doi}`, { redirect: \"follow\" });\nconst publisherUrl = res.url;\n\n// Route based on publisher domain\nif (publisherUrl.includes(\"arxiv.org\")) {\n // → arxiv path\n} else if (publisherUrl.includes(\"dl.acm.org\")) {\n // → ACM DL path with waitFor: 3000\n} else if (publisherUrl.includes(\"ieeexplore.ieee.org\")) {\n // → IEEE path with waitFor: 3000\n} else {\n // → Generic Firecrawl scrape\n}\n```\n\n---\n\n## Preprint vs Published Version Detection\n\nWhen a paper exists in multiple locations:\n\n1. **Prefer arxiv HTML** — free, structured, no paywalls\n2. **Check Semantic Scholar** for citation metadata + links to all versions\n3. **Use published version** only when arxiv version is significantly outdated (check version dates)\n\n```typescript\n// Semantic Scholar returns all known versions\nconst paper = await fetch(\n `https://api.semanticscholar.org/graph/v1/paper/search?query=${title}&fields=externalIds,url`,\n);\n// externalIds: { ArXiv: \"2401.12345\", DOI: \"10.1145/...\", ... }\n```\n\n---\n\n## Citation Extraction\n\nFor extracting references from a paper's bibliography:\n\n1. **Semantic Scholar API** — best for structured citation data:\n\n```typescript\nconst refs = await fetch(\n `https://api.semanticscholar.org/graph/v1/paper/${paperId}/references?fields=title,authors,year,externalIds&limit=100`,\n);\n```\n\n1. **Firecrawl scrape** of references section — when API doesn't have the paper\n\n---\n\n## Complement to Existing Routing\n\nThis table extends `Skill(gh-tools:research-archival)` URL routing, which covers:\n\n- ChatGPT share URLs → Jina Reader\n- Gemini share URLs → Firecrawl\n- Claude artifacts → Jina Reader\n\nThis skill adds academic-specific routing. The two are complementary — use `research-archival` for AI chat conversations, this skill for academic papers and research content.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":10089,"content_sha256":"04c1a7bc56d36f27bf0129f4c107bb3bf9d8ddd1884304cec2f8bcf70982d4c4"},{"filename":"references/api-endpoint-reference.md","content":"# API Endpoint Reference\n\nFirecrawl self-hosted API contracts for the two endpoints used in research workflows, plus health check.\n\n**Base URL**: `http://littleblack:3002` (Tailscale primary, no API key needed; legacy ZeroTier fallback at `172.25.236.1:3002`)\n\n---\n\n## POST /v1/search\n\nCombined search + scrape. Searches the web for a query and returns scraped markdown for each result.\n\n### Request\n\n```json\n{\n \"query\": \"mixture of experts scaling laws\",\n \"limit\": 5,\n \"scrapeOptions\": {\n \"formats\": [\"markdown\"]\n }\n}\n```\n\n| Field | Type | Required | Default | Description |\n| ----------------------- | -------- | -------- | -------------- | ------------------------- |\n| `query` | string | Yes | — | Search query |\n| `limit` | number | No | 5 | Max results to return |\n| `scrapeOptions.formats` | string[] | No | `[\"markdown\"]` | Content formats to return |\n\n### Response (200 OK)\n\n```json\n{\n \"success\": true,\n \"data\": [\n {\n \"url\": \"https://example.com/page1\",\n \"markdown\": \"# Page Title\\n\\nContent...\",\n \"metadata\": {\n \"title\": \"Page Title\",\n \"description\": \"Meta description\",\n \"sourceURL\": \"https://example.com/page1\"\n }\n }\n ]\n}\n```\n\n| Field | Type | Description |\n| ----------------- | ------- | ---------------------------------------- |\n| `success` | boolean | Whether the search succeeded |\n| `data` | array | Array of scraped results |\n| `data[].url` | string | Source URL |\n| `data[].markdown` | string | Scraped page content as markdown |\n| `data[].metadata` | object | Page metadata (title, description, etc.) |\n\n### Error Responses\n\n| Status | Meaning | Action |\n| ------- | ------------------------------- | ------------------------------------------------------------------------------------ |\n| 400 | Invalid request (missing query) | Check request body |\n| 408 | Search timeout | Retry with shorter query or fewer results |\n| 500 | Internal server error | Check Firecrawl logs, restart if needed |\n| 502/503 | Service unavailable | Container may be dead — see [self-hosted-operations.md](./self-hosted-operations.md) |\n\n### curl Example\n\n```bash\ncurl -s -X POST http://littleblack:3002/v1/search \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"query\": \"transformer attention mechanism\",\n \"limit\": 3,\n \"scrapeOptions\": { \"formats\": [\"markdown\"] }\n }' | jq '.data[].url'\n```\n\n### fetch() Example\n\n```typescript\nasync function firecrawlSearch(\n query: string,\n limit = 5,\n): Promise\u003cSearchResult> {\n const controller = new AbortController();\n const timeoutId = setTimeout(() => controller.abort(), 15_000);\n\n try {\n const res = await fetch(\"http://littleblack:3002/v1/search\", {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n query,\n limit,\n scrapeOptions: { formats: [\"markdown\"] },\n }),\n signal: controller.signal,\n });\n\n if (!res.ok) {\n throw new Error(\n `Firecrawl search failed: ${res.status} ${res.statusText}`,\n );\n }\n\n return await res.json();\n } finally {\n clearTimeout(timeoutId);\n }\n}\n```\n\n---\n\n## POST /v1/scrape\n\nSingle URL scrape. Fetches a specific URL and returns its content as markdown.\n\n### Request\n\n```json\n{\n \"url\": \"https://arxiv.org/abs/2401.12345\",\n \"formats\": [\"markdown\"],\n \"waitFor\": 3000\n}\n```\n\n| Field | Type | Required | Default | Description |\n| --------- | -------- | -------- | -------------- | ------------------------------------- |\n| `url` | string | Yes | — | URL to scrape |\n| `formats` | string[] | No | `[\"markdown\"]` | Content formats |\n| `waitFor` | number | No | 0 | Milliseconds to wait for JS rendering |\n\n**When to use `waitFor`**: JS-heavy SPAs (IEEE Xplore, ACM DL, NeurIPS proceedings). Static pages (arxiv, blogs) don't need it.\n\n### Response (200 OK)\n\n```json\n{\n \"success\": true,\n \"data\": {\n \"markdown\": \"# Paper Title\\n\\nAbstract...\",\n \"metadata\": {\n \"title\": \"Paper Title\",\n \"description\": \"Abstract text\",\n \"sourceURL\": \"https://arxiv.org/abs/2401.12345\"\n }\n }\n}\n```\n\n| Field | Type | Description |\n| --------------- | ------- | ---------------------------- |\n| `success` | boolean | Whether the scrape succeeded |\n| `data.markdown` | string | Page content as markdown |\n| `data.metadata` | object | Page metadata |\n\n### curl Example\n\n```bash\n# Simple static page\ncurl -s -X POST http://littleblack:3002/v1/scrape \\\n -H \"Content-Type: application/json\" \\\n -d '{\"url\":\"https://arxiv.org/abs/2401.12345\",\"formats\":[\"markdown\"]}' \\\n | jq -r '.data.markdown'\n\n# JS-heavy page (wait for rendering)\ncurl -s -X POST http://littleblack:3002/v1/scrape \\\n -H \"Content-Type: application/json\" \\\n -d '{\"url\":\"https://dl.acm.org/doi/10.1145/12345\",\"formats\":[\"markdown\"],\"waitFor\":3000}' \\\n | jq -r '.data.markdown'\n```\n\n### fetch() Example\n\n```typescript\nasync function firecrawlScrape(\n url: string,\n waitFor?: number,\n): Promise\u003cScrapeResult> {\n const controller = new AbortController();\n const timeoutId = setTimeout(() => controller.abort(), 30_000);\n\n try {\n const res = await fetch(\"http://littleblack:3002/v1/scrape\", {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n url,\n formats: [\"markdown\"],\n ...(waitFor ? { waitFor } : {}),\n }),\n signal: controller.signal,\n });\n\n if (!res.ok) {\n throw new Error(\n `Firecrawl scrape failed: ${res.status} ${res.statusText}`,\n );\n }\n\n return await res.json();\n } finally {\n clearTimeout(timeoutId);\n }\n}\n```\n\n---\n\n## GET /v1/health\n\nHealth check endpoint. Use before starting a research session.\n\n### Response (200 OK)\n\n```json\n{\n \"status\": \"ok\"\n}\n```\n\n### curl Example\n\n```bash\ncurl -sf http://littleblack:3002/v1/health && echo \"Firecrawl OK\" || echo \"Firecrawl UNHEALTHY\"\n```\n\n### fetch() Example\n\n```typescript\nasync function checkFirecrawlHealth(): Promise\u003cboolean> {\n try {\n const res = await fetch(\"http://littleblack:3002/v1/health\", {\n signal: AbortSignal.timeout(5_000),\n });\n return res.ok;\n } catch {\n return false;\n }\n}\n```\n\n---\n\n## Self-Hosted Specifics\n\n| Property | Value |\n| ------------------ | -------------------------------------------------- |\n| Base URL | `http://littleblack:3002` |\n| API key | Not required (self-hosted, no auth) |\n| Network | Tailscale (must be connected) |\n| Host | littleblack |\n| Wrapper (optional) | `http://littleblack:3003/scrape?url=URL&name=NAME` |\n\nThe wrapper at `:3003` saves markdown to disk and returns a file URL. For programmatic research workflows, prefer the direct API at `:3002` — it gives you full control over the response.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":7689,"content_sha256":"7e043bac7ac07f6eb86eed9dd2d9b9c02a6f73eb8165d933d226eccbfb26d9e2"},{"filename":"references/corpus-persistence-format.md","content":"# Corpus Persistence Format\n\nDefines how raw Firecrawl output is saved for future Claude Code sessions to re-read and re-analyze.\n\n**Design follows existing cc-skills patterns**:\n\n- YAML frontmatter + raw markdown body from `Skill(gh-tools:research-archival)`\n- NDJSON append-only registry from `Skill(devops-tools:session-chronicle)`\n\n---\n\n## Directory Layout\n\n```\n{project-root}/\n├── docs/research/\n│ ├── corpus/ # Raw scraped pages (committed to git)\n│ │ ├── 2026-02-25-moe-scaling-arxiv-2401-12345.md\n│ │ ├── 2026-02-25-switch-transformer-google.md\n│ │ └── ...\n│ ├── sessions/ # Synthesized research reports (committed)\n│ │ ├── 2026-02-25-moe-scaling.md\n│ │ └── ...\n│ └── corpus-index.jsonl # Append-only registry (committed)\n```\n\n| Directory | Committed? | Purpose |\n| ---------------------------------- | ---------- | ----------------------------------------------- |\n| `docs/research/corpus/` | Yes | Raw scraped pages — one file per URL per scrape |\n| `docs/research/sessions/` | Yes | Synthesized reports referencing corpus files |\n| `docs/research/corpus-index.jsonl` | Yes | Master index for quick corpus queries |\n\n---\n\n## Raw Corpus File Format\n\nEach file in `docs/research/corpus/` = one Firecrawl-scraped URL, preserved exactly as returned.\n\n### File Naming\n\n```\nYYYY-MM-DD-{slug}.md\n```\n\n- `YYYY-MM-DD` — date of scrape\n- `slug` — kebab-case derived from page title or URL path (max 60 chars)\n\n**Examples**:\n\n- `2026-02-25-moe-scaling-arxiv-2401-12345.md`\n- `2026-02-25-switch-transformer-google-research.md`\n- `2026-02-25-expert-parallelism-deepspeed-docs.md`\n\n### YAML Frontmatter\n\n```yaml\n---\nsource_url: https://arxiv.org/html/2401.12345\nscraped_at: \"2026-02-25T14:30:00Z\"\nscraper: firecrawl\nfirecrawl_endpoint: /v1/search\nsearch_query: \"mixture of experts scaling\"\nresult_index: 2\nresearch_session: \"2026-02-25-moe-scaling\"\ndepth_level: 1\nclaude_code_uuid: SESSION_UUID\ncontent_tokens_approx: 4200\n---\n```\n\n| Field | Type | Required | Description |\n| ----------------------- | -------- | ---------------------- | --------------------------------------------- |\n| `source_url` | URL | Yes | Original URL that was scraped |\n| `scraped_at` | ISO 8601 | Yes | UTC timestamp of scrape |\n| `scraper` | Enum | Yes | `firecrawl`, `jina-reader`, or `direct` |\n| `firecrawl_endpoint` | String | If scraper=firecrawl | `/v1/search` or `/v1/scrape` |\n| `search_query` | String | If endpoint=/v1/search | The search query that found this page |\n| `result_index` | Number | If endpoint=/v1/search | Position in search results (0-based) |\n| `research_session` | String | Yes | Session slug (links to session report) |\n| `depth_level` | Number | Yes | Recursion depth when scraped (1 = top level) |\n| `claude_code_uuid` | UUID | Yes | Claude Code session that performed the scrape |\n| `content_tokens_approx` | Number | Yes | Approximate token count (chars / 3.5) |\n\n### Body Content\n\nEverything below the closing `---` is the **exact markdown Firecrawl returned**. Rules:\n\n1. **Never modify** — no summarization, no trimming, no reformatting\n2. **No added headers** — don't prepend `# Title` if Firecrawl didn't include one\n3. **Preserve whitespace** — keep original line breaks, spacing, formatting\n4. **Include artifacts** — if Firecrawl returned table markdown, code blocks, etc., keep them\n\n### One File Per Scrape\n\nIf the same URL is scraped in multiple sessions:\n\n- Each scrape gets its own timestamped file\n- The corpus index tracks all versions\n- Deduplication is the _index's_ job, not the file system's\n\nThis preserves temporal snapshots — content at a URL may change between scrapes.\n\n---\n\n## Corpus Index Format\n\n`docs/research/corpus-index.jsonl` — append-only NDJSON, one line per scraped page.\n\n### Schema\n\n```json\n{\n \"url\": \"https://arxiv.org/html/2401.12345\",\n \"file\": \"corpus/2026-02-25-moe-scaling-arxiv-2401-12345.md\",\n \"scraped_at\": \"2026-02-25T14:30:00Z\",\n \"session\": \"2026-02-25-moe-scaling\",\n \"tokens\": 4200,\n \"scraper\": \"firecrawl\"\n}\n```\n\n| Field | Type | Description |\n| ------------ | ------ | --------------------------------------- |\n| `url` | string | Source URL (for dedup lookups) |\n| `file` | string | Relative path within `docs/research/` |\n| `scraped_at` | string | ISO 8601 UTC timestamp |\n| `session` | string | Research session slug |\n| `tokens` | number | Approximate token count |\n| `scraper` | string | `firecrawl`, `jina-reader`, or `direct` |\n\n### Usage\n\nClaude Code can query the index to find relevant corpus files:\n\n```bash\n# Find all corpus files for a session\ngrep '\"session\":\"2026-02-25-moe-scaling\"' docs/research/corpus-index.jsonl | jq -r '.file'\n\n# Check if a URL is already in the corpus\ngrep '\"url\":\"https://arxiv.org/html/2401.12345\"' docs/research/corpus-index.jsonl\n\n# Count corpus entries per session\njq -r '.session' docs/research/corpus-index.jsonl | sort | uniq -c | sort -rn\n```\n\n### Append Pattern\n\n```typescript\nimport { appendFileSync } from \"node:fs\";\n\nfunction appendToCorpusIndex(entry: CorpusIndexEntry): void {\n const line = JSON.stringify(entry) + \"\\n\";\n appendFileSync(\"docs/research/corpus-index.jsonl\", line);\n}\n```\n\n---\n\n## Session Report Format\n\nSynthesized reports in `docs/research/sessions/YYYY-MM-DD-{topic-slug}.md`.\n\n### Structure\n\n```markdown\n---\ntopic: \"Mixture of Experts Scaling Laws\"\nstarted_at: \"2026-02-25T14:00:00Z\"\ncompleted_at: \"2026-02-25T15:30:00Z\"\nbreadth: 4\ndepth: 2\ntotal_queries: 12\nqueries_succeeded: 10\nqueries_failed: 2\ncorpus_files: 35\ntotal_tokens_scraped: 147000\nclaude_code_uuid: SESSION_UUID\n---\n\n# Mixture of Experts Scaling Laws\n\n## Summary\n\n[Synthesized findings organized by theme...]\n\n## Key Findings\n\n1. Finding 1 (from [source](../corpus/2026-02-25-moe-scaling-arxiv.md))\n2. Finding 2 (from [source](../corpus/2026-02-25-switch-transformer.md))\n\n## Open Questions\n\n- Question that couldn't be fully answered\n- Area needing more research\n\n## Sources\n\n| # | Title | Corpus File | Tokens |\n| --- | --------------------- | --------------------------------------------------------------------------------------------------------- | ------ |\n| 1 | Scaling MoE Models... | [corpus/2026-02-25-moe-scaling-arxiv-2401-12345.md](../corpus/2026-02-25-moe-scaling-arxiv-2401-12345.md) | 4200 |\n| 2 | Switch Transformer... | [corpus/2026-02-25-switch-transformer-google.md](../corpus/2026-02-25-switch-transformer-google.md) | 6100 |\n| 3 | Expert Parallelism | [corpus/2026-02-25-expert-parallelism-deepspeed.md](../corpus/2026-02-25-expert-parallelism-deepspeed.md) | 3800 |\n\n## Failed Queries\n\n- \"MoE training stability RLHF\" — timeout\n- \"expert routing load balance GPU memory\" — no results\n```\n\n### Source References\n\nEvery finding in the report should link to its source corpus file using relative paths. This lets any future Claude Code session:\n\n1. Read the synthesized report for a quick overview\n2. Drill into specific corpus files for full original content\n3. Re-analyze raw sources with different questions or newer models\n\n---\n\n## Initialization\n\nWhen starting the first research session in a project:\n\n```bash\nmkdir -p docs/research/corpus docs/research/sessions\ntouch docs/research/corpus-index.jsonl\n```\n\nAdd to `.gitignore` if raw corpus files would be too large:\n\n```gitignore\n# Uncomment if corpus files are too large for git\n# docs/research/corpus/\n```\n\nBy default, commit everything — corpus files are markdown and diff cleanly.\n\n---\n\n## Consistency with Existing Patterns\n\n| Field | This Skill | `research-archival` | Match? |\n| ------------------ | ---------------------------------- | ------------------------------------ | ------------------------- |\n| `source_url` | Yes | Yes | Same field name |\n| `scraped_at` | Yes | Yes | Same field name, ISO 8601 |\n| `claude_code_uuid` | Yes | Yes | Same field name |\n| `scraper` | `firecrawl`/`jina-reader`/`direct` | N/A (uses `source_type`) | Extended |\n| File naming | `YYYY-MM-DD-{slug}.md` | `YYYY-MM-DD-{slug}-{source_type}.md` | Similar pattern |\n| Index format | JSONL | N/A | From `session-chronicle` |\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":9403,"content_sha256":"fa99d26d57eaba6ddc96dd01120bb07e35ad7bc60d38673afcd5814c4f3c79f8"},{"filename":"references/evolution-log.md","content":"# Evolution Log\n\n> **Convention**: Reverse chronological order (newest on top, oldest at bottom). Prepend new entries. Refer to releases by date, not by version tag — semantic-release owns the version SSoT (see `.releaserc.yml`).\n\n---\n\n## 2026-05-27 (b): Antifragile reconciliation of the morning's URL-routing guard\n\n**Trigger**: A live session later the same day invoked the skill on a `chatgpt.com/share/*` URL and hit a contradiction — the morning's URL-routing guard said \"route AI chat shares out to `Skill(gh-tools:research-archival)`, this skill cannot handle them,\" while Section 5's port-routing table explicitly listed `Gemini/ChatGPT shares → Port 3003 (Needs JS rendering)`. The operator (Claude) had to make a judgment call mid-flow, chose Section 5, and port 3003 returned a 75 KB / 1,734-line scrape successfully. Section 5 was right; the guard was overcautious.\n\n**Root cause**: When the URL-routing guard was introduced in the morning's patch to make AI-chat-share routing visible at the top of the templates section, it was framed as a hand-off (\"Templates A–E are for research-grade source material, not AI chat transcripts\") instead of an **intent split**. Both skills wrap the same Firecrawl backend; the difference is what happens to the bytes after they come back (raw file vs. frontmatter + GH issue + provenance). The original line-11 reference to `gh-tools:research-archival` was a _suggestion_, but the guard upgraded it to _exclusion_ without empirical justification.\n\n**Fix 1 — Intent-based routing**: Replaced the URL-pattern hand-off table with an intent-decision table. Operator picks based on what output they want (read-only conversation text vs. full archival pipeline), not based on the URL string. Both rows are valid uses of the same backend.\n\n**Fix 2 — Documented the port-3003 → Caddy two-step**: The skill's Section 5 example showed `curl :3003/scrape?url=...&name=...` as if it returned markdown directly. It does not. It returns JSON of the shape `{\"url\": \"\u003ccaddy-url>\", \"file\": \"\u003cfilename>\"}` — a pointer. The operator must then `GET` the Caddy URL to retrieve the actual markdown. Added the two-step bash snippet, plus a note that the JSON's `url` field embeds the legacy ZeroTier IP and should be reconstructed against the operator's preferred host base.\n\n**Fix 3 — Shell-quoting trap**: Documented that `python3 -c '... print(...)'` inside command substitution leaves a trailing `\\n` which becomes `%0A` in the URL-encoded payload and is silently rejected by the wrapper. Recommend `print(..., end='')`.\n\n**Files modified**:\n\n- `SKILL.md` — replaced \"URL-routing guard\" section (now \"Intent routing — AI chat share URLs\") and Section 5 port-3003/3004 bash block.\n\n**Validation evidence**: The triggering session's port-3003 invocation against `https://chatgpt.com/share/6a168eb9-b118-83e8-8397-2a4ef1a93a5c` returned 75,353 bytes / 1,734 lines of markdown via the Caddy two-step. Cannot be retroactively reproduced without re-scraping; the live trace from 2026-05-27T07:10:40Z is the audit record.\n\n---\n\n## 2026-05-27 (a): Three broken-instruction bugs from the prior MINOR release\n\n**Trigger**: A diagnostic session caught — and the very next invocation of the skill demonstrated — three documented-but-unfixed bugs that survived the prior MINOR release:\n\n1. `/v1/health` does not exist on this Firecrawl build. Probing returns HTTP 404 (Express HTML error page) which looks like service-down but isn't.\n2. Bare `littleblack` hostname was labeled \"Preferred\" in the access table but doesn't resolve over HTTP on the m3max client (MagicDNS isn't pushing the search suffix to the system resolver; SSH works only because `~/.ssh/config` hard-codes the FQDN).\n3. Templates A–E had no entry-point guard against AI chat-share URLs.\n\n**Fix**: Replaced all `/v1/health` references with `GET /` (returns 200 + Firecrawl banner). Demoted bare hostname to \"Conditional\" with `dscacheutil`/`getent` preflight; promoted Tailscale FQDN to \"Preferred\". Added URL-routing guard at the top of templates section. (The guard's framing was over-strict — see entry 2026-05-27 (b) above for the reconciliation.)\n\n**Files modified**: `SKILL.md`.\n\n---\n\n## 2026-03-02: Merged firecrawl-self-hosted into this skill\n\n**What**: Absorbed `firecrawl-self-hosted` skill — its SKILL.md condensed into `self-hosted-operations.md` reference, and its 3 reference docs (bootstrap-guide, best-practices, troubleshooting) moved here.\n\n**Why**: The two skills covered the same service (self-hosted Firecrawl). Consolidation eliminates skill discovery friction — one skill for all Firecrawl concerns.\n\n**Files added**:\n\n- `references/self-hosted-operations.md` (new — condensed from old SKILL.md)\n- `references/self-hosted-bootstrap-guide.md` (moved + renamed)\n- `references/self-hosted-best-practices.md` (moved + renamed)\n- `references/self-hosted-troubleshooting.md` (moved + renamed)\n\n**Files modified**:\n\n- `SKILL.md` — added self-hosted triggers, Section 5, updated references, removed scope boundary note\n\n---\n\n## 2026-02-26: Initial Evolution Log\n\n**Status**: Skill is in use and maintained. Track improvements here.\n\n### Purpose\n\nThis evolution log tracks updates to the skill. Each entry should note:\n\n- What changed (content, structure, tooling)\n- Why it changed (bug fix, feature request, best practice)\n- Files affected\n\n### How to Use\n\n1. When updating SKILL.md or references, add an entry here with the date\n2. Keep entries reverse-chronological (newest first)\n3. Link to ADRs or GitHub issues when relevant\n4. Reference specific line changes when helpful\n\n---\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":5623,"content_sha256":"e61ed4d7cdd0b08ec8c33c121f1ddebe3555863b825095307dfdfd69ac24079f"},{"filename":"references/recursive-research-protocol.md","content":"# Recursive Research Protocol\n\nStep-by-step protocol for the iterative search → extract → recurse → synthesize pattern. Extracted from the working [deep-research Pi extension](~/fork-tools/pi-extensions/extensions/deep-research/).\n\n---\n\n## Parameters\n\n| Parameter | Default | Range | Description |\n| ------------- | ------- | ---------- | -------------------------------------------- |\n| `breadth` | 4 | 1–10 | Search queries generated per recursion level |\n| `depth` | 2 | 1–5 | Maximum recursion depth (capped at 5) |\n| `concurrency` | 2 | 1–4 | Parallel Firecrawl requests via `p-limit` |\n| `limit` | 5 | 1–10 | Results per Firecrawl `/v1/search` call |\n| `timeout` | 15000ms | 5000–60000 | Per-search request timeout |\n\n### Query Budget Estimation\n\nTotal queries at each depth level: `breadth` queries. Each recursion halves breadth.\n\n```\nDepth 1: breadth queries (e.g., 4)\nDepth 2: breadth * ceil(breadth/2) queries (e.g., 4 * 2 = 8)\nDepth 3: breadth * ceil(breadth/2) * ceil(ceil(breadth/2)/2) queries\n```\n\nFor breadth=4, depth=2: approximately 4 + 8 = 12 total search queries.\nFor breadth=4, depth=3: approximately 4 + 8 + 8 = 20 total search queries.\n\n---\n\n## Protocol Steps\n\n### Step 1: Health Check\n\nVerify Firecrawl is reachable before starting. A failed health check saves minutes of wasted API calls.\n\n```typescript\ntry {\n await fetch(\"http://littleblack:3002/v1/health\", {\n signal: AbortSignal.timeout(5_000),\n });\n} catch {\n // Abort — see self-hosted-operations.md and troubleshooting.md references\n}\n```\n\nIf health check fails, do NOT proceed. Report the failure and suggest checking the Firecrawl deployment.\n\n### Step 2: Generate Search Queries\n\nGiven the research topic and any prior learnings, generate N search queries (N = breadth).\n\n**Input**: Topic string + accumulated learnings array\n**Output**: Array of `{ query, researchGoal }` objects\n\n```typescript\nconst queries = await generateSerpQueries(topic, breadth, priorLearnings);\n// Returns: [{ query: \"mixture of experts scaling\", researchGoal: \"understand scaling laws\" }, ...]\n```\n\nThe LLM generates diverse queries that avoid duplicating prior learnings. Each query has an explicit `researchGoal` used to focus follow-up recursion.\n\n### Step 3: Execute Searches\n\nFor each query, call Firecrawl `/v1/search` with concurrency control via `p-limit`.\n\n```typescript\nimport pLimit from \"p-limit\";\n\nconst limit = pLimit(concurrency); // default: 2\n\nconst results = await Promise.all(\n queries.map((q) =>\n limit(async () => {\n const searchResult = await firecrawlSearch(\n \"http://littleblack:3002\",\n q.query,\n { timeout: 15_000, limit: 5 },\n );\n return { query: q, data: searchResult.data ?? [] };\n }),\n ),\n);\n```\n\n### Step 4: Persist Raw Results\n\n**CRITICAL**: Save each scraped page to `docs/research/corpus/` BEFORE any LLM processing. This ensures raw content survives even if the session is interrupted.\n\nFor each search result page:\n\n1. Generate filename: `YYYY-MM-DD-{slug}.md`\n2. Write file with YAML frontmatter + raw markdown body\n3. Append entry to `docs/research/corpus-index.jsonl`\n\nSee [corpus-persistence-format.md](./corpus-persistence-format.md) for the exact file format.\n\n### Step 5: Extract Learnings\n\nFor each set of search results, pass the scraped content to an LLM to extract:\n\n- **Key learnings**: Factual findings, data points, conclusions\n- **Follow-up questions**: Gaps in understanding that warrant deeper investigation\n\n```typescript\n// Trim each page to fit in context window\nconst trimmedContents = contents.map((c) => trimToTokenLimit(c, 25_000));\n\nconst extracted = await processSerpResult(\n query,\n trimmedContents,\n numLearnings: 3, // Extract up to 3 learnings per result set\n numFollowUp: breadth / 2, // Generate follow-up questions for next depth\n);\n// Returns: { learnings: string[], followUpQuestions: string[] }\n```\n\n### Step 6: Recurse\n\nFor each follow-up question, recurse with halved breadth and decremented depth.\n\n```typescript\nconst newBreadth = Math.ceil(breadth / 2);\nconst newDepth = depth - 1;\n\nif (newDepth > 0) {\n const nextQuery = `Previous research goal: ${researchGoal}\nFollow-up research directions: ${followUpQuestions.join(\"\\n- \")}`;\n\n return researchLoop(nextQuery, newBreadth, newDepth, allLearnings, ...);\n}\n```\n\n**Why halve breadth**: Deeper levels explore narrower sub-topics. Halving breadth prevents exponential query explosion while maintaining focus.\n\n### Step 7: Base Case\n\nWhen `depth = 0`, return accumulated learnings without further recursion.\n\n```typescript\nif (depth === 0) {\n return { learnings: allLearnings, visitedUrls: allUrls };\n}\n```\n\n### Step 8: Early Stopping\n\nStop recursion early when all new learnings duplicate prior ones (no new information being discovered):\n\n```typescript\nconst newLearnings = extracted.learnings.filter(\n (l) => !priorLearnings.some((p) => similarity(l, p) > 0.9),\n);\nif (newLearnings.length === 0) {\n // No new information — stop recursing this branch\n return { learnings: allLearnings, visitedUrls: allUrls };\n}\n```\n\n### Step 9: Synthesize Final Report\n\nPass all accumulated learnings to an LLM for a structured markdown report.\n\n```typescript\nconst report = await writeFinalReport(topic, allLearnings, visitedUrls);\n```\n\nThe report should:\n\n- Organize learnings by theme/subtopic\n- Include a Sources section referencing raw corpus files by relative path\n- Highlight areas of consensus and disagreement across sources\n- Note gaps where information was unavailable\n\n### Step 10: Write Session Report\n\nSave the synthesized report to `docs/research/sessions/YYYY-MM-DD-{topic-slug}.md`.\n\nThe session report includes a Sources table linking to raw corpus files:\n\n```markdown\n## Sources\n\n| # | Title | Corpus File | Tokens |\n| --- | ------------------ | ------------------------------------------------------------------------------------- | ------ |\n| 1 | Scaling MoE... | [corpus/2026-02-25-moe-scaling-arxiv.md](../corpus/2026-02-25-moe-scaling-arxiv.md) | 4200 |\n| 2 | Switch Transformer | [corpus/2026-02-25-switch-transformer.md](../corpus/2026-02-25-switch-transformer.md) | 6100 |\n```\n\n---\n\n## Handling Partial Failures\n\nThe protocol is designed to tolerate failures at every level:\n\n| Failure Point | Impact | Recovery |\n| ---------------------------- | ---------------------------- | -------------------------------------------------- |\n| Query generation fails | No queries for this level | Return accumulated learnings |\n| Single search times out | Misses one query's results | Log failure, continue with remaining queries |\n| All searches at a level fail | No new content | Return prior learnings (degraded but usable) |\n| Learning extraction fails | Misses insights from results | Raw corpus files still preserved for manual review |\n| Report generation fails | No synthesized output | Accumulated learnings array is still available |\n| Corpus persistence fails | Raw content not saved | Critical — retry or save to temp location |\n\n**Principle**: At every level, partial results are returned rather than throwing. The `queriesFailed` array tracks what didn't work.\n\n---\n\n## Deduplication\n\nResults are deduplicated at the learning and URL level:\n\n```typescript\nreturn {\n learnings: [...new Set(results.flatMap((r) => r.learnings))],\n visitedUrls: [...new Set(results.flatMap((r) => r.visitedUrls))],\n};\n```\n\nThe corpus index (`corpus-index.jsonl`) enables cross-session deduplication — check if a URL was already scraped before re-scraping.\n\n---\n\n## Visualization\n\n```\nTopic: \"mixture of experts scaling\"\n│\n├─ Depth 1 (breadth=4)\n│ ├─ Query 1: \"MoE scaling laws\" → 5 pages → 3 learnings\n│ ├─ Query 2: \"switch transformer efficiency\" → 5 pages → 2 learnings\n│ ├─ Query 3: \"expert parallelism GPU\" → 5 pages → 3 learnings\n│ └─ Query 4: \"MoE vs dense models\" → 5 pages → 2 learnings\n│ │\n│ └─ Depth 2 (breadth=2, per follow-up from each Query)\n│ ├─ Follow-up 1a: \"MoE load balancing\" → 5 pages → 2 learnings\n│ ├─ Follow-up 1b: \"expert dropout\" → 5 pages → 1 learning\n│ ├─ Follow-up 2a: \"MoE inference cost\" → 5 pages → 2 learnings\n│ └─ ... (more follow-ups)\n│\n└─ Synthesize: 15+ learnings → Final Report\n └─ Corpus: 20-40 raw markdown files preserved\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":8817,"content_sha256":"dd7e7eb67aa4e6c01696184b44d13089403c87cfb968a2e2449788ad36c31a2a"},{"filename":"references/self-hosted-best-practices.md","content":"# Firecrawl Best Practices (Empirically Verified)\n\n## 1. Always Use `restart: unless-stopped`\n\nDocker default is `no` restart policy. Containers WILL stop on SIGINT/SIGTERM and not recover.\n\n**Anti-pattern**:\n\n```yaml\nservices:\n api:\n image: firecrawl/api\n # Missing restart policy = container dies and stays dead\n```\n\n**Correct**:\n\n```yaml\nservices:\n api:\n image: firecrawl/api\n restart: unless-stopped # Auto-restart on crash or signal\n```\n\n## 2. Use YAML Anchors for Consistency\n\nDon't repeat `restart: unless-stopped` for each service. Use anchors:\n\n```yaml\nx-common-service: &common-service\n restart: unless-stopped\n logging:\n driver: \"json-file\"\n options:\n max-size: \"1G\"\n max-file: \"4\"\n\nservices:\n api:\n \u003c\u003c: *common-service\n # ...\n```\n\n## 3. Verify After docker compose up\n\nALWAYS verify restart policies after `docker compose up -d`:\n\n```bash\ndocker inspect --format \"{{.Name}}: {{.HostConfig.RestartPolicy.Name}}\" \\\n $(docker ps -a --filter \"name=firecrawl\" -q)\n```\n\n## 4. Use systemd for Non-Docker Services\n\nFor Bun scripts and Caddy, use systemd with `Restart=always`:\n\n```ini\n[Service]\nRestart=always\nRestartSec=5\n```\n\n## 5. Monitor with Health Checks\n\nAdd periodic health check to catch silent failures:\n\n```bash\n# Add to crontab\n*/5 * * * * curl -sf http://localhost:3002/health || systemctl --user restart firecrawl\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":1376,"content_sha256":"6edefe4fc0de0b18ee2680f2f43b28347f1caabb7be1b2450d841eb0548949f1"},{"filename":"references/self-hosted-bootstrap-guide.md","content":"# Firecrawl Bootstrap: Fresh Installation\n\n## Prerequisites\n\n- Debian/Ubuntu server with Docker\n- Tailscale network membership (tailnet: terrylica.github)\n- Domain or static IP (optional, for public access)\n\n## Step 1: Clone Repository\n\n```bash\ncd ~\ngit clone https://github.com/mendableai/firecrawl.git\ncd firecrawl\n```\n\n## Step 2: Configure docker-compose.yaml\n\n**CRITICAL**: Add restart policy to prevent shutdown on signals:\n\n```yaml\nx-common-service: &common-service\n networks:\n - backend\n restart: unless-stopped # \u003c-- ADD THIS\n logging:\n driver: \"json-file\"\n options:\n max-size: \"1G\"\n max-file: \"4\"\n```\n\nApply to all services using the anchor:\n\n```yaml\nservices:\n api:\n \u003c\u003c: *common-service\n # ...\n playwright-service:\n \u003c\u003c: *common-service\n # ...\n redis:\n \u003c\u003c: *common-service\n # ...\n rabbitmq:\n \u003c\u003c: *common-service\n # ...\n```\n\n## Step 3: Environment Variables\n\nCreate `.env` from template:\n\n```bash\ncp .env.example .env\n```\n\nMinimal required settings:\n\n```bash\n# .env\nNUM_WORKERS_PER_QUEUE=2\nPORT=3002\nHOST=0.0.0.0\nREDIS_URL=redis://redis:6379\nREDIS_RATE_LIMIT_URL=redis://redis:6379\n```\n\n## Step 4: Start Services\n\n```bash\ndocker compose up -d\n```\n\n## Step 5: Verify Restart Policies\n\n```bash\ndocker inspect --format \"{{.Name}}: RestartPolicy={{.HostConfig.RestartPolicy.Name}}\" \\\n $(docker ps -a --filter \"name=firecrawl\" -q)\n```\n\nAll should show `unless-stopped`.\n\n## Step 6: Optional - Scraper Wrapper\n\nCreate `~/firecrawl-scraper.ts`:\n\n```typescript\nimport { serve } from \"bun\";\nimport { $ } from \"bun\";\n\nconst FIRECRAWL_API = \"http://localhost:3002\";\nconst OUTPUT_DIR = \"/home/kab/firecrawl-output\";\n\nserve({\n port: 3003,\n async fetch(req) {\n const url = new URL(req.url);\n\n if (url.pathname === \"/health\") {\n return new Response(\"OK\", { status: 200 });\n }\n\n if (url.pathname === \"/scrape\") {\n const targetUrl = url.searchParams.get(\"url\");\n const name = url.searchParams.get(\"name\") || \"scraped\";\n\n if (!targetUrl) {\n return Response.json(\n { error: \"url parameter required\" },\n { status: 400 },\n );\n }\n\n const response = await fetch(`${FIRECRAWL_API}/v1/scrape`, {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n url: targetUrl,\n formats: [\"markdown\"],\n waitFor: 5000,\n }),\n });\n\n const data = await response.json();\n const markdown = data?.data?.markdown;\n\n if (!markdown) {\n return Response.json(\n { error: \"No markdown returned\" },\n { status: 500 },\n );\n }\n\n const timestamp = new Date().toISOString().replace(/[:.]/g, \"-\");\n const filename = `${name}-${timestamp}.md`;\n const filepath = `${OUTPUT_DIR}/${filename}`;\n\n await Bun.write(filepath, markdown);\n\n return Response.json({\n url: `http://littleblack:8080/${filename}`,\n file: filename,\n });\n }\n\n return new Response(\"Not Found\", { status: 404 });\n },\n});\n```\n\nCreate systemd user service `~/.config/systemd/user/firecrawl-scraper.service`:\n\n```ini\n[Unit]\nDescription=Firecrawl Scraper Wrapper\nAfter=network.target\n\n[Service]\nType=simple\nWorkingDirectory=/home/kab\nExecStart=/home/kab/.bun/bin/bun run firecrawl-scraper.ts\nRestart=always\nRestartSec=5\n\n[Install]\nWantedBy=default.target\n```\n\nEnable:\n\n```bash\nsystemctl --user daemon-reload\nsystemctl --user enable --now firecrawl-scraper\n```\n\n## Step 7: Optional - Caddy File Server\n\nDownload Caddy from [GitHub releases](https://github.com/caddyserver/caddy/releases) (latest version).\n\n```bash\n# Download and extract (check releases for current version)\nwget https://github.com/caddyserver/caddy/releases/download/v\u003cversion>/caddy_\u003cversion>_linux_amd64.tar.gz # SSoT-OK\ntar xzf caddy_*.tar.gz\nchmod +x caddy\n```\n\nCreate systemd user service `~/.config/systemd/user/caddy-firecrawl.service`:\n\n```ini\n[Unit]\nDescription=Caddy Firecrawl File Server\nAfter=network.target\n\n[Service]\nType=simple\nWorkingDirectory=/home/kab\nExecStart=/home/kab/caddy file-server --root /home/kab/firecrawl-output --listen :8080 --browse\nRestart=always\nRestartSec=5\n\n[Install]\nWantedBy=default.target\n```\n\nEnable:\n\n```bash\nsystemctl --user daemon-reload\nsystemctl --user enable --now caddy-firecrawl\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":4338,"content_sha256":"495b7835072ed601c62ff7b40b3b3edd6ab1f7a091bf1b93c7be664a1c01f9a8"},{"filename":"references/self-hosted-operations.md","content":"# Firecrawl Self-Hosted Operations\n\nDeployment, health checks, recovery, and best practices for the self-hosted Firecrawl instance.\n\n**Host**: littleblack (Tailscale: `littleblack.tail0f299b.ts.net`, legacy ZeroTier: `172.25.236.1`). All 5 containers up 5+ weeks, stable.\n**Source**: \u003chttps://github.com/mendableai/firecrawl>\n\n## Architecture Overview\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ littleblack (Tailscale) │\n├─────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │\n│ │ Client │───▶│ Scraper │───▶│ Firecrawl │ │\n│ │ (curl) │ │ Wrapper :3003│ │ API :3002 │ │\n│ └──────────────┘ └──────────────┘ └──────────────┘ │\n│ │ │ │ │\n│ │ │ ▼ │\n│ │ │ ┌──────────────┐ │\n│ │ │ │ Playwright │ │\n│ │ │ │ Service │ │\n│ │ │ └──────────────┘ │\n│ │ │ │ │\n│ │ ▼ ▼ │\n│ │ ┌──────────────┐ ┌──────────────┐ │\n│ │ │ Caddy :8080 │ │ Redis │ │\n│ │ │ (files) │ │ RabbitMQ │ │\n│ ▼ └──────────────┘ └──────────────┘ │\n│ ┌──────────────┐ │\n│ │ Output URL │◀── http://littleblack:8080/NAME-TS.md │\n│ └──────────────┘ │\n│ │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n## Quick Reference\n\n| Port | Service | Type | Purpose |\n| ---- | ----------------- | ------ | -------------------------------------------------- |\n| 3002 | Firecrawl API | Docker | Core scraping engine (direct API) |\n| 3003 | Scraper Wrapper | Bun | JS-rendered SPAs, saves to file, returns Caddy URL |\n| 3004 | Cloudflare Bypass | Bun | curl-impersonate for Cloudflare-protected sites |\n| 8080 | Caddy | Binary | Serves saved markdown from firecrawl-output/ |\n\n### When to Use Which Port\n\n| Target | Port | Reason |\n| ---------------------- | ---- | --------------------------------------------- |\n| arXiv / standard pages | 3003 | Playwright JS rendering, preserves image URLs |\n| Claude artifacts | 3004 | Cloudflare blocks Playwright |\n| Gemini/ChatGPT shares | 3003 | Needs JS rendering (SPA) |\n| Other Cloudflare sites | 3004 | If 3003 gets a Cloudflare challenge |\n\n## Usage\n\n### Recommended: Wrapper Endpoint (port 3003)\n\n```bash\ncurl \"http://littleblack:3003/scrape?url=URL&name=NAME\"\n```\n\nReturns:\n\n```json\n{\n \"url\": \"http://littleblack:8080/NAME-TIMESTAMP.md\",\n \"file\": \"NAME-TIMESTAMP.md\"\n}\n```\n\n### Direct API (Advanced)\n\n```bash\ncurl -s -X POST http://littleblack:3002/v1/scrape \\\n -H \"Content-Type: application/json\" \\\n -d '{\"url\":\"URL\",\"formats\":[\"markdown\"],\"waitFor\":5000}' \\\n | jq -r '.data.markdown'\n```\n\n## Health Checks\n\n### Quick Status\n\n```bash\n# All containers running?\nssh littleblack 'docker ps --filter \"name=firecrawl\" --format \"{{.Names}}: {{.Status}}\"'\n\n# API responding?\nssh littleblack 'curl -s -o /dev/null -w \"%{http_code}\" http://localhost:3002/v1/scrape'\n# Expected: 401 (no payload) or 200 (with payload)\n\n# Wrapper responding?\ncurl -s -o /dev/null -w \"%{http_code}\" \"http://littleblack:3003/health\"\n```\n\n### Detailed Status\n\n```bash\n# systemd services (services run under kab user, not yca SSH user)\nssh littleblack \"sudo systemctl --user -M kab@ status firecrawl-scraper caddy-firecrawl\"\n\n# Docker container details\nssh littleblack 'docker ps -a --filter \"name=firecrawl\" --format \"table {{.Names}}\\t{{.Status}}\\t{{.Ports}}\"'\n\n# Logs (live)\nssh littleblack \"sudo journalctl --user -M kab@ -u firecrawl-scraper -u caddy-firecrawl -f\"\n```\n\n**Note**: Firecrawl services run under the `kab` user on littleblack. The SSH user is `yca`. Always use `sudo systemctl --user -M kab@` — plain `systemctl --user` targets the SSH user and sees no services.\n\n## Recovery Commands Cheatsheet\n\n```bash\n# Full restart (all services)\nssh littleblack 'cd ~/firecrawl && docker compose restart'\nssh littleblack 'sudo systemctl --user -M kab@ restart firecrawl-scraper caddy-firecrawl'\n\n# Check everything\nssh littleblack 'docker ps --filter \"name=firecrawl\" && sudo systemctl --user -M kab@ status firecrawl-scraper caddy-firecrawl --no-pager'\n\n# Logs (last 100 lines)\nssh littleblack 'docker logs firecrawl-api-1 --tail 100'\nssh littleblack 'sudo journalctl --user -M kab@ -u firecrawl-scraper --no-pager -n 100'\n\n# Force recreate with new config\nssh littleblack 'cd ~/firecrawl && docker compose up -d --force-recreate'\n\n# Verify restart policies\nssh littleblack 'docker inspect --format \"{{.Name}}: RestartPolicy={{.HostConfig.RestartPolicy.Name}}\" $(docker ps -a --filter \"name=firecrawl\" -q)'\n```\n\n## Cloudflare Bypass (Port 3004)\n\nFor sites that block Playwright-based scraping (Cloudflare challenge pages), use the curl-impersonate bypass service:\n\n```bash\ncurl \"http://littleblack:3004/scrape-cf?url=URL&name=NAME\"\n```\n\nThis uses `curl-impersonate` to mimic a real browser TLS fingerprint, bypassing Cloudflare's bot detection. Use when port 3003 returns a Cloudflare challenge instead of page content.\n\n## Files Reference\n\n| Path on BigBlack | Purpose |\n| --------------------------------- | --------------------------------- |\n| `~/firecrawl/` | Firecrawl Docker deployment |\n| `~/firecrawl/docker-compose.yaml` | Docker orchestration (EDIT THIS) |\n| `~/firecrawl/.env` | Environment configuration |\n| `~/firecrawl-scraper.ts` | Bun wrapper script |\n| `~/firecrawl-output/` | Saved markdown files (Caddy root) |\n| `~/caddy` | Caddy binary |\n| `~/.config/systemd/user/` | User systemd services |\n\n## Related Guides\n\n- [Self-Hosted Bootstrap Guide](./self-hosted-bootstrap-guide.md) — 7-step fresh installation\n- [Self-Hosted Best Practices](./self-hosted-best-practices.md) — Docker restart policies, health monitoring\n- [Self-Hosted Troubleshooting](./self-hosted-troubleshooting.md) — Symptom-based diagnosis and recovery\n\n## External References\n\n- [Firecrawl Official Docs](https://docs.firecrawl.dev/) - API reference\n- [Docker Compose Restart](https://docs.docker.com/compose/compose-file/05-services/#restart) - Policy options\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":8196,"content_sha256":"a261f815666b9080409865547e422121104f05bdffacd06bf9d7b3d33eaf04fb"},{"filename":"references/self-hosted-troubleshooting.md","content":"# Firecrawl Troubleshooting\n\n## Symptom: API Container Stopped\n\n**Root Cause**: Docker restart policy was `no` (default). Container received SIGINT and didn't restart.\n\n**Diagnosis**:\n\n```bash\n# Check container status\nssh littleblack 'docker ps -a --filter \"name=firecrawl\"'\n\n# Check restart policy\nssh littleblack 'docker inspect --format \"{{.Name}}: {{.HostConfig.RestartPolicy.Name}}\" $(docker ps -a --filter \"name=firecrawl\" -q)'\n```\n\n**Fix**: Add `restart: unless-stopped` to ALL services in `docker-compose.yaml`:\n\n```yaml\n# ~/firecrawl/docker-compose.yaml\nx-common-service: &common-service\n networks:\n - backend\n restart: unless-stopped # CRITICAL: Add this line\n logging:\n driver: \"json-file\"\n options:\n max-size: \"1G\"\n max-file: \"4\"\n\nservices:\n playwright-service:\n \u003c\u003c: *common-service\n # ... rest of config\n\n api:\n \u003c\u003c: *common-service\n # ... rest of config\n\n redis:\n \u003c\u003c: *common-service\n # ... rest of config\n\n rabbitmq:\n \u003c\u003c: *common-service\n # ... rest of config\n```\n\n**Apply Fix**:\n\n```bash\nssh littleblack 'cd ~/firecrawl && docker compose up -d --force-recreate'\n```\n\n**Verify**:\n\n```bash\nssh littleblack 'docker inspect --format \"{{.Name}}: RestartPolicy={{.HostConfig.RestartPolicy.Name}}\" $(docker ps -a --filter \"name=firecrawl\" -q)'\n# All should show: RestartPolicy=unless-stopped\n```\n\n## Symptom: Scraper Wrapper Not Responding\n\n**Diagnosis**:\n\n```bash\nssh littleblack \"systemctl --user status firecrawl-scraper\"\n```\n\n**Fix**:\n\n```bash\nssh littleblack \"systemctl --user restart firecrawl-scraper\"\n```\n\n## Symptom: Caddy File Server Down\n\n**Diagnosis**:\n\n```bash\nssh littleblack \"systemctl --user status caddy-firecrawl\"\ncurl -I http://littleblack:8080/\n```\n\n**Fix**:\n\n```bash\nssh littleblack \"systemctl --user restart caddy-firecrawl\"\n```\n\n## Symptom: Tailscale Unreachable\n\n**Diagnosis**:\n\n```bash\n# From local machine\ntailscale ping littleblack\n\n# Check Tailscale status\ntailscale status\n```\n\n**Fix**: Re-authorize device in Tailscale admin console if needed.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":2032,"content_sha256":"7e6a7c0f6fde8cbfd3a0e533b4d9eeee7c31b2eb433a5da039a2d12e84a033ad"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Firecrawl Research Patterns","type":"text"}]},{"type":"paragraph","content":[{"text":"Programmatic patterns for using self-hosted Firecrawl in research workflows — search, scrape, route academic papers, run recursive deep research, and persist raw results for future re-analysis. Also covers self-hosted deployment, health checks, and recovery.","type":"text"}]},{"type":"paragraph","content":[{"text":"For archiving AI chat conversations (ChatGPT/Gemini shares), see ","type":"text"},{"text":"Skill(gh-tools:research-archival)","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"blockquote","content":[{"type":"paragraph","content":[{"text":"Self-Evolving Skill","type":"text","marks":[{"type":"strong"}]},{"text":": This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.","type":"text"}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"FIRST — TodoWrite Task Templates","type":"text"}]},{"type":"paragraph","content":[{"text":"MANDATORY","type":"text","marks":[{"type":"strong"}]},{"text":": Select and load the appropriate template before any research work.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Intent routing — AI chat share URLs (chatgpt / gemini / claude)","type":"text"}]},{"type":"paragraph","content":[{"text":"AI chat share URLs (","type":"text"},{"text":"chatgpt.com/share/*","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"chat.openai.com/share/*","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"gemini.google.com/share/*","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"g.co/gemini/share/*","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"claude.ai/share/*","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"claude.ai/chat/*","type":"text","marks":[{"type":"code_inline"}]},{"text":") can be processed by ","type":"text"},{"text":"either","type":"text","marks":[{"type":"strong"}]},{"text":" this skill or ","type":"text"},{"text":"Skill(gh-tools:research-archival)","type":"text","marks":[{"type":"code_inline"}]},{"text":". Pick by ","type":"text"},{"text":"intent","type":"text","marks":[{"type":"strong"}]},{"text":", not URL pattern:","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Your intent","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skill","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Output","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"One-off read / extract conversation text for analysis","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"This skill","type":"text","marks":[{"type":"strong"}]},{"text":" — port 3003 (Sec. 5)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Markdown file on Caddy; no frontmatter, no Issue, no provenance.","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Long-term archive with identity verification, frontmatter, GitHub Issue cross-link","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skill(gh-tools:research-archival)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"docs/research/YYYY-MM-DD-{slug}-{type}.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" + issue with Discovery Provenance.","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Already have the file, just need to scrape extra content into the same corpus file","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"This skill","type":"text","marks":[{"type":"strong"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Append-mode workflow under your control.","type":"text"}]}]}]}]},{"type":"blockquote","content":[{"type":"paragraph","content":[{"text":"Both paths share the same Firecrawl backend.","type":"text","marks":[{"type":"strong"}]},{"text":" ","type":"text"},{"text":"research-archival","type":"text","marks":[{"type":"code_inline"}]},{"text":" calls Firecrawl too — it adds an archival layer on top. There is no scraping capability gap between the two; the difference is what happens to the bytes after they come back.","type":"text"}]}]},{"type":"paragraph","content":[{"text":"WebFetch limitation, regardless of intent","type":"text","marks":[{"type":"strong"}]},{"text":": Claude Code hard-blocks ","type":"text"},{"text":"WebFetch","type":"text","marks":[{"type":"code_inline"}]},{"text":" against ","type":"text"},{"text":"chatgpt.com","type":"text","marks":[{"type":"code_inline"}]},{"text":". Use Firecrawl (this skill, any port) or Jina Reader instead. Verified 2026-05-27.","type":"text"}]},{"type":"paragraph","content":[{"text":"Empirical note","type":"text","marks":[{"type":"strong"}]},{"text":" (2026-05-27): port 3003 successfully scrapes ChatGPT shares — ","type":"text"},{"text":"curl :3003/scrape?url=...&name=...","type":"text","marks":[{"type":"code_inline"}]},{"text":" returned a 75 KB / 1,734-line markdown for a real ChatGPT share via the Caddy two-step pattern (see Section 5). Earlier guidance that said \"route AI chat shares out\" was overcautious and contradicted Section 5's port table.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Template A — Single Firecrawl Search + Persist","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"1. Health check — GET http://littleblack.tail0f299b.ts.net:3002/ (expect 200 + {\"message\":\"Firecrawl API\",...}; NEVER use /v1/health — it 404s)\n2. Execute search — POST /v1/search with query, limit, scrapeOptions\n3. Persist raw results — save each result page to docs/research/corpus/ with frontmatter\n4. Update corpus index — append entries to docs/research/corpus-index.jsonl\n5. Extract findings — summarize key learnings from raw corpus files","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Template B — Academic Paper Retrieval + Persist","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"1. Identify source — classify URL/DOI per academic-paper-routing.md decision tree\n2. Route to scraper — arxiv direct HTML, Semantic Scholar API, Firecrawl, or Jina Reader\n3. Scrape content — execute fetch with appropriate method and timeout\n4. Persist raw result — save to docs/research/corpus/ with academic-specific frontmatter\n5. Update corpus index — append entry to corpus-index.jsonl\n6. Summarize paper — extract key claims, methods, results from raw corpus file","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Template C — Full Recursive Deep Research with Corpus","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"1. Health check — GET http://littleblack.tail0f299b.ts.net:3002/ (expect 200 + Firecrawl banner; NEVER /v1/health — it 404s)\n2. Initialize parameters — set breadth (default 4), depth (default 2), concurrency (default 2)\n3. Generate search queries — LLM generates N queries from topic + prior learnings\n4. Execute searches — Firecrawl /v1/search for each query via p-limit(concurrency)\n5. Persist raw results — save ALL scraped pages to docs/research/corpus/ with provenance\n6. Extract learnings — LLM extracts key findings + follow-up questions per result set\n7. Recurse — for each follow-up, recurse with breadth=ceil(breadth/2), depth=depth-1\n8. Base case — depth=0, return accumulated learnings\n9. Synthesize report — LLM generates final markdown from all learnings\n10. Write session report — save to docs/research/sessions/ with corpus file references\n11. Update corpus index — append all new entries to corpus-index.jsonl","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Template D — Corpus Review / Re-Analysis","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"1. Inventory corpus — read docs/research/corpus-index.jsonl, filter by session/topic/date\n2. Read raw files — load matching corpus files from docs/research/corpus/\n3. Re-analyze — extract new insights with current context/questions\n4. Update session report — amend or create new session report in docs/research/sessions/","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Template E — Image-Rich Paper with Inline Figures","type":"text"}]},{"type":"paragraph","content":[{"text":"Use when paper contains architecture diagrams, result plots, attention maps, or any critical visual content.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"1. Scrape text — use port 3003 (preferred, preserves absolute image URLs) or Jina fallback\n2. Detect figures — scan scraped markdown for ![alt](URL) patterns with .png/.jpg/.svg\n3. Extract figure URLs — for arXiv: probe https://arxiv.org/html/{id}v{n}/x{N}.png until 404\n4. Keep URLs inline — DO NOT rewrite to local relative paths (breaks GitHub rendering)\n5. Ensure inline embedding — markdown body must have ![Figure N](absolute-url) for each figure\n6. Catalog in frontmatter — add figure_count and figure_urls list (all absolute URLs)\n7. Save corpus file — GFM markdown with inline absolute URLs renders on GitHub without hosting\n8. Update corpus-index.jsonl — include has_figures: true, figure_count, figure_urls","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Section 1 — Programmatic Firecrawl Usage","type":"text"}]},{"type":"paragraph","content":[{"text":"Instance","type":"text","marks":[{"type":"strong"}]},{"text":": Self-hosted on ","type":"text"},{"text":"littleblack","type":"text","marks":[{"type":"strong"}]},{"text":" — Debian 12 (bookworm), kernel 6.1.0-31, hostname ","type":"text"},{"text":"kab","type":"text","marks":[{"type":"code_inline"}]},{"text":", login user ","type":"text"},{"text":"yca","type":"text","marks":[{"type":"code_inline"}]},{"text":", RTX 2080 Ti, 62 GiB RAM. No API key required for any Firecrawl endpoint.","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Access path","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"URL base","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"When to use","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Tailscale FQDN","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"http://littleblack.tail0f299b.ts.net:3002","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Preferred.","type":"text","marks":[{"type":"strong"}]},{"text":" Works on every tailnet-attached client regardless of MagicDNS resolver state.","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Tailscale IP","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"http://100.78.106.112:3002","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Bypasses DNS entirely; stable while the tailnet device exists.","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Tailscale MagicDNS","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"http://littleblack:3002","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Conditional — only when bare-name resolution works (see preflight below).","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Same-LAN direct","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"http://192.168.1.67:3002","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Only when the client is on the Telus PureFibre LAN (","type":"text"},{"text":"eno1","type":"text","marks":[{"type":"code_inline"}]},{"text":" interface).","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Legacy ZeroTier","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"http://172.25.236.1:3002","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Fragile fallback (","type":"text"},{"text":"ztksetviym","type":"text","marks":[{"type":"code_inline"}]},{"text":" interface). Prefer Tailscale.","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"MagicDNS preflight","type":"text","marks":[{"type":"strong"}]},{"text":" (run before relying on bare ","type":"text"},{"text":"littleblack","type":"text","marks":[{"type":"code_inline"}]},{"text":"):","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# macOS — does the OS resolver know about the bare name?\ndscacheutil -q host -a name littleblack | grep -q '^ip_address' && echo OK || echo MISSING\n\n# Cross-platform — does any path resolve?\ngetent hosts littleblack 2>/dev/null || ping -c1 -W1 littleblack 2>&1 | head -1","type":"text"}]},{"type":"paragraph","content":[{"text":"If preflight returns ","type":"text"},{"text":"MISSING","type":"text","marks":[{"type":"code_inline"}]},{"text":" / \"cannot resolve\", ","type":"text"},{"text":"use the FQDN row.","type":"text","marks":[{"type":"strong"}]},{"text":" SSH happens to work because ","type":"text"},{"text":"~/.ssh/config","type":"text","marks":[{"type":"code_inline"}]},{"text":" hard-codes the FQDN under the ","type":"text"},{"text":"Host littleblack","type":"text","marks":[{"type":"code_inline"}]},{"text":" alias — that's an SSH-only shortcut, not a system-wide DNS facility. Bare ","type":"text"},{"text":"littleblack","type":"text","marks":[{"type":"code_inline"}]},{"text":" over HTTP fails silently as ","type":"text"},{"text":"HTTP 000","type":"text","marks":[{"type":"code_inline"}]},{"text":" when the resolver doesn't have it; the failure mode is invisible without ","type":"text"},{"text":"ping","type":"text","marks":[{"type":"code_inline"}]},{"text":"/","type":"text"},{"text":"dscacheutil","type":"text","marks":[{"type":"code_inline"}]},{"text":". Confirmed broken on ","type":"text"},{"text":"m3max","type":"text","marks":[{"type":"code_inline"}]},{"text":" (this Mac) as of 2026-05-27.","type":"text"}]},{"type":"paragraph","content":[{"text":"SSH (for ops, not API calls): ","type":"text"},{"text":"ssh littleblack","type":"text","marks":[{"type":"code_inline"}]},{"text":" — defined in ","type":"text"},{"text":"~/.ssh/config","type":"text","marks":[{"type":"code_inline"}]},{"text":" as ","type":"text"},{"text":"HostName littleblack.tail0f299b.ts.net","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"User yca","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"IdentityFile ~/.ssh/id_ed25519_zerotier_np","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Why ","type":"text"},{"text":"fetch()","type":"text","marks":[{"type":"code_inline"}]},{"text":" Instead of ","type":"text"},{"text":"@mendable/firecrawl-js","type":"text","marks":[{"type":"code_inline"}]},{"text":" SDK","type":"text"}]},{"type":"paragraph","content":[{"text":"The official SDK uses ","type":"text"},{"text":"jiti","type":"text","marks":[{"type":"code_inline"}]},{"text":" for dynamic imports, which is incompatible with Bun's module resolution. Direct ","type":"text"},{"text":"fetch()","type":"text","marks":[{"type":"code_inline"}]},{"text":" calls are simpler, more reliable, and have zero dependencies.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Two Endpoints","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Endpoint","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Purpose","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"When to Use","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"POST /v1/search","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Search + scrape combo","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Research queries — returns multiple scraped pages","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"POST /v1/scrape","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Single URL scrape","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Known URL — extract markdown from one page","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"api-endpoint-reference.md","type":"text","marks":[{"type":"link","attrs":{"href":"./references/api-endpoint-reference.md","title":null}}]},{"text":" for full request/response contracts.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Quick Examples","type":"text"}]},{"type":"paragraph","content":[{"text":"Use the FQDN base URL — works on every tailnet-attached client regardless of MagicDNS resolver state. Pull from ","type":"text"},{"text":"$FIRECRAWL_BASE","type":"text","marks":[{"type":"code_inline"}]},{"text":" env var if your project sets one, otherwise hard-code the FQDN:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"const FIRECRAWL_BASE =\n process.env.FIRECRAWL_BASE ?? \"http://littleblack.tail0f299b.ts.net:3002\";","type":"text"}]},{"type":"paragraph","content":[{"text":"Search","type":"text","marks":[{"type":"strong"}]},{"text":" (returns multiple results with markdown):","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"const res = await fetch(`${FIRECRAWL_BASE}/v1/search`, {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n query: \"mixture of experts scaling laws\",\n limit: 5,\n scrapeOptions: { formats: [\"markdown\"] },\n }),\n});\nconst { data } = await res.json(); // data: [{ url, markdown, metadata }]","type":"text"}]},{"type":"paragraph","content":[{"text":"Scrape","type":"text","marks":[{"type":"strong"}]},{"text":" (single URL):","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"const res = await fetch(`${FIRECRAWL_BASE}/v1/scrape`, {\n method: \"POST\",\n headers: { \"Content-Type\": \"application/json\" },\n body: JSON.stringify({\n url: \"https://arxiv.org/abs/2401.12345\",\n formats: [\"markdown\"],\n waitFor: 3000, // ms — for JS-heavy pages\n }),\n});\nconst { data } = await res.json(); // data: { markdown, metadata }","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Error Handling","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// Always set a timeout\nconst controller = new AbortController();\nconst timeoutId = setTimeout(() => controller.abort(), 15_000);\n\ntry {\n const res = await fetch(url, { ...opts, signal: controller.signal });\n if (!res.ok) throw new Error(`Firecrawl: ${res.status} ${res.statusText}`);\n const json = await res.json();\n if (!json.data || (Array.isArray(json.data) && json.data.length === 0)) {\n // Empty results — not an error, but no content to process\n }\n} finally {\n clearTimeout(timeoutId);\n}","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Health Check","type":"text"}]},{"type":"blockquote","content":[{"type":"paragraph","content":[{"text":"There is no ","type":"text","marks":[{"type":"strong"}]},{"text":"/v1/health","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" endpoint on this Firecrawl build.","type":"text","marks":[{"type":"strong"}]},{"text":" Probing it returns HTTP 404 (Express's HTML error page), which looks like a service-down signal but isn't. Use the root ","type":"text"},{"text":"/","type":"text","marks":[{"type":"code_inline"}]},{"text":" endpoint, which returns HTTP 200 with ","type":"text"},{"text":"{\"message\":\"Firecrawl API\",\"documentation_url\":\"https://docs.firecrawl.dev\"}","type":"text","marks":[{"type":"code_inline"}]},{"text":". Confirmed 2026-05-27 against ports 3002 / FQDN / IP.","type":"text"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// Quick health check before starting a research session.\n// Uses the Tailscale FQDN — works regardless of MagicDNS resolver state.\nconst FIRECRAWL_BASE = \"http://littleblack.tail0f299b.ts.net:3002\";\nconst res = await fetch(`${FIRECRAWL_BASE}/`);\nif (!res.ok) {\n throw new Error(\n `Firecrawl unreachable (${res.status}) — see self-hosted-operations.md and self-hosted-troubleshooting.md`,\n );\n}\nconst banner = await res.json();\nif (banner.message !== \"Firecrawl API\") {\n throw new Error(\n `Unexpected root response: ${JSON.stringify(banner).slice(0, 200)}`,\n );\n}","type":"text"}]},{"type":"paragraph","content":[{"text":"For a true end-to-end probe (proves the full search/scrape stack works, not just the HTTP listener), ","type":"text"},{"text":"POST /v1/scrape","type":"text","marks":[{"type":"code_inline"}]},{"text":" against ","type":"text"},{"text":"https://example.com","type":"text","marks":[{"type":"code_inline"}]},{"text":" and check ","type":"text"},{"text":"success: true","type":"text","marks":[{"type":"code_inline"}]},{"text":":","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"curl -s --max-time 15 -X POST \\\n \"http://littleblack.tail0f299b.ts.net:3002/v1/scrape\" \\\n -H 'Content-Type: application/json' \\\n -d '{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}' \\\n | python3 -c \"import sys, json; d=json.load(sys.stdin); print('OK' if d.get('success') else 'FAIL')\"","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Section 2 — Academic Paper Routing","type":"text"}]},{"type":"paragraph","content":[{"text":"Route paper retrieval to the most effective method based on source. Full decision tree in ","type":"text"},{"text":"academic-paper-routing.md","type":"text","marks":[{"type":"link","attrs":{"href":"./references/academic-paper-routing.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Quick Reference","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Source","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Best Method","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Fallback","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"arxiv.org","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Direct HTML (","type":"text"},{"text":"/html/ID","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl ","type":"text"},{"text":"/v1/scrape","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Semantic Scholar","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"API (","type":"text"},{"text":"api.semanticscholar.org","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl search by title","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"ACL Anthology","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl ","type":"text"},{"text":"/v1/scrape","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Direct PDF download","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"NeurIPS/ICML/ICLR","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl ","type":"text"},{"text":"/v1/scrape","type":"text","marks":[{"type":"code_inline"}]},{"text":" with ","type":"text"},{"text":"waitFor","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Search by title","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"IEEE Xplore","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl with ","type":"text"},{"text":"waitFor: 3000","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Author's website","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"ACM DL","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl with ","type":"text"},{"text":"waitFor: 3000","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Author's website","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Author blogs","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Jina Reader (","type":"text"},{"text":"r.jina.ai","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl ","type":"text"},{"text":"/v1/scrape","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Google Scholar","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl ","type":"text"},{"text":"/v1/search","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Direct search query","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"DOI Resolution","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// DOI → publisher URL → route to appropriate scraper\nconst res = await fetch(`https://doi.org/${doi}`, { redirect: \"follow\" });\nconst publisherUrl = res.url; // e.g., https://dl.acm.org/doi/10.1145/...\n// Then route publisherUrl through the decision tree above","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Section 3 — Recursive Research Protocol","type":"text"}]},{"type":"paragraph","content":[{"text":"The iterative search → extract → recurse → synthesize pattern. Full step-by-step protocol in ","type":"text"},{"text":"recursive-research-protocol.md","type":"text","marks":[{"type":"link","attrs":{"href":"./references/recursive-research-protocol.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Algorithm Overview","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"deepResearch(topic, breadth=4, depth=2, concurrency=2):\n 1. Generate N search queries (N = breadth) from topic + prior learnings\n 2. For each query (via p-limit concurrency):\n 1. Firecrawl /v1/search → get results\n 2. PERSIST each raw result to docs/research/corpus/\n 3. Extract learnings + follow-up questions\n 3. For each follow-up question:\n → Recurse with breadth=ceil(breadth/2), depth=depth-1\n 4. Base case: depth=0 → return accumulated learnings\n 5. Synthesize final report from all learnings\n 6. Write session report to docs/research/sessions/","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Default Parameters (from working implementation)","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Parameter","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Default","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Max","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Rationale","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"breadth","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"4","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"—","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Number of parallel search queries per level","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"depth","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"2","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"5","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Recursion levels (depth > 5 yields diminishing returns)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"concurrency","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"2","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"—","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Parallel Firecrawl requests (self-hosted, be gentle)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"limit","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"5","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"—","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Results per search query","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"timeout","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"15000ms","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"—","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Per-search timeout","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Token Budget","type":"text"}]},{"type":"paragraph","content":[{"text":"Each search returns up to 5 pages. Trim each page to ~25,000 tokens before LLM processing:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"function trimToTokenLimit(text: string, maxTokens: number): string {\n if (!text) return \"\";\n const estimatedTokens = Math.ceil(text.length / 3.5);\n if (estimatedTokens \u003c= maxTokens) return text;\n const maxChars = Math.floor(maxTokens * 3.5 * 0.8);\n return text.slice(0, maxChars);\n}","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Partial Failure Principle","type":"text"}]},{"type":"paragraph","content":[{"text":"Partial results are better than total failure.","type":"text","marks":[{"type":"strong"}]},{"text":" If a query fails, log it and continue with remaining queries. Never abort the entire research session because one query timed out.","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Section 4 — Raw Corpus Persistence","type":"text"}]},{"type":"paragraph","content":[{"text":"Critical principle","type":"text","marks":[{"type":"strong"}]},{"text":": Every Firecrawl-scraped page must be persisted in its ","type":"text"},{"text":"original raw markdown","type":"text","marks":[{"type":"strong"}]},{"text":" with provenance metadata. Synthesized reports reference these originals but never replace them.","type":"text"}]},{"type":"paragraph","content":[{"text":"Full format specification in ","type":"text"},{"text":"corpus-persistence-format.md","type":"text","marks":[{"type":"link","attrs":{"href":"./references/corpus-persistence-format.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Directory Layout","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"{project-root}/\n├── docs/research/\n│ ├── corpus/ # Raw scraped pages (committed)\n│ │ └── YYYY-MM-DD-{slug}.md # One file per scraped URL\n│ ├── sessions/ # Research session reports (committed)\n│ │ └── YYYY-MM-DD-{topic-slug}.md # Synthesized report with corpus refs\n│ └── corpus-index.jsonl # Append-only registry (committed)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Corpus File Frontmatter","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"yaml"},"content":[{"text":"---\nsource_url: https://arxiv.org/html/2401.12345\nscraped_at: \"2026-02-25T14:30:00Z\"\nscraper: firecrawl\nfirecrawl_endpoint: /v1/search\nsearch_query: \"mixture of experts scaling\"\nresult_index: 2\nresearch_session: \"2026-02-25-moe-scaling\"\ndepth_level: 1\nclaude_code_uuid: SESSION_UUID\ncontent_tokens_approx: 4200\n---\n[RAW MARKDOWN FROM FIRECRAWL — NEVER MODIFIED]","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Key Rules","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Content below ","type":"text"},{"text":"---","type":"text","marks":[{"type":"code_inline"}]},{"text":" is the ","type":"text"},{"text":"exact markdown Firecrawl returned","type":"text","marks":[{"type":"strong"}]},{"text":" — no summarization, trimming, or reformatting","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"One file per URL per scrape — if the same URL is scraped in multiple sessions, each gets its own timestamped file","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"File naming: ","type":"text"},{"text":"YYYY-MM-DD-{slug}.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" where slug is kebab-case from page title or URL path (max 60 chars)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Session reports in ","type":"text"},{"text":"docs/research/sessions/","type":"text","marks":[{"type":"code_inline"}]},{"text":" reference corpus files by relative path","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Corpus Index (JSONL)","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"json"},"content":[{"text":"{\n \"url\": \"https://arxiv.org/html/2401.12345\",\n \"file\": \"corpus/2026-02-25-moe-scaling-arxiv-2401-12345.md\",\n \"scraped_at\": \"2026-02-25T14:30:00Z\",\n \"session\": \"2026-02-25-moe-scaling\",\n \"tokens\": 4200,\n \"scraper\": \"firecrawl\"\n}","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Why This Matters","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"LLM re-analysis","type":"text","marks":[{"type":"strong"}]},{"text":": Future sessions can re-read raw corpus files and extract different insights with better prompts or newer models","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"No information loss","type":"text","marks":[{"type":"strong"}]},{"text":": Synthesis drops details; raw files preserve everything Firecrawl captured","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Deduplication awareness","type":"text","marks":[{"type":"strong"}]},{"text":": The JSONL index lets agents skip URLs already in the corpus","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Git-friendly","type":"text","marks":[{"type":"strong"}]},{"text":": Markdown files diff cleanly, JSONL is append-only","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Section 5 — Self-Hosted Operations","type":"text"}]},{"type":"paragraph","content":[{"text":"The Firecrawl instance runs on ","type":"text"},{"text":"littleblack","type":"text","marks":[{"type":"strong"}]},{"text":" (Debian 12, RTX 2080 Ti, hostname ","type":"text"},{"text":"kab","type":"text","marks":[{"type":"code_inline"}]},{"text":"). System uptime is in the 100+ day range; Firecrawl is stable on this host. No API key needed. For the full access matrix (Tailscale FQDN / IP / MagicDNS, same-LAN, legacy ZeroTier), see Section 1 \"Instance\". Section 5 examples use the ","type":"text"},{"text":"Tailscale FQDN","type":"text","marks":[{"type":"strong"}]},{"text":" (","type":"text"},{"text":"littleblack.tail0f299b.ts.net","type":"text","marks":[{"type":"code_inline"}]},{"text":") since it works on every tailnet-attached client regardless of resolver state — substitute any path from the Section 1 table when appropriate.","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Port","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Service","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Type","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Purpose","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"3002","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Firecrawl API","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Docker","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Core scraping engine (direct API)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"3003","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Scraper Wrapper","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Bun","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"JS-rendered SPAs, saves to file, returns Caddy URL","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"3004","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Cloudflare Bypass","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Bun","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"curl-impersonate for Cloudflare-protected sites","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"8080","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Caddy","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Binary","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Serves saved markdown from firecrawl-output/","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"When to use which port:","type":"text","marks":[{"type":"strong"}]}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Site Type","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Port","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Why","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"arXiv / standard pages","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"3003","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Playwright JS rendering, preserves image URLs","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Claude artifacts","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"3004","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Cloudflare blocks Playwright","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Gemini/ChatGPT shares","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"3003","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Needs JS rendering (SPA)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Other Cloudflare sites","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"3004","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"If 3003 gets a Cloudflare challenge","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"Two-step pattern","type":"text","marks":[{"type":"strong"}]},{"text":" — port 3003 and 3004 do not return markdown directly. They scrape, save to Caddy-served storage, and return a JSON pointer. You then fetch the markdown from the returned Caddy URL. (Discovered 2026-05-27 — earlier snippets that ran a single ","type":"text"},{"text":"curl :3003/scrape?...","type":"text","marks":[{"type":"code_inline"}]},{"text":" and treated the response body as the scraped content were silently wrong: that body is ","type":"text"},{"text":"{\"url\":\"...\",\"file\":\"...\"}","type":"text","marks":[{"type":"code_inline"}]},{"text":", not markdown.)","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"BASE=\"http://littleblack.tail0f299b.ts.net\" # FQDN — works without MagicDNS\nURL=\"https://chatgpt.com/share/\u003cid>\" # or any JS-rendered page\nNAME=\"chatgpt-metric-stack-2026-05-27\" # slug — NO whitespace or special chars\n\n# URL-encode the target (avoid Python's trailing newline — use end='')\nENC=$(python3 -c \"import urllib.parse,sys; print(urllib.parse.quote(sys.argv[1], safe=''), end='')\" \"$URL\")\n\n# Step 1 — POST scrape request, get JSON pointer\nSCRAPE_JSON=$(curl -s --max-time 90 \"${BASE}:3003/scrape?url=${ENC}&name=${NAME}\")\necho \"$SCRAPE_JSON\"\n# → {\"url\":\"http://172.25.236.1:8080/\u003cNAME>-\u003ctimestamp>.md\",\"file\":\"\u003cNAME>-\u003ctimestamp>.md\"}\n\n# Step 2 — extract Caddy URL, rewrite host to FQDN (the JSON returns the legacy ZeroTier IP),\n# then fetch the actual markdown\nFILE=$(echo \"$SCRAPE_JSON\" | python3 -c \"import sys,json; print(json.load(sys.stdin)['file'])\")\ncurl -s --max-time 30 \"${BASE}:8080/${FILE}\" -o \"/tmp/${FILE}\"\nwc -c \"/tmp/${FILE}\" # sanity-check that content actually arrived","type":"text"}]},{"type":"blockquote","content":[{"type":"paragraph","content":[{"text":"The JSON response embeds the legacy ZeroTier IP","type":"text","marks":[{"type":"strong"}]},{"text":" (","type":"text"},{"text":"http://172.25.236.1:8080/...","type":"text","marks":[{"type":"code_inline"}]},{"text":") — do NOT follow that URL directly if ZeroTier isn't reachable from your client. Always reconstruct the Caddy URL using your preferred host base (","type":"text"},{"text":"${BASE}:8080/${FILE}","type":"text","marks":[{"type":"code_inline"}]},{"text":"), as shown above.","type":"text"}]}]},{"type":"paragraph","content":[{"text":"Shell-quoting trap","type":"text","marks":[{"type":"strong"}]},{"text":" (","type":"text"},{"text":"zsh","type":"text","marks":[{"type":"code_inline"}]},{"text":"/","type":"text"},{"text":"bash","type":"text","marks":[{"type":"code_inline"}]},{"text":"): the ","type":"text"},{"text":"&","type":"text","marks":[{"type":"code_inline"}]},{"text":" in ","type":"text"},{"text":"?url=X&name=Y","type":"text","marks":[{"type":"code_inline"}]},{"text":" is fine inside double quotes, but if you splice ","type":"text"},{"text":"$(...)","type":"text","marks":[{"type":"code_inline"}]},{"text":" command substitution mid-URL, any trailing newline from Python's ","type":"text"},{"text":"print()","type":"text","marks":[{"type":"code_inline"}]},{"text":" becomes ","type":"text"},{"text":"%0A","type":"text","marks":[{"type":"code_inline"}]},{"text":" in the encoded URL and the server rejects the malformed target silently. Always use ","type":"text"},{"text":"end=''","type":"text","marks":[{"type":"code_inline"}]},{"text":" in the encoder or pipe through ","type":"text"},{"text":"tr -d '\\n'","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"Cloudflare-bypass wrapper","type":"text","marks":[{"type":"strong"}]},{"text":" (port 3004) follows the same POST → Caddy two-step:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"curl -s --max-time 90 \"${BASE}:3004/scrape-cf?url=${ENC}&name=${NAME}\"\n# → same JSON shape; same Caddy GET to retrieve the markdown","type":"text"}]},{"type":"paragraph","content":[{"text":"Health probes","type":"text","marks":[{"type":"strong"}]},{"text":" — none of these services expose a ","type":"text"},{"text":"/v1/health","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"/health","type":"text","marks":[{"type":"code_inline"}]},{"text":" endpoint. Probe the root and inspect the response body for the service's identity string:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"BASE=\"http://littleblack.tail0f299b.ts.net\"\n\n# Port 3002 — Firecrawl API\n# Healthy: HTTP 200, body contains '\"message\":\"Firecrawl API\"'\ncurl -s --max-time 4 \"${BASE}:3002/\" | grep -q '\"Firecrawl API\"' && echo \"3002 OK\" || echo \"3002 DOWN\"\n\n# Port 3003 — Scraper wrapper\n# Healthy: HTTP 400, body contains 'Usage: /scrape?url=' (service up, rejects missing params)\ncurl -s --max-time 4 \"${BASE}:3003/\" | grep -q 'Usage: /scrape' && echo \"3003 OK\" || echo \"3003 DOWN\"\n\n# Port 3004 — Cloudflare bypass wrapper\n# Healthy: HTTP 200, body contains '\"service\":\"cloudflare-bypass-scraper\"'\ncurl -s --max-time 4 \"${BASE}:3004/\" | grep -q 'cloudflare-bypass-scraper' && echo \"3004 OK\" || echo \"3004 DOWN\"\n\n# Port 8080 — Caddy\n# Healthy: HTTP 200 (directory listing)\ncurl -s --max-time 4 -o /dev/null -w '%{http_code}\\n' \"${BASE}:8080/\" | grep -q '^200

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.