Feishu Doc Scraper Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback , not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed. Scope (read this first) This skill's contract is faithful per-source Markdown + a record of what was extracted . It does not decide…

\\xef\\xbf\\xbd' .` must be empty. A replacement character means an encoding step corrupted the text.\n\n## Acceptance contract\n\nStop only when all that apply are true:\n\n- Every fetched body reached disk via `jq`/script, not retyped by the model.\n- Collections: the residual rich-media-tag grep (Path A step 5) is empty — every `mention-doc`/`sheet`/cross-tenant reference was followed to a leaf.\n- `LC_ALL=C grep -rl

Feishu Doc Scraper Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback , not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed. Scope (read this first) This skill's contract is faithful per-source Markdown + a record of what was extracted . It does not decide…

\\xef\\xbf\\xbd' .` is empty.\n- docx path: rendered to an image and visually compared to the source; heading hierarchy and highlights match (see docx reference's checklist).\n- Browser fallback only: TOC coverage + scale check (see browser-failure-rules.md).\n- Each output file's frontmatter records `source` (the original URL/token) and, if any post-processing was applied, a `post_process` provenance line.\n- Permission gaps (131006 docs not exported yet, undownloadable images) are explicitly listed for the user — a transparent gap beats a silent omission.\n\n## Do NOT attempt\n\nVerified dead-ends — retrying them only wastes the session. Full table with failure modes and root causes: **[references/permission-and-failure-boundaries.md](references/permission-and-failure-boundaries.md)**. The top ones:\n\n- Bypassing `131006` permission-denied by any means (lark-cli / curl / anonymous browser) — it is a server-side boundary.\n- Downloading docx embedded images via `docs +media-download`, `api …/drive/v1/medias/\u003ct>/download` (with or without `extra`), or `schema drive.medias.download` — none work; lark-cli even mis-reports the real HTTP 400 as \"empty JSON\".\n- `WebFetch` against `open.feishu.cn/document/server-docs/...` for API specs — backend is flaky; use `open.feishu.cn/llms-docs/zh-CN/llms-\u003cmodule>.txt` instead (LLM-friendly, stable).\n- AppleScript/JXA `executeJavaScript`, Chrome CDP on port 9222 — disabled/empty in this environment (browser path only).\n- Using `minimax-docx` to convert docx→md — it is a docx *authoring* tool; use the doc-to-markdown skill instead.\n\n## Bundled resources\n\n- `scripts/feishu_extract_refs.py` — deterministic reference-token extractor; the recursion engine's core. Run it on every fetched body to enumerate `\u003cmention-doc>`/`\u003csheet>`/`\u003cimage>`/cross-tenant/Minutes/Tencent-Meeting references as JSON.\n- `scripts/restore_docx_headings.py` — for Path B: reads true font sizes via python-docx, maps them to heading levels, restores `w:shd` highlights to Obsidian `==…==`, without retyping body text.\n- `scripts/feishu_dom_capture.js` — Path D: injectable end-to-end browser DOM capture.\n- `scripts/download_feishu_images.py` — Path D: SSR image extraction when browser automation is unavailable.\n- `scripts/build_feishu_markdown.py` — Path D: render a capture manifest into Markdown.\n- `scripts/check_heading_coverage.py` — coverage verification (both paths).\n- `references/lark-cli-api-extraction.md` — Path A full reference (commands, recursion, sheets, cross-tenant).\n- `references/feishu-minutes-transcript.md` — Path C native transcript API + scope auth.\n- `references/permission-and-failure-boundaries.md` — error codes + the full Do-NOT-attempt table.\n- `references/docx-export-to-markdown.md` — Path B faithful conversion procedure.\n- `references/browser-dom-fallback.md` + `references/browser-failure-rules.md` — Path D.\n- `references/capture-manifest.md` — manifest shape for `build_feishu_markdown.py`.\n\n## Next step\n\nAfter extraction completes, the clean Markdown typically feeds the user's own knowledge-base ingestion (filing, indexing, dedup) — which is deliberately out of this skill's scope. If the source went through Path B (a docx), the doc-to-markdown skill is already part of that flow. Offer the handoff; do not auto-organize:\n\n```\nExtraction complete: [N] sources → faithful Markdown ([M] permission/image gaps listed).\n\nOptions:\nA) Hand off to your PKM/organizing workflow — file & index these (Recommended if part of a vault)\nB) Run /daymade-docs:docs-cleaner — consolidate redundant content across the extracted files\nC) Stop here — the faithful Markdown is the deliverable\n```\n---","attachment_filenames":["references/browser-dom-fallback.md","references/browser-failure-rules.md","references/capture-manifest.md","references/docx-export-to-markdown.md","references/feishu-minutes-transcript.md","references/lark-cli-api-extraction.md","references/permission-and-failure-boundaries.md","scripts/build_feishu_markdown.py","scripts/check_heading_coverage.py","scripts/download_feishu_images.py","scripts/feishu_extract_refs.py","scripts/restore_docx_headings.py"],"attachments":[{"filename":"references/browser-dom-fallback.md","content":"# Browser DOM Fallback (Path D — last resort)\n\nUse this **only** when lark-cli genuinely cannot reach the content: lark-cli cannot be installed/authenticated, *and* the doc is not permission-walled (a permission wall → Path B, not this). On real collection work this path was never needed — the API path did the whole job. It is slower, depends on a connected browser surface, and an anonymous debugging Chrome cannot read login-walled content. Keep it as the safety net, not the plan.\n\n## Contents\n\n- Tool surface selection\n- Step 1: probe (detect virtual scroll)\n- Step 2: TOC-driven capture (the injectable script)\n- Step 3: images\n- Step 4: normalize, order, dedup\n- Step 5: acceptance signal\n- The 19 battle-tested DOM rules\n\n## Tool surface selection\n\nPrefer data-bearing surfaces over purely visual ones. Order:\n\n1. **Chrome DevTools MCP** — structured DOM/accessibility snapshots, scripted `evaluate`, programmatic TOC clicking + per-section capture on virtual-scroll pages, and the virtual-scroll diagnostic. Best default when it can attach to the authenticated tab.\n2. **Browser Use** — direct page-text access, lower friction for repeated section capture; may not preserve every table and is still subject to virtual-scroll partial rendering.\n3. **Computer Use** — when DOM-native tooling cannot attach and the task depends on the real authenticated browser (extensions, corporate login). Slower, UI-drift-sensitive, verify after every interaction.\n4. **Screenshots + manual extraction** — only when none of the above reach the content.\n\nRejected as a primary capture path: Web Clipper on virtual-scroll pages; clipboard copy after a copy-restriction warning; one-shot \"read the whole page\" without TOC coverage checking. The in-browser extension surface frequently fails to connect at all — do not assume it is available.\n\n## Step 1: probe (detect virtual scroll)\n\nCapture ground truth before extracting: document title, source URL, authenticated+readable, visible word count (if shown), sidebar TOC, copy-restriction banners, virtual scroll.\n\nVirtual-scroll diagnostic (the decisive check): compare TOC item count vs rendered heading count, look for loading containers, total `.block` count, and identify the **real scroll container** (Feishu scrolls a nested div — `.bear-web-x-container` / `.page-main` / `[class*=\"docx-width\"]` — not `window`). If `tocItems >> renderedHeadings`, or loading blocks exist, or `totalBlocks \u003c 10` on a long doc → virtual scroll is on; one-shot extraction will silently miss sections. The full diagnostic JS is embedded in `scripts/feishu_dom_capture.js`.\n\n## Step 2: TOC-driven capture (the injectable script)\n\nDo not re-implement capture logic. Inject `scripts/feishu_dom_capture.js` and run its pipeline:\n\n```javascript\n// inject the file content via evaluate_script, then:\nconst result = await window.__feishuCapture.run({\n title: 'Document Title',\n docName: 'short-name-for-image-files',\n tags: ['feishu']\n});\n// → { totalCaptured, afterClean, sections, images, imagesOk }\n// window.__feishuCapture.manifest → feed scripts/build_feishu_markdown.py\n// window.__feishuCapture.cleanedBlocks → custom rendering\n```\n\nIt handles, in one pass: TOC-driven section capture (click TOC item → wait ~2.5s → capture all `.block`s between this heading and the next), nested-bullet recursion, table extraction (skipping blocks *inside* tables so cells don't leak as duplicate text; merging tables split across virtual-scroll boundaries by header row), code-block UI-noise stripping, inline-markdown conversion, image download via `fetch(credentials:'include')`, noise/aggregation-artifact removal, deduplication, and `data-block-id` numeric sort.\n\nIf there is no TOC: build a manual heading list top-to-bottom, scroll the **real scroll container** in stable increments, snapshot after each, stop when the bottom no longer changes.\n\n## Step 3: images\n\nFeishu image `src` points at authenticated internal streams (`internal-api-drive-stream.larkoffice.com` / `internal-api-drive-stream.feishu.cn`) — they 404/403 once the session ends, so they must be downloaded **during** capture (the injectable script does this). When browser automation cannot attach at all, use the SSR fallback:\n\n```bash\npython3 scripts/download_feishu_images.py --url \"\u003cfeishu-url>\" --doc-name \"\u003cdoc>\" --output-dir assets/\n```\n\nIt regex-extracts the authenticated image URLs straight from the SSR HTML (via `browser_cookie3` + `requests`) and downloads them with session cookies. Name images per-document (`assets/{doc-name}-{index}.ext`) — never generic `img-0.png` shared across docs. `[图片: Feishu Docs - Image]` in copy-pasted Markdown is a *real* lost-image placeholder, not noise — recover the image, do not delete the marker.\n\n## Step 4: normalize, order, dedup\n\nRender the manifest with `scripts/build_feishu_markdown.py` (shape: capture-manifest.md). Sort blocks by numeric `data-block-id` (document logical order; DOM order is unreliable under virtual scroll). Deduplicate after sorting, before rendering (virtual scroll re-renders blocks with new ids; table-cell and orphaned-nested-bullet leaks must be removed). Frontmatter minimal: `title`, `source`, `author`, `created`, `description`, `tags`. Trust the DOM class — only `docx-heading1/2/3-block` become `#/##/###`; bold-styled body text stays body text.\n\n## Step 5: acceptance signal\n\nAccept only when all hold:\n\n- final Markdown covers the expected TOC headings (run `scripts/check_heading_coverage.py`)\n- body roughly matches the visible word-count scale (when Feishu shows one)\n- >95% of sections have non-empty body (empty headings = missed virtual-scroll content)\n- tables named in the TOC (\"总览\"/\"overview\"/\"schedule\") are present as Markdown tables\n- no `docx-block-loading-container` remains unvisited\n- `LC_ALL=C grep -rl

Feishu Doc Scraper Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback , not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed. Scope (read this first) This skill's contract is faithful per-source Markdown + a record of what was extracted . It does not decide…

\\xef\\xbf\\xbd' .` is empty\n\n## The 19 battle-tested DOM rules\n\nThe detailed, verified behaviors behind the above (copy walls, virtual scroll, zoom\u003c1 table placeholders, table-cell leakage, `data-block-id` ordering, nested bullets, authenticated image streams, aggregation artifacts, callout drift, code-block noise, clipboard bridge, SSR image extraction, per-doc image naming, the lost-image placeholder): **[browser-failure-rules.md](browser-failure-rules.md)**. Read it whenever the page behaves strangely.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":6369,"content_sha256":"bfe2d5dac83a40eb686558df20e98cd4d445ed6755b8ad32e808ac726e2cd552"},{"filename":"references/browser-failure-rules.md","content":"# History-Derived Rules\n\nThese rules were distilled from repeated local Feishu scraping sessions and follow verified behavior rather than guesswork.\n\n## Rule 1: Copy Warnings Mean Clipboard Is Dead\n\nIf Feishu shows a banner saying copying is restricted, treat clipboard extraction as blocked. Do not keep retrying `Cmd+C`, browser copy commands, or \"copy all\" variants as the main plan.\n\n## Rule 2: Virtual Scroll Breaks One-Shot Extraction\n\nFeishu wiki and doc pages often virtual-render only the visible region plus a small buffer. Any extractor that reads \"the page\" once can silently miss later sections.\n\nImplication:\n\n- never trust a single pass\n- **always use the real scroll container**, not `window.scrollTo`. Feishu scrolls a nested div (usually `.bear-web-x-container`, `.page-main`, or `[class*=\"docx-width\"]`). Scrolling `window` does nothing.\n- **click TOC items to trigger section rendering**, not just scroll. Feishu responds to TOC clicks by fetching and rendering the target section's blocks.\n- after each TOC click, wait 2.5s for rendering, then capture all `.block` elements between the target heading and the next heading\n- some sections span multiple virtual \"pages\" — scroll the content container in increments after clicking, capturing new blocks each time\n- deduplicate blocks by `data-block-id` to avoid double-counting overlap\n\n## Rule 3: Web Clipper Can Look Correct While Still Being Incomplete\n\nExtension output can capture only the rendered subset and still produce plausible Markdown or HTML. Plausibility is not acceptance.\n\nImplication:\n\n- treat Web Clipper as non-authoritative on virtual-scroll pages\n- if TOC headings or word count do not line up, discard it as the main source\n\n## Rule 4: TOC Coverage Is the Best Section-Level Contract\n\nThe left sidebar TOC is the most reliable list of meaningful document sections. Use it as the checklist for coverage validation.\n\n## Rule 5: Remove UI Noise Aggressively\n\nCommon Feishu noise to delete:\n\n- comments\n- \"you may also ask\"\n- support footer items\n- upload logs\n- \"contact support\"\n- recommendation panels\n- empty interaction controls\n\n## Rule 6: Validate Against Scale, Not Exact Word Count\n\nWhen Feishu shows a visible word count, use it as a scale check. A final Markdown body that is dramatically shorter than the page count is probably incomplete even if the saved file looks tidy.\n\n## Rule 7: Trust the DOM Class, Do Not Promote Text Blocks to Headings\n\nIf the sidebar TOC does not list a sub-section, it is not a heading. Feishu sometimes styles body text as bold to make it *look* like a heading, but the DOM class remains `docx-text-block` or `docx-quote-block`. Respect the DOM class: only `docx-heading1/2/3-block` become `#/##/###`. Bold body text stays as body text with inline `**` formatting.\n\n## Rule 8: Zoom \u003c 1 Causes Table Placeholders\n\nDo not zoom out to force more content into the viewport. At zoom levels below 1.0, Feishu renders `bear-virtual-renderUnit-placeholder` inside table cells, producing empty or corrupted rows. Keep zoom at 1.0 and rely on TOC-driven section extraction instead.\n\n## Rule 9: Skip Blocks Inside Tables\n\nWhen querying `.block`, table cell blocks (`docx-table_cell-block`, `.table-cell-block`) also match. If not excluded, they appear as duplicate standalone text blocks in the output, polluting the markdown with table cell values outside the table. Exclude any block whose closest `.docx-table-block` ancestor is not itself.\n\n## Rule 10: Use data-block-id Numeric Order for Document Sequence\n\nVirtual scroll unloads and re-renders blocks, which can reorder the DOM. `compareDocumentPosition` and DOM order are unreliable. Feishu assigns numeric `data-block-id` values in document logical order (lower = earlier). Sort captured blocks by numeric `data-block-id` before generating markdown.\n\n## Rule 11: Nested Bullets Have Parent-Child DOM Structure\n\nFeishu nested lists use a parent `.docx-bullet-block` containing `.list-children` with child `.docx-bullet-block` elements. Extract parent text from `.list-content` or `.ace-line`, then recursively extract direct child bullets. Skip child bullets in the main capture loop (they're handled by their parent).\n\n## Rule 12: Image URLs Are Authenticated Internal-API Streams\n\nFeishu image `src` attributes point to `internal-api-drive-stream.larkoffice.com` (or `internal-api-drive-stream.feishu.cn` for domestic). These URLs require the user's session cookie; they are not public CDN links. After the browser session ends or cookies expire, the images 404/403.\n\nImplication:\n\n- during capture, download every image via `fetch(src, { credentials: 'include' })` while the session is alive\n- convert each response blob to a data URL for transport, then decode to local files\n- replace remote URLs in the markdown with local relative paths (`assets/{doc-name}-{index}.ext`)\n- `blob:` URLs (Feishu's in-memory object URLs) cannot be fetched at all — skip them\n\n## Rule 13: Page-Main Container Produces Aggregation Artifacts\n\nThe first few `.block` elements (typically `data-block-id` 1–4) on a Feishu page are the outer page container whose `innerText` concatenates the entire visible content into a single giant string. These are not real content blocks.\n\nImplication:\n\n- drop any `type: text` block with payload length > 350 characters — it is almost certainly an aggregation artifact\n- real paragraphs in Feishu rarely exceed 300 characters per block\n\n## Rule 14: Callout/Quote Blocks Have Non-Sequential data-block-id\n\nFeishu callout boxes, quote blocks, and sticky notes receive `data-block-id` values that are much higher than their visual position in the document. When sorting by `data-block-id`, these blocks drift to the document's tail.\n\nImplication:\n\n- after sorting, callout content may appear after the last real section\n- either mark the tail as \"appendix: callout blocks\" or attempt to re-parent them under the correct heading using text matching\n- do not assume `data-block-id` order is perfect for all block types\n\n## Rule 15: Feishu Code Blocks Contain UI Noise Lines\n\nFeishu renders code blocks with visible UI labels: a language label line (e.g., \"Bash\"), a \"Copy\" button text, and sometimes \"Code block\" / \"代码块\" as the first line of `innerText`. These are not part of the code.\n\nImplication:\n\n- strip lines that exactly match: `Copy`, `Code block`, `代码块`, or a bare language name\n- extract the language from these stripped lines if no `lang` attribute is present\n\n## Rule 16: Clipboard Bridge Is the Most Reliable Transport\n\nTransporting large text (>10KB) from chrome-devtools evaluate_script to the local filesystem is unreliable via base64 heredoc (truncation), HTTP localhost (Chrome security blocks), or chunked JSON (slow). The most reliable path is:\n\n1. `navigator.clipboard.writeText(content)` in the browser\n2. `pbpaste > file.md` in the local shell (macOS)\n\nThis works for text content up to ~1MB. For binary (images), use `writeText(base64)` + `pbpaste | base64 -d > file.png`.\n\n## Rule 17: SSR HTML Contains All Image URLs — No Browser Automation Required\n\nFeishu wiki/doc pages render image URLs directly in the initial HTML response at `internal-api-drive-stream.larkoffice.com` / `internal-api-drive-stream.feishu.cn`. These can be extracted via regex without any browser automation, scrolling, or JavaScript execution.\n\n**Working fallback when browser automation fails:**\n\nUse the bundled script `scripts/download_feishu_images.py`:\n\n```bash\npython3 scripts/download_feishu_images.py \\\n --url \"https://my.feishu.cn/wiki/...\" \\\n --doc-name \"my-document\" \\\n --output-dir \"assets/\"\n```\n\nOr implement manually:\n\n```python\nimport browser_cookie3, requests, re\n\ncj = browser_cookie3.chrome()\nheaders = {\n 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)',\n 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',\n}\nresp = requests.get(url, cookies=cj, headers=headers, timeout=30)\nimage_urls = re.findall(\n r'https?://internal-api-drive-stream[^\\s\"\\'\u003c>]+',\n resp.text\n)\n# Download each with session cookies\nfor i, img_url in enumerate(image_urls):\n img_resp = requests.get(\n img_url, cookies=cj,\n headers={'Referer': 'https://my.feishu.cn/'},\n timeout=30\n )\n```\n\n**When to use this path:**\n- AppleScript / JXA execution is disabled in Chrome\n- Chrome DevTools CDP returns 404/empty\n- Browser automation tools cannot attach to the page\n- Batch-processing many documents (faster than per-page browser automation)\n\n**Limitation:** This extracts image URLs only. For full document text + structure, browser-based DOM extraction is still required.\n\n## Rule 18: Images Must Be Named Per-Document\n\nNever use generic names like `img-0.png`, `img-1.png` across multiple documents. When multiple documents share an `assets/` directory, generic names collide and overwrite each other.\n\n**Correct naming:** `{sanitized_doc_name}-{index}.{ext}`\n\nExample: `million-dollar-creative-0.png`, `million-dollar-creative-1.png`\n\n## Rule 19: `[图片: Feishu Docs - Image]` Is a Real Image Placeholder\n\nWhen a Feishu document is copy-pasted into markdown and the image cannot be resolved, Feishu produces the non-standard placeholder `[图片: Feishu Docs - Image]`. **This is not invalid markdown — it indicates a real image existed in the original document but was lost during copy-paste.**\n\nImplication:\n- Do not delete these placeholders as \"noise\"\n- They are a signal that the document contains images that need recovery\n- Use Rule 17 (SSR extraction) or browser-based image download to recover the actual images\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":9607,"content_sha256":"0b50d7cca3231b332aa0c6ff819ddef74611c7b4a73fa6985804c350e4be8afe"},{"filename":"references/capture-manifest.md","content":"# Capture Manifest\n\nUse `scripts/build_feishu_markdown.py` when extraction is easier to stage as structured data before rendering.\n\n## Minimal Shape\n\n```json\n{\n \"title\": \"Document title\",\n \"source\": \"https://example.feishu.cn/wiki/...\",\n \"author\": [\"Author A\", \"Author B\"],\n \"published\": \"\",\n \"created\": \"2026-05-07\",\n \"description\": \"Short summary\",\n \"tags\": [\"clippings\", \"feishu\"],\n \"sections\": [\n {\n \"heading_level\": 1,\n \"heading\": \"Main Heading\",\n \"body\": [\n \"Paragraph one.\",\n \"- Bullet item\",\n \"| Col A | Col B |\",\n \"| --- | --- |\",\n \"| A1 | B1 |\"\n ]\n }\n ]\n}\n```\n\n## Field Rules\n\n- `title`: required\n- `source`: strongly recommended\n- `author`: string or array of strings\n- `published`: optional\n- `created`: optional, defaults to today only if the caller sets it\n- `description`: optional\n- `tags`: optional, string or array\n- `sections`: required array\n- `heading_level`: optional, defaults to `2`\n- `body`: string or array of Markdown blocks\n\n## Rendering Command\n\n```bash\npython3 scripts/build_feishu_markdown.py \\\n --input /path/to/capture.json \\\n --output /path/to/output.md\n```\n\nIf `--output` is omitted, the renderer prints Markdown to stdout.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":1231,"content_sha256":"fdb7f714282abdea01aeeed447d0a7a9926e389e796732f34bfe937446a1f99d"},{"filename":"references/docx-export-to-markdown.md","content":"# Owner-Exported .docx → Faithful Markdown (Path B)\n\nWhen a Feishu doc returns `131006` (permission denied) and cannot be reached by API or browser, the only correct path is: the permission holder exports it as `.docx` and sends it back out-of-band; you then convert it **faithfully**. \"Faithfully\" is the hard part — a naive pandoc conversion silently destroys the heading hierarchy and all highlights. Verified procedure (2026-05).\n\n## Contents\n\n- The two silent-corruption failure modes\n- Step 1: convert with the right tool\n- Step 2: restore heading hierarchy (font-size → heading)\n- Step 3: restore highlights (`w:shd` → `==…==`)\n- Step 4: visual verification (mandatory)\n- Step 5: provenance\n\n## The two silent-corruption failure modes\n\nFeishu-exported docx does **not** use Word heading styles. It lays out headings with **font size + bold** on otherwise-normal paragraphs, and marks emphasis with **cell/run shading (`w:shd`)**, not `w:highlight`. Consequences:\n\n1. **pandoc → 0 Markdown headings.** Every \"heading\" becomes a flat `**bold**` paragraph. In the real case: 102 flat bold paragraphs, zero `#`. A text-only check (\"no errors, word count matches\") passes while the document's entire structure is gone.\n2. **All highlights vanish.** pandoc reads `w:highlight`; Feishu uses `w:shd@fill`. Standard highlight APIs return nothing, so the conversion looks complete but every emphasized passage is now indistinguishable from body text.\n\nNeither is catchable without rendering and *looking*. This is why Step 4 is mandatory.\n\n## Step 1: convert with the right tool\n\nUse the **doc-to-markdown** skill (pandoc + 8 post-processing fixes), **not** `minimax-docx` (that is a docx authoring tool — wrong direction). Get a first-pass `.md` plus extracted media. Confirm the real format first — an exported `.docx` is sometimes mislabeled:\n\n```bash\nfile -b \"\u003cexported>.docx\" # expect: Microsoft Word 2007+ / Microsoft OOXML\n```\n\nThe text in this first pass is correct; only its **structure** (headings) and **emphasis** (highlights) are lost. Steps 2–3 add those back **without retyping the body** — the pandoc text stays byte-for-byte; only `#` prefixes and `==…==` wrappers are added.\n\n## Step 2 & 3: restore headings and highlights\n\nUse the bundled script — it does both, deterministically, by reading the docx's own XML via python-docx:\n\n```bash\npython3 scripts/restore_docx_headings.py \\\n --docx \"\u003cexported>.docx\" \\\n --md \"\u003cfirst-pass>.md\" \\\n --out \"\u003cfinal>.md\"\n```\n\nWhat it does (and why, so you can patch it for an odd document):\n\n- **Heading restoration**: reads each paragraph's true font size (`run.font.size.pt`), builds the size→count distribution, maps the largest distinct sizes to `H1…Hn` in descending order, and prepends the matching `#`s to the corresponding lines in the Markdown. It does not invent or move text. A typical observed distribution and mapping:\n\n | pt | role |\n |---|---|\n | 26 | H1 |\n | 18 | H2 |\n | 16 | H3 |\n | 15 | H4 |\n | 14 | H5 |\n | 11 | body |\n\n The exact pt values differ per document — the script derives them from the distribution rather than hard-coding, but the *descending-size → descending-level* rule is the invariant.\n\n- **Highlight restoration**: reads `rPr/w:shd@fill` per run (lxml/python-docx XML access, since python-docx has no high-level API for shading). Runs whose `fill` is a highlight color get wrapped in Obsidian `==…==` at their position in the Markdown line. Observed fills: `ffe928` (yellow), `935af6` (purple). `==text==` combined with existing `**bold**` (`**==text==**`) is valid Obsidian and renders correctly.\n\nThe script keeps the body text identical to the pandoc output; if you must do this by hand, follow the same rule — derive sizes from `run.font.size.pt`, map descending, prefix `#`, never re-transcribe.\n\n## Step 4: visual verification (mandatory)\n\nText checks cannot detect a flattened hierarchy. Render and look:\n\n```bash\n# first-page thumbnail\nqlmanage -t -s 1600 -o /tmp/vv \"\u003cexported>.docx\"\n\n# full document → PDF (LibreOffice), then read the PDF / screenshots\nsoffice --headless --convert-to pdf --outdir /tmp/vv \"\u003cexported>.docx\"\n```\n\nRead the rendered image(s) and compare against `\u003cfinal>.md` rendered as Markdown:\n\n- Heading levels match the visual size hierarchy in the source.\n- Highlighted passages in the source are `==…==` in the output, in the same places.\n- No body paragraph was promoted/demoted; no text added or dropped.\n\nOnly after this visual pass does the file count as done (this mirrors the general \"generated docs must be visually verified, not just text-checked\" rule).\n\n## Step 5: provenance\n\nRecord what was reshaped, so a future reader knows the body is not a raw API passthrough:\n\n```yaml\npost_process: headings restored from docx font sizes (26/18/16/15/14pt → H1–H5) via python-docx; w:shd fills (ffe928/935af6, invisible to pandoc) restored as Obsidian ==highlight==; visually verified against the source render.\n```\n\nAlso surface to the user any embedded images the docx contains that could not be downloaded (see permission-and-failure-boundaries.md) — list the tokens; do not silently drop them.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":5176,"content_sha256":"6a7ddecb6dc84fca93f5e369a2110412bafb5ef38e9b007bdd34207f8ea578a0"},{"filename":"references/feishu-minutes-transcript.md","content":"# Feishu Minutes (妙记) Transcript (Path C)\n\nHow to export the **text transcript** of a Feishu Minutes recording. Verified end-to-end (2026-05).\n\n## Contents\n\n- The key fact: lark-cli cannot do it directly\n- The native endpoint\n- The scope and the `99991679` error\n- Granting the scope via device-flow (and the timeout trap)\n- Permission is per-minute, not per-tenant\n- Never re-ASR\n\n## The key fact: lark-cli cannot do it directly\n\n`lark-cli minutes` exposes `minutes get` (metadata), `+download` (audio/video), `search`, `upload`. **None export the transcript text.** `lark-cli minutes minutes get --params '{\"minute_token\":\"\u003ct>\"}'` succeeds but returns only title/duration/url — no transcript. The transcript is a native endpoint not wrapped by lark-cli; call it through `lark-cli api`.\n\n## The native endpoint\n\n```\nGET https://open.feishu.cn/open-apis/minutes/v1/minutes/:minute_token/transcript\n```\n\n| Param | In | Required | Notes |\n|---|---|---|---|\n| `minute_token` | path | yes | the last segment of the Minutes URL |\n| `need_speaker` | query | no | `true` → speaker labels |\n| `need_timestamp` | query | no | `true` → per-line timestamps |\n| `file_format` | query | no | `txt` or `srt`; `txt` is best for a Markdown KB |\n\nAuth: `user_access_token` (use `--as user`) or `tenant_access_token`.\n\n```bash\nexport LARK_CLI_NO_PROXY=1\nlark-cli api GET /open-apis/minutes/v1/minutes/\u003cminute_token>/transcript \\\n --params '{\"need_speaker\":true,\"need_timestamp\":true,\"file_format\":\"txt\"}' \\\n --as user -o \u003cspeaker-and-timestamped-transcript>.txt\n```\n\nA successful run yields the full transcript with speaker + millisecond timestamps; verify with the U+FFFD check (`LC_ALL=C grep -rl

Feishu Doc Scraper Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback , not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed. Scope (read this first) This skill's contract is faithful per-source Markdown + a record of what was extracted . It does not decide…

\\xef\\xbf\\xbd' .` empty).\n\n> Spec lookups: use `https://open.feishu.cn/llms-docs/zh-CN/llms-minutes.txt` (stable, LLM-friendly). `WebFetch` against `open.feishu.cn/document/server-docs/...` is flaky. If lark-cli has no wrapper for something, the `lark-openapi-explorer` skill is the systematic way to mine the native spec.\n\n## The scope and the `99991679` error\n\nWithout the export scope the call returns:\n\n```json\n{\"ok\":false,\"error\":{\"type\":\"permission\",\"code\":99991679,\n \"message\":\"Permission denied [99991679]\",\n \"detail\":{\"permission_violations\":[\n {\"subject\":\"minutes:minute:download\",\"type\":\"action_privilege_required\"},\n {\"subject\":\"minutes:minutes.transcript:export\",\"type\":\"action_privilege_required\"}]}}}\n```\n\nThe scope you need is **`minutes:minutes.transcript:export`**.\n\n## Granting the scope via device-flow (and the timeout trap)\n\n```bash\nlark-cli auth login --scope \"minutes:minutes.transcript:export\" --no-wait --json\n# → returns a device flow_id + user_code + a verify URL like:\n# https://accounts.feishu.cn/oauth/v1/device/verify?flow_id=...&user_code=XXXX-XXXX\n```\n\n- Send the **verify URL to the person who owns / can access the Minutes** so they approve it in a browser.\n- Resume polling with `lark-cli auth login --device-code \u003ccode>` — do **not** wrap the login in a short `timeout`. lark-cli explicitly warns: each restart invalidates the previous device code, so short-timeout-retry loops never converge. The login command can legitimately block for up to ~10 minutes waiting for approval.\n- After approval, re-run the `api … /transcript` call; it now succeeds.\n\n## Permission is per-minute, not per-tenant\n\nOne Minutes returning `permission deny` (e.g. code `2091005`) does **not** mean other Minutes in the same tenant are denied. Check each minute_token independently. Before chasing a denied one, check whether its content is already covered by another document you can access (a meeting's AI summary doc often duplicates the transcript) — if so, skip it instead of escalating the permission request.\n\n## Never re-ASR\n\nThe platform's native AI transcription is materially better than downloading the media and running ASR yourself (speaker diarization, timestamps, domain vocabulary). Downloading the mp4/mp3 and re-transcribing is a regression — do not do it, even though `lark-cli minutes +download` makes it tempting.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":4062,"content_sha256":"bb056e99664b41f4e256ffe13a3fbe2200737f2f99e4d501b39ad6122de2ea31"},{"filename":"references/lark-cli-api-extraction.md","content":"# lark-cli API Extraction (Path A — primary)\n\nThe primary, highest-fidelity way to turn a Feishu/Lark source into Markdown. Everything here was verified end-to-end on a real multi-document collection import (lark-cli 1.0.27 and 1.0.32, 2026-05).\n\n## Contents\n\n- Why API over browser\n- Step 0: proxy and auth preflight\n- Step 1: classify the URL\n- Step 2: resolve wiki node → doc token\n- Step 3: fetch the body programmatically\n- Step 4: spreadsheets\n- Step 5: the reference-graph recursion (collections/hubs)\n- Step 6: cross-tenant and personal-space sources\n- Step 7: frontmatter and provenance\n- Command troubleshooting\n- What a clean run looks like\n\n## Why API over browser\n\nOn real collection work the lark-cli path did the entire job and the browser path was never needed, because the API path:\n\n1. Recurses a hub's reference graph programmatically — a browser cannot \"follow\" `\u003cmention-doc>` references mechanically.\n2. Resolves permission boundaries from exact error codes (`131006`, `99991679`) instead of guessing from a rendered page.\n3. Streams the body to disk via `jq`/`cat` so the document text **never passes through the model** (paraphrasing is undetectable later — the core fidelity argument).\n4. Does not depend on a browser extension being connected (the in-browser surface frequently fails to connect; an anonymous debugging Chrome cannot read login-walled content anyway).\n\n## Step 0: proxy and auth preflight\n\n```bash\nexport LARK_CLI_NO_PROXY=1\nlark-cli --version # confirm ≥ 1.0.32 (2026-05); older works but lacks fixes\nlark-cli auth status # must be valid for the target tenant\n```\n\n`LARK_CLI_NO_PROXY=1` is mandatory for `*.feishu.cn` (mainland, direct-connect). Without it, lark-cli prints:\n\n```\n[lark-cli] [WARN] proxy detected: https_proxy=http://127.0.0.1:1082 — requests\n(including credentials) will transit through this proxy. Set LARK_CLI_NO_PROXY=1 to disable proxy.\n```\n\nThat warning is the signal — credentials would transit the proxy and Feishu's domestic DNS would be hijacked. This is host-specific and does not conflict with rules that force `claude.ai`/`anthropic.com` through a proxy; Feishu is a different, direct host.\n\n## Step 1: classify the URL\n\n| URL shape | Meaning | Action |\n|---|---|---|\n| `…/wiki/\u003cnode_token>` | wiki node (a pointer, **not** a doc) | Step 2 then Step 3 |\n| `…/docx/\u003cdoc_token>` | doc, already a doc token | Step 3 directly |\n| `…/sheets/\u003csp_token>` | spreadsheet | Step 4 |\n| `…/minutes/\u003cminute_token>` | Minutes / 妙记 | see feishu-minutes-transcript.md |\n| `…/base/\u003ctoken>`, `…/file/\u003ctoken>` | Bitable / file attachment | see reference-graph dispatch (Step 5) |\n| `https://\u003canything>.feishu.cn/docx/…` or `https://my.feishu.cn/docx/…` | cross-tenant / personal space | Step 6 (same fetch, permission is per-doc) |\n\n## Step 2: resolve wiki node → doc token\n\nA wiki `node_token` is a navigation pointer; fetching it as a doc fails. Resolve it:\n\n```bash\nlark-cli wiki spaces get_node --params '{\"token\":\"\u003cnode_token>\"}'\n```\n\nReturns `{\"code\":0,\"data\":{\"node\":{\"node_token\":\"…\",\"obj_token\":\"\u003cDOC_TOKEN>\",\"obj_type\":\"docx\",\"node_type\":\"origin\",\"has_child\":false,…}}}`.\n\n- Use `.data.node.obj_token` + `.data.node.obj_type` for Step 3.\n- `has_child:false` on the entry node does **not** mean \"no content\" — a collection hub is typically a single docx whose *body* references many other docs (Step 5), not a multi-node wiki tree.\n- `code 131006 … node permission denied` → this node is permission-walled; stop and go to Path B (docx-export-to-markdown.md). Do not try to bypass it.\n\n## Step 3: fetch the body programmatically\n\n```bash\nlark-cli docs +fetch --doc \u003cobj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err\njq -r '.data.markdown' /tmp/fetch.json > \"\u003csanitized-title>.md\"\n```\n\n- `.data.markdown` is clean Markdown with Feishu rich-media tags preserved (resolve them in Step 5).\n- **Keep stdout/stderr separate.** `stderr` may carry `[deprecated] docs +fetch with v1 API is deprecated` — harmless. Doing `2>/dev/null | jq` in one pipe produced a spurious `Exit code 5`; redirect to files and inspect instead.\n- **Never** reconstruct `.data.markdown` by reading and retyping it. `jq -r` it to disk. This is the fidelity guarantee that makes Path A structurally safer than any browser/LLM path.\n- `--format json` is preferred over text so you parse one field deterministically.\n\n## Step 4: spreadsheets\n\nA `\u003csheet token=\"\u003cSP>_\u003cSID>\"/>` tag (or a `…/sheets/\u003cSP>` URL) carries the spreadsheet token and sheet id joined by `_`. Split on `_`:\n\n```bash\nlark-cli sheets +info --spreadsheet-token \u003cSP> \\\n --jq '.data.sheets[]? | {sheet_id, title, rowCount: .gridProperties.rowCount, colCount: .gridProperties.columnCount}'\n\nlark-cli sheets +read --spreadsheet-token \u003cSP> --sheet-id \u003cSID> \\\n --range A1:AZ200 --value-render-option ToString \\\n --jq '.data.valueRange.values'\n```\n\n- `--value-render-option ToString` returns plain text cells (formulas/dates rendered), which is what Markdown tables need.\n- The result is a 2-D array; render it to a Markdown table. Size the range from `sheets +info` row/col counts; do not blind-guess a tiny range.\n\n## Step 5: the reference-graph recursion (collections/hubs)\n\nA hub is the root of a reference graph. Treat it as BFS/DFS over references until every branch reaches a leaf (a doc with no further references).\n\n**Enumerate references with the bundled extractor** (a missed reference is a missing document — the single biggest hub-scraping failure; do not hand-roll `grep` and forget the `my.feishu.cn` personal-space pattern, which is exactly what happened before this script existed):\n\n```bash\npython3 scripts/feishu_extract_refs.py \u003cfetched-body>.md\n# → JSON array of {type, token_or_url, title}\n```\n\nThe references it recognizes (the full rich-media inventory): `\u003cmention-doc token type>`, `\u003csheet token>`, `\u003clark-table>\u003clark-tr>\u003clark-td>` (inline tables — render in place, not a reference), `\u003cimage token>`, `\u003cview>\u003cfile>`, cross-tenant `https://\u003ctenant>.feishu.cn/(docx|wiki|sheets|base|file)/\u003ctoken>`, personal-space `https://my.feishu.cn/docx/\u003ctoken>`, Minutes `https://\u003ctenant>.feishu.cn/minutes/\u003ctoken>`, Tencent-Meeting `https://meeting.tencent.com/crm/\u003cid>`.\n\n**Dispatch table:**\n\n| Reference type | Handler |\n|---|---|\n| `mention-doc` type `docx` / cross-tenant `/docx/` / `my.feishu.cn/docx/` | Step 3 `docs +fetch` |\n| `mention-doc` / URL `/wiki/` | Step 2 then Step 3 |\n| `sheet` / `/sheets/` | Step 4 |\n| `/minutes/` URL | feishu-minutes-transcript.md (native transcript API) |\n| `meeting.tencent.com/crm/` | Tencent Meeting tooling (outside this skill — its native transcript API; never download+re-ASR) |\n| `\u003clark-table>` | render inline to a Markdown table (pandas `read_html` handles colspan/rowspan); it is content, not a link |\n| `\u003cimage token>` | register the token; lark-cli cannot download it (see permission-and-failure-boundaries.md) |\n| `\u003cview>\u003cfile>` | attachment — record token + filename; treat like an image gap unless separately retrievable |\n\n**Recursion loop:** fetch root → extract refs → for each new ref, dispatch and fetch → run the extractor on each newly fetched body → repeat until no new tokens appear. A child doc can itself embed another reference (e.g. a summary doc that embeds a third Minutes link); the loop must re-scan every newly fetched file, not only the root.\n\n**Leaf / completion gate** — before declaring the collection done, no rich-media reference may remain unresolved anywhere:\n\n```bash\ngrep -rlE '\u003c(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . \\\n && echo \"UNRESOLVED — keep recursing\" || echo \"clean\"\n```\n\nThis grep being empty is a hard acceptance gate for collections.\n\n## Step 6: cross-tenant and personal-space sources\n\n`https://\u003cother-tenant>.feishu.cn/docx/…` and `https://my.feishu.cn/docx/…` (personal space) use the **same** `docs +fetch` — Feishu permission is per-document, not per-domain. A reference living in another tenant or someone's personal space is often still readable with the current token. Do not skip a reference just because its host differs; try the fetch and let the error code (`131006` / `0`) decide.\n\n## Step 7: frontmatter and provenance\n\nEach produced file should carry minimal frontmatter so the extraction is auditable and the host PKM can file it (this skill stops at producing it, not filing it):\n\n```yaml\n---\ntitle: \u003cdocument title>\nsource: \u003coriginal feishu URL or token>\nsource_type: docx | wiki | sheet | minutes\nextracted: \u003cYYYY-MM-DD>\npost_process: \u003cone line if any non-trivial transform was applied; omit if pure jq passthrough>\n---\n```\n\n`post_process` matters when text was reshaped (e.g. a sheet rendered to a table, or Path B's heading restoration) — it tells a future reader the body is not a byte-for-byte API passthrough.\n\n## Command troubleshooting\n\n| Symptom | Cause | Fix |\n|---|---|---|\n| `docs +fetch` \"Exit code 5\" but data looks present | `2>/dev/null` swallowed stderr while `jq` failed on mixed stream | Redirect stdout/stderr to separate files; parse the file |\n| `wiki spaces get_node` → `code 131006` | No read permission on that node | Path B (owner exports docx); do not bypass |\n| `api …/transcript` → `code 99991679` | Missing scope | feishu-minutes-transcript.md (device-flow scope grant) |\n| lark-cli reports `API returned an empty JSON response body` | lark-cli mis-renders a binary/error HTTP response | Real status is hidden — see permission-and-failure-boundaries.md; do not trust \"empty JSON\" literally |\n| Need an API lark-cli does not wrap | — | `lark-cli api \u003cMETHOD> \u003cpath> --params '{…}' --as user`; find the spec via `open.feishu.cn/llms-docs/zh-CN/llms-\u003cmodule>.txt` (the `/document/server-docs/` pages are flaky in WebFetch) |\n\n## What a clean run looks like\n\nSingle doc:\n\n```\n$ export LARK_CLI_NO_PROXY=1\n$ lark-cli wiki spaces get_node --params '{\"token\":\"\u003cnode_token>\"}'\n{\"code\":0,\"data\":{\"node\":{\"obj_token\":\"\u003cDOC>\",\"obj_type\":\"docx\",\"has_child\":false,...}}}\n$ lark-cli docs +fetch --doc \u003cDOC> --format json > /tmp/f.json 2> /tmp/f.err\n$ jq -r '.data.markdown' /tmp/f.json | wc -c\n 6166\n$ LC_ALL=C grep -rl

Feishu Doc Scraper Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback , not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed. Scope (read this first) This skill's contract is faithful per-source Markdown + a record of what was extracted . It does not decide…

\\xef\\xbf\\xbd' . ; echo \"ffd_count=$?\"\nffd_count=1 # 1 = grep found nothing = clean\n```\n\nCollection: the same, then N rounds of `feishu_extract_refs.py` → dispatch → fetch, ending with the residual-tag grep printing `clean`.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":10433,"content_sha256":"1677b2e1d05198741f41e0bdc1df83a06c50a85d7e416202fbce59d4900b9a73"},{"filename":"references/permission-and-failure-boundaries.md","content":"# Permission Boundaries & Verified Dead-Ends\n\nThe single most valuable part of this skill: a record of what does **not** work, so the next run does not re-pay the cost of discovering it. Every entry was verified, not guessed.\n\n## Contents\n\n- Error codes you will hit\n- Dead-end table (do NOT attempt)\n- Why \"empty JSON\" from lark-cli is a lie\n- Login-wall detection\n- Wrong-tool traps\n\n## Error codes you will hit\n\n| Code | Where | Meaning | Correct response |\n|---|---|---|---|\n| `131006` | `wiki spaces get_node` / `docs +fetch` | `node permission denied, user needs read permission` — the current token cannot read this wiki node | Hard server-side boundary. Stop. Path B: ask the permission holder to export `.docx` out-of-band. Do **not** try lark-cli/curl/browser bypasses. |\n| `99991679` | `api …/minutes/.../transcript` | missing scope `minutes:minutes.transcript:export` | Grant the scope via device-flow (feishu-minutes-transcript.md). |\n| `2091005` | minutes transcript | that specific minute is permission-denied | Per-minute, not per-tenant. Check if content is covered elsewhere before escalating. |\n| `0` | any | success | proceed |\n\n`131006` is a *Feishu-side* decision. It was verified that an anonymous browser redirects to `accounts.feishu.cn/...login`, and that even a logged-in user without a share still has to *request* access. There is no client-side trick. The only path is the document owner exporting it.\n\n## Dead-end table (do NOT attempt)\n\n| Path | Failure mode (verified) | Root cause |\n|---|---|---|\n| Bypass `131006` via lark-cli retry / different token | still `131006` | server-side per-node ACL |\n| Bypass `131006` via anonymous `curl` of the wiki URL | HTTP 200 but body is the login page (`accounts.feishu.cn`, `login`, `passport`, empty `\u003ctitle>`) | unauthenticated request hits the login wall, not the doc |\n| Bypass `131006` via anonymous debugging Chrome | redirected to `accounts.feishu.cn/.../login?redirect_uri=...` | no session in that Chrome profile |\n| docx embedded image: `lark-cli docs +media-download --token \u003cimg> --type media` | HTTP 404 | command has no `extra` param to identify the owning docx; a bare media token out of its docx context is not resolvable |\n| docx image: `lark-cli api GET /open-apis/drive/v1/medias/\u003cimg>/download` (no `extra`) | `{\"ok\":false,...,\"API returned an empty JSON response body\"}` | lark-cli swallows the real error body |\n| docx image: same with `--params '{\"extra\":\"{\\\"drive_route_token\\\":\\\"\u003cdoc>\\\"}\"}'` | empty / fails | the `extra` format lark-cli passes is not what the endpoint needs; lark-cli does not wrap this correctly |\n| docx image: `lark-cli schema drive.medias.download` (and `.media.`, `.batch_get_tmp_download_url`) | `Unknown resource` | not in lark-cli's schema registry |\n| docx image: `lark-cli api … --dry-run` then raw `curl` | `--dry-run` returns method/url/appId/as but **not** the Bearer token → curl authenticates as nobody → real `HTTP/2 400` | lark-cli intentionally does not expose the token; the curl-around-lark-cli path is structurally closed |\n| Read the downloaded image bytes to \"check\" them | `This tool cannot read binary files` | — |\n| `WebFetch https://open.feishu.cn/document/server-docs/...` for an API spec | backend flaps, often fails | use `open.feishu.cn/llms-docs/zh-CN/llms-\u003cmodule>.txt` instead |\n| AppleScript `executeJavaScript` in Chrome | `\"Executing JavaScript through AppleScript is turned off\"` | Chrome disables JS-from-AppleEvents; `defaults write` + restart does not re-enable it here |\n| JXA `executeJavaScript` with async/Promise | `Can't convert types. (-1700)` | JXA cannot convert JS Promises to AppleScript types |\n| JXA with `ObjC.import` / shebang / `includeStandardAdditions` | syntax errors (`-2741`) | unsupported in this JXA-in-Chrome context |\n| Chrome DevTools CDP on `:9222` | `curl :9222/json/list` → `[]` or 404 | CDP endpoints empty even with the flag (profile/policy) |\n| `minimax-docx` to convert docx→md | wrong direction | it is a docx *authoring/editing* tool, not an extractor |\n\n**Conclusion for docx embedded images:** lark-cli (through 1.0.32) cannot download `\u003cimage>` tokens embedded in a docx — seven distinct approaches were exhausted. Register the tokens and dimensions, note \"document owner must right-click → save and send out-of-band\", and move on. The text is the deliverable; images are a tracked, transparent gap. Grinding past the established try-limit is itself the mistake.\n\n## Why \"empty JSON\" from lark-cli is a lie\n\nWhen lark-cli prints `API returned an empty JSON response body`, the server did **not** necessarily return empty — lark-cli fails to render a binary or error response and substitutes that message. The real status (e.g. `HTTP/2 400`) is only visible via `--dry-run` + `curl`, but `--dry-run` withholds the Bearer token, so that diagnostic path cannot complete an authenticated request. Net: treat \"empty JSON\" as \"unknown failure, lark-cli does not wrap this endpoint\", not as \"the resource is empty\".\n\n## Login-wall detection\n\nNever infer \"publicly accessible\" from an HTTP 200. A Feishu login wall returns 200 with a body containing any of: `accounts.feishu.cn`, `passport`, a `login` form, an empty `\u003ctitle>\u003c/title>`. Always inspect the body. This is why an anonymous debugging Chrome can only answer \"is this page public?\" — it can never read login-walled content.\n\n## Wrong-tool traps\n\n- **docx → Markdown**: use the `doc-to-markdown` skill (pandoc + post-processing), **not** `minimax-docx` (authoring tool, opposite direction).\n- **Finding an unwrapped native API**: use the `lark-openapi-explorer` skill rather than guessing endpoints.\n- **A search agent reporting \"file not found\"**: not authoritative — verify against authoritative sources (`git worktree list`, repo-wide `find`, `git log -S`, the transcripts directory) before concluding. Ingested recordings/transcripts commonly live in a transcripts directory, not where you first looked.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":5982,"content_sha256":"be845c4c61da61a8bb49f2c9f5a5ba7b09600ff801e1618b63c0089412de5828"},{"filename":"scripts/build_feishu_markdown.py","content":"#!/usr/bin/env python3\n\"\"\"\nRender a structured Feishu capture manifest into Markdown.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport argparse\nimport json\nimport sys\nfrom pathlib import Path\n\n\ndef to_list(value):\n if value is None:\n return []\n if isinstance(value, list):\n return [str(item) for item in value if str(item).strip()]\n return [str(value)]\n\n\ndef yaml_lines(manifest):\n lines = [\"---\"]\n simple_fields = [\n \"title\",\n \"source\",\n \"published\",\n \"created\",\n \"description\",\n ]\n for field in simple_fields:\n value = manifest.get(field, \"\")\n lines.append(f'{field}: \"{str(value).replace(chr(34), chr(39))}\"' if value else f\"{field}:\")\n\n for key in (\"author\", \"tags\"):\n values = to_list(manifest.get(key))\n if values:\n lines.append(f\"{key}:\")\n for value in values:\n lines.append(f' - \"{value.replace(chr(34), chr(39))}\"')\n else:\n lines.append(f\"{key}:\")\n\n lines.append(\"---\")\n return lines\n\n\ndef normalize_body(body):\n if body is None:\n return []\n if isinstance(body, list):\n return [str(block).strip() for block in body if str(block).strip()]\n text = str(body).strip()\n return [text] if text else []\n\n\ndef section_lines(section):\n level = int(section.get(\"heading_level\", 2))\n level = max(1, min(level, 6))\n heading = str(section.get(\"heading\", \"\")).strip()\n if not heading:\n raise ValueError(\"Section heading is required\")\n\n lines = [f'{\"#\" * level} {heading}']\n body_blocks = normalize_body(section.get(\"body\"))\n if body_blocks:\n lines.append(\"\")\n lines.extend(body_blocks)\n return lines\n\n\ndef render_markdown(manifest):\n title = str(manifest.get(\"title\", \"\")).strip()\n if not title:\n raise ValueError(\"Manifest title is required\")\n\n sections = manifest.get(\"sections\")\n if not isinstance(sections, list) or not sections:\n raise ValueError(\"Manifest sections must be a non-empty list\")\n\n lines = yaml_lines(manifest)\n lines.extend([\"\", f\"# {title}\"])\n\n source = str(manifest.get(\"source\", \"\")).strip()\n if source:\n lines.extend([\"\", f\"Source: \u003c{source}>\"])\n\n for section in sections:\n lines.extend([\"\", *section_lines(section)])\n\n return \"\\n\".join(lines).rstrip() + \"\\n\"\n\n\ndef parse_args():\n parser = argparse.ArgumentParser(description=__doc__)\n parser.add_argument(\"--input\", required=True, help=\"Path to capture manifest JSON\")\n parser.add_argument(\"--output\", help=\"Optional output markdown path\")\n return parser.parse_args()\n\n\ndef main():\n args = parse_args()\n manifest_path = Path(args.input)\n manifest = json.loads(manifest_path.read_text(encoding=\"utf-8\"))\n markdown = render_markdown(manifest)\n\n if args.output:\n output_path = Path(args.output)\n output_path.write_text(markdown, encoding=\"utf-8\")\n print(f\"Wrote {output_path}\")\n else:\n sys.stdout.write(markdown)\n\n\nif __name__ == \"__main__\":\n main()\n","content_type":"text/x-python; charset=utf-8","language":"python","size":3067,"content_sha256":"1b033ba9a694ed597ace425e6bb070227069cc5bcb14111a5df3f5472da8bc13"},{"filename":"scripts/check_heading_coverage.py","content":"#!/usr/bin/env python3\n\"\"\"\nCheck that expected Feishu headings are present in the final Markdown output.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport argparse\nimport re\nimport sys\nfrom pathlib import Path\n\nNOISE_PATTERNS = (\n \"you may also ask\",\n \"recommended content\",\n \"upload logs\",\n \"contact support\",\n \"comments\",\n)\n\n\ndef normalize(text: str) -> str:\n text = text.strip().lower()\n text = re.sub(r\"^#+\\s*\", \"\", text)\n text = re.sub(r\"\\s+\", \"\", text)\n # Remove punctuation and special characters. Using a set avoids the\n # regex escaping trap (the previous character class terminated early\n # because \\\\] was interpreted as a literal backslash followed by ]).\n _REMOVE_CHARS = set(\n chr(c) for c in (\n 0x60, 0x7E, 0x7C, 0x2C, 0x2E, 0x21, 0x3F, 0x28, 0x29,\n 0x5B, 0x5D, 0x3C, 0x3E, 0x300A, 0x300B,\n 0x22, 0x27, 0x201C, 0x201D, 0x2018, 0x2019,\n 0x5C, 0x2D, 0x2B,\n 0x3A, 0xFF1A,\n 0x3002, 0xFF0C,\n 0xFF01, 0xFF1F,\n 0xFF08, 0xFF09,\n 0x2014, 0x2013,\n )\n )\n text = \"\".join(c for c in text if c not in _REMOVE_CHARS)\n return text\n\n\ndef load_expected(headings_file: Path) -> list[str]:\n return [line.strip() for line in headings_file.read_text(encoding=\"utf-8\").splitlines() if line.strip()]\n\n\ndef extract_headings(markdown_text: str) -> list[str]:\n headings = []\n for line in markdown_text.splitlines():\n if re.match(r\"^#{1,6}\\s+\\S\", line):\n headings.append(line.strip())\n return headings\n\n\ndef detect_noise(markdown_text: str) -> list[str]:\n lowered = markdown_text.lower()\n return [pattern for pattern in NOISE_PATTERNS if pattern in lowered]\n\n\ndef parse_args():\n parser = argparse.ArgumentParser(description=__doc__)\n parser.add_argument(\"--markdown-file\", required=True, help=\"Generated markdown file\")\n parser.add_argument(\"--headings-file\", required=True, help=\"Plain text file with one expected heading per line\")\n return parser.parse_args()\n\n\ndef main():\n args = parse_args()\n markdown_path = Path(args.markdown_file)\n headings_path = Path(args.headings_file)\n\n markdown_text = markdown_path.read_text(encoding=\"utf-8\")\n expected = load_expected(headings_path)\n found = extract_headings(markdown_text)\n\n found_index = {normalize(item): item for item in found}\n missing = [item for item in expected if normalize(item) not in found_index]\n noise_hits = detect_noise(markdown_text)\n\n print(f\"Expected headings: {len(expected)}\")\n print(f\"Found markdown headings: {len(found)}\")\n\n if missing:\n print(\"Missing headings:\")\n for item in missing:\n print(f\" - {item}\")\n\n if noise_hits:\n print(\"Noise patterns detected:\")\n for item in noise_hits:\n print(f\" - {item}\")\n\n if missing or noise_hits:\n sys.exit(1)\n\n print(\"Heading coverage check passed.\")\n\n\nif __name__ == \"__main__\":\n main()\n","content_type":"text/x-python; charset=utf-8","language":"python","size":2998,"content_sha256":"cf31f811466b8027f473601ad6b9ca855ce8ae08e08dbff5e154838c44bdc77e"},{"filename":"scripts/download_feishu_images.py","content":"#!/usr/bin/env python3\n\"\"\"\nDownload images from Feishu/Lark documents via SSR HTML extraction.\n\nWhen browser automation (AppleScript, JXA, Chrome DevTools) is unavailable,\nthis script extracts authenticated image URLs directly from the initial HTML\nresponse and downloads them with session cookies.\n\nDependencies: pip install browser_cookie3 requests\n\nUsage (single document):\n python3 download_feishu_images.py \\\n --url \"https://my.feishu.cn/wiki/...\" \\\n --doc-name \"my-document\" \\\n --output-dir \"assets/\"\n\nUsage (batch from file):\n python3 download_feishu_images.py \\\n --batch-file urls.txt \\\n --output-dir \"assets/\"\n\nThe urls.txt format (one per line, optional doc-name prefix):\n my-document|https://my.feishu.cn/wiki/...\n another-doc|https://my.feishu.cn/wiki/...\n\nOutput: downloaded images + markdown image references printed to stdout.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport argparse\nimport os\nimport re\nimport sys\nfrom pathlib import Path\nfrom urllib.parse import urlparse\n\ntry:\n import browser_cookie3\n import requests\nexcept ImportError as e:\n print(f\"Missing dependency: {e}\", file=sys.stderr)\n print(\"Install: pip install browser_cookie3 requests\", file=sys.stderr)\n sys.exit(1)\n\nIMAGE_URL_RE = re.compile(r'https?://internal-api-drive-stream[^\\s\"\\'\u003c>]+')\n\nDEFAULT_HEADERS = {\n \"User-Agent\": (\n \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) \"\n \"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36\"\n ),\n \"Accept\": (\n \"text/html,application/xhtml+xml,application/xml;q=0.9,\"\n \"image/avif,image/webp,image/apng,*/*;q=0.8\"\n ),\n \"Accept-Language\": \"zh-CN,zh;q=0.9,en;q=0.8\",\n}\n\n\ndef sanitize_name(name: str) -> str:\n \"\"\"Keep alphanumerics, Chinese chars, underscore, hyphen. Max 40 chars.\"\"\"\n cleaned = re.sub(r\"[^a-zA-Z0-9一-鿿_-]\", \"\", name)\n return cleaned[:40]\n\n\ndef extract_image_urls(html_text: str) -> list[str]:\n \"\"\"Extract authenticated Feishu image URLs from raw HTML.\"\"\"\n seen: set[str] = set()\n unique: list[str] = []\n for url in IMAGE_URL_RE.findall(html_text):\n if url not in seen:\n seen.add(url)\n unique.append(url)\n return unique\n\n\ndef download_image(url: str, cookies, referer: str) -> tuple[bytes, str]:\n \"\"\"Download a single image with session cookies. Returns (content, content_type).\"\"\"\n headers = {\"Referer\": referer}\n resp = requests.get(url, cookies=cookies, headers=headers, timeout=30)\n resp.raise_for_status()\n content_type = resp.headers.get(\"content-type\", \"image/png\")\n return resp.content, content_type\n\n\ndef ext_from_content_type(content_type: str) -> str:\n \"\"\"Map content-type to file extension.\"\"\"\n ct = content_type.lower()\n if \"gif\" in ct:\n return \"gif\"\n if \"jpeg\" in ct or \"jpg\" in ct:\n return \"jpg\"\n if \"webp\" in ct:\n return \"webp\"\n if \"svg\" in ct:\n return \"svg\"\n return \"png\"\n\n\ndef process_document(\n url: str,\n doc_name: str,\n output_dir: Path,\n cookies,\n dry_run: bool = False,\n) -> dict:\n \"\"\"Download all images from a single Feishu document.\"\"\"\n result = {\n \"url\": url,\n \"doc_name\": doc_name,\n \"found\": 0,\n \"downloaded\": 0,\n \"errors\": 0,\n \"files\": [],\n \"markdown_refs\": [],\n }\n\n try:\n resp = requests.get(url, cookies=cookies, headers=DEFAULT_HEADERS, timeout=30)\n resp.raise_for_status()\n except requests.RequestException as e:\n print(f\" ERROR fetching page: {e}\", file=sys.stderr)\n result[\"errors\"] += 1\n return result\n\n image_urls = extract_image_urls(resp.text)\n result[\"found\"] = len(image_urls)\n\n if not image_urls:\n print(\" No image URLs found in page HTML.\")\n return result\n\n parsed = urlparse(url)\n referer = f\"{parsed.scheme}://{parsed.netloc}/\"\n safe_name = sanitize_name(doc_name)\n output_dir.mkdir(parents=True, exist_ok=True)\n\n for i, img_url in enumerate(image_urls):\n ext = \"png\"\n local_name = f\"{safe_name}-{i}.{ext}\"\n local_path = output_dir / local_name\n\n try:\n if dry_run:\n print(f\" DRY-RUN: would download -> {local_name}\")\n else:\n content, content_type = download_image(img_url, cookies, referer=referer)\n ext = ext_from_content_type(content_type)\n local_name = f\"{safe_name}-{i}.{ext}\"\n local_path = output_dir / local_name\n local_path.write_bytes(content)\n print(f\" OK: {local_name} ({len(content)} bytes)\")\n\n result[\"downloaded\"] += 1\n result[\"files\"].append(str(local_path))\n result[\"markdown_refs\"].append(f\"![](assets/{local_name})\")\n except requests.RequestException as e:\n print(f\" ERROR downloading image {i}: {e}\", file=sys.stderr)\n result[\"errors\"] += 1\n\n return result\n\n\ndef parse_batch_file(path: Path) -> list[tuple[str, str]]:\n \"\"\"Parse batch file. Format: doc-name|url (or just url).\"\"\"\n entries: list[tuple[str, str]] = []\n for line in path.read_text(encoding=\"utf-8\").splitlines():\n line = line.strip()\n if not line or line.startswith(\"#\"):\n continue\n if \"|\" in line:\n doc_name, url = line.split(\"|\", 1)\n entries.append((doc_name.strip(), url.strip()))\n else:\n parsed = urlparse(line)\n doc_name = parsed.path.strip(\"/\").split(\"/\")[-1] or \"doc\"\n entries.append((doc_name, line))\n return entries\n\n\ndef parse_args() -> argparse.Namespace:\n parser = argparse.ArgumentParser(description=__doc__)\n parser.add_argument(\"--url\", help=\"Single Feishu document URL\")\n parser.add_argument(\"--doc-name\", default=\"doc\", help=\"Document name for image files\")\n parser.add_argument(\"--output-dir\", default=\"assets\", help=\"Directory to save images\")\n parser.add_argument(\"--batch-file\", help=\"File with doc-name|url lines for batch processing\")\n parser.add_argument(\"--dry-run\", action=\"store_true\", help=\"Print what would be done without downloading\")\n return parser.parse_args()\n\n\ndef main() -> int:\n args = parse_args()\n\n if not args.url and not args.batch_file:\n print(\"Error: specify --url or --batch-file\", file=sys.stderr)\n return 1\n\n try:\n cookies = browser_cookie3.chrome()\n except Exception as e:\n print(f\"Error loading Chrome cookies: {e}\", file=sys.stderr)\n return 1\n\n output_dir = Path(args.output_dir)\n total_found = 0\n total_downloaded = 0\n total_errors = 0\n all_markdown_refs: list[str] = []\n\n if args.batch_file:\n entries = parse_batch_file(Path(args.batch_file))\n for doc_name, url in entries:\n print(f\"\\n[{doc_name}] {url}\")\n result = process_document(url, doc_name, output_dir, cookies, dry_run=args.dry_run)\n total_found += result[\"found\"]\n total_downloaded += result[\"downloaded\"]\n total_errors += result[\"errors\"]\n all_markdown_refs.extend(result[\"markdown_refs\"])\n else:\n print(f\"\\n[{args.doc_name}] {args.url}\")\n result = process_document(args.url, args.doc_name, output_dir, cookies, dry_run=args.dry_run)\n total_found = result[\"found\"]\n total_downloaded = result[\"downloaded\"]\n total_errors = result[\"errors\"]\n all_markdown_refs = result[\"markdown_refs\"]\n\n if all_markdown_refs:\n print(\"\\n--- Markdown references ---\")\n for ref in all_markdown_refs:\n print(ref)\n\n print(\"\\n--- Summary ---\")\n print(f\"Images found: {total_found}\")\n print(f\"Images downloaded: {total_downloaded}\")\n print(f\"Errors: {total_errors}\")\n\n return 0 if total_errors == 0 else 1\n\n\nif __name__ == \"__main__\":\n sys.exit(main())\n","content_type":"text/x-python; charset=utf-8","language":"python","size":7894,"content_sha256":"b677deed1553514798dd4eb33cdf50602be9e96db16f85a2054b700bd9a2115c"},{"filename":"scripts/feishu_extract_refs.py","content":"#!/usr/bin/env python3\n\"\"\"Enumerate every rich-media reference in a fetched Feishu Markdown body.\n\nThis is the recursion engine's core for Path A (lark-cli API extraction). A\ncollection/hub is a doc whose body references other docs; missing one\nreference means a missing document — the single biggest hub-scraping failure.\nHand-rolled `grep | sed` pipelines repeatedly missed the `my.feishu.cn`\npersonal-space pattern, so this enumeration is centralized and tested here.\n\nInput : a Markdown file produced by `lark-cli docs +fetch ... | jq -r .data.markdown`.\nOutput: JSON array on stdout, one object per *distinct* reference:\n {\"type\": ..., \"ref\": \u003ctoken-or-url>, \"title\": ..., \"dispatch\": \u003chint>}\n plus a human summary on stderr.\n\nIt only *enumerates*. Dispatching/fetching each reference is the caller's job\n(see references/lark-cli-api-extraction.md, Step 5 dispatch table).\n\nUsage:\n python3 feishu_extract_refs.py FETCHED_BODY.md\n python3 feishu_extract_refs.py FETCHED_BODY.md --type docx # filter\n\"\"\"\nfrom __future__ import annotations\n\nimport argparse\nimport json\nimport re\nimport sys\nfrom pathlib import Path\n\n# Feishu/Lark hosts. feishu.cn = mainland tenants + my.feishu.cn personal space;\n# larksuite.com = international Lark. Both serve the same /docx /wiki /sheets\n# /minutes /base /file path scheme.\n_HOST = r\"[a-z0-9-]+\\.(?:feishu\\.cn|larksuite\\.com)\"\n\n# Inline rich-media tags emitted by `docs +fetch` Markdown.\nRE_MENTION_DOC = re.compile(\n r'\u003cmention-doc\\s+token=\"([^\"]+)\"\\s+type=\"([^\"]+)\"\\s*>([^\u003c]*)\u003c/mention-doc>'\n)\nRE_SHEET_TAG = re.compile(r'\u003csheet\\s+token=\"([^\"]+)\"\\s*/?>')\nRE_IMAGE_TAG = re.compile(r'\u003cimage\\s+token=\"([^\"]+)\"')\nRE_FILE_TAG = re.compile(r'\u003cfile\\s+token=\"([^\"]+)\"[^>]*>([^\u003c]*)\u003c/file>')\nRE_LARK_TABLE = re.compile(r\"\u003clark-table\\b\")\n\n# URLs that appear in the body (cross-tenant / personal space / minutes /\n# Tencent Meeting). One regex covers mainland, international and personal\n# (my.feishu.cn) because the host group accepts any sub-domain.\nRE_FEISHU_URL = re.compile(\n r\"https://(\" + _HOST + r\")/(docx|wiki|sheets|base|file|minutes)/([A-Za-z0-9]+)\"\n)\nRE_TENCENT_MEETING = re.compile(\n r\"https://meeting\\.tencent\\.com/crm/([A-Za-z0-9]+)\"\n)\n\n# How the caller should handle each type (mirrors the reference's dispatch\n# table, surfaced here so the caller does not have to re-derive it).\nDISPATCH = {\n \"mention-doc-docx\": \"docs +fetch --doc \u003ctoken>\",\n \"mention-doc-wiki\": \"wiki spaces get_node then docs +fetch\",\n \"mention-doc-sheet\": \"sheets +read\",\n \"url-docx\": \"docs +fetch --doc \u003ctoken>\",\n \"url-wiki\": \"wiki spaces get_node then docs +fetch\",\n \"url-sheets\": \"sheets +read (split token on '_' -> SP, SID)\",\n \"url-base\": \"Bitable API (outside this skill) — record token\",\n \"url-file\": \"attachment — record token + name; treat like image gap\",\n \"url-minutes\": \"native transcript API (feishu-minutes-transcript.md)\",\n \"sheet-tag\": \"sheets +read (split token on '_' -> SP, SID)\",\n \"image\": \"register token; lark-cli cannot download docx images\",\n \"file\": \"attachment — record token + name; treat like image gap\",\n \"tencent-meeting\": \"Tencent Meeting native transcript (never download+re-ASR)\",\n \"lark-table\": \"inline content — render in place to a Markdown table\",\n}\n\n\ndef _read_text(path: Path) -> str:\n \"\"\"Read the body strictly as UTF-8.\n\n We deliberately do NOT use errors='replace': a decode failure means an\n upstream step corrupted the text, and the skill's acceptance contract\n checks for U+FFFD. Masking it here would hide exactly the failure the\n pipeline is trying to detect, so fail loudly instead.\n \"\"\"\n try:\n raw = path.read_bytes()\n except FileNotFoundError:\n sys.exit(f\"error: file not found: {path}\")\n except PermissionError:\n sys.exit(f\"error: cannot read (permission): {path}\")\n if not raw.strip():\n sys.exit(f\"error: file is empty: {path} (fetch likely failed upstream)\")\n try:\n return raw.decode(\"utf-8\")\n except UnicodeDecodeError as exc:\n sys.exit(\n f\"error: {path} is not valid UTF-8 ({exc}); an upstream extraction \"\n f\"step corrupted the body — re-fetch with `lark-cli docs +fetch \"\n f\"--format json` and `jq -r .data.markdown`, do not 'fix' encoding here.\"\n )\n\n\ndef extract(text: str) -> list[dict]:\n refs: list[dict] = []\n\n for token, doc_type, title in RE_MENTION_DOC.findall(text):\n t = doc_type.strip().lower()\n kind = \"mention-doc-sheet\" if t in (\"sheet\", \"bitable\") else (\n \"mention-doc-wiki\" if t == \"wiki\" else \"mention-doc-docx\"\n )\n refs.append({\n \"type\": kind,\n \"ref\": token,\n \"title\": title.strip(),\n \"dispatch\": DISPATCH[kind],\n })\n\n for token in RE_SHEET_TAG.findall(text):\n refs.append({\n \"type\": \"sheet-tag\",\n \"ref\": token,\n \"title\": \"\",\n \"dispatch\": DISPATCH[\"sheet-tag\"],\n })\n\n for token in RE_IMAGE_TAG.findall(text):\n refs.append({\n \"type\": \"image\",\n \"ref\": token,\n \"title\": \"\",\n \"dispatch\": DISPATCH[\"image\"],\n })\n\n for token, name in RE_FILE_TAG.findall(text):\n refs.append({\n \"type\": \"file\",\n \"ref\": token,\n \"title\": name.strip(),\n \"dispatch\": DISPATCH[\"file\"],\n })\n\n for host, seg, token in RE_FEISHU_URL.findall(text):\n kind = f\"url-{seg}\"\n refs.append({\n \"type\": kind,\n \"ref\": f\"https://{host}/{seg}/{token}\",\n \"title\": \"\",\n \"dispatch\": DISPATCH.get(kind, \"record token\"),\n })\n\n for mid in RE_TENCENT_MEETING.findall(text):\n refs.append({\n \"type\": \"tencent-meeting\",\n \"ref\": f\"https://meeting.tencent.com/crm/{mid}\",\n \"title\": \"\",\n \"dispatch\": DISPATCH[\"tencent-meeting\"],\n })\n\n n_tables = len(RE_LARK_TABLE.findall(text))\n if n_tables:\n # Inline content, not a link to follow — surfaced so the caller knows\n # to render it in place (pandas.read_html handles colspan/rowspan).\n refs.append({\n \"type\": \"lark-table\",\n \"ref\": f\"(inline x{n_tables})\",\n \"title\": \"\",\n \"dispatch\": DISPATCH[\"lark-table\"],\n })\n\n # De-duplicate on (type, ref); keep first title seen.\n seen: dict[tuple[str, str], dict] = {}\n for r in refs:\n key = (r[\"type\"], r[\"ref\"])\n if key not in seen:\n seen[key] = r\n return list(seen.values())\n\n\ndef main() -> None:\n ap = argparse.ArgumentParser(description=__doc__)\n ap.add_argument(\"markdown_file\", help=\"fetched Feishu body (.md)\")\n ap.add_argument(\"--type\", help=\"only emit refs of this type (e.g. docx, image)\")\n args = ap.parse_args()\n\n text = _read_text(Path(args.markdown_file))\n refs = extract(text)\n if args.type:\n refs = [r for r in refs if args.type in r[\"type\"]]\n\n json.dump(refs, sys.stdout, ensure_ascii=False, indent=2)\n sys.stdout.write(\"\\n\")\n\n # Summary to stderr so stdout stays pure JSON for piping.\n by_type: dict[str, int] = {}\n for r in refs:\n by_type[r[\"type\"]] = by_type.get(r[\"type\"], 0) + 1\n if by_type:\n summary = \", \".join(f\"{k}={v}\" for k, v in sorted(by_type.items()))\n print(f\"[feishu_extract_refs] {len(refs)} distinct refs: {summary}\",\n file=sys.stderr)\n else:\n print(\"[feishu_extract_refs] no references found — this is a leaf doc \"\n \"(nothing further to recurse).\", file=sys.stderr)\n\n\nif __name__ == \"__main__\":\n main()\n","content_type":"text/x-python; charset=utf-8","language":"python","size":7695,"content_sha256":"d91c220ce4cccfd80b312a3f9d0d981cc3709a1935e94ccf5cf8f650ed768b04"},{"filename":"scripts/restore_docx_headings.py","content":"#!/usr/bin/env python3\n\"\"\"Restore heading hierarchy and highlights lost when pandoc converts a\nFeishu-exported .docx (Path B).\n\nFeishu-exported docx does not use Word heading styles — it lays out headings\nwith font size + bold on normal paragraphs, and marks emphasis with run\nshading (`w:shd@fill`), not `w:highlight`. pandoc therefore produces zero\nMarkdown headings (every heading becomes flat `**bold**`) and drops every\nhighlight. A text-level check (\"no errors, word count matches\") passes while\nthe document's entire structure is gone — only visual verification catches it.\n\nThis script repairs the pandoc Markdown WITHOUT retyping the body:\n * heading levels are derived from the docx's own font-size distribution\n (largest sizes -> H1..Hn, descending) and applied as `#` prefixes;\n * run shading fills are restored as Obsidian `==highlight==`.\n\nBody text is never reconstructed — only `#` prefixes and `==` wrappers are\nadded to the existing pandoc lines. This keeps the API/pandoc text byte-exact\n(the fidelity invariant) while giving back the structure a human sees.\n\nUsage:\n python3 restore_docx_headings.py --docx SRC.docx --md PANDOC.md --out FINAL.md\n python3 restore_docx_headings.py --docx SRC.docx --md PANDOC.md --dry-run\n\n`--dry-run` prints the derived size->level mapping and match counts without\nwriting — verify the plan before applying it (plan / validate / execute).\n\"\"\"\nfrom __future__ import annotations\n\nimport argparse\nimport re\nimport sys\nfrom collections import Counter\nfrom pathlib import Path\n\ntry:\n from docx import Document\n from docx.oxml.ns import qn\nexcept ModuleNotFoundError:\n sys.exit(\n \"error: python-docx is not installed.\\n\"\n \" run with uv: uv run --with python-docx python3 \"\n \"scripts/restore_docx_headings.py ...\\n\"\n \" or: pip install python-docx\"\n )\n\n# Run-shading fills that are page/background, not emphasis. Everything else\n# applied at run level by Feishu is an intentional highlight. Deriving\n# \"highlight = any non-background run fill\" from the document avoids\n# hard-coding specific colors; the values verified in practice were\n# ffe928 (yellow) and 935af6 (purple) — kept here only as the known examples,\n# not as a closed allow-list.\n_BACKGROUND_FILLS = {\"auto\", \"ffffff\", \"000000\", \"\"}\n_ZERO_WIDTH = \"​‌‍\"\n\n\ndef _norm(s: str) -> str:\n \"\"\"Normalize a line for cross-format text matching.\n\n pandoc may wrap a heading as `**text**`; the source paragraph is `text`.\n Strip emphasis/heading markers, zero-width chars, and collapse whitespace\n so the same logical line matches across the two representations.\n \"\"\"\n s = s.translate({ord(c): None for c in _ZERO_WIDTH})\n s = re.sub(r\"[*_#`]\", \"\", s)\n s = re.sub(r\"\\s+\", \" \", s)\n return s.strip()\n\n\ndef _doc_default_pt(doc) -> float:\n \"\"\"Resolve the document's default body point size.\n\n Critical: body paragraphs in a Feishu/pandoc docx frequently carry NO\n explicit run size — they inherit from the Normal style or docDefaults.\n If such paragraphs are bucketed as \"unknown\" and excluded, the modal\n size becomes a *heading* size and every real heading is demoted to body\n (verified failure). So every paragraph must get a numeric size, falling\n back to this resolved default, so the modal size is the true body size.\n\n Resolution order: Normal style -> docDefaults rPr sz -> 11.0pt.\n 11.0pt is the de-facto Word default for the .docx era (Calibri 11); it\n is only the last resort when the file declares no default at all.\n \"\"\"\n try:\n sz = doc.styles[\"Normal\"].font.size\n if sz is not None:\n return sz.pt\n except (KeyError, AttributeError, ValueError):\n pass\n try:\n sz_el = doc.styles.element.find(\n qn(\"w:docDefaults\") + \"/\" + qn(\"w:rPrDefault\")\n + \"/\" + qn(\"w:rPr\") + \"/\" + qn(\"w:sz\")\n )\n if sz_el is not None:\n val = sz_el.get(qn(\"w:val\"))\n if val:\n return int(val) / 2.0 # OOXML sz is in half-points\n except (AttributeError, ValueError, TypeError):\n pass\n return 11.0\n\n\ndef _para_font_pt(para, default_pt: float) -> float:\n \"\"\"Effective point size of a paragraph — never None.\n\n Headings here have all runs at one large size. Take the max run size;\n fall back to the paragraph style's size; finally to the resolved\n document default so unsized body paragraphs land in the body bucket\n (not the 'unknown' void that corrupts the modal-size heuristic).\n \"\"\"\n sizes = [r.font.size.pt for r in para.runs if r.font.size is not None]\n if sizes:\n return max(sizes)\n try:\n if para.style and para.style.font and para.style.font.size:\n return para.style.font.size.pt\n except (AttributeError, ValueError):\n pass\n return default_pt\n\n\ndef _run_highlight_fill(run) -> str | None:\n \"\"\"Return the run's shading fill if it is an emphasis highlight, else None.\"\"\"\n rpr = run._element.rPr\n if rpr is None:\n return None\n shd = rpr.find(qn(\"w:shd\"))\n if shd is None:\n return None\n fill = (shd.get(qn(\"w:fill\")) or \"\").lower()\n if fill in _BACKGROUND_FILLS:\n return None\n return fill\n\n\ndef build_plan(docx_path: Path):\n \"\"\"Walk the docx once, returning the heading plan and highlight plan.\n\n heading_plan : list of (normalized_text, level, raw_text) in doc order\n highlight_plan: list of (normalized_text, [run_text, ...]) in doc order\n size_to_level: derived mapping for --dry-run reporting\n \"\"\"\n try:\n doc = Document(str(docx_path))\n except Exception as exc: # python-docx raises various errors for bad files\n sys.exit(f\"error: cannot open docx ({exc}). Confirm with `file -b`; an \"\n f\"exported .docx is sometimes mislabeled.\")\n\n paras = list(doc.paragraphs)\n default_pt = _doc_default_pt(doc)\n\n # Every non-empty paragraph gets a numeric size (unsized -> resolved\n # default), so the modal size is the true body size. Sizes strictly\n # larger than body, descending, become H1..Hn.\n size_counts = Counter(\n round(_para_font_pt(p, default_pt), 1)\n for p in paras if p.text.strip()\n )\n if not size_counts:\n # No text paragraphs at all — nothing to restore; let the caller\n # pass the markdown through unchanged rather than abort.\n print(\"[restore] no text paragraphs in docx — passthrough.\",\n file=sys.stderr)\n return [], [], {}, default_pt\n body_size = size_counts.most_common(1)[0][0]\n heading_sizes = sorted((s for s in size_counts if s > body_size), reverse=True)\n size_to_level = {s: i + 1 for i, s in enumerate(heading_sizes)}\n if not size_to_level:\n # One distinct size only: the doc has no font-size heading hierarchy\n # (it likely already uses Word heading styles, which doc-to-markdown\n # converts natively). Highlights may still need restoring, so warn\n # and continue rather than exit.\n print(f\"[restore] no font-size hierarchy above body {body_size}pt — \"\n f\"doc likely uses Word heading styles already; restoring \"\n f\"highlights only.\", file=sys.stderr)\n\n heading_plan, highlight_plan = [], []\n for p in paras:\n text = p.text.strip()\n if not text:\n continue\n lvl = size_to_level.get(round(_para_font_pt(p, default_pt), 1))\n if lvl:\n heading_plan.append((_norm(text), lvl, text))\n hi = [r.text for r in p.runs\n if r.text.strip() and _run_highlight_fill(r) is not None]\n if hi:\n highlight_plan.append((_norm(text), hi))\n\n return heading_plan, highlight_plan, size_to_level, body_size\n\n\ndef apply_plan(md_lines, heading_plan, highlight_plan):\n \"\"\"Apply heading prefixes and highlight wrappers to the pandoc lines.\n\n Matching is by normalized text, in document order, with a forward-only\n cursor so repeated identical strings map to successive occurrences.\n Returns (new_lines, n_headings_applied, n_unmatched_headings,\n n_highlights_applied).\n \"\"\"\n norm_lines = [_norm(l) for l in md_lines]\n out = list(md_lines)\n\n cursor = 0\n applied_h = unmatched_h = 0\n for ntext, level, _raw in heading_plan:\n if not ntext:\n continue\n found = -1\n for i in range(cursor, len(out)):\n if norm_lines[i] == ntext:\n found = i\n break\n if found == -1:\n unmatched_h += 1\n continue\n # Replace the whole line with a clean heading — drop pandoc's bold\n # since a heading must not also be `**...**`.\n out[found] = \"#\" * level + \" \" + ntext\n norm_lines[found] = ntext # keep in sync for subsequent matches\n cursor = found + 1\n applied_h += 1\n\n cursor = 0\n applied_hl = 0\n for ntext, run_texts in highlight_plan:\n if not ntext:\n continue\n found = -1\n for i in range(cursor, len(out)):\n if norm_lines[i] == ntext:\n found = i\n break\n if found == -1:\n continue\n line = out[found]\n for rt in run_texts:\n rt = rt.strip()\n if not rt or rt not in line:\n continue\n if (\"==\" + rt + \"==\") in line: # already wrapped\n continue\n line = line.replace(rt, \"==\" + rt + \"==\", 1)\n out[found] = line\n cursor = found + 1\n applied_hl += 1\n\n return out, applied_h, unmatched_h, applied_hl\n\n\ndef main() -> None:\n ap = argparse.ArgumentParser(description=__doc__)\n ap.add_argument(\"--docx\", required=True, help=\"the owner-exported source .docx\")\n ap.add_argument(\"--md\", required=True, help=\"first-pass pandoc/doc-to-markdown .md\")\n ap.add_argument(\"--out\", help=\"output path (required unless --dry-run)\")\n ap.add_argument(\"--dry-run\", action=\"store_true\",\n help=\"print the size->level mapping and counts; do not write\")\n args = ap.parse_args()\n\n docx_path, md_path = Path(args.docx), Path(args.md)\n if not docx_path.exists():\n sys.exit(f\"error: docx not found: {docx_path}\")\n if not md_path.exists():\n sys.exit(f\"error: markdown not found: {md_path}\")\n if not args.dry_run and not args.out:\n sys.exit(\"error: --out is required unless --dry-run\")\n\n heading_plan, highlight_plan, size_to_level, body_size = build_plan(docx_path)\n\n print(f\"[restore] body size = {body_size}pt (normal text)\", file=sys.stderr)\n for sz, lvl in sorted(size_to_level.items(), key=lambda kv: -kv[0]):\n n = sum(1 for t in heading_plan if t[1] == lvl)\n print(f\"[restore] {sz}pt -> H{lvl} ({n} paragraphs)\", file=sys.stderr)\n print(f\"[restore] {len(highlight_plan)} paragraphs carry run highlights\",\n file=sys.stderr)\n\n md_lines = md_path.read_text(encoding=\"utf-8\").splitlines()\n new_lines, applied_h, unmatched_h, applied_hl = apply_plan(\n md_lines, heading_plan, highlight_plan\n )\n print(f\"[restore] headings applied={applied_h} unmatched={unmatched_h}; \"\n f\"highlight lines applied={applied_hl}\", file=sys.stderr)\n if unmatched_h:\n print(f\"[restore] WARNING: {unmatched_h} heading paragraph(s) had no \"\n f\"matching Markdown line — inspect the source vs pandoc output \"\n f\"for those (often a table caption or an image-only paragraph).\",\n file=sys.stderr)\n\n if args.dry_run:\n print(\"[restore] dry-run: nothing written.\", file=sys.stderr)\n return\n\n out_path = Path(args.out)\n out_path.write_text(\"\\n\".join(new_lines) + \"\\n\", encoding=\"utf-8\")\n print(f\"[restore] wrote {out_path}. Next: visually verify against the docx \"\n f\"render (qlmanage / soffice --convert-to pdf) before accepting.\",\n file=sys.stderr)\n\n\nif __name__ == \"__main__\":\n main()\n","content_type":"text/x-python; charset=utf-8","language":"python","size":11982,"content_sha256":"7a6a4425741f46b7b734242bf43b3a046601d6c59e24f1e8cf3f992b7bc949a0"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Feishu Doc Scraper","type":"text"}]},{"type":"paragraph","content":[{"text":"Extract a Feishu/Lark source into faithful local Markdown. ","type":"text"},{"text":"Prefer the lark-cli API","type":"text","marks":[{"type":"strong"}]},{"text":" — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a ","type":"text"},{"text":"fallback","type":"text","marks":[{"type":"em"}]},{"text":", not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Scope (read this first)","type":"text"}]},{"type":"paragraph","content":[{"text":"This skill's contract is ","type":"text"},{"text":"faithful per-source Markdown + a record of what was extracted","type":"text","marks":[{"type":"strong"}]},{"text":". It does ","type":"text"},{"text":"not","type":"text","marks":[{"type":"em"}]},{"text":" decide how the resulting files are named, indexed, deduplicated against existing notes, or organized into a knowledge base — that belongs to the host PKM / the user's own conventions. Stopping at faithful extraction keeps this skill orthogonal and reusable. When the user wants the output filed into a vault, extract first, then hand the clean Markdown to their organizing workflow.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Choose the path","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"Is the source a Feishu/Lark URL (wiki / docx / sheets / minutes / base)?\n├── YES → is lark-cli installed and authenticated to that tenant?\n│ ├── YES → PATH A: lark-cli API extraction (primary — start here)\n│ │ └── hit code 131006 / 99991679 (permission denied)?\n│ │ └── PATH B: owner-exported .docx → faithful Markdown\n│ └── NO → install/auth lark-cli first (it is worth it); only if\n│ truly impossible → PATH D: browser DOM fallback\n├── the URL is a Minutes / 妙记 link, or a doc references one → PATH C: Minutes transcript\n└── you were handed an exported .docx (not a URL) → PATH B","type":"text"}]},{"type":"paragraph","content":[{"text":"A collection/hub is just a docx whose body references other docs — ","type":"text"},{"text":"Path A handles it by recursively following the reference graph","type":"text","marks":[{"type":"strong"}]},{"text":", not by visiting pages in a browser.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Path A — lark-cli API extraction (primary)","type":"text"}]},{"type":"paragraph","content":[{"text":"Full command catalog, recursion engine, cross-tenant and personal-space nuances: ","type":"text"},{"text":"references/lark-cli-api-extraction.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/lark-cli-api-extraction.md","title":null}},{"type":"strong"}]},{"text":". The essentials for the common case:","type":"text"}]},{"type":"paragraph","content":[{"text":"1. Disable the proxy for Feishu domestic domains.","type":"text","marks":[{"type":"strong"}]},{"text":" Feishu's ","type":"text"},{"text":"*.feishu.cn","type":"text","marks":[{"type":"code_inline"}]},{"text":" endpoints are direct-connect in mainland China; routing them through a local proxy leaks credentials through the proxy and gets DNS-hijacked. lark-cli itself warns about this. Always:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"export LARK_CLI_NO_PROXY=1","type":"text"}]},{"type":"paragraph","content":[{"text":"This does not conflict with any \"Claude/Anthropic domains must use the proxy\" rule — Feishu is a different host and is direct.","type":"text"}]},{"type":"paragraph","content":[{"text":"2. Classify the URL, then resolve to a fetchable doc token.","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"…/wiki/\u003cnode_token>","type":"text","marks":[{"type":"code_inline"}]},{"text":" — a wiki node token is ","type":"text"},{"text":"not","type":"text","marks":[{"type":"strong"}]},{"text":" a doc token. Resolve it first:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"lark-cli wiki spaces get_node --params '{\"token\":\"\u003cnode_token>\"}'\n# → .data.node.obj_token and .data.node.obj_type (e.g. \"docx\")","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"…/docx/\u003cdoc_token>","type":"text","marks":[{"type":"code_inline"}]},{"text":" — already a doc token, fetch directly.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"…/sheets/\u003ctoken>","type":"text","marks":[{"type":"code_inline"}]},{"text":" — spreadsheet, use the sheets commands (see reference).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"…/minutes/\u003ctoken>","type":"text","marks":[{"type":"code_inline"}]},{"text":" — Minutes, go to ","type":"text"},{"text":"Path C","type":"text","marks":[{"type":"strong"}]},{"text":".","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"3. Fetch the body as Markdown — programmatically, never via the model.","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"lark-cli docs +fetch --doc \u003cobj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err\n# body is .data.markdown — extract with jq, do NOT retype or summarize it\njq -r '.data.markdown' /tmp/fetch.json > source.md","type":"text"}]},{"type":"paragraph","content":[{"text":"Keep stdout and stderr separate. A harmless ","type":"text"},{"text":"[deprecated] docs +fetch with v1 API is deprecated","type":"text","marks":[{"type":"code_inline"}]},{"text":" goes to stderr; piping ","type":"text"},{"text":"2>/dev/null","type":"text","marks":[{"type":"code_inline"}]},{"text":" ","type":"text"},{"text":"and","type":"text","marks":[{"type":"em"}]},{"text":" ","type":"text"},{"text":"jq","type":"text","marks":[{"type":"code_inline"}]},{"text":" together produced a false ","type":"text"},{"text":"Exit code 5","type":"text","marks":[{"type":"code_inline"}]},{"text":" in practice — redirect to files and inspect, don't blind-pipe. The body must reach disk without passing through the model (paraphrasing silently corrupts source text — this is the single most important fidelity rule).","type":"text"}]},{"type":"paragraph","content":[{"text":"4. If it's a collection/hub, follow the reference graph (BFS).","type":"text","marks":[{"type":"strong"}]},{"text":" The hub body contains ","type":"text"},{"text":"\u003cmention-doc>","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"\u003csheet>","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"\u003cimage>","type":"text","marks":[{"type":"code_inline"}]},{"text":" tags and cross-tenant / Minutes / Tencent-Meeting URLs. Extract every reference, dispatch by type, fetch, and ","type":"text"},{"text":"repeat on each newly fetched doc until no new references remain","type":"text","marks":[{"type":"strong"}]},{"text":" (leaf nodes). Use the bundled extractor so nothing is silently missed (a missed reference = a missing document, the #1 hub-scraping failure):","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python3 scripts/feishu_extract_refs.py source.md # → JSON list of {type, token, title}","type":"text"}]},{"type":"paragraph","content":[{"text":"Recursion loop, dispatch table, and the cross-tenant/","type":"text"},{"text":"my.feishu.cn","type":"text","marks":[{"type":"code_inline"}]},{"text":" personal-space rules are in the reference.","type":"text"}]},{"type":"paragraph","content":[{"text":"5. Final residual-tag check (acceptance gate for collections).","type":"text","marks":[{"type":"strong"}]},{"text":" Every rich-media reference must have been resolved and rendered:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"grep -rlE '\u003c(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo \"UNRESOLVED — keep recursing\" || echo \"clean\"","type":"text"}]},{"type":"paragraph","content":[{"text":"Must be empty before you stop.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Path B — permission denied → owner-exported .docx","type":"text"}]},{"type":"paragraph","content":[{"text":"lark-cli wiki spaces get_node","type":"text","marks":[{"type":"code_inline"}]},{"text":" returning ","type":"text"},{"text":"code 131006 … node permission denied, user needs read permission","type":"text","marks":[{"type":"code_inline"}]},{"text":" (or fetch returning it) is a ","type":"text"},{"text":"hard Feishu-side boundary","type":"text","marks":[{"type":"strong"}]},{"text":". lark-cli, anonymous curl, and the browser all fail it — this has been verified exhaustively; do not spend cycles trying to bypass it. The only correct move: ask the permission holder to export the doc as ","type":"text"},{"text":".docx","type":"text","marks":[{"type":"code_inline"}]},{"text":" and send it back out-of-band, then convert with fidelity (font-size→heading and ","type":"text"},{"text":"w:shd","type":"text","marks":[{"type":"code_inline"}]},{"text":"→highlight restoration, then visual verification). Full procedure: ","type":"text"},{"text":"references/docx-export-to-markdown.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/docx-export-to-markdown.md","title":null}},{"type":"strong"}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Path C — Feishu Minutes (妙记) transcript","type":"text"}]},{"type":"paragraph","content":[{"text":"lark-cli minutes","type":"text","marks":[{"type":"code_inline"}]},{"text":" only returns metadata and can download audio/video — it ","type":"text"},{"text":"cannot","type":"text","marks":[{"type":"strong"}]},{"text":" export the text transcript. The transcript comes from a native endpoint called through ","type":"text"},{"text":"lark-cli api","type":"text","marks":[{"type":"code_inline"}]},{"text":", and needs an extra scope granted via a device-flow login. Native AI transcription is far better than downloading the media and re-running ASR — never do the latter. Endpoint, scope name, the device-flow timeout trap, and per-minute (not per-tenant) permission behavior: ","type":"text"},{"text":"references/feishu-minutes-transcript.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/feishu-minutes-transcript.md","title":null}},{"type":"strong"}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Path D — browser DOM fallback (last resort)","type":"text"}]},{"type":"paragraph","content":[{"text":"Only when lark-cli genuinely cannot reach the content (no install possible, and the doc is not permission-walled). This is the old virtual-scroll / TOC-driven DOM capture workflow. It is slower, depends on a connected browser surface (the in-browser extension frequently fails to connect), and an anonymous debugging Chrome can only tell you whether a page is ","type":"text"},{"text":"publicly","type":"text","marks":[{"type":"em"}]},{"text":" reachable — it cannot read login-walled content. Workflow: ","type":"text"},{"text":"references/browser-dom-fallback.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/browser-dom-fallback.md","title":null}},{"type":"strong"}]},{"text":". Battle-tested DOM rules (virtual scroll, ","type":"text"},{"text":"data-block-id","type":"text","marks":[{"type":"code_inline"}]},{"text":" ordering, table/bullet extraction, image streams): ","type":"text"},{"text":"references/browser-failure-rules.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/browser-failure-rules.md","title":null}},{"type":"strong"}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Hard rules","type":"text"}]},{"type":"paragraph","content":[{"text":"These are the rules whose violation silently ruins the output. Each has a reason — follow the reason, not just the letter.","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Never let the document body pass through the model.","type":"text","marks":[{"type":"strong"}]},{"text":" Extract with ","type":"text"},{"text":"jq","type":"text","marks":[{"type":"code_inline"}]},{"text":"/","type":"text"},{"text":"cat","type":"text","marks":[{"type":"code_inline"}]},{"text":"/scripts straight to disk. The model paraphrasing source text is undetectable later and destroys fidelity. This is why Path A beats the browser path structurally.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"export LARK_CLI_NO_PROXY=1","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" for ","type":"text","marks":[{"type":"strong"}]},{"text":"*.feishu.cn","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":".","type":"text","marks":[{"type":"strong"}]},{"text":" Otherwise credentials transit a local proxy and DNS is hijacked.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Transcripts come from the platform's native transcription, never re-ASR.","type":"text","marks":[{"type":"strong"}]},{"text":" Downloading media and transcribing again loses speaker labels, timestamps, and accuracy.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"A generated docx Markdown is not done until it has been ","type":"text","marks":[{"type":"strong"}]},{"text":"visually","type":"text","marks":[{"type":"strong"},{"type":"em"}]},{"text":" verified","type":"text","marks":[{"type":"strong"}]},{"text":" against the source (render to image, read it). Feishu-exported docx uses font-size+bold for headings rather than Word heading styles, so a \"no errors, word count matches\" check passes while the entire heading hierarchy is silently flat. Text-level checks cannot catch this.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Do not 死磕 (grind) on docx embedded-image download.","type":"text","marks":[{"type":"strong"}]},{"text":" lark-cli (through 1.0.32) cannot download ","type":"text"},{"text":"\u003cimage>","type":"text","marks":[{"type":"code_inline"}]},{"text":" tokens from a docx — exhaustively verified. Register the image tokens and note \"needs document owner to right-click → save\"; the text is the value, images are a tracked gap.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"HTTP 200 from anonymous curl ≠ accessible.","type":"text","marks":[{"type":"strong"}]},{"text":" A Feishu login wall returns 200 with a body containing ","type":"text"},{"text":"accounts.feishu.cn","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"login","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"passport","type":"text","marks":[{"type":"code_inline"}]},{"text":" / an empty ","type":"text"},{"text":"\u003ctitle>","type":"text","marks":[{"type":"code_inline"}]},{"text":". Check the body, never infer \"public\" from the status code.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"A file \"not found\" by a search agent is not authoritative.","type":"text","marks":[{"type":"strong"}]},{"text":" Verify against authoritative sources before concluding (this is general Inference Discipline; relevant when locating where ingested content already lives).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"U+FFFD final check on every produced file:","type":"text","marks":[{"type":"strong"}]},{"text":" ","type":"text"},{"text":"LC_ALL=C grep -rl

Feishu Doc Scraper Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback , not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed. Scope (read this first) This skill's contract is faithful per-source Markdown + a record of what was extracted . It does not decide…

\\xef\\xbf\\xbd' .","type":"text","marks":[{"type":"code_inline"}]},{"text":" must be empty. A replacement character means an encoding step corrupted the text.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Acceptance contract","type":"text"}]},{"type":"paragraph","content":[{"text":"Stop only when all that apply are true:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Every fetched body reached disk via ","type":"text"},{"text":"jq","type":"text","marks":[{"type":"code_inline"}]},{"text":"/script, not retyped by the model.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Collections: the residual rich-media-tag grep (Path A step 5) is empty — every ","type":"text"},{"text":"mention-doc","type":"text","marks":[{"type":"code_inline"}]},{"text":"/","type":"text"},{"text":"sheet","type":"text","marks":[{"type":"code_inline"}]},{"text":"/cross-tenant reference was followed to a leaf.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"LC_ALL=C grep -rl

Feishu Doc Scraper Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback , not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed. Scope (read this first) This skill's contract is faithful per-source Markdown + a record of what was extracted . It does not decide…

\\xef\\xbf\\xbd' .","type":"text","marks":[{"type":"code_inline"}]},{"text":" is empty.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"docx path: rendered to an image and visually compared to the source; heading hierarchy and highlights match (see docx reference's checklist).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Browser fallback only: TOC coverage + scale check (see browser-failure-rules.md).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Each output file's frontmatter records ","type":"text"},{"text":"source","type":"text","marks":[{"type":"code_inline"}]},{"text":" (the original URL/token) and, if any post-processing was applied, a ","type":"text"},{"text":"post_process","type":"text","marks":[{"type":"code_inline"}]},{"text":" provenance line.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Permission gaps (131006 docs not exported yet, undownloadable images) are explicitly listed for the user — a transparent gap beats a silent omission.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Do NOT attempt","type":"text"}]},{"type":"paragraph","content":[{"text":"Verified dead-ends — retrying them only wastes the session. Full table with failure modes and root causes: ","type":"text"},{"text":"references/permission-and-failure-boundaries.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/permission-and-failure-boundaries.md","title":null}},{"type":"strong"}]},{"text":". The top ones:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Bypassing ","type":"text"},{"text":"131006","type":"text","marks":[{"type":"code_inline"}]},{"text":" permission-denied by any means (lark-cli / curl / anonymous browser) — it is a server-side boundary.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Downloading docx embedded images via ","type":"text"},{"text":"docs +media-download","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"api …/drive/v1/medias/\u003ct>/download","type":"text","marks":[{"type":"code_inline"}]},{"text":" (with or without ","type":"text"},{"text":"extra","type":"text","marks":[{"type":"code_inline"}]},{"text":"), or ","type":"text"},{"text":"schema drive.medias.download","type":"text","marks":[{"type":"code_inline"}]},{"text":" — none work; lark-cli even mis-reports the real HTTP 400 as \"empty JSON\".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"WebFetch","type":"text","marks":[{"type":"code_inline"}]},{"text":" against ","type":"text"},{"text":"open.feishu.cn/document/server-docs/...","type":"text","marks":[{"type":"code_inline"}]},{"text":" for API specs — backend is flaky; use ","type":"text"},{"text":"open.feishu.cn/llms-docs/zh-CN/llms-\u003cmodule>.txt","type":"text","marks":[{"type":"code_inline"}]},{"text":" instead (LLM-friendly, stable).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"AppleScript/JXA ","type":"text"},{"text":"executeJavaScript","type":"text","marks":[{"type":"code_inline"}]},{"text":", Chrome CDP on port 9222 — disabled/empty in this environment (browser path only).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Using ","type":"text"},{"text":"minimax-docx","type":"text","marks":[{"type":"code_inline"}]},{"text":" to convert docx→md — it is a docx ","type":"text"},{"text":"authoring","type":"text","marks":[{"type":"em"}]},{"text":" tool; use the doc-to-markdown skill instead.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Bundled resources","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scripts/feishu_extract_refs.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" — deterministic reference-token extractor; the recursion engine's core. Run it on every fetched body to enumerate ","type":"text"},{"text":"\u003cmention-doc>","type":"text","marks":[{"type":"code_inline"}]},{"text":"/","type":"text"},{"text":"\u003csheet>","type":"text","marks":[{"type":"code_inline"}]},{"text":"/","type":"text"},{"text":"\u003cimage>","type":"text","marks":[{"type":"code_inline"}]},{"text":"/cross-tenant/Minutes/Tencent-Meeting references as JSON.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scripts/restore_docx_headings.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" — for Path B: reads true font sizes via python-docx, maps them to heading levels, restores ","type":"text"},{"text":"w:shd","type":"text","marks":[{"type":"code_inline"}]},{"text":" highlights to Obsidian ","type":"text"},{"text":"==…==","type":"text","marks":[{"type":"code_inline"}]},{"text":", without retyping body text.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scripts/feishu_dom_capture.js","type":"text","marks":[{"type":"code_inline"}]},{"text":" — Path D: injectable end-to-end browser DOM capture.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scripts/download_feishu_images.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" — Path D: SSR image extraction when browser automation is unavailable.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scripts/build_feishu_markdown.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" — Path D: render a capture manifest into Markdown.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scripts/check_heading_coverage.py","type":"text","marks":[{"type":"code_inline"}]},{"text":" — coverage verification (both paths).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/lark-cli-api-extraction.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — Path A full reference (commands, recursion, sheets, cross-tenant).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/feishu-minutes-transcript.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — Path C native transcript API + scope auth.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/permission-and-failure-boundaries.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — error codes + the full Do-NOT-attempt table.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/docx-export-to-markdown.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — Path B faithful conversion procedure.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/browser-dom-fallback.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" + ","type":"text"},{"text":"references/browser-failure-rules.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — Path D.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/capture-manifest.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" — manifest shape for ","type":"text"},{"text":"build_feishu_markdown.py","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Next step","type":"text"}]},{"type":"paragraph","content":[{"text":"After extraction completes, the clean Markdown typically feeds the user's own knowledge-base ingestion (filing, indexing, dedup) — which is deliberately out of this skill's scope. If the source went through Path B (a docx), the doc-to-markdown skill is already part of that flow. Offer the handoff; do not auto-organize:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"Extraction complete: [N] sources → faithful Markdown ([M] permission/image gaps listed).\n\nOptions:\nA) Hand off to your PKM/organizing workflow — file & index these (Recommended if part of a vault)\nB) Run /daymade-docs:docs-cleaner — consolidate redundant content across the extracted files\nC) Stop here — the faithful Markdown is the deliverable","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"feishu-doc-scraper","author":"@skillopedia","source":{"stars":1137,"repo_name":"claude-code-skills","origin_url":"https://github.com/daymade/claude-code-skills/blob/HEAD/feishu-doc-scraper/SKILL.md","repo_owner":"daymade","body_sha256":"949d883a8aa28ef7c483cbb09e1e3e5e61b68f7d90af0c9c7b4b70cd73f3dbe3","cluster_key":"ce1997bad9ead3bb35f7b2782ae6c5a884ec7908ded3f3bff9abebbb9ccf3b32","clean_bundle":{"format":"clean-skill-bundle-v1","source":"daymade/claude-code-skills/feishu-doc-scraper/SKILL.md","attachments":[{"id":"4434a4c7-fa1c-5aed-889c-5aff0b50b242","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/4434a4c7-fa1c-5aed-889c-5aff0b50b242/attachment","path":".security-scan-passed","size":181,"sha256":"962126b31eb70da0d57b136a5db8df77257b0b4a2cded24ebc9244978109d361","contentType":"text/plain; charset=utf-8"},{"id":"f641088c-b320-5b72-b0b2-7670ccfe150e","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/f641088c-b320-5b72-b0b2-7670ccfe150e/attachment.md","path":"references/browser-dom-fallback.md","size":6369,"sha256":"bfe2d5dac83a40eb686558df20e98cd4d445ed6755b8ad32e808ac726e2cd552","contentType":"text/markdown; charset=utf-8"},{"id":"31497802-b945-5ed9-9175-3e6fe865c261","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/31497802-b945-5ed9-9175-3e6fe865c261/attachment.md","path":"references/browser-failure-rules.md","size":9607,"sha256":"0b50d7cca3231b332aa0c6ff819ddef74611c7b4a73fa6985804c350e4be8afe","contentType":"text/markdown; charset=utf-8"},{"id":"db813e54-7453-598e-83b7-687fe0a3cb34","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/db813e54-7453-598e-83b7-687fe0a3cb34/attachment.md","path":"references/capture-manifest.md","size":1231,"sha256":"fdb7f714282abdea01aeeed447d0a7a9926e389e796732f34bfe937446a1f99d","contentType":"text/markdown; charset=utf-8"},{"id":"10ac7240-7a01-541e-aa01-595a01c37071","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/10ac7240-7a01-541e-aa01-595a01c37071/attachment.md","path":"references/docx-export-to-markdown.md","size":5176,"sha256":"6a7ddecb6dc84fca93f5e369a2110412bafb5ef38e9b007bdd34207f8ea578a0","contentType":"text/markdown; charset=utf-8"},{"id":"dbaea441-81bc-5957-a994-e2c0e0c7994d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/dbaea441-81bc-5957-a994-e2c0e0c7994d/attachment.md","path":"references/feishu-minutes-transcript.md","size":4062,"sha256":"bb056e99664b41f4e256ffe13a3fbe2200737f2f99e4d501b39ad6122de2ea31","contentType":"text/markdown; charset=utf-8"},{"id":"f6319fd2-d172-576d-9d06-4a7ef8e652b4","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/f6319fd2-d172-576d-9d06-4a7ef8e652b4/attachment.md","path":"references/lark-cli-api-extraction.md","size":10433,"sha256":"1677b2e1d05198741f41e0bdc1df83a06c50a85d7e416202fbce59d4900b9a73","contentType":"text/markdown; charset=utf-8"},{"id":"8f744072-6be7-55f4-8c83-25a3d63e82b5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/8f744072-6be7-55f4-8c83-25a3d63e82b5/attachment.md","path":"references/permission-and-failure-boundaries.md","size":5982,"sha256":"be845c4c61da61a8bb49f2c9f5a5ba7b09600ff801e1618b63c0089412de5828","contentType":"text/markdown; charset=utf-8"},{"id":"a2a2eaea-c5fd-50d0-8739-f0b8ae623551","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/a2a2eaea-c5fd-50d0-8739-f0b8ae623551/attachment.py","path":"scripts/build_feishu_markdown.py","size":3067,"sha256":"1b033ba9a694ed597ace425e6bb070227069cc5bcb14111a5df3f5472da8bc13","contentType":"text/x-python; charset=utf-8"},{"id":"f0863449-298d-547c-8f29-c409be761190","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/f0863449-298d-547c-8f29-c409be761190/attachment.py","path":"scripts/check_heading_coverage.py","size":2998,"sha256":"cf31f811466b8027f473601ad6b9ca855ce8ae08e08dbff5e154838c44bdc77e","contentType":"text/x-python; charset=utf-8"},{"id":"999f15b5-3d4e-53d3-87ab-88295a59959a","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/999f15b5-3d4e-53d3-87ab-88295a59959a/attachment.py","path":"scripts/download_feishu_images.py","size":7894,"sha256":"b677deed1553514798dd4eb33cdf50602be9e96db16f85a2054b700bd9a2115c","contentType":"text/x-python; charset=utf-8"},{"id":"77618aa3-0499-5898-9522-d5d066c10486","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/77618aa3-0499-5898-9522-d5d066c10486/attachment.js","path":"scripts/feishu_dom_capture.js","size":14787,"sha256":"e04b8451e7760a0ea038cd042be6ed0d02a815515bbf08472aa0cffe2b82f74f","contentType":"application/javascript; charset=utf-8"},{"id":"65bd5b69-63bd-57cf-9c30-22484ea228ef","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/65bd5b69-63bd-57cf-9c30-22484ea228ef/attachment.py","path":"scripts/feishu_extract_refs.py","size":7695,"sha256":"d91c220ce4cccfd80b312a3f9d0d981cc3709a1935e94ccf5cf8f650ed768b04","contentType":"text/x-python; charset=utf-8"},{"id":"631f7e46-0da3-59ba-b1ba-ac9b9349a690","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/631f7e46-0da3-59ba-b1ba-ac9b9349a690/attachment.py","path":"scripts/restore_docx_headings.py","size":11982,"sha256":"7a6a4425741f46b7b734242bf43b3a046601d6c59e24f1e8cf3f992b7bc949a0","contentType":"text/x-python; charset=utf-8"}],"bundle_sha256":"187e1265a018b4da3518e31d0b0dab42c03cef8ca761875525f0a88e404c29ad","attachment_count":14,"text_attachments":13,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":1,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"feishu-doc-scraper/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"browser-automation-scraping","category_label":"Browser"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"browser-automation-scraping","import_tag":"clean-skills-v1","description":"Extract Feishu (Lark) Docs, Wiki pages, Wiki collections/hubs, spreadsheets, and Minutes (妙记) transcripts into clean high-fidelity local Markdown. The primary path is the lark-cli API — programmatic extraction with no LLM rewriting of the body — which recursively follows a collection's reference graph (mention-doc / sheet / cross-tenant links) and uses error codes to resolve permission boundaries precisely; a browser-DOM path is the fallback only when lark-cli cannot reach the content. Use this whenever the source is a Feishu/Lark URL and fidelity matters — including 导出飞书文档/合集/妙记转写, 把飞书 wiki/知识库转 markdown, scraping or archiving a Feishu collection, exporting a Feishu Minutes/妙记 transcript, or saving a Feishu page locally — even if the user only says clipping, archiving, converting, or \"save this\". Also covers the permission-denied path (owner-exported .docx → faithful Markdown with heading/highlight restoration).","argument-hint":["feishu-url-or-output-path"],"compatibility":"Primary path needs the `lark-cli` binary (npm `@larksuite/cli`, verified 1.0.32, 2026-05) authenticated to the target tenant. Fallback path needs a browser automation surface with an authenticated session (Chrome DevTools MCP / Browser Use / Computer Use). docx path needs `python-docx` and a docx→md converter (the bundled doc-to-markdown skill or pandoc)."}},"renderedAt":1782987679176}

Feishu Doc Scraper Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback , not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed. Scope (read this first) This skill's contract is faithful per-source Markdown + a record of what was extracted . It does not decide…