llm-wiki-source-extraction-coverage

llm-wiki source extraction coverage Silent extraction failure is the highest-frequency wiki defect today. When returns 0 chars on an image-PDF, when skips embedded objects, when ignores hidden sheets — the resulting wiki page looks complete but isn't. This skill closes the gap with two measurements per extraction and an anchor format for every wiki claim back to the source. The pattern: predict coverage, measure coverage, inventory the gap, anchor every claim. --- When this skill applies | Trigger | Action | |---|---| | Ingesting PDF / DOCX / XLSX / PPTX / HTML / image-only source via | APPLY…

\\f')\nestimate=$(echo \"scale=2; $text_pages / $total_pages\" | bc)\necho \"extraction_estimate: $estimate\"\n```\n\nSet `extraction_estimate` based on:\n\n| Indicator | Estimate baseline |\n|---|---|\n| All pages have embedded fonts, `pdftotext` recovers text on all pages | 0.95 |\n| Mixed text + image pages | (text_pages / total_pages); typically 0.5–0.85 |\n| No embedded fonts, all images | 0.0 baseline (OCR can lift to 0.8–0.95) → route to `scanned-pdf-ocr-fallback.md` |\n| Encrypted | 0.0 → defer until decrypted |\n| Damaged page tree | 0.0 → defer; flag for source replacement |\n\n## Primary extractor: `pdftotext -layout`\n\n```bash\npdftotext -layout -nopgbrk \u003csource.pdf> \u003coutput.txt>\n```\n\n`-layout` preserves columns; `-nopgbrk` removes form-feed delimiters\n(use page boundary detection in post-processing instead).\n\n## Fallback: PyMuPDF (`fitz`)\n\nWhen `pdftotext` returns 0 chars on a specific page that should be text:\n\n```python\nimport fitz\ndoc = fitz.open(\"\u003csource.pdf>\")\nfor page_num, page in enumerate(doc):\n text = page.get_text()\n if not text.strip():\n # try the alternate text extractor\n text = page.get_text(\"rawdict\") # structured extraction\n```\n\nPer `feedback_pdf_ocr_fallback_chain`: pdftotext + PyMuPDF both returning\n0 chars on a page is the signal to route that page to OCR.\n\n## Post-extraction yield measurement\n\n```bash\n# Recovered pages = pages with > N chars of extracted text\nthreshold=200\nrecovered_pages=$(awk 'BEGIN{count=0} /^\\f/{if(curr>'$threshold')count++; curr=0; next} {curr+=length($0)} END{print count}' \u003coutput.txt>)\nyield=$(echo \"scale=2; $recovered_pages / $total_pages\" | bc)\necho \"extraction_yield: $yield\"\n```\n\n## Anchor format for cites\n\n`[[sources/\u003cslug>]]:p\u003cpage>:¶\u003cparagraph-index>`\n\nParagraph index is 1-based within the page after `pdftotext -layout`\npost-processing splits on blank lines.\n\n## Spot-check\n\nOpen the source PDF in a viewer. Verify the extracted text matches the\nvisible content on 5–10 random pages. Flag mismatches in\n`extraction_yield_lost`.\n\n## Common pitfalls\n\n- Two-column layouts: `-layout` mostly handles, but verify\n- Mathematical content: equations render as character salad; transcribe\n to KaTeX manually for any equation cited in wiki content\n- Tables: `pdftotext` flattens; use `tabula-py` or `camelot` for\n structured table extraction when needed\n- Footnotes interleave with body text in layout mode; verify\n- Headers/footers repeat on every page; strip in post-processing\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":3094,"content_sha256":"4757b7b64785d59842b419ec4831900fff9e02242c416d95547dad6ef6e2236d"},{"filename":"references/scanned-pdf-ocr-fallback.md","content":"# Scanned PDF — OCR fallback chain\n\nFor image-only PDFs (scanned documents) and image files (PNG / JPG)\nthat carry text. Codifies the existing\n`feedback_pdf_ocr_fallback_chain` rule.\n\n## When to route here\n\nThe PDF protocol routes here when `pdftotext` AND `PyMuPDF.get_text()`\nboth return 0 chars on pages that visibly contain text.\n\nSignals:\n```bash\npdftotext \u003csource.pdf> - | wc -c # near-zero output\npdffonts \u003csource.pdf> # no embedded fonts\n```\n\n## Pre-extraction estimate\n\nFor scanned PDFs:\n\n| Page-image quality | Estimate |\n|---|---|\n| 300 DPI, clean scan, standard font | 0.90 |\n| 200 DPI or below | 0.70 |\n| Rotated / skewed pages | 0.60 (need pre-rotation) |\n| Faxed / heavily degraded | 0.40 |\n| Handwritten | 0.10 (tesseract is poor at handwriting; consider manual transcription) |\n\nFor images:\n\n| Source | Estimate |\n|---|---|\n| Screenshot of clean text | 0.95 |\n| Photo of printed document | 0.80 |\n| Photo with shadows / perspective | 0.60 |\n| Whiteboard / handwriting | 0.20 |\n\n## Extraction: `tesseract` via PyMuPDF render\n\n```python\nimport fitz\nimport subprocess\nimport tempfile\nfrom pathlib import Path\n\ndoc = fitz.open(\"\u003csource.pdf>\")\nextracted = []\nfor page_num, page in enumerate(doc):\n # Render at 300 DPI\n pix = page.get_pixmap(dpi=300)\n with tempfile.NamedTemporaryFile(suffix=\".png\", delete=False) as tmp:\n pix.save(tmp.name)\n # OCR\n result = subprocess.run(\n [\"tesseract\", tmp.name, \"-\", \"--psm\", \"6\", \"-l\", \"eng\"],\n capture_output=True, text=True\n )\n extracted.append((page_num, result.stdout))\n```\n\n`--psm 6` = \"Assume a single uniform block of text\". Other PSM modes:\n- `--psm 3` = automatic page segmentation (default, but slower)\n- `--psm 11` = sparse text (useful for forms / labels)\n- `--psm 12` = sparse text with OSD (orientation + script detection)\n\nFor images directly:\n```bash\ntesseract \u003csource.png> \u003coutput> --psm 6 -l eng\n```\n\n## Layout-preserving OCR\n\nFor multi-column scanned papers, single-column `--psm 6` interleaves\ncolumns. Use:\n\n```bash\ntesseract \u003csource.png> \u003coutput> --psm 3 -c preserve_interword_spaces=1\n```\n\nOr pre-segment columns with `pdf2image` + `opencv` before OCR.\n\n## Post-extraction yield\n\nTwo-pass measurement:\n\n1. **Coverage**: how many pages produced ≥ 200 chars of OCR output?\n2. **Quality spot-check**: on 5 random pages, manually compare 100-char\n samples against the source image. Estimate per-character accuracy.\n\n```python\ntotal_pages = len(doc)\nrecovered = sum(1 for _, text in extracted if len(text.strip()) > 200)\ncoverage = recovered / total_pages\n\n# Quality estimate is human-in-the-loop — record in extraction_yield_lost\n```\n\nCombined yield: `coverage × quality_estimate`, capped at 0.95\n(OCR is never perfect; reserve ≥ 0.05 for known errors).\n\n## Anchor format\n\n`[[sources/\u003cslug>]]:p\u003cpage>:OCR`\n\nThe `:OCR` suffix is the trust signal — readers/reviewers know to\nverify against the source image for any cited value. Compare to\n`:p\u003cpage>:¶\u003cparagraph>` for text-extracted PDFs, where trust is high.\n\n## Spot-check protocol\n\nFor every standards page or methodology page that cites OCR'd content:\n\n1. Open the source PDF at the cited page\n2. Locate the cited passage visually\n3. Verify the wiki claim matches the source within OCR tolerance\n4. If mismatch: file an audit per `research/llm-wiki-audit-feedback-loop`\n with the actual source text, route to revise\n\n## Common pitfalls\n\n- **Skewed scans**: OCR accuracy drops 30%+ on >2° skew; pre-rotate with\n `deskew` or `opencv.minAreaRect` before tesseract\n- **Marginalia / page numbers**: OCR includes them as text; strip in\n post-processing\n- **Footnotes interleaved with body**: tesseract is page-flat; use\n layout analysis (`tesseract --tessdata-dir \u003cdir> --psm 1`) to\n segment\n- **Multi-language documents**: pass `-l eng+deu` etc. but accuracy drops\n- **Equations / formulas**: tesseract is poor; transcribe to KaTeX\n manually; flag in `extraction_yield_lost`\n- **Tables**: OCR flattens; use `tabula-java` or `camelot` (which itself\n requires text-based PDF) or manual transcription\n- **Stamps / handwriting / annotations**: tesseract treats them as noise;\n flag in `extraction_yield_lost` if material\n\n## Quality upgrade chain\n\nIf yield is \u003c 0.80 and the source is critical:\n\n1. **Re-scan at higher DPI** if the source is physical\n2. **Deskew + denoise** with `opencv` or `imagemagick -despeckle`\n3. **Train a tesseract model** on the document's font (advanced)\n4. **Manual transcription** for the lost passages, flagged in the\n summary page\n\nRecord every upgrade attempt in `extraction_yield_lost` with date and\nnew yield.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":4665,"content_sha256":"784869c169196b6a6f2a5ee7c3d17c25b13569e2250b4c31d093a32e94e663eb"},{"filename":"references/xlsx-extraction.md","content":"# XLSX extraction — coverage protocol\n\n## Pre-extraction estimate\n\n```bash\nunzip -l \u003csource.xlsx> | grep \"xl/worksheets/\" | wc -l # sheet count\nunzip -p \u003csource.xlsx> xl/workbook.xml | head -50 # sheet metadata\n```\n\nInspect for:\n- Hidden sheets (`state=\"hidden\"` in `xl/workbook.xml`)\n- VBA macros (`xl/vbaProject.bin`) — typically contain calc logic, not data\n- Embedded objects (`xl/embeddings/`) — sub-files needing separate extraction\n- External links (`xl/externalLinks/`) — sheets referencing other workbooks\n\n`extraction_estimate` baseline:\n\n| Indicator | Estimate |\n|---|---|\n| Standard data sheets, no macros, no hidden content | 0.98 |\n| Hidden sheets with material data | (visible_count + hidden_count) / total |\n| Heavy macro logic; data derived at runtime | 0.50 (need macro execution context) |\n| External links to unreachable workbooks | 0.30 (broken refs) |\n| Password-protected | 0.0 |\n\n## Primary extractor: `openpyxl`\n\n```python\nimport openpyxl\nwb = openpyxl.load_workbook(\"\u003csource.xlsx>\", data_only=True) # data_only=True returns formula values, not formulas\nfor sheet_name in wb.sheetnames:\n ws = wb[sheet_name]\n for row in ws.iter_rows(values_only=True):\n if any(cell is not None for cell in row):\n print(row)\n```\n\n`data_only=True` is critical: returns the **last-saved cached value** of\nformulas, not the formula itself. For formula text, use `data_only=False`.\n\nLimitations:\n- Hidden rows / columns: `openpyxl` does not skip them by default; check\n `ws.row_dimensions[i].hidden` and `ws.column_dimensions[col].hidden`\n- Conditional-format-styled cells: styling lost\n- Pivot tables: extract the source data, not the pivot itself\n- Charts: not extractable as data; OCR if needed\n\n## Fallback: `pandas.read_excel`\n\n```python\nimport pandas as pd\nxl = pd.ExcelFile(\"\u003csource.xlsx>\")\nfor sheet_name in xl.sheet_names:\n df = pd.read_excel(xl, sheet_name=sheet_name)\n print(f\"{sheet_name}: {len(df)} rows\")\n```\n\nPandas is faster for large tabular sheets but less precise on\ncell-by-cell metadata. Use for bulk extraction; openpyxl for\nselective/structural work.\n\n## Post-extraction yield\n\nCount addressable cells:\n\n```python\ntotal_cells = sum(\n ws.max_row * ws.max_column\n for ws in wb.worksheets\n if ws.sheet_state == \"visible\" # exclude hidden sheets from baseline unless they carry data\n)\nextracted_cells = sum(\n 1 for ws in wb.worksheets if ws.sheet_state == \"visible\"\n for row in ws.iter_rows(values_only=True)\n for cell in row\n if cell is not None\n)\nyield_ = extracted_cells / total_cells\n```\n\n## Anchor format\n\n`[[sources/\u003cslug>]]:\u003csheet>!\u003ccell-range>`\n\nExamples:\n- `[[sources/mooring-results-export]]:Lines!C12`\n- `[[sources/mooring-results-export]]:Lines!C12:F12` (range)\n- `[[sources/mooring-results-export]]:Summary!B5`\n\nSheet names with spaces: quote per Excel convention — `'Summary By Line'!C12`.\n\n## Spot-check\n\nOpen the XLSX in Excel/LibreOffice. Verify 5–10 random cells against\nextracted values. Pay special attention to:\n- Formula cells: did `data_only=True` return the cached value correctly?\n- Date cells: ISO format vs Excel serial number\n- Cells with custom number formats: extracted as raw value, not formatted display\n\n## Common pitfalls\n\n- `data_only=True` returns `None` if the workbook was never opened in Excel\n after the formula was authored (no cached value). Force a recalc by\n opening in LibreOffice headless: `soffice --headless --calc --convert-to xlsx \u003csource.xlsx>`\n- Merged cells: only the top-left cell carries the value; other cells in\n the merge are empty\n- Frozen panes / split views: structural metadata, not data — ignore\n- Defined names: `wb.defined_names` carries them; can be cited as anchors\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":3736,"content_sha256":"30bd551fa2225f16488cab04b2529bb7983106129db9bb0ea8121b52826369f2"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"llm-wiki source extraction coverage","type":"text"}]},{"type":"paragraph","content":[{"text":"Silent extraction failure is the highest-frequency wiki defect today. When ","type":"text"},{"text":"pdftotext","type":"text","marks":[{"type":"code_inline"}]},{"text":" returns 0 chars on an image-PDF, when ","type":"text"},{"text":"python-docx","type":"text","marks":[{"type":"code_inline"}]},{"text":" skips embedded objects, when ","type":"text"},{"text":"openpyxl","type":"text","marks":[{"type":"code_inline"}]},{"text":" ignores hidden sheets — the resulting wiki page looks complete but isn't. This skill closes the gap with two measurements per extraction and an anchor format for every wiki claim back to the source.","type":"text"}]},{"type":"paragraph","content":[{"text":"The pattern: ","type":"text"},{"text":"predict coverage, measure coverage, inventory the gap, anchor every claim.","type":"text","marks":[{"type":"strong"}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"When this skill applies","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Trigger","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Action","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Ingesting PDF / DOCX / XLSX / PPTX / HTML / image-only source via ","type":"text"},{"text":"llm_wiki.py ingest","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"APPLY","type":"text","marks":[{"type":"strong"}]},{"text":" — pre + post extraction metrics required","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Re-ingesting a source whose previous yield was \u003c 1.0","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"APPLY","type":"text","marks":[{"type":"strong"}]},{"text":" — record the upgrade attempt","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Plain-text Markdown source (no binary processing)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skip extraction metrics; yield = 1.0 implicit","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Web page already in clean HTML","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Light-touch; record ","type":"text"},{"text":"extraction_yield","type":"text","marks":[{"type":"code_inline"}]},{"text":" only if non-trivial processing applied","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"LinkedIn post / blog post via WebFetch","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skip extraction metrics; the conversion is the WebFetch model's job per ","type":"text"},{"text":"feedback_webfetch_first_for_linkedin","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Hand-typed notes captured as markdown","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skip; yield = 1.0","type":"text"}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"The two metrics","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"extraction_estimate","type":"text","marks":[{"type":"code_inline"}]},{"text":" (pre-extraction)","type":"text"}]},{"type":"paragraph","content":[{"text":"Predicted upper-bound of what we expect to recover, ","type":"text"},{"text":"before","type":"text","marks":[{"type":"strong"}]},{"text":" running the extractor. Computed from cheap structural inspection of the source.","type":"text"}]},{"type":"paragraph","content":[{"text":"Frontmatter field on the resulting ","type":"text"},{"text":"sources/\u003cslug>.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" page:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"yaml"},"content":[{"text":"extraction_estimate: 0.80\nextraction_estimate_rationale: |\n PDF has 100 pages: 80 text-based (`pdffonts` shows embedded fonts), 20\n image-only (no embedded fonts, pixel-density consistent with scan). OCR\n fallback could lift toward 0.95 but baseline text-only is 0.80.","type":"text"}]},{"type":"paragraph","content":[{"text":"Range: ","type":"text"},{"text":"0.0","type":"text","marks":[{"type":"code_inline"}]},{"text":" to ","type":"text"},{"text":"1.0","type":"text","marks":[{"type":"code_inline"}]},{"text":". A ","type":"text"},{"text":"0.0","type":"text","marks":[{"type":"code_inline"}]},{"text":" estimate means \"this source is unreadable by current tooling without manual transcription\".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"extraction_yield","type":"text","marks":[{"type":"code_inline"}]},{"text":" (post-extraction)","type":"text"}]},{"type":"paragraph","content":[{"text":"Actual measured fraction recovered, ","type":"text"},{"text":"after","type":"text","marks":[{"type":"strong"}]},{"text":" running the extractor and inspecting output.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"yaml"},"content":[{"text":"extraction_yield: 0.94\nextraction_yield_method: pdftotext+OCR # how it was measured\nextraction_yield_lost: |\n Page 47 OCR garbled — table 4-2 numeric values unreadable.\n Page 73 figure caption truncated mid-sentence.\n Pages 91-95 dense math, KaTeX transcription deferred.","type":"text"}]},{"type":"paragraph","content":[{"text":"Range: ","type":"text"},{"text":"0.0","type":"text","marks":[{"type":"code_inline"}]},{"text":" to ","type":"text"},{"text":"1.0","type":"text","marks":[{"type":"code_inline"}]},{"text":". ","type":"text"},{"text":"If ","type":"text","marks":[{"type":"strong"}]},{"text":"yield \u003c estimate","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":", the lost-content inventory is required","type":"text","marks":[{"type":"strong"}]},{"text":" — a bullet list of what was lost.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Measurement protocol per format","type":"text"}]},{"type":"paragraph","content":[{"text":"See format-specific references for exact commands. Common shape:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Count addressable units (pages / paragraphs / cells / slides)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Run extractor; count units with usable output","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"yield = units_recovered / units_total","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Spot-check 5–10 random units against the source visually","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"List units that failed the spot-check in ","type":"text"},{"text":"extraction_yield_lost","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Format-specific routing","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Format","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Primary extractor","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Fallback chain","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Anchor format","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Reference","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PDF (text-based)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"pdftotext -layout","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PyMuPDF","type":"text","marks":[{"type":"code_inline"}]},{"text":" (","type":"text"},{"text":"fitz","type":"text","marks":[{"type":"code_inline"}]},{"text":") → manual","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003cslug>:p\u003cpage>:¶\u003cparagraph>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/pdf-extraction.md","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PDF (scanned image)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PyMuPDF","type":"text","marks":[{"type":"code_inline"}]},{"text":" render @ 300 DPI → ","type":"text"},{"text":"tesseract --psm 6","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"manual transcription","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003cslug>:p\u003cpage>:OCR","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/scanned-pdf-ocr-fallback.md","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DOCX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"python-docx","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"pandoc -f docx -t markdown","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003cslug>:¶\u003cparagraph-id>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/docx-extraction.md","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"XLSX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"openpyxl","type":"text","marks":[{"type":"code_inline"}]},{"text":" (visible cells)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"pandas.read_excel","type":"text","marks":[{"type":"code_inline"}]},{"text":" (per sheet)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003cslug>:\u003csheet>!\u003ccell>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/xlsx-extraction.md","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PPTX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"python-pptx","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"pandoc -f pptx -t markdown","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003cslug>:slide\u003cN>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"(extend ","type":"text"},{"text":"docx-extraction.md","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"HTML","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"trafilatura","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"BeautifulSoup","type":"text","marks":[{"type":"code_inline"}]},{"text":" + ","type":"text"},{"text":"readability-lxml","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003cslug>#\u003cheading-slug>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/html-extraction.md","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Plain text / Markdown","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"cat","type":"text","marks":[{"type":"code_inline"}]},{"text":" (yield = 1.0)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"n/a","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003cslug>:¶\u003cparagraph>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"n/a","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Image (PNG / JPG)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"tesseract --psm 6","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"manual transcription","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003cslug>:OCR","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"(extend ","type":"text"},{"text":"scanned-pdf-ocr-fallback.md","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"Existing implementation references:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"feedback_pdf_ocr_fallback_chain","type":"text","marks":[{"type":"code_inline"}]},{"text":" codifies the pdftotext → PyMuPDF → tesseract chain","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"productivity/ocr-and-documents","type":"text","marks":[{"type":"code_inline"}]},{"text":" is the existing OCR skill (this skill cites, does not duplicate)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"engineering/doc-extraction","type":"text","marks":[{"type":"code_inline"}]},{"text":" is the engineering-specific extraction skill (used for technical PDFs)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"data/document-index-pipeline","type":"text","marks":[{"type":"code_inline"}]},{"text":" is the upstream ingestion pipeline that calls this skill","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Source-anchor traceability","type":"text"}]},{"type":"paragraph","content":[{"text":"Every claim on a compiled wiki page (","type":"text"},{"text":"concepts/","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"standards/","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"methodology/","type":"text","marks":[{"type":"code_inline"}]},{"text":") that derives from an extracted source must cite a ","type":"text"},{"text":"precise location","type":"text","marks":[{"type":"strong"}]},{"text":" in the source. This is the revisability contract: a future reviewer can locate the original passage and verify or revise.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Anchor formats by source type","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Source type","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Anchor format","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Example","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PDF","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/\u003cslug>]] :p\u003cpage>:¶\u003cpara-index>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/dnv-os-e301-2023]]:p47:¶2","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PDF OCR","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/\u003cslug>]] :p\u003cpage>:OCR","type":"text","marks":[{"type":"code_inline"}]},{"text":" (note: lower confidence)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/api-rp-2sk-2008]]:p23:OCR","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DOCX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/\u003cslug>]] :¶\u003cparagraph-id>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/project-basis-of-design]]:¶47","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"XLSX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/\u003cslug>]] :\u003csheet>!\u003ccell-range>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/mooring-results-export]]:Lines!C12:F12","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PPTX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/\u003cslug>]] :slide\u003cN>:\u003celement>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/conference-2024-paper]]:slide12:figure","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"HTML","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/\u003cslug>]] #\u003cheading-slug>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/blog-post-yaw-moments]]#stability-analysis","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Plain text","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/\u003cslug>]] :¶\u003cparagraph>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"[[sources/handoff-2026-05-20]]:¶3","type":"text","marks":[{"type":"code_inline"}]}]}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Anchor placement in compiled pages","type":"text"}]},{"type":"paragraph","content":[{"text":"In a ","type":"text"},{"text":"concepts/","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"standards/","type":"text","marks":[{"type":"code_inline"}]},{"text":" page, anchors go at the ","type":"text"},{"text":"end of the sentence","type":"text","marks":[{"type":"strong"}]},{"text":" they support, in parentheses:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"markdown"},"content":[{"text":"The DNV-OS-E301 safety factor for ULS mooring conditions is **1.5**\n([[sources/dnv-os-e301-2023]]:p47:¶2), reduced from the 2018 edition's\n1.67 ([[sources/dnv-os-e301-2018]]:p41:¶3).","type":"text"}]},{"type":"paragraph","content":[{"text":"Multiple-source claims chain anchors:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"markdown"},"content":[{"text":"Empirical yield in deepwater mooring failures clusters around 14% of\nnameplate MBL ([[sources/sintef-2019-mooring-survey]]:p12:Table-3;\n[[sources/api-bulletin-2tl]]:p8:¶4).","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Anti-patterns","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Citing ","type":"text"},{"text":"[[sources/\u003cslug>]]","type":"text","marks":[{"type":"code_inline"}]},{"text":" without a sub-anchor → reviewer can't locate","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Anchor pointing at a page that lacks the claim (cut-and-paste error)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Anchor in a section the extraction yield report flagged as lost","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Using anchor format for one source type on another (e.g., ","type":"text"},{"text":":p47","type":"text","marks":[{"type":"code_inline"}]},{"text":" on a DOCX)","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Pre-extraction protocol","type":"text"}]},{"type":"paragraph","content":[{"text":"Run ","type":"text"},{"text":"before","type":"text","marks":[{"type":"strong"}]},{"text":" copying the binary into ","type":"text"},{"text":"wikis/\u003cdomain>/sources/","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# 1. Identify format\nfile \u003csource-path>\n\n# 2. Cheap structural inspection (format-specific):\n# PDF\npdfinfo \u003csource-path> # pages, encrypted, etc.\npdffonts \u003csource-path> | head # text-based vs scanned\n# DOCX\nunzip -l \u003csource-path> | head # embedded objects, images\n# XLSX\nunzip -l \u003csource-path> | grep sheet # sheet count\n\n# 3. Compute extraction_estimate (see format references for exact heuristics)\n# 4. Record estimate in the source page frontmatter BEFORE extraction","type":"text"}]},{"type":"paragraph","content":[{"text":"For binaries >10 MB, do ","type":"text"},{"text":"not","type":"text","marks":[{"type":"strong"}]},{"text":" copy into the wiki — create a ref pointer per ","type":"text"},{"text":"llm-wiki-page-shape-contract","type":"text","marks":[{"type":"code_inline"}]},{"text":" Rule 3:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"yaml"},"content":[{"text":"---\ntitle: refs/\u003cslug>\ntype: ref\nexternal_path: /mnt/ace/\u003crepo>/data/\u003cfile>.pdf\nsize: ~140 MB\nextraction_estimate: 0.80 # set even on ref pages\nextraction_yield: null # filled in after compiled pages cite this ref\n---","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Post-extraction protocol","type":"text"}]},{"type":"paragraph","content":[{"text":"After running the extractor:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Measure","type":"text","marks":[{"type":"strong"}]},{"text":": count addressable units recovered vs total (see format references for exact commands).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Spot-check","type":"text","marks":[{"type":"strong"}]},{"text":": 5–10 random samples against the source. Visual or programmatic comparison.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Inventory loss","type":"text","marks":[{"type":"strong"}]},{"text":": list every unit that didn't extract cleanly with page/paragraph/cell anchor + one-line reason.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Decide","type":"text","marks":[{"type":"strong"}]},{"text":": is the yield enough to proceed?","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Yield ≥ 0.90 AND no critical content lost → proceed to compile","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Yield 0.50–0.90 → proceed but file an audit per ","type":"text"},{"text":"research/llm-wiki-audit-feedback-loop","type":"text","marks":[{"type":"code_inline"}]},{"text":" with the loss inventory, so future passes know what to revisit","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Yield \u003c 0.50 → defer ingest; the source is not extractable enough to be useful. Note this in ","type":"text"},{"text":"wikis/\u003cdomain>/CLAUDE.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" \"Open research questions\" with the path and the failed yield.","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Write frontmatter","type":"text","marks":[{"type":"strong"}]},{"text":": ","type":"text"},{"text":"extraction_yield","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"extraction_yield_method","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"extraction_yield_lost","type":"text","marks":[{"type":"code_inline"}]},{"text":" go on the ","type":"text"},{"text":"sources/\u003cslug>.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" page.","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Frontmatter required on ","type":"text"},{"text":"sources/\u003cslug>.md","type":"text","marks":[{"type":"code_inline"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"yaml"},"content":[{"text":"---\ntitle: sources/\u003cslug>\ntype: source # input layer per page-shape Rule 7\nsource_format: pdf | docx | xlsx | pptx | html | image | text\nsource_url: https://... # if applicable\nsource_path: /path/to/local/copy.pdf # if binary copied in\nexternal_path: /mnt/ace/\u003crepo>/\u003cfile> # if ref pointer (>10 MB)\ndate: YYYY-MM-DD # original publication\ningested: YYYY-MM-DD # when extracted into wiki\n\n# Extraction coverage (this skill's required fields)\nextraction_estimate: 0.80\nextraction_estimate_rationale: |\n \u003cone-paragraph reason — what's recoverable, what isn't, why>\nextraction_yield: 0.94\nextraction_yield_method: pdftotext+OCR\nextraction_yield_lost: |\n - Page 47: OCR garbled, table 4-2 numerics unreadable\n - Page 73: figure caption truncated mid-sentence\n - Pages 91–95: dense KaTeX, transcription deferred\n\n# Wiki-shape contract fields\nsources: [] # this IS a source; empty for source pages\ntags: [\u003ctag>]\nlicense: \u003clicense-shorthand>\n---","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Decision tree per source","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"new source arrives\n │\n ├── plain text / Markdown ─────► yield = 1.0, no anchors needed beyond ¶\n │\n ├── HTML (clean) ──────────────► trafilatura; anchor by heading\n │\n ├── HTML (messy) ──────────────► trafilatura → BeautifulSoup fallback\n │\n ├── PDF\n │ ├── text-based ──────────► pdftotext -layout; anchor :p\u003cpage>:¶\n │ ├── mixed ───────────────► pdftotext + PyMuPDF where pdftotext = 0 chars\n │ └── scanned ─────────────► PyMuPDF render 300 DPI → tesseract --psm 6\n │\n ├── DOCX ──────────────────────► python-docx; fallback pandoc; anchor by ¶ id\n │\n ├── XLSX ──────────────────────► openpyxl per visible cell; anchor :\u003csheet>!\u003ccell>\n │\n ├── PPTX ──────────────────────► python-pptx; anchor :slide\u003cN>\n │\n └── image ─────────────────────► tesseract --psm 6; anchor :OCR","type":"text"}]},{"type":"paragraph","content":[{"text":"At every leaf: compute estimate before, yield after, write the inventory if yield \u003c estimate, record anchor format for downstream cites.","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Anti-patterns","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Writing a compiled page from a source whose ","type":"text"},{"text":"extraction_yield","type":"text","marks":[{"type":"code_inline"}]},{"text":" was never recorded — invisible failure surface","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Yield = 1.0 claimed without spot-checking — overclaim","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Yield \u003c 0.50 ingested anyway — pollutes the corpus","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Citing extracted content from a page whose lost-content inventory flagged that exact page → use the audit-feedback-loop to revise","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Reusing one source's anchor format on a different format (","type":"text"},{"text":":p47","type":"text","marks":[{"type":"code_inline"}]},{"text":" on a DOCX makes no sense)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Storing the binary in the wiki when it's >10 MB instead of using a ref pointer per ","type":"text"},{"text":"llm-wiki-page-shape-contract","type":"text","marks":[{"type":"code_inline"}]},{"text":" Rule 3","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Re-extracting a source repeatedly without recording the attempted yield upgrades — wastes compute, loses learning","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"What this skill is NOT","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Not a replacement for ","type":"text"},{"text":"productivity/ocr-and-documents","type":"text","marks":[{"type":"code_inline"}]},{"text":" — that skill owns the OCR tooling specifics; this skill calls into it","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Not a replacement for ","type":"text"},{"text":"engineering/doc-extraction","type":"text","marks":[{"type":"code_inline"}]},{"text":" — that's the engineering-domain extraction skill; this skill is the wiki-side contract for ","type":"text"},{"text":"recording","type":"text","marks":[{"type":"em"}]},{"text":" extraction quality","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Not a full RAG-replacement extractor — the extractor is the tool; this skill is the measurement and anchor contract","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Not for sources that are already plain text or clean HTML — those don't need pre/post metrics","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Related must-fire rules","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"feedback_pdf_ocr_fallback_chain","type":"text","marks":[{"type":"code_inline"}]},{"text":" — pdftotext+PyMuPDF=0 chars → image-PDF; fall back PyMuPDF 300 DPI → tesseract --psm 6","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"feedback_runtime_base64_blocks_binary_roundtrip","type":"text","marks":[{"type":"code_inline"}]},{"text":" — JS tool results blocked binary; download path or save_to_disk for binary capture","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"feedback_naive_secret_scan_false_positive_cascade","type":"text","marks":[{"type":"code_inline"}]},{"text":" — extracted content can contain false-positive regex matches; trust the hardened pre-commit hook","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"feedback_subagent_write_phantom","type":"text","marks":[{"type":"code_inline"}]},{"text":" — if a subagent runs the extractor, main session must verify ","type":"text"},{"text":"sources/\u003cslug>.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" actually landed on disk","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"feedback_silent_verdict_flip_defect_class","type":"text","marks":[{"type":"code_inline"}]},{"text":" — extracted standards pages need section+edition, not just code_id","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"llm-wiki-source-extraction-coverage","author":"@skillopedia","source":{"stars":10,"repo_name":"workspace-hub","origin_url":"https://github.com/vamseeachanta/workspace-hub/blob/HEAD/.claude/skills/research/llm-wiki-source-extraction-coverage/SKILL.md","repo_owner":"vamseeachanta","body_sha256":"752e22080eac593fa2a2402025df449d8c16ae4570c43915abec07647ba76c37","cluster_key":"6a5a89c9828e12e7421e3202ef852fe6e225c4c4ddaf3b0fadebaa7df3f90a6a","clean_bundle":{"format":"clean-skill-bundle-v1","source":"vamseeachanta/workspace-hub/.claude/skills/research/llm-wiki-source-extraction-coverage/SKILL.md","attachments":[{"id":"8517aa68-a14f-5118-9011-a93b3826dae5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/8517aa68-a14f-5118-9011-a93b3826dae5/attachment.md","path":"references/docx-extraction.md","size":2932,"sha256":"f0f02532b45e7933ad216aa33138bb0dd54af8f4eb859b7cc66e8234944b711a","contentType":"text/markdown; charset=utf-8"},{"id":"09b15d6e-b5bf-5506-853f-0b54acb5c966","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/09b15d6e-b5bf-5506-853f-0b54acb5c966/attachment.md","path":"references/html-extraction.md","size":3598,"sha256":"1b19c4f7c0516bee96b37b4333432c1aaf6b43f8d00c6101f93dc1c7729db5fd","contentType":"text/markdown; charset=utf-8"},{"id":"dc45bff8-8ae5-5360-ba92-01f5412a4a8c","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/dc45bff8-8ae5-5360-ba92-01f5412a4a8c/attachment.md","path":"references/pdf-extraction.md","size":3094,"sha256":"4757b7b64785d59842b419ec4831900fff9e02242c416d95547dad6ef6e2236d","contentType":"text/markdown; charset=utf-8"},{"id":"492e7247-24b3-5b48-963b-d0997b84c263","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/492e7247-24b3-5b48-963b-d0997b84c263/attachment.md","path":"references/scanned-pdf-ocr-fallback.md","size":4665,"sha256":"784869c169196b6a6f2a5ee7c3d17c25b13569e2250b4c31d093a32e94e663eb","contentType":"text/markdown; charset=utf-8"},{"id":"cffecba5-0350-5517-8092-27a906b649b1","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/cffecba5-0350-5517-8092-27a906b649b1/attachment.md","path":"references/xlsx-extraction.md","size":3736,"sha256":"30bd551fa2225f16488cab04b2529bb7983106129db9bb0ea8121b52826369f2","contentType":"text/markdown; charset=utf-8"}],"bundle_sha256":"60938641f1d8ca70ed14c9282729eb3e3da4e15a40befafc072052cfd977c5f9","attachment_count":5,"text_attachments":5,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":".claude/skills/research/llm-wiki-source-extraction-coverage/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"security","category_label":"Security"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"security","metadata":{"category":"research","references":["references/pdf-extraction.md","references/docx-extraction.md","references/xlsx-extraction.md","references/html-extraction.md","references/scanned-pdf-ocr-fallback.md"],"related_issues":["vamseeachanta/workspace-hub#2374","vamseeachanta/workspace-hub#2727"],"related_skills":["research/llm-wiki","research/llm-wiki-page-shape-contract","research/llm-wiki-audit-feedback-loop","engineering/doc-extraction","productivity/ocr-and-documents","data/document-index-pipeline"]},"import_tag":"clean-skills-v1","description":"Doc-type-aware extraction contract for llm-wiki source ingestion with measurable coverage and source-anchored traceability. Use when (1) ingesting a PDF, DOCX, XLSX, PPTX, HTML, or scanned-image source into a wiki `sources/` page, (2) computing the pre-extraction estimate (what fraction of the source we expect to recover) and post-extraction yield (what fraction we actually recovered), (3) anchoring wiki claims back to specific page / paragraph / cell / slide positions in the source so a reviewer can re-verify or revise against the actual document, (4) deciding whether OCR fallback or manual transcription is needed. Codifies workspace-hub's existing OCR fallback chain and python-docx / openpyxl / trafilatura patterns into a format-specific routing table. Companion to research/llm-wiki-page-shape-contract (Rule 7 input-layer pages) and research/llm-wiki — this skill is the defense against silent extraction failure."}},"renderedAt":1782980481798}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.