Doc to Markdown Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing. Architecture : Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add). Quick Start Dual Mode | Mode | Speed | Quality | Use Case | |------|-------|---------|----------| | Quick (default) | Fast | Good | Drafts, simple documents | | Heavy | Slower | Best | Final documents, complex layouts | Tool Selection | Format | Quick Mode | Heavy Mode | |--------|-----------|------------| | PDF | pymupdf4llm | pymupdf4llm + markitdown | | DOCX | pand…

, line)\n if heading_match:\n flush_segment()\n current_type = 'heading'\n current_level = len(heading_match.group(1))\n current_segment.append(line)\n flush_segment()\n continue\n\n # Table detection\n if '|' in line and re.match(r'^\\s*\\|.*\\|\\s*

Doc to Markdown Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing. Architecture : Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add). Quick Start Dual Mode | Mode | Speed | Quality | Use Case | |------|-------|---------|----------| | Quick (default) | Fast | Good | Drafts, simple documents | | Heavy | Slower | Best | Final documents, complex layouts | Tool Selection | Format | Quick Mode | Heavy Mode | |--------|-----------|------------| | PDF | pymupdf4llm | pymupdf4llm + markitdown | | DOCX | pand…

, line):\n if not in_table:\n flush_segment()\n in_table = True\n current_type = 'table'\n current_segment.append(line)\n continue\n elif in_table:\n flush_segment()\n in_table = False\n\n # Image detection\n if re.match(r'!\\[.*\\]\\(.*\\)', line):\n flush_segment()\n current_type = 'image'\n current_segment.append(line)\n flush_segment()\n continue\n\n # List detection\n if re.match(r'^[\\s]*[-*+]\\s+', line) or re.match(r'^[\\s]*\\d+\\.\\s+', line):\n if current_type != 'list':\n flush_segment()\n current_type = 'list'\n current_segment.append(line)\n continue\n elif current_type == 'list' and line.strip() == '':\n flush_segment()\n continue\n\n # Empty line - potential paragraph break\n if line.strip() == '':\n if current_type == 'paragraph' and current_segment:\n flush_segment()\n continue\n\n # Default: paragraph\n if current_type not in ['list']:\n current_type = 'paragraph'\n current_segment.append(line)\n\n flush_segment()\n return segments\n\n\ndef score_segment(segment: Segment) -> float:\n \"\"\"Score a segment for quality comparison.\"\"\"\n score = 0.0\n content = segment.content\n\n if segment.type == 'table':\n # Count rows and columns\n rows = [l for l in content.split('\\n') if '|' in l]\n if rows:\n cols = rows[0].count('|') - 1\n score += len(rows) * 0.5 # More rows = better\n score += cols * 0.3 # More columns = better\n # Penalize separator-only tables\n if all(re.match(r'^[\\s|:-]+

Doc to Markdown Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing. Architecture : Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add). Quick Start Dual Mode | Mode | Speed | Quality | Use Case | |------|-------|---------|----------| | Quick (default) | Fast | Good | Drafts, simple documents | | Heavy | Slower | Best | Final documents, complex layouts | Tool Selection | Format | Quick Mode | Heavy Mode | |--------|-----------|------------| | PDF | pymupdf4llm | pymupdf4llm + markitdown | | DOCX | pand…

, r) for r in rows):\n score -= 5.0\n # Bonus for proper header separator\n if len(rows) > 1 and re.match(r'^[\\s|:-]+

Doc to Markdown Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing. Architecture : Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add). Quick Start Dual Mode | Mode | Speed | Quality | Use Case | |------|-------|---------|----------| | Quick (default) | Fast | Good | Drafts, simple documents | | Heavy | Slower | Best | Final documents, complex layouts | Tool Selection | Format | Quick Mode | Heavy Mode | |--------|-----------|------------| | PDF | pymupdf4llm | pymupdf4llm + markitdown | | DOCX | pand…

, rows[1]):\n score += 1.0\n\n elif segment.type == 'heading':\n # Prefer proper heading hierarchy\n score += 1.0\n # Penalize very long headings\n if len(content) > 100:\n score -= 0.5\n\n elif segment.type == 'image':\n # Prefer images with alt text\n if re.search(r'!\\[.+\\]', content):\n score += 1.0\n # Prefer local paths over base64\n if 'data:image' not in content:\n score += 0.5\n\n elif segment.type == 'list':\n items = re.findall(r'^[\\s]*[-*+\\d.]+\\s+', content, re.MULTILINE)\n score += len(items) * 0.3\n # Bonus for nested lists\n if re.search(r'^\\s{2,}[-*+]', content, re.MULTILINE):\n score += 0.5\n\n elif segment.type == 'code':\n lines = content.split('\\n')\n score += min(len(lines) * 0.2, 3.0)\n # Bonus for language specification\n if re.match(r'^```\\w+', content):\n score += 0.5\n\n else: # paragraph\n words = len(content.split())\n score += min(words * 0.05, 2.0)\n # Penalize very short paragraphs\n if words \u003c 5:\n score -= 0.5\n\n return score\n\n\ndef find_matching_segment(\n segment: Segment,\n candidates: list[Segment],\n used_indices: set\n) -> Optional[int]:\n \"\"\"Find a matching segment in candidates by type and similarity.\"\"\"\n best_match = None\n best_similarity = 0.3 # Minimum threshold\n\n for i, candidate in enumerate(candidates):\n if i in used_indices:\n continue\n if candidate.type != segment.type:\n continue\n\n # Calculate similarity\n if segment.type == 'heading':\n # Compare heading text (ignore # symbols)\n s1 = re.sub(r'^#+\\s*', '', segment.content).lower()\n s2 = re.sub(r'^#+\\s*', '', candidate.content).lower()\n similarity = _text_similarity(s1, s2)\n elif segment.type == 'table':\n # Compare first row (header)\n h1 = segment.content.split('\\n')[0] if segment.content else ''\n h2 = candidate.content.split('\\n')[0] if candidate.content else ''\n similarity = _text_similarity(h1, h2)\n else:\n # Compare content directly\n similarity = _text_similarity(segment.content, candidate.content)\n\n if similarity > best_similarity:\n best_similarity = similarity\n best_match = i\n\n return best_match\n\n\ndef _text_similarity(s1: str, s2: str) -> float:\n \"\"\"Calculate simple text similarity (Jaccard on words).\"\"\"\n if not s1 or not s2:\n return 0.0\n\n words1 = set(s1.lower().split())\n words2 = set(s2.lower().split())\n\n if not words1 or not words2:\n return 0.0\n\n intersection = len(words1 & words2)\n union = len(words1 | words2)\n\n return intersection / union if union > 0 else 0.0\n\n\ndef merge_markdown_files(\n files: list[Path],\n source_names: Optional[list[str]] = None\n) -> MergeResult:\n \"\"\"Merge multiple markdown files by selecting best segments.\"\"\"\n if not files:\n return MergeResult(markdown=\"\", sources=[])\n\n if source_names is None:\n source_names = [f.stem for f in files]\n\n # Parse all files into segments\n all_segments = []\n for i, file_path in enumerate(files):\n content = file_path.read_text()\n segments = parse_segments(content)\n # Score each segment\n for seg in segments:\n seg.score = score_segment(seg)\n all_segments.append((source_names[i], segments))\n\n if len(all_segments) == 1:\n return MergeResult(\n markdown=files[0].read_text(),\n sources=[source_names[0]]\n )\n\n # Use first file as base structure\n base_name, base_segments = all_segments[0]\n merged_segments = []\n segment_sources = {}\n\n for i, base_seg in enumerate(base_segments):\n best_segment = base_seg\n best_source = base_name\n\n # Find matching segments in other files\n for other_name, other_segments in all_segments[1:]:\n used = set()\n match_idx = find_matching_segment(base_seg, other_segments, used)\n\n if match_idx is not None:\n other_seg = other_segments[match_idx]\n if other_seg.score > best_segment.score:\n best_segment = other_seg\n best_source = other_name\n\n merged_segments.append(best_segment)\n segment_sources[i] = best_source\n\n # Check for segments in other files that weren't matched\n # (content that only appears in secondary sources)\n base_used = set(range(len(base_segments)))\n for other_name, other_segments in all_segments[1:]:\n for j, other_seg in enumerate(other_segments):\n match_idx = find_matching_segment(other_seg, base_segments, set())\n if match_idx is None and other_seg.score > 0.5:\n # This segment doesn't exist in base - consider adding\n merged_segments.append(other_seg)\n segment_sources[len(merged_segments) - 1] = other_name\n\n # Reconstruct markdown\n merged_md = '\\n\\n'.join(seg.content for seg in merged_segments)\n\n return MergeResult(\n markdown=merged_md,\n sources=source_names,\n segment_sources=segment_sources\n )\n\n\ndef merge_from_json(json_path: Path) -> MergeResult:\n \"\"\"Merge from JSON results file (from convert.py).\"\"\"\n with open(json_path) as f:\n data = json.load(f)\n\n results = data.get('results', [])\n if not results:\n return MergeResult(markdown=\"\", sources=[])\n\n # Filter successful results\n successful = [r for r in results if r.get('success') and r.get('markdown')]\n if not successful:\n return MergeResult(markdown=\"\", sources=[])\n\n if len(successful) == 1:\n return MergeResult(\n markdown=successful[0]['markdown'],\n sources=[successful[0]['tool']]\n )\n\n # Parse and merge\n all_segments = []\n for result in successful:\n tool = result['tool']\n segments = parse_segments(result['markdown'])\n for seg in segments:\n seg.score = score_segment(seg)\n all_segments.append((tool, segments))\n\n # Same merge logic as merge_markdown_files\n base_name, base_segments = all_segments[0]\n merged_segments = []\n segment_sources = {}\n\n for i, base_seg in enumerate(base_segments):\n best_segment = base_seg\n best_source = base_name\n\n for other_name, other_segments in all_segments[1:]:\n match_idx = find_matching_segment(base_seg, other_segments, set())\n if match_idx is not None:\n other_seg = other_segments[match_idx]\n if other_seg.score > best_segment.score:\n best_segment = other_seg\n best_source = other_name\n\n merged_segments.append(best_segment)\n segment_sources[i] = best_source\n\n merged_md = '\\n\\n'.join(seg.content for seg in merged_segments)\n\n return MergeResult(\n markdown=merged_md,\n sources=[r['tool'] for r in successful],\n segment_sources=segment_sources\n )\n\n\ndef main():\n parser = argparse.ArgumentParser(\n description=\"Merge markdown outputs from multiple conversion tools\"\n )\n parser.add_argument(\n \"inputs\",\n nargs=\"*\",\n type=Path,\n help=\"Input markdown files to merge\"\n )\n parser.add_argument(\n \"-o\", \"--output\",\n type=Path,\n help=\"Output merged markdown file\"\n )\n parser.add_argument(\n \"--from-json\",\n type=Path,\n help=\"Merge from JSON results file (from convert.py)\"\n )\n parser.add_argument(\n \"--verbose\",\n action=\"store_true\",\n help=\"Show segment source attribution\"\n )\n\n args = parser.parse_args()\n\n if args.from_json:\n result = merge_from_json(args.from_json)\n elif args.inputs:\n # Validate inputs\n for f in args.inputs:\n if not f.exists():\n print(f\"Error: File not found: {f}\", file=sys.stderr)\n sys.exit(1)\n result = merge_markdown_files(args.inputs)\n else:\n parser.error(\"Either input files or --from-json is required\")\n\n if not result.markdown:\n print(\"Error: No content to merge\", file=sys.stderr)\n sys.exit(1)\n\n # Output\n if args.output:\n args.output.parent.mkdir(parents=True, exist_ok=True)\n args.output.write_text(result.markdown)\n print(f\"Merged output: {args.output}\")\n print(f\"Sources: {', '.join(result.sources)}\")\n else:\n print(result.markdown)\n\n if args.verbose and result.segment_sources:\n print(\"\\n--- Segment Attribution ---\", file=sys.stderr)\n for idx, source in result.segment_sources.items():\n print(f\" Segment {idx}: {source}\", file=sys.stderr)\n\n\nif __name__ == \"__main__\":\n main()\n","content_type":"text/x-python; charset=utf-8","language":"python","size":13559,"content_sha256":"b174d1d5871e9a393dc4e50da1f14d14f16762ad8d4e64e4a4618d361da4b6b4"},{"filename":"scripts/test_convert.py","content":"\"\"\"Tests for doc-to-markdown convert.py post-processing functions.\n\nRun: uv run pytest scripts/test_convert.py -v\n\"\"\"\n\nimport pytest\nimport re\nimport sys\nfrom pathlib import Path\n\n# Import the module under test\nsys.path.insert(0, str(Path(__file__).parent))\nfrom convert import (\n _fix_cjk_bold_spacing,\n _build_pipe_table,\n _collect_images,\n PostProcessStats,\n postprocess_docx_markdown,\n)\n\n\n# ── CJK Bold Spacing ─────────────────────────────────────────────────────────\n\n\nclass TestCjkBoldSpacing:\n \"\"\"Test _fix_cjk_bold_spacing: spaces between **bold** and CJK chars.\"\"\"\n\n def test_bold_followed_by_cjk_punctuation(self):\n \"\"\"**text** directly touching CJK colon → add space after **.\"\"\"\n inp = \"**打开阶跃开放平台链接**:https://platform.stepfun.com/\"\n out = _fix_cjk_bold_spacing(inp)\n assert \"**打开阶跃开放平台链接** :\" in out\n\n def test_cjk_before_bold(self):\n \"\"\"CJK char directly before ** → add space before **.\"\"\"\n assert _fix_cjk_bold_spacing(\"可用**手机号**进行\") == \"可用 **手机号** 进行\"\n\n def test_bold_with_emoji_neighbor(self):\n \"\"\"**text** touching emoji ➡️ → still add space (CJK content rule).\"\"\"\n inp = \"点击**【接口密码】**➡️**【创建新的密钥**】\"\n out = _fix_cjk_bold_spacing(inp)\n # Each CJK-containing bold span should have spaces on both sides\n assert \"点击 **【接口密码】** ➡️\" in out\n assert \"➡️ **【创建新的密钥**\" in out\n\n def test_full_emoji_line(self):\n \"\"\"Complete line with emoji separators between bold spans.\"\"\"\n inp = \"点击**【接口密码】**➡️**【创建新的密钥**】➡️**【输入密钥名称】**(输入你想取的名称),生成API Key\"\n out = _fix_cjk_bold_spacing(inp)\n assert \"点击 **【接口密码】** ➡️\" in out\n assert \"**【输入密钥名称】** (输入\" in out\n\n def test_bold_between_cjk(self):\n \"\"\"CJK **text** CJK → spaces on both sides.\"\"\"\n assert _fix_cjk_bold_spacing(\"打开**飞书**,就可以\") == \"打开 **飞书** ,就可以\"\n\n def test_bold_with_chinese_quotes(self):\n \"\"\"Bold containing Chinese quotes.\"\"\"\n inp = '有个**\"企鹅戴龙虾头套的机器人\"**,开始'\n out = _fix_cjk_bold_spacing(inp)\n assert '**\"企鹅戴龙虾头套的机器人\"** ,' in out\n\n def test_multiple_bold_spans(self):\n \"\"\"Multiple bold spans in one line.\"\"\"\n assert _fix_cjk_bold_spacing(\"这是**测试**和**验证**的效果\") == \"这是 **测试** 和 **验证** 的效果\"\n\n def test_already_spaced(self):\n \"\"\"Already has spaces → no double spaces.\"\"\"\n inp = \"已有空格 **粗体** 不需要再加\"\n assert _fix_cjk_bold_spacing(inp) == inp\n\n def test_english_unchanged(self):\n \"\"\"English bold text should not be modified.\"\"\"\n inp = \"English **bold** text should not change\"\n assert _fix_cjk_bold_spacing(inp) == inp\n\n def test_line_start_bold(self):\n \"\"\"Bold at line start followed by CJK.\"\"\"\n assert _fix_cjk_bold_spacing(\"**重要**内容\") == \"**重要** 内容\"\n\n def test_line_start_bold_standalone(self):\n \"\"\"Bold at line start with no CJK neighbor → no change.\"\"\"\n assert _fix_cjk_bold_spacing(\"**这是纯粗体不需要改**\") == \"**这是纯粗体不需要改**\"\n\n def test_no_bold(self):\n \"\"\"Text without bold markers → unchanged.\"\"\"\n inp = \"这是普通文本,没有粗体\"\n assert _fix_cjk_bold_spacing(inp) == inp\n\n def test_empty_string(self):\n assert _fix_cjk_bold_spacing(\"\") == \"\"\n\n def test_bold_at_line_end(self):\n \"\"\"Bold at line end → no trailing space needed.\"\"\"\n assert _fix_cjk_bold_spacing(\"内容是**粗体**\") == \"内容是 **粗体**\"\n\n def test_mixed_cjk_and_english_bold(self):\n \"\"\"English bold between CJK → no change (no CJK in content).\"\"\"\n inp = \"请使用 **API Key** 进行认证\"\n assert _fix_cjk_bold_spacing(inp) == inp\n\n\n# ── Pipe Table Builder ────────────────────────────────────────────────────────\n\n\nclass TestBuildPipeTable:\n \"\"\"Test _build_pipe_table: rows → markdown pipe table.\"\"\"\n\n def test_basic_table(self):\n rows = [[\"a\", \"b\"], [\"c\", \"d\"]]\n result = _build_pipe_table(rows)\n assert result == [\n \"| | |\",\n \"| --- | --- |\",\n \"| a | b |\",\n \"| c | d |\",\n ]\n\n def test_uneven_rows(self):\n \"\"\"Rows with different column counts → padded.\"\"\"\n rows = [[\"a\", \"b\", \"c\"], [\"d\"]]\n result = _build_pipe_table(rows)\n assert \"| d | | |\" in result\n\n def test_single_cell(self):\n rows = [[\"only\"]]\n result = _build_pipe_table(rows)\n assert len(result) == 3 # header + sep + 1 row\n\n def test_empty_rows(self):\n assert _build_pipe_table([]) == []\n\n def test_image_with_caption(self):\n \"\"\"Images and captions should pair correctly in table.\"\"\"\n rows = [\n [\"![](img1.png)\", \"![](img2.png)\"],\n [\"Step 1\", \"Step 2\"],\n ]\n result = _build_pipe_table(rows)\n assert \"| ![](img1.png) | ![](img2.png) |\" in result\n assert \"| Step 1 | Step 2 |\" in result\n\n\n# ── Full Post-Processing Pipeline ─────────────────────────────────────────────\n\n\nclass TestPostprocessPipeline:\n \"\"\"Integration tests for the full postprocess_docx_markdown pipeline.\"\"\"\n\n def test_grid_table_single_column_to_blockquote(self):\n \"\"\"Single-column grid table → blockquote.\"\"\"\n inp = \"\"\"+:---+\n| 注意事项 |\n+----+\"\"\"\n out, stats = postprocess_docx_markdown(inp)\n assert \"> 注意事项\" in out\n assert \"+:---+\" not in out\n\n def test_pandoc_attributes_removed(self):\n \"\"\"Pandoc {width=...} and {.underline} removed.\"\"\"\n inp = '![](img.png){width=\"5in\" height=\"3in\"} and [text]{.underline}'\n out, stats = postprocess_docx_markdown(inp)\n assert \"{width=\" not in out\n assert \"{.underline}\" not in out\n assert \"![](img.png)\" in out\n\n def test_escaped_brackets_fixed(self):\n r\"\"\"Pandoc \\[ and \\] → [ and ].\"\"\"\n inp = r\"你 \\[在飞书里\\] 发消息\"\n out, stats = postprocess_docx_markdown(inp)\n assert \"你 [在飞书里] 发消息\" in out\n\n def test_double_bracket_links_fixed(self):\n \"\"\"[[text]](url) → [text](url).\"\"\"\n inp = \"[[点击跳转]](https://example.com)\"\n out, stats = postprocess_docx_markdown(inp)\n assert \"[点击跳转](https://example.com)\" in out\n\n def test_code_block_with_language(self):\n \"\"\"Indented dashed block with JSON language hint → ```json.\"\"\"\n inp = \"\"\" ------------------------------------------------------------------\n JSON\\\\\n {\\\\\n \"provider\": \"stepfun\"\\\\\n }\n ------------------------------------------------------------------\"\"\"\n out, stats = postprocess_docx_markdown(inp)\n assert \"```json\" in out\n assert '\"provider\": \"stepfun\"' in out\n assert \"---\" not in out\n\n def test_code_block_plain_text_to_blockquote(self):\n \"\"\"Indented dashed block with plain text → blockquote.\"\"\"\n inp = \"\"\" --------------------------\n 注意:这是一条重要提示\n --------------------------\"\"\"\n out, stats = postprocess_docx_markdown(inp)\n assert \"> 注意:这是一条重要提示\" in out\n\n def test_cjk_bold_spacing_in_pipeline(self):\n \"\"\"CJK bold spacing is applied in the full pipeline.\"\"\"\n inp = \"打开**飞书**,就可以看到\"\n out, stats = postprocess_docx_markdown(inp)\n assert \"打开 **飞书** ,就可以看到\" in out\n\n def test_excessive_blank_lines_collapsed(self):\n \"\"\"4+ blank lines → 2 blank lines.\"\"\"\n inp = \"line1\\n\\n\\n\\n\\nline2\"\n out, stats = postprocess_docx_markdown(inp)\n assert out.count(\"\\n\") \u003c 5\n\n def test_stats_tracking(self):\n \"\"\"Stats object correctly tracks fix counts.\"\"\"\n inp = '![](media/media/img.png){width=\"5in\"}'\n out, stats = postprocess_docx_markdown(inp)\n assert stats.attributes_removed > 0\n\n\n# ── Simple Table (pandoc) ─────────────────────────────────────────────────────\n\n\nclass TestSimpleTable:\n \"\"\"Test pandoc simple table (indented dashes with spaces) → pipe table.\"\"\"\n\n def test_two_column_image_table(self):\n \"\"\"Two images side by side in simple table → pipe table.\"\"\"\n inp = \"\"\" ---- ----\n ![](img1.png) ![](img2.png)\n\n ---- ----\"\"\"\n out, stats = postprocess_docx_markdown(inp)\n assert \"| ![](img1.png) | ![](img2.png) |\" in out\n assert \"----\" not in out\n\n def test_four_column_image_table(self):\n \"\"\"Four images in simple table → 4-column pipe table.\"\"\"\n inp = \"\"\" ---------- ---------- ---------- ----------\n ![](a.png) ![](b.png) ![](c.png) ![](d.png)\n\n ---------- ---------- ---------- ----------\"\"\"\n out, stats = postprocess_docx_markdown(inp)\n assert \"| ![](a.png) | ![](b.png) | ![](c.png) | ![](d.png) |\" in out\n","content_type":"text/x-python; charset=utf-8","language":"python","size":9654,"content_sha256":"a78a88b16f1ecfd7a597855aa2b285dceaf4058567d01ec8900a8d80ee0690f6"},{"filename":"scripts/validate_output.py","content":"#!/usr/bin/env python3\n\"\"\"\nQuality validator for document-to-markdown conversion.\n\nCompare original document with converted markdown to assess conversion quality.\nGenerates HTML quality report with detailed metrics.\n\nUsage:\n uv run --with pymupdf scripts/validate_output.py document.pdf output.md\n uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html\n\"\"\"\n\nimport argparse\nimport html\nimport re\nimport subprocess\nimport sys\nfrom dataclasses import dataclass, field\nfrom pathlib import Path\nfrom typing import Optional\n\n\n@dataclass\nclass ValidationMetrics:\n \"\"\"Quality metrics for conversion validation.\"\"\"\n # Text metrics\n source_char_count: int = 0\n output_char_count: int = 0\n text_retention: float = 0.0\n\n # Table metrics\n source_table_count: int = 0\n output_table_count: int = 0\n table_retention: float = 0.0\n\n # Image metrics\n source_image_count: int = 0\n output_image_count: int = 0\n image_retention: float = 0.0\n\n # Structure metrics\n heading_count: int = 0\n list_count: int = 0\n code_block_count: int = 0\n\n # Quality scores\n overall_score: float = 0.0\n status: str = \"unknown\" # pass, warn, fail\n\n # Details\n warnings: list[str] = field(default_factory=list)\n errors: list[str] = field(default_factory=list)\n\n\ndef extract_text_from_pdf(pdf_path: Path) -> tuple[str, int, int]:\n \"\"\"Extract text, table count, and image count from PDF.\"\"\"\n try:\n import fitz # PyMuPDF\n\n doc = fitz.open(str(pdf_path))\n text_parts = []\n table_count = 0\n image_count = 0\n\n for page in doc:\n text_parts.append(page.get_text())\n # Count images\n image_count += len(page.get_images())\n # Estimate tables (look for grid-like structures)\n # This is approximate - tables are hard to detect in PDFs\n page_text = page.get_text()\n if re.search(r'(\\t.*){2,}', page_text) or '│' in page_text:\n table_count += 1\n\n doc.close()\n return '\\n'.join(text_parts), table_count, image_count\n\n except ImportError:\n # Fallback to pdftotext if available\n try:\n result = subprocess.run(\n ['pdftotext', '-layout', str(pdf_path), '-'],\n capture_output=True,\n text=True,\n timeout=60\n )\n return result.stdout, 0, 0 # Can't count tables/images\n except Exception:\n return \"\", 0, 0\n\n\ndef extract_text_from_docx(docx_path: Path) -> tuple[str, int, int]:\n \"\"\"Extract text, table count, and image count from DOCX.\"\"\"\n try:\n import zipfile\n from xml.etree import ElementTree as ET\n\n with zipfile.ZipFile(docx_path, 'r') as z:\n # Extract main document text\n if 'word/document.xml' not in z.namelist():\n return \"\", 0, 0\n\n with z.open('word/document.xml') as f:\n tree = ET.parse(f)\n root = tree.getroot()\n\n # Extract text\n wordprocessing_ns = 'http' + '://schemas.openxmlformats.org/wordprocessingml/2006/main'\n ns = {'w': wordprocessing_ns}\n text_parts = []\n for t in root.iter(f'{{{wordprocessing_ns}}}t'):\n if t.text:\n text_parts.append(t.text)\n\n # Count tables\n tables = root.findall('.//w:tbl', ns)\n table_count = len(tables)\n\n # Count images\n image_count = sum(1 for name in z.namelist()\n if name.startswith('word/media/'))\n\n return ' '.join(text_parts), table_count, image_count\n\n except Exception as e:\n return \"\", 0, 0\n\n\ndef analyze_markdown(md_path: Path) -> dict:\n \"\"\"Analyze markdown file structure and content.\"\"\"\n content = md_path.read_text()\n\n # Count tables (markdown tables with |)\n table_lines = [l for l in content.split('\\n')\n if re.match(r'^\\s*\\|.*\\|', l)]\n # Group consecutive table lines\n table_count = 0\n in_table = False\n for line in content.split('\\n'):\n if re.match(r'^\\s*\\|.*\\|', line):\n if not in_table:\n table_count += 1\n in_table = True\n else:\n in_table = False\n\n # Count images\n images = re.findall(r'!\\[.*?\\]\\(.*?\\)', content)\n\n # Count headings\n headings = re.findall(r'^#{1,6}\\s+.+

Doc to Markdown Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing. Architecture : Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add). Quick Start Dual Mode | Mode | Speed | Quality | Use Case | |------|-------|---------|----------| | Quick (default) | Fast | Good | Drafts, simple documents | | Heavy | Slower | Best | Final documents, complex layouts | Tool Selection | Format | Quick Mode | Heavy Mode | |--------|-----------|------------| | PDF | pymupdf4llm | pymupdf4llm + markitdown | | DOCX | pand…

, content, re.MULTILINE)\n\n # Count lists\n list_items = re.findall(r'^[\\s]*[-*+]\\s+', content, re.MULTILINE)\n list_items += re.findall(r'^[\\s]*\\d+\\.\\s+', content, re.MULTILINE)\n\n # Count code blocks\n code_blocks = re.findall(r'```', content)\n\n # Clean text for comparison\n clean_text = re.sub(r'```.*?```', '', content, flags=re.DOTALL)\n clean_text = re.sub(r'!\\[.*?\\]\\(.*?\\)', '', clean_text)\n clean_text = re.sub(r'\\[.*?\\]\\(.*?\\)', '', clean_text)\n clean_text = re.sub(r'[#*_`|>-]', '', clean_text)\n clean_text = re.sub(r'\\s+', ' ', clean_text).strip()\n\n return {\n 'char_count': len(clean_text),\n 'table_count': table_count,\n 'image_count': len(images),\n 'heading_count': len(headings),\n 'list_count': len(list_items),\n 'code_block_count': len(code_blocks) // 2,\n 'raw_content': content,\n 'clean_text': clean_text\n }\n\n\ndef validate_conversion(\n source_path: Path,\n output_path: Path\n) -> ValidationMetrics:\n \"\"\"Validate conversion quality by comparing source and output.\"\"\"\n metrics = ValidationMetrics()\n\n # Analyze output markdown\n md_analysis = analyze_markdown(output_path)\n metrics.output_char_count = md_analysis['char_count']\n metrics.output_table_count = md_analysis['table_count']\n metrics.output_image_count = md_analysis['image_count']\n metrics.heading_count = md_analysis['heading_count']\n metrics.list_count = md_analysis['list_count']\n metrics.code_block_count = md_analysis['code_block_count']\n\n # Extract source content based on file type\n ext = source_path.suffix.lower()\n if ext == '.pdf':\n source_text, source_tables, source_images = extract_text_from_pdf(source_path)\n elif ext in ['.docx', '.doc']:\n source_text, source_tables, source_images = extract_text_from_docx(source_path)\n else:\n # For other formats, estimate from file size\n source_text = \"\"\n source_tables = 0\n source_images = 0\n metrics.warnings.append(f\"Cannot analyze source format: {ext}\")\n\n metrics.source_char_count = len(source_text.replace(' ', '').replace('\\n', ''))\n metrics.source_table_count = source_tables\n metrics.source_image_count = source_images\n\n # Calculate retention rates\n if metrics.source_char_count > 0:\n # Use ratio of actual/expected, capped at 1.0\n metrics.text_retention = min(\n metrics.output_char_count / metrics.source_char_count,\n 1.0\n )\n else:\n metrics.text_retention = 1.0 if metrics.output_char_count > 0 else 0.0\n\n if metrics.source_table_count > 0:\n metrics.table_retention = min(\n metrics.output_table_count / metrics.source_table_count,\n 1.0\n )\n else:\n metrics.table_retention = 1.0 # No tables expected\n\n if metrics.source_image_count > 0:\n metrics.image_retention = min(\n metrics.output_image_count / metrics.source_image_count,\n 1.0\n )\n else:\n metrics.image_retention = 1.0 # No images expected\n\n # Determine status based on thresholds\n if metrics.text_retention \u003c 0.85:\n metrics.errors.append(f\"Low text retention: {metrics.text_retention:.1%}\")\n elif metrics.text_retention \u003c 0.95:\n metrics.warnings.append(f\"Text retention below optimal: {metrics.text_retention:.1%}\")\n\n if metrics.source_table_count > 0 and metrics.table_retention \u003c 0.9:\n metrics.errors.append(f\"Tables missing: {metrics.table_retention:.1%} retained\")\n elif metrics.source_table_count > 0 and metrics.table_retention \u003c 1.0:\n metrics.warnings.append(f\"Some tables may be incomplete: {metrics.table_retention:.1%}\")\n\n if metrics.source_image_count > 0 and metrics.image_retention \u003c 0.8:\n metrics.errors.append(f\"Images missing: {metrics.image_retention:.1%} retained\")\n elif metrics.source_image_count > 0 and metrics.image_retention \u003c 1.0:\n metrics.warnings.append(f\"Some images missing: {metrics.image_retention:.1%}\")\n\n # Calculate overall score (0-100)\n metrics.overall_score = (\n metrics.text_retention * 50 +\n metrics.table_retention * 25 +\n metrics.image_retention * 25\n ) * 100\n\n # Determine status\n if metrics.errors:\n metrics.status = \"fail\"\n elif metrics.warnings:\n metrics.status = \"warn\"\n else:\n metrics.status = \"pass\"\n\n return metrics\n\n\ndef generate_html_report(\n metrics: ValidationMetrics,\n source_path: Path,\n output_path: Path\n) -> str:\n \"\"\"Generate HTML quality report.\"\"\"\n status_colors = {\n \"pass\": \"#28a745\",\n \"warn\": \"#ffc107\",\n \"fail\": \"#dc3545\"\n }\n status_color = status_colors.get(metrics.status, \"#6c757d\")\n\n def metric_bar(value: float, thresholds: tuple) -> str:\n \"\"\"Generate colored progress bar.\"\"\"\n pct = int(value * 100)\n if value >= thresholds[0]:\n color = \"#28a745\" # green\n elif value >= thresholds[1]:\n color = \"#ffc107\" # yellow\n else:\n color = \"#dc3545\" # red\n return f'''\n \u003cdiv style=\"background: #e9ecef; border-radius: 4px; overflow: hidden; height: 20px;\">\n \u003cdiv style=\"background: {color}; width: {pct}%; height: 100%; transition: width 0.3s;\">\u003c/div>\n \u003c/div>\n \u003cspan style=\"font-size: 14px; color: #666;\">{pct}%\u003c/span>\n '''\n\n report = f'''\u003c!DOCTYPE html>\n\u003chtml>\n\u003chead>\n \u003cmeta charset=\"UTF-8\">\n \u003ctitle>Conversion Quality Report\u003c/title>\n \u003cstyle>\n body {{ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; margin: 40px; background: #f5f5f5; }}\n .container {{ max-width: 800px; margin: 0 auto; background: white; padding: 30px; border-radius: 8px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}\n h1 {{ color: #333; border-bottom: 2px solid #eee; padding-bottom: 15px; }}\n .status {{ display: inline-block; padding: 8px 16px; border-radius: 4px; color: white; font-weight: bold; }}\n .metric {{ margin: 20px 0; padding: 15px; background: #f8f9fa; border-radius: 4px; }}\n .metric-label {{ font-weight: bold; color: #333; margin-bottom: 8px; }}\n .metric-value {{ font-size: 24px; color: #333; }}\n .issues {{ margin-top: 20px; }}\n .error {{ background: #f8d7da; color: #721c24; padding: 10px; margin: 5px 0; border-radius: 4px; }}\n .warning {{ background: #fff3cd; color: #856404; padding: 10px; margin: 5px 0; border-radius: 4px; }}\n table {{ width: 100%; border-collapse: collapse; margin: 15px 0; }}\n th, td {{ padding: 10px; text-align: left; border-bottom: 1px solid #eee; }}\n th {{ background: #f8f9fa; }}\n .score {{ font-size: 48px; font-weight: bold; color: {status_color}; }}\n \u003c/style>\n\u003c/head>\n\u003cbody>\n \u003cdiv class=\"container\">\n \u003ch1>📊 Conversion Quality Report\u003c/h1>\n\n \u003cdiv style=\"text-align: center; margin: 30px 0;\">\n \u003cdiv class=\"score\">{metrics.overall_score:.0f}\u003c/div>\n \u003cdiv style=\"color: #666;\">Overall Score\u003c/div>\n \u003cdiv class=\"status\" style=\"background: {status_color}; margin-top: 10px;\">\n {metrics.status.upper()}\n \u003c/div>\n \u003c/div>\n\n \u003ch2>📄 File Information\u003c/h2>\n \u003ctable>\n \u003ctr>\u003cth>Source\u003c/th>\u003ctd>{html.escape(str(source_path))}\u003c/td>\u003c/tr>\n \u003ctr>\u003cth>Output\u003c/th>\u003ctd>{html.escape(str(output_path))}\u003c/td>\u003c/tr>\n \u003c/table>\n\n \u003ch2>📏 Retention Metrics\u003c/h2>\n\n \u003cdiv class=\"metric\">\n \u003cdiv class=\"metric-label\">Text Retention (target: >95%)\u003c/div>\n {metric_bar(metrics.text_retention, (0.95, 0.85))}\n \u003cdiv style=\"font-size: 12px; color: #666; margin-top: 5px;\">\n Source: ~{metrics.source_char_count:,} chars | Output: {metrics.output_char_count:,} chars\n \u003c/div>\n \u003c/div>\n\n \u003cdiv class=\"metric\">\n \u003cdiv class=\"metric-label\">Table Retention (target: 100%)\u003c/div>\n {metric_bar(metrics.table_retention, (1.0, 0.9))}\n \u003cdiv style=\"font-size: 12px; color: #666; margin-top: 5px;\">\n Source: {metrics.source_table_count} tables | Output: {metrics.output_table_count} tables\n \u003c/div>\n \u003c/div>\n\n \u003cdiv class=\"metric\">\n \u003cdiv class=\"metric-label\">Image Retention (target: 100%)\u003c/div>\n {metric_bar(metrics.image_retention, (1.0, 0.8))}\n \u003cdiv style=\"font-size: 12px; color: #666; margin-top: 5px;\">\n Source: {metrics.source_image_count} images | Output: {metrics.output_image_count} images\n \u003c/div>\n \u003c/div>\n\n \u003ch2>📊 Structure Analysis\u003c/h2>\n \u003ctable>\n \u003ctr>\u003cth>Headings\u003c/th>\u003ctd>{metrics.heading_count}\u003c/td>\u003c/tr>\n \u003ctr>\u003cth>List Items\u003c/th>\u003ctd>{metrics.list_count}\u003c/td>\u003c/tr>\n \u003ctr>\u003cth>Code Blocks\u003c/th>\u003ctd>{metrics.code_block_count}\u003c/td>\u003c/tr>\n \u003c/table>\n\n {'\u003ch2>⚠️ Issues\u003c/h2>\u003cdiv class=\"issues\">' + ''.join(f'\u003cdiv class=\"error\">❌ {html.escape(e)}\u003c/div>' for e in metrics.errors) + ''.join(f'\u003cdiv class=\"warning\">⚠️ {html.escape(w)}\u003c/div>' for w in metrics.warnings) + '\u003c/div>' if metrics.errors or metrics.warnings else ''}\n\n \u003cdiv style=\"margin-top: 30px; padding-top: 20px; border-top: 1px solid #eee; color: #666; font-size: 12px;\">\n Generated by markdown-tools validate_output.py\n \u003c/div>\n \u003c/div>\n\u003c/body>\n\u003c/html>\n'''\n return report\n\n\ndef main():\n parser = argparse.ArgumentParser(\n description=\"Validate document-to-markdown conversion quality\"\n )\n parser.add_argument(\n \"source\",\n type=Path,\n help=\"Original document (PDF, DOCX, etc.)\"\n )\n parser.add_argument(\n \"output\",\n type=Path,\n help=\"Converted markdown file\"\n )\n parser.add_argument(\n \"--report\",\n type=Path,\n help=\"Generate HTML report at this path\"\n )\n parser.add_argument(\n \"--json\",\n action=\"store_true\",\n help=\"Output metrics as JSON\"\n )\n\n args = parser.parse_args()\n\n # Validate inputs\n if not args.source.exists():\n print(f\"Error: Source file not found: {args.source}\", file=sys.stderr)\n sys.exit(1)\n if not args.output.exists():\n print(f\"Error: Output file not found: {args.output}\", file=sys.stderr)\n sys.exit(1)\n\n # Run validation\n metrics = validate_conversion(args.source, args.output)\n\n # Output results\n if args.json:\n import json\n print(json.dumps({\n 'text_retention': metrics.text_retention,\n 'table_retention': metrics.table_retention,\n 'image_retention': metrics.image_retention,\n 'overall_score': metrics.overall_score,\n 'status': metrics.status,\n 'warnings': metrics.warnings,\n 'errors': metrics.errors\n }, indent=2))\n else:\n # Console output\n status_emoji = {\"pass\": \"✅\", \"warn\": \"⚠️\", \"fail\": \"❌\"}.get(metrics.status, \"❓\")\n print(f\"\\n{status_emoji} Conversion Quality: {metrics.status.upper()}\")\n print(f\" Overall Score: {metrics.overall_score:.0f}/100\")\n print(f\"\\n Text Retention: {metrics.text_retention:.1%}\")\n print(f\" Table Retention: {metrics.table_retention:.1%}\")\n print(f\" Image Retention: {metrics.image_retention:.1%}\")\n\n if metrics.errors:\n print(\"\\n Errors:\")\n for e in metrics.errors:\n print(f\" ❌ {e}\")\n\n if metrics.warnings:\n print(\"\\n Warnings:\")\n for w in metrics.warnings:\n print(f\" ⚠️ {w}\")\n\n # Generate HTML report\n if args.report:\n report_html = generate_html_report(metrics, args.source, args.output)\n args.report.parent.mkdir(parents=True, exist_ok=True)\n args.report.write_text(report_html)\n print(f\"\\n📊 HTML report: {args.report}\")\n\n # Exit with appropriate code\n sys.exit(0 if metrics.status != \"fail\" else 1)\n\n\nif __name__ == \"__main__\":\n main()\n","content_type":"text/x-python; charset=utf-8","language":"python","size":16624,"content_sha256":"f7d89030dff50e3cfa849d929d46564f886fed99543cb20ff01c5c1ec1b9da11"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Doc to Markdown","type":"text"}]},{"type":"paragraph","content":[{"text":"Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.","type":"text"}]},{"type":"paragraph","content":[{"text":"Architecture","type":"text","marks":[{"type":"strong"}]},{"text":": Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Quick Start","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# DOCX → Markdown (one command, zero manual fixes)\nuv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media\n\n# PDF → Markdown\nuv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md\n\n# Run tests\nuv run --with pytest pytest scripts/test_convert.py -v","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Dual Mode","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Mode","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Speed","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Quality","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Use Case","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Quick","type":"text","marks":[{"type":"strong"}]},{"text":" (default)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Fast","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Good","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Drafts, simple documents","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Heavy","type":"text","marks":[{"type":"strong"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Slower","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Best","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Final documents, complex layouts","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Tool Selection","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Format","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Quick Mode","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Heavy Mode","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PDF","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"pymupdf4llm","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"pymupdf4llm + markitdown","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DOCX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"pandoc + post-processing","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"pandoc + markitdown","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PPTX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"markitdown","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"markitdown + pandoc","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"XLSX","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"markitdown","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"markitdown","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"DOCX Post-Processing (automatic)","type":"text"}]},{"type":"paragraph","content":[{"text":"When converting DOCX via pandoc, 8 cleanups are applied automatically:","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Problem","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Fix","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Test coverage","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Grid tables (","type":"text"},{"text":"+:---+","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Single-column → blockquote, multi-column → pipe table","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"TestPostprocessPipeline","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Simple tables (","type":"text"},{"text":" ---- ----","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Multi-column images → pipe table with captions","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"TestSimpleTable","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Image path nesting (","type":"text"},{"text":"media/media/","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Flatten to ","type":"text"},{"text":"media/","type":"text","marks":[{"type":"code_inline"}]},{"text":", absolute → relative","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"test_stats_tracking","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Pandoc attributes (","type":"text"},{"text":"{width=\"...\"}","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Removed","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"test_pandoc_attributes_removed","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"CJK bold spacing (","type":"text"},{"text":"**粗体**中文","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Add space around ","type":"text"},{"text":"**","type":"text","marks":[{"type":"code_inline"}]},{"text":" for CJK bold spans","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"TestCjkBoldSpacing","type":"text","marks":[{"type":"code_inline"}]},{"text":" (15 cases)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Indented dashed code blocks","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"→ fenced ``` with language detection","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"test_code_block_with_language","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Escaped brackets (","type":"text"},{"text":"\\[...\\]","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"→ ","type":"text"},{"text":"[...]","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"test_escaped_brackets_fixed","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Double-bracket links (","type":"text"},{"text":"[[text]](url)","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"→ ","type":"text"},{"text":"[text](url)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"test_double_bracket_links_fixed","type":"text","marks":[{"type":"code_inline"}]}]}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"CJK Bold Spacing — why and how","type":"text"}]},{"type":"paragraph","content":[{"text":"DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around ","type":"text"},{"text":"**","type":"text","marks":[{"type":"code_inline"}]},{"text":" to recognize bold boundaries.","type":"text"}]},{"type":"paragraph","content":[{"text":"Rule","type":"text","marks":[{"type":"strong"}]},{"text":": if a ","type":"text"},{"text":"**content**","type":"text","marks":[{"type":"code_inline"}]},{"text":" span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"Before: 打开**飞书**,就可以 → some renderers fail to bold\nAfter: 打开 **飞书** ,就可以 → universally renders correctly","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Heavy Mode Workflow","type":"text"}]},{"type":"paragraph","content":[{"text":"Heavy Mode runs multiple tools in parallel and selects the best segments:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Parallel Execution","type":"text","marks":[{"type":"strong"}]},{"text":": Run all applicable tools simultaneously","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Segment Analysis","type":"text","marks":[{"type":"strong"}]},{"text":": Parse each output into segments (tables, headings, images, paragraphs)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Quality Scoring","type":"text","marks":[{"type":"strong"}]},{"text":": Score each segment based on completeness and structure","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Intelligent Merge","type":"text","marks":[{"type":"strong"}]},{"text":": Select best version of each segment across tools","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Merge Criteria","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Segment Type","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Selection Criteria","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Tables","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"More rows/columns, proper header separator","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Images","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Alt text present, local paths preferred","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Headings","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Proper hierarchy, appropriate length","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Lists","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"More items, nested structure preserved","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Paragraphs","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Content completeness","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Image Extraction","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Extract images with metadata\nuv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./extracted-images\n\n# Generate markdown references file\nuv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md","type":"text"}]},{"type":"paragraph","content":[{"text":"Output:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Images: ","type":"text"},{"text":"extracted-images/img_page1_1.png","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"extracted-images/img_page2_1.jpg","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Metadata: ","type":"text"},{"text":"extracted-images/images_metadata.json","type":"text","marks":[{"type":"code_inline"}]},{"text":" (page, position, dimensions)","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Quality Validation","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Validate conversion quality\nuv run --with pymupdf scripts/validate_output.py document.pdf output.md\n\n# Generate HTML report\nuv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Quality Metrics","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Metric","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Pass","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Warn","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Fail","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Text Retention","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":">95%","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"85-95%","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003c85%","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Table Retention","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"100%","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"90-99%","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003c90%","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Image Retention","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"100%","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"80-99%","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\u003c80%","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Merge Outputs Manually","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Merge multiple markdown files\npython scripts/merge_outputs.py output1.md output2.md -o merged.md\n\n# Show segment attribution\npython scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Path Conversion (Windows/WSL)","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Windows to WSL conversion\npython scripts/convert_path.py \"C:\\Users\\\u003cwindows-user>\\Documents\\file.pdf\"\n# Output: /mnt/c/Users/\u003cwindows-user>/Documents/file.pdf","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Common Issues","type":"text"}]},{"type":"paragraph","content":[{"text":"\"No conversion tools available\"","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Install all tools\npip install pymupdf4llm\nuv tool install \"markitdown[pdf]\"\nbrew install pandoc","type":"text"}]},{"type":"paragraph","content":[{"text":"FontBBox warnings during PDF conversion","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Harmless font parsing warnings, output is still correct","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Images missing from output","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use Heavy Mode for better image preservation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Or extract separately with ","type":"text"},{"text":"scripts/extract_pdf_images.py","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"paragraph","content":[{"text":"Tables broken in output","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use Heavy Mode - it selects the most complete table version","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Or validate with ","type":"text"},{"text":"scripts/validate_output.py","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Bundled Scripts","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Script","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Purpose","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"convert.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Main orchestrator with Quick/Heavy mode + DOCX post-processing","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"test_convert.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"31 tests covering all post-processing functions","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"merge_outputs.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Merge multiple markdown outputs","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"validate_output.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Quality validation with HTML report","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"extract_pdf_images.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PDF image extraction with metadata","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"convert_path.py","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Windows to WSL path converter","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"References","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/benchmark-2026-03-22.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" - 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/heavy-mode-guide.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" - Detailed Heavy Mode documentation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/tool-comparison.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" - Tool capabilities comparison","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"references/conversion-examples.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" - Batch operation examples","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Next Step: Clean Up Converted Content","type":"text"}]},{"type":"paragraph","content":[{"text":"After converting documents to markdown, suggest cleanup:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"Conversion complete: [N] files converted to markdown.\n\nOptions:\nA) Clean up docs — run /daymade-docs:docs-cleaner to consolidate redundant content (Recommended if multiple files)\nB) Check facts — run /fact-checker to verify claims in the converted content\nC) No thanks — the markdown conversion is sufficient","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"doc-to-markdown","author":"@skillopedia","source":{"stars":1137,"repo_name":"claude-code-skills","origin_url":"https://github.com/daymade/claude-code-skills/blob/HEAD/daymade-docs/doc-to-markdown/SKILL.md","repo_owner":"daymade","body_sha256":"52edd68252850eb39a909054940544a9bdcdf7658887a584aeba3dfef4a56137","cluster_key":"8bacda51eb4ead4fa455a23d31758dfbc649b1cdeae30a7ae1b4334fa8c58525","clean_bundle":{"format":"clean-skill-bundle-v1","source":"daymade/claude-code-skills/daymade-docs/doc-to-markdown/SKILL.md","attachments":[{"id":"0513a417-d690-5cd7-8394-bfb6b94d8760","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/0513a417-d690-5cd7-8394-bfb6b94d8760/attachment.md","path":"references/benchmark-2026-03-22.md","size":6321,"sha256":"cc9dc56b26de9f15b2f6ace4d7e374479ce1d14e37f2132ca8b7ed45441f2639","contentType":"text/markdown; charset=utf-8"},{"id":"b12b9f60-a8dd-5c5b-88d0-28112ffceed7","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b12b9f60-a8dd-5c5b-88d0-28112ffceed7/attachment.md","path":"references/conversion-examples.md","size":6749,"sha256":"8ac58d7758ffadae83868e8a3d650bafa6c53203c23a5f760cebc57ada28a808","contentType":"text/markdown; charset=utf-8"},{"id":"208e3334-eb53-5b63-b5b7-917fbb7642f4","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/208e3334-eb53-5b63-b5b7-917fbb7642f4/attachment.md","path":"references/heavy-mode-guide.md","size":3864,"sha256":"746861f4a067a62da86d9f2bb3fdcf57e4c0960f08ab1921a73a8a1865ff9c6f","contentType":"text/markdown; charset=utf-8"},{"id":"04105eae-6579-5077-81b4-9504d609e4d4","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/04105eae-6579-5077-81b4-9504d609e4d4/attachment.md","path":"references/tool-comparison.md","size":3746,"sha256":"62c919cd00cb207b67438b0104e2596110bd48253ec9acd0fb840d1990ff0f6d","contentType":"text/markdown; charset=utf-8"},{"id":"9821f573-2b5f-573a-8064-8e9e41397b22","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9821f573-2b5f-573a-8064-8e9e41397b22/attachment.py","path":"scripts/convert.py","size":39754,"sha256":"6d9937625eab6c7059e65ae9fe9b503e7adeb96194c5434fd9c1ad1cfd8aa99c","contentType":"text/x-python; charset=utf-8"},{"id":"b80e5da5-c6a6-5afc-96fb-e3ed12097fc9","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b80e5da5-c6a6-5afc-96fb-e3ed12097fc9/attachment.py","path":"scripts/convert_path.py","size":1477,"sha256":"f599b1bc248b15e1af31f14e523680101a7452f0bb414755cf4eb1fe35d4a17c","contentType":"text/x-python; charset=utf-8"},{"id":"727b68db-8d93-5ca7-9c67-ccbd3083cf6b","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/727b68db-8d93-5ca7-9c67-ccbd3083cf6b/attachment.py","path":"scripts/extract_pdf_images.py","size":7800,"sha256":"b5d02d782dfb8c799597fa1e67442927b0ad6bace71474fc8ebd7077af8c6d48","contentType":"text/x-python; charset=utf-8"},{"id":"82db6b9c-dbf1-5771-b39e-d6719856e3db","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/82db6b9c-dbf1-5771-b39e-d6719856e3db/attachment.py","path":"scripts/merge_outputs.py","size":13559,"sha256":"b174d1d5871e9a393dc4e50da1f14d14f16762ad8d4e64e4a4618d361da4b6b4","contentType":"text/x-python; charset=utf-8"},{"id":"61a34ab7-9f26-573d-b530-2b58414613fb","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/61a34ab7-9f26-573d-b530-2b58414613fb/attachment.py","path":"scripts/test_convert.py","size":9654,"sha256":"a78a88b16f1ecfd7a597855aa2b285dceaf4058567d01ec8900a8d80ee0690f6","contentType":"text/x-python; charset=utf-8"},{"id":"97f26063-b9fb-523d-ab2d-c1b3c3444afd","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/97f26063-b9fb-523d-ab2d-c1b3c3444afd/attachment.py","path":"scripts/validate_output.py","size":16624,"sha256":"f7d89030dff50e3cfa849d929d46564f886fed99543cb20ff01c5c1ec1b9da11","contentType":"text/x-python; charset=utf-8"}],"bundle_sha256":"c162f4b4f5aa70dedc9c7872bd9de454e0202170f66ebb201ba0a41bb00b41eb","attachment_count":10,"text_attachments":10,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"daymade-docs/doc-to-markdown/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"documents-office","category_label":"Documents"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"documents-office","import_tag":"clean-skills-v1","description":"Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on \"convert document\", \"docx to markdown\", \"parse word\", \"doc to markdown\", \"解析word\", \"转换文档\"."}},"renderedAt":1782980950200}

Doc to Markdown Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing. Architecture : Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add). Quick Start Dual Mode | Mode | Speed | Quality | Use Case | |------|-------|---------|----------| | Quick (default) | Fast | Good | Drafts, simple documents | | Heavy | Slower | Best | Final documents, complex layouts | Tool Selection | Format | Quick Mode | Heavy Mode | |--------|-----------|------------| | PDF | pymupdf4llm | pymupdf4llm + markitdown | | DOCX | pand…