pdf-harvester — Skillopedia

PDF Harvester Skill Extract and ingest PDF documents into RAG with proper text extraction, table handling, and metadata. Overview PDFs are common for research papers, reports, manuals, and ebooks. This skill covers: - Text extraction with layout preservation - Table extraction and conversion to markdown - Academic paper patterns (abstract, sections, citations) - OCR for scanned documents - Multi-page chunking strategies Prerequisites Extraction Methods Method 1: pdfplumber (Recommended) Best for structured PDFs with tables. Method 2: PyMuPDF (fitz) Faster, better for large PDFs. Method 3: OCR…

, # ALL CAPS headings\n r'^(Abstract|Introduction|Conclusion|References)',\n ]\n\n full_text = \"\n\n\".join(p[\"text\"] for p in extracted[\"pages\"])\n\n # Find section boundaries\n sections = []\n current_section = {\"title\": \"Introduction\", \"content\": \"\", \"start_pos\": 0}\n\n lines = full_text.split(\"\n\")\n\n for line in lines:\n is_heading = any(\n re.match(pattern, line.strip())\n for pattern in heading_patterns\n )\n\n if is_heading and current_section[\"content\"].strip():\n sections.append(current_section)\n current_section = {\n \"title\": line.strip(),\n \"content\": \"\",\n \"start_pos\": len(sections)\n }\n else:\n current_section[\"content\"] += line + \"\n\"\n\n # Don't forget last section\n if current_section[\"content\"].strip():\n sections.append(current_section)\n\n return [\n {\n \"content\": s[\"content\"].strip(),\n \"section\": s[\"title\"],\n \"chunk_index\": i\n }\n for i, s in enumerate(sections)\n ]\n```\n\n### Strategy 3: Semantic Paragraphs\n\nChunk by paragraph with size limits.\n\n```python\ndef chunk_by_paragraphs(\n extracted: Dict,\n max_chunk_size: int = 500, # words\n overlap: int = 50\n) -> List[Dict]:\n \"\"\"Chunk by paragraphs with overlap.\"\"\"\n full_text = \"\n\n\".join(p[\"text\"] for p in extracted[\"pages\"])\n\n # Split into paragraphs\n paragraphs = [p.strip() for p in full_text.split(\"\n\n\") if p.strip()]\n\n chunks = []\n current_chunk = []\n current_size = 0\n\n for para in paragraphs:\n para_size = len(para.split())\n\n if current_size + para_size > max_chunk_size and current_chunk:\n # Save current chunk\n chunks.append({\n \"content\": \"\n\n\".join(current_chunk),\n \"chunk_index\": len(chunks),\n \"word_count\": current_size\n })\n\n # Start new chunk with overlap\n overlap_text = current_chunk[-1] if current_chunk else \"\"\n current_chunk = [overlap_text] if overlap_text else []\n current_size = len(overlap_text.split()) if overlap_text else 0\n\n current_chunk.append(para)\n current_size += para_size\n\n # Last chunk\n if current_chunk:\n chunks.append({\n \"content\": \"\n\n\".join(current_chunk),\n \"chunk_index\": len(chunks),\n \"word_count\": current_size\n })\n\n return chunks\n```\n\n## Academic Paper Pattern\n\nSpecial handling for research papers.\n\n```python\ndef extract_academic_paper(pdf_path: str) -> Dict:\n \"\"\"\n Extract academic paper with structure detection.\n\n Identifies: title, authors, abstract, sections, references\n \"\"\"\n extracted = extract_pdf_text(pdf_path)\n full_text = \"\n\".join(p[\"text\"] for p in extracted[\"pages\"])\n\n paper = {\n \"title\": \"\",\n \"authors\": [],\n \"abstract\": \"\",\n \"sections\": [],\n \"references\": [],\n \"tables\": extracted[\"tables\"]\n }\n\n # Title is usually first large text\n lines = full_text.split(\"\n\")\n for line in lines[:10]:\n if len(line) > 20 and len(line) \u003c 200:\n paper[\"title\"] = line.strip()\n break\n\n # Abstract\n abstract_match = re.search(\n r'Abstract[:\\s]*\n?(.*?)(?=\n(?:1\\.?\\s+)?Introduction|\n\n[A-Z])',\n full_text,\n re.DOTALL | re.IGNORECASE\n )\n if abstract_match:\n paper[\"abstract\"] = abstract_match.group(1).strip()\n\n # Sections\n section_pattern = r'\n(\\d+\\.?\\s+[A-Z][^\n]+)\n'\n section_matches = re.finditer(section_pattern, full_text)\n\n section_positions = [(m.group(1), m.start()) for m in section_matches]\n\n for i, (title, start) in enumerate(section_positions):\n end = section_positions[i+1][1] if i+1 \u003c len(section_positions) else len(full_text)\n content = full_text[start:end]\n\n paper[\"sections\"].append({\n \"title\": title.strip(),\n \"content\": content.strip()\n })\n\n # References section\n ref_match = re.search(\n r'(?:References|Bibliography)\\s*\n(.*?)

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

,\n full_text,\n re.DOTALL | re.IGNORECASE\n )\n if ref_match:\n paper[\"references_text\"] = ref_match.group(1).strip()\n\n return paper\n```\n\n## Full Harvesting Pipeline\n\n```python\n#!/usr/bin/env python3\n\"\"\"Complete PDF harvesting pipeline.\"\"\"\n\nfrom datetime import datetime\nfrom pathlib import Path\nfrom typing import Dict, List, Optional\nimport hashlib\n\nasync def harvest_pdf(\n pdf_path: str,\n collection: str,\n chunk_strategy: str = \"paragraphs\", # pages, sections, paragraphs\n is_academic: bool = False,\n use_ocr: bool = False\n) -> Dict:\n \"\"\"\n Harvest a PDF document into RAG.\n\n Args:\n pdf_path: Path to PDF file\n collection: Target RAG collection\n chunk_strategy: How to chunk the document\n is_academic: Use academic paper extraction\n use_ocr: Force OCR extraction\n \"\"\"\n path = Path(pdf_path)\n\n # Check if OCR needed\n if use_ocr or is_scanned_pdf(pdf_path):\n extracted = extract_with_ocr(pdf_path)\n else:\n extracted = extract_pdf_text(pdf_path)\n\n # Get document metadata\n doc_metadata = {\n \"source_type\": \"pdf\",\n \"source_path\": str(path.absolute()),\n \"filename\": path.name,\n \"total_pages\": extracted[\"total_pages\"],\n \"harvested_at\": datetime.now().isoformat(),\n \"pdf_metadata\": extracted.get(\"metadata\", {})\n }\n\n # Academic paper special handling\n if is_academic:\n paper = extract_academic_paper(pdf_path)\n doc_metadata[\"title\"] = paper[\"title\"]\n doc_metadata[\"abstract\"] = paper[\"abstract\"]\n doc_metadata[\"is_academic\"] = True\n\n # Chunk based on strategy\n if chunk_strategy == \"pages\":\n chunks = chunk_by_pages(extracted)\n elif chunk_strategy == \"sections\":\n chunks = chunk_by_sections(extracted)\n else:\n chunks = chunk_by_paragraphs(extracted)\n\n # Generate document ID from content hash\n content_hash = hashlib.md5(\n \"\".join(p[\"text\"] for p in extracted[\"pages\"]).encode()\n ).hexdigest()[:12]\n doc_id = f\"pdf_{content_hash}\"\n\n # Ingest chunks\n ingested = 0\n for chunk in chunks:\n chunk_metadata = {\n **doc_metadata,\n \"chunk_index\": chunk[\"chunk_index\"],\n \"total_chunks\": len(chunks),\n }\n\n # Add page info if available\n if \"page_start\" in chunk:\n chunk_metadata[\"page_start\"] = chunk[\"page_start\"]\n chunk_metadata[\"page_end\"] = chunk[\"page_end\"]\n\n # Add section info if available\n if \"section\" in chunk:\n chunk_metadata[\"section\"] = chunk[\"section\"]\n\n await ingest(\n content=chunk[\"content\"],\n collection=collection,\n metadata=chunk_metadata,\n doc_id=f\"{doc_id}_chunk_{chunk['chunk_index']}\"\n )\n ingested += 1\n\n # Ingest tables separately\n for table in extracted.get(\"tables\", []):\n table_metadata = {\n **doc_metadata,\n \"content_type\": \"table\",\n \"page_number\": table[\"page_number\"],\n \"table_number\": table[\"table_number\"]\n }\n\n await ingest(\n content=table[\"markdown\"],\n collection=collection,\n metadata=table_metadata,\n doc_id=f\"{doc_id}_table_{table['page_number']}_{table['table_number']}\"\n )\n\n return {\n \"status\": \"success\",\n \"filename\": path.name,\n \"pages\": extracted[\"total_pages\"],\n \"chunks\": ingested,\n \"tables\": len(extracted.get(\"tables\", [])),\n \"collection\": collection,\n \"doc_id\": doc_id\n }\n\n\nasync def harvest_pdf_url(\n url: str,\n collection: str,\n **kwargs\n) -> Dict:\n \"\"\"Download and harvest a PDF from URL.\"\"\"\n import httpx\n import tempfile\n\n # Download PDF\n async with httpx.AsyncClient() as client:\n response = await client.get(url, follow_redirects=True)\n response.raise_for_status()\n\n # Save to temp file\n with tempfile.NamedTemporaryFile(suffix=\".pdf\", delete=False) as f:\n f.write(response.content)\n temp_path = f.name\n\n try:\n result = await harvest_pdf(temp_path, collection, **kwargs)\n result[\"source_url\"] = url\n return result\n finally:\n Path(temp_path).unlink() # Clean up\n```\n\n## Metadata Schema\n\n```yaml\n# PDF chunk metadata\nsource_type: pdf\nsource_path: /path/to/document.pdf\nsource_url: https://... (if downloaded)\nfilename: document.pdf\ntotal_pages: 45\npage_start: 5\npage_end: 7\nsection: \"3. Methodology\"\nchunk_index: 12\ntotal_chunks: 28\nharvested_at: \"2024-01-01T12:00:00Z\"\nis_academic: true\ntitle: \"Paper Title\"\nabstract: \"Paper abstract...\"\ncontent_type: text|table\n```\n\n## Usage Examples\n\n```python\n# Local PDF\nresult = await harvest_pdf(\n pdf_path=\"/path/to/document.pdf\",\n collection=\"research_papers\",\n chunk_strategy=\"sections\",\n is_academic=True\n)\n\n# PDF from URL\nresult = await harvest_pdf_url(\n url=\"https://arxiv.org/pdf/2301.00001.pdf\",\n collection=\"ml_papers\",\n is_academic=True\n)\n\n# Scanned document\nresult = await harvest_pdf(\n pdf_path=\"/path/to/scanned.pdf\",\n collection=\"legacy_docs\",\n use_ocr=True\n)\n```\n\n## Refinement Notes\n\n> Track improvements as you use this skill.\n\n- [ ] Text extraction tested\n- [ ] Table extraction working\n- [ ] OCR fallback tested\n- [ ] Academic paper pattern validated\n- [ ] Chunking strategies compared\n- [ ] Large PDF handling optimized\n---","attachment_filenames":[],"attachments":[],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"PDF Harvester Skill","type":"text"}]},{"type":"blockquote","content":[{"type":"paragraph","content":[{"text":"Extract and ingest PDF documents into RAG with proper text extraction, table handling, and metadata.","type":"text"}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Overview","type":"text"}]},{"type":"paragraph","content":[{"text":"PDFs are common for research papers, reports, manuals, and ebooks. This skill covers:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Text extraction with layout preservation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Table extraction and conversion to markdown","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Academic paper patterns (abstract, sections, citations)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"OCR for scanned documents","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Multi-page chunking strategies","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Prerequisites","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Core extraction\npip install pdfplumber pymupdf\n\n# For OCR (scanned documents)\npip install pytesseract pdf2image\n# Also need: brew install tesseract poppler (macOS)\n\n# For academic papers\npip install arxiv # If fetching from arXiv","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Extraction Methods","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Method 1: pdfplumber (Recommended)","type":"text"}]},{"type":"paragraph","content":[{"text":"Best for structured PDFs with tables.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"#!/usr/bin/env python3\n\"\"\"PDF extraction using pdfplumber.\"\"\"\n\nimport pdfplumber\nfrom pathlib import Path\nfrom typing import Dict, List, Optional\nimport re\n\ndef extract_pdf_text(\n pdf_path: str,\n extract_tables: bool = True\n) -> Dict:\n \"\"\"\n Extract text and tables from PDF.\n\n Args:\n pdf_path: Path to PDF file\n extract_tables: Whether to extract tables separately\n\n Returns:\n Dict with pages, tables, and metadata\n \"\"\"\n result = {\n \"pages\": [],\n \"tables\": [],\n \"metadata\": {},\n \"total_pages\": 0\n }\n\n with pdfplumber.open(pdf_path) as pdf:\n result[\"total_pages\"] = len(pdf.pages)\n result[\"metadata\"] = pdf.metadata or {}\n\n for page_num, page in enumerate(pdf.pages, 1):\n # Extract text\n text = page.extract_text() or \"\"\n\n result[\"pages\"].append({\n \"page_number\": page_num,\n \"text\": text,\n \"width\": page.width,\n \"height\": page.height\n })\n\n # Extract tables\n if extract_tables:\n tables = page.extract_tables()\n for table_num, table in enumerate(tables, 1):\n if table and len(table) > 0:\n result[\"tables\"].append({\n \"page_number\": page_num,\n \"table_number\": table_num,\n \"data\": table,\n \"markdown\": table_to_markdown(table)\n })\n\n return result\n\n\ndef table_to_markdown(table: List[List]) -> str:\n \"\"\"Convert table data to markdown format.\"\"\"\n if not table or len(table) == 0:\n return \"\"\n\n # Clean cells\n def clean_cell(cell):\n if cell is None:\n return \"\"\n return str(cell).replace(\"\n\", \" \").strip()\n\n # Header row\n headers = [clean_cell(c) for c in table[0]]\n md = \"| \" + \" | \".join(headers) + \" |\n\"\n md += \"| \" + \" | \".join([\"---\"] * len(headers)) + \" |\n\"\n\n # Data rows\n for row in table[1:]:\n cells = [clean_cell(c) for c in row]\n # Pad if necessary\n while len(cells) \u003c len(headers):\n cells.append(\"\")\n md += \"| \" + \" | \".join(cells[:len(headers)]) + \" |\n\"\n\n return md","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Method 2: PyMuPDF (fitz)","type":"text"}]},{"type":"paragraph","content":[{"text":"Faster, better for large PDFs.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"#!/usr/bin/env python3\n\"\"\"PDF extraction using PyMuPDF.\"\"\"\n\nimport fitz # PyMuPDF\nfrom typing import Dict, List\n\ndef extract_with_pymupdf(pdf_path: str) -> Dict:\n \"\"\"\n Extract text using PyMuPDF.\n\n Faster than pdfplumber, good for large documents.\n \"\"\"\n doc = fitz.open(pdf_path)\n\n result = {\n \"pages\": [],\n \"metadata\": doc.metadata,\n \"total_pages\": len(doc)\n }\n\n for page_num, page in enumerate(doc, 1):\n # Get text with layout preservation\n text = page.get_text(\"text\")\n\n # Get text blocks for better structure\n blocks = page.get_text(\"dict\")[\"blocks\"]\n\n result[\"pages\"].append({\n \"page_number\": page_num,\n \"text\": text,\n \"blocks\": len(blocks)\n })\n\n doc.close()\n return result\n\n\ndef extract_with_structure(pdf_path: str) -> Dict:\n \"\"\"Extract with heading detection.\"\"\"\n doc = fitz.open(pdf_path)\n\n pages = []\n for page_num, page in enumerate(doc, 1):\n blocks = page.get_text(\"dict\")[\"blocks\"]\n\n structured_content = []\n for block in blocks:\n if block[\"type\"] == 0: # Text block\n for line in block.get(\"lines\", []):\n for span in line.get(\"spans\", []):\n text = span[\"text\"].strip()\n font_size = span[\"size\"]\n is_bold = \"bold\" in span[\"font\"].lower()\n\n # Detect headings by font size\n if font_size > 14 or is_bold:\n structured_content.append({\n \"type\": \"heading\",\n \"text\": text,\n \"size\": font_size\n })\n else:\n structured_content.append({\n \"type\": \"paragraph\",\n \"text\": text\n })\n\n pages.append({\n \"page_number\": page_num,\n \"content\": structured_content\n })\n\n doc.close()\n return {\"pages\": pages, \"total_pages\": len(pages)}","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Method 3: OCR for Scanned PDFs","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"#!/usr/bin/env python3\n\"\"\"OCR extraction for scanned PDFs.\"\"\"\n\nimport pytesseract\nfrom pdf2image import convert_from_path\nfrom typing import Dict, List\n\ndef extract_with_ocr(\n pdf_path: str,\n language: str = \"eng\",\n dpi: int = 300\n) -> Dict:\n \"\"\"\n Extract text from scanned PDF using OCR.\n\n Args:\n pdf_path: Path to PDF\n language: Tesseract language code\n dpi: Resolution for conversion\n \"\"\"\n # Convert PDF pages to images\n images = convert_from_path(pdf_path, dpi=dpi)\n\n pages = []\n for page_num, image in enumerate(images, 1):\n # Run OCR\n text = pytesseract.image_to_string(image, lang=language)\n\n pages.append({\n \"page_number\": page_num,\n \"text\": text,\n \"ocr\": True\n })\n\n return {\n \"pages\": pages,\n \"total_pages\": len(pages),\n \"ocr_used\": True\n }\n\n\ndef is_scanned_pdf(pdf_path: str) -> bool:\n \"\"\"Detect if PDF is scanned (image-based).\"\"\"\n import fitz\n\n doc = fitz.open(pdf_path)\n\n # Check first few pages\n for page in doc[:min(3, len(doc))]:\n text = page.get_text().strip()\n if len(text) > 100: # Has extractable text\n doc.close()\n return False\n\n doc.close()\n return True","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Chunking Strategies","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Strategy 1: Page-Based","type":"text"}]},{"type":"paragraph","content":[{"text":"Simple chunking by page boundaries.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"def chunk_by_pages(\n extracted: Dict,\n pages_per_chunk: int = 1\n) -> List[Dict]:\n \"\"\"Chunk PDF by page boundaries.\"\"\"\n chunks = []\n pages = extracted[\"pages\"]\n\n for i in range(0, len(pages), pages_per_chunk):\n page_group = pages[i:i + pages_per_chunk]\n\n text = \"\n\n\".join(p[\"text\"] for p in page_group)\n\n chunks.append({\n \"content\": text,\n \"page_start\": page_group[0][\"page_number\"],\n \"page_end\": page_group[-1][\"page_number\"],\n \"chunk_index\": len(chunks)\n })\n\n return chunks","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Strategy 2: Section-Based","type":"text"}]},{"type":"paragraph","content":[{"text":"Chunk by document sections/headings.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"def chunk_by_sections(\n extracted: Dict,\n heading_patterns: List[str] = None\n) -> List[Dict]:\n \"\"\"Chunk PDF by section headings.\"\"\"\n if heading_patterns is None:\n heading_patterns = [\n r'^#+\\s', # Markdown headings\n r'^\\d+\\.\\s+[A-Z]', # Numbered sections\n r'^[A-Z][A-Z\\s]+

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

, # ALL CAPS headings\n r'^(Abstract|Introduction|Conclusion|References)',\n ]\n\n full_text = \"\n\n\".join(p[\"text\"] for p in extracted[\"pages\"])\n\n # Find section boundaries\n sections = []\n current_section = {\"title\": \"Introduction\", \"content\": \"\", \"start_pos\": 0}\n\n lines = full_text.split(\"\n\")\n\n for line in lines:\n is_heading = any(\n re.match(pattern, line.strip())\n for pattern in heading_patterns\n )\n\n if is_heading and current_section[\"content\"].strip():\n sections.append(current_section)\n current_section = {\n \"title\": line.strip(),\n \"content\": \"\",\n \"start_pos\": len(sections)\n }\n else:\n current_section[\"content\"] += line + \"\n\"\n\n # Don't forget last section\n if current_section[\"content\"].strip():\n sections.append(current_section)\n\n return [\n {\n \"content\": s[\"content\"].strip(),\n \"section\": s[\"title\"],\n \"chunk_index\": i\n }\n for i, s in enumerate(sections)\n ]","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Strategy 3: Semantic Paragraphs","type":"text"}]},{"type":"paragraph","content":[{"text":"Chunk by paragraph with size limits.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"def chunk_by_paragraphs(\n extracted: Dict,\n max_chunk_size: int = 500, # words\n overlap: int = 50\n) -> List[Dict]:\n \"\"\"Chunk by paragraphs with overlap.\"\"\"\n full_text = \"\n\n\".join(p[\"text\"] for p in extracted[\"pages\"])\n\n # Split into paragraphs\n paragraphs = [p.strip() for p in full_text.split(\"\n\n\") if p.strip()]\n\n chunks = []\n current_chunk = []\n current_size = 0\n\n for para in paragraphs:\n para_size = len(para.split())\n\n if current_size + para_size > max_chunk_size and current_chunk:\n # Save current chunk\n chunks.append({\n \"content\": \"\n\n\".join(current_chunk),\n \"chunk_index\": len(chunks),\n \"word_count\": current_size\n })\n\n # Start new chunk with overlap\n overlap_text = current_chunk[-1] if current_chunk else \"\"\n current_chunk = [overlap_text] if overlap_text else []\n current_size = len(overlap_text.split()) if overlap_text else 0\n\n current_chunk.append(para)\n current_size += para_size\n\n # Last chunk\n if current_chunk:\n chunks.append({\n \"content\": \"\n\n\".join(current_chunk),\n \"chunk_index\": len(chunks),\n \"word_count\": current_size\n })\n\n return chunks","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Academic Paper Pattern","type":"text"}]},{"type":"paragraph","content":[{"text":"Special handling for research papers.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"def extract_academic_paper(pdf_path: str) -> Dict:\n \"\"\"\n Extract academic paper with structure detection.\n\n Identifies: title, authors, abstract, sections, references\n \"\"\"\n extracted = extract_pdf_text(pdf_path)\n full_text = \"\n\".join(p[\"text\"] for p in extracted[\"pages\"])\n\n paper = {\n \"title\": \"\",\n \"authors\": [],\n \"abstract\": \"\",\n \"sections\": [],\n \"references\": [],\n \"tables\": extracted[\"tables\"]\n }\n\n # Title is usually first large text\n lines = full_text.split(\"\n\")\n for line in lines[:10]:\n if len(line) > 20 and len(line) \u003c 200:\n paper[\"title\"] = line.strip()\n break\n\n # Abstract\n abstract_match = re.search(\n r'Abstract[:\\s]*\n?(.*?)(?=\n(?:1\\.?\\s+)?Introduction|\n\n[A-Z])',\n full_text,\n re.DOTALL | re.IGNORECASE\n )\n if abstract_match:\n paper[\"abstract\"] = abstract_match.group(1).strip()\n\n # Sections\n section_pattern = r'\n(\\d+\\.?\\s+[A-Z][^\n]+)\n'\n section_matches = re.finditer(section_pattern, full_text)\n\n section_positions = [(m.group(1), m.start()) for m in section_matches]\n\n for i, (title, start) in enumerate(section_positions):\n end = section_positions[i+1][1] if i+1 \u003c len(section_positions) else len(full_text)\n content = full_text[start:end]\n\n paper[\"sections\"].append({\n \"title\": title.strip(),\n \"content\": content.strip()\n })\n\n # References section\n ref_match = re.search(\n r'(?:References|Bibliography)\\s*\n(.*?)

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

,\n full_text,\n re.DOTALL | re.IGNORECASE\n )\n if ref_match:\n paper[\"references_text\"] = ref_match.group(1).strip()\n\n return paper","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Full Harvesting Pipeline","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"#!/usr/bin/env python3\n\"\"\"Complete PDF harvesting pipeline.\"\"\"\n\nfrom datetime import datetime\nfrom pathlib import Path\nfrom typing import Dict, List, Optional\nimport hashlib\n\nasync def harvest_pdf(\n pdf_path: str,\n collection: str,\n chunk_strategy: str = \"paragraphs\", # pages, sections, paragraphs\n is_academic: bool = False,\n use_ocr: bool = False\n) -> Dict:\n \"\"\"\n Harvest a PDF document into RAG.\n\n Args:\n pdf_path: Path to PDF file\n collection: Target RAG collection\n chunk_strategy: How to chunk the document\n is_academic: Use academic paper extraction\n use_ocr: Force OCR extraction\n \"\"\"\n path = Path(pdf_path)\n\n # Check if OCR needed\n if use_ocr or is_scanned_pdf(pdf_path):\n extracted = extract_with_ocr(pdf_path)\n else:\n extracted = extract_pdf_text(pdf_path)\n\n # Get document metadata\n doc_metadata = {\n \"source_type\": \"pdf\",\n \"source_path\": str(path.absolute()),\n \"filename\": path.name,\n \"total_pages\": extracted[\"total_pages\"],\n \"harvested_at\": datetime.now().isoformat(),\n \"pdf_metadata\": extracted.get(\"metadata\", {})\n }\n\n # Academic paper special handling\n if is_academic:\n paper = extract_academic_paper(pdf_path)\n doc_metadata[\"title\"] = paper[\"title\"]\n doc_metadata[\"abstract\"] = paper[\"abstract\"]\n doc_metadata[\"is_academic\"] = True\n\n # Chunk based on strategy\n if chunk_strategy == \"pages\":\n chunks = chunk_by_pages(extracted)\n elif chunk_strategy == \"sections\":\n chunks = chunk_by_sections(extracted)\n else:\n chunks = chunk_by_paragraphs(extracted)\n\n # Generate document ID from content hash\n content_hash = hashlib.md5(\n \"\".join(p[\"text\"] for p in extracted[\"pages\"]).encode()\n ).hexdigest()[:12]\n doc_id = f\"pdf_{content_hash}\"\n\n # Ingest chunks\n ingested = 0\n for chunk in chunks:\n chunk_metadata = {\n **doc_metadata,\n \"chunk_index\": chunk[\"chunk_index\"],\n \"total_chunks\": len(chunks),\n }\n\n # Add page info if available\n if \"page_start\" in chunk:\n chunk_metadata[\"page_start\"] = chunk[\"page_start\"]\n chunk_metadata[\"page_end\"] = chunk[\"page_end\"]\n\n # Add section info if available\n if \"section\" in chunk:\n chunk_metadata[\"section\"] = chunk[\"section\"]\n\n await ingest(\n content=chunk[\"content\"],\n collection=collection,\n metadata=chunk_metadata,\n doc_id=f\"{doc_id}_chunk_{chunk['chunk_index']}\"\n )\n ingested += 1\n\n # Ingest tables separately\n for table in extracted.get(\"tables\", []):\n table_metadata = {\n **doc_metadata,\n \"content_type\": \"table\",\n \"page_number\": table[\"page_number\"],\n \"table_number\": table[\"table_number\"]\n }\n\n await ingest(\n content=table[\"markdown\"],\n collection=collection,\n metadata=table_metadata,\n doc_id=f\"{doc_id}_table_{table['page_number']}_{table['table_number']}\"\n )\n\n return {\n \"status\": \"success\",\n \"filename\": path.name,\n \"pages\": extracted[\"total_pages\"],\n \"chunks\": ingested,\n \"tables\": len(extracted.get(\"tables\", [])),\n \"collection\": collection,\n \"doc_id\": doc_id\n }\n\n\nasync def harvest_pdf_url(\n url: str,\n collection: str,\n **kwargs\n) -> Dict:\n \"\"\"Download and harvest a PDF from URL.\"\"\"\n import httpx\n import tempfile\n\n # Download PDF\n async with httpx.AsyncClient() as client:\n response = await client.get(url, follow_redirects=True)\n response.raise_for_status()\n\n # Save to temp file\n with tempfile.NamedTemporaryFile(suffix=\".pdf\", delete=False) as f:\n f.write(response.content)\n temp_path = f.name\n\n try:\n result = await harvest_pdf(temp_path, collection, **kwargs)\n result[\"source_url\"] = url\n return result\n finally:\n Path(temp_path).unlink() # Clean up","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Metadata Schema","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"yaml"},"content":[{"text":"# PDF chunk metadata\nsource_type: pdf\nsource_path: /path/to/document.pdf\nsource_url: https://... (if downloaded)\nfilename: document.pdf\ntotal_pages: 45\npage_start: 5\npage_end: 7\nsection: \"3. Methodology\"\nchunk_index: 12\ntotal_chunks: 28\nharvested_at: \"2024-01-01T12:00:00Z\"\nis_academic: true\ntitle: \"Paper Title\"\nabstract: \"Paper abstract...\"\ncontent_type: text|table","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Usage Examples","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# Local PDF\nresult = await harvest_pdf(\n pdf_path=\"/path/to/document.pdf\",\n collection=\"research_papers\",\n chunk_strategy=\"sections\",\n is_academic=True\n)\n\n# PDF from URL\nresult = await harvest_pdf_url(\n url=\"https://arxiv.org/pdf/2301.00001.pdf\",\n collection=\"ml_papers\",\n is_academic=True\n)\n\n# Scanned document\nresult = await harvest_pdf(\n pdf_path=\"/path/to/scanned.pdf\",\n collection=\"legacy_docs\",\n use_ocr=True\n)","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Refinement Notes","type":"text"}]},{"type":"blockquote","content":[{"type":"paragraph","content":[{"text":"Track improvements as you use this skill.","type":"text"}]}]},{"type":"checkbox_list","attrs":{"id":null},"content":[{"type":"checkbox_item","attrs":{"checked":false},"content":[{"type":"paragraph","content":[{"text":"Text extraction tested","type":"text"}]}]},{"type":"checkbox_item","attrs":{"checked":false},"content":[{"type":"paragraph","content":[{"text":"Table extraction working","type":"text"}]}]},{"type":"checkbox_item","attrs":{"checked":false},"content":[{"type":"paragraph","content":[{"text":"OCR fallback tested","type":"text"}]}]},{"type":"checkbox_item","attrs":{"checked":false},"content":[{"type":"paragraph","content":[{"text":"Academic paper pattern validated","type":"text"}]}]},{"type":"checkbox_item","attrs":{"checked":false},"content":[{"type":"paragraph","content":[{"text":"Chunking strategies compared","type":"text"}]}]},{"type":"checkbox_item","attrs":{"checked":false},"content":[{"type":"paragraph","content":[{"text":"Large PDF handling optimized","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"pdf-harvester","author":"@skillopedia","source":{"stars":2,"repo_name":"reflex","origin_url":"https://github.com/mindmorass/reflex/blob/HEAD/plugins/reflex/skills/pdf-harvester/SKILL.md","repo_owner":"mindmorass","body_sha256":"5a0aa29428421466cf7c58b828cb428f2c344d8a58c0a4d133e9b05ce9156811","cluster_key":"415824e2cad4429ddf7f4834fdcd989b882c912677e67301950209a1755c5758","clean_bundle":{"format":"clean-skill-bundle-v1","source":"mindmorass/reflex/plugins/reflex/skills/pdf-harvester/SKILL.md","bundle_sha256":"146baed2aa2493bfdd7dd0276084181dce430d2a7de90723ee17ace6ccc66fa7","attachment_count":0,"text_attachments":0,"binary_attachments":0},"cluster_size":1,"skill_md_path":"plugins/reflex/skills/pdf-harvester/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"data-analytics","category_label":"Data"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"data-analytics","import_tag":"clean-skills-v1","description":"Extract text and data from PDF documents"}},"renderedAt":1782981232138}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.