PDF Processing Pro — Skillopedia

PDF Processing Pro Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows. Quick start Extract text from PDF Analyze PDF form (using included script) Fill PDF form with validation Extract tables from PDF Features ✅ Production-ready scripts All scripts include: - Error handling : Graceful failures with detailed error messages - Validation : Input validation and type checking - Logging : Configurable logging with timestamps - Type hints : Full type annotations for IDE support - CLI interface : flag for all scripts - Exit c…

\n return re.match(pattern, value) is not None\n\n elif format_type == \"phone\":\n # US phone: (555) 123-4567 or 555-123-4567\n pattern = r'^\$?\\d{3}\$?[-.\\s]?\\d{3}[-.\\s]?\\d{4}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\n return re.match(pattern, value) is not None\n\n elif format_type == \"MM/DD/YYYY\":\n try:\n datetime.strptime(value, \"%m/%d/%Y\")\n return True\n except ValueError:\n return False\n\n elif format_type == \"SSN\":\n # XXX-XX-XXXX\n pattern = r'^\\d{3}-\\d{2}-\\d{4}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\n return re.match(pattern, value) is not None\n\n elif format_type == \"ZIP\":\n # XXXXX or XXXXX-XXXX\n pattern = r'^\\d{5}(-\\d{4})?

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\n return re.match(pattern, value) is not None\n\n return True # Unknown format, skip validation\n```\n\n## Multi-page forms\n\n### Handling multi-page forms\n\n```python\nfrom pypdf import PdfReader, PdfWriter\n\nreader = PdfReader(\"multi_page_form.pdf\")\nwriter = PdfWriter()\n\n# Clone all pages\nfor page in reader.pages:\n writer.add_page(page)\n\n# Fill fields on page 1\nwriter.update_page_form_field_values(\n writer.pages[0],\n {\n \"name_page1\": \"John Doe\",\n \"email_page1\": \"[email protected]\"\n }\n)\n\n# Fill fields on page 2\nwriter.update_page_form_field_values(\n writer.pages[1],\n {\n \"address_page2\": \"123 Main St\",\n \"city_page2\": \"Springfield\"\n }\n)\n\n# Fill fields on page 3\nwriter.update_page_form_field_values(\n writer.pages[2],\n {\n \"signature_page3\": \"John Doe\",\n \"date_page3\": \"12/25/2024\"\n }\n)\n\nwith open(\"filled_multi_page.pdf\", \"wb\") as output:\n writer.write(output)\n```\n\n### Identifying page-specific fields\n\n```python\n# Analyze which fields are on which pages\nfor page_num, page in enumerate(reader.pages, 1):\n fields = page.get(\"/Annots\", [])\n\n if fields:\n print(f\"\\nPage {page_num} fields:\")\n for field_ref in fields:\n field = field_ref.get_object()\n field_name = field.get(\"/T\", \"Unknown\")\n print(f\" - {field_name}\")\n```\n\n## Flattening forms\n\n### Why flatten\n\nFlattening makes form fields non-editable, embedding values permanently:\n\n- **Security**: Prevent modifications\n- **Distribution**: Share read-only forms\n- **Printing**: Ensure correct appearance\n- **Archival**: Long-term storage\n\n### Flatten with pypdf\n\n```python\nfrom pypdf import PdfReader, PdfWriter\n\nreader = PdfReader(\"filled.pdf\")\nwriter = PdfWriter()\n\n# Add all pages\nfor page in reader.pages:\n writer.add_page(page)\n\n# Flatten all form fields\nwriter.flatten_fields()\n\n# Save flattened PDF\nwith open(\"flattened.pdf\", \"wb\") as output:\n writer.write(output)\n```\n\n### Using included script\n\n```bash\npython scripts/flatten_form.py filled.pdf flattened.pdf\n```\n\n## Error handling patterns\n\n### Robust form filling\n\n```python\nimport logging\nfrom pathlib import Path\nfrom pypdf import PdfReader, PdfWriter\nfrom pypdf.errors import PdfReadError\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\ndef fill_form_safe(template_path, data, output_path):\n \"\"\"Fill form with comprehensive error handling.\"\"\"\n\n try:\n # Validate inputs\n template = Path(template_path)\n if not template.exists():\n raise FileNotFoundError(f\"Template not found: {template_path}\")\n\n # Read template\n logger.info(f\"Reading template: {template_path}\")\n reader = PdfReader(template_path)\n\n if not reader.pages:\n raise ValueError(\"PDF has no pages\")\n\n # Check if form has fields\n fields = reader.get_fields()\n if not fields:\n logger.warning(\"PDF has no form fields\")\n return False\n\n # Create writer\n writer = PdfWriter()\n for page in reader.pages:\n writer.add_page(page)\n\n # Validate data against schema\n missing_required = []\n invalid_fields = []\n\n for field_name, field_info in fields.items():\n # Check required fields\n is_required = field_info.get(\"/Ff\", 0) & 2 == 2\n if is_required and field_name not in data:\n missing_required.append(field_name)\n\n # Check invalid field names in data\n if field_name in data:\n value = data[field_name]\n # Add type validation here if needed\n\n if missing_required:\n raise ValueError(f\"Missing required fields: {missing_required}\")\n\n # Fill fields\n logger.info(\"Filling form fields\")\n writer.update_page_form_field_values(\n writer.pages[0],\n data\n )\n\n # Write output\n logger.info(f\"Writing output: {output_path}\")\n with open(output_path, \"wb\") as output:\n writer.write(output)\n\n logger.info(\"Form filled successfully\")\n return True\n\n except PdfReadError as e:\n logger.error(f\"PDF read error: {e}\")\n return False\n\n except FileNotFoundError as e:\n logger.error(f\"File error: {e}\")\n return False\n\n except ValueError as e:\n logger.error(f\"Validation error: {e}\")\n return False\n\n except Exception as e:\n logger.error(f\"Unexpected error: {e}\")\n return False\n\n# Usage\nsuccess = fill_form_safe(\n \"template.pdf\",\n {\"name\": \"John\", \"email\": \"[email protected]\"},\n \"filled.pdf\"\n)\n\nif not success:\n exit(1)\n```\n\n## Production examples\n\n### Example 1: Batch form processing\n\n```python\nimport json\nimport glob\nfrom pathlib import Path\nfrom fill_form_safe import fill_form_safe\n\n# Process multiple submissions\nsubmissions_dir = Path(\"submissions\")\ntemplate = \"application_template.pdf\"\noutput_dir = Path(\"completed\")\noutput_dir.mkdir(exist_ok=True)\n\nfor submission_file in submissions_dir.glob(\"*.json\"):\n print(f\"Processing: {submission_file.name}\")\n\n # Load submission data\n with open(submission_file) as f:\n data = json.load(f)\n\n # Fill form\n applicant_id = data.get(\"id\", \"unknown\")\n output_file = output_dir / f\"application_{applicant_id}.pdf\"\n\n success = fill_form_safe(template, data, output_file)\n\n if success:\n print(f\" ✓ Completed: {output_file}\")\n else:\n print(f\" ✗ Failed: {submission_file.name}\")\n```\n\n### Example 2: Form with conditional logic\n\n```python\ndef prepare_form_data(raw_data):\n \"\"\"Prepare form data with conditional logic.\"\"\"\n\n form_data = {}\n\n # Basic fields\n form_data[\"full_name\"] = raw_data[\"name\"]\n form_data[\"email\"] = raw_data[\"email\"]\n\n # Conditional fields\n if raw_data.get(\"is_student\"):\n form_data[\"student_id\"] = raw_data[\"student_id\"]\n form_data[\"school_name\"] = raw_data[\"school\"]\n else:\n form_data[\"employer\"] = raw_data.get(\"employer\", \"\")\n\n # Checkbox logic\n form_data[\"newsletter\"] = \"/Yes\" if raw_data.get(\"opt_in\") else \"/Off\"\n\n # Calculated fields\n total = sum(raw_data.get(\"items\", []))\n form_data[\"total_amount\"] = f\"${total:.2f}\"\n\n return form_data\n\n# Usage\nraw_input = {\n \"name\": \"Jane Smith\",\n \"email\": \"[email protected]\",\n \"is_student\": True,\n \"student_id\": \"12345\",\n \"school\": \"State University\",\n \"opt_in\": True,\n \"items\": [10.00, 25.50, 15.75]\n}\n\nform_data = prepare_form_data(raw_input)\nfill_form_safe(\"template.pdf\", form_data, \"output.pdf\")\n```\n\n## Best practices\n\n1. **Always analyze before filling**: Use `analyze_form.py` to understand structure\n2. **Validate early**: Check data before attempting to fill\n3. **Use logging**: Track operations for debugging\n4. **Handle errors gracefully**: Don't crash on invalid data\n5. **Test with samples**: Verify with small datasets first\n6. **Flatten when distributing**: Make read-only for recipients\n7. **Keep templates versioned**: Track form template changes\n8. **Document field mappings**: Maintain data-to-field documentation\n\n## Troubleshooting\n\n### Fields not filling\n\n1. Check field names match exactly (case-sensitive)\n2. Verify checkbox/radio values (`/Yes`, `/On`, etc.)\n3. Ensure PDF is not encrypted or protected\n4. Check if form uses XFA format (not supported by pypdf)\n\n### Encoding issues\n\n```python\n# Handle special characters\nfield_values[\"name\"] = \"José García\" # UTF-8 encoded\n```\n\n### Large batch processing\n\n```python\n# Process in chunks to avoid memory issues\nchunk_size = 100\n\nfor i in range(0, len(submissions), chunk_size):\n chunk = submissions[i:i + chunk_size]\n process_batch(chunk)\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":14006,"content_sha256":"ea43f2930d53347e0cf19a205bf94315737cbb034153181ebdff6d037bac8d24"},{"filename":"OCR.md","content":"# PDF OCR Processing Guide\n\nExtract text from scanned PDFs and image-based documents.\n\n## Quick start\n\n```python\nimport pytesseract\nfrom pdf2image import convert_from_path\nfrom PIL import Image\n\n# Convert PDF to images\nimages = convert_from_path(\"scanned.pdf\")\n\n# Extract text from each page\nfor i, image in enumerate(images):\n text = pytesseract.image_to_string(image)\n print(f\"Page {i+1}:\\n{text}\\n\")\n```\n\n## Installation\n\n### Install Tesseract\n\n**macOS:**\n```bash\nbrew install tesseract\n```\n\n**Ubuntu/Debian:**\n```bash\nsudo apt-get install tesseract-ocr\n```\n\n**Windows:**\nDownload from: https://github.com/UB-Mannheim/tesseract/wiki\n\n### Install Python packages\n\n```bash\npip install pytesseract pdf2image pillow\n```\n\n## Language support\n\n```python\n# English (default)\ntext = pytesseract.image_to_string(image, lang=\"eng\")\n\n# Spanish\ntext = pytesseract.image_to_string(image, lang=\"spa\")\n\n# Multiple languages\ntext = pytesseract.image_to_string(image, lang=\"eng+spa+fra\")\n```\n\nInstall additional languages:\n```bash\n# macOS\nbrew install tesseract-lang\n\n# Ubuntu\nsudo apt-get install tesseract-ocr-spa tesseract-ocr-fra\n```\n\n## Image preprocessing\n\n```python\nfrom PIL import Image, ImageEnhance, ImageFilter\n\ndef preprocess_for_ocr(image):\n \"\"\"Optimize image for better OCR accuracy.\"\"\"\n\n # Convert to grayscale\n image = image.convert(\"L\")\n\n # Increase contrast\n enhancer = ImageEnhance.Contrast(image)\n image = enhancer.enhance(2.0)\n\n # Denoise\n image = image.filter(ImageFilter.MedianFilter())\n\n # Sharpen\n image = image.filter(ImageFilter.SHARPEN)\n\n return image\n\n# Usage\nimage = Image.open(\"scanned_page.png\")\nprocessed = preprocess_for_ocr(image)\ntext = pytesseract.image_to_string(processed)\n```\n\n## Best practices\n\n1. **Preprocess images** for better accuracy\n2. **Use appropriate language** models\n3. **Batch process** large documents\n4. **Cache results** to avoid re-processing\n5. **Validate output** - OCR is not 100% accurate\n6. **Consider confidence scores** for quality checks\n\n## Production example\n\n```python\nimport pytesseract\nfrom pdf2image import convert_from_path\nfrom PIL import Image\n\ndef ocr_pdf(pdf_path, output_path):\n \"\"\"OCR PDF and save to text file.\"\"\"\n\n # Convert to images\n images = convert_from_path(pdf_path, dpi=300)\n\n full_text = []\n\n for i, image in enumerate(images, 1):\n print(f\"Processing page {i}/{len(images)}\")\n\n # Preprocess\n processed = preprocess_for_ocr(image)\n\n # OCR\n text = pytesseract.image_to_string(processed, lang=\"eng\")\n full_text.append(f\"--- Page {i} ---\\n{text}\\n\")\n\n # Save\n with open(output_path, \"w\", encoding=\"utf-8\") as f:\n f.write(\"\\n\".join(full_text))\n\n print(f\"Saved to {output_path}\")\n\n# Usage\nocr_pdf(\"scanned_document.pdf\", \"extracted_text.txt\")\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":2828,"content_sha256":"f28c254d9f15eed42233a3e806f722f3a92c35d429df0b50641e7ef0efda10fb"},{"filename":"scripts/analyze_form.py","content":"#!/usr/bin/env python3\n\"\"\"\nAnalyze PDF form fields and structure.\n\nUsage:\n python analyze_form.py input.pdf [--output fields.json] [--verbose]\n\nReturns:\n JSON with all form fields, types, positions, and metadata\n\nExit codes:\n 0 - Success\n 1 - File not found\n 2 - Invalid PDF\n 3 - Processing error\n\"\"\"\n\nimport sys\nimport json\nimport logging\nimport argparse\nfrom pathlib import Path\nfrom typing import Dict, List, Optional, Any\n\ntry:\n from pypdf import PdfReader\nexcept ImportError:\n print(\"Error: pypdf not installed. Run: pip install pypdf\", file=sys.stderr)\n sys.exit(3)\n\n# Configure logging\nlogging.basicConfig(\n level=logging.INFO,\n format='%(asctime)s - %(levelname)s - %(message)s'\n)\nlogger = logging.getLogger(__name__)\n\n\nclass FormField:\n \"\"\"Represents a PDF form field.\"\"\"\n\n def __init__(self, name: str, field_dict: Dict[str, Any]):\n self.name = name\n self.raw_data = field_dict\n\n @property\n def field_type(self) -> str:\n \"\"\"Get field type.\"\"\"\n ft = self.raw_data.get('/FT', '')\n type_map = {\n '/Tx': 'text',\n '/Btn': 'button', # checkbox or radio\n '/Ch': 'choice', # dropdown or list\n '/Sig': 'signature'\n }\n return type_map.get(ft, 'unknown')\n\n @property\n def value(self) -> Optional[str]:\n \"\"\"Get current field value.\"\"\"\n val = self.raw_data.get('/V')\n return str(val) if val else None\n\n @property\n def default_value(self) -> Optional[str]:\n \"\"\"Get default field value.\"\"\"\n dv = self.raw_data.get('/DV')\n return str(dv) if dv else None\n\n @property\n def is_required(self) -> bool:\n \"\"\"Check if field is required.\"\"\"\n flags = self.raw_data.get('/Ff', 0)\n # Bit 2 indicates required\n return bool(flags & 2)\n\n @property\n def is_readonly(self) -> bool:\n \"\"\"Check if field is read-only.\"\"\"\n flags = self.raw_data.get('/Ff', 0)\n # Bit 1 indicates read-only\n return bool(flags & 1)\n\n @property\n def options(self) -> List[str]:\n \"\"\"Get options for choice fields.\"\"\"\n if self.field_type != 'choice':\n return []\n\n opts = self.raw_data.get('/Opt', [])\n if isinstance(opts, list):\n return [str(opt) for opt in opts]\n return []\n\n @property\n def max_length(self) -> Optional[int]:\n \"\"\"Get max length for text fields.\"\"\"\n if self.field_type == 'text':\n return self.raw_data.get('/MaxLen')\n return None\n\n @property\n def rect(self) -> Optional[List[float]]:\n \"\"\"Get field position and size [x0, y0, x1, y1].\"\"\"\n return self.raw_data.get('/Rect')\n\n def to_dict(self) -> Dict[str, Any]:\n \"\"\"Convert to dictionary.\"\"\"\n result = {\n 'name': self.name,\n 'type': self.field_type,\n 'required': self.is_required,\n 'readonly': self.is_readonly\n }\n\n if self.value is not None:\n result['value'] = self.value\n\n if self.default_value is not None:\n result['default_value'] = self.default_value\n\n if self.options:\n result['options'] = self.options\n\n if self.max_length is not None:\n result['max_length'] = self.max_length\n\n if self.rect:\n result['position'] = {\n 'x0': float(self.rect[0]),\n 'y0': float(self.rect[1]),\n 'x1': float(self.rect[2]),\n 'y1': float(self.rect[3]),\n 'width': float(self.rect[2] - self.rect[0]),\n 'height': float(self.rect[3] - self.rect[1])\n }\n\n return result\n\n\nclass PDFFormAnalyzer:\n \"\"\"Analyzes PDF forms and extracts field information.\"\"\"\n\n def __init__(self, pdf_path: str):\n self.pdf_path = Path(pdf_path)\n self.reader: Optional[PdfReader] = None\n self._validate_file()\n\n def _validate_file(self) -> None:\n \"\"\"Validate PDF file exists and is readable.\"\"\"\n if not self.pdf_path.exists():\n logger.error(f\"PDF not found: {self.pdf_path}\")\n raise FileNotFoundError(f\"PDF not found: {self.pdf_path}\")\n\n if not self.pdf_path.is_file():\n logger.error(f\"Not a file: {self.pdf_path}\")\n raise ValueError(f\"Not a file: {self.pdf_path}\")\n\n if self.pdf_path.suffix.lower() != '.pdf':\n logger.error(f\"Not a PDF file: {self.pdf_path}\")\n raise ValueError(f\"Not a PDF file: {self.pdf_path}\")\n\n def analyze(self) -> Dict[str, Dict[str, Any]]:\n \"\"\"\n Analyze PDF and extract all form fields.\n\n Returns:\n Dictionary mapping field names to field information\n \"\"\"\n try:\n self.reader = PdfReader(str(self.pdf_path))\n\n if not self.reader.pages:\n logger.warning(\"PDF has no pages\")\n return {}\n\n logger.info(f\"Analyzing PDF with {len(self.reader.pages)} pages\")\n\n # Get form fields\n raw_fields = self.reader.get_fields()\n\n if not raw_fields:\n logger.warning(\"PDF has no form fields\")\n return {}\n\n logger.info(f\"Found {len(raw_fields)} form fields\")\n\n # Process fields\n fields = {}\n for field_name, field_dict in raw_fields.items():\n try:\n field = FormField(field_name, field_dict)\n fields[field_name] = field.to_dict()\n except Exception as e:\n logger.warning(f\"Error processing field {field_name}: {e}\")\n continue\n\n return fields\n\n except Exception as e:\n logger.error(f\"Error analyzing PDF: {e}\")\n raise\n\n def get_summary(self) -> Dict[str, Any]:\n \"\"\"Get summary statistics.\"\"\"\n fields = self.analyze()\n\n summary = {\n 'total_fields': len(fields),\n 'field_types': {},\n 'required_fields': [],\n 'readonly_fields': [],\n 'fields_with_values': []\n }\n\n for field_name, field_data in fields.items():\n # Count by type\n field_type = field_data['type']\n summary['field_types'][field_type] = summary['field_types'].get(field_type, 0) + 1\n\n # Required fields\n if field_data.get('required'):\n summary['required_fields'].append(field_name)\n\n # Read-only fields\n if field_data.get('readonly'):\n summary['readonly_fields'].append(field_name)\n\n # Fields with values\n if field_data.get('value'):\n summary['fields_with_values'].append(field_name)\n\n return summary\n\n\ndef main():\n \"\"\"Main entry point.\"\"\"\n parser = argparse.ArgumentParser(\n description='Analyze PDF form fields',\n formatter_class=argparse.RawDescriptionHelpFormatter,\n epilog='''\nExamples:\n %(prog)s form.pdf\n %(prog)s form.pdf --output fields.json\n %(prog)s form.pdf --output fields.json --verbose\n %(prog)s form.pdf --summary\n\nExit codes:\n 0 - Success\n 1 - File not found\n 2 - Invalid PDF\n 3 - Processing error\n '''\n )\n\n parser.add_argument('input', help='Input PDF file')\n parser.add_argument('--output', '-o', help='Output JSON file (default: stdout)')\n parser.add_argument('--summary', '-s', action='store_true', help='Show summary only')\n parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')\n\n args = parser.parse_args()\n\n # Set log level\n if args.verbose:\n logger.setLevel(logging.DEBUG)\n else:\n logger.setLevel(logging.WARNING)\n\n try:\n # Analyze form\n analyzer = PDFFormAnalyzer(args.input)\n\n if args.summary:\n result = analyzer.get_summary()\n else:\n result = analyzer.analyze()\n\n # Output\n json_output = json.dumps(result, indent=2)\n\n if args.output:\n with open(args.output, 'w', encoding='utf-8') as f:\n f.write(json_output)\n logger.info(f\"Saved to {args.output}\")\n else:\n print(json_output)\n\n return 0\n\n except FileNotFoundError:\n logger.error(f\"File not found: {args.input}\")\n return 1\n\n except ValueError as e:\n logger.error(f\"Invalid input: {e}\")\n return 2\n\n except Exception as e:\n logger.error(f\"Error: {e}\")\n if args.verbose:\n import traceback\n traceback.print_exc()\n return 3\n\n\nif __name__ == '__main__':\n sys.exit(main())\n","content_type":"text/x-python; charset=utf-8","language":"python","size":8684,"content_sha256":"694c93c3e1dec5dc6a6e4ebdc2548c8a9200c52a9b261399893d0592831f4395"},{"filename":"TABLES.md","content":"# PDF Table Extraction Guide\n\nAdvanced table extraction strategies for production environments.\n\n## Table of contents\n\n- Basic table extraction\n- Multi-page tables\n- Complex table structures\n- Export formats\n- Table detection algorithms\n- Custom extraction rules\n- Performance optimization\n- Production examples\n\n## Basic table extraction\n\n### Using pdfplumber (recommended)\n\n```python\nimport pdfplumber\n\nwith pdfplumber.open(\"report.pdf\") as pdf:\n page = pdf.pages[0]\n tables = page.extract_tables()\n\n for i, table in enumerate(tables):\n print(f\"\\nTable {i + 1}:\")\n for row in table:\n print(row)\n```\n\n### Using included script\n\n```bash\npython scripts/extract_tables.py report.pdf --output tables.csv\n```\n\nOutput:\n```csv\nName,Age,City\nJohn Doe,30,New York\nJane Smith,25,Los Angeles\nBob Johnson,35,Chicago\n```\n\n## Table extraction strategies\n\n### Strategy 1: Automatic detection\n\nLet pdfplumber auto-detect tables:\n\n```python\nimport pdfplumber\n\nwith pdfplumber.open(\"document.pdf\") as pdf:\n for page_num, page in enumerate(pdf.pages, 1):\n tables = page.extract_tables()\n\n if tables:\n print(f\"Found {len(tables)} table(s) on page {page_num}\")\n\n for table_num, table in enumerate(tables, 1):\n print(f\"\\nTable {table_num}:\")\n # First row is usually headers\n headers = table[0]\n print(f\"Columns: {headers}\")\n\n # Data rows\n for row in table[1:]:\n print(row)\n```\n\n### Strategy 2: Custom table settings\n\nFine-tune detection with custom settings:\n\n```python\nimport pdfplumber\n\ntable_settings = {\n \"vertical_strategy\": \"lines\", # or \"text\", \"lines_strict\"\n \"horizontal_strategy\": \"lines\",\n \"explicit_vertical_lines\": [],\n \"explicit_horizontal_lines\": [],\n \"snap_tolerance\": 3,\n \"join_tolerance\": 3,\n \"edge_min_length\": 3,\n \"min_words_vertical\": 3,\n \"min_words_horizontal\": 1,\n \"keep_blank_chars\": False,\n \"text_tolerance\": 3,\n \"text_x_tolerance\": 3,\n \"text_y_tolerance\": 3,\n \"intersection_tolerance\": 3\n}\n\nwith pdfplumber.open(\"document.pdf\") as pdf:\n page = pdf.pages[0]\n tables = page.extract_tables(table_settings=table_settings)\n```\n\n### Strategy 3: Explicit boundaries\n\nDefine table boundaries manually:\n\n```python\nimport pdfplumber\n\nwith pdfplumber.open(\"document.pdf\") as pdf:\n page = pdf.pages[0]\n\n # Define bounding box (x0, top, x1, bottom)\n bbox = (50, 100, 550, 700)\n\n # Extract table within bounding box\n cropped = page.within_bbox(bbox)\n tables = cropped.extract_tables()\n```\n\n## Multi-page tables\n\n### Detect and merge multi-page tables\n\n```python\nimport pdfplumber\n\ndef extract_multipage_table(pdf_path, start_page=0, end_page=None):\n \"\"\"Extract table that spans multiple pages.\"\"\"\n\n all_rows = []\n headers = None\n\n with pdfplumber.open(pdf_path) as pdf:\n pages = pdf.pages[start_page:end_page]\n\n for page_num, page in enumerate(pages):\n tables = page.extract_tables()\n\n if not tables:\n continue\n\n # Assume first table on page\n table = tables[0]\n\n if page_num == 0:\n # First page: capture headers and data\n headers = table[0]\n all_rows.extend(table[1:])\n else:\n # Subsequent pages: skip headers if they repeat\n if table[0] == headers:\n all_rows.extend(table[1:])\n else:\n all_rows.extend(table)\n\n return [headers] + all_rows if headers else all_rows\n\n# Usage\ntable = extract_multipage_table(\"report.pdf\", start_page=2, end_page=5)\n\nprint(f\"Extracted {len(table) - 1} rows\")\nprint(f\"Columns: {table[0]}\")\n```\n\n## Complex table structures\n\n### Handling merged cells\n\n```python\nimport pdfplumber\n\ndef handle_merged_cells(table):\n \"\"\"Process table with merged cells.\"\"\"\n\n processed = []\n\n for row in table:\n new_row = []\n last_value = None\n\n for cell in row:\n if cell is None or cell == \"\":\n # Merged cell - use value from left\n new_row.append(last_value)\n else:\n new_row.append(cell)\n last_value = cell\n\n processed.append(new_row)\n\n return processed\n\n# Usage\nwith pdfplumber.open(\"document.pdf\") as pdf:\n table = pdf.pages[0].extract_tables()[0]\n clean_table = handle_merged_cells(table)\n```\n\n### Nested tables\n\n```python\ndef extract_nested_tables(page, bbox):\n \"\"\"Extract nested tables from a region.\"\"\"\n\n cropped = page.within_bbox(bbox)\n\n # Try to detect sub-regions with tables\n tables = cropped.extract_tables()\n\n result = []\n for table in tables:\n # Process each nested table\n if table:\n result.append({\n \"type\": \"nested\",\n \"data\": table\n })\n\n return result\n```\n\n### Tables with varying column counts\n\n```python\ndef normalize_table_columns(table):\n \"\"\"Normalize table with inconsistent column counts.\"\"\"\n\n if not table:\n return table\n\n # Find max column count\n max_cols = max(len(row) for row in table)\n\n # Pad short rows\n normalized = []\n for row in table:\n if len(row) \u003c max_cols:\n # Pad with empty strings\n row = row + [\"\"] * (max_cols - len(row))\n normalized.append(row)\n\n return normalized\n```\n\n## Export formats\n\n### Export to CSV\n\n```python\nimport csv\n\ndef export_to_csv(table, output_path):\n \"\"\"Export table to CSV.\"\"\"\n\n with open(output_path, \"w\", newline=\"\", encoding=\"utf-8\") as f:\n writer = csv.writer(f)\n writer.writerows(table)\n\n# Usage\ntable = extract_table(\"report.pdf\")\nexport_to_csv(table, \"output.csv\")\n```\n\n### Export to Excel\n\n```python\nimport pandas as pd\n\ndef export_to_excel(tables, output_path):\n \"\"\"Export multiple tables to Excel with sheets.\"\"\"\n\n with pd.ExcelWriter(output_path, engine=\"openpyxl\") as writer:\n for i, table in enumerate(tables):\n if not table:\n continue\n\n # Convert to DataFrame\n headers = table[0]\n data = table[1:]\n df = pd.DataFrame(data, columns=headers)\n\n # Write to sheet\n sheet_name = f\"Table_{i + 1}\"\n df.to_excel(writer, sheet_name=sheet_name, index=False)\n\n # Auto-adjust column widths\n worksheet = writer.sheets[sheet_name]\n for column in worksheet.columns:\n max_length = 0\n column_letter = column[0].column_letter\n for cell in column:\n if len(str(cell.value)) > max_length:\n max_length = len(str(cell.value))\n worksheet.column_dimensions[column_letter].width = max_length + 2\n\n# Usage\ntables = extract_all_tables(\"report.pdf\")\nexport_to_excel(tables, \"output.xlsx\")\n```\n\n### Export to JSON\n\n```python\nimport json\n\ndef export_to_json(table, output_path):\n \"\"\"Export table to JSON.\"\"\"\n\n if not table:\n return\n\n headers = table[0]\n data = table[1:]\n\n # Convert to list of dictionaries\n records = []\n for row in data:\n record = {}\n for i, header in enumerate(headers):\n value = row[i] if i \u003c len(row) else None\n record[header] = value\n records.append(record)\n\n # Save to JSON\n with open(output_path, \"w\", encoding=\"utf-8\") as f:\n json.dump(records, f, indent=2)\n\n# Usage\ntable = extract_table(\"report.pdf\")\nexport_to_json(table, \"output.json\")\n```\n\n## Table detection algorithms\n\n### Visual debugging\n\n```python\nimport pdfplumber\n\ndef visualize_table_detection(pdf_path, page_num=0, output_path=\"debug.png\"):\n \"\"\"Visualize detected table structure.\"\"\"\n\n with pdfplumber.open(pdf_path) as pdf:\n page = pdf.pages[page_num]\n\n # Draw detected table lines\n im = page.to_image(resolution=150)\n im = im.debug_tablefinder()\n im.save(output_path)\n\n print(f\"Saved debug image to {output_path}\")\n\n# Usage\nvisualize_table_detection(\"document.pdf\", page_num=0)\n```\n\n### Algorithm: Line-based detection\n\nBest for tables with visible borders:\n\n```python\ntable_settings = {\n \"vertical_strategy\": \"lines\",\n \"horizontal_strategy\": \"lines\"\n}\n\ntables = page.extract_tables(table_settings=table_settings)\n```\n\n### Algorithm: Text-based detection\n\nBest for tables without borders:\n\n```python\ntable_settings = {\n \"vertical_strategy\": \"text\",\n \"horizontal_strategy\": \"text\"\n}\n\ntables = page.extract_tables(table_settings=table_settings)\n```\n\n### Algorithm: Explicit lines\n\nFor complex layouts, define lines manually:\n\n```python\n# Define vertical lines at x-coordinates\nvertical_lines = [50, 150, 250, 350, 450, 550]\n\n# Define horizontal lines at y-coordinates\nhorizontal_lines = [100, 130, 160, 190, 220, 250]\n\ntable_settings = {\n \"explicit_vertical_lines\": vertical_lines,\n \"explicit_horizontal_lines\": horizontal_lines\n}\n\ntables = page.extract_tables(table_settings=table_settings)\n```\n\n## Custom extraction rules\n\n### Rule-based extraction\n\n```python\ndef extract_with_rules(page, rules):\n \"\"\"Extract table using custom rules.\"\"\"\n\n # Rule: \"Headers are bold\"\n if rules.get(\"bold_headers\"):\n chars = page.chars\n bold_chars = [c for c in chars if \"Bold\" in c.get(\"fontname\", \"\")]\n # Use bold chars to identify header row\n pass\n\n # Rule: \"First column is always left-aligned\"\n if rules.get(\"left_align_first_col\"):\n # Adjust extraction to respect alignment\n pass\n\n # Rule: \"Currency values in last column\"\n if rules.get(\"currency_last_col\"):\n # Parse currency format\n pass\n\n # Extract with adjusted settings\n return page.extract_tables()\n```\n\n### Post-processing rules\n\n```python\ndef apply_post_processing(table, rules):\n \"\"\"Apply post-processing rules to extracted table.\"\"\"\n\n processed = []\n\n for row in table:\n new_row = []\n\n for i, cell in enumerate(row):\n value = cell\n\n # Rule: Strip whitespace\n if rules.get(\"strip_whitespace\"):\n value = value.strip() if value else value\n\n # Rule: Convert currency to float\n if rules.get(\"parse_currency\") and i == len(row) - 1:\n if value and \"$\" in value:\n value = float(value.replace(\"$\", \"\").replace(\",\", \"\"))\n\n # Rule: Parse dates\n if rules.get(\"parse_dates\") and i == 0:\n # Convert to datetime\n pass\n\n new_row.append(value)\n\n processed.append(new_row)\n\n return processed\n```\n\n## Performance optimization\n\n### Process large PDFs efficiently\n\n```python\ndef extract_tables_optimized(pdf_path):\n \"\"\"Extract tables with memory optimization.\"\"\"\n\n import gc\n\n results = []\n\n with pdfplumber.open(pdf_path) as pdf:\n for page_num, page in enumerate(pdf.pages):\n print(f\"Processing page {page_num + 1}/{len(pdf.pages)}\")\n\n # Extract tables from current page\n tables = page.extract_tables()\n results.extend(tables)\n\n # Force garbage collection\n gc.collect()\n\n return results\n```\n\n### Parallel processing\n\n```python\nfrom concurrent.futures import ProcessPoolExecutor\nimport pdfplumber\n\ndef extract_page_tables(args):\n \"\"\"Extract tables from a single page.\"\"\"\n pdf_path, page_num = args\n\n with pdfplumber.open(pdf_path) as pdf:\n page = pdf.pages[page_num]\n return page.extract_tables()\n\ndef extract_tables_parallel(pdf_path, max_workers=4):\n \"\"\"Extract tables using multiple processes.\"\"\"\n\n with pdfplumber.open(pdf_path) as pdf:\n page_count = len(pdf.pages)\n\n # Create tasks\n tasks = [(pdf_path, i) for i in range(page_count)]\n\n # Process in parallel\n with ProcessPoolExecutor(max_workers=max_workers) as executor:\n results = list(executor.map(extract_page_tables, tasks))\n\n # Flatten results\n all_tables = []\n for page_tables in results:\n all_tables.extend(page_tables)\n\n return all_tables\n```\n\n## Production examples\n\n### Example 1: Financial report extraction\n\n```python\nimport pdfplumber\nimport pandas as pd\nfrom decimal import Decimal\n\ndef extract_financial_tables(pdf_path):\n \"\"\"Extract financial data with proper number formatting.\"\"\"\n\n tables = []\n\n with pdfplumber.open(pdf_path) as pdf:\n for page in pdf.pages:\n page_tables = page.extract_tables()\n\n for table in page_tables:\n # Convert to DataFrame\n df = pd.DataFrame(table[1:], columns=table[0])\n\n # Parse currency columns\n for col in df.columns:\n if df[col].str.contains(\"$\", na=False).any():\n df[col] = df[col].str.replace(r\"[$,()]\", \"\", regex=True)\n df[col] = pd.to_numeric(df[col], errors=\"coerce\")\n\n tables.append(df)\n\n return tables\n```\n\n### Example 2: Batch table extraction\n\n```python\nimport glob\nfrom pathlib import Path\n\ndef batch_extract_tables(input_dir, output_dir):\n \"\"\"Extract tables from all PDFs in directory.\"\"\"\n\n input_path = Path(input_dir)\n output_path = Path(output_dir)\n output_path.mkdir(exist_ok=True)\n\n for pdf_file in input_path.glob(\"*.pdf\"):\n print(f\"Processing: {pdf_file.name}\")\n\n try:\n # Extract tables\n tables = extract_all_tables(str(pdf_file))\n\n # Export to Excel\n output_file = output_path / f\"{pdf_file.stem}_tables.xlsx\"\n export_to_excel(tables, str(output_file))\n\n print(f\" ✓ Extracted {len(tables)} table(s)\")\n\n except Exception as e:\n print(f\" ✗ Error: {e}\")\n\n# Usage\nbatch_extract_tables(\"invoices/\", \"extracted/\")\n```\n\n## Best practices\n\n1. **Visualize first**: Use debug mode to understand table structure\n2. **Test settings**: Try different strategies for best results\n3. **Handle errors**: PDFs vary widely in quality\n4. **Validate output**: Check extracted data makes sense\n5. **Post-process**: Clean and normalize extracted data\n6. **Use pandas**: Leverage DataFrame operations for analysis\n7. **Cache results**: Avoid re-processing large files\n8. **Monitor performance**: Profile for bottlenecks\n\n## Troubleshooting\n\n### Tables not detected\n\n1. Try different detection strategies\n2. Use visual debugging to see structure\n3. Define explicit lines manually\n4. Check if table is actually an image\n\n### Incorrect cell values\n\n1. Adjust snap/join tolerance\n2. Check text extraction quality\n3. Use post-processing to clean data\n4. Verify PDF is not scanned image\n\n### Performance issues\n\n1. Process pages individually\n2. Use parallel processing\n3. Reduce image resolution\n4. Extract only needed pages\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":14911,"content_sha256":"ae2b4c9fc07a724f25415aa19d061477e52565ad50c7cb10a9b3bdc069dfe94f"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"PDF Processing Pro","type":"text"}]},{"type":"paragraph","content":[{"text":"Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Quick start","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Extract text from PDF","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"import pdfplumber\n\nwith pdfplumber.open(\"document.pdf\") as pdf:\n text = pdf.pages[0].extract_text()\n print(text)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Analyze PDF form (using included script)","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/analyze_form.py input.pdf --output fields.json\n# Returns: JSON with all form fields, types, and positions","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Fill PDF form with validation","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/fill_form.py input.pdf data.json output.pdf\n# Validates all fields before filling, includes error reporting","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Extract tables from PDF","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/extract_tables.py report.pdf --output tables.csv\n# Extracts all tables with automatic column detection","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Features","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"✅ Production-ready scripts","type":"text"}]},{"type":"paragraph","content":[{"text":"All scripts include:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Error handling","type":"text","marks":[{"type":"strong"}]},{"text":": Graceful failures with detailed error messages","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Validation","type":"text","marks":[{"type":"strong"}]},{"text":": Input validation and type checking","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Logging","type":"text","marks":[{"type":"strong"}]},{"text":": Configurable logging with timestamps","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Type hints","type":"text","marks":[{"type":"strong"}]},{"text":": Full type annotations for IDE support","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"CLI interface","type":"text","marks":[{"type":"strong"}]},{"text":": ","type":"text"},{"text":"--help","type":"text","marks":[{"type":"code_inline"}]},{"text":" flag for all scripts","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Exit codes","type":"text","marks":[{"type":"strong"}]},{"text":": Proper exit codes for automation","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"✅ Comprehensive workflows","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"PDF Forms","type":"text","marks":[{"type":"strong"}]},{"text":": Complete form processing pipeline","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Table Extraction","type":"text","marks":[{"type":"strong"}]},{"text":": Advanced table detection and extraction","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"OCR Processing","type":"text","marks":[{"type":"strong"}]},{"text":": Scanned PDF text extraction","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Batch Operations","type":"text","marks":[{"type":"strong"}]},{"text":": Process multiple PDFs efficiently","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Validation","type":"text","marks":[{"type":"strong"}]},{"text":": Pre and post-processing validation","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Advanced topics","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"PDF Form Processing","type":"text"}]},{"type":"paragraph","content":[{"text":"For complete form workflows including:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Field analysis and detection","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Dynamic form filling","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Validation rules","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Multi-page forms","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Checkbox and radio button handling","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"FORMS.md","type":"text","marks":[{"type":"link","attrs":{"href":"FORMS.md","title":null}}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Table Extraction","type":"text"}]},{"type":"paragraph","content":[{"text":"For complex table extraction:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Multi-page tables","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Merged cells","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Nested tables","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Custom table detection","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Export to CSV/Excel","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"TABLES.md","type":"text","marks":[{"type":"link","attrs":{"href":"TABLES.md","title":null}}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"OCR Processing","type":"text"}]},{"type":"paragraph","content":[{"text":"For scanned PDFs and image-based documents:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Tesseract integration","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Language support","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Image preprocessing","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Confidence scoring","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Batch OCR","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"OCR.md","type":"text","marks":[{"type":"link","attrs":{"href":"OCR.md","title":null}}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Included scripts","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Form processing","type":"text"}]},{"type":"paragraph","content":[{"text":"analyze_form.py","type":"text","marks":[{"type":"strong"}]},{"text":" - Extract form field information","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]","type":"text"}]},{"type":"paragraph","content":[{"text":"fill_form.py","type":"text","marks":[{"type":"strong"}]},{"text":" - Fill PDF forms with data","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/fill_form.py input.pdf data.json output.pdf [--validate]","type":"text"}]},{"type":"paragraph","content":[{"text":"validate_form.py","type":"text","marks":[{"type":"strong"}]},{"text":" - Validate form data before filling","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/validate_form.py data.json schema.json","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Table extraction","type":"text"}]},{"type":"paragraph","content":[{"text":"extract_tables.py","type":"text","marks":[{"type":"strong"}]},{"text":" - Extract tables to CSV/Excel","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Text extraction","type":"text"}]},{"type":"paragraph","content":[{"text":"extract_text.py","type":"text","marks":[{"type":"strong"}]},{"text":" - Extract text with formatting preservation","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Utilities","type":"text"}]},{"type":"paragraph","content":[{"text":"merge_pdfs.py","type":"text","marks":[{"type":"strong"}]},{"text":" - Merge multiple PDFs","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf","type":"text"}]},{"type":"paragraph","content":[{"text":"split_pdf.py","type":"text","marks":[{"type":"strong"}]},{"text":" - Split PDF into individual pages","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/split_pdf.py input.pdf --output-dir pages/","type":"text"}]},{"type":"paragraph","content":[{"text":"validate_pdf.py","type":"text","marks":[{"type":"strong"}]},{"text":" - Validate PDF integrity","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/validate_pdf.py input.pdf","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Common workflows","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Workflow 1: Process form submissions","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# 1. Analyze form structure\npython scripts/analyze_form.py template.pdf --output schema.json\n\n# 2. Validate submission data\npython scripts/validate_form.py submission.json schema.json\n\n# 3. Fill form\npython scripts/fill_form.py template.pdf submission.json completed.pdf\n\n# 4. Validate output\npython scripts/validate_pdf.py completed.pdf","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Workflow 2: Extract data from reports","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# 1. Extract tables\npython scripts/extract_tables.py monthly_report.pdf --output data.csv\n\n# 2. Extract text for analysis\npython scripts/extract_text.py monthly_report.pdf --output report.txt","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Workflow 3: Batch processing","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"import glob\nfrom pathlib import Path\nimport subprocess\n\n# Process all PDFs in directory\nfor pdf_file in glob.glob(\"invoices/*.pdf\"):\n output_file = Path(\"processed\") / Path(pdf_file).name\n\n result = subprocess.run([\n \"python\", \"scripts/extract_text.py\",\n pdf_file,\n \"--output\", str(output_file)\n ], capture_output=True)\n\n if result.returncode == 0:\n print(f\"✓ Processed: {pdf_file}\")\n else:\n print(f\"✗ Failed: {pdf_file} - {result.stderr}\")","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Error handling","type":"text"}]},{"type":"paragraph","content":[{"text":"All scripts follow consistent error patterns:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# Exit codes\n# 0 - Success\n# 1 - File not found\n# 2 - Invalid input\n# 3 - Processing error\n# 4 - Validation error\n\n# Example usage in automation\nresult = subprocess.run([\"python\", \"scripts/fill_form.py\", ...])\n\nif result.returncode == 0:\n print(\"Success\")\nelif result.returncode == 4:\n print(\"Validation failed - check input data\")\nelse:\n print(f\"Error occurred: {result.returncode}\")","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Dependencies","type":"text"}]},{"type":"paragraph","content":[{"text":"All scripts require:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"pip install pdfplumber pypdf pillow pytesseract pandas","type":"text"}]},{"type":"paragraph","content":[{"text":"Optional for OCR:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Install tesseract-ocr system package\n# macOS: brew install tesseract\n# Ubuntu: apt-get install tesseract-ocr\n# Windows: Download from GitHub releases","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Performance tips","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use batch processing","type":"text","marks":[{"type":"strong"}]},{"text":" for multiple PDFs","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Enable multiprocessing","type":"text","marks":[{"type":"strong"}]},{"text":" with ","type":"text"},{"text":"--parallel","type":"text","marks":[{"type":"code_inline"}]},{"text":" flag (where supported)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Cache extracted data","type":"text","marks":[{"type":"strong"}]},{"text":" to avoid re-processing","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Validate inputs early","type":"text","marks":[{"type":"strong"}]},{"text":" to fail fast","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use streaming","type":"text","marks":[{"type":"strong"}]},{"text":" for large PDFs (>50MB)","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Best practices","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Always validate inputs","type":"text","marks":[{"type":"strong"}]},{"text":" before processing","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use try-except","type":"text","marks":[{"type":"strong"}]},{"text":" in custom scripts","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Log all operations","type":"text","marks":[{"type":"strong"}]},{"text":" for debugging","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Test with sample PDFs","type":"text","marks":[{"type":"strong"}]},{"text":" before production","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Set timeouts","type":"text","marks":[{"type":"strong"}]},{"text":" for long-running operations","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Check exit codes","type":"text","marks":[{"type":"strong"}]},{"text":" in automation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Backup originals","type":"text","marks":[{"type":"strong"}]},{"text":" before modification","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Troubleshooting","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Common issues","type":"text"}]},{"type":"paragraph","content":[{"text":"\"Module not found\" errors","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"pip install -r requirements.txt","type":"text"}]},{"type":"paragraph","content":[{"text":"Tesseract not found","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Install tesseract system package (see Dependencies)","type":"text"}]},{"type":"paragraph","content":[{"text":"Memory errors with large PDFs","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# Process page by page instead of loading entire PDF\nwith pdfplumber.open(\"large.pdf\") as pdf:\n for page in pdf.pages:\n text = page.extract_text()\n # Process page immediately","type":"text"}]},{"type":"paragraph","content":[{"text":"Permission errors","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"chmod +x scripts/*.py","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Getting help","type":"text"}]},{"type":"paragraph","content":[{"text":"All scripts support ","type":"text"},{"text":"--help","type":"text","marks":[{"type":"code_inline"}]},{"text":":","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"python scripts/analyze_form.py --help\npython scripts/extract_tables.py --help","type":"text"}]},{"type":"paragraph","content":[{"text":"For detailed documentation on specific topics, see:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"FORMS.md","type":"text","marks":[{"type":"link","attrs":{"href":"FORMS.md","title":null}}]},{"text":" - Complete form processing guide","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"TABLES.md","type":"text","marks":[{"type":"link","attrs":{"href":"TABLES.md","title":null}}]},{"text":" - Advanced table extraction","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"OCR.md","type":"text","marks":[{"type":"link","attrs":{"href":"OCR.md","title":null}}]},{"text":" - Scanned PDF processing","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"PDF Processing Pro","author":"@skillopedia","source":{"stars":27714,"repo_name":"claude-code-templates","origin_url":"https://github.com/davila7/claude-code-templates/blob/HEAD/cli-tool/components/skills/document-processing/pdf-processing-pro/SKILL.md","repo_owner":"davila7","body_sha256":"ea46955388b5f0c7f21770b59bc6bf385537d5840224aa2c33a4bc4876d75ef6","cluster_key":"94c60da414f99c0a99aca53164ca8b0209d5552fd01d222848acadf62f07b93c","clean_bundle":{"format":"clean-skill-bundle-v1","source":"davila7/claude-code-templates/cli-tool/components/skills/document-processing/pdf-processing-pro/SKILL.md","attachments":[{"id":"76ac029c-d7ad-55db-ae44-253a15a0e8d2","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/76ac029c-d7ad-55db-ae44-253a15a0e8d2/attachment.md","path":"FORMS.md","size":14006,"sha256":"ea43f2930d53347e0cf19a205bf94315737cbb034153181ebdff6d037bac8d24","contentType":"text/markdown; charset=utf-8"},{"id":"865ff473-aec4-5e1e-8a3c-c6d0acc0a836","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/865ff473-aec4-5e1e-8a3c-c6d0acc0a836/attachment.md","path":"OCR.md","size":2828,"sha256":"f28c254d9f15eed42233a3e806f722f3a92c35d429df0b50641e7ef0efda10fb","contentType":"text/markdown; charset=utf-8"},{"id":"45271ff4-d8ce-5d48-9fdd-9094ecb1203a","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/45271ff4-d8ce-5d48-9fdd-9094ecb1203a/attachment.md","path":"TABLES.md","size":14911,"sha256":"ae2b4c9fc07a724f25415aa19d061477e52565ad50c7cb10a9b3bdc069dfe94f","contentType":"text/markdown; charset=utf-8"},{"id":"fbf9826b-8a8a-5af4-ba3f-18b287fc4c39","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/fbf9826b-8a8a-5af4-ba3f-18b287fc4c39/attachment.py","path":"scripts/analyze_form.py","size":8684,"sha256":"694c93c3e1dec5dc6a6e4ebdc2548c8a9200c52a9b261399893d0592831f4395","contentType":"text/x-python; charset=utf-8"}],"bundle_sha256":"e8e8642e60c0e06ae3d26662fc005502d3708538a78f8a5f7e3080a175eb1361","attachment_count":4,"text_attachments":4,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":7,"skill_md_path":"cli-tool/components/skills/document-processing/pdf-processing-pro/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"web-development","category_label":"Web"},"exact_dupes_collapsed_into_this":6},"version":"v1","category":"web-development","import_tag":"clean-skills-v1","description":"Production-ready PDF processing with forms, tables, OCR, validation, and batch operations. Use when working with complex PDF workflows in production environments, processing large volumes of PDFs, or requiring robust error handling and validation."}},"renderedAt":1782979414857}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.