table-extractor — Skillopedia

Table Extractor Overview Extract tables from PDF documents with high accuracy using camelot-py. Handles complex table structures including merged cells, multi-line rows, spanning headers, and borderless tables. Outputs clean DataFrames that can be exported to CSV, Excel, or JSON. Instructions When a user asks you to extract tables from a PDF, follow this process: Step 1: Install and verify dependencies If ghostscript is not available, fall back to pdfplumber: Step 2: Inspect the PDF to locate tables Step 3: Choose the right extraction flavor Lattice flavor (for tables with visible borders/gri…

, ''))\n except (ValueError, AttributeError):\n pass # Keep as string\n\n print(f\"\\nCleaned Table {i}:\")\n print(df.head())\n```\n\n### Step 5: Handle complex table structures\n\n**Merged cells and spanning headers:**\n\n```python\n# Forward-fill merged cells (common in row headers)\ndf.iloc[:, 0] = df.iloc[:, 0].replace('', pd.NA).ffill()\n\n# Handle multi-level column headers\nif df.iloc[0:2].apply(lambda x: x.str.len().mean()).mean() \u003c 20:\n # Combine first two rows as multi-level header\n new_cols = df.iloc[0] + \" - \" + df.iloc[1]\n df.columns = new_cols.str.strip(\" - \")\n df = df[2:].reset_index(drop=True)\n```\n\n**Tables spanning multiple pages:**\n\n```python\n# Extract from all pages and concatenate\nall_tables = camelot.read_pdf(\"document.pdf\", pages=\"all\", flavor=\"lattice\")\n\n# Group tables that are continuations (same column count)\ngroups = {}\nfor t in all_tables:\n key = t.shape[1]\n groups.setdefault(key, []).append(t.df)\n\nfor col_count, dfs in groups.items():\n combined = pd.concat(dfs, ignore_index=True)\n # Remove duplicate header rows that appear at page breaks\n combined = combined[~combined.duplicated(keep='first')]\n```\n\n### Step 6: Export the results\n\n```python\n# CSV (one file per table)\nfor i, table in enumerate(tables):\n table.df.to_csv(f\"table_{i+1}.csv\", index=False)\n\n# Excel (all tables as separate sheets)\nwith pd.ExcelWriter(\"extracted_tables.xlsx\") as writer:\n for i, table in enumerate(tables):\n table.df.to_excel(writer, sheet_name=f\"Table_{i+1}\", index=False)\n\n# JSON\nfor i, table in enumerate(tables):\n table.df.to_json(f\"table_{i+1}.json\", orient=\"records\", indent=2)\n\nprint(f\"Exported {len(tables)} tables\")\n```\n\n## Examples\n\n### Example 1: Extract financial tables from an annual report\n\n**User request:** \"Extract all tables from this annual report PDF\"\n\n**Actions:**\n\n1. Scan all pages with lattice flavor (financial reports typically have bordered tables)\n2. Identify income statement, balance sheet, and cash flow tables by column headers\n3. Clean numeric values (remove $, commas, parentheses for negatives)\n4. Export each table to a separate CSV and combine into one Excel workbook\n\n**Output:** \"Extracted 7 tables across 42 pages. Exported to extracted_tables.xlsx with sheets: Income_Statement, Balance_Sheet, Cash_Flow, Revenue_Breakdown, Expenses, Quarterly_Summary, KPIs.\"\n\n### Example 2: Extract a specific table from a research paper\n\n**User request:** \"Get the results table from page 8 of this paper\"\n\n**Actions:**\n\n1. Target page 8 specifically: `camelot.read_pdf(\"paper.pdf\", pages=\"8\")`\n2. If multiple tables on the page, show summaries and let the user pick\n3. Clean the extracted table and handle any multi-line cells\n4. Export as CSV\n\n**Output:** A single CSV file with the results table, plus a preview of the first few rows printed to the console.\n\n### Example 3: Batch process multiple PDFs\n\n**User request:** \"Extract the summary table from each of these 20 monthly reports\"\n\n**Actions:**\n\n```python\nimport glob\n\nresults = []\nfor pdf_path in sorted(glob.glob(\"reports/*.pdf\")):\n tables = camelot.read_pdf(pdf_path, pages=\"1\", flavor=\"lattice\")\n if tables:\n df = tables[0].df # First table on first page\n df[\"source_file\"] = pdf_path\n results.append(df)\n\ncombined = pd.concat(results, ignore_index=True)\ncombined.to_csv(\"all_summaries.csv\", index=False)\n```\n\n**Output:** A single CSV combining the summary table from all 20 reports with a source_file column for traceability.\n\n## Guidelines\n\n- Always try `lattice` flavor first (bordered tables). Fall back to `stream` for borderless tables.\n- Check the `accuracy` score on each table. Below 80% indicates extraction issues that need manual review.\n- For scanned PDFs, run OCR first (e.g., `ocrmypdf`) before table extraction.\n- When camelot struggles, try pdfplumber as an alternative: `page.extract_table(table_settings={...})`.\n- Clean numeric data aggressively: remove currency symbols, commas, and handle parenthesized negatives.\n- For tables with merged cells, use forward-fill on the appropriate columns.\n- When extracting from multiple pages, watch for repeated header rows at page breaks.\n- Always preview the extracted data before exporting to catch alignment or parsing issues.\n- Report extraction quality metrics (accuracy, row/column count) so the user can verify correctness.\n---","attachment_filenames":["_scores.json"],"attachments":[{"filename":"_scores.json","content":"{\n \"version\": \"1.0.0\",\n \"skillHash\": \"sha256:14047c3cb1b75a7dfe6422c83f1f9d034177d847c0fc66a264a7cee1e64008f9\",\n \"scoredAt\": \"2026-05-13T10:22:43.208Z\",\n \"backend\": \"subagent\",\n \"model\": \"claude-sonnet-4-6\",\n \"quality\": {\n \"score\": 80,\n \"dimensions\": {\n \"clarity\": \"PASS\",\n \"completeness\": \"WEAK\",\n \"conciseness\": \"WEAK\",\n \"actionability\": \"PASS\",\n \"crossPlatform\": \"WEAK\",\n \"examples\": \"PASS\"\n },\n \"issues\": [\n {\n \"severity\": \"MEDIUM\",\n \"category\": \"completeness\",\n \"detail\": \"The instructions do not cover all possible error scenarios, such as missing PDF files or permission issues.\"\n },\n {\n \"severity\": \"MEDIUM\",\n \"category\": \"conciseness\",\n \"detail\": \"The skill file contains verbose explanations and multiple code blocks that could be streamlined.\"\n },\n {\n \"severity\": \"MEDIUM\",\n \"category\": \"crossPlatform\",\n \"detail\": \"The instructions rely on specific OS commands (e.g., apt install) and Python environment, limiting cross-platform compatibility.\"\n }\n ]\n },\n \"security\": {\n \"verdict\": \"SAFE\",\n \"issues\": []\n },\n \"impact\": {\n \"baselineAvg\": 50,\n \"treatmentAvg\": 88,\n \"multiplier\": 1.76,\n \"scenarios\": [\n {\n \"name\": \"Borderless financial table extraction\",\n \"baseline\": 45,\n \"treatment\": 88,\n \"rationale\": \"Treatment correctly uses camelot with lattice-then-stream fallback, checks accuracy scores, and aggressively cleans currency symbols/commas as specified in the skill; baseline reaches for pdfplumber-only or pandas.read_html without the flavor-fallback pattern.\"\n },\n {\n \"name\": \"Multi-page table concatenation with repeated headers\",\n \"baseline\": 55,\n \"treatment\": 87,\n \"rationale\": \"Treatment uses pages='all' with lattice, groups by column count, and removes duplicated header rows at page breaks per the skill; baseline manually loops pages and lacks the duplicated-header dedup step.\"\n }\n ]\n }\n}\n","content_type":"application/json; charset=utf-8","language":"json","size":2060,"content_sha256":"af3ff2ffaaf206d1ae268d384eca39b4a96fda530365891d80e08346afc012a3"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Table Extractor","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Overview","type":"text"}]},{"type":"paragraph","content":[{"text":"Extract tables from PDF documents with high accuracy using camelot-py. Handles complex table structures including merged cells, multi-line rows, spanning headers, and borderless tables. Outputs clean DataFrames that can be exported to CSV, Excel, or JSON.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Instructions","type":"text"}]},{"type":"paragraph","content":[{"text":"When a user asks you to extract tables from a PDF, follow this process:","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 1: Install and verify dependencies","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Install camelot and its dependencies\npip install \"camelot-py[base]\" ghostscript opencv-python-headless pandas\n\n# Verify ghostscript is available (required by camelot)\ngs --version 2>/dev/null || echo \"Install ghostscript: sudo apt install ghostscript\"","type":"text"}]},{"type":"paragraph","content":[{"text":"If ghostscript is not available, fall back to pdfplumber:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"pip install pdfplumber pandas","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 2: Inspect the PDF to locate tables","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"import camelot\n\n# Quick scan: how many tables are in the document?\ntables = camelot.read_pdf(\"document.pdf\", pages=\"all\", flavor=\"lattice\")\nprint(f\"Found {len(tables)} tables using lattice detection\")\n\n# If no tables found, try stream detection (for borderless tables)\nif len(tables) == 0:\n tables = camelot.read_pdf(\"document.pdf\", pages=\"all\", flavor=\"stream\")\n print(f\"Found {len(tables)} tables using stream detection\")\n\n# Summary of each table\nfor i, table in enumerate(tables):\n print(f\"\\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols (page {table.page})\")\n print(f\"Accuracy: {table.accuracy:.1f}%\")\n print(table.df.head(3))","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 3: Choose the right extraction flavor","type":"text"}]},{"type":"paragraph","content":[{"text":"Lattice flavor","type":"text","marks":[{"type":"strong"}]},{"text":" (for tables with visible borders/gridlines):","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"tables = camelot.read_pdf(\n \"document.pdf\",\n pages=\"1,2,3\", # Specific pages\n flavor=\"lattice\",\n line_scale=40, # Adjust line detection sensitivity\n process_background=True # Detect lines on colored backgrounds\n)","type":"text"}]},{"type":"paragraph","content":[{"text":"Stream flavor","type":"text","marks":[{"type":"strong"}]},{"text":" (for borderless tables, whitespace-separated):","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"tables = camelot.read_pdf(\n \"document.pdf\",\n pages=\"1\",\n flavor=\"stream\",\n edge_tol=50, # Tolerance for edge detection\n row_tol=10, # Tolerance for grouping text into rows\n columns=[\"72,200,350,500\"] # Manual column boundaries if auto-detect fails\n)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 4: Clean and process extracted tables","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"import pandas as pd\n\nfor i, table in enumerate(tables):\n df = table.df\n\n # Promote first row to header if it contains column names\n if df.iloc[0].str.match(r'^[A-Za-z]').all():\n df.columns = df.iloc[0]\n df = df[1:].reset_index(drop=True)\n\n # Clean whitespace and newlines within cells\n df = df.apply(lambda col: col.str.strip().str.replace(r'\\n', ' ', regex=True))\n\n # Remove completely empty rows\n df = df.dropna(how='all').replace('', pd.NA).dropna(how='all')\n\n # Convert numeric columns\n for col in df.columns:\n try:\n df[col] = pd.to_numeric(df[col].str.replace(',', '').str.replace('

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

, ''))\n except (ValueError, AttributeError):\n pass # Keep as string\n\n print(f\"\\nCleaned Table {i}:\")\n print(df.head())","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 5: Handle complex table structures","type":"text"}]},{"type":"paragraph","content":[{"text":"Merged cells and spanning headers:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# Forward-fill merged cells (common in row headers)\ndf.iloc[:, 0] = df.iloc[:, 0].replace('', pd.NA).ffill()\n\n# Handle multi-level column headers\nif df.iloc[0:2].apply(lambda x: x.str.len().mean()).mean() \u003c 20:\n # Combine first two rows as multi-level header\n new_cols = df.iloc[0] + \" - \" + df.iloc[1]\n df.columns = new_cols.str.strip(\" - \")\n df = df[2:].reset_index(drop=True)","type":"text"}]},{"type":"paragraph","content":[{"text":"Tables spanning multiple pages:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# Extract from all pages and concatenate\nall_tables = camelot.read_pdf(\"document.pdf\", pages=\"all\", flavor=\"lattice\")\n\n# Group tables that are continuations (same column count)\ngroups = {}\nfor t in all_tables:\n key = t.shape[1]\n groups.setdefault(key, []).append(t.df)\n\nfor col_count, dfs in groups.items():\n combined = pd.concat(dfs, ignore_index=True)\n # Remove duplicate header rows that appear at page breaks\n combined = combined[~combined.duplicated(keep='first')]","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 6: Export the results","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# CSV (one file per table)\nfor i, table in enumerate(tables):\n table.df.to_csv(f\"table_{i+1}.csv\", index=False)\n\n# Excel (all tables as separate sheets)\nwith pd.ExcelWriter(\"extracted_tables.xlsx\") as writer:\n for i, table in enumerate(tables):\n table.df.to_excel(writer, sheet_name=f\"Table_{i+1}\", index=False)\n\n# JSON\nfor i, table in enumerate(tables):\n table.df.to_json(f\"table_{i+1}.json\", orient=\"records\", indent=2)\n\nprint(f\"Exported {len(tables)} tables\")","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Examples","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Example 1: Extract financial tables from an annual report","type":"text"}]},{"type":"paragraph","content":[{"text":"User request:","type":"text","marks":[{"type":"strong"}]},{"text":" \"Extract all tables from this annual report PDF\"","type":"text"}]},{"type":"paragraph","content":[{"text":"Actions:","type":"text","marks":[{"type":"strong"}]}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Scan all pages with lattice flavor (financial reports typically have bordered tables)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Identify income statement, balance sheet, and cash flow tables by column headers","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Clean numeric values (remove $, commas, parentheses for negatives)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Export each table to a separate CSV and combine into one Excel workbook","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Output:","type":"text","marks":[{"type":"strong"}]},{"text":" \"Extracted 7 tables across 42 pages. Exported to extracted_tables.xlsx with sheets: Income_Statement, Balance_Sheet, Cash_Flow, Revenue_Breakdown, Expenses, Quarterly_Summary, KPIs.\"","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Example 2: Extract a specific table from a research paper","type":"text"}]},{"type":"paragraph","content":[{"text":"User request:","type":"text","marks":[{"type":"strong"}]},{"text":" \"Get the results table from page 8 of this paper\"","type":"text"}]},{"type":"paragraph","content":[{"text":"Actions:","type":"text","marks":[{"type":"strong"}]}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Target page 8 specifically: ","type":"text"},{"text":"camelot.read_pdf(\"paper.pdf\", pages=\"8\")","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If multiple tables on the page, show summaries and let the user pick","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Clean the extracted table and handle any multi-line cells","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Export as CSV","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Output:","type":"text","marks":[{"type":"strong"}]},{"text":" A single CSV file with the results table, plus a preview of the first few rows printed to the console.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Example 3: Batch process multiple PDFs","type":"text"}]},{"type":"paragraph","content":[{"text":"User request:","type":"text","marks":[{"type":"strong"}]},{"text":" \"Extract the summary table from each of these 20 monthly reports\"","type":"text"}]},{"type":"paragraph","content":[{"text":"Actions:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"import glob\n\nresults = []\nfor pdf_path in sorted(glob.glob(\"reports/*.pdf\")):\n tables = camelot.read_pdf(pdf_path, pages=\"1\", flavor=\"lattice\")\n if tables:\n df = tables[0].df # First table on first page\n df[\"source_file\"] = pdf_path\n results.append(df)\n\ncombined = pd.concat(results, ignore_index=True)\ncombined.to_csv(\"all_summaries.csv\", index=False)","type":"text"}]},{"type":"paragraph","content":[{"text":"Output:","type":"text","marks":[{"type":"strong"}]},{"text":" A single CSV combining the summary table from all 20 reports with a source_file column for traceability.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Guidelines","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Always try ","type":"text"},{"text":"lattice","type":"text","marks":[{"type":"code_inline"}]},{"text":" flavor first (bordered tables). Fall back to ","type":"text"},{"text":"stream","type":"text","marks":[{"type":"code_inline"}]},{"text":" for borderless tables.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Check the ","type":"text"},{"text":"accuracy","type":"text","marks":[{"type":"code_inline"}]},{"text":" score on each table. Below 80% indicates extraction issues that need manual review.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For scanned PDFs, run OCR first (e.g., ","type":"text"},{"text":"ocrmypdf","type":"text","marks":[{"type":"code_inline"}]},{"text":") before table extraction.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When camelot struggles, try pdfplumber as an alternative: ","type":"text"},{"text":"page.extract_table(table_settings={...})","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Clean numeric data aggressively: remove currency symbols, commas, and handle parenthesized negatives.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For tables with merged cells, use forward-fill on the appropriate columns.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When extracting from multiple pages, watch for repeated header rows at page breaks.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Always preview the extracted data before exporting to catch alignment or parsing issues.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Report extraction quality metrics (accuracy, row/column count) so the user can verify correctness.","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"table-extractor","author":"@skillopedia","source":{"stars":62,"repo_name":"skills","origin_url":"https://github.com/terminalskills/skills/blob/HEAD/skills/table-extractor/SKILL.md","repo_owner":"terminalskills","body_sha256":"087cd6cdc6260003ef41431354858921b8d547936944b8d6979fcfe68fd692cd","cluster_key":"bc9484d56e99d3766e33294b3d64680e69e716763ef6fc7be40ffc1af21f86e8","clean_bundle":{"format":"clean-skill-bundle-v1","source":"terminalskills/skills/skills/table-extractor/SKILL.md","attachments":[{"id":"e510c6e1-ba41-5920-bf47-44bb08ee1d27","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/e510c6e1-ba41-5920-bf47-44bb08ee1d27/attachment.json","path":"_scores.json","size":2060,"sha256":"af3ff2ffaaf206d1ae268d384eca39b4a96fda530365891d80e08346afc012a3","contentType":"application/json; charset=utf-8"}],"bundle_sha256":"c9060b090d74756452355e6eee70adb17837369088e0920f9d8d84118afa9f9b","attachment_count":1,"text_attachments":1,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"skills/table-extractor/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"data-analytics","category_label":"Data"},"exact_dupes_collapsed_into_this":0},"license":"Apache-2.0","version":"v1","category":"data-analytics","metadata":{"tags":["table-extraction","pdf","camelot","csv","spreadsheet"],"agents":["claude-code","openai-codex","gemini-cli","cursor"],"author":"terminal-skills","version":"1.0.0","category":"data-ai","use-cases":["Extract complex tables from PDF reports into CSV or Excel","Batch extract all tables from a multi-page PDF document","Convert financial statements or data tables from PDF to structured data"]},"import_tag":"clean-skills-v1","description":"Extract tables from PDFs with high accuracy using camelot. Handles complex table structures including merged cells, multi-line rows, and spanning headers. Use when a user asks to extract a table from a PDF, pull tabular data from a document, convert PDF tables to CSV or Excel, or parse structured tables from reports.","compatibility":"Requires Python 3.8+ with camelot-py and ghostscript"}},"renderedAt":1782979488921}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.