Data Extractor Overview Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semi-structured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and free-text documents. Instructions When a user asks you to extract data from a document, follow this process: Step 1: Identify the document format and install dependencies Library selection by format: - PDF: (text + tables), (fitz) for complex layouts - DOCX: - HTML: with - Excel: or - Images: (OCR) with - JSON/XML: Python standard library St…