schema-discoverer — Skillopedia

Schema Discoverer Audience: Data engineers and analysts working with unfamiliar data files. Goal: Analyze data files to infer schema and generate type definitions in multiple output formats. Workflow 1. Identify file format from extension and content 2. Sample the data (first 1000 rows for large files) 3. Infer column types with confidence scores 4. Detect patterns (dates, emails, IDs, categories) 5. Generate output in requested format Type Inference Rules Numeric Detection - Integer: All values match with no leading zeros (except "0") - Float: Values match or scientific notation - Currency:…

)\n status: Literal['active', 'pending', 'inactive']\n created_at: date\n amount: float = Field(ge=0)\n```\n\n### Pandera Schema\n```python\nimport pandera as pa\n\nschema = pa.DataFrameSchema({\n \"id\": pa.Column(int, pa.Check.gt(0), unique=True),\n \"email\": pa.Column(str, pa.Check.str_matches(r'^[\\w\\.-]+@')),\n \"status\": pa.Column(str, pa.Check.isin(['active', 'pending', 'inactive'])),\n \"created_at\": pa.Column(\"datetime64[ns]\"),\n \"amount\": pa.Column(float, pa.Check.ge(0)),\n})\n```\n\n### SQL CREATE TABLE\n```sql\nCREATE TABLE records (\n id INTEGER PRIMARY KEY,\n email VARCHAR(255) NOT NULL,\n status VARCHAR(20) CHECK (status IN ('active', 'pending', 'inactive')),\n created_at DATE NOT NULL,\n amount DECIMAL(10, 2) CHECK (amount >= 0)\n);\n```\n\n### TypeScript Interface\n```typescript\ninterface Record {\n id: number;\n email: string;\n status: 'active' | 'pending' | 'inactive';\n created_at: string;\n amount: number;\n}\n```\n\n## Execution Steps\n\n1. Read file sample:\n - CSV: `pd.read_csv(path, nrows=1000)`\n - JSON: `pd.read_json(path, lines=True, nrows=1000)`\n - Parquet: `pd.read_parquet(path).head(1000)`\n\n2. For each column, analyze: null percentage, unique count/ratio, sample values, pattern matches\n\n3. Generate confidence score (0-100) for each type inference\n\n4. Output schema in requested format with comments explaining inference\n\n## Report Format\n\n```markdown\n## Schema Discovery Report\n\n**File:** data.csv\n**Rows sampled:** 1000 of 50000\n**Columns:** 5\n\n| Column | Inferred Type | Confidence | Notes |\n|--------|--------------|------------|-------|\n| id | integer | 100% | All positive, unique |\n| email | string(email) | 98% | 2% invalid format |\n| status | categorical | 100% | 3 unique values |\n| created_at | date | 95% | ISO format |\n| amount | float | 100% | 2 decimals, no negatives |\n\n### Recommendations\n- Add NOT NULL constraint to: id, email, created_at\n- Consider UNIQUE constraint on: id, email\n- Status should be ENUM: active, pending, inactive\n```\n---","attachment_filenames":[],"attachments":[],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Schema Discoverer","type":"text"}]},{"type":"paragraph","content":[{"text":"Audience:","type":"text","marks":[{"type":"strong"}]},{"text":" Data engineers and analysts working with unfamiliar data files.","type":"text"}]},{"type":"paragraph","content":[{"text":"Goal:","type":"text","marks":[{"type":"strong"}]},{"text":" Analyze data files to infer schema and generate type definitions in multiple output formats.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Workflow","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Identify file format","type":"text","marks":[{"type":"strong"}]},{"text":" from extension and content","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Sample the data","type":"text","marks":[{"type":"strong"}]},{"text":" (first 1000 rows for large files)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Infer column types","type":"text","marks":[{"type":"strong"}]},{"text":" with confidence scores","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Detect patterns","type":"text","marks":[{"type":"strong"}]},{"text":" (dates, emails, IDs, categories)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Generate output","type":"text","marks":[{"type":"strong"}]},{"text":" in requested format","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Type Inference Rules","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Numeric Detection","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Integer: All values match ","type":"text"},{"text":"^\\d+$","type":"text","marks":[{"type":"code_inline"}]},{"text":" with no leading zeros (except \"0\")","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Float: Values match ","type":"text"},{"text":"^\\d+\\.\\d+$","type":"text","marks":[{"type":"code_inline"}]},{"text":" or scientific notation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Currency: Values match ","type":"text"},{"text":"^\\$?\\d{1,3}(,\\d{3})*(\\.\\d{2})?$","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"String Patterns","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Email: ","type":"text"},{"text":"^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"URL: ","type":"text"},{"text":"^https?://","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"UUID: ","type":"text"},{"text":"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Phone: ","type":"text"},{"text":"^\\+?[\\d\\s\\-\$\$]+$","type":"text","marks":[{"type":"code_inline"}]},{"text":" with 10+ digits","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Date Patterns","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ISO: ","type":"text"},{"text":"YYYY-MM-DD","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"YYYY-MM-DDTHH:MM:SS","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"US: ","type":"text"},{"text":"MM/DD/YYYY","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"EU: ","type":"text"},{"text":"DD/MM/YYYY","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Timestamp: Unix epoch (10 or 13 digits)","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Categorical Detection","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Unique ratio \u003c 5% of total rows","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Repeated values dominate","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Output Formats","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pydantic Model","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"from pydantic import BaseModel, Field\nfrom datetime import date\nfrom typing import Literal\n\nclass Record(BaseModel):\n id: int = Field(gt=0)\n email: str = Field(pattern=r'^[\\w\\.-]+@[\\w\\.-]+\\.\\w+

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

)\n status: Literal['active', 'pending', 'inactive']\n created_at: date\n amount: float = Field(ge=0)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pandera Schema","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"import pandera as pa\n\nschema = pa.DataFrameSchema({\n \"id\": pa.Column(int, pa.Check.gt(0), unique=True),\n \"email\": pa.Column(str, pa.Check.str_matches(r'^[\\w\\.-]+@')),\n \"status\": pa.Column(str, pa.Check.isin(['active', 'pending', 'inactive'])),\n \"created_at\": pa.Column(\"datetime64[ns]\"),\n \"amount\": pa.Column(float, pa.Check.ge(0)),\n})","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"SQL CREATE TABLE","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"sql"},"content":[{"text":"CREATE TABLE records (\n id INTEGER PRIMARY KEY,\n email VARCHAR(255) NOT NULL,\n status VARCHAR(20) CHECK (status IN ('active', 'pending', 'inactive')),\n created_at DATE NOT NULL,\n amount DECIMAL(10, 2) CHECK (amount >= 0)\n);","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"TypeScript Interface","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"interface Record {\n id: number;\n email: string;\n status: 'active' | 'pending' | 'inactive';\n created_at: string;\n amount: number;\n}","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Execution Steps","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Read file sample:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"CSV: ","type":"text"},{"text":"pd.read_csv(path, nrows=1000)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"JSON: ","type":"text"},{"text":"pd.read_json(path, lines=True, nrows=1000)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Parquet: ","type":"text"},{"text":"pd.read_parquet(path).head(1000)","type":"text","marks":[{"type":"code_inline"}]}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For each column, analyze: null percentage, unique count/ratio, sample values, pattern matches","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Generate confidence score (0-100) for each type inference","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Output schema in requested format with comments explaining inference","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Report Format","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"markdown"},"content":[{"text":"## Schema Discovery Report\n\n**File:** data.csv\n**Rows sampled:** 1000 of 50000\n**Columns:** 5\n\n| Column | Inferred Type | Confidence | Notes |\n|--------|--------------|------------|-------|\n| id | integer | 100% | All positive, unique |\n| email | string(email) | 98% | 2% invalid format |\n| status | categorical | 100% | 3 unique values |\n| created_at | date | 95% | ISO format |\n| amount | float | 100% | 2 decimals, no negatives |\n\n### Recommendations\n- Add NOT NULL constraint to: id, email, created_at\n- Consider UNIQUE constraint on: id, email\n- Status should be ENUM: active, pending, inactive","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"schema-discoverer","author":"@skillopedia","source":{"stars":39,"repo_name":"majestic-marketplace","origin_url":"https://github.com/majesticlabs-dev/majestic-marketplace/blob/HEAD/plugins/majestic-data/skills/schema-discoverer/SKILL.md","repo_owner":"majesticlabs-dev","body_sha256":"46d9ccd6ed9b0235d72fcda374ff6417aac02ac37690ac52d53f117912447b4d","cluster_key":"28b0d24adfd691c480a18da70e42cc1735b8ad1f1bcf54e20384ea192556fdd8","clean_bundle":{"format":"clean-skill-bundle-v1","source":"majesticlabs-dev/majestic-marketplace/plugins/majestic-data/skills/schema-discoverer/SKILL.md","bundle_sha256":"802ce172ed63178c699388dffe0a47900769f1d37817eddbd7ab9c949f4f97dd","attachment_count":0,"text_attachments":0,"binary_attachments":0},"cluster_size":1,"skill_md_path":"plugins/majestic-data/skills/schema-discoverer/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"data-analytics","category_label":"Data"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"data-analytics","import_tag":"clean-skills-v1","description":"Infer schema from sample data files (CSV, JSON, Parquet) and generate type definitions.","allowed-tools":"Read Grep Glob Bash"}},"renderedAt":1782980940337}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.