Web Scraper Overview Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON. When to Use This Skill - When the user mentions "scraper" or related topics - When the user mentions "scraping" or related topics - When the user mentions "extrair dados web" or related topics - When the user mentions "web scraping" or related topics - When the user mentions "raspar dados" or related topics - When the user mentions "coletar dados site" or related topics Do Not Use This Skill When - The task is unrelated…

, num_str):\n num_str = num_str.replace('.', '').replace(',', '.')\n else:\n num_str = num_str.replace(',', '')\n return float(num_str)\n```\n\n### Currency Detection\n\n| Symbol/Code | Currency | Symbol/Code | Currency |\n|:------------|:---------|:------------|:---------|\n| ` web-scraper — Skillopedia , `US web-scraper — Skillopedia , `USD` | US Dollar | `R web-scraper — Skillopedia , `BRL` | Brazilian Real |\n| `€`, `EUR` | Euro | `£`, `GBP` | British Pound |\n| `¥`, `JPY` | Yen | `₹`, `INR` | Indian Rupee |\n| `C web-scraper — Skillopedia , `CAD` | Canadian Dollar | `A web-scraper — Skillopedia , `AUD` | Australian Dollar |\n\n### Output Format\n\n```json\n{\n \"price\": 29.99,\n \"currency\": \"USD\",\n \"rawPrice\": \"$29.99\"\n}\n```\n\nFor Markdown, show formatted: `$29.99` (right-aligned in table).\n\n---\n\n## Date Normalization\n\nNormalize all dates to ISO-8601 format.\n\n### Common Formats to Handle\n\n| Input Format | Example | Normalized |\n|:------------------------|:---------------------|:-------------------|\n| Full text | February 25, 2026 | 2026-02-25 |\n| Short text | Feb 25, 2026 | 2026-02-25 |\n| US numeric | 02/25/2026 | 2026-02-25 |\n| EU numeric | 25/02/2026 | 2026-02-25 |\n| ISO already | 2026-02-25 | 2026-02-25 |\n| Relative | 3 days ago | (compute from now) |\n| Relative | Yesterday | (compute from now) |\n| Timestamp | 1740441600 | 2025-02-25 |\n| With time | 2026-02-25T14:30:00Z | 2026-02-25 14:30 |\n\n### Ambiguous Dates\n\nWhen format is ambiguous (e.g. `03/04/2026`):\n- Default to US format (MM/DD/YYYY) unless site is clearly non-US\n- Check page `lang` attribute or URL TLD for locale hints\n- Note ambiguity in delivery notes\n\n### Relative Date Resolution\n\n```python\nfrom datetime import datetime, timedelta\nimport re\n\ndef resolve_relative_date(text):\n text = text.lower().strip()\n today = datetime.now()\n\n if 'today' in text: return today.strftime('%Y-%m-%d')\n if 'yesterday' in text: return (today - timedelta(days=1)).strftime('%Y-%m-%d')\n\n match = re.search(r'(\\d+)\\s*(hour|day|week|month|year)s?\\s*ago', text)\n if match:\n n, unit = int(match.group(1)), match.group(2)\n deltas = {'hour': 0, 'day': n, 'week': n*7, 'month': n*30, 'year': n*365}\n return (today - timedelta(days=deltas.get(unit, 0))).strftime('%Y-%m-%d')\n\n return text # Return as-is if can't parse\n```\n\n---\n\n## URL Resolution\n\nConvert relative URLs to absolute.\n\n### Patterns\n\n| Input | Base URL | Resolved |\n|:-------------------------|:----------------------------|:--------------------------------------|\n| `/products/item-1` | `https://example.com/shop` | `https://example.com/products/item-1` |\n| `item-1` | `https://example.com/shop/` | `https://example.com/shop/item-1` |\n| `//cdn.example.com/img` | `https://example.com` | `https://cdn.example.com/img` |\n| `https://other.com/page` | (any) | `https://other.com/page` (absolute) |\n\n### JavaScript Resolution\n\n```javascript\nfunction resolveUrl(relative, base) {\n try { return new URL(relative, base || window.location.href).href; }\n catch { return relative; }\n}\n```\n\n---\n\n## Phone Normalization\n\nFor contact mode extraction.\n\n### Pattern\n\n```python\nimport re\n\ndef normalize_phone(raw):\n if not raw:\n return None\n # Remove all non-digit chars except leading +\n digits = re.sub(r'[^\\d+]', '', raw)\n if not digits or len(digits) \u003c 7:\n return None\n # Add + prefix if looks international\n if len(digits) >= 11 and not digits.startswith('+'):\n digits = '+' + digits\n return digits\n```\n\n### Format by Context\n\n| Context | Format Example |\n|:-----------------|:---------------------|\n| JSON output | `\"+5511999998888\"` |\n| Markdown table | `+55 11 99999-8888` |\n| CSV output | `\"+5511999998888\"` |\n\n---\n\n## Deduplication\n\n### Exact Deduplication\n\n```python\ndef deduplicate(records, key_fields=None):\n \"\"\"Remove exact duplicate records.\n If key_fields provided, deduplicate by those fields only.\n \"\"\"\n seen = set()\n unique = []\n for record in records:\n if key_fields:\n key = tuple(record.get(f) for f in key_fields)\n else:\n key = tuple(sorted(record.items()))\n if key not in seen:\n seen.add(key)\n unique.append(record)\n return unique, len(records) - len(unique) # returns (unique_list, removed_count)\n```\n\n### Near-Duplicate Detection\n\nWhen records share key fields but differ in details:\n1. Group by key fields (e.g. product name + source)\n2. For each group, keep the record with fewest null values\n3. If tie, keep the first occurrence\n4. Report in notes: \"Merged N near-duplicate records\"\n\n### Dedup Key Selection by Mode\n\n| Mode | Key Fields |\n|:---------|:----------------------------------|\n| product | name + source (or name + brand) |\n| contact | name + email (or name + org) |\n| jobs | title + company + location |\n| events | title + date + location |\n| table | all fields (exact match) |\n| list | first 2-3 identifying fields |\n\n---\n\n## Text Cleaning\n\n### Remove Noise\n\nCommon noise patterns to strip from extracted text:\n\n| Pattern | Action |\n|:-----------------------------------|:--------------------------|\n| `\\[edit\\]`, `\\[citation needed\\]` | Remove (Wikipedia) |\n| `Read more...`, `See more` | Remove (truncation markers)|\n| `Sponsored`, `Ad`, `Promoted` | Remove or flag |\n| Cookie consent text | Remove |\n| Navigation breadcrumbs | Remove |\n| Footer boilerplate | Remove |\n\n### Sentence Case Normalization\n\nWhen extracting ALL-CAPS or inconsistent-case text:\n\n```python\ndef normalize_case(text):\n if text.isupper() and len(text) > 3:\n return text.title() # ALL CAPS -> Title Case\n return text\n```\n\nOnly apply when: field is clearly ALL-CAPS input (common in older sites),\nuser requests it, or data looks better normalized.\n\n---\n\n## Data Type Coercion\n\n### Automatic Type Detection\n\n| Raw Value | Detected Type | Coerced Value |\n|:--------------|:--------------|:------------------|\n| `\"123\"` | integer | `123` |\n| `\"12.99\"` | float | `12.99` |\n| `\"true\"` | boolean | `true` |\n| `\"false\"` | boolean | `false` |\n| `\"2026-02-25\"`| date string | `\"2026-02-25\"` |\n| `\"$29.99\"` | price | `29.99` + currency|\n| `\"4.5/5\"` | rating | `4.5` |\n| `\"1,234\"` | integer | `1234` |\n\n### Rating Normalization\n\n```python\nimport re\n\ndef normalize_rating(raw):\n if not raw:\n return None\n match = re.search(r'([\\d.]+)\\s*(?:/\\s*([\\d.]+))?', str(raw))\n if match:\n score = float(match.group(1))\n max_score = float(match.group(2)) if match.group(2) else 5.0\n return round(score / max_score * 5, 1) # Normalize to /5 scale\n return None\n```\n\n---\n\n## Enrichment Patterns\n\n### Domain Extraction\n\nAdd domain from full URLs:\n```python\nfrom urllib.parse import urlparse\n\ndef extract_domain(url):\n try:\n parsed = urlparse(url)\n domain = parsed.netloc.replace('www.', '')\n return domain\n except:\n return None\n```\n\n### Word Count\n\nFor article mode:\n```python\ndef word_count(text):\n return len(text.split()) if text else 0\n```\n\n### Relative Time\n\nAdd human-readable time since date:\n```python\ndef time_since(date_str):\n from datetime import datetime\n try:\n dt = datetime.fromisoformat(date_str)\n delta = datetime.now() - dt\n if delta.days == 0: return \"Today\"\n if delta.days == 1: return \"Yesterday\"\n if delta.days \u003c 7: return f\"{delta.days} days ago\"\n if delta.days \u003c 30: return f\"{delta.days // 7} weeks ago\"\n if delta.days \u003c 365: return f\"{delta.days // 30} months ago\"\n return f\"{delta.days // 365} years ago\"\n except:\n return None\n```\n\n---\n\n## Transform Pipeline Order\n\nApply transforms in this sequence:\n\n1. **HTML entity decode** - raw text cleanup\n2. **Unicode normalization** - character standardization\n3. **Whitespace cleanup** - spacing normalization\n4. **Empty value standardization** - null/N/A handling\n5. **URL resolution** - relative to absolute\n6. **Data type coercion** - strings to numbers/dates\n7. **Price normalization** - if applicable\n8. **Date normalization** - if applicable\n9. **Phone normalization** - if applicable\n10. **Text cleaning** - noise removal\n11. **Deduplication** - remove duplicates\n12. **Sorting** - user-requested order\n13. **Enrichment** - domain, word count, etc.\n\nNot all steps apply to every extraction. Apply only what's relevant\nto the data type and extraction mode.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":11666,"content_sha256":"2cd94e932b3059490116aad6104cfdbe4ebc6700674e50283e8ad89a4e04f22c"},{"filename":"references/extraction-patterns.md","content":"# Extraction Patterns Reference\n\nCSS selectors, JavaScript snippets, and domain-specific tips for\ncommon web scraping scenarios.\n\n---\n\n## CSS Selector Patterns\n\n### Tables\n\n```css\n/* Standard HTML tables */\ntable /* All tables */\ntable.data-table /* Class-based */\ntable[id*=\"result\"] /* ID contains \"result\" */\ntable thead th /* Header cells */\ntable tbody tr /* Data rows */\ntable tbody tr td /* Data cells */\ntable tbody tr td:nth-child(2) /* Specific column (2nd) */\n\n/* Grid layouts acting as tables */\n[role=\"table\"] /* ARIA table role */\n[role=\"row\"] /* ARIA row */\n[role=\"gridcell\"] /* ARIA grid cell */\n.table-responsive table /* Bootstrap responsive wrapper */\n```\n\n### Product Listings\n\n```css\n/* E-commerce product grids */\n.product-card, .product-item, .product-tile\n[data-product-id] /* Data attribute markers */\n.product-name, .product-title, h2.title\n.price, .product-price, [data-price]\n.price--sale, .price--original /* Sale vs original price */\n.rating, .stars, [data-rating]\n.availability, .stock-status\n.product-image img, .product-thumb img\n\n/* Common e-commerce patterns */\n.search-results .result-item\n.catalog-grid .catalog-item\n.listing .listing-item\n```\n\n### Search Results\n\n```css\n/* Generic search result patterns */\n.search-result, .result-item, .search-entry\n.result-title a, .result-link\n.result-snippet, .result-description\n.result-url, .result-source\n.result-date, .result-timestamp\n.pagination a, .page-numbers a, [aria-label=\"Next\"]\n```\n\n### Contact / Directory\n\n```css\n/* People and contact cards */\n.team-member, .staff-card, .person, .contact-card\n.member-name, .person-name, h3.name\n.member-title, .job-title, .role\n.member-email a[href^=\"mailto:\"]\n.member-phone a[href^=\"tel:\"]\n.member-bio, .person-description\n.vcard /* hCard microformat */\n```\n\n### FAQ / Accordion\n\n```css\n/* FAQ and accordion patterns */\n.faq-item, .accordion-item, [itemtype*=\"FAQPage\"] [itemprop=\"mainEntity\"]\n.faq-question, .accordion-header, [itemprop=\"name\"], summary\n.faq-answer, .accordion-body, .accordion-content, [itemprop=\"acceptedAnswer\"]\ndetails, details > summary /* Native HTML accordion */\n[role=\"tabpanel\"] /* Tab-based FAQ */\n```\n\n### Pricing Tables\n\n```css\n/* SaaS pricing page patterns */\n.pricing-table, .pricing-card, .plan-card, .pricing-tier\n.plan-name, .tier-name, .pricing-title\n.plan-price, .pricing-amount, .price-value\n.plan-period, .billing-cycle /* monthly/annually */\n.plan-features li, .feature-list li\n.plan-cta, .pricing-button\n[class*=\"popular\"], [class*=\"recommended\"], [class*=\"featured\"] /* highlighted plan */\n```\n\n### Job Listings\n\n```css\n/* Job board patterns */\n.job-listing, .job-card, .job-posting, [itemtype*=\"JobPosting\"]\n.job-title, [itemprop=\"title\"]\n.company-name, [itemprop=\"hiringOrganization\"]\n.job-location, [itemprop=\"jobLocation\"]\n.job-salary, [itemprop=\"baseSalary\"]\n.job-type, .employment-type\n.job-date, [itemprop=\"datePosted\"]\n```\n\n### Events\n\n```css\n/* Event listing patterns */\n.event-card, .event-item, [itemtype*=\"Event\"]\n.event-title, [itemprop=\"name\"]\n.event-date, [itemprop=\"startDate\"], time[datetime]\n.event-location, [itemprop=\"location\"]\n.event-description, [itemprop=\"description\"]\n.event-speaker, .speaker-name\n```\n\n### Navigation / Pagination\n\n```css\n/* Pagination controls */\n.pagination, .pager, nav[aria-label*=\"pagination\"]\n.pagination .next, a[rel=\"next\"]\n.pagination .prev, a[rel=\"prev\"]\n.page-numbers, .page-link\nbutton[data-page], a[data-page]\n.load-more, button.show-more\n```\n\n### Articles / Blog Posts\n\n```css\n/* Article content */\narticle, .post, .entry, .article-content\narticle h1, .post-title, .entry-title\n.author, .byline, [rel=\"author\"]\ntime, .date, .published, .post-date\n.post-content, .entry-content, .article-body\n.tags a, .categories a, .post-tags a\n```\n\n---\n\n## JavaScript Extraction Snippets\n\n### Generic Table Extractor\n\n```javascript\nfunction extractTable(selector) {\n const table = document.querySelector(selector || 'table');\n if (!table) return { error: 'No table found' };\n\n const headers = Array.from(\n table.querySelectorAll('thead th, tr:first-child th, tr:first-child td')\n ).map(el => el.textContent.trim());\n\n const rows = Array.from(table.querySelectorAll('tbody tr, tr:not(:first-child)'))\n .map(tr => {\n const cells = Array.from(tr.querySelectorAll('td'))\n .map(td => td.textContent.trim());\n return cells.length > 0 ? cells : null;\n })\n .filter(Boolean);\n\n return { headers, rows, rowCount: rows.length };\n}\nJSON.stringify(extractTable());\n```\n\n### Multi-Table Extractor\n\n```javascript\nfunction extractAllTables() {\n const tables = document.querySelectorAll('table');\n return Array.from(tables).map((table, idx) => {\n const caption = table.querySelector('caption')?.textContent?.trim()\n || table.getAttribute('aria-label') || `Table ${idx + 1}`;\n const headers = Array.from(\n table.querySelectorAll('thead th, tr:first-child th')\n ).map(el => el.textContent.trim());\n const rows = Array.from(table.querySelectorAll('tbody tr'))\n .map(tr => Array.from(tr.querySelectorAll('td')).map(td => td.textContent.trim()))\n .filter(r => r.length > 0);\n return { caption, headers, rows, rowCount: rows.length };\n });\n}\nJSON.stringify(extractAllTables());\n```\n\n### Generic List Extractor\n\n```javascript\nfunction extractList(containerSelector, itemSelector, fieldMap) {\n // fieldMap: { fieldName: { selector: 'CSS', attr: 'href'|'src'|null } }\n const container = document.querySelector(containerSelector);\n if (!container) return { error: 'Container not found' };\n\n const items = Array.from(container.querySelectorAll(itemSelector));\n const data = items.map(item => {\n const record = {};\n for (const [key, config] of Object.entries(fieldMap)) {\n const sel = typeof config === 'string' ? config : config.selector;\n const attr = typeof config === 'object' ? config.attr : null;\n const el = item.querySelector(sel);\n if (!el) { record[key] = null; continue; }\n record[key] = attr ? el.getAttribute(attr) : el.textContent.trim();\n }\n return record;\n });\n return { data, itemCount: data.length };\n}\n\n// Example usage:\nJSON.stringify(extractList('.results', '.result-item', {\n title: '.result-title',\n description: '.result-snippet',\n url: { selector: '.result-title a', attr: 'href' },\n date: '.result-date'\n}));\n```\n\n### JSON-LD Structured Data Extractor\n\nMany pages embed structured data that's easier to parse than DOM:\n\n```javascript\nfunction extractJsonLd(targetType) {\n const scripts = document.querySelectorAll('script[type=\"application/ld+json\"]');\n const allData = Array.from(scripts).map(s => {\n try { return JSON.parse(s.textContent); } catch { return null; }\n }).filter(Boolean);\n\n // Flatten @graph arrays\n const flat = allData.flatMap(d => d['@graph'] || [d]);\n\n if (targetType) {\n return flat.filter(d =>\n d['@type'] === targetType ||\n (Array.isArray(d['@type']) && d['@type'].includes(targetType))\n );\n }\n return flat;\n}\n// Extract products: extractJsonLd('Product')\n// Extract articles: extractJsonLd('Article')\n// Extract all: extractJsonLd()\nJSON.stringify(extractJsonLd());\n```\n\nCommon JSON-LD types and their useful fields:\n- `Product`: name, offers.price, offers.priceCurrency, aggregateRating, brand.name\n- `Article`: headline, author.name, datePublished, description, wordCount\n- `Organization`: name, address, telephone, email, url\n- `BreadcrumbList`: itemListElement[].name (navigation path)\n- `FAQPage`: mainEntity[].name (question), mainEntity[].acceptedAnswer.text\n- `JobPosting`: title, hiringOrganization.name, jobLocation, baseSalary\n- `Event`: name, startDate, endDate, location, performer\n\n### OpenGraph / Meta Tag Extractor\n\n```javascript\nfunction extractMeta() {\n const meta = {};\n document.querySelectorAll('meta[property^=\"og:\"], meta[name^=\"twitter:\"]')\n .forEach(el => {\n const key = el.getAttribute('property') || el.getAttribute('name');\n meta[key] = el.getAttribute('content');\n });\n meta.title = document.title;\n meta.description = document.querySelector('meta[name=\"description\"]')\n ?.getAttribute('content');\n meta.canonical = document.querySelector('link[rel=\"canonical\"]')\n ?.getAttribute('href');\n return meta;\n}\nJSON.stringify(extractMeta());\n```\n\n### Pricing Plan Extractor\n\n```javascript\nfunction extractPricingPlans() {\n const cards = document.querySelectorAll(\n '.pricing-card, .plan-card, .pricing-tier, [class*=\"pricing\"] [class*=\"card\"]'\n );\n return Array.from(cards).map(card => ({\n name: card.querySelector('[class*=\"name\"], [class*=\"title\"], h2, h3')\n ?.textContent?.trim() || null,\n price: card.querySelector('[class*=\"price\"], [class*=\"amount\"]')\n ?.textContent?.trim() || null,\n period: card.querySelector('[class*=\"period\"], [class*=\"billing\"]')\n ?.textContent?.trim() || null,\n features: Array.from(card.querySelectorAll('[class*=\"feature\"] li, ul li'))\n .map(li => li.textContent.trim()),\n highlighted: card.matches('[class*=\"popular\"], [class*=\"recommended\"], [class*=\"featured\"]'),\n ctaText: card.querySelector('a, button')?.textContent?.trim() || null,\n ctaUrl: card.querySelector('a')?.href || null,\n }));\n}\nJSON.stringify(extractPricingPlans());\n```\n\n### FAQ Extractor\n\n```javascript\nfunction extractFAQ() {\n // Try JSON-LD first\n const ldFaq = extractJsonLd('FAQPage');\n if (ldFaq.length > 0 && ldFaq[0].mainEntity) {\n return ldFaq[0].mainEntity.map(q => ({\n question: q.name,\n answer: q.acceptedAnswer?.text || null\n }));\n }\n\n // Try \u003cdetails>/\u003csummary> pattern\n const details = document.querySelectorAll('details');\n if (details.length > 0) {\n return Array.from(details).map(d => ({\n question: d.querySelector('summary')?.textContent?.trim() || null,\n answer: Array.from(d.children).filter(c => c.tagName !== 'SUMMARY')\n .map(c => c.textContent.trim()).join(' ')\n }));\n }\n\n // Try accordion pattern\n const items = document.querySelectorAll(\n '.faq-item, .accordion-item, [class*=\"faq\"] [class*=\"item\"]'\n );\n return Array.from(items).map(item => ({\n question: item.querySelector(\n '[class*=\"question\"], [class*=\"header\"], [class*=\"title\"], h3, h4'\n )?.textContent?.trim() || null,\n answer: item.querySelector(\n '[class*=\"answer\"], [class*=\"body\"], [class*=\"content\"], p'\n )?.textContent?.trim() || null\n }));\n}\nJSON.stringify(extractFAQ());\n```\n\n### Link Extractor\n\n```javascript\nfunction extractLinks(scope) {\n const container = scope ? document.querySelector(scope) : document;\n const links = Array.from(container.querySelectorAll('a[href]'))\n .map(a => ({\n text: a.textContent.trim(),\n href: a.href,\n title: a.title || null\n }))\n .filter(l => l.text && l.href && !l.href.startsWith('javascript:'));\n return { links, count: links.length };\n}\nJSON.stringify(extractLinks());\n```\n\n### Image Extractor\n\n```javascript\nfunction extractImages(scope) {\n const container = scope ? document.querySelector(scope) : document;\n const images = Array.from(container.querySelectorAll('img'))\n .map(img => ({\n src: img.src,\n alt: img.alt || null,\n width: img.naturalWidth,\n height: img.naturalHeight\n }))\n .filter(i => i.src && !i.src.includes('data:image/gif'));\n return { images, count: images.length };\n}\nJSON.stringify(extractImages());\n```\n\n### Scroll-and-Collect Pattern\n\nFor pages with lazy-loaded content, use this pattern with Browser automation:\n\n```javascript\n// Count items before scroll\nfunction countItems(selector) {\n return document.querySelectorAll(selector).length;\n}\n```\n\nThen in the workflow:\n1. `javascript_tool`: `countItems('.item')` -> get initial count\n2. `computer(action=\"scroll\", scroll_direction=\"down\")`\n3. `computer(action=\"wait\", duration=2)`\n4. `javascript_tool`: `countItems('.item')` -> get new count\n5. If new count > old count, repeat from step 2\n6. If count unchanged after 2 scrolls, all items loaded\n7. Extract all items at once\n\n---\n\n## Domain-Specific Tips\n\n### E-Commerce Sites\n- Check for JSON-LD `Product` schema first - often has cleaner data than DOM\n- Prices may have hidden original/sale price elements\n- Availability often encoded in data attributes (`data-available=\"true\"`)\n- Product variants (size, color) may require click interactions\n- Review data often loaded lazily - scroll to reviews section first\n- Many sites have internal APIs at `/api/products` - check Network tab\n\n### Wikipedia\n- Tables use class `.wikitable` - always prefer this selector\n- Infoboxes use class `.infobox`\n- References in `\u003csup class=\"reference\">` - exclude from text extraction\n- Table cells may contain complex nested HTML - use `.textContent.trim()`\n- Sortable tables have class `.sortable` with sort buttons in headers\n\n### News Sites\n- Article body often in `\u003carticle>` or `[itemprop=\"articleBody\"]`\n- Paywall indicators: `.paywall`, `.subscribe-wall`, truncated with \"Read more\"\n- Publication date in `\u003ctime>` element or `[itemprop=\"datePublished\"]`\n- Author in `[itemprop=\"author\"]` or `.byline`\n- JSON-LD `NewsArticle` often has complete metadata\n\n### Government / Data Portals\n- Often use HTML tables without JavaScript\n- May have download links for CSV/Excel - check for `.csv`, `.xlsx` links\n- Data dictionaries may be on separate pages\n- Look for API endpoints in page source (`/api/`, `.json` links)\n- CORS may block direct API access; use Bash curl instead\n\n### Social Media (Public Profiles)\n- Content is almost always JS-rendered - use Browser automation\n- Rate limiting is aggressive - keep requests minimal\n- Infinite scroll is the norm - set clear item limits\n- Structure changes frequently - prefer text extraction over selectors\n\n### SaaS Pricing Pages\n- Pricing often changes dynamically (monthly vs annual toggle)\n- May need to click \"Annual\" toggle to see annual prices\n- Feature comparison tables often use checkmarks (Unicode or SVG)\n- Check for hidden elements toggled by billing period selector\n\n### Job Boards\n- Most use JSON-LD `JobPosting` schema\n- Salary ranges often hidden behind \"View salary\" buttons\n- Location may include remote/hybrid indicators\n- Filters are URL-parameter based - useful for pagination\n\n---\n\n## Anti-Patterns to Avoid\n\n| Anti-Pattern | Why It Fails | Better Approach |\n|:-------------|:-------------|:----------------|\n| Selectors with generated hashes (`.css-1a2b3c`) | Change on every deploy | Use semantic selectors, ARIA roles, data attributes |\n| Deeply nested paths (`div > div > div > span`) | Fragile on layout changes | Use closest meaningful class or attribute |\n| Index-based (`:nth-child(3)`) for dynamic lists | Order may change | Use content-based identification |\n| Selecting by inline styles | Presentation, not semantics | Use classes, IDs, or data attributes |\n| Hardcoded wait times for JS content | Too short or too long | Check for content presence in a loop |\n| Single selector for variant pages | Different pages differ | Test selector on multiple pages first |\n\n## Robust Selector Priority\n\nPrefer selectors in this order (most stable to least):\n\n1. `[data-testid=\"...\"]`, `[data-id=\"...\"]` - test/data attributes\n2. `#unique-id` - unique IDs\n3. `[role=\"...\"]`, `[aria-label=\"...\"]` - ARIA attributes\n4. `[itemprop=\"...\"]`, `[itemtype=\"...\"]` - microdata / schema.org\n5. `.semantic-class` - meaningful class names\n6. `tag.class` - element type + class\n7. Structural selectors - last resort\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":15720,"content_sha256":"c9ebd7aef349304bed50cbe5dca86c932a928c08aeb0645de86174e947d69e6e"},{"filename":"references/output-templates.md","content":"# Output Templates Reference\n\nComplete formatting templates for all supported output formats.\nEvery output must be wrapped in a delivery envelope with metadata.\n\n---\n\n## Delivery Envelope (Required)\n\nEvery extraction result MUST include this metadata wrapper,\nregardless of output format:\n\n```markdown\n## Extraction Results\n\n**Source:** [Page Title](https://example.com/page)\n**Date:** 2026-02-25 14:30 UTC\n**Items:** 47 records\n**Confidence:** HIGH\n**Format:** Markdown Table\n\n---\n\n[DATA GOES HERE]\n\n---\n\n**Notes:**\n- Any gaps, anomalies, or observations\n- Filters or sorts applied\n- Pages scraped (if paginated)\n```\n\n---\n\n## Markdown Table Format\n\n### Standard Table\n\n```markdown\n| Name | Price | Rating | Availability |\n|:---------------|---------:|:------:|:-------------|\n| Product Alpha | $29.99 | 4.5 | In Stock |\n| Product Beta | $49.99 | 4.2 | In Stock |\n| Product Gamma | $119.00 | 4.8 | Pre-order |\n| Product Delta | $15.50 | 3.9 | Out of Stock |\n```\n\n### Alignment Rules\n\n| Data Type | Alignment | Markdown Syntax |\n|:-------------|:----------|:----------------|\n| Text | Left | `:---` |\n| Numbers | Right | `---:` |\n| Centered | Center | `:---:` |\n| Mixed/Status | Left | `:---` |\n\n### Table with Summary Row\n\n```markdown\n| Product | Units Sold | Revenue |\n|:---------------|----------:|-----------:|\n| Widget A | 1,234 | $12,340 |\n| Widget B | 567 | $8,505 |\n| Widget C | 2,890 | $57,800 |\n| **Total** | **4,691** | **$78,645**|\n```\n\n### Wide Data (Split Tables)\n\nWhen data has more than 10 columns, split into logical groups:\n\n```markdown\n### Basic Information\n\n| Name | Category | Brand | SKU |\n|:--------|:---------|:--------|:---------|\n| Item A | Tools | Acme | ACM-001 |\n\n### Pricing and Availability\n\n| Name | Price | Sale Price | Stock | Ships In |\n|:--------|--------:|-----------:|:------|:---------|\n| Item A | $49.99 | $39.99 | 142 | 2 days |\n```\n\n### Multi-URL Comparison Table\n\n```markdown\n| Source | Product | Price | Rating |\n|:-------------|:-----------|--------:|:------:|\n| store-a.com | Laptop X | $999 | 4.3 |\n| store-b.com | Laptop X | $949 | 4.5 |\n| store-c.com | Laptop X | $1,029 | 4.1 |\n```\n\n### Truncation Rules\n\nFor values exceeding 60 characters:\n```markdown\n| Title | Author |\n|:------------------------------------------------------------|:--------|\n| Introduction to Advanced Machine Learning Techni... | J. Smith|\n```\n\n---\n\n## JSON Format\n\n### Standard JSON Output\n\n```json\n{\n \"metadata\": {\n \"source\": \"https://example.com/products\",\n \"title\": \"Product Catalog - Example Store\",\n \"extractedAt\": \"2026-02-25T14:30:00Z\",\n \"itemCount\": 3,\n \"confidence\": \"HIGH\",\n \"fields\": [\"name\", \"price\", \"rating\", \"availability\"],\n \"notes\": []\n },\n \"data\": [\n {\n \"name\": \"Product Alpha\",\n \"price\": 29.99,\n \"currency\": \"USD\",\n \"rating\": 4.5,\n \"availability\": \"In Stock\"\n },\n {\n \"name\": \"Product Beta\",\n \"price\": 49.99,\n \"currency\": \"USD\",\n \"rating\": 4.2,\n \"availability\": \"In Stock\"\n },\n {\n \"name\": \"Product Gamma\",\n \"price\": 119.00,\n \"currency\": \"USD\",\n \"rating\": 4.8,\n \"availability\": \"Pre-order\"\n }\n ]\n}\n```\n\n### JSON Key Naming\n\n| Rule | Example |\n|:-----------------------|:----------------------------------|\n| camelCase | `productName`, `unitPrice` |\n| Numbers stay numeric | `29.99` not `\"29.99\"` |\n| Booleans stay boolean | `true` not `\"true\"` |\n| Missing = null | `null` not `\"\"` or `\"N/A\"` |\n| Arrays for multiples | `\"tags\": [\"sale\", \"new\"]` |\n| ISO-8601 for dates | `\"2026-02-25T14:30:00Z\"` |\n\n### Nested JSON (Product with Details)\n\n```json\n{\n \"metadata\": { \"...\" : \"...\" },\n \"data\": [\n {\n \"name\": \"Laptop Pro X\",\n \"brand\": \"TechCo\",\n \"pricing\": {\n \"current\": 999.99,\n \"original\": 1299.99,\n \"currency\": \"USD\",\n \"discount\": \"23%\"\n },\n \"rating\": {\n \"score\": 4.5,\n \"count\": 1234\n },\n \"specifications\": {\n \"processor\": \"M3 Pro\",\n \"ram\": \"16 GB\",\n \"storage\": \"512 GB SSD\",\n \"display\": \"14.2 inch Retina\"\n },\n \"availability\": {\n \"inStock\": true,\n \"shipsIn\": \"2-3 business days\"\n }\n }\n ]\n}\n```\n\n### Multi-URL JSON\n\n```json\n{\n \"metadata\": {\n \"sources\": [\n \"https://store-a.com/laptop-x\",\n \"https://store-b.com/laptop-x\"\n ],\n \"extractedAt\": \"2026-02-25T14:30:00Z\",\n \"itemCount\": 2,\n \"confidence\": \"HIGH\"\n },\n \"data\": [\n {\n \"source\": \"store-a.com\",\n \"name\": \"Laptop X\",\n \"price\": 999,\n \"currency\": \"USD\",\n \"rating\": 4.3\n },\n {\n \"source\": \"store-b.com\",\n \"name\": \"Laptop X\",\n \"price\": 949,\n \"currency\": \"USD\",\n \"rating\": 4.5\n }\n ]\n}\n```\n\n---\n\n## CSV Format\n\n### Standard CSV\n\n```csv\n# Source: https://example.com/products\n# Extracted: 2026-02-25 14:30 UTC\n# Items: 3 | Confidence: HIGH\nname,price,currency,rating,availability\n\"Product Alpha\",29.99,USD,4.5,\"In Stock\"\n\"Product Beta\",49.99,USD,4.2,\"In Stock\"\n\"Product Gamma\",119.00,USD,4.8,\"Pre-order\"\n```\n\n### CSV Rules\n\n| Rule | Example |\n|:-------------------------------------|:-------------------------------|\n| Always include header row | `name,price,rating` |\n| Quote fields with commas | `\"Smith, John\"` |\n| Quote fields with quotes (escape) | `\"He said \"\"hello\"\"\"` |\n| Quote fields with newlines | `\"Line 1\\nLine 2\"` |\n| UTF-8 encoding with BOM | `\\xEF\\xBB\\xBF` prefix |\n| Comma delimiter (standard) | `,` |\n| Metadata as comments (# prefix) | `# Source: URL` |\n| null/missing as empty field | `field1,,field3` |\n\n### Multi-URL CSV\n\n```csv\n# Sources: store-a.com, store-b.com\n# Extracted: 2026-02-25 14:30 UTC\nsource,name,price,currency,rating\n\"store-a.com\",\"Laptop X\",999,USD,4.3\n\"store-b.com\",\"Laptop X\",949,USD,4.5\n```\n\n---\n\n## Summary Statistics Template\n\nWhen extracted data contains numeric fields, include a summary block:\n\n```markdown\n### Summary Statistics\n\n| Metric | Price | Rating |\n|:----------|----------:|-------:|\n| Count | 47 | 47 |\n| Min | $12.99 | 2.1 |\n| Max | $299.99 | 5.0 |\n| Average | $67.42 | 4.1 |\n| Median | $54.99 | 4.3 |\n```\n\nInclude only when:\n- Data has numeric columns\n- More than 5 items extracted\n- User would likely benefit from aggregate view (prices, ratings, quantities)\n\n---\n\n## Contact Data Template\n\n```markdown\n| Name | Title | Email | Phone |\n|:---------------|:-------------------|:---------------------|:---------------|\n| Jane Smith | CEO | [email protected] | +1-555-0101 |\n| John Doe | CTO | [email protected] | +1-555-0102 |\n| Alice Johnson | VP Engineering | [email protected] | N/A |\n```\n\n---\n\n## Article Extraction Template\n\n```markdown\n## Article: [Title]\n\n**Author:** Author Name\n**Published:** YYYY-MM-DD\n**Source:** [Site Name](URL)\n\n### Summary\n[2-3 sentence summary of the article content]\n\n### Key Data Points\n- [Factual data point 1]\n- [Factual data point 2]\n- [Statistical finding]\n\n### Tags\n`tag1` `tag2` `tag3`\n```\n\nNote: Summarize article content. Do not reproduce full article text\ndue to copyright.\n\n---\n\n## FAQ Extraction Template\n\n```markdown\n### FAQ: [Page Title]\n\n**Source:** [Site Name](URL)\n**Items:** 12 questions\n\n| # | Question | Answer (excerpt) |\n|--:|:---------|:-----------------|\n| 1 | How do I reset my password? | Navigate to Settings > Security and click \"Reset...\" |\n| 2 | What payment methods do you accept? | We accept Visa, Mastercard, PayPal, and bank transfer... |\n```\n\nOr as JSON (default for FAQ mode):\n```json\n{\n \"metadata\": { \"source\": \"URL\", \"itemCount\": 12, \"confidence\": \"HIGH\" },\n \"data\": [\n { \"question\": \"How do I reset my password?\", \"answer\": \"Navigate to...\", \"category\": \"Account\" },\n { \"question\": \"What payment methods?\", \"answer\": \"We accept...\", \"category\": \"Billing\" }\n ]\n}\n```\n\n---\n\n## Pricing Plans Template\n\n```markdown\n### Pricing: [Product Name]\n\n**Source:** [Site Name](URL)\n**Plans:** 3 tiers\n\n| Plan | Monthly | Annual | Highlighted |\n|:------------|----------:|----------:|:-----------:|\n| Starter | $9/mo | $7/mo | |\n| Pro | $29/mo | $24/mo | * |\n| Enterprise | Custom | Custom | |\n\n#### Feature Comparison\n\n| Feature | Starter | Pro | Enterprise |\n|:----------------------|:-------:|:---:|:----------:|\n| Users | 1 | 10 | Unlimited |\n| Storage | 5 GB | 50 GB | Unlimited |\n| API Access | N/A | Yes | Yes |\n| Priority Support | N/A | N/A | Yes |\n```\n\n---\n\n## Job Listings Template\n\n```markdown\n| Title | Company | Location | Salary | Type | Posted |\n|:-------------------|:------------|:---------------|:----------------|:----------|:-----------|\n| Senior Engineer | TechCo | Remote, US | $150k - $200k | Full-time | 2026-02-20 |\n| Product Manager | StartupXYZ | San Francisco | $130k - $160k | Full-time | 2026-02-18 |\n| Data Analyst | DataCorp | London, UK | GBP 55k - 70k | Contract | 2026-02-22 |\n```\n\n---\n\n## Events Template\n\n```markdown\n| Event | Date | Time | Location | Speakers |\n|:-----------------------|:-----------|:--------|:------------------|:---------------|\n| Opening Keynote | 2026-03-15 | 09:00 | Main Hall | J. Smith |\n| Workshop: AI Basics | 2026-03-15 | 14:00 | Room 201 | A. Johnson |\n| Networking Reception | 2026-03-15 | 18:00 | Rooftop Lounge | N/A |\n```\n\n---\n\n## Differential (Diff) Output Template\n\nWhen comparing current extraction with a previous run:\n\n```markdown\n## Extraction Results (Diff)\n\n**Source:** [Page Title](URL)\n**Date:** 2026-02-25 14:30 UTC\n**Compared to:** 2026-02-20 10:00 UTC\n**Changes:** +5 new, -2 removed, 3 modified\n\n---\n\n### New Items (+5)\n\n| Name | Price | Rating |\n|:---------------|--------:|:------:|\n| Product Eta | $39.99 | 4.6 |\n| Product Theta | $24.99 | 4.1 |\n| ... | | |\n\n### Removed Items (-2)\n\n| Name | Price | Rating |\n|:---------------|--------:|:------:|\n| ~~Product Alpha~~ | ~~$29.99~~ | ~~4.5~~ |\n| ~~Product Beta~~ | ~~$49.99~~ | ~~4.2~~ |\n\n### Modified Items (3)\n\n| Name | Field | Was | Now |\n|:---------------|:--------|:-----------|:-----------|\n| Product Gamma | Price | $119.00 | $109.00 |\n| Product Gamma | Rating | 4.8 | 4.9 |\n| Product Delta | Stock | Out of Stock | In Stock |\n\n---\n\n**Summary:**\n- 5 new products added since last extraction\n- 2 products removed (possibly discontinued)\n- Product Gamma had a price drop of $10 and rating increase\n- Product Delta is back in stock\n```\n\n---\n\n## Error / Partial Result Template\n\nWhen extraction partially fails:\n\n```markdown\n## Extraction Results (Partial)\n\n**Source:** [Page Title](URL)\n**Date:** 2026-02-25 14:30 UTC\n**Items:** 23 of ~50 expected records\n**Confidence:** LOW\n**Strategy:** A (WebFetch) -> escalated to B (Browser)\n\n---\n\n[PARTIAL DATA]\n\n---\n\n**Issues:**\n- 27 items could not be extracted (content behind JS rendering)\n- Price field missing for 5 items (marked N/A)\n- Auto-escalation from WebFetch to Browser recovered 15 additional items\n\n**Suggestions:**\n- Re-run with explicit Browser automation for complete results\n- Check if site has an API endpoint for direct data access\n- Try at a different time if rate-limited\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":12263,"content_sha256":"0b77d9ca726262f8742d9435ff57a66e5c85622809cb6868049199cd46dc491d"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Web Scraper","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Overview","type":"text"}]},{"type":"paragraph","content":[{"text":"Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"When to Use This Skill","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When the user mentions \"scraper\" or related topics","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When the user mentions \"scraping\" or related topics","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When the user mentions \"extrair dados web\" or related topics","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When the user mentions \"web scraping\" or related topics","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When the user mentions \"raspar dados\" or related topics","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When the user mentions \"coletar dados site\" or related topics","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Do Not Use This Skill When","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The task is unrelated to web scraper","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"A simpler, more specific tool can handle the request","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The user needs general-purpose assistance without domain expertise","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"How It Works","type":"text"}]},{"type":"paragraph","content":[{"text":"Execute phases in strict order. Each phase feeds the next.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"1. CLARIFY -> 2. RECON -> 3. STRATEGY -> 4. EXTRACT -> 5. TRANSFORM -> 6. VALIDATE -> 7. FORMAT","type":"text"}]},{"type":"paragraph","content":[{"text":"Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.","type":"text"}]},{"type":"paragraph","content":[{"text":"Fast path","type":"text","marks":[{"type":"strong"}]},{"text":": If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Capabilities","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Multi-strategy","type":"text","marks":[{"type":"strong"}]},{"text":": WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extraction modes","type":"text","marks":[{"type":"strong"}]},{"text":": table, list, article, product, contact, FAQ, pricing, events, jobs, custom","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Output formats","type":"text","marks":[{"type":"strong"}]},{"text":": Markdown tables (default), JSON, CSV","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Pagination","type":"text","marks":[{"type":"strong"}]},{"text":": auto-detect and follow (page numbers, infinite scroll, load-more)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Multi-URL","type":"text","marks":[{"type":"strong"}]},{"text":": extract same structure across sources with comparison and diff","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Validation","type":"text","marks":[{"type":"strong"}]},{"text":": confidence ratings (HIGH/MEDIUM/LOW) on every extraction","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Auto-escalation","type":"text","marks":[{"type":"strong"}]},{"text":": WebFetch fails silently -> automatic Browser fallback","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Data transforms","type":"text","marks":[{"type":"strong"}]},{"text":": cleaning, normalization, deduplication, enrichment","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Differential mode","type":"text","marks":[{"type":"strong"}]},{"text":": detect changes between scraping runs","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Web Scraper","type":"text"}]},{"type":"paragraph","content":[{"text":"Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Phase 1: Clarify","type":"text"}]},{"type":"paragraph","content":[{"text":"Establish extraction parameters before touching any URL.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Required Parameters","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Parameter","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Resolve","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Default","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Target URL(s)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Which page(s) to scrape?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"(required)","type":"text","marks":[{"type":"em"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Data Target","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"What specific data to extract?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"(required)","type":"text","marks":[{"type":"em"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Output Format","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table, JSON, CSV, or text?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Scope","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Single page, paginated, or multi-URL?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Single page","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Optional Parameters","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Parameter","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Resolve","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Default","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Pagination","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Follow pagination? Max pages?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"No, 1 page","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Max Items","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Maximum number of items to collect?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Unlimited","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Filters","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Data to exclude or include?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"None","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Sort Order","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"How to sort results?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Source order","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Save Path","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Save to file? Which path?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Display only","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Language","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Respond in which language?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"User's lang","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Diff Mode","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Compare with previous run?","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"No","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Clarification Rules","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If request is ambiguous (e.g. \"scrape this site\"), ask ONLY: \"What specific data do you want me to extract from this page?\"","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Default to Markdown table output. Mention alternatives only if relevant.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Accept requests in any language. Always respond in the user's language.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If user says \"everything\" or \"all data\", perform recon first, then present what's available and let user choose.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Discovery Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"When user has a topic but no specific URL:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use WebSearch to find the most relevant pages","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Present top 3-5 URLs with descriptions","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Let user choose which to scrape, or scrape all","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Proceed to Phase 2 with selected URL(s)","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Example: \"find and extract pricing data for CRM tools\" -> WebSearch(\"CRM tools pricing comparison 2026\") -> Present top results -> User selects -> Extract","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Phase 2: Reconnaissance","type":"text"}]},{"type":"paragraph","content":[{"text":"Analyze the target page before extraction.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Step 2.1: Initial Fetch","type":"text"}]},{"type":"paragraph","content":[{"text":"Use WebFetch to retrieve and analyze the page structure:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"WebFetch(\n url = TARGET_URL,\n prompt = \"Analyze this page structure and report:\n 1. Page type: article, product listing, search results, data table,\n directory, dashboard, API docs, FAQ, pricing page, job board, events, or other\n 2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,\n accordion/collapsible sections, tabs\n 3. Approximate number of distinct data items visible\n 4. JavaScript rendering indicators: empty containers, loading spinners,\n SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS\n 5. Pagination: next/prev links, page numbers, load-more buttons,\n infinite scroll indicators, total results count\n 6. Data density: how much structured, extractable data exists\n 7. List the main data fields/columns available for extraction\n 8. Embedded structured data: JSON-LD, microdata, OpenGraph tags\n 9. Available download links: CSV, Excel, PDF, API endpoints\"\n)","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Step 2.2: Evaluate Fetch Quality","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Signal","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Interpretation","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Action","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Rich content with data clearly visible","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Static page","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Strategy A (WebFetch)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Empty containers, \"loading...\", minimal text","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"JS-rendered","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Strategy B (Browser)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Login wall, CAPTCHA, 403/401 response","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Blocked","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Report to user","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Content present but poorly structured","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Needs precision","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Strategy B (Browser)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"JSON or XML response body","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"API endpoint","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Strategy C (Bash/curl)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Download links for CSV/Excel available","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Direct data file","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Strategy C (download)","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Step 2.3: Content Classification","type":"text"}]},{"type":"paragraph","content":[{"text":"Classify into an extraction mode:","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Mode","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Indicators","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Examples","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"table","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"HTML ","type":"text"},{"text":"\u003ctable>","type":"text","marks":[{"type":"code_inline"}]},{"text":", grid layout with headers","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Price comparison, statistics, specs","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"list","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Repeated similar elements, card grids","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Search results, product listings","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"article","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Long-form text with headings/paragraphs","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Blog post, news article, docs","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"product","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Product name, price, specs, images, rating","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"E-commerce product page","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"contact","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Names, emails, phones, addresses, roles","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Team page, staff directory","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"faq","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Question-answer pairs, accordions","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"FAQ page, help center","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"pricing","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Plan names, prices, features, tiers","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"SaaS pricing page","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"events","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Dates, locations, titles, descriptions","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Event listings, conferences","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"jobs","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Titles, companies, locations, salaries","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Job boards, career pages","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"custom","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"User specified CSS selectors or fields","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Anything not matching above","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"Record: ","type":"text"},{"text":"page type","type":"text","marks":[{"type":"strong"}]},{"text":", ","type":"text"},{"text":"extraction mode","type":"text","marks":[{"type":"strong"}]},{"text":", ","type":"text"},{"text":"JS rendering needed (yes/no)","type":"text","marks":[{"type":"strong"}]},{"text":", ","type":"text"},{"text":"available fields","type":"text","marks":[{"type":"strong"}]},{"text":", ","type":"text"},{"text":"structured data present (JSON-LD etc.)","type":"text","marks":[{"type":"strong"}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"If user asked for \"everything\", present the available fields and let them choose.","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Phase 3: Strategy Selection","type":"text"}]},{"type":"paragraph","content":[{"text":"Choose the extraction approach based on recon results.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Decision Tree","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"Structured data (JSON-LD, microdata) has what we need?\n |\n +-- YES --> STRATEGY E: Extract structured data directly\n |\n +-- NO: Content fully visible in WebFetch?\n |\n +-- YES: Need precise element targeting?\n | |\n | +-- NO --> STRATEGY A: WebFetch + AI extraction\n | +-- YES --> STRATEGY B: Browser automation\n |\n +-- NO: JavaScript rendering detected?\n |\n +-- YES --> STRATEGY B: Browser automation\n +-- NO: API/JSON/XML endpoint or download link?\n |\n +-- YES --> STRATEGY C: Bash (curl + jq)\n +-- NO --> Report access issue to user","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Strategy A: Webfetch With Ai Extraction","type":"text"}]},{"type":"paragraph","content":[{"text":"Best for","type":"text","marks":[{"type":"strong"}]},{"text":": Static pages, articles, simple tables, well-structured HTML.","type":"text"}]},{"type":"paragraph","content":[{"text":"Use WebFetch with a targeted extraction prompt tailored to the mode:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"WebFetch(\n url = URL,\n prompt = \"Extract [DATA_TARGET] from this page.\n Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].\n Rules:\n - If a value is missing or unclear, use 'N/A'\n - Do not include navigation, ads, footers, or unrelated content\n - Preserve original values exactly (numbers, currencies, dates)\n - Include ALL matching items, not just the first few\n - For each item, also extract the URL/link if available\"\n)","type":"text"}]},{"type":"paragraph","content":[{"text":"Auto-escalation","type":"text","marks":[{"type":"strong"}]},{"text":": If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Strategy B: Browser Automation","type":"text"}]},{"type":"paragraph","content":[{"text":"Best for","type":"text","marks":[{"type":"strong"}]},{"text":": JS-rendered pages, SPAs, interactive content, lazy-loaded data.","type":"text"}]},{"type":"paragraph","content":[{"text":"Sequence:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Get tab context: ","type":"text"},{"text":"tabs_context_mcp(createIfEmpty=true)","type":"text","marks":[{"type":"code_inline"}]},{"text":" -> get tabId","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Navigate to URL: ","type":"text"},{"text":"navigate(url=TARGET_URL, tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Wait for content to load: ","type":"text"},{"text":"computer(action=\"wait\", duration=3, tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Check for cookie/consent banners: ","type":"text"},{"text":"find(query=\"cookie consent or accept button\", tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If found, dismiss it (prefer privacy-preserving option)","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Read page structure: ","type":"text"},{"text":"read_page(tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"get_page_text(tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Locate target elements: ","type":"text"},{"text":"find(query=\"[DESCRIPTION]\", tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extract with JavaScript for precise data via ","type":"text"},{"text":"javascript_tool","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"javascript"},"content":[{"text":"// Table extraction\nconst rows = document.querySelectorAll('TABLE_SELECTOR tr');\nconst data = Array.from(rows).map(row => {\n const cells = row.querySelectorAll('td, th');\n return Array.from(cells).map(c => c.textContent.trim());\n});\nJSON.stringify(data);","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"javascript"},"content":[{"text":"// List/card extraction\nconst items = document.querySelectorAll('ITEM_SELECTOR');\nconst data = Array.from(items).map(item => ({\n field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,\n field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,\n link: item.querySelector('a')?.href || null,\n}));\nJSON.stringify(data);","type":"text"}]},{"type":"ordered_list","attrs":{"order":8,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For lazy-loaded content, scroll and re-extract: ","type":"text"},{"text":"computer(action=\"scroll\", scroll_direction=\"down\", tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]},{"text":" then ","type":"text"},{"text":"computer(action=\"wait\", duration=2, tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Strategy C: Bash (Curl + Jq)","type":"text"}]},{"type":"paragraph","content":[{"text":"Best for","type":"text","marks":[{"type":"strong"}]},{"text":": REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"\n## Json Api\n\ncurl -s \"API_URL\" | jq '[.items[] | {field1: .key1, field2: .key2}]'\n\n## Csv Download\n\ncurl -s \"CSV_URL\" -o /tmp/scraped_data.csv\n\n## Xml Parsing\n\ncurl -s \"XML_URL\" | python3 -c \"\nimport xml.etree.ElementTree as ET, json, sys\ntree = ET.parse(sys.stdin)\n\n## ... Parse And Output Json\n\n\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Strategy D: Hybrid","type":"text"}]},{"type":"paragraph","content":[{"text":"When a single strategy is insufficient, combine:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"WebSearch to discover relevant URLs","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"WebFetch for initial content assessment","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Browser automation for JS-heavy sections","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Bash for post-processing (jq, python for data cleaning)","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Strategy E: Structured Data Extraction","type":"text"}]},{"type":"paragraph","content":[{"text":"When JSON-LD, microdata, or OpenGraph is present:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use Browser ","type":"text"},{"text":"javascript_tool","type":"text","marks":[{"type":"code_inline"}]},{"text":" to extract structured data:","type":"text"}]}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"javascript"},"content":[{"text":"const scripts = document.querySelectorAll('script[type=\"application/ld+json\"]');\nconst data = Array.from(scripts).map(s => {\n try { return JSON.parse(s.textContent); } catch { return null; }\n}).filter(Boolean);\nJSON.stringify(data);","type":"text"}]},{"type":"ordered_list","attrs":{"order":2,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"This often provides cleaner, more reliable data than DOM scraping","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Fall back to DOM extraction only for fields not in structured data","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Pagination Handling","type":"text"}]},{"type":"paragraph","content":[{"text":"When pagination is detected and user wants multiple pages:","type":"text"}]},{"type":"paragraph","content":[{"text":"Page-number pagination (any strategy):","type":"text","marks":[{"type":"strong"}]}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extract data from current page","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Identify URL pattern (e.g. ","type":"text"},{"text":"?page=N","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"/page/N","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"&offset=N","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Iterate through pages up to user's max (default: 5 pages)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Show progress: \"Extracting page 2/5...\"","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Concatenate all results, deduplicate if needed","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Infinite scroll (Browser only):","type":"text","marks":[{"type":"strong"}]}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extract currently visible data","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Record item count","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Scroll down: ","type":"text"},{"text":"computer(action=\"scroll\", scroll_direction=\"down\", tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Wait: ","type":"text"},{"text":"computer(action=\"wait\", duration=2, tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extract newly loaded data","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Compare count - if no new items after 2 scrolls, stop","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Repeat until no new content or max iterations (default: 5)","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"\"Load More\" button (Browser only):","type":"text","marks":[{"type":"strong"}]}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extract currently visible data","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Find button: ","type":"text"},{"text":"find(query=\"load more button\", tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Click it: ","type":"text"},{"text":"computer(action=\"left_click\", ref=REF, tabId=TAB)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Wait and extract new content","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Repeat until button disappears or max iterations reached","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Phase 4: Extract","type":"text"}]},{"type":"paragraph","content":[{"text":"Execute the selected strategy using mode-specific patterns. See ","type":"text"},{"text":"references/extraction-patterns.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/extraction-patterns.md","title":null}}]},{"text":" for CSS selectors and JavaScript snippets.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Table Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract ALL rows from the table(s) on this page.\nReturn as a markdown table with exact column headers.\nInclude every row - do not truncate or summarize.\nPreserve numeric precision, currencies, and units.\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"List Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract each [ITEM_TYPE] from this page.\nFor each item, extract: [FIELD_LIST].\nReturn as a JSON array of objects with these keys: [KEY_LIST].\nInclude ALL items, not just the first few. Include link/URL for each item if available.\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Article Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract article metadata:\n- title, author, date, tags/categories, word count estimate\n- Key factual data points, statistics, and named entities\nReturn as structured markdown. Summarize the content; do not reproduce full text.\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Product Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract product data with these exact fields:\n- name, brand, price, currency, originalPrice (if discounted),\n availability, description (first 200 chars), rating, reviewCount,\n specifications (as key-value pairs), productUrl, imageUrl\nReturn as JSON. Use null for missing fields.\"","type":"text"}]},{"type":"paragraph","content":[{"text":"Also check for JSON-LD ","type":"text"},{"text":"Product","type":"text","marks":[{"type":"code_inline"}]},{"text":" schema (Strategy E) first.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Contact Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract contact information for each person/entity:\n- name, title, role, email, phone, address, organization, website, linkedinUrl\nReturn as a markdown table. Only extract real contacts visible on the page.\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Faq Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract all question-answer pairs from this page.\nFor each FAQ item extract:\n- question: the exact question text\n- answer: the answer text (first 300 chars if long)\n- category: the section/category if grouped\nReturn as a JSON array of objects.\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Pricing Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract all pricing plans/tiers from this page.\nFor each plan extract:\n- planName, monthlyPrice, annualPrice, currency\n- features (array of included features)\n- limitations (array of limits or excluded features)\n- ctaText (call-to-action button text)\n- highlighted (true if marked as recommended/popular)\nReturn as JSON. Use null for missing fields.\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Events Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract all events/sessions from this page.\nFor each event extract:\n- title, date, time, endTime, location, description (first 200 chars)\n- speakers (array of names), category, registrationUrl\nReturn as JSON. Use null for missing fields.\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Jobs Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"WebFetch prompt:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"\"Extract all job listings from this page.\nFor each job extract:\n- title, company, location, salary, salaryRange, type (full-time/part-time/contract)\n- postedDate, description (first 200 chars), applyUrl, tags\nReturn as JSON. Use null for missing fields.\"","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Custom Mode","type":"text"}]},{"type":"paragraph","content":[{"text":"When user provides specific selectors or field descriptions:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use Browser automation with ","type":"text"},{"text":"javascript_tool","type":"text","marks":[{"type":"code_inline"}]},{"text":" and user's CSS selectors","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Or use WebFetch with a prompt built from user's field descriptions","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Always confirm extracted schema with user before proceeding to multi-URL","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Multi-Url Extraction","type":"text"}]},{"type":"paragraph","content":[{"text":"When extracting from multiple URLs:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extract from the ","type":"text"},{"text":"first URL","type":"text","marks":[{"type":"strong"}]},{"text":" to establish the data schema","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Show user the first results and confirm the schema is correct","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extract from remaining URLs using the same schema","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Add a ","type":"text"},{"text":"source","type":"text","marks":[{"type":"code_inline"}]},{"text":" column/field to every record with the origin URL","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Combine all results into a single output","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Show progress: \"Extracting 3/7 URLs...\"","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Phase 5: Transform","type":"text"}]},{"type":"paragraph","content":[{"text":"Clean, normalize, and enrich extracted data before validation. See ","type":"text"},{"text":"references/data-transforms.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/data-transforms.md","title":null}}]},{"text":" for patterns.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Automatic Transforms (Always Apply)","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Transform","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Action","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Whitespace cleanup","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Trim, collapse multiple spaces, remove ","type":"text"},{"text":"\\n","type":"text","marks":[{"type":"code_inline"}]},{"text":" in cells","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"HTML entity decode","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"&","type":"text","marks":[{"type":"code_inline"}]},{"text":" -> ","type":"text"},{"text":"&","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"<","type":"text","marks":[{"type":"code_inline"}]},{"text":" -> ","type":"text"},{"text":"\u003c","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"'","type":"text","marks":[{"type":"code_inline"}]},{"text":" -> ","type":"text"},{"text":"'","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Unicode normalization","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"NFKC normalization for consistent characters","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Empty string to null","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"\"","type":"text","marks":[{"type":"code_inline"}]},{"text":" -> ","type":"text"},{"text":"null","type":"text","marks":[{"type":"code_inline"}]},{"text":" (for JSON), ","type":"text"},{"text":"\"\"","type":"text","marks":[{"type":"code_inline"}]},{"text":" -> ","type":"text"},{"text":"N/A","type":"text","marks":[{"type":"code_inline"}]},{"text":" (for tables)","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Conditional Transforms (Apply When Relevant)","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Transform","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"When","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Action","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Price normalization","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Product/pricing modes","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Extract numeric value + currency symbol","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Date normalization","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Any dates found","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Normalize to ISO-8601 (YYYY-MM-DD)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"URL resolution","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Relative URLs extracted","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Convert to absolute URLs","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Phone normalization","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Contact mode","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Standardize to E.164 format if possible","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Deduplication","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Multi-page or multi-URL","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Remove exact duplicate rows","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Sorting","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"User requested or natural","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Sort by user-specified field","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Data Enrichment (Only When Useful)","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Enrichment","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"When","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Action","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Currency conversion","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"User asks for single currency","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Note original + convert (approximate)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Domain extraction","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"URLs in data","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Add domain column from full URLs","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Word count","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Article mode","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Count words in extracted text","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Relative dates","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Dates present","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Add \"X days ago\" column if useful","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Deduplication Strategy","type":"text"}]},{"type":"paragraph","content":[{"text":"When combining data from multiple pages or URLs:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Exact match: rows with identical values in all fields -> keep first","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Near match: rows with same key fields (name+source) but different details -> keep most complete (fewer nulls), flag in notes","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Report: \"Removed N duplicate rows\" in delivery notes","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Phase 6: Validate","type":"text"}]},{"type":"paragraph","content":[{"text":"Verify extraction quality before delivering results.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Validation Checks","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Check","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Action","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Item count","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Compare extracted count to expected count from recon","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Empty fields","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Count N/A or null values per field","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Data type consistency","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Numbers should be numeric, dates parseable","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Duplicates","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Flag exact duplicate rows (post-dedup)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Encoding","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Check for HTML entities, garbled characters","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Completeness","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"All user-requested fields present in output","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Truncation","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Verify data wasn't cut off (check last items)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Outliers","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Flag values that seem anomalous (e.g. $0.00 price)","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Confidence Rating","type":"text"}]},{"type":"paragraph","content":[{"text":"Assign to every extraction:","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Rating","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Criteria","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"HIGH","type":"text","marks":[{"type":"strong"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"All fields populated, count matches expected, no anomalies","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"MEDIUM","type":"text","marks":[{"type":"strong"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Minor gaps (\u003c10% empty fields) or count slightly differs","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"LOW","type":"text","marks":[{"type":"strong"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Significant gaps (>10% empty), structural issues, partial data","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"Always report confidence with specifics:","type":"text"}]},{"type":"blockquote","content":[{"type":"paragraph","content":[{"text":"Confidence: ","type":"text"},{"text":"HIGH","type":"text","marks":[{"type":"strong"}]},{"text":" - 47 items extracted, all 6 fields populated, matches expected count from page analysis.","type":"text"}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Auto-Recovery (Try Before Reporting Issues)","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Issue","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Auto-Recovery Action","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Missing data","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Re-attempt with Browser if WebFetch was used","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Encoding problems","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Apply HTML entity decode + unicode normalization","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Incomplete results","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Check for pagination or lazy-loading, fetch more","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Count mismatch","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Scroll/paginate to find remaining items","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"All fields empty","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Page likely JS-rendered, switch to Browser strategy","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Partial fields","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Try JSON-LD extraction as supplement","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"Log all recovery attempts in delivery notes. Inform user of any irrecoverable gaps with specific details.","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Phase 7: Format And Deliver","type":"text"}]},{"type":"paragraph","content":[{"text":"Structure results according to user preference. See ","type":"text"},{"text":"references/output-templates.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/output-templates.md","title":null}}]},{"text":" for complete formatting templates.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Delivery Envelope","type":"text"}]},{"type":"paragraph","content":[{"text":"ALWAYS wrap results with this metadata header:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"markdown"},"content":[{"text":"\n## Extraction Results\n\n**Source:** [Page Title](http://example.com)\n**Date:** YYYY-MM-DD HH:MM UTC\n**Items:** N records (M fields each)\n**Confidence:** HIGH | MEDIUM | LOW\n**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)\n**Format:** Markdown Table | JSON | CSV\n\n---\n\n[DATA HERE]\n\n---\n\n**Notes:**\n- [Any gaps, issues, or observations]\n- [Transforms applied: deduplication, normalization, etc.]\n- [Pages scraped if paginated: \"Pages 1-5 of 12\"]\n- [Auto-escalation if it occurred: \"Escalated from WebFetch to Browser\"]","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Markdown Table Rules","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Left-align text columns (","type":"text"},{"text":":---","type":"text","marks":[{"type":"code_inline"}]},{"text":"), right-align numbers (","type":"text"},{"text":"---:","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Consistent column widths for readability","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Include summary row for numeric data when useful (totals, averages)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Maximum 10 columns per table; split wider data into multiple tables or suggest JSON format","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Truncate long cell values to 60 chars with ","type":"text"},{"text":"...","type":"text","marks":[{"type":"code_inline"}]},{"text":" indicator","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use ","type":"text"},{"text":"N/A","type":"text","marks":[{"type":"code_inline"}]},{"text":" for missing values, never leave cells empty","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For multi-page results, show combined table (not per-page)","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Json Rules","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use camelCase for keys (e.g. ","type":"text"},{"text":"productName","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"unitPrice","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Wrap in metadata envelope:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"json"},"content":[{"text":"{\n \"metadata\": {\n \"source\": \"URL\",\n \"title\": \"Page Title\",\n \"extractedAt\": \"ISO-8601\",\n \"itemCount\": 47,\n \"fieldCount\": 6,\n \"confidence\": \"HIGH\",\n \"strategy\": \"A\",\n \"transforms\": [\"deduplication\", \"priceNormalization\"],\n \"notes\": []\n },\n \"data\": [ ... ]\n}","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Pretty-print with 2-space indentation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Numbers as numbers (not strings), booleans as booleans","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"null for missing values (not empty strings)","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Csv Rules","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"First row is always headers","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Quote any field containing commas, quotes, or newlines","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"UTF-8 encoding with BOM for Excel compatibility","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use ","type":"text"},{"text":",","type":"text","marks":[{"type":"code_inline"}]},{"text":" as delimiter (standard)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Include metadata as comments: ","type":"text"},{"text":"# Source: URL","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"File Output","type":"text"}]},{"type":"paragraph","content":[{"text":"When user requests file save:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Markdown: ","type":"text"},{"text":".md","type":"text","marks":[{"type":"code_inline"}]},{"text":" extension","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"JSON: ","type":"text"},{"text":".json","type":"text","marks":[{"type":"code_inline"}]},{"text":" extension","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"CSV: ","type":"text"},{"text":".csv","type":"text","marks":[{"type":"code_inline"}]},{"text":" extension","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Confirm path before writing","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Report full file path and item count after saving","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Multi-Url Comparison Format","type":"text"}]},{"type":"paragraph","content":[{"text":"When comparing data across multiple sources:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Add ","type":"text"},{"text":"Source","type":"text","marks":[{"type":"code_inline"}]},{"text":" as the first column/field","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use short identifiers for sources (domain name or user label)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Group by source or interleave based on user preference","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Highlight differences if user asks for comparison","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Include summary: \"Best price: $X at store-b.com\"","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Differential Output","type":"text"}]},{"type":"paragraph","content":[{"text":"When user requests change detection (diff mode):","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Compare current extraction with previous run","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Mark new items with ","type":"text"},{"text":"[NEW]","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Mark removed items with ","type":"text"},{"text":"[REMOVED]","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Mark changed values with ","type":"text"},{"text":"[WAS: old_value]","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Include summary: \"Changes since last run: +5 new, -2 removed, 3 modified\"","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Rate Limiting","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Maximum 1 request per 2 seconds for sequential page fetches","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For multi-URL jobs, process sequentially with pauses","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If a site returns 429 (Too Many Requests), stop and report to user","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Access Respect","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If a page blocks access (403, CAPTCHA, login wall), report to user","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Do NOT scrape behind authentication unless user explicitly provides access","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Respect robots.txt directives when known","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Copyright","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Do NOT reproduce large blocks of copyrighted article text","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For articles: extract factual data, statistics, and structured info; summarize narrative content","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Always include source attribution (http://example.com) in output","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Data Scope","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extract ONLY what the user explicitly requested","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Warn user before collecting potentially sensitive data at scale (emails, phone numbers, personal information)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Do not store or transmit extracted data beyond what the user sees","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Failure Protocol","type":"text"}]},{"type":"paragraph","content":[{"text":"When extraction fails or is blocked:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Explain the specific reason (JS rendering, bot detection, login, etc.)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Suggest alternatives (different URL, API if available, manual approach)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Never retry aggressively or escalate access attempts","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Quick Reference: Mode Cheat Sheet","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"User Says...","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Mode","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Strategy","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Output Default","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"extract the table\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"table","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A or B","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"get all products/prices\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"product","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"E then A","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"scrape the listings\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"list","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A or B","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"extract contact info / team page\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"contact","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"get the article data\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"article","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown text","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"extract the FAQ\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"faq","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A or B","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"JSON","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"get pricing plans\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"pricing","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A or B","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"scrape job listings\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"jobs","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A or B","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"get event schedule\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"events","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A or B","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"find and extract [topic]\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"discovery","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"WebSearch","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Markdown table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"compare prices across sites\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"multi-URL","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"A or B","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Comparison table","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"\"what changed since last time\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"diff","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"any","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":"left"},"content":[{"type":"paragraph","content":[{"text":"Diff format","type":"text"}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"References","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Extraction patterns","type":"text","marks":[{"type":"strong"}]},{"text":": ","type":"text"},{"text":"references/extraction-patterns.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/extraction-patterns.md","title":null}}]},{"text":" CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Output templates","type":"text","marks":[{"type":"strong"}]},{"text":": ","type":"text"},{"text":"references/output-templates.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/output-templates.md","title":null}}]},{"text":" Markdown, JSON, CSV templates with complete examples.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Data transforms","type":"text","marks":[{"type":"strong"}]},{"text":": ","type":"text"},{"text":"references/data-transforms.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/data-transforms.md","title":null}}]},{"text":" Cleaning, normalization, deduplication, enrichment patterns.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Best Practices","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Provide clear, specific context about your project and requirements","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Review all suggestions before applying them to production code","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Combine with other complementary skills for comprehensive analysis","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Common Pitfalls","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Using this skill for tasks outside its domain expertise","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Applying recommendations without understanding your specific context","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Not providing enough project context for accurate analysis","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Limitations","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use this skill only when the task clearly matches the scope described above.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Do not treat the output as a substitute for environment-specific validation, testing, or expert review.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"web-scraper","risk":"safe","tags":["scraping","data-extraction","automation","csv"],"tools":["claude-code","antigravity","cursor","gemini-cli","codex-cli"],"author":"@skillopedia","source":{"stars":39376,"repo_name":"antigravity-awesome-skills","origin_url":"https://github.com/sickn33/antigravity-awesome-skills/blob/HEAD/skills/web-scraper/SKILL.md","repo_owner":"sickn33","body_sha256":"af7bb1d97b7420e48407f4c683ccd436851e2205b2ff3b5c90a715f70d037a58","cluster_key":"6250ac2c15a0682cbc8611547c9f8d2a81795f14bb68399c5ddfd4a7e8af89b0","clean_bundle":{"format":"clean-skill-bundle-v1","source":"sickn33/antigravity-awesome-skills/skills/web-scraper/SKILL.md","attachments":[{"id":"a01317ed-d86b-5d0e-8bda-183cdf49090c","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/a01317ed-d86b-5d0e-8bda-183cdf49090c/attachment.md","path":"references/data-transforms.md","size":11666,"sha256":"2cd94e932b3059490116aad6104cfdbe4ebc6700674e50283e8ad89a4e04f22c","contentType":"text/markdown; charset=utf-8"},{"id":"ce9c4de9-dadd-59bb-95d0-c12d2e3d2966","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/ce9c4de9-dadd-59bb-95d0-c12d2e3d2966/attachment.md","path":"references/extraction-patterns.md","size":15720,"sha256":"c9ebd7aef349304bed50cbe5dca86c932a928c08aeb0645de86174e947d69e6e","contentType":"text/markdown; charset=utf-8"},{"id":"f0e1b604-3175-5853-86b3-a577d1effcd9","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/f0e1b604-3175-5853-86b3-a577d1effcd9/attachment.md","path":"references/output-templates.md","size":12263,"sha256":"0b77d9ca726262f8742d9435ff57a66e5c85622809cb6868049199cd46dc491d","contentType":"text/markdown; charset=utf-8"}],"bundle_sha256":"aeaaf483c2c20d1306127875c92dca3292b64b7e2d0f91f869481805d6b23386","attachment_count":3,"text_attachments":3,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":3,"skill_md_path":"skills/web-scraper/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"browser-automation-scraping","category_label":"Browser"},"exact_dupes_collapsed_into_this":2},"version":"v1","category":"browser-automation-scraping","date_added":"2026-03-06","import_tag":"clean-skills-v1","description":"Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON."}},"renderedAt":1782986681940}

Web Scraper Overview Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON. When to Use This Skill - When the user mentions "scraper" or related topics - When the user mentions "scraping" or related topics - When the user mentions "extrair dados web" or related topics - When the user mentions "web scraping" or related topics - When the user mentions "raspar dados" or related topics - When the user mentions "coletar dados site" or related topics Do Not Use This Skill When - The task is unrelated…