AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

\n return bool(re.match(pattern, email))\n\n\ndef validate_phone(phone):\n \"\"\"Check if phone number format is valid.\"\"\"\n cleaned = re.sub(r'[\\s\\-\\(\\)]', '', phone)\n return bool(re.match(r'^\\+?\\d{10,15}

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

, cleaned))\n\n\ndef validate_date(date_str, fmt='%Y-%m-%d'):\n \"\"\"Check if date string matches expected format.\"\"\"\n try:\n datetime.strptime(date_str, fmt)\n return True\n except ValueError:\n return False\n\n\ndef validate_required_fields(record, required):\n \"\"\"Check that all required fields exist and are non-empty.\n\n Returns a list of field names that are missing or empty.\n \"\"\"\n missing = []\n for field in required:\n if field not in record or not record[field]:\n missing.append(field)\n return missing\n\n\n# === Formatting Functions ===\n\ndef format_currency(amount, symbol='

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

, decimals=2):\n \"\"\"Format number as currency string.\n\n Args:\n amount: Numeric value to format.\n symbol: Currency symbol to prepend (default '

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

).\n decimals: Number of decimal places (default 2).\n\n Returns:\n Formatted currency string like '$1,234.56'.\n \"\"\"\n return f\"{symbol}{amount:,.{decimals}f}\"\n\n\ndef format_date(date_str, from_fmt='%Y-%m-%d', to_fmt='%B %d, %Y'):\n \"\"\"Convert date string from one format to another.\n\n Args:\n date_str: Date string in the source format.\n from_fmt: Source format (default '%Y-%m-%d').\n to_fmt: Target format (default '%B %d, %Y').\n\n Returns:\n Reformatted date string.\n \"\"\"\n dt = datetime.strptime(date_str, from_fmt)\n return dt.strftime(to_fmt)\n\n\ndef format_phone(phone):\n \"\"\"Format phone number as (XXX) XXX-XXXX.\n\n Only formats 10-digit US phone numbers; returns original string otherwise.\n \"\"\"\n cleaned = re.sub(r'\\D', '', phone)\n if len(cleaned) == 10:\n return f\"({cleaned[:3]}) {cleaned[3:6]}-{cleaned[6:]}\"\n return phone\n\n\ndef format_percentage(value, decimals=1):\n \"\"\"Format number as percentage string.\n\n Args:\n value: A decimal value (e.g. 0.85 for 85%).\n decimals: Number of decimal places (default 1).\n\n Returns:\n Formatted percentage string like '85.0%'.\n \"\"\"\n return f\"{value * 100:.{decimals}f}%\"\n\n\n# === Calculation Functions ===\n\ndef calculate_total(items, key='amount'):\n \"\"\"Calculate sum of a specific field across items.\n\n Args:\n items: List of dicts containing the target field.\n key: Field name to sum (default 'amount').\n\n Returns:\n Sum of the field values as a float.\n \"\"\"\n return sum(float(item[key]) for item in items)\n\n\ndef calculate_average(items, key='amount'):\n \"\"\"Calculate average of a specific field across items.\n\n Args:\n items: List of dicts containing the target field.\n key: Field name to average (default 'amount').\n\n Returns:\n Average of the field values, or 0 if items is empty.\n \"\"\"\n if not items:\n return 0\n return calculate_total(items, key) / len(items)\n\n\ndef calculate_discount(price, tier, seasonal=False):\n \"\"\"Calculate discount based on customer tier.\n\n Args:\n price: Original price.\n tier: Customer tier ('standard', 'silver', 'gold', 'platinum').\n seasonal: Whether to apply extra 5% seasonal discount.\n\n Returns:\n Discounted price rounded to 2 decimal places.\n \"\"\"\n rates = {'standard': 0, 'silver': 0.05, 'gold': 0.10, 'platinum': 0.15}\n rate = rates.get(tier, 0)\n if seasonal:\n rate += 0.05\n return round(price * (1 - rate), 2)\nUTILS_EOF\n\n# ── app.py ───────────────────────────────────────────────────────────────────\ncat > app.py \u003c\u003c 'APP_EOF'\n\"\"\"Main application module.\"\"\"\nfrom utils import read_csv, validate_required_fields, format_currency, calculate_total\n\n\nclass App:\n \"\"\"Core application that loads and processes order data.\"\"\"\n\n REQUIRED_FIELDS = ['id', 'name', 'amount']\n\n def __init__(self, data_path):\n self.data = read_csv(data_path)\n\n def validate_records(self):\n \"\"\"Return list of records with missing required fields.\"\"\"\n invalid = []\n for record in self.data:\n missing = validate_required_fields(record, self.REQUIRED_FIELDS)\n if missing:\n invalid.append({'record': record, 'missing': missing})\n return invalid\n\n def get_total_display(self):\n \"\"\"Return the formatted total of all order amounts.\"\"\"\n total = calculate_total(self.data, 'amount')\n return format_currency(total)\n\n def summary(self):\n \"\"\"Return a summary dict with record count and total.\"\"\"\n return {\n 'record_count': len(self.data),\n 'total': self.get_total_display(),\n 'invalid_records': len(self.validate_records()),\n }\nAPP_EOF\n\n# ── reports.py ───────────────────────────────────────────────────────────────\ncat > reports.py \u003c\u003c 'REPORTS_EOF'\n\"\"\"Reporting module for generating formatted summaries.\"\"\"\nfrom utils import read_csv, format_date, format_currency, format_percentage, calculate_average\n\n\nclass ReportGenerator:\n \"\"\"Generates formatted reports from data files.\"\"\"\n\n def __init__(self, data_path):\n self.data = read_csv(data_path)\n\n def average_amount_display(self):\n \"\"\"Return the average order amount as a formatted currency string.\"\"\"\n avg = calculate_average(self.data, 'amount')\n return format_currency(avg)\n\n def average_score_display(self):\n \"\"\"Return the average score as a formatted percentage.\"\"\"\n avg = calculate_average(self.data, 'score')\n return format_percentage(avg)\n\n def format_record_dates(self, to_fmt='%B %d, %Y'):\n \"\"\"Return list of records with dates reformatted for display.\"\"\"\n formatted = []\n for record in self.data:\n entry = dict(record)\n entry['date'] = format_date(record['date'], to_fmt=to_fmt)\n formatted.append(entry)\n return formatted\nREPORTS_EOF\n\n# ── invoices.py ──────────────────────────────────────────────────────────────\ncat > invoices.py \u003c\u003c 'INVOICES_EOF'\n\"\"\"Invoice generation module.\"\"\"\nfrom utils import write_csv, validate_email, format_currency, calculate_total, calculate_discount\n\n\nclass InvoiceProcessor:\n \"\"\"Processes and writes invoice data.\"\"\"\n\n def __init__(self, orders):\n self.orders = orders\n\n def apply_discounts(self):\n \"\"\"Apply tier-based discounts to all orders and return updated list.\"\"\"\n result = []\n for order in self.orders:\n price = float(order['amount'])\n tier = order.get('tier', 'standard')\n discounted = calculate_discount(price, tier)\n entry = dict(order)\n entry['discounted_amount'] = discounted\n entry['discounted_display'] = format_currency(discounted)\n result.append(entry)\n return result\n\n def validate_contacts(self):\n \"\"\"Return list of orders with invalid email addresses.\"\"\"\n invalid = []\n for order in self.orders:\n if not validate_email(order.get('email', '')):\n invalid.append(order)\n return invalid\n\n def get_total(self):\n \"\"\"Return total of all order amounts.\"\"\"\n return calculate_total(self.orders, 'amount')\n\n def export(self, filepath):\n \"\"\"Apply discounts and write invoices to CSV.\"\"\"\n invoices = self.apply_discounts()\n write_csv(filepath, invoices)\nINVOICES_EOF\n\n# ── contacts.py ──────────────────────────────────────────────────────────────\ncat > contacts.py \u003c\u003c 'CONTACTS_EOF'\n\"\"\"Contact management module.\"\"\"\nfrom utils import read_json, write_json, validate_email, validate_phone, format_phone\n\n\nclass ContactManager:\n \"\"\"Manages a contact list stored as JSON.\"\"\"\n\n def __init__(self, filepath):\n self.filepath = filepath\n try:\n self.contacts = read_json(filepath)\n except FileNotFoundError:\n self.contacts = []\n\n def add_contact(self, name, email, phone):\n \"\"\"Add a new contact after validation.\"\"\"\n errors = []\n if not validate_email(email):\n errors.append(f\"Invalid email: {email}\")\n if not validate_phone(phone):\n errors.append(f\"Invalid phone: {phone}\")\n if errors:\n return {'success': False, 'errors': errors}\n self.contacts.append({\n 'name': name,\n 'email': email,\n 'phone': format_phone(phone),\n })\n return {'success': True, 'errors': []}\n\n def save(self):\n \"\"\"Persist contacts to the JSON file.\"\"\"\n write_json(self.filepath, self.contacts)\n\n def find_by_email(self, email):\n \"\"\"Look up a contact by email address.\"\"\"\n for contact in self.contacts:\n if contact.get('email') == email:\n return contact\n return None\nCONTACTS_EOF\n\n# ── dashboard.py ─────────────────────────────────────────────────────────────\ncat > dashboard.py \u003c\u003c 'DASHBOARD_EOF'\n\"\"\"Dashboard module for at-a-glance metrics.\"\"\"\nfrom utils import read_csv, format_percentage, calculate_average\n\n\nclass Dashboard:\n \"\"\"Provides summary metrics for a dashboard view.\"\"\"\n\n def __init__(self, data_path):\n self.data = read_csv(data_path)\n\n def get_average_score(self):\n \"\"\"Return average score as a formatted percentage.\"\"\"\n avg = calculate_average(self.data, 'score')\n return format_percentage(avg)\n\n def get_average_amount(self):\n \"\"\"Return average order amount as a float.\"\"\"\n return calculate_average(self.data, 'amount')\n\n def get_metrics(self):\n \"\"\"Return a dict of dashboard metrics.\"\"\"\n return {\n 'record_count': len(self.data),\n 'avg_score': self.get_average_score(),\n 'avg_amount': self.get_average_amount(),\n }\nDASHBOARD_EOF\n\n# ── importer.py ──────────────────────────────────────────────────────────────\ncat > importer.py \u003c\u003c 'IMPORTER_EOF'\n\"\"\"Data import module supporting CSV and JSON sources.\"\"\"\nfrom utils import read_csv, read_json, validate_required_fields, validate_date\n\n\nclass DataImporter:\n \"\"\"Imports and validates data from various file formats.\"\"\"\n\n REQUIRED_FIELDS = ['id', 'name', 'date']\n\n def import_csv(self, filepath):\n \"\"\"Import records from a CSV file with validation.\"\"\"\n records = read_csv(filepath)\n return self._validate_records(records)\n\n def import_json(self, filepath):\n \"\"\"Import records from a JSON file with validation.\"\"\"\n records = read_json(filepath)\n return self._validate_records(records)\n\n def _validate_records(self, records):\n \"\"\"Validate a list of records and split into valid/invalid.\"\"\"\n valid = []\n invalid = []\n for record in records:\n missing = validate_required_fields(record, self.REQUIRED_FIELDS)\n date_ok = validate_date(record.get('date', ''))\n if missing or not date_ok:\n invalid.append({\n 'record': record,\n 'missing_fields': missing,\n 'date_valid': date_ok,\n })\n else:\n valid.append(record)\n return {'valid': valid, 'invalid': invalid}\nIMPORTER_EOF\n\n# ── exporter.py ──────────────────────────────────────────────────────────────\ncat > exporter.py \u003c\u003c 'EXPORTER_EOF'\n\"\"\"Data export module supporting CSV and JSON outputs.\"\"\"\nfrom utils import write_csv, write_json, format_date, format_currency\n\n\nclass DataExporter:\n \"\"\"Exports data to various formats with optional formatting.\"\"\"\n\n def __init__(self, records):\n self.records = records\n\n def to_csv(self, filepath, format_dates=False, format_amounts=False):\n \"\"\"Export records to CSV, optionally formatting dates and amounts.\"\"\"\n output = self._prepare(format_dates, format_amounts)\n write_csv(filepath, output)\n\n def to_json(self, filepath, format_dates=False, format_amounts=False):\n \"\"\"Export records to JSON, optionally formatting dates and amounts.\"\"\"\n output = self._prepare(format_dates, format_amounts)\n write_json(filepath, output)\n\n def _prepare(self, format_dates, format_amounts):\n \"\"\"Apply formatting transformations to a copy of the records.\"\"\"\n result = []\n for record in self.records:\n entry = dict(record)\n if format_dates and 'date' in entry:\n entry['date'] = format_date(entry['date'])\n if format_amounts and 'amount' in entry:\n entry['amount'] = format_currency(float(entry['amount']))\n result.append(entry)\n return result\nEXPORTER_EOF\n\n# ── analytics.py ─────────────────────────────────────────────────────────────\ncat > analytics.py \u003c\u003c 'ANALYTICS_EOF'\n\"\"\"Analytics module for business intelligence calculations.\"\"\"\nfrom utils import read_csv, calculate_total, calculate_average, calculate_discount\n\n\nclass Analytics:\n \"\"\"Provides analytical computations over order data.\"\"\"\n\n def __init__(self, data_path):\n self.data = read_csv(data_path)\n\n def revenue_total(self):\n \"\"\"Return the total revenue across all orders.\"\"\"\n return calculate_total(self.data, 'amount')\n\n def revenue_average(self):\n \"\"\"Return the average revenue per order.\"\"\"\n return calculate_average(self.data, 'amount')\n\n def projected_revenue(self, seasonal=False):\n \"\"\"Calculate projected revenue after applying tier discounts.\"\"\"\n total = 0.0\n for record in self.data:\n price = float(record['amount'])\n tier = record.get('tier', 'standard')\n total += calculate_discount(price, tier, seasonal=seasonal)\n return round(total, 2)\n\n def discount_savings(self, seasonal=False):\n \"\"\"Return the total savings from discounts.\"\"\"\n original = self.revenue_total()\n discounted = self.projected_revenue(seasonal=seasonal)\n return round(original - discounted, 2)\nANALYTICS_EOF\n\n# ── tests/ ───────────────────────────────────────────────────────────────────\nmkdir -p tests\n\ncat > tests/__init__.py \u003c\u003c 'INIT_EOF'\nINIT_EOF\n\ncat > tests/test_io.py \u003c\u003c 'TEST_IO_EOF'\n\"\"\"Tests for IO utility functions.\"\"\"\nimport os\nimport json\nimport tempfile\nimport pytest\nfrom utils import read_csv, write_csv, read_json, write_json\n\n\[email protected]\ndef tmp_dir():\n \"\"\"Provide a temporary directory for test files.\"\"\"\n with tempfile.TemporaryDirectory() as d:\n yield d\n\n\nclass TestReadCSV:\n def test_reads_sample_data(self):\n \"\"\"read_csv returns a list of dicts from the sample file.\"\"\"\n data = read_csv('data/sample.csv')\n assert isinstance(data, list)\n assert len(data) == 5\n assert data[0]['name'] == 'Alice Johnson'\n\n def test_keys_match_header(self):\n \"\"\"Each record should have the expected column keys.\"\"\"\n data = read_csv('data/sample.csv')\n expected_keys = {'id', 'name', 'email', 'phone', 'amount', 'score', 'tier', 'date'}\n assert set(data[0].keys()) == expected_keys\n\n\nclass TestWriteCSV:\n def test_roundtrip(self, tmp_dir):\n \"\"\"Writing then reading CSV should return equivalent data.\"\"\"\n records = [\n {'a': '1', 'b': '2'},\n {'a': '3', 'b': '4'},\n ]\n path = os.path.join(tmp_dir, 'out.csv')\n write_csv(path, records)\n result = read_csv(path)\n assert result == records\n\n def test_empty_data(self, tmp_dir):\n \"\"\"Writing empty list should not create a file (no-op).\"\"\"\n path = os.path.join(tmp_dir, 'empty.csv')\n write_csv(path, [])\n assert not os.path.exists(path)\n\n\nclass TestReadJSON:\n def test_reads_json_file(self, tmp_dir):\n \"\"\"read_json should parse a JSON file into Python objects.\"\"\"\n path = os.path.join(tmp_dir, 'data.json')\n with open(path, 'w') as f:\n json.dump({'key': 'value', 'nums': [1, 2, 3]}, f)\n result = read_json(path)\n assert result == {'key': 'value', 'nums': [1, 2, 3]}\n\n\nclass TestWriteJSON:\n def test_roundtrip(self, tmp_dir):\n \"\"\"Writing then reading JSON should return equivalent data.\"\"\"\n data = {'users': [{'name': 'Alice'}, {'name': 'Bob'}]}\n path = os.path.join(tmp_dir, 'out.json')\n write_json(path, data)\n result = read_json(path)\n assert result == data\n\n def test_custom_indent(self, tmp_dir):\n \"\"\"write_json should respect custom indent parameter.\"\"\"\n path = os.path.join(tmp_dir, 'indented.json')\n write_json(path, {'a': 1}, indent=4)\n with open(path) as f:\n content = f.read()\n assert ' \"a\"' in content\nTEST_IO_EOF\n\ncat > tests/test_validation.py \u003c\u003c 'TEST_VAL_EOF'\n\"\"\"Tests for validation utility functions.\"\"\"\nimport pytest\nfrom utils import validate_email, validate_phone, validate_date, validate_required_fields\n\n\nclass TestValidateEmail:\n @pytest.mark.parametrize(\"email\", [\n \"[email protected]\",\n \"[email protected]\",\n \"[email protected]\",\n ])\n def test_valid_emails(self, email):\n assert validate_email(email) is True\n\n @pytest.mark.parametrize(\"email\", [\n \"\",\n \"not-an-email\",\n \"@no-user.com\",\n \"user@\",\n \"[email protected]\",\n ])\n def test_invalid_emails(self, email):\n assert validate_email(email) is False\n\n\nclass TestValidatePhone:\n @pytest.mark.parametrize(\"phone\", [\n \"5551234567\",\n \"555-123-4567\",\n \"(555) 123-4567\",\n \"+15551234567\",\n ])\n def test_valid_phones(self, phone):\n assert validate_phone(phone) is True\n\n @pytest.mark.parametrize(\"phone\", [\n \"\",\n \"123\",\n \"abcdefghij\",\n \"12345\",\n ])\n def test_invalid_phones(self, phone):\n assert validate_phone(phone) is False\n\n\nclass TestValidateDate:\n def test_valid_default_format(self):\n assert validate_date(\"2024-01-15\") is True\n\n def test_invalid_default_format(self):\n assert validate_date(\"01/15/2024\") is False\n\n def test_custom_format(self):\n assert validate_date(\"15/01/2024\", fmt='%d/%m/%Y') is True\n\n def test_invalid_date_value(self):\n assert validate_date(\"2024-13-01\") is False\n\n def test_empty_string(self):\n assert validate_date(\"\") is False\n\n\nclass TestValidateRequiredFields:\n def test_all_present(self):\n record = {'name': 'Alice', 'email': '[email protected]', 'age': '30'}\n assert validate_required_fields(record, ['name', 'email']) == []\n\n def test_missing_field(self):\n record = {'name': 'Alice'}\n result = validate_required_fields(record, ['name', 'email'])\n assert result == ['email']\n\n def test_empty_field(self):\n record = {'name': '', 'email': '[email protected]'}\n result = validate_required_fields(record, ['name', 'email'])\n assert result == ['name']\n\n def test_multiple_missing(self):\n record = {}\n result = validate_required_fields(record, ['a', 'b', 'c'])\n assert result == ['a', 'b', 'c']\nTEST_VAL_EOF\n\ncat > tests/test_calculations.py \u003c\u003c 'TEST_CALC_EOF'\n\"\"\"Tests for calculation utility functions.\"\"\"\nimport pytest\nfrom utils import calculate_total, calculate_average, calculate_discount\n\n\nclass TestCalculateTotal:\n def test_basic_sum(self):\n items = [{'amount': '10'}, {'amount': '20'}, {'amount': '30'}]\n assert calculate_total(items) == 60.0\n\n def test_custom_key(self):\n items = [{'price': '5.5'}, {'price': '4.5'}]\n assert calculate_total(items, key='price') == 10.0\n\n def test_single_item(self):\n items = [{'amount': '42'}]\n assert calculate_total(items) == 42.0\n\n\nclass TestCalculateAverage:\n def test_basic_average(self):\n items = [{'amount': '10'}, {'amount': '20'}, {'amount': '30'}]\n assert calculate_average(items) == 20.0\n\n def test_empty_list(self):\n assert calculate_average([]) == 0\n\n def test_single_item(self):\n items = [{'amount': '50'}]\n assert calculate_average(items) == 50.0\n\n def test_custom_key(self):\n items = [{'score': '0.8'}, {'score': '0.6'}]\n assert calculate_average(items, key='score') == 0.7\n\n\nclass TestCalculateDiscount:\n def test_standard_no_discount(self):\n assert calculate_discount(100, 'standard') == 100.0\n\n def test_silver_discount(self):\n assert calculate_discount(100, 'silver') == 95.0\n\n def test_gold_discount(self):\n assert calculate_discount(100, 'gold') == 90.0\n\n def test_platinum_discount(self):\n assert calculate_discount(100, 'platinum') == 85.0\n\n def test_unknown_tier(self):\n assert calculate_discount(100, 'unknown') == 100.0\n\n def test_seasonal_bonus(self):\n assert calculate_discount(100, 'gold', seasonal=True) == 85.0\n\n def test_seasonal_with_standard(self):\n assert calculate_discount(100, 'standard', seasonal=True) == 95.0\n\n def test_rounding(self):\n result = calculate_discount(99.99, 'silver')\n assert result == 94.99\nTEST_CALC_EOF\n\n# ── REFACTOR_PLAN.md ─────────────────────────────────────────────────────────\ncat > REFACTOR_PLAN.md \u003c\u003c 'PLAN_EOF'\n# Refactor Plan: Split utils.py\n\n## Goal\nSplit the monolithic utils.py into 4 focused modules organized by responsibility.\n\n## New Modules\n1. `io_utils.py` — File I/O functions: read_csv, write_csv, read_json, write_json\n2. `validation.py` — Data validation: validate_email, validate_phone, validate_date, validate_required_fields\n3. `formatting.py` — Display formatting: format_currency, format_date, format_phone, format_percentage\n4. `calculations.py` — Business logic: calculate_total, calculate_average, calculate_discount\n\n## Steps\n1. Create the 4 new module files with functions moved from utils.py\n2. Update all imports in consumer files (app.py, reports.py, invoices.py, contacts.py, dashboard.py, importer.py, exporter.py, analytics.py)\n3. Update test imports\n4. Run tests to verify nothing broke\n5. Delete utils.py\n6. Commit each logical step separately\nPLAN_EOF\n\n# ── Initial commit ───────────────────────────────────────────────────────────\ngit add -A\ngit commit -q -m \"initial: add Python project with monolithic utils.py\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":25816,"content_sha256":"ebff5354dcf658fce4351109712d54504d5774cf8ea9ceb728063263d9b2be21"},{"filename":"tasks/multi-step/repo-refactor/task.yaml","content":"name: \"Execute Multi-File Refactor Plan\"\nid: \"multi-step-repo-refactor\"\nversion: \"1.0\"\nsuite: \"multi-step\"\ndifficulty: \"expert\"\nmode: \"real\"\n\ndescription: |\n Tests the agent's ability to execute a structured refactoring plan across a\n Python codebase. The workspace contains a monolithic utils.py with 15 functions\n across 4 groups (IO, validation, formatting, calculations), 8 consumer files\n that import from it, 3 test files, and a REFACTOR_PLAN.md. The agent must\n split utils.py into 4 focused modules, update all import statements in\n consumer and test files, verify tests still pass, delete the original\n utils.py, and commit each logical step separately. Evaluates multi-file\n coordination, import graph understanding, and incremental verification.\n\nuser_message: |\n Read REFACTOR_PLAN.md and execute the refactoring plan it describes.\n Split utils.py into 4 new modules (io_utils.py, validation.py, formatting.py,\n calculations.py), update all imports in consumer files and tests, verify\n tests pass with `python -m pytest tests/ -v`, remove utils.py, and commit\n each logical step separately with descriptive messages.\n\ninput_files: []\n\nexpected_outputs:\n - pattern: \"io_utils.py\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"read_csv\"\n - \"write_csv\"\n - \"read_json\"\n - \"write_json\"\n - pattern: \"validation.py\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"validate_email\"\n - \"validate_phone\"\n - \"validate_date\"\n - \"validate_required_fields\"\n - pattern: \"formatting.py\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"format_currency\"\n - \"format_date\"\n - \"format_phone\"\n - \"format_percentage\"\n - pattern: \"calculations.py\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"calculate_total\"\n - \"calculate_average\"\n - \"calculate_discount\"\n - pattern: \"pytest-output\"\n required: true\n validators:\n - type: \"command-output-contains\"\n command: \"python -m pytest tests/ -v\"\n contains:\n - \"passed\"\n - pattern: \"git-history\"\n required: true\n validators:\n - type: \"command-output-contains\"\n command: \"git log --oneline\"\n contains:\n - \"refactor\"\n\nexpected_metrics:\n tool_calls: [15, 35]\n planning_ratio: [0.10, 0.30]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.25\n layer2_weight: 0.25\n layer3_weight: 0.35\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":2721,"content_sha256":"838b676250cf17e877fdfda3b474b41c60a312030ed72949a91247ad93d4fae4"},{"filename":"tasks/research/codebase-archaeology/setup.sh","content":"#!/usr/bin/env bash\nset -euo pipefail\n\ncd \"$1\"\n\ngit init\ngit config user.email \"[email protected]\"\ngit config user.name \"Stats Developer\"\n\n# =============================================================================\n# Commit 1: Initial commit — stats module with mean function\n# =============================================================================\nmkdir -p tests\n\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\nPYEOF\n\ncat > tests/__init__.py \u003c\u003c 'PYEOF'\nPYEOF\n\ncat > tests/test_mean.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_mean\n\n\nclass TestMean(unittest.TestCase):\n def test_mean_integers(self):\n self.assertAlmostEqual(calculate_mean([1, 2, 3, 4, 5]), 3.0)\n\n def test_mean_single_value(self):\n self.assertAlmostEqual(calculate_mean([7]), 7.0)\n\n def test_mean_floats(self):\n self.assertAlmostEqual(calculate_mean([1.5, 2.5, 3.5]), 2.5)\n\n def test_mean_empty_raises(self):\n with self.assertRaises(ValueError):\n calculate_mean([])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"Initial commit: add stats module with mean function\"\n\n# =============================================================================\n# Commit 2: feat: add median calculation (CORRECT implementation)\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid - 1] + sorted_data[mid]) / 2\nPYEOF\n\ncat > tests/test_median.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_median\n\n\nclass TestMedian(unittest.TestCase):\n def test_median_odd_list(self):\n self.assertAlmostEqual(calculate_median([3, 1, 2]), 2.0)\n\n def test_median_even_list(self):\n self.assertAlmostEqual(calculate_median([1, 2, 3, 4]), 2.5)\n\n def test_median_single_value(self):\n self.assertAlmostEqual(calculate_median([42]), 42.0)\n\n def test_median_empty_raises(self):\n with self.assertRaises(ValueError):\n calculate_median([])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"feat: add median calculation\"\n\n# =============================================================================\n# Commit 3: feat: add standard deviation\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid - 1] + sorted_data[mid]) / 2\n\n\ndef calculate_stddev(data):\n \"\"\"Calculate the population standard deviation of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_stddev requires at least two data points\")\n mean = calculate_mean(data)\n variance = sum((x - mean) ** 2 for x in data) / len(data)\n return variance ** 0.5\nPYEOF\n\ncat > tests/test_stddev.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_stddev\n\n\nclass TestStddev(unittest.TestCase):\n def test_stddev_basic(self):\n self.assertAlmostEqual(calculate_stddev([2, 4, 4, 4, 5, 5, 7, 9]), 2.0)\n\n def test_stddev_identical_values(self):\n self.assertAlmostEqual(calculate_stddev([5, 5, 5, 5]), 0.0)\n\n def test_stddev_too_few_raises(self):\n with self.assertRaises(ValueError):\n calculate_stddev([1])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"feat: add standard deviation\"\n\n# =============================================================================\n# Commit 4: feat: add mode calculation\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid - 1] + sorted_data[mid]) / 2\n\n\ndef calculate_stddev(data):\n \"\"\"Calculate the population standard deviation of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_stddev requires at least two data points\")\n mean = calculate_mean(data)\n variance = sum((x - mean) ** 2 for x in data) / len(data)\n return variance ** 0.5\n\n\ndef calculate_mode(data):\n \"\"\"Return the most frequently occurring value in a list.\n\n If there are multiple modes, returns the smallest one.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_mode requires at least one data point\")\n frequency = {}\n for value in data:\n frequency[value] = frequency.get(value, 0) + 1\n max_count = max(frequency.values())\n modes = [k for k, v in frequency.items() if v == max_count]\n return min(modes)\nPYEOF\n\ncat > tests/test_mode.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_mode\n\n\nclass TestMode(unittest.TestCase):\n def test_mode_single_mode(self):\n self.assertEqual(calculate_mode([1, 2, 2, 3]), 2)\n\n def test_mode_multiple_modes(self):\n self.assertEqual(calculate_mode([1, 1, 2, 2, 3]), 1)\n\n def test_mode_all_same(self):\n self.assertEqual(calculate_mode([5, 5, 5]), 5)\n\n def test_mode_empty_raises(self):\n with self.assertRaises(ValueError):\n calculate_mode([])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"feat: add mode calculation\"\n\n# =============================================================================\n# Commit 5: feat: add range calculation\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid - 1] + sorted_data[mid]) / 2\n\n\ndef calculate_stddev(data):\n \"\"\"Calculate the population standard deviation of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_stddev requires at least two data points\")\n mean = calculate_mean(data)\n variance = sum((x - mean) ** 2 for x in data) / len(data)\n return variance ** 0.5\n\n\ndef calculate_mode(data):\n \"\"\"Return the most frequently occurring value in a list.\n\n If there are multiple modes, returns the smallest one.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_mode requires at least one data point\")\n frequency = {}\n for value in data:\n frequency[value] = frequency.get(value, 0) + 1\n max_count = max(frequency.values())\n modes = [k for k, v in frequency.items() if v == max_count]\n return min(modes)\n\n\ndef calculate_range(data):\n \"\"\"Calculate the range (max - min) of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_range requires at least one data point\")\n return max(data) - min(data)\nPYEOF\n\ncat > tests/test_range.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_range\n\n\nclass TestRange(unittest.TestCase):\n def test_range_basic(self):\n self.assertEqual(calculate_range([1, 5, 3, 9, 2]), 8)\n\n def test_range_identical(self):\n self.assertEqual(calculate_range([4, 4, 4]), 0)\n\n def test_range_negative(self):\n self.assertEqual(calculate_range([-3, -1, -7]), 6)\n\n def test_range_empty_raises(self):\n with self.assertRaises(ValueError):\n calculate_range([])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"feat: add range calculation\"\n\n# =============================================================================\n# Commit 6: docs: add README with usage examples\n# =============================================================================\ncat > README.md \u003c\u003c 'MDEOF'\n# Stats Module\n\nA lightweight Python statistics library providing common statistical functions.\n\n## Usage\n\n```python\nfrom stats import calculate_mean, calculate_median, calculate_stddev\n\ndata = [2, 4, 4, 4, 5, 5, 7, 9]\n\nprint(calculate_mean(data)) # 5.0\nprint(calculate_median(data)) # 4.5\nprint(calculate_stddev(data)) # 2.0\n```\n\n## Available Functions\n\n- `calculate_mean(data)` — Arithmetic mean\n- `calculate_median(data)` — Median (handles odd and even length lists)\n- `calculate_stddev(data)` — Population standard deviation\n- `calculate_mode(data)` — Mode (most frequent value)\n- `calculate_range(data)` — Range (max - min)\n\n## Running Tests\n\n```bash\npython -m pytest tests/\n```\nMDEOF\n\ngit add -A\ngit commit -m \"docs: add README with usage examples\"\n\n# =============================================================================\n# Commit 7: chore: add requirements.txt\n# =============================================================================\ncat > requirements.txt \u003c\u003c 'EOF'\npytest>=7.0\nEOF\n\ngit add -A\ngit commit -m \"chore: add requirements.txt\"\n\n# =============================================================================\n# Commit 8: refactor: optimize median — INTRODUCES BUG (off-by-one)\n# =============================================================================\n# The \"optimization\" replaces the correct even-case formula:\n# (sorted_data[mid - 1] + sorted_data[mid]) / 2\n# with an INCORRECT formula:\n# (sorted_data[mid] + sorted_data[mid + 1]) / 2\n#\n# For [1,2,3,4]: n=4, mid=2\n# Correct: (sorted_data[1] + sorted_data[2]) / 2 = (2+3)/2 = 2.5\n# Buggy: (sorted_data[2] + sorted_data[3]) / 2 = (3+4)/2 = 3.5\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid] + sorted_data[mid + 1]) / 2\n\n\ndef calculate_stddev(data):\n \"\"\"Calculate the population standard deviation of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_stddev requires at least two data points\")\n mean = calculate_mean(data)\n variance = sum((x - mean) ** 2 for x in data) / len(data)\n return variance ** 0.5\n\n\ndef calculate_mode(data):\n \"\"\"Return the most frequently occurring value in a list.\n\n If there are multiple modes, returns the smallest one.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_mode requires at least one data point\")\n frequency = {}\n for value in data:\n frequency[value] = frequency.get(value, 0) + 1\n max_count = max(frequency.values())\n modes = [k for k, v in frequency.items() if v == max_count]\n return min(modes)\n\n\ndef calculate_range(data):\n \"\"\"Calculate the range (max - min) of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_range requires at least one data point\")\n return max(data) - min(data)\nPYEOF\n\ngit add -A\ngit commit -m \"refactor: optimize median calculation for large datasets\"\n\n# =============================================================================\n# Commit 9: feat: add percentile calculation\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid] + sorted_data[mid + 1]) / 2\n\n\ndef calculate_stddev(data):\n \"\"\"Calculate the population standard deviation of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_stddev requires at least two data points\")\n mean = calculate_mean(data)\n variance = sum((x - mean) ** 2 for x in data) / len(data)\n return variance ** 0.5\n\n\ndef calculate_mode(data):\n \"\"\"Return the most frequently occurring value in a list.\n\n If there are multiple modes, returns the smallest one.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_mode requires at least one data point\")\n frequency = {}\n for value in data:\n frequency[value] = frequency.get(value, 0) + 1\n max_count = max(frequency.values())\n modes = [k for k, v in frequency.items() if v == max_count]\n return min(modes)\n\n\ndef calculate_range(data):\n \"\"\"Calculate the range (max - min) of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_range requires at least one data point\")\n return max(data) - min(data)\n\n\ndef calculate_percentile(data, p):\n \"\"\"Calculate the p-th percentile of a list of numbers.\n\n Uses linear interpolation between closest ranks.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_percentile requires at least one data point\")\n if not 0 \u003c= p \u003c= 100:\n raise ValueError(\"Percentile must be between 0 and 100\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n if n == 1:\n return sorted_data[0]\n rank = (p / 100) * (n - 1)\n lower = int(rank)\n upper = lower + 1\n if upper >= n:\n return sorted_data[-1]\n weight = rank - lower\n return sorted_data[lower] * (1 - weight) + sorted_data[upper] * weight\nPYEOF\n\ncat > tests/test_percentile.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_percentile\n\n\nclass TestPercentile(unittest.TestCase):\n def test_percentile_50(self):\n self.assertAlmostEqual(calculate_percentile([1, 2, 3, 4, 5], 50), 3.0)\n\n def test_percentile_25(self):\n self.assertAlmostEqual(calculate_percentile([1, 2, 3, 4, 5], 25), 2.0)\n\n def test_percentile_0(self):\n self.assertAlmostEqual(calculate_percentile([1, 2, 3, 4, 5], 0), 1.0)\n\n def test_percentile_100(self):\n self.assertAlmostEqual(calculate_percentile([1, 2, 3, 4, 5], 100), 5.0)\n\n def test_percentile_invalid_raises(self):\n with self.assertRaises(ValueError):\n calculate_percentile([1, 2, 3], 101)\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"feat: add percentile calculation\"\n\n# =============================================================================\n# Commit 10: test: skip flaky median test\n# =============================================================================\ncat > tests/test_median.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_median\n\n\nclass TestMedian(unittest.TestCase):\n def test_median_odd_list(self):\n self.assertAlmostEqual(calculate_median([3, 1, 2]), 2.0)\n\n @unittest.skip(\"flaky — investigate later\")\n def test_median_even_list(self):\n self.assertAlmostEqual(calculate_median([1, 2, 3, 4]), 2.5)\n\n def test_median_single_value(self):\n self.assertAlmostEqual(calculate_median([42]), 42.0)\n\n def test_median_empty_raises(self):\n with self.assertRaises(ValueError):\n calculate_median([])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m 'test: skip flaky median test'\n\n# =============================================================================\n# Commit 11: feat: add variance calculation\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid] + sorted_data[mid + 1]) / 2\n\n\ndef calculate_stddev(data):\n \"\"\"Calculate the population standard deviation of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_stddev requires at least two data points\")\n mean = calculate_mean(data)\n variance = sum((x - mean) ** 2 for x in data) / len(data)\n return variance ** 0.5\n\n\ndef calculate_mode(data):\n \"\"\"Return the most frequently occurring value in a list.\n\n If there are multiple modes, returns the smallest one.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_mode requires at least one data point\")\n frequency = {}\n for value in data:\n frequency[value] = frequency.get(value, 0) + 1\n max_count = max(frequency.values())\n modes = [k for k, v in frequency.items() if v == max_count]\n return min(modes)\n\n\ndef calculate_range(data):\n \"\"\"Calculate the range (max - min) of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_range requires at least one data point\")\n return max(data) - min(data)\n\n\ndef calculate_percentile(data, p):\n \"\"\"Calculate the p-th percentile of a list of numbers.\n\n Uses linear interpolation between closest ranks.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_percentile requires at least one data point\")\n if not 0 \u003c= p \u003c= 100:\n raise ValueError(\"Percentile must be between 0 and 100\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n if n == 1:\n return sorted_data[0]\n rank = (p / 100) * (n - 1)\n lower = int(rank)\n upper = lower + 1\n if upper >= n:\n return sorted_data[-1]\n weight = rank - lower\n return sorted_data[lower] * (1 - weight) + sorted_data[upper] * weight\n\n\ndef calculate_variance(data):\n \"\"\"Calculate the population variance of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_variance requires at least two data points\")\n mean = calculate_mean(data)\n return sum((x - mean) ** 2 for x in data) / len(data)\nPYEOF\n\ncat > tests/test_variance.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_variance\n\n\nclass TestVariance(unittest.TestCase):\n def test_variance_basic(self):\n self.assertAlmostEqual(calculate_variance([2, 4, 4, 4, 5, 5, 7, 9]), 4.0)\n\n def test_variance_identical(self):\n self.assertAlmostEqual(calculate_variance([3, 3, 3, 3]), 0.0)\n\n def test_variance_too_few_raises(self):\n with self.assertRaises(ValueError):\n calculate_variance([1])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"feat: add variance calculation\"\n\n# =============================================================================\n# Commit 12: docs: update README with new functions\n# =============================================================================\ncat > README.md \u003c\u003c 'MDEOF'\n# Stats Module\n\nA lightweight Python statistics library providing common statistical functions.\n\n## Usage\n\n```python\nfrom stats import (\n calculate_mean, calculate_median, calculate_stddev,\n calculate_mode, calculate_range, calculate_percentile,\n calculate_variance,\n)\n\ndata = [2, 4, 4, 4, 5, 5, 7, 9]\n\nprint(calculate_mean(data)) # 5.0\nprint(calculate_median(data)) # 4.5\nprint(calculate_stddev(data)) # 2.0\nprint(calculate_mode(data)) # 4\nprint(calculate_range(data)) # 7\nprint(calculate_percentile(data, 75)) # 5.75\nprint(calculate_variance(data)) # 4.0\n```\n\n## Available Functions\n\n- `calculate_mean(data)` — Arithmetic mean\n- `calculate_median(data)` — Median (handles odd and even length lists)\n- `calculate_stddev(data)` — Population standard deviation\n- `calculate_mode(data)` — Mode (most frequent value)\n- `calculate_range(data)` — Range (max - min)\n- `calculate_percentile(data, p)` — p-th percentile with interpolation\n- `calculate_variance(data)` — Population variance\n\n## Running Tests\n\n```bash\npython -m pytest tests/\n```\nMDEOF\n\ngit add -A\ngit commit -m \"docs: update README with new functions\"\n\n# =============================================================================\n# Commit 13: feat: add weighted mean\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid] + sorted_data[mid + 1]) / 2\n\n\ndef calculate_stddev(data):\n \"\"\"Calculate the population standard deviation of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_stddev requires at least two data points\")\n mean = calculate_mean(data)\n variance = sum((x - mean) ** 2 for x in data) / len(data)\n return variance ** 0.5\n\n\ndef calculate_mode(data):\n \"\"\"Return the most frequently occurring value in a list.\n\n If there are multiple modes, returns the smallest one.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_mode requires at least one data point\")\n frequency = {}\n for value in data:\n frequency[value] = frequency.get(value, 0) + 1\n max_count = max(frequency.values())\n modes = [k for k, v in frequency.items() if v == max_count]\n return min(modes)\n\n\ndef calculate_range(data):\n \"\"\"Calculate the range (max - min) of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_range requires at least one data point\")\n return max(data) - min(data)\n\n\ndef calculate_percentile(data, p):\n \"\"\"Calculate the p-th percentile of a list of numbers.\n\n Uses linear interpolation between closest ranks.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_percentile requires at least one data point\")\n if not 0 \u003c= p \u003c= 100:\n raise ValueError(\"Percentile must be between 0 and 100\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n if n == 1:\n return sorted_data[0]\n rank = (p / 100) * (n - 1)\n lower = int(rank)\n upper = lower + 1\n if upper >= n:\n return sorted_data[-1]\n weight = rank - lower\n return sorted_data[lower] * (1 - weight) + sorted_data[upper] * weight\n\n\ndef calculate_variance(data):\n \"\"\"Calculate the population variance of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_variance requires at least two data points\")\n mean = calculate_mean(data)\n return sum((x - mean) ** 2 for x in data) / len(data)\n\n\ndef calculate_weighted_mean(data, weights):\n \"\"\"Calculate the weighted arithmetic mean.\n\n Args:\n data: list of numeric values\n weights: list of weights (must be same length as data)\n\n Returns:\n Weighted mean as a float.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_weighted_mean requires at least one data point\")\n if len(data) != len(weights):\n raise ValueError(\"data and weights must have the same length\")\n if sum(weights) == 0:\n raise ValueError(\"sum of weights must be non-zero\")\n return sum(d * w for d, w in zip(data, weights)) / sum(weights)\nPYEOF\n\ncat > tests/test_weighted_mean.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_weighted_mean\n\n\nclass TestWeightedMean(unittest.TestCase):\n def test_weighted_mean_equal_weights(self):\n self.assertAlmostEqual(\n calculate_weighted_mean([1, 2, 3], [1, 1, 1]), 2.0\n )\n\n def test_weighted_mean_different_weights(self):\n self.assertAlmostEqual(\n calculate_weighted_mean([1, 2, 3], [3, 2, 1]), 1.6666666666666667\n )\n\n def test_weighted_mean_mismatched_lengths_raises(self):\n with self.assertRaises(ValueError):\n calculate_weighted_mean([1, 2], [1])\n\n def test_weighted_mean_zero_weights_raises(self):\n with self.assertRaises(ValueError):\n calculate_weighted_mean([1, 2], [0, 0])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"feat: add weighted mean\"\n\n# =============================================================================\n# Commit 14: chore: add .gitignore\n# =============================================================================\ncat > .gitignore \u003c\u003c 'EOF'\n__pycache__/\n*.pyc\n*.pyo\n.pytest_cache/\n*.egg-info/\ndist/\nbuild/\nEOF\n\ngit add -A\ngit commit -m \"chore: add .gitignore\"\n\n# =============================================================================\n# Commit 15: feat: add correlation coefficient\n# =============================================================================\ncat > stats.py \u003c\u003c 'PYEOF'\n\"\"\"Statistics module — core statistical functions.\"\"\"\n\n\ndef calculate_mean(data):\n \"\"\"Calculate the arithmetic mean of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_mean requires at least one data point\")\n return sum(data) / len(data)\n\n\ndef calculate_median(data):\n \"\"\"Calculate the median of a list of numbers.\n\n For odd-length lists, returns the middle element.\n For even-length lists, returns the average of the two middle elements.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_median requires at least one data point\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n mid = n // 2\n if n % 2 == 1:\n return sorted_data[mid]\n else:\n return (sorted_data[mid] + sorted_data[mid + 1]) / 2\n\n\ndef calculate_stddev(data):\n \"\"\"Calculate the population standard deviation of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_stddev requires at least two data points\")\n mean = calculate_mean(data)\n variance = sum((x - mean) ** 2 for x in data) / len(data)\n return variance ** 0.5\n\n\ndef calculate_mode(data):\n \"\"\"Return the most frequently occurring value in a list.\n\n If there are multiple modes, returns the smallest one.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_mode requires at least one data point\")\n frequency = {}\n for value in data:\n frequency[value] = frequency.get(value, 0) + 1\n max_count = max(frequency.values())\n modes = [k for k, v in frequency.items() if v == max_count]\n return min(modes)\n\n\ndef calculate_range(data):\n \"\"\"Calculate the range (max - min) of a list of numbers.\"\"\"\n if not data:\n raise ValueError(\"calculate_range requires at least one data point\")\n return max(data) - min(data)\n\n\ndef calculate_percentile(data, p):\n \"\"\"Calculate the p-th percentile of a list of numbers.\n\n Uses linear interpolation between closest ranks.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_percentile requires at least one data point\")\n if not 0 \u003c= p \u003c= 100:\n raise ValueError(\"Percentile must be between 0 and 100\")\n sorted_data = sorted(data)\n n = len(sorted_data)\n if n == 1:\n return sorted_data[0]\n rank = (p / 100) * (n - 1)\n lower = int(rank)\n upper = lower + 1\n if upper >= n:\n return sorted_data[-1]\n weight = rank - lower\n return sorted_data[lower] * (1 - weight) + sorted_data[upper] * weight\n\n\ndef calculate_variance(data):\n \"\"\"Calculate the population variance of a list of numbers.\"\"\"\n if len(data) \u003c 2:\n raise ValueError(\"calculate_variance requires at least two data points\")\n mean = calculate_mean(data)\n return sum((x - mean) ** 2 for x in data) / len(data)\n\n\ndef calculate_weighted_mean(data, weights):\n \"\"\"Calculate the weighted arithmetic mean.\n\n Args:\n data: list of numeric values\n weights: list of weights (must be same length as data)\n\n Returns:\n Weighted mean as a float.\n \"\"\"\n if not data:\n raise ValueError(\"calculate_weighted_mean requires at least one data point\")\n if len(data) != len(weights):\n raise ValueError(\"data and weights must have the same length\")\n if sum(weights) == 0:\n raise ValueError(\"sum of weights must be non-zero\")\n return sum(d * w for d, w in zip(data, weights)) / sum(weights)\n\n\ndef calculate_correlation(x, y):\n \"\"\"Calculate the Pearson correlation coefficient between two lists.\n\n Args:\n x: first list of numeric values\n y: second list of numeric values (same length as x)\n\n Returns:\n Pearson correlation coefficient as a float in [-1, 1].\n \"\"\"\n if len(x) != len(y):\n raise ValueError(\"x and y must have the same length\")\n if len(x) \u003c 2:\n raise ValueError(\"calculate_correlation requires at least two data points\")\n mean_x = calculate_mean(x)\n mean_y = calculate_mean(y)\n numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))\n denom_x = sum((xi - mean_x) ** 2 for xi in x) ** 0.5\n denom_y = sum((yi - mean_y) ** 2 for yi in y) ** 0.5\n if denom_x == 0 or denom_y == 0:\n raise ValueError(\"correlation is undefined when standard deviation is zero\")\n return numerator / (denom_x * denom_y)\nPYEOF\n\ncat > tests/test_correlation.py \u003c\u003c 'PYEOF'\nimport unittest\nfrom stats import calculate_correlation\n\n\nclass TestCorrelation(unittest.TestCase):\n def test_perfect_positive(self):\n self.assertAlmostEqual(\n calculate_correlation([1, 2, 3, 4, 5], [2, 4, 6, 8, 10]), 1.0\n )\n\n def test_perfect_negative(self):\n self.assertAlmostEqual(\n calculate_correlation([1, 2, 3, 4, 5], [10, 8, 6, 4, 2]), -1.0\n )\n\n def test_no_correlation(self):\n r = calculate_correlation([1, 2, 3, 4, 5], [2, 4, 1, 5, 3])\n self.assertTrue(-1 \u003c= r \u003c= 1)\n\n def test_mismatched_lengths_raises(self):\n with self.assertRaises(ValueError):\n calculate_correlation([1, 2], [1])\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\ngit add -A\ngit commit -m \"feat: add correlation coefficient\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":33805,"content_sha256":"b5dd044e2be79784a8979d13db3cd94e483594f55704711f684aa421523015f7"},{"filename":"tasks/research/codebase-archaeology/task.yaml","content":"name: \"Investigate Skipped Test via Git Archaeology\"\nid: \"research-codebase-archaeology\"\nversion: \"1.0\"\nsuite: \"research\"\ndifficulty: \"hard\"\nmode: \"real\"\n\ndescription: |\n Tests the agent's ability to investigate a codebase mystery using git\n history. A test has been skipped with a \"flaky\" annotation, but the real\n cause is a regression introduced in an earlier commit. The agent must use\n git log, git blame, git diff, and other archaeology tools to trace the\n problem back to its origin, identify the exact bug, and write a clear\n investigation report. A setup.sh script builds a 15-commit git repo with\n a real Python stats module before the task runner begins.\n\nuser_message: |\n This repository contains a Python statistics module. One of the test files\n has a test that has been skipped with the annotation \"flaky — investigate\n later\". Your job is to investigate:\n\n 1. Find the skipped test and understand what it is testing.\n 2. Remove the skip decorator and run the test to confirm it actually fails.\n 3. Use git log, git blame, git diff, and other git archaeology techniques\n to identify which commit introduced the regression that caused this\n test to start failing.\n 4. Determine exactly what the bug is and why the test fails.\n 5. Write an investigation-report.md with the following sections:\n - **Skipped Test**: Which test is skipped and in which file\n - **Bug-Introducing Commit**: The full commit hash and message\n - **Root Cause**: A clear explanation of the bug (what changed, why\n it is wrong, what the correct behavior should be)\n - **Recommended Fix**: The exact code change needed to fix the bug\n\n Do NOT actually fix the bug in stats.py — only write the report.\n\ninput_files: []\n\nexpected_outputs:\n - pattern: \"investigation-report.md\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"test_median_even\"\n - \"commit\"\n - \"fix\"\n - type: \"word-count-range\"\n min: 150\n max: 800\n\nexpected_metrics:\n tool_calls: [8, 20]\n planning_ratio: [0.15, 0.40]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.25\n layer2_weight: 0.25\n layer3_weight: 0.35\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":2232,"content_sha256":"8c0ea2e7e8dfc95c9d3d9d825af8e1622f23f8529acf6d103d9b6f207aa5a3ed"},{"filename":"tasks/research/compare-technologies/inputs/tech-a.txt","content":"PostgreSQL — Technology Overview\n\nPostgreSQL is an advanced, enterprise-class open-source relational database management system (RDBMS) that has been in active development for over 35 years. Originally developed at the University of California, Berkeley, as the POSTGRES project in 1986, it has grown into one of the most feature-rich and standards-compliant database systems available.\n\nCORE ARCHITECTURE\nPostgreSQL follows a traditional client-server architecture with a multi-process model. Each client connection is handled by a separate server process (backend), coordinated by a supervisor process (postmaster). It uses a shared memory architecture with write-ahead logging (WAL) for crash recovery and data integrity.\n\nThe storage engine uses a sophisticated MVCC (Multi-Version Concurrency Control) implementation that allows readers to never block writers and vice versa. This makes PostgreSQL particularly well-suited for read-heavy workloads and environments where concurrent access is critical.\n\nKEY FEATURES\n- Full ACID compliance with robust transaction support, including savepoints and two-phase commit\n- Rich SQL support covering window functions, CTEs (Common Table Expressions), lateral joins, and recursive queries\n- Advanced data types including arrays, hstore (key-value), JSON/JSONB, geometric types, network address types, and range types\n- Full-text search capabilities with support for multiple languages, custom dictionaries, and ranking algorithms\n- Extensibility through custom functions in multiple languages (PL/pgSQL, PL/Python, PL/Perl, PL/V8), custom types, and custom operators\n- Sophisticated indexing options including B-tree, Hash, GiST, SP-GiST, GIN, and BRIN indexes\n- Table partitioning (range, list, and hash) for managing very large tables\n- Logical and streaming replication for high availability\n- Row-level security policies for fine-grained access control\n- Foreign data wrappers for querying external data sources as if they were local tables\n\nSTRENGTHS\nPostgreSQL excels in scenarios requiring complex queries, data integrity, and standards compliance. Its query optimizer is considered among the best in open-source databases, capable of efficiently executing multi-join queries with hundreds of millions of rows. The JSONB data type provides document-store-like flexibility within a relational framework, allowing teams to handle semi-structured data without abandoning relational guarantees.\n\nThe extension ecosystem is a major strength. PostGIS adds world-class geospatial capabilities. TimescaleDB extends PostgreSQL for time-series workloads. pg_vector enables vector similarity search for AI/ML applications. Citus enables horizontal scaling for multi-tenant and real-time analytics workloads.\n\nUSE CASES\nPostgreSQL is widely used in financial systems (where transaction integrity is non-negotiable), geospatial applications (leveraging PostGIS), data warehousing and analytics, web application backends, and increasingly in AI/ML pipelines for vector storage and retrieval. Major users include Apple, Instagram, Spotify, Reddit, and the International Space Station's ground control systems.\n\nLIMITATIONS\nPostgreSQL's multi-process architecture consumes more memory per connection compared to thread-based alternatives. Write-heavy workloads with very high concurrency can encounter bottlenecks with the MVCC vacuum process. Horizontal scaling (sharding) requires extensions like Citus rather than being built into the core. The learning curve for advanced features like query optimization and configuration tuning can be steep.\n","content_type":"text/plain; charset=utf-8","language":null,"size":3587,"content_sha256":"9724c2a744f91f6eb05b86684d4e430faf592f6429c541dfe2dce538fed4c434"},{"filename":"tasks/research/compare-technologies/inputs/tech-b.txt","content":"MongoDB — Technology Overview\n\nMongoDB is a general-purpose, document-oriented NoSQL database that stores data in flexible, JSON-like documents (BSON format). First released in 2009 by MongoDB Inc. (formerly 10gen), it has become the most widely adopted NoSQL database, with an estimated 100 million+ downloads and deployments across organizations ranging from startups to Fortune 500 companies.\n\nCORE ARCHITECTURE\nMongoDB uses a distributed architecture built around the concept of replica sets and sharding. A replica set is a group of MongoDB instances that maintain the same data set, providing redundancy and automatic failover. Sharding distributes data across multiple servers (shards) to support horizontal scaling, with a config server tracking the distribution and mongos routers directing queries to the appropriate shard.\n\nData is stored in collections (analogous to tables) as BSON documents. Unlike relational databases, documents within a single collection can have different structures — a property known as schema flexibility. This means fields can vary from document to document, and data structures can evolve without requiring schema migrations.\n\nKEY FEATURES\n- Flexible document model with dynamic schemas — no need for upfront schema design or migrations\n- Native horizontal scaling through built-in sharding with automatic balancing\n- Rich query language supporting field queries, range queries, regular expression searches, and aggregation pipelines\n- Aggregation framework providing powerful data processing and transformation capabilities (similar to SQL GROUP BY but more flexible)\n- Built-in replication with automatic failover for high availability\n- GridFS for storing and retrieving files exceeding the 16MB BSON document size limit\n- Change streams for real-time data change notifications\n- Atlas Search powered by Apache Lucene for full-text search within the database\n- Time-series collections optimized for IoT and event data\n- Multi-document ACID transactions (added in version 4.0, improved through 7.0)\n- Queryable encryption allowing queries on encrypted data without server-side decryption\n\nSTRENGTHS\nMongoDB's primary strength is developer productivity. The document model maps naturally to objects in application code, eliminating the \"impedance mismatch\" between application objects and database rows that characterizes ORM usage with relational databases. Schema flexibility means development teams can iterate rapidly, modifying data structures as application requirements evolve without costly migration processes.\n\nHorizontal scalability is a built-in, first-class feature. Organizations can start with a single server and scale to hundreds of nodes without application-level changes. The sharding architecture handles data distribution, query routing, and rebalancing automatically. This makes MongoDB particularly strong for applications expecting rapid growth or unpredictable workload patterns.\n\nMongoDB Atlas, the fully managed cloud database service, provides additional capabilities including automated backups, point-in-time recovery, performance advisors, global clusters with data locality controls, and serverless instances. Atlas simplifies operations significantly, handling infrastructure management, security patching, and scaling decisions.\n\nUSE CASES\nMongoDB is widely deployed for content management systems, product catalogs (where product attributes vary widely), real-time analytics, mobile application backends, IoT data platforms, and gaming leaderboards and player profiles. Its flexibility makes it popular for projects in early stages where the data model is still evolving. Major users include Toyota, Forbes, Cisco, eBay, and the City of Chicago's open data platform.\n\nLIMITATIONS\nMongoDB's flexibility can become a liability without discipline. Without enforced schemas, data inconsistency can creep into production systems over time. While multi-document ACID transactions are supported, they carry performance overhead and are not as mature as transaction support in established relational databases. Complex joins across collections are less efficient than in relational systems — MongoDB encourages denormalization instead, which can lead to data duplication. Storage efficiency can be lower than relational alternatives due to field name repetition across documents. The aggregation pipeline, while powerful, has a steeper learning curve than SQL for developers accustomed to relational databases.\n","content_type":"text/plain; charset=utf-8","language":null,"size":4487,"content_sha256":"a314a38fe554cc1b0f6aa66bbae90cc725d14732160e467b9fbf249bf33bded6"},{"filename":"tasks/research/compare-technologies/task.yaml","content":"name: \"Compare Database Technologies\"\nid: \"research-compare-technologies\"\nversion: \"1.0\"\nsuite: \"research\"\ndifficulty: \"medium\"\nmode: \"sandboxed\"\n\ndescription: |\n Tests the agent's ability to read two separate technology descriptions\n and produce a structured comparison. Evaluates analytical thinking,\n balanced assessment, and the ability to synthesize information from\n multiple sources into actionable recommendations.\n\nuser_message: |\n Compare these two database technologies described in tech-a.txt and\n tech-b.txt. Create comparison.md with:\n - Structured pros and cons for each technology\n - A comparison table covering key dimensions\n - Use case recommendations (when to choose each)\n - Your final recommendation with reasoning\n\ninput_files:\n - name: \"tech-a.txt\"\n - name: \"tech-b.txt\"\n\nexpected_outputs:\n - pattern: \"comparison.md\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"pros\"\n - \"cons\"\n - \"recommendation\"\n - type: \"word-count-range\"\n min: 300\n max: 1000\n\nexpected_metrics:\n tool_calls: [3, 10]\n planning_ratio: [0.1, 0.4]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.25\n layer2_weight: 0.25\n layer3_weight: 0.35\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":1272,"content_sha256":"ede582f3f9c0e8a396b8d3c067db04161565668efd78b9de00b9666a8d2c44d9"},{"filename":"tasks/research/extract-structured-data/inputs/meeting-transcript.txt","content":"Project Phoenix — Weekly Standup Meeting\nDate: Monday, February 10, 2026, 10:00 AM EST\nLocation: Virtual (Zoom)\n\n[Meeting recording started at 10:02 AM]\n\nSarah Chen: Alright, let's get started. I know we're waiting on Marcus but he pinged me that he'll be a couple minutes late. Let me just kick things off. Thanks everyone for joining — I see we have myself, David Park, Lisa Nakamura, Tom Bradley, Priya Sharma, and Kenji Watanabe. Marcus Rivera should be joining shortly.\n\nSarah Chen: So first item — the Q1 roadmap. David, where are we on the API redesign?\n\nDavid Park: Yeah so we wrapped up the initial spec last Thursday. The good news is that the new REST endpoints are backward compatible. I did find a couple edge cases in the authentication flow that need attention, but nothing blocking. I think we should target having the implementation done by March 1st, but I'll need Lisa's team to review the spec before we start coding.\n\nLisa Nakamura: We can do that. My team has bandwidth this week. Can you share the spec doc in the shared drive today?\n\nDavid Park: Absolutely, I'll get that uploaded by end of day.\n\nSarah Chen: Great. So Lisa, your team will review David's API spec this week. David, please get it uploaded today. Now let's talk about the customer dashboard redesign.\n\nPriya Sharma: I've been working with the design team on the mockups. We presented three options to the stakeholder group last Wednesday and they picked Option B — the one with the modular widget layout. The decision was unanimous actually, which was nice. So we're moving forward with Option B.\n\nTom Bradley: Quick question — does Option B require the new charting library we discussed? Because if so, I need to factor that into my sprint planning.\n\nPriya Sharma: Yes, it does. We decided to go with Recharts instead of D3 directly. It'll save us about two weeks of development time. Tom, can you add the Recharts integration to your sprint? I'd say it's high priority since the dashboard work depends on it.\n\nTom Bradley: Got it. I'll add it to the current sprint. Aiming to have the basic integration done by February 21st.\n\n[Marcus Rivera joined the call at 10:11 AM]\n\nMarcus Rivera: Sorry I'm late, everyone. Had a production alert I needed to check on.\n\nSarah Chen: No worries, Marcus. Actually perfect timing — can you give us an update on the production issues from last week?\n\nMarcus Rivera: Sure. So the database connection pooling issue we saw on Thursday has been resolved. Root cause was a misconfigured timeout parameter that was causing connections to stay open indefinitely under high load. Kenji and I patched it Friday evening and we've been stable since then.\n\nThe bigger discussion point is that this exposed some gaps in our monitoring. We didn't get alerted until customers started reporting slowdowns, which is not acceptable. I think we need to set up better automated alerting.\n\nKenji Watanabe: Agreed. I've been looking into Datadog's anomaly detection features. I think we could set up alerts that would have caught this issue about 45 minutes earlier than our customers did. I can put together a proposal for the monitoring improvements.\n\nSarah Chen: That sounds important. Kenji, can you have that monitoring proposal ready by next Monday? I want to review it before our leadership sync on Tuesday.\n\nKenji Watanabe: Will do. Monday end of day at the latest.\n\nSarah Chen: Perfect. Now there's one more thing I wanted to bring up — we got feedback from the sales team that the onboarding flow for enterprise customers is too complicated. They're losing deals in the trial phase. Has anyone looked into this?\n\nDavid Park: I saw that email thread. I think the main pain point is the SSO configuration. It currently takes 14 steps and requires a customer to contact support at least once. We should streamline that.\n\nPriya Sharma: I actually started sketching out a simplified SSO wizard last week. I can have a prototype ready in about two weeks. Should I prioritize that?\n\nSarah Chen: Let me think about this... Yes, I think we should. The sales team said they've lost three enterprise deals worth a combined $180K in the last quarter because of this. So Priya, please prioritize the SSO wizard prototype. Let's target having it ready for internal review by February 28th.\n\nPriya Sharma: Understood. I'll shift my focus to that after I finish the dashboard mockup handoff to Tom, which should be done by Wednesday.\n\nTom Bradley: Oh, that reminds me — while we're talking about the onboarding flow, we should also update the API documentation. Some of the enterprise setup guides reference deprecated endpoints. David, can you audit the docs and flag what needs updating?\n\nDavid Park: Sure, I'll add that to my list. Probably won't get to it until after the API spec is uploaded though. Let me say by end of next week.\n\nSarah Chen: Sounds good. Alright, before we wrap — Lisa, any update on the hiring front?\n\nLisa Nakamura: Yes. We've extended an offer to a senior backend engineer and she's expected to accept by Wednesday. If that goes through, she'll start March 3rd. I'm also scheduling final round interviews for two frontend candidates next week. We definitely need the help — the team is stretched pretty thin right now.\n\nSarah Chen: Great news on the backend hire. Alright, I think we've covered everything. Let me just make sure we also decided on the deployment schedule — we agreed last time that we're moving to bi-weekly deployments starting March 1st instead of the current weekly schedule. That decision still stands, right?\n\nMarcus Rivera: Yes, I'm on board with that. The bi-weekly cadence gives us more time for QA and reduces the deployment stress.\n\nSarah Chen: Perfect. Okay everyone, good meeting. Let me know if anything comes up before next week. Have a good one.\n\n[Meeting recording ended at 10:38 AM]\n","content_type":"text/plain; charset=utf-8","language":null,"size":5864,"content_sha256":"fdf30cda23a964c2e27cf224c840912ac98a67af560aabbd6f6f0c3b2f7f55ec"},{"filename":"tasks/research/extract-structured-data/task.yaml","content":"name: \"Extract Structured Data from Meeting Notes\"\nid: \"research-extract-structured-data\"\nversion: \"1.0\"\nsuite: \"research\"\ndifficulty: \"medium\"\nmode: \"sandboxed\"\n\ndescription: |\n Tests the agent's ability to read unstructured meeting notes and extract\n organized, structured information. Evaluates information extraction,\n categorization, and the ability to identify key details scattered\n throughout a conversational transcript.\n\nuser_message: |\n Extract structured data from the meeting notes in meeting-transcript.txt.\n Create meeting-summary.md with:\n - Attendees list\n - Decisions Made (numbered)\n - Action Items (with assignee and deadline if mentioned)\n - Key Discussion Points\n\ninput_files:\n - name: \"meeting-transcript.txt\"\n\nexpected_outputs:\n - pattern: \"meeting-summary.md\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"attendees\"\n - \"decisions\"\n - \"action items\"\n\nexpected_metrics:\n tool_calls: [2, 8]\n planning_ratio: [0.1, 0.35]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.25\n layer2_weight: 0.25\n layer3_weight: 0.35\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":1154,"content_sha256":"c50cf28a640eb0d983467efefe52d5ed4e0aa3d89da276efa6c40b0e26e11abd"},{"filename":"tasks/research/multi-source-synthesis/setup.sh","content":"#!/usr/bin/env bash\nset -euo pipefail\n\nWORKSPACE=\"$1\"\ncd \"$WORKSPACE\"\n\ngit init\n\n# --- Document 1: Product Specification ---\ncat > product-spec.md \u003c\u003c 'SPEC'\n# TaskFlow — Product Specification\n\n## Overview\n\nTaskFlow is a project management SaaS platform designed for mid-size companies (50–500 employees). It aims to replace fragmented tool stacks with a single, integrated workspace for planning, executing, and reporting on work across departments.\n\n## Core Features\n\n### Task Boards\n- Kanban and list views with drag-and-drop\n- Custom columns, labels, and swimlanes\n- Subtask hierarchies up to 3 levels deep\n- Bulk operations for triaging and sprint planning\n\n### Team Collaboration\n- Real-time commenting on tasks with @mentions\n- Shared team dashboards with customizable widgets\n- Activity feeds scoped per project, team, or individual\n- File attachments up to 50 MB per task\n\n### Time Tracking\n- Built-in timer with manual entry fallback\n- Weekly timesheets with approval workflow\n- Billable vs. non-billable hour categorization\n- Integration with payroll export formats\n\n### Reporting\n- Pre-built reports: velocity, burndown, workload distribution\n- Custom report builder with saved filters\n- Scheduled email digests (daily/weekly)\n- CSV and PDF export\n\n## Target Market\n\nMid-size companies with 50–500 employees that have outgrown basic tools like Trello or spreadsheets but find enterprise platforms like Jira overly complex and expensive.\n\n## Required Integrations\n\n- **Slack**: Bidirectional — create tasks from Slack messages, receive task update notifications in channels\n- **Email**: Forward emails to create tasks, send task assignments and due-date reminders via email\n- **Calendar**: Sync task due dates to Google Calendar and Outlook; show availability in workload view\n\n## Roadmap\n\n### MVP (Phase 1)\nTask boards, basic collaboration, Slack integration, core reporting\n\n### Phase 2\nTime tracking, advanced reporting, email integration\n\n### Phase 3\nCalendar sync, resource planning, API marketplace for third-party integrations\nSPEC\n\n# --- Document 2: CTO Feedback ---\ncat > cto-feedback.md \u003c\u003c 'FEEDBACK_CTO'\nFrom: Alex Chen \[email protected]>\nTo: Product Team \[email protected]>\nSubject: Re: TaskFlow Architecture — My Recommendations\nDate: Mon, 14 Oct 2024 09:17:00 -0700\n\nTeam,\n\nI've reviewed the product spec and want to share my architectural recommendations before we start building.\n\n**Architecture**: We should go with microservices from day one. I know it's tempting to start with a monolith and refactor later, but I've seen that story play out at three companies now — you never actually refactor, and eighteen months in you're stuck with a tangled mess nobody wants to touch. Let's do it right from the start: separate services for auth, task management, notifications, reporting, and integrations.\n\n**API Layer**: I strongly recommend GraphQL over REST. Our frontend will need flexible queries — think about a dashboard that pulls tasks, team activity, and time entries in a single request. With REST you'd need three round trips. GraphQL gives us exactly the data shape the UI needs with zero over-fetching.\n\n**Infrastructure**: Kubernetes is the right choice for deployment. It gives us auto-scaling, rolling deployments, and service mesh capabilities out of the box. We can run on EKS to keep ops overhead manageable.\n\n**Frontend**: React with TypeScript is the most productive stack for our use case. Strong typing catches bugs early, and the React ecosystem has everything we need for drag-and-drop boards, real-time updates, and complex form handling.\n\nEven if it takes longer to set up this foundation, architecture matters. The decisions we make now will determine how fast we can move in year two and year three. I'd rather spend an extra month on the foundation than accumulate tech debt from day one. We need to build for scale from the start.\n\nHappy to walk through my service decomposition proposal in Thursday's meeting.\n\n— Alex\nFEEDBACK_CTO\n\n# --- Document 3: VP Sales Feedback ---\ncat > sales-feedback.md \u003c\u003c 'FEEDBACK_SALES'\nFrom: Jamie Park \[email protected]>\nTo: Product Team \[email protected]>\nSubject: Re: TaskFlow MVP — Sales Perspective\nDate: Tue, 15 Oct 2024 14:32:00 -0700\n\nHi everyone,\n\nWanted to share what I'm hearing from the field and what we need from a go-to-market standpoint.\n\n**Timeline is everything.** The ProjectWorld conference is in 3 months and we absolutely need a working demo by then. I've already committed to three prospects that they'd see a live product. If we miss that window, we lose first-mover advantage and those deals go to Asana or Monday.com.\n\n**Architecture**: Honestly, a monolith is fine for v1. We can always refactor later once we have revenue and a bigger team. What matters right now is shipping something that works, not building the perfect system nobody ever sees.\n\n**API**: REST API is the standard. Every customer I've talked to expects a REST API for their integrations. GraphQL is a niche technology — our buyers are IT directors at mid-size companies, not Silicon Valley startups. Let's not make things harder than they need to be.\n\n**What customers actually want**:\n1. **Slack integration** — this comes up in literally every sales call. People want to create tasks without leaving Slack. This has to be in the MVP.\n2. **Mobile-responsive design** — three of our top prospects specifically asked about mobile access. We don't need a native app yet, but the web app must work well on phones and tablets.\n\n**What we should NOT build yet**: Admin features, advanced permissions, audit logs — none of our prospects have asked for these. Don't waste engineering time on features that don't close deals. Focus on end-user experience: clean UI, fast load times, intuitive workflows.\n\nThe product needs to sell itself in a 15-minute demo. That's the bar.\n\n— Jamie\nFEEDBACK_SALES\n\n# --- Document 4: Compliance Feedback ---\ncat > compliance-feedback.md \u003c\u003c 'FEEDBACK_COMPLIANCE'\nFrom: Sam Rivera \[email protected]>\nTo: Product Team \[email protected]>\nSubject: Re: TaskFlow — Compliance and Security Requirements\nDate: Wed, 16 Oct 2024 11:05:00 -0700\n\nProduct Team,\n\nBefore development begins, I need to ensure we have alignment on compliance and security requirements. These are non-negotiable for the type of customers we're targeting.\n\n**Audit Trail**: Every data mutation in the system must have a full audit trail — who performed the action, what changed (before and after values), and when it happened. This includes task creation, updates, deletions, permission changes, and user management actions. The audit log must be immutable and retained for a minimum of 7 years. This is not optional; mid-size companies in regulated industries will require this during their procurement reviews.\n\n**Third-Party Data Processors**: We cannot send customer data to any third-party service or data processor without completing a security review first. This applies to analytics tools, error tracking services, email delivery providers, and any SaaS dependency. Each vendor needs a signed Data Processing Agreement on file before integration.\n\n**Encryption**: All data must be encrypted at rest (AES-256 minimum) and in transit (TLS 1.2+). Database backups must also be encrypted. Encryption keys must be managed through a proper KMS — no hardcoded keys, no shared secrets in environment variables.\n\n**SOC 2 Type II**: We need to achieve SOC 2 Type II compliance before launch, not after. Many of our target customers will require a SOC 2 report during their vendor evaluation process. This means we need proper access controls, change management procedures, and incident response plans from day one.\n\n**GDPR Data Residency**: Any EU customer's data must be stored and processed within the EU. We'll need a multi-region deployment strategy to ensure data residency compliance. This affects our infrastructure architecture decisions significantly.\n\nSecurity is non-negotiable, even if it slows development. A data breach or compliance failure in year one would be an extinction-level event for a startup like ours.\n\n— Sam\nFEEDBACK_COMPLIANCE\n\n# --- Document 5: Project Constraints ---\ncat > constraints.md \u003c\u003c 'CONSTRAINTS'\n# TaskFlow — Project Constraints\n\n## Budget\n- **Total development budget**: $200,000\n- This covers salaries, tools, and infrastructure for the build phase\n- No contingency fund — this is a hard ceiling approved by the board\n\n## Timeline\n- **Launch deadline**: 4 months from project kickoff\n- Milestone 1 (Month 2): Internal alpha with core task board functionality\n- Milestone 2 (Month 3): Beta with integrations, limited external testers\n- Milestone 3 (Month 4): Production launch\n\n## Team\n- **Backend engineers**: 2 (senior and mid-level)\n- **Frontend engineer**: 1 (senior)\n- **No additional headcount** — hiring freeze in effect until Series A closes\n- Product manager and designer available part-time (shared with other projects)\n\n## Infrastructure\n- **Monthly infrastructure budget**: $3,000/month maximum\n- This must cover hosting, databases, CDN, monitoring, and any managed services\n- Cloud provider: AWS (existing account with organizational billing)\n\n## Other Constraints\n- Must use existing company GitHub organization and CI/CD pipeline\n- On-call rotation starts at launch — the 3-person engineering team will share it\n- Legal has already approved open-source license usage (MIT, Apache 2.0, BSD only)\nCONSTRAINTS\n\ngit add -A\ngit commit -m \"Initial project documents: product spec, stakeholder feedback, and constraints\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":9609,"content_sha256":"22cc83d9e6b57e349335191931f1c8cbaf0e8df0c8d70b84b4c3139966f739e2"},{"filename":"tasks/research/multi-source-synthesis/task.yaml","content":"name: \"Synthesize Conflicting Stakeholder Requirements\"\nid: \"research-multi-source-synthesis\"\nversion: \"1.0\"\nsuite: \"research\"\ndifficulty: \"hard\"\nmode: \"real\"\n\ndescription: |\n Tests the agent's ability to read multiple documents with conflicting\n stakeholder perspectives and synthesize them into a coherent, prioritized\n requirements document. Evaluates cross-source analysis, conflict\n identification, critical reasoning about trade-offs, and the ability to\n produce actionable recommendations under real-world constraints.\n\nuser_message: |\n Read all documents in this workspace (product-spec.md, cto-feedback.md,\n sales-feedback.md, compliance-feedback.md, and constraints.md). These\n contain a product specification and feedback from multiple stakeholders\n with conflicting viewpoints.\n\n Synthesize everything into a single requirements.md file with:\n - Requirements grouped by priority: Must Have, Should Have, and Nice to Have\n - Each requirement attributed to its source (which stakeholder or document)\n - At least 3 conflicts identified between stakeholders, with a clear\n resolution recommendation for each that considers the project constraints\n - Budget and timeline implications addressed throughout\n - A brief executive summary at the top\n\ninput_files: []\n\nexpected_outputs:\n - pattern: \"requirements.md\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"conflict\"\n - \"priority\"\n - \"must have\"\n - \"budget\"\n - \"timeline\"\n - type: \"word-count-range\"\n min: 500\n max: 2000\n\nexpected_metrics:\n tool_calls: [6, 15]\n planning_ratio: [0.15, 0.40]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.25\n layer2_weight: 0.25\n layer3_weight: 0.35\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":1803,"content_sha256":"d18fca84141e1a0445c11802fa4b34f2ebb57182c38f8aec4639e6fa4e5ed688"},{"filename":"tasks/research/summarize-doc/inputs/whitepaper.txt","content":"The Future of Remote Work: Trends, Challenges, and Opportunities\nA Whitepaper by the Global Workforce Institute\n\nINTRODUCTION\n\nThe global experiment with remote work, accelerated by the COVID-19 pandemic beginning in 2020, has fundamentally reshaped how organizations think about where and when work happens. What began as an emergency measure has evolved into a permanent feature of the modern workplace landscape. As of 2025, an estimated 35% of knowledge workers globally work remotely at least three days per week, up from just 6% in 2019. This whitepaper examines the current state of remote work, analyzes emerging trends, reviews the evidence on productivity and well-being, and offers policy recommendations for organizations navigating this evolving terrain.\n\nTREND 1: THE HYBRID MODEL BECOMES DOMINANT\n\nThe most significant trend in remote work is the convergence toward hybrid arrangements rather than fully remote or fully in-office models. According to a 2024 survey by McKinsey Global Institute, 58% of companies have adopted hybrid work policies, compared to 12% fully remote and 30% fully in-office. The typical hybrid arrangement involves 2-3 days in the office per week, though there is significant variation by industry and role.\n\nFinancial services firms tend toward more in-office days (3-4 per week), while technology companies lean toward fewer (1-2 per week). Interestingly, company size correlates with flexibility: organizations with fewer than 500 employees are 40% more likely to offer fully remote options compared to those with more than 10,000 employees. This may reflect the agility advantage of smaller firms and their need to compete for talent against larger organizations with stronger brand recognition.\n\nThe geographic distribution of remote workers is also shifting. While early remote work concentrated in major metropolitan areas, data from LinkedIn Economic Graph shows a 28% increase in remote job postings targeting workers in secondary cities and rural areas between 2022 and 2024. This \"geographic democratization\" of knowledge work has implications for regional economic development and housing markets.\n\nTREND 2: TECHNOLOGY INFRASTRUCTURE MATURATION\n\nThe tools supporting remote work have undergone rapid evolution. Video conferencing platforms have moved beyond basic calls to incorporate AI-powered features including real-time transcription, automated meeting summaries, and background noise cancellation. Asynchronous collaboration tools like Loom, Notion, and Linear have gained significant market share, reflecting a growing recognition that not all collaboration needs to happen synchronously.\n\nDigital workplace platforms are increasingly integrating previously separate tools into unified ecosystems. Microsoft's Viva suite, Slack's expansion into workflow automation, and Google Workspace's AI integrations represent a convergence toward comprehensive platforms that manage communication, project management, knowledge bases, and analytics in a single environment.\n\nHowever, technology adoption remains uneven. A 2024 Gartner survey found that while 89% of remote workers report having adequate video conferencing tools, only 52% say their organization provides effective asynchronous collaboration tools, and just 34% have access to digital whiteboarding or brainstorming platforms. This gap suggests significant room for improvement in the tooling layer of remote work.\n\nTREND 3: MANAGEMENT PRACTICES EVOLVING\n\nPerhaps the most consequential shift is in management philosophy. The traditional model of management by presence — where visibility in the office served as a proxy for productivity — is giving way to management by outcomes. A 2024 Harvard Business Review study found that 64% of managers at remote-first companies now evaluate performance primarily through deliverables and outcomes rather than hours worked or time online.\n\nThis shift has required new skills from managers. Companies report investing 35% more in management training since 2021, with a focus on asynchronous communication, building trust remotely, recognizing burnout signals in virtual settings, and creating inclusive environments for distributed teams. Despite this investment, manager confidence in their ability to lead remote teams effectively has only increased from 42% to 58% over the same period, suggesting that effective remote leadership remains a developing competency.\n\nPRODUCTIVITY EVIDENCE\n\nThe evidence on remote work productivity is nuanced and resists simple characterization. A comprehensive meta-analysis published in the Journal of Applied Psychology in 2024, covering 147 studies and over 200,000 workers, found that remote work is associated with a modest increase in individual task productivity (approximately 5-8%) but a slight decrease in collaborative innovation metrics (approximately 3-5%).\n\nThe productivity gains appear to stem primarily from reduced commute-related fatigue, fewer in-office interruptions, and the ability to work during personally optimal hours. Stanford economist Nicholas Bloom's ongoing research finds that hybrid workers report saving an average of 72 minutes per day previously spent commuting, with approximately 40% of that time redirected to work tasks and 60% to personal activities.\n\nHowever, the picture is complicated by role type. Roles requiring deep individual focus (software development, writing, data analysis) show the strongest productivity benefits from remote work, while roles requiring frequent real-time collaboration (design teams, trading floors, certain types of research) show either neutral or slightly negative effects.\n\nA concerning finding across multiple studies is the \"proximity bias\" effect: remote workers are 24% less likely to receive promotions compared to in-office peers, even when performance metrics are equivalent. This suggests that organizational culture and evaluation systems have not fully adapted to distributed work models.\n\nMENTAL HEALTH AND WELL-BEING\n\nThe mental health implications of remote work are perhaps the most debated aspect of this transformation. The evidence presents a paradox: remote workers report both higher job satisfaction (72% vs. 64% for in-office workers in a 2024 Gallup survey) AND higher rates of feelings of isolation and disconnection.\n\nA longitudinal study by the University of Chicago's Behavioral Science department tracked 3,000 knowledge workers over three years and found that remote workers experienced a 15% increase in reported autonomy and work-life balance satisfaction, but a 22% increase in feelings of professional isolation and a 12% increase in difficulty maintaining boundaries between work and personal life.\n\nThe impact varies significantly by demographic. Younger workers (ages 22-30) report the highest levels of isolation when working remotely, likely because they have had less time to build professional networks. Workers with caregiving responsibilities report the highest satisfaction with remote arrangements due to scheduling flexibility. Single workers living alone report the most negative mental health impacts from fully remote work.\n\nBurnout presents a complex picture. While remote workers report lower rates of commute-related stress, they show higher rates of \"always-on\" syndrome, with 47% reporting difficulty disconnecting from work compared to 31% of in-office workers. The blurring of physical boundaries between workspace and living space appears to be a significant contributing factor.\n\nCHALLENGES AND LIMITATIONS\n\nSeveral significant challenges remain unresolved in the remote work transition:\n\nFirst, knowledge transfer and mentorship suffer in distributed environments. Junior employees report 30% fewer informal learning opportunities when working remotely, and organizations with predominantly remote workforces report longer onboarding times (averaging 23% longer) compared to those with significant in-person components.\n\nSecond, organizational culture is more difficult to build and maintain remotely. While virtual team-building activities have proliferated, research suggests they are approximately 60% as effective as in-person equivalents for building trust and social cohesion.\n\nThird, the digital divide creates equity concerns. Not all workers have access to reliable high-speed internet, quiet workspaces, or ergonomic equipment at home. A 2024 Pew Research study found that workers earning less than $50,000 annually are three times more likely to report inadequate home workspace conditions compared to those earning over $100,000.\n\nFourth, legal and regulatory frameworks have not kept pace with the geographic distribution of remote workers. Issues including tax jurisdiction, labor law applicability, data privacy compliance across borders, and workers' compensation for home office injuries remain incompletely addressed in most jurisdictions.\n\nFifth, cybersecurity risks increase with distributed work. Organizations with majority-remote workforces report 38% more security incidents related to endpoint devices and 24% more phishing-related breaches compared to primarily in-office organizations.\n\nPOLICY RECOMMENDATIONS\n\nBased on the evidence reviewed, we offer the following recommendations for organizations:\n\n1. Adopt hybrid-first policies that provide flexibility while maintaining regular in-person collaboration touchpoints, ideally 2-3 synchronous days per quarter for fully remote teams and 2-3 days per week for hybrid teams.\n\n2. Invest in asynchronous collaboration infrastructure to reduce meeting fatigue and enable effective work across time zones. Organizations should aim for no more than 40% of collaborative work happening synchronously.\n\n3. Redesign performance evaluation systems to focus explicitly on outcomes and deliverables, with regular calibration to detect and correct proximity bias.\n\n4. Implement structured mentorship and knowledge transfer programs that do not rely on informal in-office interactions. Assign dedicated mentors to all junior employees and create regular virtual knowledge-sharing sessions.\n\n5. Provide stipends for home office equipment and internet connectivity to address workspace equity. The median effective stipend identified in our research is $1,500 initially plus $50/month for ongoing connectivity costs.\n\n6. Train managers specifically in remote leadership skills, including asynchronous communication, recognizing burnout signals virtually, and building inclusive distributed team cultures.\n\n7. Establish clear \"right to disconnect\" policies that define expectations around response times and after-hours communication.\n\n8. Conduct regular pulse surveys on employee well-being, adjusting policies based on data rather than assumptions.\n\nCONCLUSION\n\nRemote work is no longer an experiment — it is a permanent feature of the knowledge economy. The organizations that will thrive are those that approach distributed work with intentionality, investing in the tools, training, policies, and culture needed to make it effective. The evidence clearly shows that remote work, when implemented thoughtfully, can enhance both productivity and employee well-being. However, poorly managed remote work creates real risks around isolation, inequity, and organizational cohesion. The path forward requires nuanced, evidence-based policy-making rather than ideological commitment to either fully remote or fully in-office models.\n\nPublished by the Global Workforce Institute, January 2025.\nAll data cited is from publicly available research and surveys.\n","content_type":"text/plain; charset=utf-8","language":null,"size":11548,"content_sha256":"c33f5ab67a69b6457496c67385ebcc84cf3bb32964fbbec18cafec555538ab11"},{"filename":"tasks/research/summarize-doc/task.yaml","content":"name: \"Summarize Whitepaper Document\"\nid: \"research-summarize-doc\"\nversion: \"1.0\"\nsuite: \"research\"\ndifficulty: \"medium\"\nmode: \"sandboxed\"\n\ndescription: |\n Tests the agent's ability to read a lengthy document and produce a\n concise, well-structured summary. Evaluates reading comprehension,\n information synthesis, and structured output generation.\n\nuser_message: |\n Read the whitepaper in whitepaper.txt and create summary.md with:\n - 3 key findings (numbered, with brief explanations)\n - A limitations section discussing the gaps or weaknesses in the paper\n - A one-paragraph executive summary\n\ninput_files:\n - name: \"whitepaper.txt\"\n\nexpected_outputs:\n - pattern: \"summary.md\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"key findings\"\n - \"limitations\"\n - \"executive summary\"\n - type: \"word-count-range\"\n min: 200\n max: 800\n\nexpected_metrics:\n tool_calls: [2, 8]\n planning_ratio: [0.05, 0.35]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.25\n layer2_weight: 0.25\n layer3_weight: 0.35\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":1124,"content_sha256":"8cd96c636c12a571bd9828ce40bbc6fa69bc2d34fd99567e6e547fac20c1871c"},{"filename":"tasks/tool-efficiency/large-codebase-navigation/setup.sh","content":"#!/usr/bin/env bash\nset -euo pipefail\n\ncd \"$1\"\n\ngit init\ngit config user.email \"[email protected]\"\ngit config user.name \"Web App Developer\"\n\n# =============================================================================\n# Create a 30+ file Python web application with calculate_discount defined in\n# src/services/pricing.py and called from exactly 4 locations:\n# 1. src/views/orders.py\n# 2. src/views/dashboard.py\n# 3. src/services/analytics.py\n# 4. tests/test_pricing.py\n# =============================================================================\n\nmkdir -p src/models src/views src/services src/utils\nmkdir -p tests config docs\n\n# =============================================================================\n# src/__init__.py\n# =============================================================================\ncat > src/__init__.py \u003c\u003c 'PYEOF'\n\"\"\"Web application root package.\"\"\"\n\n__version__ = \"2.4.1\"\n__author__ = \"Web App Team\"\nPYEOF\n\n# =============================================================================\n# src/models/__init__.py\n# =============================================================================\ncat > src/models/__init__.py \u003c\u003c 'PYEOF'\n\"\"\"Data models package.\"\"\"\n\nfrom src.models.user import User\nfrom src.models.order import Order\nfrom src.models.product import Product\nfrom src.models.inventory import InventoryItem\n\n__all__ = [\"User\", \"Order\", \"Product\", \"InventoryItem\"]\nPYEOF\n\n# =============================================================================\n# src/models/user.py\n# =============================================================================\ncat > src/models/user.py \u003c\u003c 'PYEOF'\n\"\"\"User model — represents a registered customer.\"\"\"\n\nfrom datetime import datetime\n\n\nclass User:\n \"\"\"A registered user with tier-based membership.\"\"\"\n\n VALID_TIERS = (\"standard\", \"silver\", \"gold\", \"platinum\", \"enterprise\")\n\n def __init__(self, user_id, name, email, tier=\"standard\"):\n self.user_id = user_id\n self.name = name\n self.email = email\n self.tier = tier if tier in self.VALID_TIERS else \"standard\"\n self.created_at = datetime.utcnow()\n self.is_active = True\n\n def upgrade_tier(self, new_tier):\n \"\"\"Upgrade user to a higher membership tier.\"\"\"\n if new_tier in self.VALID_TIERS:\n self.tier = new_tier\n\n def deactivate(self):\n \"\"\"Mark the user as inactive.\"\"\"\n self.is_active = False\n\n def __repr__(self):\n return f\"User(id={self.user_id}, name='{self.name}', tier='{self.tier}')\"\nPYEOF\n\n# =============================================================================\n# src/models/order.py\n# =============================================================================\ncat > src/models/order.py \u003c\u003c 'PYEOF'\n\"\"\"Order model — represents a customer purchase order.\"\"\"\n\nfrom datetime import datetime\n\n\nclass Order:\n \"\"\"A purchase order containing line items.\"\"\"\n\n STATUS_CHOICES = (\"pending\", \"confirmed\", \"shipped\", \"delivered\", \"cancelled\")\n\n def __init__(self, order_id, customer_id):\n self.order_id = order_id\n self.customer_id = customer_id\n self.items = []\n self.status = \"pending\"\n self.created_at = datetime.utcnow()\n\n @property\n def subtotal(self):\n \"\"\"Calculate order subtotal before discounts and tax.\"\"\"\n return sum(item[\"price\"] * item[\"quantity\"] for item in self.items)\n\n def add_item(self, product_id, price, quantity=1):\n \"\"\"Add a line item to the order.\"\"\"\n self.items.append({\n \"product_id\": product_id,\n \"price\": price,\n \"quantity\": quantity,\n })\n\n def update_status(self, new_status):\n \"\"\"Transition order to a new status.\"\"\"\n if new_status in self.STATUS_CHOICES:\n self.status = new_status\n\n def cancel(self):\n \"\"\"Cancel the order if it has not shipped.\"\"\"\n if self.status in (\"pending\", \"confirmed\"):\n self.status = \"cancelled\"\n return True\n return False\n\n def __repr__(self):\n return f\"Order(id={self.order_id}, items={len(self.items)}, status='{self.status}')\"\nPYEOF\n\n# =============================================================================\n# src/models/product.py\n# =============================================================================\ncat > src/models/product.py \u003c\u003c 'PYEOF'\n\"\"\"Product model — represents an item in the catalog.\"\"\"\n\n\nclass Product:\n \"\"\"A product available for purchase.\"\"\"\n\n def __init__(self, product_id, name, price, category=\"general\"):\n self.product_id = product_id\n self.name = name\n self.price = price\n self.category = category\n self.is_available = True\n\n def set_price(self, new_price):\n \"\"\"Update the product price.\"\"\"\n if new_price >= 0:\n self.price = new_price\n\n def discontinue(self):\n \"\"\"Mark the product as unavailable.\"\"\"\n self.is_available = False\n\n def to_dict(self):\n \"\"\"Serialize product to dictionary.\"\"\"\n return {\n \"id\": self.product_id,\n \"name\": self.name,\n \"price\": self.price,\n \"category\": self.category,\n \"available\": self.is_available,\n }\n\n def __repr__(self):\n return f\"Product(id={self.product_id}, name='{self.name}', price={self.price})\"\nPYEOF\n\n# =============================================================================\n# src/models/inventory.py\n# =============================================================================\ncat > src/models/inventory.py \u003c\u003c 'PYEOF'\n\"\"\"Inventory model — tracks stock levels for products.\"\"\"\n\nfrom datetime import datetime\n\n\nclass InventoryItem:\n \"\"\"Tracks stock quantity for a single product.\"\"\"\n\n def __init__(self, product_id, quantity=0, reorder_level=10):\n self.product_id = product_id\n self.quantity = quantity\n self.reorder_level = reorder_level\n self.last_restocked = None\n\n def restock(self, amount):\n \"\"\"Add stock for this product.\"\"\"\n if amount > 0:\n self.quantity += amount\n self.last_restocked = datetime.utcnow()\n\n def reserve(self, amount):\n \"\"\"Reserve stock for an order. Returns True if sufficient stock.\"\"\"\n if amount \u003c= self.quantity:\n self.quantity -= amount\n return True\n return False\n\n @property\n def needs_reorder(self):\n \"\"\"Check if stock is below the reorder threshold.\"\"\"\n return self.quantity \u003c self.reorder_level\n\n def __repr__(self):\n return f\"InventoryItem(product={self.product_id}, qty={self.quantity})\"\nPYEOF\n\n# =============================================================================\n# src/views/__init__.py\n# =============================================================================\ncat > src/views/__init__.py \u003c\u003c 'PYEOF'\n\"\"\"View layer package — handles request routing and response rendering.\"\"\"\nPYEOF\n\n# =============================================================================\n# src/views/dashboard.py — CALLS calculate_discount (call site #2)\n# =============================================================================\ncat > src/views/dashboard.py \u003c\u003c 'PYEOF'\n\"\"\"Dashboard view — renders the main user dashboard.\"\"\"\n\nfrom datetime import datetime, timedelta\n\nfrom src.services.pricing import calculate_discount\n\n\nclass DashboardView:\n \"\"\"Handles rendering of the user dashboard.\"\"\"\n\n def __init__(self, user, orders):\n self.user = user\n self.orders = orders\n\n def get_summary(self):\n \"\"\"Build a summary dict for the dashboard template.\"\"\"\n recent = [o for o in self.orders if o.status != \"cancelled\"]\n total_spent = sum(o.subtotal for o in recent)\n return {\n \"user\": self.user.name,\n \"tier\": self.user.tier,\n \"total_orders\": len(recent),\n \"total_spent\": round(total_spent, 2),\n }\n\n def get_recent_orders(self, limit=5):\n \"\"\"Return the N most recent orders.\"\"\"\n sorted_orders = sorted(\n self.orders, key=lambda o: o.created_at, reverse=True\n )\n return sorted_orders[:limit]\n\n def render_price_preview(self, items):\n \"\"\"Show preview prices with standard discount.\"\"\"\n for item in items:\n item['preview_price'] = calculate_discount(item['price'], 'standard')\n return items\n\n def get_notifications(self):\n \"\"\"Return pending notifications for the user.\"\"\"\n notifications = []\n pending = [o for o in self.orders if o.status == \"pending\"]\n if pending:\n notifications.append(f\"You have {len(pending)} pending order(s).\")\n return notifications\nPYEOF\n\n# =============================================================================\n# src/views/orders.py — CALLS calculate_discount (call site #1)\n# =============================================================================\ncat > src/views/orders.py \u003c\u003c 'PYEOF'\n\"\"\"Orders view — handles order listing, creation, and checkout.\"\"\"\n\nfrom datetime import datetime\n\nfrom src.services.pricing import calculate_discount\n\n\nclass OrderView:\n \"\"\"Handles order-related requests.\"\"\"\n\n def __init__(self, order_repository):\n self.order_repository = order_repository\n\n def list_orders(self, customer_id, status=None):\n \"\"\"List orders for a customer, optionally filtered by status.\"\"\"\n orders = self.order_repository.get_by_customer(customer_id)\n if status:\n orders = [o for o in orders if o.status == status]\n return orders\n\n def get_order_detail(self, order_id):\n \"\"\"Retrieve full details for a single order.\"\"\"\n order = self.order_repository.get_by_id(order_id)\n if not order:\n raise ValueError(f\"Order {order_id} not found\")\n return order\n\n def checkout(self, order, customer):\n \"\"\"Apply customer tier discount at checkout.\"\"\"\n discounted = calculate_discount(order.subtotal, customer.tier)\n order.final_total = discounted\n order.update_status(\"confirmed\")\n return order\n\n def cancel_order(self, order_id):\n \"\"\"Cancel an order by ID.\"\"\"\n order = self.order_repository.get_by_id(order_id)\n if not order:\n raise ValueError(f\"Order {order_id} not found\")\n if not order.cancel():\n raise ValueError(f\"Order {order_id} cannot be cancelled\")\n return order\nPYEOF\n\n# =============================================================================\n# src/views/users.py\n# =============================================================================\ncat > src/views/users.py \u003c\u003c 'PYEOF'\n\"\"\"Users view — handles user registration, profile, and account management.\"\"\"\n\nfrom datetime import datetime\n\n\nclass UserView:\n \"\"\"Handles user-related requests.\"\"\"\n\n def __init__(self, user_repository):\n self.user_repository = user_repository\n\n def register(self, name, email, password):\n \"\"\"Register a new user account.\"\"\"\n existing = self.user_repository.get_by_email(email)\n if existing:\n raise ValueError(f\"Email {email} is already registered\")\n user = self.user_repository.create(name=name, email=email, password=password)\n return user\n\n def get_profile(self, user_id):\n \"\"\"Retrieve user profile by ID.\"\"\"\n user = self.user_repository.get_by_id(user_id)\n if not user:\n raise ValueError(f\"User {user_id} not found\")\n return {\n \"name\": user.name,\n \"email\": user.email,\n \"tier\": user.tier,\n \"member_since\": user.created_at.isoformat(),\n }\n\n def update_profile(self, user_id, **kwargs):\n \"\"\"Update user profile fields.\"\"\"\n user = self.user_repository.get_by_id(user_id)\n if not user:\n raise ValueError(f\"User {user_id} not found\")\n for key, value in kwargs.items():\n if hasattr(user, key):\n setattr(user, key, value)\n return user\n\n def deactivate_account(self, user_id):\n \"\"\"Deactivate a user account.\"\"\"\n user = self.user_repository.get_by_id(user_id)\n if user:\n user.deactivate()\n return user\nPYEOF\n\n# =============================================================================\n# src/views/reports.py\n# =============================================================================\ncat > src/views/reports.py \u003c\u003c 'PYEOF'\n\"\"\"Reports view — generates various business reports.\"\"\"\n\nfrom datetime import datetime, timedelta\n\n\nclass ReportsView:\n \"\"\"Handles report generation requests.\"\"\"\n\n def __init__(self, order_repository, user_repository):\n self.order_repository = order_repository\n self.user_repository = user_repository\n\n def sales_summary(self, start_date, end_date):\n \"\"\"Generate a sales summary for a date range.\"\"\"\n orders = self.order_repository.get_by_date_range(start_date, end_date)\n total_revenue = sum(o.subtotal for o in orders if o.status != \"cancelled\")\n return {\n \"period_start\": start_date.isoformat(),\n \"period_end\": end_date.isoformat(),\n \"total_orders\": len(orders),\n \"total_revenue\": round(total_revenue, 2),\n }\n\n def top_customers(self, limit=10):\n \"\"\"Return top customers by total spend.\"\"\"\n customers = self.user_repository.get_all_active()\n ranked = sorted(customers, key=lambda c: c.total_spent, reverse=True)\n return ranked[:limit]\n\n def order_status_breakdown(self):\n \"\"\"Return counts of orders by status.\"\"\"\n all_orders = self.order_repository.get_all()\n breakdown = {}\n for order in all_orders:\n breakdown[order.status] = breakdown.get(order.status, 0) + 1\n return breakdown\n\n def monthly_trend(self, months=6):\n \"\"\"Return monthly order counts for trend analysis.\"\"\"\n now = datetime.utcnow()\n trend = []\n for i in range(months):\n month_start = now.replace(day=1) - timedelta(days=30 * i)\n month_end = month_start + timedelta(days=30)\n count = self.order_repository.count_by_date_range(month_start, month_end)\n trend.append({\"month\": month_start.strftime(\"%Y-%m\"), \"orders\": count})\n return list(reversed(trend))\nPYEOF\n\n# =============================================================================\n# src/views/admin.py\n# =============================================================================\ncat > src/views/admin.py \u003c\u003c 'PYEOF'\n\"\"\"Admin view — handles administrative operations.\"\"\"\n\n\nclass AdminView:\n \"\"\"Administrative interface for managing the application.\"\"\"\n\n def __init__(self, user_repository, order_repository):\n self.user_repository = user_repository\n self.order_repository = order_repository\n\n def list_users(self, page=1, per_page=20):\n \"\"\"List all users with pagination.\"\"\"\n offset = (page - 1) * per_page\n users = self.user_repository.get_all(offset=offset, limit=per_page)\n total = self.user_repository.count()\n return {\n \"users\": users,\n \"page\": page,\n \"per_page\": per_page,\n \"total\": total,\n }\n\n def ban_user(self, user_id, reason=\"\"):\n \"\"\"Ban a user from the platform.\"\"\"\n user = self.user_repository.get_by_id(user_id)\n if not user:\n raise ValueError(f\"User {user_id} not found\")\n user.deactivate()\n return {\"user_id\": user_id, \"banned\": True, \"reason\": reason}\n\n def force_cancel_order(self, order_id):\n \"\"\"Admin override to cancel any order regardless of status.\"\"\"\n order = self.order_repository.get_by_id(order_id)\n if not order:\n raise ValueError(f\"Order {order_id} not found\")\n order.status = \"cancelled\"\n return order\n\n def system_health(self):\n \"\"\"Return basic system health metrics.\"\"\"\n return {\n \"total_users\": self.user_repository.count(),\n \"total_orders\": self.order_repository.count(),\n \"active_users\": self.user_repository.count_active(),\n }\nPYEOF\n\n# =============================================================================\n# src/services/__init__.py\n# =============================================================================\ncat > src/services/__init__.py \u003c\u003c 'PYEOF'\n\"\"\"Business logic services package.\"\"\"\nPYEOF\n\n# =============================================================================\n# src/services/pricing.py — DEFINES calculate_discount\n# =============================================================================\ncat > src/services/pricing.py \u003c\u003c 'PYEOF'\n\"\"\"Pricing service — handles all discount and pricing logic.\"\"\"\n\nTIER_RATES = {\n 'standard': 0.0,\n 'silver': 0.05,\n 'gold': 0.10,\n 'platinum': 0.15,\n 'enterprise': 0.20,\n}\n\nSEASONAL_BONUS = 0.05\n\ndef calculate_discount(price, tier, seasonal=False):\n \"\"\"Calculate discounted price based on customer tier.\n\n Args:\n price: Original price (float)\n tier: Customer tier string (standard/silver/gold/platinum/enterprise)\n seasonal: Whether seasonal promotion is active (adds 5% extra)\n\n Returns:\n Discounted price rounded to 2 decimal places\n \"\"\"\n rate = TIER_RATES.get(tier, 0.0)\n if seasonal:\n rate += SEASONAL_BONUS\n return round(price * (1 - rate), 2)\n\n\ndef calculate_tax(price, region):\n \"\"\"Calculate tax based on region.\"\"\"\n tax_rates = {'US': 0.08, 'EU': 0.20, 'UK': 0.20, 'JP': 0.10}\n return round(price * tax_rates.get(region, 0.0), 2)\n\n\ndef calculate_shipping(weight, destination):\n \"\"\"Calculate shipping cost.\"\"\"\n base = 5.99\n per_kg = 2.50\n international_surcharge = 15.0\n cost = base + (weight * per_kg)\n if destination != 'US':\n cost += international_surcharge\n return round(cost, 2)\nPYEOF\n\n# =============================================================================\n# src/services/notifications.py\n# =============================================================================\ncat > src/services/notifications.py \u003c\u003c 'PYEOF'\n\"\"\"Notification service — sends emails, SMS, and push notifications.\"\"\"\n\nfrom datetime import datetime\n\n\nclass NotificationService:\n \"\"\"Manages sending notifications to users.\"\"\"\n\n def __init__(self, email_backend=None, sms_backend=None):\n self.email_backend = email_backend\n self.sms_backend = sms_backend\n self.sent_log = []\n\n def send_email(self, to_address, subject, body):\n \"\"\"Send an email notification.\"\"\"\n message = {\n \"type\": \"email\",\n \"to\": to_address,\n \"subject\": subject,\n \"body\": body,\n \"sent_at\": datetime.utcnow().isoformat(),\n }\n if self.email_backend:\n self.email_backend.send(message)\n self.sent_log.append(message)\n return message\n\n def send_sms(self, phone_number, text):\n \"\"\"Send an SMS notification.\"\"\"\n message = {\n \"type\": \"sms\",\n \"to\": phone_number,\n \"text\": text,\n \"sent_at\": datetime.utcnow().isoformat(),\n }\n if self.sms_backend:\n self.sms_backend.send(message)\n self.sent_log.append(message)\n return message\n\n def send_order_confirmation(self, user, order):\n \"\"\"Send order confirmation email to user.\"\"\"\n subject = f\"Order #{order.order_id} Confirmed\"\n body = f\"Dear {user.name}, your order has been confirmed.\"\n return self.send_email(user.email, subject, body)\n\n def send_shipping_update(self, user, order, tracking_number):\n \"\"\"Notify user that their order has shipped.\"\"\"\n subject = f\"Order #{order.order_id} Shipped\"\n body = f\"Your order is on the way! Tracking: {tracking_number}\"\n return self.send_email(user.email, subject, body)\n\n def get_sent_count(self):\n \"\"\"Return total number of sent notifications.\"\"\"\n return len(self.sent_log)\nPYEOF\n\n# =============================================================================\n# src/services/analytics.py — CALLS calculate_discount (call site #3)\n# =============================================================================\ncat > src/services/analytics.py \u003c\u003c 'PYEOF'\n\"\"\"Analytics service — computes business metrics and insights.\"\"\"\n\nfrom datetime import datetime, timedelta\nfrom collections import defaultdict\n\nfrom src.services.pricing import calculate_discount\n\n\nclass AnalyticsService:\n \"\"\"Provides business analytics and reporting data.\"\"\"\n\n def __init__(self, order_repository, user_repository):\n self.order_repository = order_repository\n self.user_repository = user_repository\n\n def revenue_by_tier(self, orders):\n \"\"\"Break down revenue by customer tier.\"\"\"\n breakdown = defaultdict(float)\n for order in orders:\n customer = self.user_repository.get_by_id(order.customer_id)\n if customer:\n breakdown[customer.tier] += order.subtotal\n return dict(breakdown)\n\n def average_order_value(self, orders):\n \"\"\"Calculate average order value.\"\"\"\n if not orders:\n return 0.0\n total = sum(o.subtotal for o in orders)\n return round(total / len(orders), 2)\n\n def conversion_rate(self, visitors, purchases):\n \"\"\"Calculate conversion rate as a percentage.\"\"\"\n if visitors == 0:\n return 0.0\n return round((purchases / visitors) * 100, 2)\n\n def calculate_discount_impact(self, orders, tier):\n \"\"\"Calculate revenue impact of discounts for reporting.\"\"\"\n total_original = sum(o['amount'] for o in orders)\n total_discounted = sum(\n calculate_discount(o['amount'], tier, seasonal=True) for o in orders\n )\n return total_original - total_discounted\n\n def retention_rate(self, period_start, period_end):\n \"\"\"Calculate customer retention rate for a period.\"\"\"\n start_users = self.user_repository.count_active_at(period_start)\n end_users = self.user_repository.count_active_at(period_end)\n if start_users == 0:\n return 0.0\n return round((end_users / start_users) * 100, 2)\n\n def top_products(self, orders, limit=10):\n \"\"\"Find the most ordered products.\"\"\"\n product_counts = defaultdict(int)\n for order in orders:\n for item in order.items:\n product_counts[item[\"product_id\"]] += item[\"quantity\"]\n sorted_products = sorted(product_counts.items(), key=lambda x: x[1], reverse=True)\n return sorted_products[:limit]\nPYEOF\n\n# =============================================================================\n# src/services/shipping.py\n# =============================================================================\ncat > src/services/shipping.py \u003c\u003c 'PYEOF'\n\"\"\"Shipping service — handles shipment tracking and carrier integration.\"\"\"\n\nfrom datetime import datetime, timedelta\n\n\nclass ShippingService:\n \"\"\"Manages shipping operations and tracking.\"\"\"\n\n CARRIERS = {\n \"standard\": {\"name\": \"Standard Post\", \"days\": 7, \"base_rate\": 5.99},\n \"express\": {\"name\": \"Express Delivery\", \"days\": 3, \"base_rate\": 12.99},\n \"overnight\": {\"name\": \"Overnight Air\", \"days\": 1, \"base_rate\": 24.99},\n }\n\n def __init__(self, carrier=\"standard\"):\n self.carrier = self.CARRIERS.get(carrier, self.CARRIERS[\"standard\"])\n\n def estimate_delivery(self, ship_date=None):\n \"\"\"Estimate delivery date based on carrier speed.\"\"\"\n start = ship_date or datetime.utcnow()\n return start + timedelta(days=self.carrier[\"days\"])\n\n def calculate_rate(self, weight_kg, distance_km):\n \"\"\"Calculate shipping rate based on weight and distance.\"\"\"\n base = self.carrier[\"base_rate\"]\n weight_charge = weight_kg * 1.50\n distance_charge = (distance_km / 100) * 0.75\n return round(base + weight_charge + distance_charge, 2)\n\n def generate_tracking_number(self, order_id):\n \"\"\"Generate a mock tracking number for an order.\"\"\"\n timestamp = datetime.utcnow().strftime(\"%Y%m%d%H%M%S\")\n return f\"TRK-{order_id}-{timestamp}\"\n\n def get_tracking_status(self, tracking_number):\n \"\"\"Look up shipment status by tracking number.\"\"\"\n return {\n \"tracking_number\": tracking_number,\n \"status\": \"in_transit\",\n \"last_update\": datetime.utcnow().isoformat(),\n \"carrier\": self.carrier[\"name\"],\n }\n\n def validate_address(self, address_dict):\n \"\"\"Basic address validation.\"\"\"\n required_fields = [\"street\", \"city\", \"state\", \"zip_code\", \"country\"]\n missing = [f for f in required_fields if f not in address_dict]\n return len(missing) == 0, missing\nPYEOF\n\n# =============================================================================\n# src/utils/__init__.py\n# =============================================================================\ncat > src/utils/__init__.py \u003c\u003c 'PYEOF'\n\"\"\"Utility functions package.\"\"\"\nPYEOF\n\n# =============================================================================\n# src/utils/validators.py\n# =============================================================================\ncat > src/utils/validators.py \u003c\u003c 'PYEOF'\n\"\"\"Input validation utilities.\"\"\"\n\nimport re\n\n\ndef validate_email(email):\n \"\"\"Validate an email address format.\"\"\"\n pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

\n return bool(re.match(pattern, email))\n\n\ndef validate_phone(phone):\n \"\"\"Validate a phone number (US format).\"\"\"\n cleaned = re.sub(r'[\\s\\-\\(\\)]', '', phone)\n return bool(re.match(r'^\\+?1?\\d{10}

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

, cleaned))\n\n\ndef validate_price(price):\n \"\"\"Validate that a price is a positive number.\"\"\"\n try:\n val = float(price)\n return val >= 0\n except (TypeError, ValueError):\n return False\n\n\ndef validate_zip_code(zip_code):\n \"\"\"Validate a US ZIP code (5-digit or ZIP+4).\"\"\"\n return bool(re.match(r'^\\d{5}(-\\d{4})?

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

, str(zip_code)))\n\n\ndef validate_password(password):\n \"\"\"Validate password meets minimum requirements.\"\"\"\n if len(password) \u003c 8:\n return False, \"Password must be at least 8 characters\"\n if not re.search(r'[A-Z]', password):\n return False, \"Password must contain an uppercase letter\"\n if not re.search(r'[0-9]', password):\n return False, \"Password must contain a digit\"\n return True, \"Valid\"\nPYEOF\n\n# =============================================================================\n# src/utils/formatters.py\n# =============================================================================\ncat > src/utils/formatters.py \u003c\u003c 'PYEOF'\n\"\"\"Output formatting utilities.\"\"\"\n\nfrom datetime import datetime\n\n\ndef format_currency(amount, currency=\"USD\"):\n \"\"\"Format a number as a currency string.\"\"\"\n symbols = {\"USD\": \"$\", \"EUR\": \"\\u20ac\", \"GBP\": \"\\u00a3\", \"JPY\": \"\\u00a5\"}\n symbol = symbols.get(currency, \"$\")\n return f\"{symbol}{amount:,.2f}\"\n\n\ndef format_date(dt, fmt=\"short\"):\n \"\"\"Format a datetime object for display.\"\"\"\n if fmt == \"short\":\n return dt.strftime(\"%m/%d/%Y\")\n elif fmt == \"long\":\n return dt.strftime(\"%B %d, %Y\")\n elif fmt == \"iso\":\n return dt.isoformat()\n return str(dt)\n\n\ndef format_percentage(value, decimals=1):\n \"\"\"Format a float as a percentage string.\"\"\"\n return f\"{value:.{decimals}f}%\"\n\n\ndef truncate_string(text, max_length=50):\n \"\"\"Truncate a string and add ellipsis if needed.\"\"\"\n if len(text) \u003c= max_length:\n return text\n return text[:max_length - 3] + \"...\"\n\n\ndef format_file_size(size_bytes):\n \"\"\"Format a file size in bytes to a human-readable string.\"\"\"\n for unit in [\"B\", \"KB\", \"MB\", \"GB\", \"TB\"]:\n if size_bytes \u003c 1024:\n return f\"{size_bytes:.1f} {unit}\"\n size_bytes /= 1024\n return f\"{size_bytes:.1f} PB\"\nPYEOF\n\n# =============================================================================\n# src/utils/helpers.py\n# =============================================================================\ncat > src/utils/helpers.py \u003c\u003c 'PYEOF'\n\"\"\"General-purpose helper functions.\"\"\"\n\nimport hashlib\nimport random\nimport string\nfrom datetime import datetime\n\n\ndef generate_id(prefix=\"\", length=8):\n \"\"\"Generate a random alphanumeric ID with optional prefix.\"\"\"\n chars = string.ascii_lowercase + string.digits\n random_part = ''.join(random.choices(chars, k=length))\n if prefix:\n return f\"{prefix}_{random_part}\"\n return random_part\n\n\ndef hash_password(password, salt=None):\n \"\"\"Hash a password with SHA-256 and optional salt.\"\"\"\n if salt is None:\n salt = generate_id(length=16)\n combined = f\"{salt}:{password}\"\n hashed = hashlib.sha256(combined.encode()).hexdigest()\n return f\"{salt}:{hashed}\"\n\n\ndef paginate(items, page=1, per_page=20):\n \"\"\"Paginate a list of items.\"\"\"\n start = (page - 1) * per_page\n end = start + per_page\n return {\n \"items\": items[start:end],\n \"page\": page,\n \"per_page\": per_page,\n \"total\": len(items),\n \"has_next\": end \u003c len(items),\n \"has_prev\": page > 1,\n }\n\n\ndef deep_merge(base, override):\n \"\"\"Deep merge two dictionaries.\"\"\"\n result = base.copy()\n for key, value in override.items():\n if key in result and isinstance(result[key], dict) and isinstance(value, dict):\n result[key] = deep_merge(result[key], value)\n else:\n result[key] = value\n return result\n\n\ndef elapsed_time_str(start_time):\n \"\"\"Return a human-readable elapsed time string.\"\"\"\n delta = datetime.utcnow() - start_time\n seconds = int(delta.total_seconds())\n if seconds \u003c 60:\n return f\"{seconds}s\"\n minutes = seconds // 60\n remaining = seconds % 60\n return f\"{minutes}m {remaining}s\"\nPYEOF\n\n# =============================================================================\n# tests/__init__.py\n# =============================================================================\ncat > tests/__init__.py \u003c\u003c 'PYEOF'\n\"\"\"Test suite package.\"\"\"\nPYEOF\n\n# =============================================================================\n# tests/test_pricing.py — CALLS calculate_discount (call site #4)\n# =============================================================================\ncat > tests/test_pricing.py \u003c\u003c 'PYEOF'\n\"\"\"Tests for the pricing service.\"\"\"\n\nimport unittest\n\nfrom src.services.pricing import calculate_discount, calculate_tax, calculate_shipping\n\n\nclass TestCalculateTax(unittest.TestCase):\n \"\"\"Tests for calculate_tax function.\"\"\"\n\n def test_us_tax(self):\n result = calculate_tax(100.0, 'US')\n self.assertEqual(result, 8.0)\n\n def test_eu_tax(self):\n result = calculate_tax(100.0, 'EU')\n self.assertEqual(result, 20.0)\n\n def test_unknown_region_zero_tax(self):\n result = calculate_tax(100.0, 'XX')\n self.assertEqual(result, 0.0)\n\n\nclass TestCalculateDiscount(unittest.TestCase):\n \"\"\"Tests for calculate_discount function.\"\"\"\n\n def test_standard_no_discount(self):\n result = calculate_discount(100.0, 'standard')\n self.assertEqual(result, 100.0)\n\n def test_gold_discount(self):\n result = calculate_discount(100.0, 'gold')\n self.assertEqual(result, 90.0)\n\n def test_platinum_discount(self):\n result = calculate_discount(200.0, 'platinum')\n self.assertEqual(result, 170.0)\n\n def test_seasonal_adds_bonus(self):\n result = calculate_discount(100.0, 'gold', seasonal=True)\n self.assertEqual(result, 85.0)\n\n def test_unknown_tier_no_discount(self):\n result = calculate_discount(100.0, 'vip')\n self.assertEqual(result, 100.0)\n\n\nclass TestCalculateShipping(unittest.TestCase):\n \"\"\"Tests for calculate_shipping function.\"\"\"\n\n def test_domestic_shipping(self):\n result = calculate_shipping(2.0, 'US')\n self.assertEqual(result, 10.99)\n\n def test_international_shipping(self):\n result = calculate_shipping(2.0, 'EU')\n self.assertEqual(result, 25.99)\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\n# =============================================================================\n# tests/test_models.py\n# =============================================================================\ncat > tests/test_models.py \u003c\u003c 'PYEOF'\n\"\"\"Tests for data models.\"\"\"\n\nimport unittest\n\nfrom src.models.user import User\nfrom src.models.order import Order\nfrom src.models.product import Product\nfrom src.models.inventory import InventoryItem\n\n\nclass TestUser(unittest.TestCase):\n \"\"\"Tests for the User model.\"\"\"\n\n def test_create_user(self):\n user = User(1, \"Alice\", \"[email protected]\", \"gold\")\n self.assertEqual(user.name, \"Alice\")\n self.assertEqual(user.tier, \"gold\")\n\n def test_invalid_tier_defaults_to_standard(self):\n user = User(2, \"Bob\", \"[email protected]\", \"vip\")\n self.assertEqual(user.tier, \"standard\")\n\n def test_upgrade_tier(self):\n user = User(3, \"Carol\", \"[email protected]\")\n user.upgrade_tier(\"platinum\")\n self.assertEqual(user.tier, \"platinum\")\n\n def test_deactivate(self):\n user = User(4, \"Dave\", \"[email protected]\")\n user.deactivate()\n self.assertFalse(user.is_active)\n\n\nclass TestOrder(unittest.TestCase):\n \"\"\"Tests for the Order model.\"\"\"\n\n def test_empty_order_subtotal(self):\n order = Order(1, 100)\n self.assertEqual(order.subtotal, 0)\n\n def test_add_item(self):\n order = Order(2, 100)\n order.add_item(\"P001\", 29.99, 2)\n self.assertEqual(len(order.items), 1)\n self.assertAlmostEqual(order.subtotal, 59.98)\n\n def test_cancel_pending_order(self):\n order = Order(3, 100)\n self.assertTrue(order.cancel())\n self.assertEqual(order.status, \"cancelled\")\n\n\nclass TestProduct(unittest.TestCase):\n \"\"\"Tests for the Product model.\"\"\"\n\n def test_create_product(self):\n product = Product(\"P001\", \"Widget\", 19.99)\n self.assertEqual(product.name, \"Widget\")\n\n def test_to_dict(self):\n product = Product(\"P002\", \"Gadget\", 49.99, \"electronics\")\n d = product.to_dict()\n self.assertEqual(d[\"category\"], \"electronics\")\n\n\nclass TestInventory(unittest.TestCase):\n \"\"\"Tests for the InventoryItem model.\"\"\"\n\n def test_restock(self):\n item = InventoryItem(\"P001\", quantity=5)\n item.restock(10)\n self.assertEqual(item.quantity, 15)\n\n def test_reserve_success(self):\n item = InventoryItem(\"P001\", quantity=10)\n self.assertTrue(item.reserve(5))\n self.assertEqual(item.quantity, 5)\n\n def test_reserve_insufficient(self):\n item = InventoryItem(\"P001\", quantity=3)\n self.assertFalse(item.reserve(5))\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\n# =============================================================================\n# tests/test_views.py\n# =============================================================================\ncat > tests/test_views.py \u003c\u003c 'PYEOF'\n\"\"\"Tests for view layer (using mock repositories).\"\"\"\n\nimport unittest\nfrom unittest.mock import MagicMock\n\nfrom src.views.users import UserView\nfrom src.views.admin import AdminView\n\n\nclass TestUserView(unittest.TestCase):\n \"\"\"Tests for UserView.\"\"\"\n\n def setUp(self):\n self.repo = MagicMock()\n self.view = UserView(self.repo)\n\n def test_register_new_user(self):\n self.repo.get_by_email.return_value = None\n self.repo.create.return_value = MagicMock(name=\"Alice\")\n result = self.view.register(\"Alice\", \"[email protected]\", \"Pass1234\")\n self.repo.create.assert_called_once()\n\n def test_register_duplicate_email(self):\n self.repo.get_by_email.return_value = MagicMock()\n with self.assertRaises(ValueError):\n self.view.register(\"Alice\", \"[email protected]\", \"Pass1234\")\n\n def test_get_profile_not_found(self):\n self.repo.get_by_id.return_value = None\n with self.assertRaises(ValueError):\n self.view.get_profile(999)\n\n\nclass TestAdminView(unittest.TestCase):\n \"\"\"Tests for AdminView.\"\"\"\n\n def setUp(self):\n self.user_repo = MagicMock()\n self.order_repo = MagicMock()\n self.view = AdminView(self.user_repo, self.order_repo)\n\n def test_ban_user(self):\n user = MagicMock()\n self.user_repo.get_by_id.return_value = user\n result = self.view.ban_user(1, reason=\"spam\")\n user.deactivate.assert_called_once()\n\n def test_system_health(self):\n self.user_repo.count.return_value = 100\n self.order_repo.count.return_value = 500\n self.user_repo.count_active.return_value = 80\n health = self.view.system_health()\n self.assertEqual(health[\"total_users\"], 100)\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\n# =============================================================================\n# tests/test_services.py\n# =============================================================================\ncat > tests/test_services.py \u003c\u003c 'PYEOF'\n\"\"\"Tests for business logic services.\"\"\"\n\nimport unittest\nfrom unittest.mock import MagicMock\n\nfrom src.services.shipping import ShippingService\nfrom src.services.notifications import NotificationService\n\n\nclass TestShippingService(unittest.TestCase):\n \"\"\"Tests for the ShippingService.\"\"\"\n\n def test_standard_delivery_estimate(self):\n service = ShippingService(\"standard\")\n from datetime import datetime, timedelta\n ship_date = datetime(2025, 1, 1)\n delivery = service.estimate_delivery(ship_date)\n self.assertEqual(delivery, datetime(2025, 1, 8))\n\n def test_generate_tracking_number(self):\n service = ShippingService()\n tracking = service.generate_tracking_number(\"ORD123\")\n self.assertTrue(tracking.startswith(\"TRK-ORD123-\"))\n\n def test_validate_address_complete(self):\n service = ShippingService()\n address = {\n \"street\": \"123 Main St\",\n \"city\": \"Springfield\",\n \"state\": \"IL\",\n \"zip_code\": \"62701\",\n \"country\": \"US\",\n }\n valid, missing = service.validate_address(address)\n self.assertTrue(valid)\n\n def test_validate_address_missing_fields(self):\n service = ShippingService()\n address = {\"street\": \"123 Main St\"}\n valid, missing = service.validate_address(address)\n self.assertFalse(valid)\n\n\nclass TestNotificationService(unittest.TestCase):\n \"\"\"Tests for the NotificationService.\"\"\"\n\n def test_send_email(self):\n service = NotificationService()\n result = service.send_email(\"[email protected]\", \"Hello\", \"Body text\")\n self.assertEqual(result[\"type\"], \"email\")\n self.assertEqual(service.get_sent_count(), 1)\n\n def test_send_sms(self):\n service = NotificationService()\n result = service.send_sms(\"+15551234567\", \"Your order shipped!\")\n self.assertEqual(result[\"type\"], \"sms\")\n\n\nif __name__ == \"__main__\":\n unittest.main()\nPYEOF\n\n# =============================================================================\n# config/__init__.py\n# =============================================================================\ncat > config/__init__.py \u003c\u003c 'PYEOF'\n\"\"\"Configuration package.\"\"\"\nPYEOF\n\n# =============================================================================\n# config/settings.py\n# =============================================================================\ncat > config/settings.py \u003c\u003c 'PYEOF'\n\"\"\"Application settings and configuration.\"\"\"\n\nimport os\n\n\n# Database configuration\nDATABASE = {\n \"engine\": os.getenv(\"DB_ENGINE\", \"postgresql\"),\n \"host\": os.getenv(\"DB_HOST\", \"localhost\"),\n \"port\": int(os.getenv(\"DB_PORT\", \"5432\")),\n \"name\": os.getenv(\"DB_NAME\", \"webapp_db\"),\n \"user\": os.getenv(\"DB_USER\", \"webapp\"),\n \"password\": os.getenv(\"DB_PASSWORD\", \"changeme\"),\n}\n\n# Application settings\nAPP_NAME = \"WebApp\"\nDEBUG = os.getenv(\"DEBUG\", \"false\").lower() == \"true\"\nSECRET_KEY = os.getenv(\"SECRET_KEY\", \"dev-secret-key-change-in-production\")\nMAX_UPLOAD_SIZE = 10 * 1024 * 1024 # 10 MB\n\n# Email settings\nEMAIL_HOST = os.getenv(\"EMAIL_HOST\", \"smtp.example.com\")\nEMAIL_PORT = int(os.getenv(\"EMAIL_PORT\", \"587\"))\nEMAIL_USE_TLS = True\n\n# Pagination defaults\nDEFAULT_PAGE_SIZE = 20\nMAX_PAGE_SIZE = 100\n\n# Session configuration\nSESSION_TIMEOUT = 3600 # 1 hour in seconds\nSESSION_COOKIE_NAME = \"webapp_session\"\n\n# Logging\nLOG_LEVEL = os.getenv(\"LOG_LEVEL\", \"INFO\")\nLOG_FORMAT = \"%(asctime)s [%(levelname)s] %(name)s: %(message)s\"\nPYEOF\n\n# =============================================================================\n# config/routes.py\n# =============================================================================\ncat > config/routes.py \u003c\u003c 'PYEOF'\n\"\"\"URL route configuration for the web application.\"\"\"\n\n\nROUTES = [\n {\"path\": \"/\", \"view\": \"DashboardView\", \"method\": \"GET\", \"name\": \"home\"},\n {\"path\": \"/login\", \"view\": \"AuthView\", \"method\": \"POST\", \"name\": \"login\"},\n {\"path\": \"/register\", \"view\": \"UserView\", \"method\": \"POST\", \"name\": \"register\"},\n {\"path\": \"/profile\", \"view\": \"UserView\", \"method\": \"GET\", \"name\": \"profile\"},\n {\"path\": \"/orders\", \"view\": \"OrderView\", \"method\": \"GET\", \"name\": \"order_list\"},\n {\"path\": \"/orders/\u003cid>\", \"view\": \"OrderView\", \"method\": \"GET\", \"name\": \"order_detail\"},\n {\"path\": \"/checkout\", \"view\": \"OrderView\", \"method\": \"POST\", \"name\": \"checkout\"},\n {\"path\": \"/reports\", \"view\": \"ReportsView\", \"method\": \"GET\", \"name\": \"reports\"},\n {\"path\": \"/admin/users\", \"view\": \"AdminView\", \"method\": \"GET\", \"name\": \"admin_users\"},\n {\"path\": \"/admin/orders\", \"view\": \"AdminView\", \"method\": \"GET\", \"name\": \"admin_orders\"},\n {\"path\": \"/api/products\", \"view\": \"ProductAPI\", \"method\": \"GET\", \"name\": \"api_products\"},\n {\"path\": \"/api/analytics\", \"view\": \"AnalyticsAPI\", \"method\": \"GET\", \"name\": \"api_analytics\"},\n]\n\n\ndef get_route(name):\n \"\"\"Look up a route by its name.\"\"\"\n for route in ROUTES:\n if route[\"name\"] == name:\n return route\n return None\n\n\ndef url_for(name, **kwargs):\n \"\"\"Generate a URL for a named route.\"\"\"\n route = get_route(name)\n if not route:\n raise ValueError(f\"Route '{name}' not found\")\n path = route[\"path\"]\n for key, value in kwargs.items():\n path = path.replace(f\"\u003c{key}>\", str(value))\n return path\nPYEOF\n\n# =============================================================================\n# docs/api.md\n# =============================================================================\ncat > docs/api.md \u003c\u003c 'MDEOF'\n# API Reference\n\n## Endpoints\n\n### Authentication\n\n- `POST /login` — Authenticate user and return session token\n- `POST /register` — Create new user account\n\n### Orders\n\n- `GET /orders` — List all orders for the authenticated user\n- `GET /orders/\u003cid>` — Get order details\n- `POST /checkout` — Create a new order from cart\n\n### Products\n\n- `GET /api/products` — List products with optional category filter\n- `GET /api/products/\u003cid>` — Get product details\n\n### Admin\n\n- `GET /admin/users` — List all users (admin only)\n- `POST /admin/users/\u003cid>/ban` — Ban a user (admin only)\n\n## Authentication\n\nAll API endpoints require a valid session token passed via the\n`Authorization` header:\n\n```\nAuthorization: Bearer \u003csession_token>\n```\n\n## Error Responses\n\nAll errors follow a standard format:\n\n```json\n{\n \"error\": true,\n \"message\": \"Description of the error\",\n \"code\": 400\n}\n```\nMDEOF\n\n# =============================================================================\n# docs/setup.md\n# =============================================================================\ncat > docs/setup.md \u003c\u003c 'MDEOF'\n# Setup Guide\n\n## Prerequisites\n\n- Python 3.9+\n- PostgreSQL 14+\n- pip or pipenv\n\n## Installation\n\n1. Clone the repository:\n\n```bash\ngit clone https://github.com/example/webapp.git\ncd webapp\n```\n\n2. Create a virtual environment:\n\n```bash\npython -m venv venv\nsource venv/bin/activate\n```\n\n3. Install dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\n4. Set up the database:\n\n```bash\ncreatedb webapp_db\npython manage.py migrate\n```\n\n5. Run the development server:\n\n```bash\npython manage.py runserver\n```\n\n## Environment Variables\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `DB_HOST` | localhost | Database host |\n| `DB_PORT` | 5432 | Database port |\n| `DB_NAME` | webapp_db | Database name |\n| `SECRET_KEY` | dev-secret | Application secret key |\n| `DEBUG` | false | Enable debug mode |\nMDEOF\n\n# =============================================================================\n# README.md\n# =============================================================================\ncat > README.md \u003c\u003c 'MDEOF'\n# WebApp\n\nA full-featured Python web application with user management, order processing,\npricing tiers, analytics, and admin tools.\n\n## Features\n\n- User registration and tier-based membership (standard, silver, gold, platinum, enterprise)\n- Order management with checkout and cancellation\n- Dynamic pricing with tier-based discounts\n- Analytics and reporting dashboards\n- Admin interface for user and order management\n- Email and SMS notifications\n- Shipping rate calculation and tracking\n\n## Project Structure\n\n```\nsrc/\n models/ — Data models (User, Order, Product, Inventory)\n views/ — Request handlers (Dashboard, Orders, Users, Reports, Admin)\n services/ — Business logic (Pricing, Notifications, Analytics, Shipping)\n utils/ — Helper functions (validators, formatters, helpers)\ntests/ — Unit test suite\nconfig/ — Application configuration and routes\ndocs/ — API reference and setup guide\n```\n\n## Quick Start\n\n```bash\npip install -r requirements.txt\npython manage.py runserver\n```\n\nSee [docs/setup.md](docs/setup.md) for detailed setup instructions.\n\n## Running Tests\n\n```bash\npython -m pytest tests/\n```\n\n## License\n\nMIT\nMDEOF\n\n# =============================================================================\n# Commit everything as a single initial commit\n# =============================================================================\ngit add -A\ngit commit -m \"feat: initial webapp with pricing, orders, analytics, and tests\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":46337,"content_sha256":"7e4e17ad71b7b745205e5813dafbda09570b3abd07633f689cd9bd61d08de5f5"},{"filename":"tasks/tool-efficiency/large-codebase-navigation/task.yaml","content":"name: \"Navigate Large Codebase Efficiently\"\nid: \"tool-efficiency-large-codebase-navigation\"\nversion: \"1.0\"\nsuite: \"tool-efficiency\"\ndifficulty: \"hard\"\nmode: \"real\"\n\ndescription: |\n Tests efficient codebase navigation across 30+ files. The agent must find\n all callers of calculate_discount using Grep (not bash grep), then Read\n only the relevant files. Penalizes opening irrelevant files, re-reading\n files, and using bash for searches.\n\nuser_message: |\n Find all callers of the `calculate_discount` function in this codebase\n and document how each caller uses it. Write usage-report.md with:\n - File path and line number of each call site\n - The arguments passed in each call\n - A brief description of why that caller uses the function\n - Any differences in how callers use it (e.g., different argument patterns)\n\ninput_files: []\n\nexpected_outputs:\n - pattern: \"usage-report.md\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"orders.py\"\n - \"dashboard.py\"\n - \"analytics.py\"\n - \"test_pricing.py\"\n - \"calculate_discount\"\n\nexpected_metrics:\n tool_calls: [5, 12]\n planning_ratio: [0.10, 0.30]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.40\n layer2_weight: 0.25\n layer3_weight: 0.20\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":1317,"content_sha256":"3ec5ff7b0492668bed18524adfccfbb2356192f44e29069e7b439fac5be3a87d"},{"filename":"tasks/tool-efficiency/minimal-reads/inputs/config.json","content":"{\n \"database\": {\n \"host\": \"db.example.com\",\n \"port\": 5432,\n \"name\": \"production\",\n \"max_connections\": 50,\n \"timeout_ms\": 5000\n },\n \"cache\": {\n \"enabled\": true,\n \"ttl\": 3600,\n \"provider\": \"redis\",\n \"host\": \"cache.example.com\"\n },\n \"logging\": {\n \"level\": \"info\",\n \"format\": \"json\",\n \"output\": \"/var/log/app.log\"\n },\n \"api\": {\n \"rate_limit\": 1000,\n \"version\": \"v2\",\n \"cors_origins\": [\"https://app.example.com\", \"https://admin.example.com\"]\n }\n}\n","content_type":"application/json; charset=utf-8","language":"json","size":495,"content_sha256":"d22055c06679400935803d12e109e79213451aab07dd098fa7c206f79ac12750"},{"filename":"tasks/tool-efficiency/minimal-reads/task.yaml","content":"name: \"Answer Simple Question with Minimal Tool Use\"\nid: \"tool-efficiency-minimal-reads\"\nversion: \"1.0\"\nsuite: \"tool-efficiency\"\ndifficulty: \"easy\"\nmode: \"sandboxed\"\n\ndescription: |\n Tests whether the agent can answer a simple factual question from a\n configuration file using minimal tool calls. The agent should read\n the file once and answer directly. It should not use Bash cat, should\n not read the file multiple times, and should not perform unnecessary\n operations.\n\nuser_message: |\n What is the database host configured in config.json?\n\ninput_files:\n - name: \"config.json\"\n\nexpected_outputs: []\n\nexpected_behavior:\n - description: \"Agent reads file once and answers correctly\"\n validators:\n - type: \"response-contains\"\n values: [\"db.example.com\"]\n\nexpected_metrics:\n tool_calls: [1, 3]\n planning_ratio: [0.05, 0.30]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.40\n layer2_weight: 0.20\n layer3_weight: 0.25\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":947,"content_sha256":"35b2d70732ca74e221f5d8c69d03f5f7cdabd54ecad177fe69d963f72f13d0e4"},{"filename":"tasks/tool-efficiency/no-unnecessary-changes/inputs/report.md","content":"# Project Atlas — Q4 Status Report\n\n**Prepared by:** Project Management Office\n**Date:** January 15, 2026\n**Reporting Period:** October 1 — December 31, 2025\n\n## Executive Summary\n\nProject Atlas continues to make strong progress toward its goals of modernizing our core infrastructure platform. This quarter saw the successful completion of Phase 2 (Database Migration) and the kickoff of Phase 3 (API Gateway Implementation). The project remains on track for its overall completion target of June 2026.\n\n## Team Overview\n\nThe Project Atlas team currently consists of 14 members across three functional groups: Platform Engineering (6 engineers), Quality Assurance (3 engineers), and Project Management (2 coordinators). The team expanded this quarter with the addition of two senior backend engineers and one performance testing specialist. Total team size has grown from 9 members at project inception to the current 14.\n\n## Key Milestones\n\n### Completed This Quarter\n\n- **Database Migration (Phase 2):** Successfully migrated 47 production databases from MySQL 5.7 to PostgreSQL 15. Migration was completed on November 8, 2025, two weeks ahead of the original November 22 deadline. Zero data loss reported across all migrations.\n\n- **Legacy API Deprecation:** Formally deprecated 23 legacy REST endpoints. All consuming services have been updated to use the new v3 API. Traffic to deprecated endpoints dropped to 0.3% of total API calls by end of quarter.\n\n- **Performance Baseline Established:** Conducted comprehensive load testing across all migrated services. Average response times improved by 34% compared to the pre-migration baseline. The P99 latency for critical endpoints dropped from 850ms to 290ms.\n\n### In Progress\n\n- **API Gateway Implementation (Phase 3):** Kicked off on December 1, 2025, with a target completion of March 15, 2026. Currently in the architecture design phase. The team has selected Kong as the gateway platform after evaluating four candidates.\n\n- **Monitoring Dashboard Overhaul:** Redesigning the operations monitoring dashboard to incorporate the new infrastructure metrics. Estimated completion: February 2026.\n\n## Budget Status\n\nThe project has consumed $1.2 million of its $2.0 million total budget through Q4. Spending is slightly under the planned burn rate, primarily due to delayed hiring of one DevOps position that was filled in December rather than October as planned. Current projection indicates the project will complete within budget, with approximately $150,000 in contingency remaining.\n\n## Risk Assessment\n\n| Risk | Probability | Impact | Mitigation |\n|------|------------|--------|------------|\n| Kong gateway performance under peak load | Medium | High | Early load testing scheduled for February |\n| Key personnel departure during Phase 3 | Low | High | Cross-training program initiated |\n| Third-party API breaking changes | Medium | Medium | Version pinning and contract testing in place |\n| Scope creep from stakeholder requests | High | Medium | Change advisory board reviews all additions |\n\n## Upcoming Quarter Objectives\n\n1. Complete API Gateway architecture design and begin implementation\n2. Migrate first 10 services to the new gateway\n3. Achieve 99.95% uptime SLA across all migrated services\n4. Complete security audit of new infrastructure components\n5. Begin Phase 4 planning (Service Mesh Implementation)\n\n## Conclusion\n\nProject Atlas remains healthy with strong momentum. The successful early completion of the database migration demonstrates the team's capability and commitment. The primary focus for Q1 2026 will be the API Gateway implementation, which represents the most technically complex phase of the project. The team is well-positioned to deliver on schedule.\n\n---\n*Next report due: April 15, 2026*\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":3789,"content_sha256":"58c9761cc47b9f24258d607f98824b373b4613ff7b06a58665c14b0226dc9648"},{"filename":"tasks/tool-efficiency/no-unnecessary-changes/task.yaml","content":"name: \"Review Without Making Unnecessary Changes\"\nid: \"tool-efficiency-no-unnecessary-changes\"\nversion: \"1.0\"\nsuite: \"tool-efficiency\"\ndifficulty: \"easy\"\nmode: \"sandboxed\"\n\ndescription: |\n Tests whether the agent can review a file and report findings without\n making unnecessary modifications. The report contains two intentional\n factual errors: the team grew from 9 to 14 (a growth of 5, but only\n 3 new hires are mentioned), and the database migration completion date\n of November 8 is described as two weeks ahead of November 22 (which is\n actually 14 days, so the claim is accurate — the real error is that the\n team size math is wrong). The agent should READ the file and report\n findings without editing or rewriting it.\n\nuser_message: |\n Review this report in report.md and tell me if there are any factual\n errors or inconsistencies. Do not make changes, just tell me what you find.\n\ninput_files:\n - name: \"report.md\"\n\nexpected_outputs: []\n\nexpected_behavior:\n - description: \"Agent identifies errors without modifying the file\"\n validators:\n - type: \"response-contains-analysis\"\n description: \"Response mentions factual inconsistencies found\"\n - type: \"file-unchanged\"\n file: \"report.md\"\n\nexpected_metrics:\n tool_calls: [1, 3]\n planning_ratio: [0.1, 0.4]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.40\n layer2_weight: 0.20\n layer3_weight: 0.25\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":1406,"content_sha256":"e16d228327accf8c79237e87fc303b5b4fa47cc7ee2dd8390f69a6565a47ec0b"},{"filename":"tasks/tool-efficiency/right-tool-choice/inputs/contacts.txt","content":"Company Contact Directory — Updated January 2026\n\nAlice Johnson: [email protected]\nBob Martinez: [email protected]\nCarol White\nDavid Kim: [email protected]\nEmily Chen: [email protected]\nFrank O'Brien: [email protected]\nGrace Lee: [email protected]\nHenry Patel\nIbrahim Hassan: [email protected]\nJulia Nakamura: [email protected]\nKevin Wright: [email protected]\nLaura Santos: [email protected]\nMichael Brown: [email protected]\nNancy Taylor: [email protected]\nOscar Fernandez\nPatricia Williams: [email protected]\nQuincy Adams: [email protected]\nRachel Green: [email protected]\nSamuel Davis: [email protected]\nTanya Ivanova: [email protected]\nUma Krishnan: [email protected]\nVictor Morales: [email protected]\nWendy Chang: [email protected]\nXavier Dubois\nYuki Tanaka: [email protected]\nZachary Thompson: [email protected]\nAmara Osei: [email protected]\nBenjamin Foster: [email protected]\nCatherine Dupont: [email protected]\nDerek Washington: [email protected]\n","content_type":"text/plain; charset=utf-8","language":null,"size":1205,"content_sha256":"58d9dfbfd323d69eb04d8211bd62b61b4abd31150836a421d4bcfcd9b31ba212"},{"filename":"tasks/tool-efficiency/right-tool-choice/task.yaml","content":"name: \"Extract Data Using Appropriate Tools\"\nid: \"tool-efficiency-right-tool-choice\"\nversion: \"1.0\"\nsuite: \"tool-efficiency\"\ndifficulty: \"easy\"\nmode: \"sandboxed\"\n\ndescription: |\n Tests whether the agent uses the appropriate native tools (Read, Grep,\n Write) rather than falling back to bash commands (grep, sed, awk) for\n a straightforward text extraction task. The contacts file has 30 lines\n with a mix of entries that have email addresses and some that do not.\n\nuser_message: |\n Find all email addresses in contacts.txt and save them to emails.txt,\n one per line.\n\ninput_files:\n - name: \"contacts.txt\"\n\nexpected_outputs:\n - pattern: \"emails.txt\"\n required: true\n validators:\n - type: \"file-exists\"\n - type: \"content-contains\"\n sections:\n - \"@\"\n\nexpected_metrics:\n tool_calls: [2, 5]\n planning_ratio: [0.05, 0.25]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.40\n layer2_weight: 0.20\n layer3_weight: 0.25\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":956,"content_sha256":"76dd8aff820329e42ee0135bacf7d31b3b474274a8f55eaffc191122cd4a12e0"},{"filename":"tasks/tool-efficiency/targeted-fix/setup.sh","content":"#!/usr/bin/env bash\nset -euo pipefail\n\nWORKSPACE=\"$1\"\n\ncd \"$WORKSPACE\"\ngit init -q\ngit config user.email \"[email protected]\"\ngit config user.name \"AgentBench\"\n\n# ── Directory structure ──────────────────────────────────────────────────────\nmkdir -p src/models src/views src/services src/utils tests config\n\n# ── src/models/__init__.py ───────────────────────────────────────────────────\ncat > src/models/__init__.py \u003c\u003c 'EOF'\n\"\"\"Data models for the financial dashboard application.\"\"\"\nfrom .user import User\nfrom .transaction import Transaction as TransactionModel\nfrom .account import Account\n\n__all__ = [\"User\", \"TransactionModel\", \"Account\"]\nEOF\n\n# ── src/models/user.py ───────────────────────────────────────────────────────\ncat > src/models/user.py \u003c\u003c 'EOF'\n\"\"\"User model for authentication and profile management.\"\"\"\nfrom datetime import datetime\nimport hashlib\n\n\nclass User:\n \"\"\"Represents an application user.\"\"\"\n\n def __init__(self, user_id, username, email, created_at=None):\n self.user_id = user_id\n self.username = username\n self.email = email\n self.created_at = created_at or datetime.now()\n self.is_active = True\n self.preferences = {}\n\n def set_preference(self, key, value):\n \"\"\"Set a user preference.\"\"\"\n self.preferences[key] = value\n\n def get_preference(self, key, default=None):\n \"\"\"Get a user preference with optional default.\"\"\"\n return self.preferences.get(key, default)\n\n def deactivate(self):\n \"\"\"Deactivate the user account.\"\"\"\n self.is_active = False\n\n def get_gravatar_url(self, size=80):\n \"\"\"Generate Gravatar URL from email.\"\"\"\n email_hash = hashlib.md5(self.email.lower().encode()).hexdigest()\n return f\"https://www.gravatar.com/avatar/{email_hash}?s={size}\"\n\n def __repr__(self):\n status = \"active\" if self.is_active else \"inactive\"\n return f\"User({self.user_id}, {self.username}, {status})\"\n\n def __eq__(self, other):\n if not isinstance(other, User):\n return False\n return self.user_id == other.user_id\n\n def to_dict(self):\n \"\"\"Serialize user to dictionary.\"\"\"\n return {\n \"user_id\": self.user_id,\n \"username\": self.username,\n \"email\": self.email,\n \"created_at\": self.created_at.isoformat(),\n \"is_active\": self.is_active,\n \"preferences\": self.preferences,\n }\nEOF\n\n# ── src/models/transaction.py ────────────────────────────────────────────────\ncat > src/models/transaction.py \u003c\u003c 'EOF'\n\"\"\"Transaction model for financial records.\"\"\"\nfrom datetime import datetime\nfrom enum import Enum\n\n\nclass TransactionType(Enum):\n DEBIT = \"debit\"\n CREDIT = \"credit\"\n TRANSFER = \"transfer\"\n REFUND = \"refund\"\n\n\nclass TransactionStatus(Enum):\n PENDING = \"pending\"\n COMPLETED = \"completed\"\n FAILED = \"failed\"\n CANCELLED = \"cancelled\"\n\n\nclass Transaction:\n \"\"\"Represents a financial transaction linked to an account.\"\"\"\n\n def __init__(self, txn_id, account_id, amount, txn_type, description=\"\",\n status=TransactionStatus.PENDING, created_at=None):\n self.txn_id = txn_id\n self.account_id = account_id\n self.amount = amount\n self.txn_type = txn_type if isinstance(txn_type, TransactionType) else TransactionType(txn_type)\n self.description = description\n self.status = status if isinstance(status, TransactionStatus) else TransactionStatus(status)\n self.created_at = created_at or datetime.now()\n self.metadata = {}\n\n def complete(self):\n \"\"\"Mark transaction as completed.\"\"\"\n if self.status == TransactionStatus.PENDING:\n self.status = TransactionStatus.COMPLETED\n return True\n return False\n\n def cancel(self):\n \"\"\"Cancel a pending transaction.\"\"\"\n if self.status == TransactionStatus.PENDING:\n self.status = TransactionStatus.CANCELLED\n return True\n return False\n\n def add_metadata(self, key, value):\n \"\"\"Attach metadata to the transaction.\"\"\"\n self.metadata[key] = value\n\n def is_debit(self):\n \"\"\"Check if this is a debit transaction.\"\"\"\n return self.txn_type == TransactionType.DEBIT\n\n def __repr__(self):\n return (f\"Transaction({self.txn_id}, {self.txn_type.value}, \"\n f\"${self.amount:.2f}, {self.status.value})\")\n\n def to_dict(self):\n \"\"\"Serialize to dictionary.\"\"\"\n return {\n \"txn_id\": self.txn_id,\n \"account_id\": self.account_id,\n \"amount\": self.amount,\n \"type\": self.txn_type.value,\n \"description\": self.description,\n \"status\": self.status.value,\n \"created_at\": self.created_at.isoformat(),\n \"metadata\": self.metadata,\n }\nEOF\n\n# ── src/models/account.py ────────────────────────────────────────────────────\ncat > src/models/account.py \u003c\u003c 'EOF'\n\"\"\"Account model for managing user financial accounts.\"\"\"\nfrom datetime import datetime\n\n\nclass Account:\n \"\"\"Represents a financial account belonging to a user.\"\"\"\n\n def __init__(self, account_id, user_id, name, balance=0.0, currency=\"USD\"):\n self.account_id = account_id\n self.user_id = user_id\n self.name = name\n self.balance = balance\n self.currency = currency\n self.created_at = datetime.now()\n self.is_frozen = False\n\n def deposit(self, amount):\n \"\"\"Add funds to the account.\"\"\"\n if amount \u003c= 0:\n raise ValueError(\"Deposit amount must be positive\")\n if self.is_frozen:\n raise RuntimeError(\"Cannot deposit to a frozen account\")\n self.balance += amount\n return self.balance\n\n def withdraw(self, amount):\n \"\"\"Remove funds from the account.\"\"\"\n if amount \u003c= 0:\n raise ValueError(\"Withdrawal amount must be positive\")\n if self.is_frozen:\n raise RuntimeError(\"Cannot withdraw from a frozen account\")\n if amount > self.balance:\n raise ValueError(\"Insufficient funds\")\n self.balance -= amount\n return self.balance\n\n def freeze(self):\n \"\"\"Freeze the account to prevent transactions.\"\"\"\n self.is_frozen = True\n\n def unfreeze(self):\n \"\"\"Unfreeze the account.\"\"\"\n self.is_frozen = False\n\n def __repr__(self):\n status = \"FROZEN\" if self.is_frozen else \"active\"\n return f\"Account({self.account_id}, {self.name}, ${self.balance:.2f} {self.currency}, {status})\"\n\n def to_dict(self):\n \"\"\"Serialize to dictionary.\"\"\"\n return {\n \"account_id\": self.account_id,\n \"user_id\": self.user_id,\n \"name\": self.name,\n \"balance\": self.balance,\n \"currency\": self.currency,\n \"is_frozen\": self.is_frozen,\n \"created_at\": self.created_at.isoformat(),\n }\nEOF\n\n# ── src/views/__init__.py ────────────────────────────────────────────────────\ncat > src/views/__init__.py \u003c\u003c 'EOF'\n\"\"\"View layer — handles request/response formatting and presentation logic.\"\"\"\nEOF\n\n# ── src/views/dashboard.py (HAS THE BUG) ────────────────────────────────────\ncat > src/views/dashboard.py \u003c\u003c 'EOF'\n\"\"\"Dashboard view — displays user's recent activity.\"\"\"\nfrom datetime import datetime\n\n\nclass Transaction:\n \"\"\"Simple transaction data class.\"\"\"\n def __init__(self, id, description, amount, date):\n self.id = id\n self.description = description\n self.amount = amount\n self.date = datetime.strptime(date, '%Y-%m-%d') if isinstance(date, str) else date\n\n def __repr__(self):\n return f\"Transaction({self.id}, {self.date.strftime('%Y-%m-%d')}, ${self.amount})\"\n\n\ndef get_recent_transactions(transactions, limit=10):\n \"\"\"Get the most recent transactions, sorted newest first.\n\n Args:\n transactions: List of Transaction objects\n limit: Maximum number to return\n\n Returns:\n List of Transaction objects sorted by date, newest first\n \"\"\"\n # BUG: Missing reverse=True — sorts oldest first instead of newest first\n sorted_transactions = sorted(transactions, key=lambda t: t.date)\n return sorted_transactions[:limit]\n\n\ndef get_dashboard_summary(transactions):\n \"\"\"Generate dashboard summary stats.\"\"\"\n if not transactions:\n return {\"total\": 0, \"count\": 0, \"average\": 0}\n total = sum(t.amount for t in transactions)\n return {\n \"total\": round(total, 2),\n \"count\": len(transactions),\n \"average\": round(total / len(transactions), 2)\n }\nEOF\n\n# ── src/views/profile.py ────────────────────────────────────────────────────\ncat > src/views/profile.py \u003c\u003c 'EOF'\n\"\"\"Profile view — handles user profile display and editing.\"\"\"\n\n\ndef get_profile_data(user):\n \"\"\"Prepare profile data for rendering.\n\n Args:\n user: User model instance\n\n Returns:\n Dictionary with formatted profile fields\n \"\"\"\n return {\n \"display_name\": user.username,\n \"email\": user.email,\n \"member_since\": user.created_at.strftime(\"%B %Y\"),\n \"avatar_url\": user.get_gravatar_url(size=120),\n \"is_active\": user.is_active,\n \"preferences\": user.preferences,\n }\n\n\ndef update_profile(user, updates):\n \"\"\"Apply profile updates from form submission.\n\n Args:\n user: User model instance\n updates: Dictionary of field updates\n\n Returns:\n List of fields that were changed\n \"\"\"\n changed = []\n if \"email\" in updates and updates[\"email\"] != user.email:\n user.email = updates[\"email\"]\n changed.append(\"email\")\n if \"username\" in updates and updates[\"username\"] != user.username:\n user.username = updates[\"username\"]\n changed.append(\"username\")\n if \"preferences\" in updates:\n for key, value in updates[\"preferences\"].items():\n user.set_preference(key, value)\n changed.append(\"preferences\")\n return changed\nEOF\n\n# ── src/views/settings.py ───────────────────────────────────────────────────\ncat > src/views/settings.py \u003c\u003c 'EOF'\n\"\"\"Settings view — application and account settings management.\"\"\"\n\n\nNOTIFICATION_CHANNELS = [\"email\", \"sms\", \"push\", \"in_app\"]\nTHEME_OPTIONS = [\"light\", \"dark\", \"auto\"]\nTIMEZONE_PRESETS = [\"UTC\", \"US/Eastern\", \"US/Central\", \"US/Pacific\", \"Europe/London\"]\n\n\ndef get_settings(user):\n \"\"\"Retrieve current settings for a user.\n\n Args:\n user: User model instance\n\n Returns:\n Dictionary of current settings with defaults applied\n \"\"\"\n return {\n \"theme\": user.get_preference(\"theme\", \"light\"),\n \"timezone\": user.get_preference(\"timezone\", \"UTC\"),\n \"notifications\": {\n \"enabled\": user.get_preference(\"notifications_enabled\", True),\n \"channels\": user.get_preference(\"notification_channels\", [\"email\"]),\n \"frequency\": user.get_preference(\"notification_frequency\", \"daily\"),\n },\n \"currency_display\": user.get_preference(\"currency_display\", \"USD\"),\n \"date_format\": user.get_preference(\"date_format\", \"YYYY-MM-DD\"),\n \"page_size\": user.get_preference(\"page_size\", 25),\n }\n\n\ndef validate_settings(settings):\n \"\"\"Validate settings before applying.\n\n Args:\n settings: Dictionary of proposed settings\n\n Returns:\n Tuple of (is_valid, list_of_errors)\n \"\"\"\n errors = []\n\n if \"theme\" in settings and settings[\"theme\"] not in THEME_OPTIONS:\n errors.append(f\"Invalid theme. Must be one of: {', '.join(THEME_OPTIONS)}\")\n\n if \"timezone\" in settings and settings[\"timezone\"] not in TIMEZONE_PRESETS:\n errors.append(f\"Invalid timezone. Must be one of: {', '.join(TIMEZONE_PRESETS)}\")\n\n if \"notifications\" in settings:\n notif = settings[\"notifications\"]\n if \"channels\" in notif:\n invalid = [ch for ch in notif[\"channels\"] if ch not in NOTIFICATION_CHANNELS]\n if invalid:\n errors.append(f\"Invalid notification channels: {', '.join(invalid)}\")\n\n if \"page_size\" in settings:\n ps = settings[\"page_size\"]\n if not isinstance(ps, int) or ps \u003c 5 or ps > 100:\n errors.append(\"page_size must be an integer between 5 and 100\")\n\n return (len(errors) == 0, errors)\n\n\ndef apply_settings(user, settings):\n \"\"\"Apply validated settings to user preferences.\n\n Args:\n user: User model instance\n settings: Validated settings dictionary\n \"\"\"\n if \"theme\" in settings:\n user.set_preference(\"theme\", settings[\"theme\"])\n if \"timezone\" in settings:\n user.set_preference(\"timezone\", settings[\"timezone\"])\n if \"notifications\" in settings:\n notif = settings[\"notifications\"]\n if \"enabled\" in notif:\n user.set_preference(\"notifications_enabled\", notif[\"enabled\"])\n if \"channels\" in notif:\n user.set_preference(\"notification_channels\", notif[\"channels\"])\n if \"frequency\" in notif:\n user.set_preference(\"notification_frequency\", notif[\"frequency\"])\n if \"currency_display\" in settings:\n user.set_preference(\"currency_display\", settings[\"currency_display\"])\n if \"date_format\" in settings:\n user.set_preference(\"date_format\", settings[\"date_format\"])\n if \"page_size\" in settings:\n user.set_preference(\"page_size\", settings[\"page_size\"])\nEOF\n\n# ── src/views/transactions.py ────────────────────────────────────────────────\ncat > src/views/transactions.py \u003c\u003c 'EOF'\n\"\"\"Transactions list view — paginated transaction history.\"\"\"\nimport math\n\n\ndef paginate_transactions(transactions, page=1, per_page=20):\n \"\"\"Paginate a list of transactions.\n\n Args:\n transactions: Full list of transaction objects\n page: Current page number (1-indexed)\n per_page: Items per page\n\n Returns:\n Dictionary with paginated results and metadata\n \"\"\"\n total = len(transactions)\n total_pages = math.ceil(total / per_page) if total > 0 else 1\n page = max(1, min(page, total_pages))\n\n start = (page - 1) * per_page\n end = start + per_page\n\n return {\n \"items\": transactions[start:end],\n \"page\": page,\n \"per_page\": per_page,\n \"total\": total,\n \"total_pages\": total_pages,\n \"has_next\": page \u003c total_pages,\n \"has_prev\": page > 1,\n }\n\n\ndef filter_transactions(transactions, filters):\n \"\"\"Apply filters to transaction list.\n\n Args:\n transactions: List of transaction objects\n filters: Dictionary of filter criteria\n\n Returns:\n Filtered list of transactions\n \"\"\"\n result = list(transactions)\n\n if \"min_amount\" in filters:\n result = [t for t in result if t.amount >= filters[\"min_amount\"]]\n if \"max_amount\" in filters:\n result = [t for t in result if t.amount \u003c= filters[\"max_amount\"]]\n if \"type\" in filters:\n result = [t for t in result if t.txn_type.value == filters[\"type\"]]\n if \"status\" in filters:\n result = [t for t in result if t.status.value == filters[\"status\"]]\n if \"description_contains\" in filters:\n keyword = filters[\"description_contains\"].lower()\n result = [t for t in result if keyword in t.description.lower()]\n\n return result\n\n\ndef format_transaction_row(transaction):\n \"\"\"Format a transaction for display in a table row.\n\n Args:\n transaction: Transaction model instance\n\n Returns:\n Dictionary of formatted display values\n \"\"\"\n return {\n \"id\": transaction.txn_id,\n \"date\": transaction.created_at.strftime(\"%Y-%m-%d %H:%M\"),\n \"description\": transaction.description or \"No description\",\n \"amount\": f\"${transaction.amount:,.2f}\",\n \"type\": transaction.txn_type.value.capitalize(),\n \"status\": transaction.status.value.capitalize(),\n \"status_class\": _status_css_class(transaction.status.value),\n }\n\n\ndef _status_css_class(status):\n \"\"\"Map transaction status to CSS class name.\"\"\"\n mapping = {\n \"completed\": \"badge-success\",\n \"pending\": \"badge-warning\",\n \"failed\": \"badge-danger\",\n \"cancelled\": \"badge-secondary\",\n }\n return mapping.get(status, \"badge-default\")\nEOF\n\n# ── src/services/__init__.py ─────────────────────────────────────────────────\ncat > src/services/__init__.py \u003c\u003c 'EOF'\n\"\"\"Service layer — business logic and external integrations.\"\"\"\nEOF\n\n# ── src/services/auth.py ─────────────────────────────────────────────────────\ncat > src/services/auth.py \u003c\u003c 'EOF'\n\"\"\"Authentication service — handles login, logout, and session management.\"\"\"\nimport hashlib\nimport secrets\nfrom datetime import datetime, timedelta\n\n\n# In-memory store for demo purposes\n_sessions = {}\n_users_db = {\n \"alice\": {\"password_hash\": hashlib.sha256(b\"password123\").hexdigest(), \"role\": \"admin\"},\n \"bob\": {\"password_hash\": hashlib.sha256(b\"letmein\").hexdigest(), \"role\": \"user\"},\n \"charlie\": {\"password_hash\": hashlib.sha256(b\"s3cure!\").hexdigest(), \"role\": \"user\"},\n}\n\nSESSION_TTL_HOURS = 24\n\n\ndef authenticate(username, password):\n \"\"\"Authenticate a user with username and password.\n\n Args:\n username: The username\n password: The plaintext password\n\n Returns:\n Session token string if successful, None otherwise\n \"\"\"\n user_record = _users_db.get(username)\n if not user_record:\n return None\n\n password_hash = hashlib.sha256(password.encode()).hexdigest()\n if password_hash != user_record[\"password_hash\"]:\n return None\n\n token = secrets.token_hex(32)\n _sessions[token] = {\n \"username\": username,\n \"role\": user_record[\"role\"],\n \"created_at\": datetime.now(),\n \"expires_at\": datetime.now() + timedelta(hours=SESSION_TTL_HOURS),\n }\n return token\n\n\ndef validate_session(token):\n \"\"\"Check if a session token is valid and not expired.\n\n Args:\n token: Session token string\n\n Returns:\n Session data dictionary if valid, None otherwise\n \"\"\"\n session = _sessions.get(token)\n if not session:\n return None\n if datetime.now() > session[\"expires_at\"]:\n del _sessions[token]\n return None\n return session\n\n\ndef logout(token):\n \"\"\"Invalidate a session token.\n\n Args:\n token: Session token to invalidate\n\n Returns:\n True if session was found and removed, False otherwise\n \"\"\"\n if token in _sessions:\n del _sessions[token]\n return True\n return False\n\n\ndef get_active_sessions():\n \"\"\"Return count of currently active sessions.\"\"\"\n now = datetime.now()\n active = {k: v for k, v in _sessions.items() if now \u003c= v[\"expires_at\"]}\n return len(active)\nEOF\n\n# ── src/services/email.py ────────────────────────────────────────────────────\ncat > src/services/email.py \u003c\u003c 'EOF'\n\"\"\"Email service — sends transactional emails.\"\"\"\nfrom datetime import datetime\n\n\nclass EmailService:\n \"\"\"Handles composing and sending emails.\n\n In production this would integrate with an SMTP server or API\n like SendGrid. For the demo it just logs messages.\n \"\"\"\n\n def __init__(self, from_address=\"[email protected]\"):\n self.from_address = from_address\n self.sent_log = []\n\n def send_welcome(self, user):\n \"\"\"Send a welcome email to a new user.\"\"\"\n subject = f\"Welcome to FinanceApp, {user.username}!\"\n body = (\n f\"Hi {user.username},\\n\\n\"\n f\"Your account has been created successfully.\\n\"\n f\"You can start tracking your finances right away.\\n\\n\"\n f\"Best regards,\\nThe FinanceApp Team\"\n )\n return self._send(user.email, subject, body)\n\n def send_transaction_receipt(self, user, transaction):\n \"\"\"Send a receipt for a completed transaction.\"\"\"\n subject = f\"Transaction Receipt - ${transaction.amount:.2f}\"\n body = (\n f\"Hi {user.username},\\n\\n\"\n f\"Your transaction has been processed:\\n\"\n f\" Amount: ${transaction.amount:.2f}\\n\"\n f\" Type: {transaction.txn_type.value}\\n\"\n f\" Description: {transaction.description}\\n\"\n f\" Date: {transaction.created_at.strftime('%Y-%m-%d %H:%M')}\\n\\n\"\n f\"Thank you for using FinanceApp.\"\n )\n return self._send(user.email, subject, body)\n\n def send_password_reset(self, email, reset_token):\n \"\"\"Send a password reset email.\"\"\"\n subject = \"Password Reset Request\"\n body = (\n f\"A password reset was requested for your account.\\n\\n\"\n f\"Use this token to reset your password: {reset_token}\\n\\n\"\n f\"If you didn't request this, please ignore this email.\"\n )\n return self._send(email, subject, body)\n\n def _send(self, to_address, subject, body):\n \"\"\"Internal send method. Logs the email for demo purposes.\"\"\"\n record = {\n \"from\": self.from_address,\n \"to\": to_address,\n \"subject\": subject,\n \"body\": body,\n \"sent_at\": datetime.now().isoformat(),\n }\n self.sent_log.append(record)\n return True\n\n def get_sent_count(self):\n \"\"\"Return number of emails sent.\"\"\"\n return len(self.sent_log)\nEOF\n\n# ── src/services/reporting.py ────────────────────────────────────────────────\ncat > src/services/reporting.py \u003c\u003c 'EOF'\n\"\"\"Reporting service — generates financial reports and summaries.\"\"\"\nfrom datetime import datetime, timedelta\nfrom collections import defaultdict\n\n\ndef generate_monthly_report(transactions, year, month):\n \"\"\"Generate a monthly financial summary.\n\n Args:\n transactions: List of Transaction model instances\n year: Report year\n month: Report month (1-12)\n\n Returns:\n Dictionary with monthly totals and breakdowns\n \"\"\"\n monthly = [\n t for t in transactions\n if t.created_at.year == year and t.created_at.month == month\n ]\n\n totals_by_type = defaultdict(float)\n for t in monthly:\n totals_by_type[t.txn_type.value] += t.amount\n\n total_in = totals_by_type.get(\"credit\", 0) + totals_by_type.get(\"refund\", 0)\n total_out = totals_by_type.get(\"debit\", 0) + totals_by_type.get(\"transfer\", 0)\n\n return {\n \"year\": year,\n \"month\": month,\n \"transaction_count\": len(monthly),\n \"total_in\": round(total_in, 2),\n \"total_out\": round(total_out, 2),\n \"net\": round(total_in - total_out, 2),\n \"by_type\": dict(totals_by_type),\n }\n\n\ndef generate_spending_categories(transactions):\n \"\"\"Group spending by description keywords.\n\n Args:\n transactions: List of Transaction model instances\n\n Returns:\n Dictionary mapping category keywords to total amounts\n \"\"\"\n categories = defaultdict(float)\n for t in transactions:\n if t.is_debit():\n keyword = t.description.split()[0].lower() if t.description else \"uncategorized\"\n categories[keyword] += t.amount\n return dict(sorted(categories.items(), key=lambda x: x[1], reverse=True))\n\n\ndef generate_weekly_trend(transactions, weeks=4):\n \"\"\"Calculate weekly spending trend.\n\n Args:\n transactions: List of Transaction model instances\n weeks: Number of weeks to include\n\n Returns:\n List of dictionaries with weekly totals\n \"\"\"\n now = datetime.now()\n trend = []\n for i in range(weeks):\n week_end = now - timedelta(weeks=i)\n week_start = week_end - timedelta(weeks=1)\n weekly = [\n t for t in transactions\n if week_start \u003c= t.created_at \u003c= week_end and t.is_debit()\n ]\n trend.append({\n \"week_start\": week_start.strftime(\"%Y-%m-%d\"),\n \"week_end\": week_end.strftime(\"%Y-%m-%d\"),\n \"total\": round(sum(t.amount for t in weekly), 2),\n \"count\": len(weekly),\n })\n return trend\nEOF\n\n# ── src/utils/__init__.py ────────────────────────────────────────────────────\ncat > src/utils/__init__.py \u003c\u003c 'EOF'\n\"\"\"Utility functions — formatters, validators, and common helpers.\"\"\"\nEOF\n\n# ── src/utils/formatters.py ──────────────────────────────────────────────────\ncat > src/utils/formatters.py \u003c\u003c 'EOF'\n\"\"\"Formatting utilities for display values.\"\"\"\nfrom datetime import datetime\n\n\ndef format_currency(amount, currency=\"USD\", locale=\"en_US\"):\n \"\"\"Format a numeric amount as a currency string.\n\n Args:\n amount: Numeric amount\n currency: ISO currency code\n locale: Locale for formatting conventions\n\n Returns:\n Formatted currency string\n \"\"\"\n symbols = {\"USD\": \"$\", \"EUR\": \"\\u20ac\", \"GBP\": \"\\u00a3\", \"JPY\": \"\\u00a5\"}\n symbol = symbols.get(currency, currency + \" \")\n\n if amount \u003c 0:\n return f\"-{symbol}{abs(amount):,.2f}\"\n return f\"{symbol}{amount:,.2f}\"\n\n\ndef format_date(dt, style=\"short\"):\n \"\"\"Format a datetime object for display.\n\n Args:\n dt: datetime instance\n style: 'short' (2025-03-15), 'medium' (Mar 15, 2025),\n 'long' (March 15, 2025), 'relative' (3 days ago)\n\n Returns:\n Formatted date string\n \"\"\"\n if style == \"short\":\n return dt.strftime(\"%Y-%m-%d\")\n elif style == \"medium\":\n return dt.strftime(\"%b %d, %Y\")\n elif style == \"long\":\n return dt.strftime(\"%B %d, %Y\")\n elif style == \"relative\":\n return _relative_time(dt)\n return dt.isoformat()\n\n\ndef _relative_time(dt):\n \"\"\"Convert datetime to relative time string.\"\"\"\n now = datetime.now()\n delta = now - dt\n seconds = int(delta.total_seconds())\n\n if seconds \u003c 60:\n return \"just now\"\n elif seconds \u003c 3600:\n minutes = seconds // 60\n return f\"{minutes} minute{'s' if minutes != 1 else ''} ago\"\n elif seconds \u003c 86400:\n hours = seconds // 3600\n return f\"{hours} hour{'s' if hours != 1 else ''} ago\"\n elif seconds \u003c 604800:\n days = seconds // 86400\n return f\"{days} day{'s' if days != 1 else ''} ago\"\n else:\n weeks = seconds // 604800\n return f\"{weeks} week{'s' if weeks != 1 else ''} ago\"\n\n\ndef truncate(text, max_length=50, suffix=\"...\"):\n \"\"\"Truncate text to a maximum length with suffix.\n\n Args:\n text: Input string\n max_length: Maximum length including suffix\n suffix: String to append when truncated\n\n Returns:\n Truncated string\n \"\"\"\n if len(text) \u003c= max_length:\n return text\n return text[:max_length - len(suffix)] + suffix\n\n\ndef format_percentage(value, decimals=1):\n \"\"\"Format a decimal value as percentage string.\"\"\"\n return f\"{value * 100:.{decimals}f}%\"\nEOF\n\n# ── src/utils/validators.py ──────────────────────────────────────────────────\ncat > src/utils/validators.py \u003c\u003c 'EOF'\n\"\"\"Input validation utilities.\"\"\"\nimport re\n\n\ndef validate_email(email):\n \"\"\"Validate an email address format.\n\n Args:\n email: Email string to validate\n\n Returns:\n Tuple of (is_valid, error_message_or_None)\n \"\"\"\n if not email or not isinstance(email, str):\n return (False, \"Email is required\")\n pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

\n if not re.match(pattern, email):\n return (False, \"Invalid email format\")\n if len(email) > 254:\n return (False, \"Email address too long\")\n return (True, None)\n\n\ndef validate_username(username):\n \"\"\"Validate a username.\n\n Args:\n username: Username string to validate\n\n Returns:\n Tuple of (is_valid, error_message_or_None)\n \"\"\"\n if not username or not isinstance(username, str):\n return (False, \"Username is required\")\n if len(username) \u003c 3:\n return (False, \"Username must be at least 3 characters\")\n if len(username) > 30:\n return (False, \"Username must be at most 30 characters\")\n if not re.match(r'^[a-zA-Z0-9_]+

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…

, username):\n return (False, \"Username may only contain letters, numbers, and underscores\")\n return (True, None)\n\n\ndef validate_amount(amount):\n \"\"\"Validate a transaction amount.\n\n Args:\n amount: Numeric value to validate\n\n Returns:\n Tuple of (is_valid, error_message_or_None)\n \"\"\"\n if amount is None:\n return (False, \"Amount is required\")\n try:\n amount = float(amount)\n except (ValueError, TypeError):\n return (False, \"Amount must be a number\")\n if amount \u003c= 0:\n return (False, \"Amount must be positive\")\n if amount > 1_000_000:\n return (False, \"Amount exceeds maximum limit\")\n if round(amount, 2) != amount:\n return (False, \"Amount cannot have more than 2 decimal places\")\n return (True, None)\n\n\ndef validate_password(password):\n \"\"\"Validate password strength.\n\n Args:\n password: Password string\n\n Returns:\n Tuple of (is_valid, error_message_or_None)\n \"\"\"\n if not password or not isinstance(password, str):\n return (False, \"Password is required\")\n if len(password) \u003c 8:\n return (False, \"Password must be at least 8 characters\")\n if not re.search(r'[A-Z]', password):\n return (False, \"Password must contain at least one uppercase letter\")\n if not re.search(r'[a-z]', password):\n return (False, \"Password must contain at least one lowercase letter\")\n if not re.search(r'[0-9]', password):\n return (False, \"Password must contain at least one digit\")\n return (True, None)\nEOF\n\n# ── src/__init__.py ──────────────────────────────────────────────────────────\ncat > src/__init__.py \u003c\u003c 'EOF'\n\"\"\"FinanceApp — a personal finance dashboard application.\"\"\"\n__version__ = \"1.2.0\"\nEOF\n\n# ── tests/__init__.py ────────────────────────────────────────────────────────\ncat > tests/__init__.py \u003c\u003c 'EOF'\nEOF\n\n# ── tests/test_dashboard.py (HAS FAILING TEST) ──────────────────────────────\ncat > tests/test_dashboard.py \u003c\u003c 'EOF'\n\"\"\"Tests for dashboard view.\"\"\"\nimport unittest\nfrom datetime import datetime\nimport sys\nimport os\nsys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))\nfrom src.views.dashboard import Transaction, get_recent_transactions, get_dashboard_summary\n\n\nclass TestGetRecentTransactions(unittest.TestCase):\n def setUp(self):\n self.transactions = [\n Transaction(1, \"Coffee\", 4.50, \"2025-01-15\"),\n Transaction(2, \"Groceries\", 67.80, \"2025-03-01\"),\n Transaction(3, \"Gas\", 45.00, \"2025-02-10\"),\n Transaction(4, \"Restaurant\", 32.50, \"2025-03-10\"),\n Transaction(5, \"Books\", 29.99, \"2025-01-20\"),\n ]\n\n def test_transaction_sort_order(self):\n \"\"\"Most recent transactions should appear first.\"\"\"\n result = get_recent_transactions(self.transactions)\n # Newest (2025-03-10) should be first\n self.assertEqual(result[0].id, 4)\n # Second newest (2025-03-01) should be second\n self.assertEqual(result[1].id, 2)\n\n def test_limit(self):\n \"\"\"Should respect the limit parameter.\"\"\"\n result = get_recent_transactions(self.transactions, limit=3)\n self.assertEqual(len(result), 3)\n\n def test_empty_list(self):\n \"\"\"Should handle empty list.\"\"\"\n result = get_recent_transactions([])\n self.assertEqual(result, [])\n\n\nclass TestGetDashboardSummary(unittest.TestCase):\n def test_summary_calculation(self):\n transactions = [\n Transaction(1, \"A\", 10.00, \"2025-01-01\"),\n Transaction(2, \"B\", 20.00, \"2025-01-02\"),\n ]\n summary = get_dashboard_summary(transactions)\n self.assertEqual(summary[\"total\"], 30.0)\n self.assertEqual(summary[\"count\"], 2)\n self.assertEqual(summary[\"average\"], 15.0)\n\n def test_empty_summary(self):\n summary = get_dashboard_summary([])\n self.assertEqual(summary[\"total\"], 0)\n\n\nif __name__ == '__main__':\n unittest.main()\nEOF\n\n# ── tests/test_auth.py ───────────────────────────────────────────────────────\ncat > tests/test_auth.py \u003c\u003c 'EOF'\n\"\"\"Tests for authentication service.\"\"\"\nimport unittest\nimport sys\nimport os\nsys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))\nfrom src.services.auth import authenticate, validate_session, logout, get_active_sessions\n\n\nclass TestAuthenticate(unittest.TestCase):\n def test_valid_credentials(self):\n \"\"\"Should return a session token for valid credentials.\"\"\"\n token = authenticate(\"alice\", \"password123\")\n self.assertIsNotNone(token)\n self.assertEqual(len(token), 64) # hex token\n\n def test_invalid_password(self):\n \"\"\"Should return None for wrong password.\"\"\"\n token = authenticate(\"alice\", \"wrongpassword\")\n self.assertIsNone(token)\n\n def test_unknown_user(self):\n \"\"\"Should return None for unknown username.\"\"\"\n token = authenticate(\"nonexistent\", \"password123\")\n self.assertIsNone(token)\n\n\nclass TestValidateSession(unittest.TestCase):\n def test_valid_session(self):\n \"\"\"Should return session data for a valid token.\"\"\"\n token = authenticate(\"bob\", \"letmein\")\n session = validate_session(token)\n self.assertIsNotNone(session)\n self.assertEqual(session[\"username\"], \"bob\")\n\n def test_invalid_token(self):\n \"\"\"Should return None for invalid token.\"\"\"\n session = validate_session(\"invalid-token-abc123\")\n self.assertIsNone(session)\n\n\nclass TestLogout(unittest.TestCase):\n def test_logout_valid_session(self):\n \"\"\"Should invalidate an existing session.\"\"\"\n token = authenticate(\"charlie\", \"s3cure!\")\n self.assertTrue(logout(token))\n self.assertIsNone(validate_session(token))\n\n def test_logout_invalid_token(self):\n \"\"\"Should return False for non-existent token.\"\"\"\n self.assertFalse(logout(\"fake-token\"))\n\n\nclass TestActiveSessions(unittest.TestCase):\n def test_count_after_login(self):\n \"\"\"Active sessions should increase after login.\"\"\"\n initial = get_active_sessions()\n authenticate(\"alice\", \"password123\")\n self.assertGreaterEqual(get_active_sessions(), initial)\n\n\nif __name__ == '__main__':\n unittest.main()\nEOF\n\n# ── tests/test_models.py ─────────────────────────────────────────────────────\ncat > tests/test_models.py \u003c\u003c 'EOF'\n\"\"\"Tests for data models.\"\"\"\nimport unittest\nimport sys\nimport os\nsys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))\nfrom src.models.user import User\nfrom src.models.account import Account\nfrom src.models.transaction import Transaction, TransactionType, TransactionStatus\n\n\nclass TestUser(unittest.TestCase):\n def setUp(self):\n self.user = User(1, \"testuser\", \"[email protected]\")\n\n def test_creation(self):\n self.assertEqual(self.user.user_id, 1)\n self.assertEqual(self.user.username, \"testuser\")\n self.assertTrue(self.user.is_active)\n\n def test_preferences(self):\n self.user.set_preference(\"theme\", \"dark\")\n self.assertEqual(self.user.get_preference(\"theme\"), \"dark\")\n self.assertIsNone(self.user.get_preference(\"missing\"))\n self.assertEqual(self.user.get_preference(\"missing\", \"default\"), \"default\")\n\n def test_deactivate(self):\n self.user.deactivate()\n self.assertFalse(self.user.is_active)\n\n def test_to_dict(self):\n d = self.user.to_dict()\n self.assertEqual(d[\"username\"], \"testuser\")\n self.assertIn(\"created_at\", d)\n\n def test_equality(self):\n other = User(1, \"different_name\", \"[email protected]\")\n self.assertEqual(self.user, other)\n another = User(2, \"testuser\", \"[email protected]\")\n self.assertNotEqual(self.user, another)\n\n\nclass TestAccount(unittest.TestCase):\n def setUp(self):\n self.account = Account(100, 1, \"Checking\", balance=500.0)\n\n def test_deposit(self):\n result = self.account.deposit(100)\n self.assertEqual(result, 600.0)\n\n def test_withdraw(self):\n result = self.account.withdraw(200)\n self.assertEqual(result, 300.0)\n\n def test_insufficient_funds(self):\n with self.assertRaises(ValueError):\n self.account.withdraw(999)\n\n def test_negative_deposit(self):\n with self.assertRaises(ValueError):\n self.account.deposit(-50)\n\n def test_freeze(self):\n self.account.freeze()\n with self.assertRaises(RuntimeError):\n self.account.deposit(100)\n with self.assertRaises(RuntimeError):\n self.account.withdraw(50)\n\n def test_unfreeze(self):\n self.account.freeze()\n self.account.unfreeze()\n self.account.deposit(100) # Should not raise\n\n\nclass TestTransaction(unittest.TestCase):\n def setUp(self):\n self.txn = Transaction(1, 100, 50.0, TransactionType.DEBIT, \"Test purchase\")\n\n def test_creation(self):\n self.assertEqual(self.txn.txn_id, 1)\n self.assertEqual(self.txn.amount, 50.0)\n self.assertEqual(self.txn.status, TransactionStatus.PENDING)\n\n def test_complete(self):\n self.assertTrue(self.txn.complete())\n self.assertEqual(self.txn.status, TransactionStatus.COMPLETED)\n # Cannot complete again\n self.assertFalse(self.txn.complete())\n\n def test_cancel(self):\n self.assertTrue(self.txn.cancel())\n self.assertEqual(self.txn.status, TransactionStatus.CANCELLED)\n\n def test_is_debit(self):\n self.assertTrue(self.txn.is_debit())\n credit = Transaction(2, 100, 100.0, TransactionType.CREDIT)\n self.assertFalse(credit.is_debit())\n\n def test_metadata(self):\n self.txn.add_metadata(\"reference\", \"INV-001\")\n self.assertEqual(self.txn.metadata[\"reference\"], \"INV-001\")\n\n def test_string_type_init(self):\n txn = Transaction(3, 100, 25.0, \"refund\")\n self.assertEqual(txn.txn_type, TransactionType.REFUND)\n\n\nif __name__ == '__main__':\n unittest.main()\nEOF\n\n# ── config/settings.py ───────────────────────────────────────────────────────\ncat > config/settings.py \u003c\u003c 'EOF'\n\"\"\"Application configuration settings.\"\"\"\nimport os\n\n\nclass Config:\n \"\"\"Base configuration.\"\"\"\n APP_NAME = \"FinanceApp\"\n VERSION = \"1.2.0\"\n DEBUG = False\n TESTING = False\n\n # Database\n DATABASE_URI = os.getenv(\"DATABASE_URI\", \"sqlite:///financeapp.db\")\n DATABASE_POOL_SIZE = 5\n\n # Session\n SESSION_TTL_HOURS = 24\n SESSION_COOKIE_NAME = \"financeapp_session\"\n SESSION_COOKIE_SECURE = True\n\n # Email\n SMTP_HOST = os.getenv(\"SMTP_HOST\", \"localhost\")\n SMTP_PORT = int(os.getenv(\"SMTP_PORT\", \"587\"))\n SMTP_FROM = \"[email protected]\"\n\n # Pagination\n DEFAULT_PAGE_SIZE = 25\n MAX_PAGE_SIZE = 100\n\n # Security\n PASSWORD_MIN_LENGTH = 8\n MAX_LOGIN_ATTEMPTS = 5\n LOCKOUT_DURATION_MINUTES = 30\n\n\nclass DevelopmentConfig(Config):\n \"\"\"Development environment configuration.\"\"\"\n DEBUG = True\n SESSION_COOKIE_SECURE = False\n DATABASE_URI = \"sqlite:///dev.db\"\n\n\nclass TestingConfig(Config):\n \"\"\"Testing environment configuration.\"\"\"\n TESTING = True\n DATABASE_URI = \"sqlite:///:memory:\"\n\n\nclass ProductionConfig(Config):\n \"\"\"Production environment configuration.\"\"\"\n DATABASE_POOL_SIZE = 20\n SESSION_COOKIE_SECURE = True\n\n\ndef get_config(env=None):\n \"\"\"Return configuration for the given environment.\n\n Args:\n env: Environment name ('development', 'testing', 'production')\n\n Returns:\n Config class for the environment\n \"\"\"\n env = env or os.getenv(\"APP_ENV\", \"development\")\n configs = {\n \"development\": DevelopmentConfig,\n \"testing\": TestingConfig,\n \"production\": ProductionConfig,\n }\n return configs.get(env, DevelopmentConfig)\nEOF\n\n# ── BUG_REPORT.md ────────────────────────────────────────────────────────────\ncat > BUG_REPORT.md \u003c\u003c 'EOF'\n# Bug Report: Dashboard Shows Oldest Transactions First\n\n**Reported by:** Product Manager\n**Date:** 2025-03-15\n**Severity:** Medium\n\n## Description\nThe dashboard page shows transactions sorted with the oldest first. Users\nexpect to see their most recent transactions at the top of the list.\n\n## Expected Behavior\nTransactions on the dashboard should be sorted by date with the newest\ntransactions appearing first (descending order).\n\n## Actual Behavior\nTransactions appear in ascending date order (oldest first).\n\n## How to Reproduce\n1. View the dashboard\n2. Notice transactions are sorted oldest-first instead of newest-first\nEOF\n\n# ── README.md ────────────────────────────────────────────────────────────────\ncat > README.md \u003c\u003c 'EOF'\n# FinanceApp\n\nA personal finance dashboard application built with Python.\n\n## Project Structure\n\n```\nsrc/\n models/ — Data models (User, Transaction, Account)\n views/ — View layer (dashboard, profile, settings, transactions)\n services/ — Business logic (auth, email, reporting)\n utils/ — Helpers (formatters, validators)\ntests/ — Unit tests\nconfig/ — Application configuration\n```\n\n## Running Tests\n\n```bash\npython -m pytest tests/ -v\n```\n\n## Development\n\nThis project uses Python 3.8+ with no external dependencies beyond the\nstandard library (except pytest for testing).\nEOF\n\n# ── Initial commit ───────────────────────────────────────────────────────────\ngit add -A\ngit commit -q -m \"initial: add FinanceApp with dashboard sort bug\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":44035,"content_sha256":"e28d67cf766d63eeb15bae9d2fa52ece2d228dd47ca028792ccf8ccf70621088"},{"filename":"tasks/tool-efficiency/targeted-fix/task.yaml","content":"name: \"Fix Bug with Minimal File Reads\"\nid: \"tool-efficiency-targeted-fix\"\nversion: \"1.0\"\nsuite: \"tool-efficiency\"\ndifficulty: \"hard\"\nmode: \"real\"\n\ndescription: |\n Tests surgical precision in bug fixing. The bug report clearly points\n to dashboard sorting, and the fix is a one-line change (add reverse=True).\n An efficient agent reads the bug report, navigates to dashboard.py,\n makes the fix, and verifies — touching at most 4-5 files total.\n\nuser_message: |\n Fix the bug described in BUG_REPORT.md. Run the relevant test to verify\n your fix works.\n\ninput_files: []\n\nexpected_outputs:\n - pattern: \"src/views/dashboard.py\"\n required: true\n validators:\n - type: \"content-contains\"\n sections: [\"reverse=True\"]\n - pattern: \"**\"\n required: true\n validators:\n - type: \"command-output-contains\"\n command: \"python -m pytest tests/test_dashboard.py -v\"\n contains: [\"passed\"]\n\nexpected_metrics:\n tool_calls: [3, 8]\n planning_ratio: [0.10, 0.25]\n\nscoring:\n layer0_weight: 0.15\n layer1_weight: 0.45\n layer2_weight: 0.20\n layer3_weight: 0.20\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":1090,"content_sha256":"e3dfe205909e16d0dfda3f07114fb52e939de0e43614205cac9266bafd7e352e"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"AgentBench for OpenClaw","type":"text"}]},{"type":"paragraph","content":[{"text":"Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Commands","type":"text"}]},{"type":"paragraph","content":[{"text":"When the user says any of these, follow the corresponding instructions:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"/benchmark","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" — Run the full benchmark suite (all 40 tasks)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"/benchmark --fast","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" — Run only easy+medium tasks (19 tasks)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"/benchmark --suite \u003cname>","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" — Run one domain only","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"/benchmark --task \u003cid>","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" — Run a single task","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"/benchmark --strict","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" — Tag results as externally verified scoring","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"/benchmark-list","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" — List all tasks grouped by domain","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"/benchmark-results","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" — Show results from previous runs","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"/benchmark-compare","type":"text","marks":[{"type":"code_inline"},{"type":"strong"}]},{"text":" — Compare two runs side-by-side","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Flags are combinable: ","type":"text"},{"text":"/benchmark --fast --suite research","type":"text","marks":[{"type":"code_inline"}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Running a Benchmark","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 1: Discover Tasks","type":"text"}]},{"type":"paragraph","content":[{"text":"Read task.yaml files from the ","type":"text"},{"text":"tasks/","type":"text","marks":[{"type":"code_inline"}]},{"text":" directory in this skill:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"tasks/{suite-name}/{task-name}/task.yaml","type":"text"}]},{"type":"paragraph","content":[{"text":"Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.","type":"text"}]},{"type":"paragraph","content":[{"text":"Filter by ","type":"text"},{"text":"--suite","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"--task","type":"text","marks":[{"type":"code_inline"}]},{"text":" if specified. If ","type":"text"},{"text":"--fast","type":"text","marks":[{"type":"code_inline"}]},{"text":" is set and ","type":"text"},{"text":"--task","type":"text","marks":[{"type":"code_inline"}]},{"text":" is not, filter to only tasks where difficulty is \"easy\" or \"medium\".","type":"text"}]},{"type":"paragraph","content":[{"text":"Profile is \"fast\" if ","type":"text"},{"text":"--fast","type":"text","marks":[{"type":"code_inline"}]},{"text":" was specified, otherwise \"full\".","type":"text"}]},{"type":"paragraph","content":[{"text":"List discovered tasks with count and suites.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 2: Set Up Run Directory","type":"text"}]},{"type":"paragraph","content":[{"text":"Generate a run ID from the current timestamp: ","type":"text"},{"text":"YYYYMMDD-HHmmss","type":"text","marks":[{"type":"code_inline"}]}]},{"type":"paragraph","content":[{"text":"Read ","type":"text"},{"text":"suite_version","type":"text","marks":[{"type":"code_inline"}]},{"text":" from ","type":"text"},{"text":"skill.json","type":"text","marks":[{"type":"code_inline"}]},{"text":" in this skill directory.","type":"text"}]},{"type":"paragraph","content":[{"text":"Create the results directory:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"agentbench-results/{run-id}/","type":"text"}]},{"type":"paragraph","content":[{"text":"Announce: ","type":"text"},{"text":"Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}","type":"text","marks":[{"type":"code_inline"}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 3: Execute Each Task","type":"text"}]},{"type":"paragraph","content":[{"text":"For each task:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Set up workspace","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Create ","type":"text"},{"text":"/tmp/agentbench-task-{task-id}/","type":"text","marks":[{"type":"code_inline"}]},{"text":" as workspace","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Copy input files from ","type":"text"},{"text":"tasks/{suite}/{task}/inputs/","type":"text","marks":[{"type":"code_inline"}]},{"text":" to the workspace (if inputs/ exists)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If the task directory contains a ","type":"text"},{"text":"setup.sh","type":"text","marks":[{"type":"code_inline"}]},{"text":": run ","type":"text"},{"text":"bash tasks/{suite}/{task}/setup.sh {workspace-path}","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For ","type":"text"},{"text":"file-unchanged","type":"text","marks":[{"type":"code_inline"}]},{"text":" validators: compute checksums of specified files after setup, before task execution","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Announce","type":"text","marks":[{"type":"strong"}]},{"text":": ","type":"text"},{"text":"Running: {task.name} [{task.suite}] (difficulty: {task.difficulty})","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Record start time","type":"text","marks":[{"type":"strong"}]},{"text":" (milliseconds): ","type":"text"},{"text":"date +%s%3N","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Execute the task yourself directly","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Read the task's ","type":"text"},{"text":"user_message","type":"text","marks":[{"type":"code_inline"}]},{"text":" and execute it as if a real user sent you the request","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Work ONLY within the workspace directory","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"If input files are listed, read them from the workspace","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Execute naturally — use the appropriate tools (read, write, edit, exec, web_search, web_fetch, etc.)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Create any output files in the workspace directory","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"When done, write a brief ","type":"text"},{"text":"execution-trace.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" to the workspace:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"What you understood the task to be","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"What approach you took","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"What files you created or modified","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Any difficulties or decisions you made","type":"text"}]}]}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Record end time","type":"text","marks":[{"type":"strong"}]},{"text":" and compute duration","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Collect metrics","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"total_time_ms","type":"text","marks":[{"type":"code_inline"}]},{"text":": end - start","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"tool_calls_total","type":"text","marks":[{"type":"code_inline"}]},{"text":": count how many tool calls you made during this task","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"errors","type":"text","marks":[{"type":"code_inline"}]},{"text":": count any tool call failures","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"planning_ratio","type":"text","marks":[{"type":"code_inline"}]},{"text":": estimate the fraction of time spent reading/thinking vs producing output (approximate is fine)","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Layer 0 — Automated Structural Checks","type":"text","marks":[{"type":"strong"}]},{"text":" (compute directly): After task execution, check the workspace. For each entry in ","type":"text"},{"text":"expected_outputs","type":"text","marks":[{"type":"code_inline"}]},{"text":":","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"file-exists","type":"text","marks":[{"type":"code_inline"}]},{"text":": Check if file exists. 30 points if found, 0 if not.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"content-contains","type":"text","marks":[{"type":"code_inline"}]},{"text":": Read file, check each required section keyword (case-insensitive). Points proportional to matches found. Pool: 40 points.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"word-count-range","type":"text","marks":[{"type":"code_inline"}]},{"text":": Count words. In range = 30 points. Within 2x range = 15 points. Outside = 0.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"git-log-contains","type":"text","marks":[{"type":"code_inline"}]},{"text":": Check git log for expected strings. 30 points if all found, proportional otherwise.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"directory-structure","type":"text","marks":[{"type":"code_inline"}]},{"text":": Check all paths exist. 30 points if all present, proportional for partial.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"command-output-contains","type":"text","marks":[{"type":"code_inline"}]},{"text":": Run command, check output contains all strings. 30 points if match, 0 if not.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"file-unchanged","type":"text","marks":[{"type":"code_inline"}]},{"text":": Compare checksum against pre-execution checksum. 30 points if unchanged, 0 if modified.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"link-consistency","type":"text","marks":[{"type":"code_inline"}]},{"text":": Scan files for link syntax consistency. 30 points if consistent, 15 if mostly consistent (>70% one style), 0 if mixed.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Normalize total to 0-100.","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Layer 1 — Metrics Analysis","type":"text","marks":[{"type":"strong"}]},{"text":" (compute directly): If task has expected_metrics:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Tool calls within expected range: 40 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Tool calls within 2x range: 20 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Outside 2x range: 0 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Planning ratio within expected range: 30 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Planning ratio outside but within 2x: 15 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Way off: 0 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Zero errors: 30 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"1-2 errors: 15 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"3+ errors: 0 points","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Normalize to 0-100. If no metrics available, score as 50.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Token estimate is tracked for reporting but NOT scored.","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Layer 2 — Behavioral Analysis","type":"text","marks":[{"type":"strong"}]},{"text":" (self-evaluate honestly, 0-100): Score based on HOW you executed:","type":"text"}]},{"type":"paragraph","content":[{"text":"Instruction Adherence (30 points):","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"30: Followed all instructions precisely","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"20: Mostly followed, minor deviations","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"10: Significant deviations","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"0: Ignored or misunderstood","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Tool Appropriateness (25 points)","type":"text","marks":[{"type":"strong"}]},{"text":" — rule-based first:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Penalty: -10 for each use of ","type":"text"},{"text":"exec cat","type":"text","marks":[{"type":"code_inline"}]},{"text":" instead of ","type":"text"},{"text":"read","type":"text","marks":[{"type":"code_inline"}]},{"text":" to read files","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Penalty: -10 for each use of ","type":"text"},{"text":"exec echo/printf","type":"text","marks":[{"type":"code_inline"}]},{"text":" instead of ","type":"text"},{"text":"write","type":"text","marks":[{"type":"code_inline"}]},{"text":" to create files","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Penalty: -5 for each use of ","type":"text"},{"text":"exec sed/awk","type":"text","marks":[{"type":"code_inline"}]},{"text":" instead of ","type":"text"},{"text":"edit","type":"text","marks":[{"type":"code_inline"}]},{"text":" for file edits","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Start at 25, apply penalties, floor at 0","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Approach Quality (25 points)","type":"text","marks":[{"type":"strong"}]},{"text":" — check read-before-write:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"25: Read all inputs before producing output","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"15: Read most inputs, minor gaps","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"5: Started producing output without reading context","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"0: No clear approach","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Error Recovery (20 points):","type":"text","marks":[{"type":"strong"}]}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"20: Clean recovery or no errors occurred","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"10: Partial recovery","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"0: Failed to recover","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Layer 3 — Output Quality","type":"text","marks":[{"type":"strong"}]},{"text":" (self-evaluate honestly, 0-100): Score the deliverable:","type":"text"}]},{"type":"paragraph","content":[{"text":"Completeness (25):","type":"text","marks":[{"type":"strong"}]},{"text":" All requirements met? Gaps? ","type":"text"},{"text":"Accuracy (25):","type":"text","marks":[{"type":"strong"}]},{"text":" Content correct? Calculations right? ","type":"text"},{"text":"Formatting (25):","type":"text","marks":[{"type":"strong"}]},{"text":" Well-structured? Correct file format? ","type":"text"},{"text":"Polish (25):","type":"text","marks":[{"type":"strong"}]},{"text":" Would a user be satisfied?","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Compute composite score","type":"text","marks":[{"type":"strong"}]},{"text":":","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"score = (L0 × 0.20) + (L1 × 0.35) + (L2 × 0.20) + (L3 × 0.25)","type":"text"}]},{"type":"paragraph","content":[{"text":"Use weights from task.yaml if specified, otherwise these defaults.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Save task result","type":"text","marks":[{"type":"strong"}]},{"text":" to ","type":"text"},{"text":"agentbench-results/{run-id}/{task-id}/","type":"text","marks":[{"type":"code_inline"}]},{"text":":","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"scores.json","type":"text","marks":[{"type":"code_inline"}]},{"text":": All layer scores, composite, breakdown, notes","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"metrics.json","type":"text","marks":[{"type":"code_inline"}]},{"text":": Timing, tool calls, errors, planning ratio","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Copy output files","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Display","type":"text","marks":[{"type":"strong"}]},{"text":": ","type":"text"},{"text":"{task.name}: {composite}/100 (L0:{l0} L1:{l1} L2:{l2} L3:{l3})","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 4: Generate Report","type":"text"}]},{"type":"paragraph","content":[{"text":"After all tasks:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Compute domain averages (group by suite, average composite scores)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Compute overall score (average of domain scores — equal domain weighting)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Compute aggregate metrics","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Generate three files in ","type":"text"},{"text":"agentbench-results/{run-id}/","type":"text","marks":[{"type":"code_inline"}]},{"text":":","type":"text"}]},{"type":"paragraph","content":[{"text":"results.json","type":"text","marks":[{"type":"strong"}]},{"text":" — Machine-readable with this structure:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"json"},"content":[{"text":"{\n \"run_id\": \"20260222-143022\",\n \"timestamp\": \"2026-02-22T14:30:22Z\",\n \"platform\": \"openclaw\",\n \"mode\": \"sandboxed\",\n \"profile\": \"full\",\n \"suite_version\": \"1.0.0\",\n \"scoring_method\": \"self-scored\",\n \"overall_score\": 74,\n \"duration_ms\": 754000,\n \"task_count\": 40,\n \"metrics\": {\n \"total_tool_calls\": 187,\n \"total_errors\": 3,\n \"avg_planning_ratio\": 0.28,\n \"est_tokens\": 245000\n },\n \"domain_scores\": {},\n \"tasks\": []\n}","type":"text"}]},{"type":"paragraph","content":[{"text":"If ","type":"text"},{"text":"--strict","type":"text","marks":[{"type":"code_inline"}]},{"text":" was used, set ","type":"text"},{"text":"scoring_method","type":"text","marks":[{"type":"code_inline"}]},{"text":" to ","type":"text"},{"text":"\"externally-verified\"","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"Integrity signature","type":"text","marks":[{"type":"strong"}]},{"text":": After building results.json (without signature field), compute:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"SIG=$(echo -n \"$CONTENT\" | openssl dgst -sha256 -hmac \"agentbench-v1-{run_id}-{suite_version}-integrity\" | awk '{print $2}')","type":"text"}]},{"type":"paragraph","content":[{"text":"Add as ","type":"text"},{"text":"\"signature\"","type":"text","marks":[{"type":"code_inline"}]},{"text":" field to results.json.","type":"text"}]},{"type":"paragraph","content":[{"text":"report.md","type":"text","marks":[{"type":"strong"}]},{"text":" — Markdown summary: Overall Score, Metrics, Domain Breakdown, Task Details, Top Failures, Recommendations.","type":"text"}]},{"type":"paragraph","content":[{"text":"report.html","type":"text","marks":[{"type":"strong"}]},{"text":" — Self-contained HTML dashboard (inline CSS/JS, no external deps):","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Score display with color (green 80+, yellow 60-79, red \u003c60)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Domain cards with score bars","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Task detail table (sortable, expandable)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Top failures section","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Dark mode via prefers-color-scheme","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Footer: \"Generated by AgentBench v1.0.0 (OpenClaw) | Suite v{suite_version} | Profile: {profile}\"","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 5: Present Results","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Display overall score","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Show domain breakdown","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Tell user where results are saved","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Mention they can submit to https://www.agentbench.app/submit","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 6: Clean Up","type":"text"}]},{"type":"paragraph","content":[{"text":"Run teardown.sh if present. Remove temp workspace directories unless ","type":"text"},{"text":"--keep-workspace","type":"text","marks":[{"type":"code_inline"}]},{"text":" was specified.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Listing Tasks (","type":"text"},{"text":"/benchmark-list","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]},{"type":"paragraph","content":[{"text":"Read all task.yaml files, group by suite, display as:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"## file-creation (9 tasks)\n - project-scaffold [easy]\n - project-proposal [medium]\n ...","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Viewing Results (","type":"text"},{"text":"/benchmark-results","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]},{"type":"paragraph","content":[{"text":"List all directories in ","type":"text"},{"text":"agentbench-results/","type":"text","marks":[{"type":"code_inline"}]},{"text":", show run ID, date, overall score, profile, and task count for each.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Comparing Runs (","type":"text"},{"text":"/benchmark-compare","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]},{"type":"paragraph","content":[{"text":"Show two runs side-by-side: overall scores, domain scores, and per-task deltas. Warn if profiles differ.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Key Differences from Claude Code Version","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"No hooks","type":"text","marks":[{"type":"strong"}]},{"text":" — metrics are self-tracked (timing, tool call counting)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"No subagents","type":"text","marks":[{"type":"strong"}]},{"text":" — you execute tasks directly in sequence","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Same tasks, same scoring, same output format","type":"text","marks":[{"type":"strong"}]},{"text":" — results are cross-platform comparable","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Same integrity signature","type":"text","marks":[{"type":"strong"}]},{"text":" — submissions work on the same leaderboard","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Important Notes","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Be honest in self-evaluation (L2/L3). Inflated scores are obvious on the leaderboard.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The objective layers (L0 + L1) carry 55% of the weight — they can't be faked.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Token estimates are informational only, not scored.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Any link syntax is accepted in skill graph tasks — consistency is what's scored.","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"agentbench","author":"@skillopedia","source":{"stars":8,"repo_name":"agentbench-openclaw","origin_url":"https://github.com/agentbench/agentbench-openclaw/blob/HEAD/SKILL.md","repo_owner":"agentbench","body_sha256":"9396c800fcccefbe8e0f449485b401394838c2954f5a6a1479fe5e72ae891932","cluster_key":"cff52bc9481d479fc29f6b4e0c2e4d8e987dc384c69f8cedb8d6fafee03954f4","clean_bundle":{"format":"clean-skill-bundle-v1","source":"agentbench/agentbench-openclaw/SKILL.md","attachments":[{"id":"23db29a6-f0c8-53ef-b598-621de46048f7","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/23db29a6-f0c8-53ef-b598-621de46048f7/attachment","path":".gitignore","size":62,"sha256":"7760fedbd74826d6160a44dbe7f10edaa060869a2ffa79909965af02bb39a9f5","contentType":"text/plain; charset=utf-8"},{"id":"df899c5d-a110-5965-962f-d82664716210","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/df899c5d-a110-5965-962f-d82664716210/attachment.md","path":"README.md","size":2868,"sha256":"b4ea486fd5df0a68fb2495d53891ef42365ae344da038e012513736c9dd5a42a","contentType":"text/markdown; charset=utf-8"},{"id":"02ffcfce-5445-5268-9bc5-743526d3f8af","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/02ffcfce-5445-5268-9bc5-743526d3f8af/attachment.sh","path":"lib/metrics.sh","size":4006,"sha256":"1e21be4a094e20a552bb13c4f243a782d0f5dc5431f19fc25a702ba18101309c","contentType":"application/x-sh; charset=utf-8"},{"id":"e0160712-b3b4-5d7a-95eb-6731dc2a07f7","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/e0160712-b3b4-5d7a-95eb-6731dc2a07f7/attachment.svg","path":"logo.svg","size":766,"sha256":"69df296e1fd2e465b42659f226149ff483fac6ea69fdff2d752b2422a9d6f4c3","contentType":"image/svg+xml"},{"id":"9acdc20a-ab32-5f31-89f5-1eb76752464e","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9acdc20a-ab32-5f31-89f5-1eb76752464e/attachment.json","path":"skill.json","size":397,"sha256":"e1d3c403cca3c12a6294502584c44ba720e016a09ed3b1222f75a2734c394152","contentType":"application/json; charset=utf-8"},{"id":"520ff079-436b-5911-8bdb-80e3ff4a6d43","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/520ff079-436b-5911-8bdb-80e3ff4a6d43/attachment.csv","path":"tasks/data-analysis/cross-reference/inputs/inventory.csv","size":1173,"sha256":"ba429364df8d94362b263d10b733aacc983d59f12ebfa9304427ae712c3529ff","contentType":"text/csv; charset=utf-8"},{"id":"b766772f-c1ba-50a2-a4fb-d97a1857da92","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b766772f-c1ba-50a2-a4fb-d97a1857da92/attachment.csv","path":"tasks/data-analysis/cross-reference/inputs/orders.csv","size":1588,"sha256":"1d9039ed2dc9fe17d1717acd331b7fed49979bd041c9e44ec533ba6112c6c350","contentType":"text/csv; charset=utf-8"},{"id":"3da34a31-1fb9-55dc-a291-9a3cf7e4c544","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/3da34a31-1fb9-55dc-a291-9a3cf7e4c544/attachment.yaml","path":"tasks/data-analysis/cross-reference/task.yaml","size":1415,"sha256":"dd9d615a2fa80a1b2a13e87ad60ef944c43d625492fe8ea4c37fc750854184d4","contentType":"application/yaml; charset=utf-8"},{"id":"68e5fead-925f-5b9c-8a3b-3351c73ba286","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/68e5fead-925f-5b9c-8a3b-3351c73ba286/attachment.csv","path":"tasks/data-analysis/find-anomalies/inputs/sales-data.csv","size":4255,"sha256":"ffdccd9b8470b1c3629441083e27620af302cfeeea34d2f03568d4fd68c1d52f","contentType":"text/csv; charset=utf-8"},{"id":"bb58a7b5-b456-5d8c-ad70-8297b122e06d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/bb58a7b5-b456-5d8c-ad70-8297b122e06d/attachment.yaml","path":"tasks/data-analysis/find-anomalies/task.yaml","size":1194,"sha256":"e35b1f20f8a2bb454af35e2172404b1cabe3fc0c19f07a6093ab0fd7619cea17","contentType":"application/yaml; charset=utf-8"},{"id":"12a56c01-4fa3-5b30-a1e6-8a445bd4b162","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/12a56c01-4fa3-5b30-a1e6-8a445bd4b162/attachment.sh","path":"tasks/data-analysis/log-pattern-detection/setup.sh","size":42939,"sha256":"d7391d18aadb2046fdd28de6e360bfbb7eca14f6863dc2309be4f9b499a269f8","contentType":"application/x-sh; charset=utf-8"},{"id":"94720930-9107-5120-956b-f3b861002aac","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/94720930-9107-5120-956b-f3b861002aac/attachment.yaml","path":"tasks/data-analysis/log-pattern-detection/task.yaml","size":1954,"sha256":"604261fa47d58aa452d16fa0e4ac8409d08d9bdd3b820525ff4226d352840c14","contentType":"application/yaml; charset=utf-8"},{"id":"24eabeb1-9b85-5320-ac8d-c7efe461401b","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/24eabeb1-9b85-5320-ac8d-c7efe461401b/attachment.sh","path":"tasks/data-analysis/multi-format-reconciliation/setup.sh","size":10446,"sha256":"ea245a25dec2c483a197083a63732a43e4764ed426142f58b89e052d77cd92ae","contentType":"application/x-sh; charset=utf-8"},{"id":"5e90805e-143d-5ce4-939a-1e8f728b3c11","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/5e90805e-143d-5ce4-939a-1e8f728b3c11/attachment.yaml","path":"tasks/data-analysis/multi-format-reconciliation/task.yaml","size":2168,"sha256":"2b1e4d265aad7a600eafdd41545b6ea16b2cf5642c79bcab3fc153d32e1861cf","contentType":"application/yaml; charset=utf-8"},{"id":"7910e7c2-56de-5615-8f02-335243b6204e","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/7910e7c2-56de-5615-8f02-335243b6204e/attachment.csv","path":"tasks/data-analysis/summary-statistics/inputs/survey-results.csv","size":2957,"sha256":"ef3b7254ff669b17e0cd7fdfd2c83b9ceeff9ee294d74440de6e03dfe3da4c5f","contentType":"text/csv; charset=utf-8"},{"id":"9efdd926-99b9-5bd2-84fc-ea6b314775a4","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9efdd926-99b9-5bd2-84fc-ea6b314775a4/attachment.yaml","path":"tasks/data-analysis/summary-statistics/task.yaml","size":1279,"sha256":"545180f10f96dddb9046d8026396c9e14b5959a424c38f3769541426eed1ee64","contentType":"application/yaml; charset=utf-8"},{"id":"57047423-7d93-50da-afea-7dad62130c8b","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/57047423-7d93-50da-afea-7dad62130c8b/attachment.sh","path":"tasks/error-handling/cascading-failures/setup.sh","size":5542,"sha256":"88fc086b4416f16c6766a357603cc0fc4b5762381104ccffee0d0b102e0f51b8","contentType":"application/x-sh; charset=utf-8"},{"id":"def77a9e-05bd-5e18-9d1a-cc8472e709e6","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/def77a9e-05bd-5e18-9d1a-cc8472e709e6/attachment.yaml","path":"tasks/error-handling/cascading-failures/task.yaml","size":1398,"sha256":"b33785ae7e15251eb532fcb7a05ee82f7be101cf9ea5580d438d9f3580a5e6e0","contentType":"application/yaml; charset=utf-8"},{"id":"708b8578-f920-50fb-9c3f-880a7c1da900","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/708b8578-f920-50fb-9c3f-880a7c1da900/attachment.json","path":"tasks/error-handling/corrupted-input/inputs/data.json","size":751,"sha256":"ba587ddfea40797fa830543089b2ab0f20ad75e95aa89cf04a2a511d9d4b735c","contentType":"application/json; charset=utf-8"},{"id":"12a88ef8-93ae-5a3c-8852-99999b47622d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/12a88ef8-93ae-5a3c-8852-99999b47622d/attachment.yaml","path":"tasks/error-handling/corrupted-input/task.yaml","size":1335,"sha256":"8c881ac02878c6e441faee7b125796c9aa24ba1d4b47222a679f33b8a81d412e","contentType":"application/yaml; charset=utf-8"},{"id":"6f28f65b-44f5-513d-868b-ba44d929baa7","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/6f28f65b-44f5-513d-868b-ba44d929baa7/attachment","path":"tasks/error-handling/impossible-request/inputs/.gitkeep","size":0,"sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","contentType":"text/plain; charset=utf-8"},{"id":"9f51946e-7e46-56b2-b8c8-3a13d9fa5db6","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9f51946e-7e46-56b2-b8c8-3a13d9fa5db6/attachment.yaml","path":"tasks/error-handling/impossible-request/task.yaml","size":1408,"sha256":"0e96b8ade34d2e73e14b9791f86e6af7e8f4e3656341190808842db7c09fc1fa","contentType":"application/yaml; charset=utf-8"},{"id":"d361588e-8c6b-5280-b749-30d6c592d9f5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/d361588e-8c6b-5280-b749-30d6c592d9f5/attachment.sh","path":"tasks/error-handling/misleading-error-message/setup.sh","size":6459,"sha256":"9a8bd8a04e251f99be41b852e0cf474f41266ec09961493bcd12d1c97dcd1689","contentType":"application/x-sh; charset=utf-8"},{"id":"f8c84615-5b89-5bf5-a571-22dc4f92c028","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/f8c84615-5b89-5bf5-a571-22dc4f92c028/attachment.yaml","path":"tasks/error-handling/misleading-error-message/task.yaml","size":1444,"sha256":"79831715d3bace2ea8c7f24f4fbfa30cbb8cfd35da7ec4c6a5d379bc26a6882e","contentType":"application/yaml; charset=utf-8"},{"id":"b64f37bc-57c2-5ab7-865f-c20f41a9194b","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b64f37bc-57c2-5ab7-865f-c20f41a9194b/attachment","path":"tasks/error-handling/missing-dependency/inputs/.gitkeep","size":0,"sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","contentType":"text/plain; charset=utf-8"},{"id":"6523e75f-bf5f-5da7-ae02-478757ffbb78","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/6523e75f-bf5f-5da7-ae02-478757ffbb78/attachment.yaml","path":"tasks/error-handling/missing-dependency/task.yaml","size":1156,"sha256":"4782525ed887160dbe1ffedb7e383eee14cb3fb7892d720bd3ad222ef9c93781","contentType":"application/yaml; charset=utf-8"},{"id":"deea328a-0c8c-5eb3-a260-acccd4cc7df2","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/deea328a-0c8c-5eb3-a260-acccd4cc7df2/attachment.sh","path":"tasks/error-handling/partial-recovery/setup.sh","size":7600,"sha256":"b4e1adf478bb6fb8c03ae44e8976e15576adab22da9f3436ebc18516eccc26f3","contentType":"application/x-sh; charset=utf-8"},{"id":"537a505d-0ed2-58f2-b3f2-899643cbf895","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/537a505d-0ed2-58f2-b3f2-899643cbf895/attachment.yaml","path":"tasks/error-handling/partial-recovery/task.yaml","size":1770,"sha256":"383b5a14f7d182ea10d7ca7f4776b3a4654b2372bb277d7c0a1a7c0c77bebc1c","contentType":"application/yaml; charset=utf-8"},{"id":"ee2b87a3-3130-5809-9e26-83f7e6229bf4","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/ee2b87a3-3130-5809-9e26-83f7e6229bf4/attachment.txt","path":"tasks/file-creation/data-spreadsheet/inputs/raw-records.txt","size":1894,"sha256":"7d66716013d717eaa3857182217d6899537c19e89bdd1278f6aab2ee7cf26d97","contentType":"text/plain; charset=utf-8"},{"id":"06a19181-cca6-587e-b18d-6d72023ea4dc","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/06a19181-cca6-587e-b18d-6d72023ea4dc/attachment.yaml","path":"tasks/file-creation/data-spreadsheet/task.yaml","size":1139,"sha256":"0c25edd018d50f47c12e3692ec25efca4b452d0d4c310648dc672203192d915a","contentType":"application/yaml; charset=utf-8"},{"id":"0046bc2c-b3a8-5e3d-955f-b9c206b4e8b9","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/0046bc2c-b3a8-5e3d-955f-b9c206b4e8b9/attachment.md","path":"tasks/file-creation/linked-project-scaffold/inputs/architecture.md","size":4402,"sha256":"bb36c63f96a0cb23a1b3c7d64ae1ad217c8d16d24739167e85de9ab7c5b034ea","contentType":"text/markdown; charset=utf-8"},{"id":"0702656d-1ef8-5272-aa23-05349f49430f","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/0702656d-1ef8-5272-aa23-05349f49430f/attachment.yaml","path":"tasks/file-creation/linked-project-scaffold/task.yaml","size":3271,"sha256":"f0d1c89cbf1f41f8239f1323b83ac0f2b1371b5aa34006b59ecf884e790abda2","contentType":"application/yaml; charset=utf-8"},{"id":"9efce833-a677-57b0-9dc0-988433309b19","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9efce833-a677-57b0-9dc0-988433309b19/attachment.sh","path":"tasks/file-creation/migration-script/setup.sh","size":17239,"sha256":"42dbe2af4dfd80b1deadb4c9b1d80dac4e42203b09a97b11c2eea14d8fd82c0d","contentType":"application/x-sh; charset=utf-8"},{"id":"6e28766f-afea-5414-9c67-4223381d99c2","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/6e28766f-afea-5414-9c67-4223381d99c2/attachment.yaml","path":"tasks/file-creation/migration-script/task.yaml","size":2431,"sha256":"2ccf48e5273551d2bbec8b18b2dd552fc48a3b3f16cd8a67290cef1a9d97f088","contentType":"application/yaml; charset=utf-8"},{"id":"d21331e2-125c-56de-941b-926ddb7122c4","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/d21331e2-125c-56de-941b-926ddb7122c4/attachment.txt","path":"tasks/file-creation/pitch-deck-outline/inputs/startup-info.txt","size":4064,"sha256":"6a6c52f283c6ebb7603d95bd177d61288ea8f4deedb144b2604d45449fab0887","contentType":"text/plain; charset=utf-8"},{"id":"399099ab-1f01-5237-b8d2-81cac91189b6","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/399099ab-1f01-5237-b8d2-81cac91189b6/attachment.yaml","path":"tasks/file-creation/pitch-deck-outline/task.yaml","size":1435,"sha256":"d7d752bab942a77b46576c7d216b448501abe55a9c384c5307181f44a42e9cbd","contentType":"application/yaml; charset=utf-8"},{"id":"27aa3572-7748-5280-8659-0d8f9afcc6d1","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/27aa3572-7748-5280-8659-0d8f9afcc6d1/attachment.txt","path":"tasks/file-creation/project-proposal/inputs/project-brief.txt","size":2222,"sha256":"600e28399b4d424cf174875ae355b8a6361b9d1ee097bfe690f2d02ead38ef96","contentType":"text/plain; charset=utf-8"},{"id":"ca55adb1-eb6e-55e0-9df0-5710095b92f5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/ca55adb1-eb6e-55e0-9df0-5710095b92f5/attachment.yaml","path":"tasks/file-creation/project-proposal/task.yaml","size":1278,"sha256":"e0d79062504912b5a2f6b8bcfde675e579d9714217f773e07946a90aa791d67f","contentType":"application/yaml; charset=utf-8"},{"id":"9b684586-c87f-5c43-bc6d-0a3a6b489e0d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9b684586-c87f-5c43-bc6d-0a3a6b489e0d/attachment.sh","path":"tasks/file-creation/project-scaffold/setup.sh","size":5015,"sha256":"f0de48508fd8a2f282854be4e87deae960bc12f110d20f0f20672134aef94949","contentType":"application/x-sh; charset=utf-8"},{"id":"1e8a1ab8-2ffd-5dd1-81b9-bcdecf04daea","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/1e8a1ab8-2ffd-5dd1-81b9-bcdecf04daea/attachment.yaml","path":"tasks/file-creation/project-scaffold/task.yaml","size":3054,"sha256":"76608c85f7aac47c3cfecf959ee44024fd3efef78ad013126f76a96884683a43","contentType":"application/yaml; charset=utf-8"},{"id":"f567b5c8-89ee-5941-b6f3-be968eb1748d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/f567b5c8-89ee-5941-b6f3-be968eb1748d/attachment.md","path":"tasks/file-creation/skill-graph-creation/inputs/skill-requirements.md","size":2934,"sha256":"7b28a753e7c6885518304703e7844800cb672030cf65c676328263b2f1711d99","contentType":"text/markdown; charset=utf-8"},{"id":"bec27f18-415f-5642-be6f-65bedf0b8eb5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/bec27f18-415f-5642-be6f-65bedf0b8eb5/attachment.yaml","path":"tasks/file-creation/skill-graph-creation/task.yaml","size":3916,"sha256":"084ccf01358806d736e574388d23e9214715667aaa62cd6d2a4d4508c17a9fdd","contentType":"application/yaml; charset=utf-8"},{"id":"99361533-4f98-5368-a7e1-89104ee053f8","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/99361533-4f98-5368-a7e1-89104ee053f8/attachment.sh","path":"tasks/file-creation/skill-graph-refactor/setup.sh","size":3434,"sha256":"77912439d1adca7b1913e90678ff611a746078b9bfa421acdd3c9610b284355e","contentType":"application/x-sh; charset=utf-8"},{"id":"da68b142-2afa-50ba-92f0-9826a16bff02","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/da68b142-2afa-50ba-92f0-9826a16bff02/attachment.yaml","path":"tasks/file-creation/skill-graph-refactor/task.yaml","size":3978,"sha256":"9921b4b04768d870acb318283dc3dc272c6c290368c46b7e1cebbfe2f51d4845","contentType":"application/yaml; charset=utf-8"},{"id":"ab239b89-8499-52dd-901f-b79f5d7dedff","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/ab239b89-8499-52dd-901f-b79f5d7dedff/attachment.txt","path":"tasks/file-creation/structured-form/inputs/form-requirements.txt","size":4222,"sha256":"692e8a030c9dec0b7fb5d531732c0ce780fccdc0adbbc75b819d3f5dbccf4801","contentType":"text/plain; charset=utf-8"},{"id":"b4443dea-5391-5fdb-abf4-024f8cdc75e2","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b4443dea-5391-5fdb-abf4-024f8cdc75e2/attachment.yaml","path":"tasks/file-creation/structured-form/task.yaml","size":1134,"sha256":"31188b4f1ff14fa2a46a1a69dd6d0e224b3ae57ea9d8d3c6eeab3c749f392344","contentType":"application/yaml; charset=utf-8"},{"id":"5b813a08-49f2-5087-bd1a-b1a0809544fd","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/5b813a08-49f2-5087-bd1a-b1a0809544fd/attachment.sh","path":"tasks/memory/constraint-accumulation/setup.sh","size":6100,"sha256":"9cc782ea1f55bca821e83bb25585dba32895a420a01b84727919cc396b34f615","contentType":"application/x-sh; charset=utf-8"},{"id":"0f1aa3b0-16d8-5db8-ace3-8f927f7b2eed","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/0f1aa3b0-16d8-5db8-ace3-8f927f7b2eed/attachment.yaml","path":"tasks/memory/constraint-accumulation/task.yaml","size":2366,"sha256":"48fbff980c35b207d6d496dac6b457b87c410305ffc50ba49e54849606588173","contentType":"application/yaml; charset=utf-8"},{"id":"6506fbea-9b63-5870-bade-b7f09e3c1808","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/6506fbea-9b63-5870-bade-b7f09e3c1808/attachment.csv","path":"tasks/memory/context-retention/inputs/sales-data.csv","size":1502,"sha256":"7eb3f9b036d2d8d14161e0427b63422e6390ab32d68dd6e230e909b1bcaa6252","contentType":"text/csv; charset=utf-8"},{"id":"9ff8d383-d8c5-5b73-b235-b2610ff18ea5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9ff8d383-d8c5-5b73-b235-b2610ff18ea5/attachment.yaml","path":"tasks/memory/context-retention/task.yaml","size":1858,"sha256":"6d528a08219503c1b03af3c917e4d23d45f096cf0443c739803ef93a2a5fd472","contentType":"application/yaml; charset=utf-8"},{"id":"48b068b5-6104-5df1-acd4-c26ac85d7d16","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/48b068b5-6104-5df1-acd4-c26ac85d7d16/attachment.sh","path":"tasks/memory/interleaved-projects/setup.sh","size":4691,"sha256":"412aaf49a3fe541ac5e941e244fb337e06fe9cd5e505db795d076d2016c2d449","contentType":"application/x-sh; charset=utf-8"},{"id":"f026943c-bac3-5345-9ef3-105ac4a11847","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/f026943c-bac3-5345-9ef3-105ac4a11847/attachment.yaml","path":"tasks/memory/interleaved-projects/task.yaml","size":1936,"sha256":"4dc167ec88262abce3e01b57772c73374b0158838cc753c1ed60cb996ea8b754","contentType":"application/yaml; charset=utf-8"},{"id":"c44559b7-f3d3-5aa4-b03e-07efad8a3d32","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/c44559b7-f3d3-5aa4-b03e-07efad8a3d32/attachment","path":"tasks/memory/memory-organization/inputs/.gitkeep","size":0,"sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","contentType":"text/plain; charset=utf-8"},{"id":"80216c42-dc0d-5e6a-b52a-c859816a99af","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/80216c42-dc0d-5e6a-b52a-c859816a99af/attachment.yaml","path":"tasks/memory/memory-organization/task.yaml","size":1162,"sha256":"dd28f58bb768b68e9649b17714a018b3f747fbc8390742d5f9227e91cf448987","contentType":"application/yaml; charset=utf-8"},{"id":"71700f2a-849b-5657-8c61-d8abe0181533","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/71700f2a-849b-5657-8c61-d8abe0181533/attachment","path":"tasks/memory/recall-distraction/inputs/.gitkeep","size":0,"sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","contentType":"text/plain; charset=utf-8"},{"id":"5011548c-150b-5c15-b36f-bf9d37db2cb4","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/5011548c-150b-5c15-b36f-bf9d37db2cb4/attachment.yaml","path":"tasks/memory/recall-distraction/task.yaml","size":1556,"sha256":"1bbf939c2688fd032c81519c98a053f393dfa14d777f728777dbe580dbf0ac7c","contentType":"application/yaml; charset=utf-8"},{"id":"3daa32ca-f4c6-5a62-b7e1-d8ab9e4121b2","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/3daa32ca-f4c6-5a62-b7e1-d8ab9e4121b2/attachment.csv","path":"tasks/multi-step/data-pipeline/inputs/raw-feedback.csv","size":5942,"sha256":"c03b65b61626ed6bc27d99517f1b00ce543ed17d936458ae27141d733e1e4686","contentType":"text/csv; charset=utf-8"},{"id":"42f003a5-4661-5633-a781-3fd0a1248fc4","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/42f003a5-4661-5633-a781-3fd0a1248fc4/attachment.yaml","path":"tasks/multi-step/data-pipeline/task.yaml","size":1657,"sha256":"b3e6e62e084b2e13e75dd34385d3e9444b1d22261bff4ca1e596784d28af924f","contentType":"application/yaml; charset=utf-8"},{"id":"e00ce038-d4d1-55b1-abea-1a36f4b81cfa","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/e00ce038-d4d1-55b1-abea-1a36f4b81cfa/attachment.yaml","path":"tasks/multi-step/log-analysis-report/task.yaml","size":1467,"sha256":"2aca22f102dc66eaa22783defe7b33510b3a55b52c16005d56a3cfcb211b143e","contentType":"application/yaml; charset=utf-8"},{"id":"1cf591e1-afbc-527f-b76a-7825e82f9e56","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/1cf591e1-afbc-527f-b76a-7825e82f9e56/attachment.txt","path":"tasks/multi-step/meeting-to-tasks/inputs/meeting-notes.txt","size":4288,"sha256":"116bacf86c746af3404d15281d7d9460c06f09d73071ace2edcee8f26abce7e4","contentType":"text/plain; charset=utf-8"},{"id":"125de65b-c21d-5859-9862-5365994075df","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/125de65b-c21d-5859-9862-5365994075df/attachment.yaml","path":"tasks/multi-step/meeting-to-tasks/task.yaml","size":1426,"sha256":"5b83e2825b148c070869dbd80c31d940cdcee59f3c098e15e2027d11460860ef","contentType":"application/yaml; charset=utf-8"},{"id":"8514c709-9a4a-57c0-9b7c-6471121a38b2","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/8514c709-9a4a-57c0-9b7c-6471121a38b2/attachment.sh","path":"tasks/multi-step/release-preparation/setup.sh","size":6505,"sha256":"daf9405a589178da48c4f4f2a4d3b59bbf11f9bebcff06038167884c47f58a5e","contentType":"application/x-sh; charset=utf-8"},{"id":"94f46fa4-3b9d-54aa-885a-5d2f4459da0d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/94f46fa4-3b9d-54aa-885a-5d2f4459da0d/attachment.yaml","path":"tasks/multi-step/release-preparation/task.yaml","size":2106,"sha256":"ca930d7fbb2687358b10fa91177e29a72cbd7cf3e0b2a1c358eb063dc5499c64","contentType":"application/yaml; charset=utf-8"},{"id":"addcd44f-f15e-5fc4-b419-a4d1ab73fc7f","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/addcd44f-f15e-5fc4-b419-a4d1ab73fc7f/attachment.sh","path":"tasks/multi-step/repo-refactor/setup.sh","size":25816,"sha256":"ebff5354dcf658fce4351109712d54504d5774cf8ea9ceb728063263d9b2be21","contentType":"application/x-sh; charset=utf-8"},{"id":"0e978c35-8d8c-5f2e-a66e-bb151ce492ee","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/0e978c35-8d8c-5f2e-a66e-bb151ce492ee/attachment.yaml","path":"tasks/multi-step/repo-refactor/task.yaml","size":2721,"sha256":"838b676250cf17e877fdfda3b474b41c60a312030ed72949a91247ad93d4fae4","contentType":"application/yaml; charset=utf-8"},{"id":"36e2c219-b92d-50e5-b15a-2e505e9b6d1e","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/36e2c219-b92d-50e5-b15a-2e505e9b6d1e/attachment.sh","path":"tasks/research/codebase-archaeology/setup.sh","size":33805,"sha256":"b5dd044e2be79784a8979d13db3cd94e483594f55704711f684aa421523015f7","contentType":"application/x-sh; charset=utf-8"},{"id":"19e2329b-75ff-5279-8b6f-40ec704f02f8","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/19e2329b-75ff-5279-8b6f-40ec704f02f8/attachment.yaml","path":"tasks/research/codebase-archaeology/task.yaml","size":2232,"sha256":"8c0ea2e7e8dfc95c9d3d9d825af8e1622f23f8529acf6d103d9b6f207aa5a3ed","contentType":"application/yaml; charset=utf-8"},{"id":"3b116bea-f726-59d4-9d7f-03ea59a80ee5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/3b116bea-f726-59d4-9d7f-03ea59a80ee5/attachment.txt","path":"tasks/research/compare-technologies/inputs/tech-a.txt","size":3587,"sha256":"9724c2a744f91f6eb05b86684d4e430faf592f6429c541dfe2dce538fed4c434","contentType":"text/plain; charset=utf-8"},{"id":"b802e9ba-c39e-5e02-9946-52df046aab66","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b802e9ba-c39e-5e02-9946-52df046aab66/attachment.txt","path":"tasks/research/compare-technologies/inputs/tech-b.txt","size":4487,"sha256":"a314a38fe554cc1b0f6aa66bbae90cc725d14732160e467b9fbf249bf33bded6","contentType":"text/plain; charset=utf-8"},{"id":"71c53298-461c-592d-8289-fe85813665c2","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/71c53298-461c-592d-8289-fe85813665c2/attachment.yaml","path":"tasks/research/compare-technologies/task.yaml","size":1272,"sha256":"ede582f3f9c0e8a396b8d3c067db04161565668efd78b9de00b9666a8d2c44d9","contentType":"application/yaml; charset=utf-8"},{"id":"fcd59646-d338-5abd-9f06-5938b9aab3e3","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/fcd59646-d338-5abd-9f06-5938b9aab3e3/attachment.txt","path":"tasks/research/extract-structured-data/inputs/meeting-transcript.txt","size":5864,"sha256":"fdf30cda23a964c2e27cf224c840912ac98a67af560aabbd6f6f0c3b2f7f55ec","contentType":"text/plain; charset=utf-8"},{"id":"d8293dd2-c754-525d-838d-6d614ffad710","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/d8293dd2-c754-525d-838d-6d614ffad710/attachment.yaml","path":"tasks/research/extract-structured-data/task.yaml","size":1154,"sha256":"c50cf28a640eb0d983467efefe52d5ed4e0aa3d89da276efa6c40b0e26e11abd","contentType":"application/yaml; charset=utf-8"},{"id":"18edb071-a5da-5c42-8563-0a90a9dcd55f","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/18edb071-a5da-5c42-8563-0a90a9dcd55f/attachment.sh","path":"tasks/research/multi-source-synthesis/setup.sh","size":9609,"sha256":"22cc83d9e6b57e349335191931f1c8cbaf0e8df0c8d70b84b4c3139966f739e2","contentType":"application/x-sh; charset=utf-8"},{"id":"74ce6efb-b92a-547e-b73d-0b94075e754f","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/74ce6efb-b92a-547e-b73d-0b94075e754f/attachment.yaml","path":"tasks/research/multi-source-synthesis/task.yaml","size":1803,"sha256":"d18fca84141e1a0445c11802fa4b34f2ebb57182c38f8aec4639e6fa4e5ed688","contentType":"application/yaml; charset=utf-8"},{"id":"572e43c0-e9c0-5b36-a2de-b4d72922962f","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/572e43c0-e9c0-5b36-a2de-b4d72922962f/attachment.txt","path":"tasks/research/summarize-doc/inputs/whitepaper.txt","size":11548,"sha256":"c33f5ab67a69b6457496c67385ebcc84cf3bb32964fbbec18cafec555538ab11","contentType":"text/plain; charset=utf-8"},{"id":"7e0468b4-65d8-543a-934b-d6e0302a4557","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/7e0468b4-65d8-543a-934b-d6e0302a4557/attachment.yaml","path":"tasks/research/summarize-doc/task.yaml","size":1124,"sha256":"8cd96c636c12a571bd9828ce40bbc6fa69bc2d34fd99567e6e547fac20c1871c","contentType":"application/yaml; charset=utf-8"},{"id":"89e2f681-acaa-5cc2-bb4b-145088de0c4e","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/89e2f681-acaa-5cc2-bb4b-145088de0c4e/attachment.sh","path":"tasks/tool-efficiency/large-codebase-navigation/setup.sh","size":46337,"sha256":"7e4e17ad71b7b745205e5813dafbda09570b3abd07633f689cd9bd61d08de5f5","contentType":"application/x-sh; charset=utf-8"},{"id":"638f8983-c8d5-58c4-b6cd-8693b5d5ed38","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/638f8983-c8d5-58c4-b6cd-8693b5d5ed38/attachment.yaml","path":"tasks/tool-efficiency/large-codebase-navigation/task.yaml","size":1317,"sha256":"3ec5ff7b0492668bed18524adfccfbb2356192f44e29069e7b439fac5be3a87d","contentType":"application/yaml; charset=utf-8"},{"id":"1bc5c329-8114-501f-9461-43bc6d4f01e5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/1bc5c329-8114-501f-9461-43bc6d4f01e5/attachment.json","path":"tasks/tool-efficiency/minimal-reads/inputs/config.json","size":495,"sha256":"d22055c06679400935803d12e109e79213451aab07dd098fa7c206f79ac12750","contentType":"application/json; charset=utf-8"},{"id":"585be403-ec3f-568b-8d7c-8640fe43e6d0","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/585be403-ec3f-568b-8d7c-8640fe43e6d0/attachment.yaml","path":"tasks/tool-efficiency/minimal-reads/task.yaml","size":947,"sha256":"35b2d70732ca74e221f5d8c69d03f5f7cdabd54ecad177fe69d963f72f13d0e4","contentType":"application/yaml; charset=utf-8"},{"id":"4a926909-384c-5f3d-8c33-6ef7a0e558bd","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/4a926909-384c-5f3d-8c33-6ef7a0e558bd/attachment.md","path":"tasks/tool-efficiency/no-unnecessary-changes/inputs/report.md","size":3789,"sha256":"58c9761cc47b9f24258d607f98824b373b4613ff7b06a58665c14b0226dc9648","contentType":"text/markdown; charset=utf-8"},{"id":"d839a47a-a758-5ce5-b57f-7c5c63904523","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/d839a47a-a758-5ce5-b57f-7c5c63904523/attachment.yaml","path":"tasks/tool-efficiency/no-unnecessary-changes/task.yaml","size":1406,"sha256":"e16d228327accf8c79237e87fc303b5b4fa47cc7ee2dd8390f69a6565a47ec0b","contentType":"application/yaml; charset=utf-8"},{"id":"3ac5f10e-27b9-587e-b19a-4ea8271e87be","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/3ac5f10e-27b9-587e-b19a-4ea8271e87be/attachment.txt","path":"tasks/tool-efficiency/right-tool-choice/inputs/contacts.txt","size":1205,"sha256":"58d9dfbfd323d69eb04d8211bd62b61b4abd31150836a421d4bcfcd9b31ba212","contentType":"text/plain; charset=utf-8"},{"id":"08fa8a7d-aba6-5fd6-9c13-5f62c6607831","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/08fa8a7d-aba6-5fd6-9c13-5f62c6607831/attachment.yaml","path":"tasks/tool-efficiency/right-tool-choice/task.yaml","size":956,"sha256":"76dd8aff820329e42ee0135bacf7d31b3b474274a8f55eaffc191122cd4a12e0","contentType":"application/yaml; charset=utf-8"},{"id":"2a10ba46-c0e9-58c7-b5c9-d64a43cbcaff","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/2a10ba46-c0e9-58c7-b5c9-d64a43cbcaff/attachment.sh","path":"tasks/tool-efficiency/targeted-fix/setup.sh","size":44035,"sha256":"e28d67cf766d63eeb15bae9d2fa52ece2d228dd47ca028792ccf8ccf70621088","contentType":"application/x-sh; charset=utf-8"},{"id":"97ba2c8f-959b-53dd-9b0e-9360e79ae727","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/97ba2c8f-959b-53dd-9b0e-9360e79ae727/attachment.yaml","path":"tasks/tool-efficiency/targeted-fix/task.yaml","size":1090,"sha256":"e3dfe205909e16d0dfda3f07114fb52e939de0e43614205cac9266bafd7e352e","contentType":"application/yaml; charset=utf-8"}],"bundle_sha256":"a4a73442126ebf563ba70295aec3cf42050c6cf9d39add3698589d00223d6839","attachment_count":86,"text_attachments":69,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":17,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"testing-qa","category_label":"Testing"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"testing-qa","homepage":"https://www.agentbench.app","metadata":{"openclaw":{"emoji":"📊","requires":{"bins":["jq","bash","python3"]}}},"import_tag":"clean-skills-v1","description":"Benchmark your OpenClaw agent across 40 real-world tasks. Tests file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency. Not a coding benchmark — measures your agent setup and config."}},"renderedAt":1782979546629}

AgentBench for OpenClaw Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains. Commands When the user says any of these, follow the corresponding instructions: - — Run the full benchmark suite (all 40 tasks) - — Run only easy+medium tasks (19 tasks) - — Run one domain only - — Run a single task - — Tag results as externally verified scoring - — List all tasks grouped by domain - — Show results from previous runs - — Compare two runs side-by-side Flags are combinable: Running a Benchmark Step 1: Discover Tasks Read task.yaml files from the directory…