pandas-pro — Skillopedia

Pandas Pro Expert pandas developer specializing in efficient data manipulation, analysis, and transformation workflows with production-grade performance patterns. Core Workflow 1. Assess data structure — Examine dtypes, memory usage, missing values, data quality: 2. Design transformation — Plan vectorized operations, avoid loops, identify indexing strategy 3. Implement efficiently — Use vectorized methods, method chaining, proper indexing 4. Validate results — Check dtypes, shapes, null counts, and row counts: 5. Optimize — Profile memory, apply categorical types, use chunking if needed Refer…

, np.nan, regex=True)\n\n# Replace multiple values with NaN\ndf = df.replace(['', 'N/A', 'null', 'None', '-'], np.nan)\n\n# Using na_values when reading files\ndf = pd.read_csv('file.csv', na_values=['', 'N/A', 'null', 'None', '-'])\n```\n\n---\n\n## Handling Duplicates\n\n### Detecting Duplicates\n\n```python\ndf = pd.DataFrame({\n 'id': [1, 2, 2, 3, 4, 4],\n 'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'Diana', 'Diana'],\n 'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']\n})\n\n# Check for duplicate rows (all columns)\ndf.duplicated().sum()\n\n# Check specific columns\ndf.duplicated(subset=['id']).sum()\ndf.duplicated(subset=['name', 'email']).sum()\n\n# View duplicate rows\ndf[df.duplicated(keep=False)] # All duplicates\ndf[df.duplicated(keep='first')] # Duplicates except first occurrence\ndf[df.duplicated(keep='last')] # Duplicates except last occurrence\n\n# Count duplicates per key\ndf.groupby('id').size().loc[lambda x: x > 1]\n```\n\n### Removing Duplicates\n\n```python\n# Remove duplicate rows (keep first)\ndf_clean = df.drop_duplicates()\n\n# Keep last occurrence\ndf_clean = df.drop_duplicates(keep='last')\n\n# Remove all duplicates (keep none)\ndf_clean = df.drop_duplicates(keep=False)\n\n# Based on specific columns\ndf_clean = df.drop_duplicates(subset=['id'])\ndf_clean = df.drop_duplicates(subset=['name', 'email'], keep='last')\n\n# In-place modification\ndf.drop_duplicates(inplace=True)\n```\n\n### Handling Duplicates with Aggregation\n\n```python\n# Instead of dropping, aggregate duplicates\ndf_agg = df.groupby('id').agg({\n 'name': 'first',\n 'email': lambda x: ', '.join(x.unique())\n}).reset_index()\n\n# Keep row with max/min value\ndf_best = df.loc[df.groupby('id')['score'].idxmax()]\n\n# Rank duplicates\ndf['rank'] = df.groupby('id').cumcount() + 1\n```\n\n---\n\n## Type Conversion\n\n### Checking and Converting Types\n\n```python\n# Check current types\ndf.dtypes\ndf.info()\n\n# Convert to specific type\ndf['age'] = df['age'].astype(int)\ndf['salary'] = df['salary'].astype(float)\ndf['name'] = df['name'].astype(str)\n\n# Safe conversion with errors handling\ndf['age'] = pd.to_numeric(df['age'], errors='coerce') # Invalid -> NaN\ndf['age'] = pd.to_numeric(df['age'], errors='ignore') # Keep original if invalid\n\n# Convert multiple columns\ndf = df.astype({'age': 'int64', 'salary': 'float64'})\n\n# Convert object to string (pandas 2.0+ StringDtype)\ndf['name'] = df['name'].astype('string') # Nullable string type\n```\n\n### Datetime Conversion\n\n```python\ndf = pd.DataFrame({\n 'date_str': ['2024-01-15', '2024-02-20', 'invalid', '2024-03-10'],\n 'timestamp': [1705276800, 1708387200, 1710028800, 1710028800]\n})\n\n# String to datetime\ndf['date'] = pd.to_datetime(df['date_str'], errors='coerce')\n\n# Specify format for faster parsing\ndf['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d', errors='coerce')\n\n# Unix timestamp to datetime\ndf['datetime'] = pd.to_datetime(df['timestamp'], unit='s')\n\n# Extract components\ndf['year'] = df['date'].dt.year\ndf['month'] = df['date'].dt.month\ndf['day_of_week'] = df['date'].dt.day_name()\n\n# Handle mixed formats\ndf['date'] = pd.to_datetime(df['date_str'], format='mixed', dayfirst=False)\n```\n\n### Categorical Conversion\n\n```python\n# Convert to categorical (memory efficient for low cardinality)\ndf['department'] = df['department'].astype('category')\n\n# Ordered categorical\ndf['size'] = pd.Categorical(\n df['size'],\n categories=['Small', 'Medium', 'Large'],\n ordered=True\n)\n\n# Check memory savings\nprint(f\"Object: {df['department'].nbytes}\")\ndf['department'] = df['department'].astype('category')\nprint(f\"Category: {df['department'].nbytes}\")\n```\n\n### Nullable Integer Types (pandas 2.0+)\n\n```python\n# Standard int doesn't support NaN\n# Use nullable integer types\ndf['age'] = df['age'].astype('Int64') # Note capital I\n\n# All nullable types\ndf = df.astype({\n 'count': 'Int64', # Nullable integer\n 'price': 'Float64', # Nullable float\n 'flag': 'boolean', # Nullable boolean\n 'name': 'string', # Nullable string\n})\n\n# Convert with NA handling\ndf['age'] = pd.array([1, 2, None, 4], dtype='Int64')\n```\n\n---\n\n## String Cleaning\n\n### Common String Operations\n\n```python\ndf = pd.DataFrame({\n 'name': [' Alice ', 'BOB', 'charlie', None, 'Diana Smith'],\n 'email': ['[email protected]', 'bob@test', 'invalid', None, '[email protected]']\n})\n\n# Strip whitespace\ndf['name'] = df['name'].str.strip()\n\n# Case normalization\ndf['name'] = df['name'].str.lower()\ndf['name'] = df['name'].str.upper()\ndf['name'] = df['name'].str.title() # Title Case\n\n# Replace patterns\ndf['name'] = df['name'].str.replace(r'\\s+', ' ', regex=True) # Multiple spaces to one\ndf['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True) # Keep only digits\n\n# Extract with regex\ndf['domain'] = df['email'].str.extract(r'@(.+)

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

)\ndf['first_name'] = df['name'].str.extract(r'^(\\w+)')\n\n# Split strings\ndf[['first', 'last']] = df['name'].str.split(' ', n=1, expand=True)\n```\n\n### String Validation\n\n```python\n# Check patterns\ndf['valid_email'] = df['email'].str.match(r'^[\\w.]+@[\\w.]+\\.\\w+

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

, na=False)\n\n# String length\ndf['name_length'] = df['name'].str.len()\ndf['valid_length'] = df['name'].str.len().between(2, 50)\n\n# Contains check\ndf['has_domain'] = df['email'].str.contains('@', na=False)\n```\n\n---\n\n## Data Validation\n\n### Validation Functions\n\n```python\ndef validate_dataframe(df: pd.DataFrame) -> dict:\n \"\"\"Comprehensive DataFrame validation.\"\"\"\n report = {\n 'rows': len(df),\n 'columns': len(df.columns),\n 'duplicates': df.duplicated().sum(),\n 'missing_by_column': df.isna().sum().to_dict(),\n 'dtypes': df.dtypes.astype(str).to_dict(),\n }\n return report\n\n# Range validation\ndef validate_range(series: pd.Series, min_val, max_val) -> pd.Series:\n \"\"\"Return boolean mask for values in range.\"\"\"\n return series.between(min_val, max_val)\n\ndf['valid_age'] = validate_range(df['age'], 0, 120)\n\n# Custom validation\ndef validate_email(series: pd.Series) -> pd.Series:\n \"\"\"Validate email format.\"\"\"\n pattern = r'^[\\w.+-]+@[\\w-]+\\.[\\w.-]+

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\n return series.str.match(pattern, na=False)\n\ndf['valid_email'] = validate_email(df['email'])\n```\n\n### Schema Validation with pandera\n\n```python\n# Using pandera for schema validation (recommended for production)\nimport pandera as pa\nfrom pandera import Column, Check\n\nschema = pa.DataFrameSchema({\n 'name': Column(str, Check.str_length(min_value=1, max_value=100)),\n 'age': Column(int, Check.in_range(0, 120)),\n 'email': Column(str, Check.str_matches(r'^[\\w.+-]+@[\\w-]+\\.[\\w.-]+

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

)),\n 'salary': Column(float, Check.greater_than(0), nullable=True),\n})\n\n# Validate DataFrame\ntry:\n schema.validate(df)\nexcept pa.errors.SchemaError as e:\n print(f\"Validation failed: {e}\")\n```\n\n---\n\n## Data Cleaning Pipeline\n\n### Method Chaining Pattern\n\n```python\ndef clean_dataframe(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Complete data cleaning pipeline using method chaining.\"\"\"\n return (\n df\n # Make a copy\n .copy()\n # Standardize column names\n .rename(columns=lambda x: x.lower().strip().replace(' ', '_'))\n # Drop fully empty rows\n .dropna(how='all')\n # Clean string columns\n .assign(\n name=lambda x: x['name'].str.strip().str.title(),\n email=lambda x: x['email'].str.lower().str.strip(),\n )\n # Handle missing values\n .fillna({'department': 'Unknown'})\n # Convert types\n .astype({'age': 'Int64', 'department': 'category'})\n # Remove duplicates\n .drop_duplicates(subset=['email'])\n # Reset index\n .reset_index(drop=True)\n )\n\ndf_clean = clean_dataframe(df)\n```\n\n### Pipeline with Validation\n\n```python\ndef clean_and_validate(\n df: pd.DataFrame,\n required_columns: list[str],\n unique_columns: list[str] | None = None,\n) -> tuple[pd.DataFrame, dict]:\n \"\"\"Clean DataFrame and return validation report.\"\"\"\n\n # Validate required columns exist\n missing_cols = set(required_columns) - set(df.columns)\n if missing_cols:\n raise ValueError(f\"Missing required columns: {missing_cols}\")\n\n # Track cleaning stats\n stats = {\n 'initial_rows': len(df),\n 'dropped_empty': 0,\n 'dropped_duplicates': 0,\n 'filled_missing': {},\n }\n\n # Clean\n df = df.copy()\n\n # Drop empty rows\n before = len(df)\n df = df.dropna(how='all')\n stats['dropped_empty'] = before - len(df)\n\n # Handle duplicates\n if unique_columns:\n before = len(df)\n df = df.drop_duplicates(subset=unique_columns)\n stats['dropped_duplicates'] = before - len(df)\n\n stats['final_rows'] = len(df)\n\n return df, stats\n```\n\n---\n\n## Best Practices Summary\n\n1. **Always check data quality first** - Use `.info()`, `.describe()`, and missing value analysis\n2. **Document cleaning decisions** - Track what was dropped/filled and why\n3. **Use nullable types** - `Int64`, `string`, `boolean` for proper NA handling\n4. **Validate after cleaning** - Ensure data meets expectations\n5. **Use method chaining** - Readable, maintainable cleaning pipelines\n6. **Copy before modifying** - Avoid SettingWithCopyWarning\n7. **Handle edge cases** - Empty strings, whitespace, invalid formats\n\n---\n\n## Anti-Patterns to Avoid\n\n```python\n# BAD: Dropping NaN without understanding impact\ndf = df.dropna() # May lose significant data\n\n# GOOD: Investigate first, then decide\nprint(f\"Missing values: {df.isna().sum()}\")\nprint(f\"Rows affected: {df.isna().any(axis=1).sum()}\")\n# Then make informed decision\n\n# BAD: Filling without domain knowledge\ndf['age'] = df['age'].fillna(0) # Age 0 is not valid\n\n# GOOD: Use appropriate fill strategy\ndf['age'] = df['age'].fillna(df['age'].median())\n\n# BAD: Type conversion without error handling\ndf['id'] = df['id'].astype(int) # Will fail on NaN or invalid\n\n# GOOD: Safe conversion\ndf['id'] = pd.to_numeric(df['id'], errors='coerce').astype('Int64')\n```\n\n---\n\n## Related References\n\n- `dataframe-operations.md` - Selection and filtering for targeted cleaning\n- `aggregation-groupby.md` - Aggregate duplicates instead of dropping\n- `performance-optimization.md` - Efficient cleaning of large datasets\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":12969,"content_sha256":"99468b768f3bfe7a6efb24999462d80791292861c3037ca46b8073d2828490f4"},{"filename":"references/dataframe-operations.md","content":"# DataFrame Operations\n\n---\n\n## Overview\n\nDataFrame operations form the foundation of pandas work. This reference covers indexing, selection, filtering, and sorting with pandas 2.0+ best practices.\n\n---\n\n## Indexing and Selection\n\n### Label-Based Selection with `.loc[]`\n\nUse `.loc[]` for label-based indexing. Always preferred over chained indexing.\n\n```python\nimport pandas as pd\nimport numpy as np\n\n# Sample DataFrame\ndf = pd.DataFrame({\n 'name': ['Alice', 'Bob', 'Charlie', 'Diana'],\n 'age': [25, 30, 35, 28],\n 'salary': [50000, 60000, 70000, 55000],\n 'department': ['Engineering', 'Sales', 'Engineering', 'Marketing']\n}, index=['a', 'b', 'c', 'd'])\n\n# Single value\nvalue = df.loc['a', 'name'] # 'Alice'\n\n# Single row (returns Series)\nrow = df.loc['a']\n\n# Multiple rows\nrows = df.loc[['a', 'c']]\n\n# Row and column slices (inclusive on both ends)\nsubset = df.loc['a':'c', 'name':'salary']\n\n# Boolean indexing with .loc\nadults = df.loc[df['age'] >= 30]\n\n# Boolean indexing with column selection\nadults_names = df.loc[df['age'] >= 30, 'name']\n\n# Multiple conditions\nengineering_seniors = df.loc[\n (df['department'] == 'Engineering') & (df['age'] >= 30),\n ['name', 'salary']\n]\n```\n\n### Position-Based Selection with `.iloc[]`\n\nUse `.iloc[]` for integer position-based indexing.\n\n```python\n# Single value by position\nvalue = df.iloc[0, 0] # First row, first column\n\n# Single row by position\nfirst_row = df.iloc[0]\n\n# Slice rows (exclusive end, like Python)\nfirst_three = df.iloc[:3]\n\n# Specific rows and columns by position\nsubset = df.iloc[[0, 2], [0, 2]] # Rows 0,2 and columns 0,2\n\n# Range selection\nblock = df.iloc[1:3, 0:2] # Rows 1-2, columns 0-1\n```\n\n### When to Use `.loc[]` vs `.iloc[]`\n\n| Scenario | Use | Example |\n|----------|-----|---------|\n| Known column names | `.loc[]` | `df.loc[:, 'name']` |\n| Filter by condition | `.loc[]` | `df.loc[df['age'] > 25]` |\n| First/last N rows | `.iloc[]` | `df.iloc[:5]` or `df.iloc[-5:]` |\n| Specific row positions | `.iloc[]` | `df.iloc[[0, 5, 10]]` |\n| Unknown column order | `.iloc[]` | `df.iloc[:, 0]` |\n\n---\n\n## Filtering DataFrames\n\n### Boolean Masks\n\n```python\n# Single condition\nmask = df['age'] > 25\nfiltered = df[mask]\n\n# Multiple conditions (use parentheses!)\nmask = (df['age'] > 25) & (df['salary'] \u003c 65000)\nfiltered = df[mask]\n\n# OR conditions\nmask = (df['department'] == 'Engineering') | (df['department'] == 'Sales')\nfiltered = df[mask]\n\n# NOT condition\nmask = ~(df['department'] == 'Marketing')\nfiltered = df[mask]\n```\n\n### Using `.query()` for Readable Filters\n\n```python\n# Simple query - more readable for complex conditions\nresult = df.query('age > 25 and salary \u003c 65000')\n\n# Using variables with @\nmin_age = 25\nresult = df.query('age > @min_age')\n\n# String comparisons\nresult = df.query('department == \"Engineering\"')\n\n# In-list filtering\ndepts = ['Engineering', 'Sales']\nresult = df.query('department in @depts')\n\n# Complex expressions\nresult = df.query('(age > 25) and (department != \"Marketing\")')\n```\n\n### Using `.isin()` for Multiple Values\n\n```python\n# Filter by multiple values\ndepartments = ['Engineering', 'Sales']\nfiltered = df[df['department'].isin(departments)]\n\n# Negation\nfiltered = df[~df['department'].isin(departments)]\n\n# Multiple columns\nconditions = {\n 'department': ['Engineering', 'Sales'],\n 'age': [25, 30, 35]\n}\n# Filter where department is in list AND age is in list\nmask = df['department'].isin(conditions['department']) & df['age'].isin(conditions['age'])\n```\n\n### String Filtering with `.str` Accessor\n\n```python\ndf = pd.DataFrame({\n 'email': ['[email protected]', '[email protected]', '[email protected]'],\n 'name': ['Alice Smith', 'Bob Jones', 'Charlie Brown']\n})\n\n# Contains\nmask = df['email'].str.contains('example')\n\n# Starts/ends with\nmask = df['email'].str.endswith('.com')\nmask = df['name'].str.startswith('A')\n\n# Regex matching\nmask = df['email'].str.match(r'^[a-z]+@example\\.com

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

)\n\n# Case-insensitive\nmask = df['name'].str.lower().str.contains('alice')\n# Or with case parameter\nmask = df['name'].str.contains('alice', case=False)\n\n# Handle NaN in string columns\nmask = df['email'].str.contains('example', na=False)\n```\n\n---\n\n## Sorting\n\n### Basic Sorting\n\n```python\n# Sort by single column (ascending)\nsorted_df = df.sort_values('age')\n\n# Sort descending\nsorted_df = df.sort_values('age', ascending=False)\n\n# Sort by multiple columns\nsorted_df = df.sort_values(['department', 'salary'], ascending=[True, False])\n\n# Sort by index\nsorted_df = df.sort_index()\nsorted_df = df.sort_index(ascending=False)\n```\n\n### Advanced Sorting\n\n```python\n# Sort with NaN handling\ndf_with_nan = pd.DataFrame({\n 'name': ['Alice', 'Bob', 'Charlie'],\n 'score': [85.0, np.nan, 90.0]\n})\n\n# NaN at end (default)\nsorted_df = df_with_nan.sort_values('score', na_position='last')\n\n# NaN at beginning\nsorted_df = df_with_nan.sort_values('score', na_position='first')\n\n# Custom sort order using Categorical\norder = ['Marketing', 'Sales', 'Engineering']\ndf['department'] = pd.Categorical(df['department'], categories=order, ordered=True)\nsorted_df = df.sort_values('department')\n\n# Sort by computed values without adding column\nsorted_df = df.iloc[df['name'].str.len().argsort()]\n```\n\n### In-Place Sorting\n\n```python\n# Modify DataFrame in place\ndf.sort_values('age', inplace=True)\n\n# Reset index after sorting\ndf.sort_values('age', inplace=True)\ndf.reset_index(drop=True, inplace=True)\n\n# Or chain\ndf = df.sort_values('age').reset_index(drop=True)\n```\n\n---\n\n## Column Operations\n\n### Adding and Modifying Columns\n\n```python\n# Add new column\ndf['bonus'] = df['salary'] * 0.1\n\n# Conditional column with np.where\ndf['seniority'] = np.where(df['age'] >= 30, 'Senior', 'Junior')\n\n# Multiple conditions with np.select\nconditions = [\n df['age'] \u003c 25,\n df['age'] \u003c 35,\n df['age'] >= 35\n]\nchoices = ['Junior', 'Mid', 'Senior']\ndf['level'] = np.select(conditions, choices, default='Unknown')\n\n# Using .assign() for method chaining (returns new DataFrame)\ndf_new = df.assign(\n bonus=lambda x: x['salary'] * 0.1,\n total_comp=lambda x: x['salary'] + x['salary'] * 0.1\n)\n```\n\n### Renaming Columns\n\n```python\n# Rename specific columns\ndf = df.rename(columns={'name': 'full_name', 'age': 'years'})\n\n# Rename all columns with function\ndf.columns = df.columns.str.lower().str.replace(' ', '_')\n\n# Using rename with function\ndf = df.rename(columns=str.upper)\n```\n\n### Dropping Columns\n\n```python\n# Drop single column\ndf = df.drop('bonus', axis=1)\n# Or\ndf = df.drop(columns=['bonus'])\n\n# Drop multiple columns\ndf = df.drop(columns=['bonus', 'level'])\n\n# Drop columns by condition\ncols_to_drop = [col for col in df.columns if col.startswith('temp_')]\ndf = df.drop(columns=cols_to_drop)\n```\n\n### Reordering Columns\n\n```python\n# Explicit order\nnew_order = ['name', 'department', 'age', 'salary']\ndf = df[new_order]\n\n# Move specific column to front\ncols = ['salary'] + [c for c in df.columns if c != 'salary']\ndf = df[cols]\n\n# Using .reindex()\ndf = df.reindex(columns=['name', 'age', 'salary', 'department'])\n```\n\n---\n\n## Index Operations\n\n### Setting and Resetting Index\n\n```python\n# Set column as index\ndf = df.set_index('name')\n\n# Reset index back to column\ndf = df.reset_index()\n\n# Drop index completely\ndf = df.reset_index(drop=True)\n\n# Set multiple columns as index (MultiIndex)\ndf = df.set_index(['department', 'name'])\n```\n\n### Working with MultiIndex\n\n```python\n# Create MultiIndex DataFrame\ndf = pd.DataFrame({\n 'department': ['Eng', 'Eng', 'Sales', 'Sales'],\n 'team': ['Backend', 'Frontend', 'East', 'West'],\n 'headcount': [10, 8, 15, 12]\n}).set_index(['department', 'team'])\n\n# Select from MultiIndex\ndf.loc['Eng'] # All Eng rows\ndf.loc[('Eng', 'Backend')] # Specific row\n\n# Cross-section with .xs()\ndf.xs('Backend', level='team') # All Backend teams\n\n# Reset specific level\ndf.reset_index(level='team')\n```\n\n---\n\n## Copying DataFrames\n\n### When to Use `.copy()`\n\n```python\n# ALWAYS copy when modifying a subset\nsubset = df[df['age'] > 25].copy()\nsubset['new_col'] = 100 # Safe, no SettingWithCopyWarning\n\n# Without copy - may raise warning or fail silently\n# BAD:\n# subset = df[df['age'] > 25]\n# subset['new_col'] = 100 # SettingWithCopyWarning!\n\n# Deep copy (default) - copies data\ndf_copy = df.copy() # or df.copy(deep=True)\n\n# Shallow copy - shares data, only copies structure\ndf_shallow = df.copy(deep=False)\n```\n\n---\n\n## Best Practices Summary\n\n1. **Use `.loc[]` and `.iloc[]`** - Never use chained indexing\n2. **Parenthesize conditions** - `(cond1) & (cond2)` not `cond1 & cond2`\n3. **Use `.query()` for readability** - Especially with complex filters\n4. **Copy before modifying subsets** - Always use `.copy()`\n5. **Use vectorized operations** - Avoid row iteration for filtering\n6. **Handle NaN explicitly** - Use `na=False` in string operations\n7. **Prefer method chaining** - Use `.assign()` for column creation\n\n---\n\n## Anti-Patterns to Avoid\n\n```python\n# BAD: Chained indexing\ndf['A']['B'] = value # May not work, raises warning\n\n# GOOD: Use .loc\ndf.loc[:, ('A', 'B')] = value\n# Or for row selection then assignment:\ndf.loc[df['A'] > 0, 'B'] = value\n\n# BAD: Iterating for filtering\nresult = []\nfor idx, row in df.iterrows():\n if row['age'] > 25:\n result.append(row)\n\n# GOOD: Boolean indexing\nresult = df[df['age'] > 25]\n\n# BAD: Multiple separate assignments\ndf = df[df['age'] > 25]\ndf = df[df['salary'] > 50000]\n\n# GOOD: Combined filter\ndf = df[(df['age'] > 25) & (df['salary'] > 50000)]\n```\n\n---\n\n## Related References\n\n- `data-cleaning.md` - After selection, clean the data\n- `aggregation-groupby.md` - Group and aggregate filtered data\n- `performance-optimization.md` - Optimize filtering on large datasets\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":9667,"content_sha256":"ea3807160696538f33abda69d557f2810bdff7e746f01673e94f1a349dca3e3e"},{"filename":"references/merging-joining.md","content":"# Merging and Joining\n\n---\n\n## Overview\n\nCombining DataFrames is essential for working with relational data. This reference covers merge, join, concat, and advanced combination strategies with pandas 2.0+.\n\n---\n\n## Merge (SQL-Style Joins)\n\n### Basic Merge\n\n```python\nimport pandas as pd\nimport numpy as np\n\n# Sample DataFrames\nemployees = pd.DataFrame({\n 'emp_id': [1, 2, 3, 4, 5],\n 'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],\n 'dept_id': [101, 102, 101, 103, 102],\n})\n\ndepartments = pd.DataFrame({\n 'dept_id': [101, 102, 104],\n 'dept_name': ['Engineering', 'Sales', 'Marketing'],\n})\n\n# Inner join (default) - only matching rows\nresult = pd.merge(employees, departments, on='dept_id')\n\n# Explicit how parameter\nresult = pd.merge(employees, departments, on='dept_id', how='inner')\n```\n\n### Join Types\n\n```python\n# Inner join - only matching rows from both\ninner = pd.merge(employees, departments, on='dept_id', how='inner')\n# Result: 4 rows (emp_id 4 has dept_id 103 which doesn't exist in departments)\n\n# Left join - all rows from left, matching from right\nleft = pd.merge(employees, departments, on='dept_id', how='left')\n# Result: 5 rows (Diana has NaN for dept_name)\n\n# Right join - all rows from right, matching from left\nright = pd.merge(employees, departments, on='dept_id', how='right')\n# Result: 4 rows (Marketing has no employees, but is included)\n\n# Outer join - all rows from both\nouter = pd.merge(employees, departments, on='dept_id', how='outer')\n# Result: 6 rows (includes unmatched from both sides)\n\n# Cross join - cartesian product\ncross = pd.merge(employees, departments, how='cross')\n# Result: 15 rows (5 employees x 3 departments)\n```\n\n### Merging on Different Column Names\n\n```python\nemployees = pd.DataFrame({\n 'emp_id': [1, 2, 3],\n 'name': ['Alice', 'Bob', 'Charlie'],\n 'department': [101, 102, 101],\n})\n\ndepartments = pd.DataFrame({\n 'id': [101, 102],\n 'dept_name': ['Engineering', 'Sales'],\n})\n\n# Different column names\nresult = pd.merge(\n employees,\n departments,\n left_on='department',\n right_on='id'\n)\n\n# Drop duplicate column after merge\nresult = result.drop('id', axis=1)\n```\n\n### Merging on Multiple Columns\n\n```python\nsales = pd.DataFrame({\n 'region': ['East', 'East', 'West', 'West'],\n 'product': ['A', 'B', 'A', 'B'],\n 'sales': [100, 150, 120, 180],\n})\n\ntargets = pd.DataFrame({\n 'region': ['East', 'East', 'West'],\n 'product': ['A', 'B', 'A'],\n 'target': [90, 140, 110],\n})\n\n# Merge on multiple columns\nresult = pd.merge(sales, targets, on=['region', 'product'], how='left')\n```\n\n### Merging on Index\n\n```python\n# Set index before merge\nemployees_idx = employees.set_index('emp_id')\nsalaries = pd.DataFrame({\n 'emp_id': [1, 2, 3, 4],\n 'salary': [80000, 75000, 70000, 65000],\n}).set_index('emp_id')\n\n# Merge on index\nresult = pd.merge(employees_idx, salaries, left_index=True, right_index=True)\n\n# Mix of column and index\nresult = pd.merge(\n employees,\n salaries,\n left_on='emp_id',\n right_index=True\n)\n```\n\n---\n\n## Handling Duplicate Columns\n\n### Suffixes\n\n```python\ndf1 = pd.DataFrame({\n 'id': [1, 2, 3],\n 'value': [10, 20, 30],\n 'date': ['2024-01-01', '2024-01-02', '2024-01-03'],\n})\n\ndf2 = pd.DataFrame({\n 'id': [1, 2, 3],\n 'value': [100, 200, 300],\n 'date': ['2024-02-01', '2024-02-02', '2024-02-03'],\n})\n\n# Default suffixes\nresult = pd.merge(df1, df2, on='id')\n# Columns: id, value_x, date_x, value_y, date_y\n\n# Custom suffixes\nresult = pd.merge(df1, df2, on='id', suffixes=('_jan', '_feb'))\n# Columns: id, value_jan, date_jan, value_feb, date_feb\n```\n\n### Validate Merge Cardinality\n\n```python\n# Validate merge relationships (pandas 2.0+)\n# Raises MergeError if validation fails\n\n# One-to-one: each key appears at most once in both DataFrames\nresult = pd.merge(df1, df2, on='id', validate='one_to_one') # or '1:1'\n\n# One-to-many: keys unique in left only\nresult = pd.merge(employees, salaries, on='emp_id', validate='one_to_many') # or '1:m'\n\n# Many-to-one: keys unique in right only\nresult = pd.merge(salaries, employees, on='emp_id', validate='many_to_one') # or 'm:1'\n\n# Many-to-many: no uniqueness requirement (default)\nresult = pd.merge(df1, df2, on='id', validate='many_to_many') # or 'm:m'\n```\n\n### Indicator Column\n\n```python\n# Add indicator column showing source of each row\nresult = pd.merge(\n employees,\n departments,\n on='dept_id',\n how='outer',\n indicator=True\n)\n# _merge column values: 'left_only', 'right_only', 'both'\n\n# Custom indicator name\nresult = pd.merge(\n employees,\n departments,\n on='dept_id',\n how='outer',\n indicator='source'\n)\n\n# Filter by indicator\nleft_only = result[result['_merge'] == 'left_only']\nboth = result[result['_merge'] == 'both']\n```\n\n---\n\n## Join (Index-Based)\n\n### DataFrame.join()\n\n```python\n# join() is for index-based joining (simpler syntax)\nemployees = pd.DataFrame({\n 'name': ['Alice', 'Bob', 'Charlie'],\n 'dept_id': [101, 102, 101],\n}, index=[1, 2, 3])\n\nsalaries = pd.DataFrame({\n 'salary': [80000, 75000, 70000],\n 'bonus': [5000, 4000, 3500],\n}, index=[1, 2, 3])\n\n# Join on index\nresult = employees.join(salaries)\n\n# Join types (same as merge)\nresult = employees.join(salaries, how='left')\nresult = employees.join(salaries, how='outer')\n```\n\n### Join on Column to Index\n\n```python\nemployees = pd.DataFrame({\n 'name': ['Alice', 'Bob', 'Charlie'],\n 'dept_id': [101, 102, 101],\n})\n\ndepartments = pd.DataFrame({\n 'dept_name': ['Engineering', 'Sales'],\n}, index=[101, 102])\n\n# Join left column to right index\nresult = employees.join(departments, on='dept_id')\n```\n\n### Join Multiple DataFrames\n\n```python\ndf1 = pd.DataFrame({'a': [1, 2]}, index=['x', 'y'])\ndf2 = pd.DataFrame({'b': [3, 4]}, index=['x', 'y'])\ndf3 = pd.DataFrame({'c': [5, 6]}, index=['x', 'y'])\n\n# Join multiple at once\nresult = df1.join([df2, df3])\n\n# With suffixes for duplicate columns\nresult = df1.join([df2, df3], lsuffix='_1', rsuffix='_2')\n```\n\n---\n\n## Concat (Stacking DataFrames)\n\n### Vertical Concatenation (Row-wise)\n\n```python\n# Stack DataFrames vertically\ndf1 = pd.DataFrame({\n 'name': ['Alice', 'Bob'],\n 'age': [25, 30],\n})\n\ndf2 = pd.DataFrame({\n 'name': ['Charlie', 'Diana'],\n 'age': [35, 28],\n})\n\n# Basic concat (axis=0 is default)\nresult = pd.concat([df1, df2])\n\n# Reset index\nresult = pd.concat([df1, df2], ignore_index=True)\n\n# Keep track of source\nresult = pd.concat([df1, df2], keys=['source1', 'source2'])\n# Creates MultiIndex\n```\n\n### Horizontal Concatenation (Column-wise)\n\n```python\nnames = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie']})\nages = pd.DataFrame({'age': [25, 30, 35]})\nsalaries = pd.DataFrame({'salary': [50000, 60000, 70000]})\n\n# Concat columns (axis=1)\nresult = pd.concat([names, ages, salaries], axis=1)\n```\n\n### Handling Mismatched Columns\n\n```python\ndf1 = pd.DataFrame({\n 'name': ['Alice', 'Bob'],\n 'age': [25, 30],\n})\n\ndf2 = pd.DataFrame({\n 'name': ['Charlie', 'Diana'],\n 'salary': [70000, 65000],\n})\n\n# Outer join (default) - include all columns\nresult = pd.concat([df1, df2])\n# age and salary columns have NaN where not present\n\n# Inner join - only common columns\nresult = pd.concat([df1, df2], join='inner')\n# Only 'name' column\n```\n\n### Concat with Verification\n\n```python\n# Verify no index overlap\ntry:\n result = pd.concat([df1, df2], verify_integrity=True)\nexcept ValueError as e:\n print(f\"Index overlap detected: {e}\")\n\n# Alternative: use ignore_index\nresult = pd.concat([df1, df2], ignore_index=True)\n```\n\n---\n\n## Combine and Update\n\n### combine_first() - Fill Gaps\n\n```python\n# Fill NaN values from another DataFrame\ndf1 = pd.DataFrame({\n 'A': [1, np.nan, 3],\n 'B': [np.nan, 2, 3],\n}, index=['a', 'b', 'c'])\n\ndf2 = pd.DataFrame({\n 'A': [10, 20, 30],\n 'B': [10, 20, 30],\n}, index=['a', 'b', 'c'])\n\n# Fill NaN in df1 with values from df2\nresult = df1.combine_first(df2)\n# A: [1, 20, 3], B: [10, 2, 3]\n```\n\n### update() - In-Place Update\n\n```python\ndf1 = pd.DataFrame({\n 'A': [1, 2, 3],\n 'B': [4, 5, 6],\n}, index=['a', 'b', 'c'])\n\ndf2 = pd.DataFrame({\n 'A': [10, 20],\n 'B': [40, 50],\n}, index=['a', 'b'])\n\n# Update df1 with values from df2 (in-place)\ndf1.update(df2)\n# df1 now has A: [10, 20, 3], B: [40, 50, 6]\n\n# Only update where df2 has non-NaN\ndf1.update(df2, overwrite=False) # Don't overwrite existing values\n```\n\n---\n\n## Advanced Merge Patterns\n\n### Merge with Aggregation\n\n```python\n# Merge and aggregate in one operation\norders = pd.DataFrame({\n 'order_id': [1, 2, 3, 4],\n 'customer_id': [101, 102, 101, 103],\n 'amount': [100, 200, 150, 300],\n})\n\ncustomers = pd.DataFrame({\n 'customer_id': [101, 102, 103],\n 'name': ['Alice', 'Bob', 'Charlie'],\n})\n\n# Get customer summary\ncustomer_summary = orders.groupby('customer_id').agg(\n total_orders=('order_id', 'count'),\n total_amount=('amount', 'sum'),\n).reset_index()\n\n# Merge with customer info\nresult = pd.merge(customers, customer_summary, on='customer_id')\n```\n\n### Merge Asof (Nearest Match)\n\n```python\n# Merge on nearest key (useful for time series)\ntrades = pd.DataFrame({\n 'time': pd.to_datetime(['2024-01-01 10:00:01', '2024-01-01 10:00:03', '2024-01-01 10:00:05']),\n 'ticker': ['AAPL', 'AAPL', 'AAPL'],\n 'price': [150.0, 151.0, 150.5],\n})\n\nquotes = pd.DataFrame({\n 'time': pd.to_datetime(['2024-01-01 10:00:00', '2024-01-01 10:00:02', '2024-01-01 10:00:04']),\n 'ticker': ['AAPL', 'AAPL', 'AAPL'],\n 'bid': [149.5, 150.5, 150.0],\n 'ask': [150.5, 151.5, 151.0],\n})\n\n# Merge asof - find nearest quote for each trade\nresult = pd.merge_asof(\n trades.sort_values('time'),\n quotes.sort_values('time'),\n on='time',\n by='ticker',\n direction='backward' # Use most recent quote\n)\n```\n\n### Conditional Merge\n\n```python\n# Merge with conditions beyond key equality\n# First merge, then filter\n\nproducts = pd.DataFrame({\n 'product_id': [1, 2, 3],\n 'name': ['Widget', 'Gadget', 'Gizmo'],\n 'category': ['A', 'B', 'A'],\n})\n\ndiscounts = pd.DataFrame({\n 'category': ['A', 'A', 'B'],\n 'min_qty': [10, 50, 20],\n 'discount': [0.05, 0.10, 0.08],\n})\n\n# Cross merge then filter\nmerged = pd.merge(products, discounts, on='category')\n# Then apply quantity-based filtering as needed\n```\n\n---\n\n## Performance Considerations\n\n### Pre-sorting for Merge\n\n```python\n# Sort keys before merge for better performance\ndf1 = df1.sort_values('key')\ndf2 = df2.sort_values('key')\n\n# Merge sorted DataFrames\nresult = pd.merge(df1, df2, on='key')\n```\n\n### Index Alignment\n\n```python\n# Using index for merge is often faster than columns\ndf1 = df1.set_index('key')\ndf2 = df2.set_index('key')\n\n# Join on index\nresult = df1.join(df2)\n```\n\n### Memory-Efficient Merge\n\n```python\n# For large DataFrames, reduce memory before merge\n# Convert to appropriate types\ndf1['key'] = df1['key'].astype('int32') # Instead of int64\ndf1['category'] = df1['category'].astype('category')\n\n# Select only needed columns\ncols_needed = ['key', 'value1', 'value2']\nresult = pd.merge(df1[cols_needed], df2[cols_needed], on='key')\n```\n\n---\n\n## Common Merge Patterns\n\n### Left Join with Null Check\n\n```python\n# Find unmatched rows after left join\nresult = pd.merge(employees, departments, on='dept_id', how='left')\nunmatched = result[result['dept_name'].isna()]\n```\n\n### Anti-Join (Rows Not in Other)\n\n```python\n# Find employees NOT in a specific department list\ndept_list = [101, 102]\n\n# Method 1: Using isin\nnot_in_depts = employees[~employees['dept_id'].isin(dept_list)]\n\n# Method 2: Using merge with indicator\nmerged = pd.merge(\n employees,\n pd.DataFrame({'dept_id': dept_list}),\n on='dept_id',\n how='left',\n indicator=True\n)\nnot_in_depts = merged[merged['_merge'] == 'left_only']\n```\n\n### Self-Join\n\n```python\n# Find pairs within same department\nemployees = pd.DataFrame({\n 'emp_id': [1, 2, 3, 4],\n 'name': ['Alice', 'Bob', 'Charlie', 'Diana'],\n 'dept_id': [101, 101, 102, 101],\n})\n\n# Self-join to find pairs\npairs = pd.merge(\n employees,\n employees,\n on='dept_id',\n suffixes=('_1', '_2')\n)\n# Remove self-pairs and duplicates\npairs = pairs[pairs['emp_id_1'] \u003c pairs['emp_id_2']]\n```\n\n---\n\n## Best Practices Summary\n\n1. **Choose the right join type** - Default inner may drop data\n2. **Validate cardinality** - Use `validate` parameter\n3. **Use indicator** - Debug unexpected results\n4. **Handle duplicates** - Use meaningful suffixes\n5. **Pre-sort for performance** - Especially for large DataFrames\n6. **Reset index after operations** - Keep DataFrames usable\n7. **Check for NaN after join** - Understand unmatched rows\n\n---\n\n## Anti-Patterns to Avoid\n\n```python\n# BAD: Merge without understanding cardinality\nresult = pd.merge(df1, df2, on='key') # May explode row count\n\n# GOOD: Validate relationship\nresult = pd.merge(df1, df2, on='key', validate='one_to_one')\n\n# BAD: Repeated merges\nresult = pd.merge(df1, df2, on='key')\nresult = pd.merge(result, df3, on='key')\nresult = pd.merge(result, df4, on='key')\n\n# GOOD: Chain or use reduce\nfrom functools import reduce\ndfs = [df1, df2, df3, df4]\nresult = reduce(lambda left, right: pd.merge(left, right, on='key'), dfs)\n\n# BAD: Ignoring merge indicators\nresult = pd.merge(df1, df2, on='key', how='outer')\n\n# GOOD: Check merge results\nresult = pd.merge(df1, df2, on='key', how='outer', indicator=True)\nprint(result['_merge'].value_counts())\n```\n\n---\n\n## Related References\n\n- `dataframe-operations.md` - Filter before/after merge\n- `aggregation-groupby.md` - Aggregate before merging\n- `performance-optimization.md` - Optimize large merges\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":13517,"content_sha256":"f51f06377908a6aaa9ed498a673d993826bd6d642e3d6af2c26878dd9c80bb87"},{"filename":"references/performance-optimization.md","content":"# Performance Optimization\n\n---\n\n## Overview\n\nOptimizing pandas performance is critical for production workflows. This reference covers memory optimization, vectorization, chunking, and profiling with pandas 2.0+.\n\n---\n\n## Memory Analysis\n\n### Checking Memory Usage\n\n```python\nimport pandas as pd\nimport numpy as np\n\ndf = pd.DataFrame({\n 'id': range(1_000_000),\n 'name': ['user_' + str(i) for i in range(1_000_000)],\n 'category': np.random.choice(['A', 'B', 'C', 'D'], 1_000_000),\n 'value': np.random.randn(1_000_000),\n 'count': np.random.randint(0, 100, 1_000_000),\n})\n\n# Basic memory info\nprint(df.info(memory_usage='deep'))\n\n# Detailed memory by column\nmemory_usage = df.memory_usage(deep=True)\nprint(memory_usage)\nprint(f\"Total: {memory_usage.sum() / 1e6:.2f} MB\")\n\n# Memory as percentage of total\nmemory_pct = (memory_usage / memory_usage.sum() * 100).round(2)\nprint(memory_pct)\n```\n\n### Memory Profiling Function\n\n```python\ndef memory_profile(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Profile memory usage by column with optimization suggestions.\"\"\"\n memory_bytes = df.memory_usage(deep=True)\n\n profile = pd.DataFrame({\n 'dtype': df.dtypes,\n 'non_null': df.count(),\n 'null_count': df.isna().sum(),\n 'unique': df.nunique(),\n 'memory_mb': (memory_bytes / 1e6).round(3),\n })\n\n # Add optimization suggestions\n suggestions = []\n for col in df.columns:\n dtype = df[col].dtype\n nunique = df[col].nunique()\n\n if dtype == 'object':\n if nunique / len(df) \u003c 0.5: # Less than 50% unique\n suggestions.append(f\"Convert to category (only {nunique} unique)\")\n else:\n suggestions.append(\"Consider string dtype\")\n elif dtype == 'int64':\n if df[col].max() \u003c 2**31 and df[col].min() >= -2**31:\n suggestions.append(\"Downcast to int32\")\n if df[col].max() \u003c 2**15 and df[col].min() >= -2**15:\n suggestions.append(\"Downcast to int16\")\n elif dtype == 'float64':\n suggestions.append(\"Consider float32 if precision allows\")\n else:\n suggestions.append(\"OK\")\n\n profile['suggestion'] = suggestions\n return profile\n\nprint(memory_profile(df))\n```\n\n---\n\n## Memory Optimization Techniques\n\n### Downcasting Numeric Types\n\n```python\n# Automatic downcasting for integers\ndf['count'] = pd.to_numeric(df['count'], downcast='integer')\n\n# Automatic downcasting for floats\ndf['value'] = pd.to_numeric(df['value'], downcast='float')\n\n# Manual downcasting function\ndef downcast_dtypes(df: pd.DataFrame) -> pd.DataFrame:\n \"\"\"Reduce memory by downcasting numeric types.\"\"\"\n df = df.copy()\n\n for col in df.select_dtypes(include=['int']).columns:\n df[col] = pd.to_numeric(df[col], downcast='integer')\n\n for col in df.select_dtypes(include=['float']).columns:\n df[col] = pd.to_numeric(df[col], downcast='float')\n\n return df\n\ndf_optimized = downcast_dtypes(df)\nprint(f\"Before: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB\")\nprint(f\"After: {df_optimized.memory_usage(deep=True).sum() / 1e6:.2f} MB\")\n```\n\n### Using Categorical Type\n\n```python\n# Convert low-cardinality string columns to category\n# Especially effective when unique values \u003c\u003c total rows\n\n# Before\nprint(f\"Object dtype: {df['category'].memory_usage(deep=True) / 1e6:.2f} MB\")\n\n# After\ndf['category'] = df['category'].astype('category')\nprint(f\"Category dtype: {df['category'].memory_usage(deep=True) / 1e6:.2f} MB\")\n\n# Automatic conversion for low-cardinality columns\ndef optimize_categories(df: pd.DataFrame, threshold: float = 0.5) -> pd.DataFrame:\n \"\"\"Convert object columns to category if unique ratio \u003c threshold.\"\"\"\n df = df.copy()\n\n for col in df.select_dtypes(include=['object']).columns:\n unique_ratio = df[col].nunique() / len(df)\n if unique_ratio \u003c threshold:\n df[col] = df[col].astype('category')\n\n return df\n```\n\n### Sparse Data Types\n\n```python\n# For data with many repeated values (especially zeros/NaN)\nsparse_series = pd.arrays.SparseArray([0, 0, 1, 0, 0, 0, 2, 0, 0, 0])\n\n# Create sparse DataFrame\ndf_sparse = pd.DataFrame({\n 'sparse_col': pd.arrays.SparseArray([0] * 9000 + [1] * 1000),\n 'dense_col': [0] * 9000 + [1] * 1000,\n})\n\nprint(f\"Sparse: {df_sparse['sparse_col'].memory_usage() / 1e6:.4f} MB\")\nprint(f\"Dense: {df_sparse['dense_col'].memory_usage() / 1e6:.4f} MB\")\n```\n\n### Nullable Types (pandas 2.0+)\n\n```python\n# Use nullable types for proper NA handling with memory efficiency\ndf = df.astype({\n 'id': 'Int32', # Nullable int32\n 'count': 'Int16', # Nullable int16\n 'value': 'Float32', # Nullable float32\n 'name': 'string', # Nullable string (more memory efficient)\n 'category': 'category', # Categorical\n})\n\n# Arrow-backed types for even better memory (pandas 2.0+)\ndf['name'] = df['name'].astype('string[pyarrow]')\ndf['category'] = df['category'].astype('category')\n```\n\n---\n\n## Vectorization\n\n### Replace Loops with Vectorized Operations\n\n```python\n# BAD: Row iteration (extremely slow)\nresult = []\nfor idx, row in df.iterrows():\n if row['value'] > 0:\n result.append(row['value'] * 2)\n else:\n result.append(0)\ndf['result'] = result\n\n# GOOD: Vectorized with np.where\ndf['result'] = np.where(df['value'] > 0, df['value'] * 2, 0)\n\n# GOOD: Vectorized with boolean indexing\ndf['result'] = 0\ndf.loc[df['value'] > 0, 'result'] = df.loc[df['value'] > 0, 'value'] * 2\n```\n\n### Multiple Conditions with np.select\n\n```python\n# BAD: Nested if-else in apply\ndef categorize(row):\n if row['value'] \u003c -1:\n return 'very_low'\n elif row['value'] \u003c 0:\n return 'low'\n elif row['value'] \u003c 1:\n return 'medium'\n else:\n return 'high'\n\ndf['category'] = df.apply(categorize, axis=1) # SLOW!\n\n# GOOD: Vectorized with np.select\nconditions = [\n df['value'] \u003c -1,\n df['value'] \u003c 0,\n df['value'] \u003c 1,\n]\nchoices = ['very_low', 'low', 'medium']\ndf['category'] = np.select(conditions, choices, default='high')\n```\n\n### String Operations - Vectorized\n\n```python\n# BAD: Apply for string operations\ndf['upper_name'] = df['name'].apply(lambda x: x.upper())\n\n# GOOD: Vectorized string methods\ndf['upper_name'] = df['name'].str.upper()\n\n# Combine multiple string operations\ndf['processed'] = (\n df['name']\n .str.strip()\n .str.lower()\n .str.replace(r'\\s+', '_', regex=True)\n)\n```\n\n### Avoid apply() When Possible\n\n```python\n# BAD: apply for row-wise calculation\ndf['total'] = df.apply(lambda row: row['a'] + row['b'] + row['c'], axis=1)\n\n# GOOD: Direct vectorized operation\ndf['total'] = df['a'] + df['b'] + df['c']\n\n# BAD: apply for element-wise operation\ndf['squared'] = df['value'].apply(lambda x: x ** 2)\n\n# GOOD: Vectorized\ndf['squared'] = df['value'] ** 2\n\n# When apply IS appropriate: complex custom logic\ndef complex_calculation(row):\n # Multiple dependencies and conditional logic\n if row['type'] == 'A':\n return row['value'] * row['multiplier'] + row['offset']\n else:\n return row['value'] / row['divisor'] - row['adjustment']\n\n# Consider rewriting as vectorized if performance critical\n```\n\n---\n\n## Chunked Processing\n\n### Reading Large Files in Chunks\n\n```python\n# Read CSV in chunks\nchunk_size = 100_000\nchunks = []\n\nfor chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):\n # Process each chunk\n processed = chunk[chunk['value'] > 0] # Filter\n processed = processed.groupby('category')['value'].sum() # Aggregate\n chunks.append(processed)\n\n# Combine results\nresult = pd.concat(chunks).groupby(level=0).sum()\n```\n\n### Chunked Processing Function\n\n```python\ndef process_large_csv(\n filepath: str,\n chunk_size: int = 100_000,\n filter_func=None,\n agg_func=None,\n) -> pd.DataFrame:\n \"\"\"Process large CSV files in chunks.\"\"\"\n results = []\n\n for chunk in pd.read_csv(filepath, chunksize=chunk_size):\n # Apply filter if provided\n if filter_func:\n chunk = filter_func(chunk)\n\n # Apply aggregation if provided\n if agg_func:\n chunk = agg_func(chunk)\n\n results.append(chunk)\n\n # Combine results\n combined = pd.concat(results, ignore_index=True)\n\n # Re-aggregate if needed\n if agg_func:\n combined = agg_func(combined)\n\n return combined\n\n# Usage\nresult = process_large_csv(\n 'large_file.csv',\n chunk_size=50_000,\n filter_func=lambda df: df[df['value'] > 0],\n agg_func=lambda df: df.groupby('category').agg({'value': 'sum'}),\n)\n```\n\n### Memory-Efficient Iteration\n\n```python\n# When you must iterate, use itertuples (not iterrows)\n# itertuples is 10-100x faster than iterrows\n\n# BAD: iterrows\nfor idx, row in df.iterrows():\n process(row['name'], row['value'])\n\n# BETTER: itertuples\nfor row in df.itertuples():\n process(row.name, row.value) # Access as attributes\n\n# BEST: Vectorized operations (avoid iteration entirely)\n```\n\n---\n\n## Query Optimization\n\n### Efficient Filtering\n\n```python\n# Order matters - filter early, compute late\n# BAD: Compute on all rows, then filter\ndf['expensive_calc'] = df['a'] * df['b'] + np.sin(df['c'])\nresult = df[df['category'] == 'A']\n\n# GOOD: Filter first, compute on subset\nmask = df['category'] == 'A'\nresult = df[mask].copy()\nresult['expensive_calc'] = result['a'] * result['b'] + np.sin(result['c'])\n```\n\n### Using query() for Performance\n\n```python\n# query() can be faster for large DataFrames (uses numexpr)\n# Traditional boolean indexing\nresult = df[(df['value'] > 0) & (df['category'] == 'A')]\n\n# query() syntax (faster for large data)\nresult = df.query('value > 0 and category == \"A\"')\n\n# With variables\nthreshold = 0\ncat = 'A'\nresult = df.query('value > @threshold and category == @cat')\n```\n\n### eval() for Complex Expressions\n\n```python\n# eval() uses numexpr for faster computation\n# Standard pandas\ndf['result'] = df['a'] + df['b'] * df['c'] - df['d']\n\n# Using eval (faster for large DataFrames)\ndf['result'] = pd.eval('df.a + df.b * df.c - df.d')\n\n# In-place with inplace parameter\ndf.eval('result = a + b * c - d', inplace=True)\n```\n\n---\n\n## GroupBy Optimization\n\n### Pre-sort for Faster GroupBy\n\n```python\n# Sort by groupby column first\ndf = df.sort_values('category')\n\n# Use sort=False since already sorted\nresult = df.groupby('category', sort=False)['value'].mean()\n```\n\n### Use Built-in Aggregations\n\n```python\n# BAD: Custom function via apply\nresult = df.groupby('category')['value'].apply(lambda x: x.mean())\n\n# GOOD: Built-in aggregation\nresult = df.groupby('category')['value'].mean()\n\n# Built-in aggregations available:\n# sum, mean, median, min, max, std, var, count, first, last, nth\n# size, sem, prod, cumsum, cummax, cummin, cumprod\n```\n\n### Observed Categories\n\n```python\n# For categorical columns, use observed=True (pandas 2.0+ default)\ndf['category'] = df['category'].astype('category')\n\n# Avoid computing for unobserved categories\nresult = df.groupby('category', observed=True)['value'].mean()\n```\n\n---\n\n## I/O Optimization\n\n### Efficient File Formats\n\n```python\n# Parquet - best for analytical workloads\ndf.to_parquet('data.parquet', compression='snappy')\ndf = pd.read_parquet('data.parquet')\n\n# Feather - best for pandas interchange\ndf.to_feather('data.feather')\ndf = pd.read_feather('data.feather')\n\n# CSV with optimizations\ndf.to_csv('data.csv', index=False)\ndf = pd.read_csv(\n 'data.csv',\n dtype={'category': 'category', 'count': 'int32'},\n usecols=['id', 'category', 'value'], # Only needed columns\n nrows=10000, # Limit rows for testing\n)\n```\n\n### Specify dtypes When Reading\n\n```python\n# Specify dtypes upfront to avoid inference overhead\ndtypes = {\n 'id': 'int32',\n 'name': 'string',\n 'category': 'category',\n 'value': 'float32',\n 'count': 'int16',\n}\n\ndf = pd.read_csv('data.csv', dtype=dtypes)\n\n# Parse dates efficiently\ndf = pd.read_csv(\n 'data.csv',\n dtype=dtypes,\n parse_dates=['date_column'],\n date_format='%Y-%m-%d', # Explicit format is faster\n)\n```\n\n---\n\n## Profiling and Benchmarking\n\n### Timing Operations\n\n```python\nimport time\n\n# Simple timing\nstart = time.time()\nresult = df.groupby('category')['value'].mean()\nelapsed = time.time() - start\nprint(f\"Elapsed: {elapsed:.4f} seconds\")\n\n# Using %%timeit in Jupyter\n# %%timeit\n# df.groupby('category')['value'].mean()\n```\n\n### Memory Profiling\n\n```python\n# Track memory before/after\nimport tracemalloc\n\ntracemalloc.start()\n\n# Your operation\ndf_result = df.groupby('category').agg({'value': 'sum'})\n\ncurrent, peak = tracemalloc.get_traced_memory()\nprint(f\"Current memory: {current / 1e6:.2f} MB\")\nprint(f\"Peak memory: {peak / 1e6:.2f} MB\")\n\ntracemalloc.stop()\n```\n\n### Comparison Template\n\n```python\ndef benchmark_operations(df: pd.DataFrame, operations: dict, n_runs: int = 5):\n \"\"\"Benchmark multiple operations.\"\"\"\n results = {}\n\n for name, func in operations.items():\n times = []\n for _ in range(n_runs):\n start = time.time()\n func(df)\n times.append(time.time() - start)\n\n results[name] = {\n 'mean': np.mean(times),\n 'std': np.std(times),\n 'min': np.min(times),\n }\n\n return pd.DataFrame(results).T\n\n# Usage\noperations = {\n 'iterrows': lambda df: [row['value'] for _, row in df.iterrows()],\n 'itertuples': lambda df: [row.value for row in df.itertuples()],\n 'vectorized': lambda df: df['value'].tolist(),\n}\n\nbenchmark_results = benchmark_operations(df.head(10000), operations)\nprint(benchmark_results)\n```\n\n---\n\n## Best Practices Summary\n\n1. **Profile first** - Identify actual bottlenecks before optimizing\n2. **Use appropriate dtypes** - int32/float32/category save memory\n3. **Vectorize everything** - Avoid loops and apply when possible\n4. **Filter early** - Reduce data before expensive operations\n5. **Chunk large files** - Process in manageable pieces\n6. **Use efficient file formats** - Parquet/Feather over CSV\n7. **Leverage built-in methods** - Faster than custom functions\n\n---\n\n## Performance Checklist\n\nBefore deploying pandas code:\n\n- [ ] Memory profiled with `memory_usage(deep=True)`\n- [ ] Dtypes optimized (downcast, categorical)\n- [ ] No iterrows/itertuples in hot paths\n- [ ] GroupBy uses built-in aggregations\n- [ ] Large files processed in chunks\n- [ ] Filters applied before computations\n- [ ] Appropriate file format used\n- [ ] Benchmarked with representative data size\n\n---\n\n## Anti-Patterns Summary\n\n| Anti-Pattern | Alternative |\n|--------------|-------------|\n| `iterrows()` for computation | Vectorized operations |\n| `apply(lambda)` for simple ops | Built-in methods |\n| Loading entire large file | Chunked reading |\n| String columns with low cardinality | Category dtype |\n| int64 for small integers | int32/int16 |\n| Multiple separate filters | Combined boolean mask |\n| Repeated groupby calls | Single groupby with multiple aggs |\n\n---\n\n## Related References\n\n- `dataframe-operations.md` - Efficient indexing and filtering\n- `aggregation-groupby.md` - Optimized aggregation patterns\n- `merging-joining.md` - Efficient merge strategies\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":15068,"content_sha256":"a34ad5df2cba5465ceea51ed61191b0eaf9adf38c65bc2ef18aacd21ca5ecb48"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Pandas Pro","type":"text"}]},{"type":"paragraph","content":[{"text":"Expert pandas developer specializing in efficient data manipulation, analysis, and transformation workflows with production-grade performance patterns.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Core Workflow","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Assess data structure","type":"text","marks":[{"type":"strong"}]},{"text":" — Examine dtypes, memory usage, missing values, data quality:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"print(df.dtypes)\nprint(df.memory_usage(deep=True).sum() / 1e6, \"MB\")\nprint(df.isna().sum())\nprint(df.describe(include=\"all\"))","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Design transformation","type":"text","marks":[{"type":"strong"}]},{"text":" — Plan vectorized operations, avoid loops, identify indexing strategy","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Implement efficiently","type":"text","marks":[{"type":"strong"}]},{"text":" — Use vectorized methods, method chaining, proper indexing","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Validate results","type":"text","marks":[{"type":"strong"}]},{"text":" — Check dtypes, shapes, null counts, and row counts:","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"assert result.shape[0] == expected_rows, f\"Row count mismatch: {result.shape[0]}\"\nassert result.isna().sum().sum() == 0, \"Unexpected nulls after transform\"\nassert set(result.columns) == expected_cols","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Optimize","type":"text","marks":[{"type":"strong"}]},{"text":" — Profile memory, apply categorical types, use chunking if needed","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Reference Guide","type":"text"}]},{"type":"paragraph","content":[{"text":"Load detailed guidance based on context:","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Topic","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Reference","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Load When","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DataFrame Operations","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/dataframe-operations.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Indexing, selection, filtering, sorting","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Data Cleaning","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/data-cleaning.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Missing values, duplicates, type conversion","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Aggregation & GroupBy","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/aggregation-groupby.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"GroupBy, pivot, crosstab, aggregation","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Merging & Joining","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/merging-joining.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Merge, join, concat, combine strategies","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Performance Optimization","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/performance-optimization.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Memory usage, vectorization, chunking","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Code Patterns","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Vectorized Operations (before/after)","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# ❌ AVOID: row-by-row iteration\nfor i, row in df.iterrows():\n df.at[i, 'tax'] = row['price'] * 0.2\n\n# ✅ USE: vectorized assignment\ndf['tax'] = df['price'] * 0.2","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Safe Subsetting with ","type":"text"},{"text":".copy()","type":"text","marks":[{"type":"code_inline"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# ❌ AVOID: chained indexing triggers SettingWithCopyWarning\ndf['A']['B'] = 1\n\n# ✅ USE: .loc[] with explicit copy when mutating a subset\nsubset = df.loc[df['status'] == 'active', :].copy()\nsubset['score'] = subset['score'].fillna(0)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"GroupBy Aggregation","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"summary = (\n df.groupby(['region', 'category'], observed=True)\n .agg(\n total_sales=('revenue', 'sum'),\n avg_price=('price', 'mean'),\n order_count=('order_id', 'nunique'),\n )\n .reset_index()\n)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Merge with Validation","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"merged = pd.merge(\n left_df, right_df,\n on=['customer_id', 'date'],\n how='left',\n validate='m:1', # asserts right key is unique\n indicator=True,\n)\nunmatched = merged[merged['_merge'] != 'both']\nprint(f\"Unmatched rows: {len(unmatched)}\")\nmerged.drop(columns=['_merge'], inplace=True)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Missing Value Handling","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# Forward-fill then interpolate numeric gaps\ndf['price'] = df['price'].ffill().interpolate(method='linear')\n\n# Fill categoricals with mode, numerics with median\nfor col in df.select_dtypes(include='object'):\n df[col] = df[col].fillna(df[col].mode()[0])\nfor col in df.select_dtypes(include='number'):\n df[col] = df[col].fillna(df[col].median())","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Time Series Resampling","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"daily = (\n df.set_index('timestamp')\n .resample('D')\n .agg({'revenue': 'sum', 'sessions': 'count'})\n .fillna(0)\n)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Pivot Table","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"pivot = df.pivot_table(\n values='revenue',\n index='region',\n columns='product_line',\n aggfunc='sum',\n fill_value=0,\n margins=True,\n)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Memory Optimization","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"# Downcast numerics and convert low-cardinality strings to categorical\ndf['category'] = df['category'].astype('category')\ndf['count'] = pd.to_numeric(df['count'], downcast='integer')\ndf['score'] = pd.to_numeric(df['score'], downcast='float')\nprint(df.memory_usage(deep=True).sum() / 1e6, \"MB after optimization\")","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Constraints","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"MUST DO","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use vectorized operations instead of loops","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Set appropriate dtypes (categorical for low-cardinality strings)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Check memory usage with ","type":"text"},{"text":".memory_usage(deep=True)","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Handle missing values explicitly (don't silently drop)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use method chaining for readability","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Preserve index integrity through operations","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Validate data quality before and after transformations","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use ","type":"text"},{"text":".copy()","type":"text","marks":[{"type":"code_inline"}]},{"text":" when modifying subsets to avoid SettingWithCopyWarning","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"MUST NOT DO","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Iterate over DataFrame rows with ","type":"text"},{"text":".iterrows()","type":"text","marks":[{"type":"code_inline"}]},{"text":" unless absolutely necessary","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use chained indexing (","type":"text"},{"text":"df['A']['B']","type":"text","marks":[{"type":"code_inline"}]},{"text":") — use ","type":"text"},{"text":".loc[]","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":".iloc[]","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Ignore SettingWithCopyWarning messages","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Load entire large datasets without chunking","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use deprecated methods (","type":"text"},{"text":".ix","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":".append()","type":"text","marks":[{"type":"code_inline"}]},{"text":" — use ","type":"text"},{"text":"pd.concat()","type":"text","marks":[{"type":"code_inline"}]},{"text":")","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Convert to Python lists for operations possible in pandas","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Assume data is clean without validation","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Output Templates","type":"text"}]},{"type":"paragraph","content":[{"text":"When implementing pandas solutions, provide:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Code with vectorized operations and proper indexing","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Comments explaining complex transformations","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Memory/performance considerations if dataset is large","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Data validation checks (dtypes, nulls, shapes)","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Documentation","type":"text","marks":[{"type":"link","attrs":{"href":"https://jeffallan.github.io/claude-skills/skills/data-ml/pandas-pro/","title":null}}]}]}]},"metadata":{"date":"2026-06-05","name":"pandas-pro","author":"@skillopedia","source":{"stars":9549,"repo_name":"claude-skills","origin_url":"https://github.com/jeffallan/claude-skills/blob/HEAD/skills/pandas-pro/SKILL.md","repo_owner":"jeffallan","body_sha256":"8a1689ff360be5bf05371195e95ce38f5c8a943528cced43a1bcb0096d5a2b2a","cluster_key":"9a3d8e820e5fcef2166cb283469b104f4bbbfb680103eec117c2d478b661eb42","clean_bundle":{"format":"clean-skill-bundle-v1","source":"jeffallan/claude-skills/skills/pandas-pro/SKILL.md","attachments":[{"id":"fa130bc9-f420-5273-b904-720354c486ad","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/fa130bc9-f420-5273-b904-720354c486ad/attachment.md","path":"references/aggregation-groupby.md","size":13206,"sha256":"d71f9e011ebbcc62e4b695932c8e526cfe45b37ec05088bf2995a6f49f26a68d","contentType":"text/markdown; charset=utf-8"},{"id":"4b40edb9-1842-506b-9488-37747c0e814a","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/4b40edb9-1842-506b-9488-37747c0e814a/attachment.md","path":"references/data-cleaning.md","size":12969,"sha256":"99468b768f3bfe7a6efb24999462d80791292861c3037ca46b8073d2828490f4","contentType":"text/markdown; charset=utf-8"},{"id":"db072661-dcc7-5d32-a4cd-3559c7b77e1e","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/db072661-dcc7-5d32-a4cd-3559c7b77e1e/attachment.md","path":"references/dataframe-operations.md","size":9667,"sha256":"ea3807160696538f33abda69d557f2810bdff7e746f01673e94f1a349dca3e3e","contentType":"text/markdown; charset=utf-8"},{"id":"ea9e7c8d-5b21-52bd-8820-3d6e1eef0d6f","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/ea9e7c8d-5b21-52bd-8820-3d6e1eef0d6f/attachment.md","path":"references/merging-joining.md","size":13517,"sha256":"f51f06377908a6aaa9ed498a673d993826bd6d642e3d6af2c26878dd9c80bb87","contentType":"text/markdown; charset=utf-8"},{"id":"e1ee4176-99a4-5060-9fcc-10c4a59af92c","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/e1ee4176-99a4-5060-9fcc-10c4a59af92c/attachment.md","path":"references/performance-optimization.md","size":15068,"sha256":"a34ad5df2cba5465ceea51ed61191b0eaf9adf38c65bc2ef18aacd21ca5ecb48","contentType":"text/markdown; charset=utf-8"}],"bundle_sha256":"3bb7972b9f50d68bec196cac62dd6002f1cb733e2efb246758d50bbe1416fbd2","attachment_count":5,"text_attachments":5,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"skills/pandas-pro/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"data-analytics","category_label":"Data"},"exact_dupes_collapsed_into_this":0},"license":"MIT","version":"v1","category":"data-analytics","metadata":{"role":"expert","scope":"implementation","author":"https://github.com/Jeffallan","domain":"data-ml","version":"1.1.0","triggers":"pandas, DataFrame, data manipulation, data cleaning, aggregation, groupby, merge, join, time series, data wrangling, pivot table, data transformation","output-format":"code","related-skills":"python-pro"},"import_tag":"clean-skills-v1","description":"Performs pandas DataFrame operations for data analysis, manipulation, and transformation. Use when working with pandas DataFrames, data cleaning, aggregation, merging, or time series analysis. Invoke for data manipulation tasks such as joining DataFrames on multiple keys, pivoting tables, resampling time series, handling NaN values with interpolation or forward-fill, groupby aggregations, type conversion, or performance optimization of large datasets."}},"renderedAt":1782981055266}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.