Logging & Observability Structured logging, distributed tracing, and metrics for production systems. Covers the full observability stack from log formatting to alert routing. When to Use Activate on: "structured logging", "distributed tracing", "OpenTelemetry", "OTel", "correlation ID", "log levels", "Grafana dashboard", "alerting thresholds", "SLI SLO", "Prometheus metrics", "PagerDuty integration", "observability stack", "Winston setup", "Pino logger", "log aggregation", "Datadog", "Honeycomb" NOT for: Performance profiling (CPU/memory flamegraphs) | Load testing | Database query optimizati…

+ amount)`\n\n**Why wrong**: You cannot filter, aggregate, or alert on string-interpolated data in any log aggregator. A Grafana query for `amount > 1000` requires `amount` to be a numeric field, not embedded in a sentence. String logs are write-only — you can read them but not query them at scale.\n\n**Impact**: Your 10 million daily log lines become unsearchable. MTTR (mean time to recovery) during incidents doubles because engineers grep through strings instead of filtering structured fields.\n\n**Fix:**\n```typescript\n// Bad: string log — amount is buried in text\nlogger.info(`User ${userId} purchased ${productId} for ${amount}`);\n\n// Good: structured — every field is queryable\nlogger.info({ userId, productId, amountDollars: amount / 100 }, 'purchase_completed');\n```\n\n**Consistent event naming**: Use `snake_case` verb-noun event names (`payment_processed`, `user_signed_up`, `order_failed`) as the message string. This creates a stable vocabulary for dashboards and alerts.\n\n**Shibboleth**: The log message string is for humans scanning log tails. All queryable data lives in structured fields. A logger that produces `{}` as its output shape is better than one that produces readable strings.\n\n---\n\n### Anti-Pattern 3: Log-and-Throw (Duplicate Log Entries)\n\n**Novice thinking**: Log the error, then re-throw so the caller also knows about it.\n\n```typescript\n// BAD: log-and-throw\nasync function processPayment(orderId: string) {\n try {\n return await chargeCard(orderId);\n } catch (err) {\n logger.error({ err, orderId }, 'Payment failed'); // Logged here\n throw err; // And the caller logs it again\n }\n}\n\nasync function handleCheckout(req, res) {\n try {\n await processPayment(req.body.orderId);\n } catch (err) {\n logger.error({ err }, 'Checkout failed'); // Same error logged twice\n res.status(500).json({ error: 'Checkout failed' });\n }\n}\n```\n\n**Why wrong**: Every error appears 2-5 times in your logs depending on call depth. Alerting on error count becomes unreliable. Incident review is confusing — engineers think there were multiple failures. Log volume costs money (Datadog charges per ingested GB).\n\n**Fix — Log only at the boundary where you handle the error:**\n```typescript\n// Good: log only where you decide what to do with the error\nasync function processPayment(orderId: string) {\n // No try-catch: let errors propagate naturally\n return await chargeCard(orderId);\n}\n\nasync function handleCheckout(req, res) {\n try {\n await processPayment(req.body.orderId);\n res.json({ success: true });\n } catch (err) {\n // One log, at the boundary where we're deciding to return 500\n logger.error({ err, orderId: req.body.orderId }, 'checkout_failed');\n res.status(500).json({ error: 'Checkout failed' });\n }\n}\n```\n\n**Rule**: Log where you handle. Don't log where you propagate. The call stack in the error object already tells you where it originated.\n\n**Shibboleth**: If you see the same `traceId` appear in more than two error log lines for a single request, you have a log-and-throw chain somewhere.\n\n## Quality Checklist\n\n```\n[ ] All log lines are JSON (no string concatenation)\n[ ] Log level strategy documented: what goes at each level\n[ ] PII/secrets redacted at logger config level, not call site\n[ ] Correlation IDs propagated on all outbound HTTP calls\n[ ] OTel SDK initialized before any other imports\n[ ] Error logs include the error object (not just message)\n[ ] No log-and-throw patterns in error handling\n[ ] DEBUG logs use conditional guards or sampling\n[ ] SLI/SLO defined for each critical user journey\n[ ] Alert routing: notify vs page threshold documented\n[ ] Runbook linked from every paging alert\n[ ] Log retention policy set (cost vs compliance)\n```\n\n## Output Artifacts\n\n1. **Logger configuration** — Pino/Winston/structlog setup with redaction rules\n2. **OTel bootstrap file** — SDK init with auto-instrumentation\n3. **Correlation middleware** — AsyncLocalStorage request context\n4. **Prometheus metrics module** — Counter/histogram/gauge definitions\n5. **Grafana dashboard JSON** — Four golden signals panels\n6. **Alertmanager rules YAML** — SLO-based alert definitions\n---","attachment_filenames":["references/alerting-patterns.md","references/opentelemetry-setup.md"],"attachments":[{"filename":"references/alerting-patterns.md","content":"# Alerting Patterns Reference\n\nSLI/SLO design, alert routing, fatigue prevention, and PagerDuty integration.\n\n## SLI/SLO Design\n\n### Vocabulary\n\n- **SLI** (Service Level Indicator): A measurement. \"Our 99th percentile latency.\"\n- **SLO** (Service Level Objective): A target. \"99th percentile latency < 500ms over 30 days.\"\n- **SLA** (Service Level Agreement): A contract with consequences if the SLO is missed.\n- **Error Budget**: How much failure your SLO allows. SLO of 99.9% = 43.8 minutes/month of allowed downtime.\n\n### The Four Golden Signals (Google SRE Book)\n\nEvery service needs SLIs for these:\n\n| Signal | Definition | Example Metric |\n|--------|-----------|----------------|\n| **Latency** | Time to serve requests | p99 request duration < 500ms |\n| **Traffic** | Volume of requests | requests per second |\n| **Errors** | Rate of failed requests | HTTP 5xx rate < 0.1% |\n| **Saturation** | How full the service is | CPU < 80%, queue depth < 1000 |\n\n### SLO Templates by Service Type\n\n**HTTP API:**\n```yaml\nslos:\n - name: availability\n sli: http_requests_total{status!~\"5..\"} / http_requests_total\n target: 0.999 # 99.9% — 43.8 min/month budget\n window: 30d\n\n - name: latency_p99\n sli: histogram_quantile(0.99, http_request_duration_seconds_bucket)\n target: 0.5 # \u003c 500ms at p99\n window: 30d\n\n - name: latency_p50\n sli: histogram_quantile(0.50, http_request_duration_seconds_bucket)\n target: 0.1 # \u003c 100ms at p50\n window: 30d\n```\n\n**Async Worker / Queue Consumer:**\n```yaml\nslos:\n - name: job_success_rate\n sli: jobs_completed_total / (jobs_completed_total + jobs_failed_total)\n target: 0.995 # 99.5% — more lenient for async\n\n - name: queue_lag\n sli: queue_oldest_unprocessed_message_age_seconds\n target: 60 # \u003c 60 seconds lag\n window: 1h # shorter window for queue health\n```\n\n### Error Budget Burn Rate Alerting\n\nBurn rate alerts are more actionable than threshold alerts. They tell you how fast you're consuming your budget.\n\n**Formula**: If your SLO is 99.9% over 30 days, your error budget is 43.8 minutes.\n- Burn rate 1x = consuming budget at exactly the pace it's refreshed (sustainable)\n- Burn rate 14.4x = consuming 30-day budget in 2 days (page immediately)\n- Burn rate 6x = consuming budget in ~5 days (page within 1 hour)\n\n**Multi-window, multi-burn-rate alerts** (the recommended pattern from Google SRE Workbook):\n\n```yaml\n# Prometheus alerting rules\ngroups:\n - name: slo.payment-service\n rules:\n # Fast burn: 2% of budget in 1 hour (14.4x rate) — page now\n - alert: PaymentServiceFastBurn\n expr: |\n (\n rate(http_requests_total{service=\"payment\",status=~\"5..\"}[1h]) /\n rate(http_requests_total{service=\"payment\"}[1h])\n ) > 0.144\n for: 2m\n labels:\n severity: critical\n team: payments\n annotations:\n summary: \"Fast error budget burn on payment-service\"\n description: \"Burning error budget at {{ $value | humanizePercentage }} error rate (14.4x)\"\n runbook: \"https://runbooks.internal/payment-service/fast-burn\"\n\n # Slow burn: 5% of budget in 6 hours (6x rate) — ticket or Slack\n - alert: PaymentServiceSlowBurn\n expr: |\n (\n rate(http_requests_total{service=\"payment\",status=~\"5..\"}[6h]) /\n rate(http_requests_total{service=\"payment\"}[6h])\n ) > 0.06\n for: 15m\n labels:\n severity: warning\n team: payments\n annotations:\n summary: \"Slow error budget burn on payment-service\"\n runbook: \"https://runbooks.internal/payment-service/slow-burn\"\n```\n\n## Alert Routing and Severity\n\n### Severity Taxonomy\n\nDo not use ad-hoc severity labels. Define a taxonomy and stick to it.\n\n| Severity | Definition | Response | Channel |\n|----------|-----------|----------|---------|\n| **critical** | SLO breaching now, users impacted | Page on-call immediately | PagerDuty High |\n| **warning** | Budget burning, will breach if not fixed | Ticket + Slack | PagerDuty Low / Slack |\n| **info** | Anomaly worth watching, no breach risk | Grafana annotation | Slack only |\n\n### PagerDuty Integration\n\n```yaml\n# Alertmanager config\nglobal:\n resolve_timeout: 5m\n\nreceivers:\n - name: pagerduty-critical\n pagerduty_configs:\n - routing_key: ${PAGERDUTY_INTEGRATION_KEY}\n severity: critical\n description: '{{ .GroupLabels.alertname }}: {{ .Annotations.summary }}'\n details:\n runbook: '{{ .Annotations.runbook }}'\n service: '{{ .Labels.service }}'\n environment: '{{ .Labels.env }}'\n\n - name: slack-warning\n slack_configs:\n - api_url: ${SLACK_WEBHOOK_URL}\n channel: '#alerts-warning'\n title: '{{ .GroupLabels.alertname }}'\n text: '{{ .Annotations.description }}'\n actions:\n - type: button\n text: 'Runbook'\n url: '{{ .Annotations.runbook }}'\n\n - name: slack-info\n slack_configs:\n - api_url: ${SLACK_WEBHOOK_URL}\n channel: '#alerts-info'\n\nroute:\n group_by: ['alertname', 'service', 'env']\n group_wait: 30s\n group_interval: 5m\n repeat_interval: 4h\n receiver: slack-info # default\n routes:\n - match:\n severity: critical\n receiver: pagerduty-critical\n repeat_interval: 1h # re-page every hour if not resolved\n - match:\n severity: warning\n receiver: slack-warning\n repeat_interval: 8h\n```\n\n## Alert Fatigue Prevention\n\n### The Alert Fatigue Death Spiral\n\n1. Team adds alert for every possible condition\n2. Alerts fire constantly, mostly for non-urgent things\n3. Team starts ignoring alerts\n4. Real incident fires, nobody notices\n5. Outage\n\n### Rules for Alert-Worthiness\n\nAn alert should only fire if:\n1. **It requires human action** — can the system fix it automatically? If yes, it should.\n2. **It cannot wait until morning** — if the on-call can sleep through it, it's not a page.\n3. **It's actionable** — is there a runbook? If not, write one or don't alert.\n4. **It's not already covered** — does a higher-level alert catch this?\n\n**Audit question**: For each alert in the last 30 days, was there a runbook entry written? If an alert fires and nobody writes anything down, it's noise.\n\n### Inhibition Rules\n\nSuppress child alerts when parent is already firing:\n\n```yaml\n# alertmanager inhibition rules\ninhibit_rules:\n # If the whole service is down, don't also alert on latency\n - source_match:\n alertname: ServiceDown\n target_match_re:\n alertname: (HighLatency|ErrorRateHigh|QueueLag)\n equal: [service, env]\n\n # If database is down, suppress application errors (they're caused by DB)\n - source_match:\n alertname: DatabaseDown\n target_match_re:\n alertname: (PaymentFailed|OrderCreateFailed)\n equal: [env]\n```\n\n### Alert Deduplication\n\nGroup alerts so the on-call gets one notification for a correlated event, not ten:\n\n```yaml\nroute:\n group_by: ['alertname', 'cluster', 'service']\n group_wait: 30s # collect related alerts for 30s before firing\n group_interval: 5m # how often to send updates on ongoing group\n repeat_interval: 4h # re-notify if still firing after 4h\n```\n\n## Runbook Template\n\nEvery paging alert must link to a runbook. Without this, the alert is not production-ready.\n\n```markdown\n# Runbook: PaymentServiceFastBurn\n\n## Summary\nPayment service is consuming its error budget at 14.4x the sustainable rate.\n\n## Impact\nUsers are experiencing payment failures. Checkout is degraded.\n\n## Immediate Steps (first 5 minutes)\n1. Check Grafana dashboard: [link]\n2. Look at recent deploys: `git log --oneline -10`\n3. Check downstream dependencies: Stripe, database, fraud service\n\n## Diagnosis\n\n### Is it a deploy?\n- Compare error rate before/after deploy\n- If yes: rollback with `kubectl rollout undo deploy/payment-service`\n\n### Is it a downstream dependency?\n- Check Stripe status: https://status.stripe.com\n- Check database: `kubectl exec -it postgres-0 -- psql -c \"SELECT 1\"`\n\n### Is it traffic-related?\n- Check traffic volume: unusual spike in requests?\n- Rate limiting might need adjustment\n\n## Escalation\n- Exhaust the above steps within 15 minutes\n- If unresolved: escalate to payment-team-lead\n- If infrastructure: escalate to platform-team\n\n## Resolution\nOnce resolved, write a postmortem in Notion: [link]\n```\n\n## Grafana Dashboard Design\n\n### Four Golden Signals Dashboard Layout\n\n```\nRow 1: Service Health Overview\n [Availability SLO gauge] [Error Budget remaining gauge] [Current error rate]\n\nRow 2: Traffic & Errors\n [Request rate time series] [HTTP 5xx rate time series]\n\nRow 3: Latency\n [p50/p95/p99 latency heatmap] [Latency distribution histogram]\n\nRow 4: Saturation\n [CPU/Memory usage] [Queue depth] [Connection pool usage]\n\nRow 5: Downstream Dependencies\n [Database query latency] [External API success rate]\n```\n\n### Key Grafana Panel Configs\n\n```json\n// Error rate panel — use a threshold annotation for SLO\n{\n \"type\": \"timeseries\",\n \"title\": \"HTTP Error Rate\",\n \"targets\": [{\n \"expr\": \"rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m])\",\n \"legendFormat\": \"Error Rate\"\n }],\n \"thresholds\": {\n \"mode\": \"absolute\",\n \"steps\": [\n { \"color\": \"green\", \"value\": 0 },\n { \"color\": \"yellow\", \"value\": 0.001 }, // 0.1% warning\n { \"color\": \"red\", \"value\": 0.01 } // 1% critical\n ]\n }\n}\n```\n\n### SLO Gauge Panel\n\n```json\n{\n \"type\": \"gauge\",\n \"title\": \"30-Day Availability\",\n \"targets\": [{\n \"expr\": \"1 - (increase(http_requests_total{status=~'5..'}[30d]) / increase(http_requests_total[30d]))\",\n \"instant\": true\n }],\n \"options\": {\n \"minVizValue\": 0.99,\n \"maxVizValue\": 1,\n \"thresholds\": [\n { \"color\": \"red\", \"value\": 0 },\n { \"color\": \"yellow\", \"value\": 0.999 },\n { \"color\": \"green\", \"value\": 0.9995 }\n ]\n }\n}\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":9897,"content_sha256":"7916a9fd5204b3e03c55c1ca8c59975e80a298da0972f7214e2ac7ec760e37af"},{"filename":"references/opentelemetry-setup.md","content":"# OpenTelemetry Setup Reference\n\nComplete configuration for OTel SDK initialization, collector deployment, and trace propagation.\n\n## SDK Initialization by Language\n\n### Node.js\n\nInstall:\n```bash\nnpm install @opentelemetry/sdk-node \\\n @opentelemetry/auto-instrumentations-node \\\n @opentelemetry/exporter-trace-otlp-http \\\n @opentelemetry/exporter-metrics-otlp-http \\\n @opentelemetry/resources \\\n @opentelemetry/semantic-conventions\n```\n\n`src/telemetry.ts` — must be the first import in your entrypoint:\n```typescript\nimport { NodeSDK } from '@opentelemetry/sdk-node';\nimport { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';\nimport { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';\nimport { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';\nimport { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';\nimport { Resource } from '@opentelemetry/resources';\nimport { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';\n\nconst resource = new Resource({\n [SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME ?? 'unknown-service',\n [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',\n [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? 'development',\n});\n\nconst sdk = new NodeSDK({\n resource,\n traceExporter: new OTLPTraceExporter({\n url: `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`,\n headers: {\n 'Authorization': `Bearer ${process.env.OTEL_EXPORTER_TOKEN}`,\n },\n }),\n metricReader: new PeriodicExportingMetricReader({\n exporter: new OTLPMetricExporter({\n url: `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/metrics`,\n }),\n exportIntervalMillis: 15000, // match Prometheus scrape interval\n }),\n instrumentations: [\n getNodeAutoInstrumentations({\n '@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy\n '@opentelemetry/instrumentation-http': {\n ignoreIncomingRequestHook: (req) => req.url?.includes('/health'),\n },\n }),\n ],\n});\n\nsdk.start();\n\nprocess.on('SIGTERM', async () => {\n await sdk.shutdown();\n});\n```\n\n`src/index.ts` — always import telemetry first:\n```typescript\nimport './telemetry'; // Must be first\nimport express from 'express';\n// ... rest of app\n```\n\n### Python\n\nInstall:\n```bash\npip install opentelemetry-sdk \\\n opentelemetry-exporter-otlp \\\n opentelemetry-instrumentation-fastapi \\\n opentelemetry-instrumentation-requests \\\n opentelemetry-instrumentation-sqlalchemy\n```\n\n`app/telemetry.py`:\n```python\nfrom opentelemetry import trace, metrics\nfrom opentelemetry.sdk.trace import TracerProvider\nfrom opentelemetry.sdk.trace.export import BatchSpanProcessor\nfrom opentelemetry.sdk.metrics import MeterProvider\nfrom opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader\nfrom opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter\nfrom opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter\nfrom opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION\nimport os\n\ndef init_telemetry():\n resource = Resource.create({\n SERVICE_NAME: os.getenv(\"SERVICE_NAME\", \"unknown-service\"),\n SERVICE_VERSION: os.getenv(\"SERVICE_VERSION\", \"0.0.0\"),\n \"deployment.environment\": os.getenv(\"ENVIRONMENT\", \"development\"),\n })\n\n otlp_endpoint = os.getenv(\"OTEL_EXPORTER_OTLP_ENDPOINT\", \"http://otel-collector:4318\")\n\n # Traces\n tracer_provider = TracerProvider(resource=resource)\n tracer_provider.add_span_processor(\n BatchSpanProcessor(\n OTLPSpanExporter(endpoint=f\"{otlp_endpoint}/v1/traces\")\n )\n )\n trace.set_tracer_provider(tracer_provider)\n\n # Metrics\n metric_reader = PeriodicExportingMetricReader(\n OTLPMetricExporter(endpoint=f\"{otlp_endpoint}/v1/metrics\"),\n export_interval_millis=15000,\n )\n meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])\n metrics.set_meter_provider(meter_provider)\n\n# FastAPI app setup\nfrom fastapi import FastAPI\nfrom opentelemetry.instrumentation.fastapi import FastAPIInstrumentor\nfrom opentelemetry.instrumentation.requests import RequestsInstrumentor\n\napp = FastAPI()\ninit_telemetry()\nFastAPIInstrumentor.instrument_app(app)\nRequestsInstrumentor().instrument()\n```\n\n### Go\n\nInstall:\n```bash\ngo get go.opentelemetry.io/otel \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp \\\n go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp\n```\n\n```go\n// telemetry/setup.go\npackage telemetry\n\nimport (\n \"context\"\n \"os\"\n\n \"go.opentelemetry.io/otel\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp\"\n \"go.opentelemetry.io/otel/sdk/resource\"\n \"go.opentelemetry.io/otel/sdk/trace\"\n semconv \"go.opentelemetry.io/otel/semconv/v1.21.0\"\n)\n\nfunc InitTracer(ctx context.Context) (func(context.Context) error, error) {\n exporter, err := otlptracehttp.New(ctx,\n otlptracehttp.WithEndpoint(os.Getenv(\"OTEL_EXPORTER_OTLP_ENDPOINT\")),\n )\n if err != nil {\n return nil, err\n }\n\n res := resource.NewWithAttributes(\n semconv.SchemaURL,\n semconv.ServiceName(os.Getenv(\"SERVICE_NAME\")),\n semconv.ServiceVersion(os.Getenv(\"SERVICE_VERSION\")),\n )\n\n tp := trace.NewTracerProvider(\n trace.WithBatcher(exporter),\n trace.WithResource(res),\n trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1))), // 10% sampling\n )\n otel.SetTracerProvider(tp)\n\n return tp.Shutdown, nil\n}\n```\n\n## OTel Collector Config\n\nDeploy the OTel Collector as a sidecar or daemonset. It receives from your services and fans out to Jaeger, Prometheus, and your cloud provider.\n\n`otel-collector-config.yaml`:\n```yaml\nreceivers:\n otlp:\n protocols:\n http:\n endpoint: 0.0.0.0:4318\n grpc:\n endpoint: 0.0.0.0:4317\n\nprocessors:\n batch:\n timeout: 5s\n send_batch_size: 512\n\n memory_limiter:\n limit_mib: 512\n spike_limit_mib: 128\n check_interval: 5s\n\n # Add environment and cluster metadata\n resource:\n attributes:\n - key: k8s.cluster.name\n value: ${CLUSTER_NAME}\n action: upsert\n\n # Filter health check spans\n filter:\n traces:\n exclude:\n match_type: regexp\n span_names: [\".*health.*\", \".*readiness.*\", \".*liveness.*\"]\n\nexporters:\n # Jaeger (for trace visualization)\n otlp/jaeger:\n endpoint: jaeger-collector:4317\n tls:\n insecure: true\n\n # Prometheus (for metrics scraping)\n prometheus:\n endpoint: 0.0.0.0:8889\n namespace: otel\n\n # Grafana Tempo (alternative to Jaeger)\n otlp/tempo:\n endpoint: tempo:4317\n tls:\n insecure: true\n\n # Cloud providers (pick one)\n otlp/datadog:\n endpoint: https://trace.agent.datadoghq.com\n headers:\n DD-API-KEY: ${DD_API_KEY}\n\nservice:\n pipelines:\n traces:\n receivers: [otlp]\n processors: [memory_limiter, filter, resource, batch]\n exporters: [otlp/jaeger]\n\n metrics:\n receivers: [otlp]\n processors: [memory_limiter, resource, batch]\n exporters: [prometheus]\n```\n\n## Span Attributes — Semantic Conventions\n\nUse the OpenTelemetry semantic conventions for span attributes. Do not invent your own attribute names.\n\n**HTTP spans** (auto-instrumented, but good to know):\n```\nhttp.method = \"POST\"\nhttp.url = \"https://api.example.com/payment\"\nhttp.status_code = 200\nhttp.route = \"/payment/:id\"\n```\n\n**Database spans** (auto-instrumented with sqlalchemy/pg):\n```\ndb.system = \"postgresql\"\ndb.name = \"payments\"\ndb.operation = \"INSERT\"\ndb.sql.table = \"orders\"\n```\n\n**Custom business spans** — add to your code:\n```typescript\nconst tracer = trace.getTracer('payment-service');\n\nasync function processPayment(orderId: string, amountCents: number) {\n return tracer.startActiveSpan('payment.process', async (span) => {\n span.setAttributes({\n 'payment.order_id': orderId,\n 'payment.amount_cents': amountCents,\n 'payment.currency': 'USD',\n // Never set: cardNumber, cvv, or any PII\n });\n\n try {\n const result = await chargeCard(orderId, amountCents);\n span.setStatus({ code: SpanStatusCode.OK });\n span.setAttribute('payment.transaction_id', result.transactionId);\n return result;\n } catch (err) {\n span.recordException(err);\n span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });\n throw err;\n } finally {\n span.end();\n }\n });\n}\n```\n\n## Context Propagation\n\nThe W3C `traceparent` header propagates trace context across service boundaries.\n\nFormat: `00-{traceId-32hex}-{spanId-16hex}-{flags-2hex}`\nExample: `00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01`\n\n**Propagation in fetch/axios:**\n```typescript\nimport { context, propagation } from '@opentelemetry/api';\n\nasync function callDownstreamService(url: string, body: object) {\n const headers: Record\u003cstring, string> = { 'Content-Type': 'application/json' };\n\n // OTel injects traceparent/tracestate automatically if using auto-instrumentation\n // Manual injection if needed:\n propagation.inject(context.active(), headers);\n\n return fetch(url, { method: 'POST', headers, body: JSON.stringify(body) });\n}\n```\n\nWith `getNodeAutoInstrumentations()`, the `http` and `undici` instrumentations handle this automatically for all outbound calls.\n\n## Sampling Strategy\n\nFull trace sampling at 100% is expensive and often unnecessary. Use head-based sampling:\n\n| Traffic Volume | Recommended Strategy |\n|---------------|---------------------|\n| < 10 req/s | 100% sampling |\n| 10-1000 req/s | 10% TraceIDRatioBased |\n| > 1000 req/s | 1% + always-on for errors |\n| Any | Always sample errors, never sample health checks |\n\n```typescript\n// Always sample errors, sample 10% of successes\nclass ErrorAlwaysSampler implements Sampler {\n shouldSample(context, traceId, spanName, spanKind, attributes) {\n if (attributes['http.status_code'] >= 400) {\n return { decision: SamplingDecision.RECORD_AND_SAMPLED };\n }\n // 10% of everything else\n if (Math.random() \u003c 0.1) {\n return { decision: SamplingDecision.RECORD_AND_SAMPLED };\n }\n return { decision: SamplingDecision.NOT_RECORD };\n }\n}\n```\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":10328,"content_sha256":"52867fd70ea3cf4753ab29c11a883b3a81199a11b6c4829c307040856fa1f075"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Logging & Observability","type":"text"}]},{"type":"paragraph","content":[{"text":"Structured logging, distributed tracing, and metrics for production systems. Covers the full observability stack from log formatting to alert routing.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"When to Use","type":"text"}]},{"type":"paragraph","content":[{"text":"Activate on:","type":"text","marks":[{"type":"strong"}]},{"text":" \"structured logging\", \"distributed tracing\", \"OpenTelemetry\", \"OTel\", \"correlation ID\", \"log levels\", \"Grafana dashboard\", \"alerting thresholds\", \"SLI SLO\", \"Prometheus metrics\", \"PagerDuty integration\", \"observability stack\", \"Winston setup\", \"Pino logger\", \"log aggregation\", \"Datadog\", \"Honeycomb\"","type":"text"}]},{"type":"paragraph","content":[{"text":"NOT for:","type":"text","marks":[{"type":"strong"}]},{"text":" Performance profiling (CPU/memory flamegraphs) | Load testing | Database query optimization | Security auditing","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Decision Tree: What to Log at Each Level","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"mermaid"},"content":[{"text":"flowchart TD\n E[Event Occurs] --> Q1{Does it represent\\na system failure?}\n Q1 -->|Yes| Q2{Is it recoverable\\nwithout human?}\n Q2 -->|No| FATAL[FATAL: Service cannot\\ncontinue — trigger pager]\n Q2 -->|Yes| ERROR[ERROR: Operation failed,\\nwill retry or degrade]\n Q1 -->|No| Q3{Is it unexpected\\nbut not failing?}\n Q3 -->|Yes| WARN[WARN: Unusual condition,\\ncircuit breaker open,\\ndeprecation used]\n Q3 -->|No| Q4{Is it a meaningful\\nbusiness event?}\n Q4 -->|Yes| INFO[INFO: User action,\\npayment processed,\\nservice started]\n Q4 -->|No| Q5{Needed to debug\\na specific issue?}\n Q5 -->|Yes| DEBUG[DEBUG: DB queries,\\ncache hits/misses,\\nfunction inputs]\n Q5 -->|No| TRACE[TRACE: Fine-grained\\nloop iterations,\\nOTel spans]","type":"text"}]},{"type":"paragraph","content":[{"text":"Rule of thumb","type":"text","marks":[{"type":"strong"}]},{"text":": Production runs INFO and above. DEBUG only enabled per-service via dynamic config, never always-on in prod.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Core Patterns","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Structured Log Format (JSON)","type":"text"}]},{"type":"paragraph","content":[{"text":"Every log line must be parseable. String concatenation is not a log.","type":"text"}]},{"type":"paragraph","content":[{"text":"Node.js with Pino:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"import pino from 'pino';\n\nconst logger = pino({\n level: process.env.LOG_LEVEL ?? 'info',\n base: {\n service: 'payment-service',\n version: process.env.SERVICE_VERSION,\n env: process.env.NODE_ENV,\n },\n redact: {\n paths: ['req.headers.authorization', 'body.password', 'body.cardNumber', '*.ssn'],\n censor: '[REDACTED]',\n },\n});\n\n// Good: structured fields\nlogger.info({ orderId, userId, amountCents }, 'Payment processed');\n\n// Bad: string interpolation\nlogger.info(`Payment processed for user ${userId} order ${orderId}`);","type":"text"}]},{"type":"paragraph","content":[{"text":"Python with structlog:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"python"},"content":[{"text":"import structlog\n\nlog = structlog.get_logger()\n\nstructlog.configure(\n processors=[\n structlog.contextvars.merge_contextvars,\n structlog.stdlib.add_log_level,\n structlog.stdlib.add_logger_name,\n structlog.processors.TimeStamper(fmt=\"iso\"),\n structlog.processors.JSONRenderer(),\n ]\n)\n\n# Good: key-value pairs\nlog.info(\"payment_processed\", order_id=order_id, user_id=user_id, amount_cents=amount)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Correlation IDs","type":"text"}]},{"type":"paragraph","content":[{"text":"Every request needs a trace ID that flows through all downstream calls. This is the minimum viable distributed tracing without OTel.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// Express middleware\nimport { randomUUID } from 'crypto';\nimport { AsyncLocalStorage } from 'async_hooks';\n\nconst requestContext = new AsyncLocalStorage\u003c{ traceId: string; spanId: string }>();\n\nexport function correlationMiddleware(req, res, next) {\n const traceId = req.headers['x-trace-id'] ?? randomUUID();\n const spanId = randomUUID().slice(0, 8);\n\n requestContext.run({ traceId, spanId }, () => {\n res.setHeader('x-trace-id', traceId);\n next();\n });\n}\n\n// Logger that auto-includes context\nexport function getLogger(name: string) {\n return {\n info: (msg: string, fields?: object) => {\n const ctx = requestContext.getStore();\n logger.info({ ...ctx, ...fields, logger: name }, msg);\n },\n // ... error, warn, debug\n };\n}","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"OpenTelemetry Setup","type":"text"}]},{"type":"paragraph","content":[{"text":"See ","type":"text"},{"text":"references/opentelemetry-setup.md","type":"text","marks":[{"type":"code_inline"}]},{"text":" for complete OTel collector config, SDK initialization per language, and span attribute conventions.","type":"text"}]},{"type":"paragraph","content":[{"text":"Minimal Node.js bootstrap:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// Must be first import in entrypoint\nimport { NodeSDK } from '@opentelemetry/sdk-node';\nimport { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';\nimport { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';\n\nconst sdk = new NodeSDK({\n serviceName: 'payment-service',\n traceExporter: new OTLPTraceExporter({\n url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,\n }),\n instrumentations: [getNodeAutoInstrumentations()],\n});\n\nsdk.start();","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Distributed Trace Propagation","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"mermaid"},"content":[{"text":"sequenceDiagram\n participant C as Client\n participant GW as API Gateway\n participant SVC as Payment Service\n participant DB as Database\n\n C->>GW: POST /checkout\u003cbr/>(no trace header)\n Note over GW: Generate trace-id: abc123\u003cbr/>span-id: 0001\n GW->>SVC: POST /payment\u003cbr/>traceparent: 00-abc123-0001-01\n Note over SVC: Inherit trace-id: abc123\u003cbr/>New span-id: 0002\n SVC->>DB: INSERT payment\u003cbr/>traceparent: 00-abc123-0002-01\n Note over DB: Inherit trace-id: abc123\u003cbr/>New span-id: 0003\n DB-->>SVC: OK (span 0003 ends)\n SVC-->>GW: 200 OK (span 0002 ends)\n GW-->>C: 200 OK (span 0001 ends)\u003cbr/>x-trace-id: abc123","type":"text"}]},{"type":"paragraph","content":[{"text":"The W3C ","type":"text"},{"text":"traceparent","type":"text","marks":[{"type":"code_inline"}]},{"text":" header format: ","type":"text"},{"text":"00-{traceId}-{spanId}-{flags}","type":"text","marks":[{"type":"code_inline"}]},{"text":". Always propagate this header on every downstream HTTP call.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Reference Files","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"File","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Contents","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/opentelemetry-setup.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"OTel SDK init per language, collector YAML config, span attributes, context propagation","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"references/alerting-patterns.md","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"SLI/SLO definitions, alert routing, PagerDuty severity mapping, alert fatigue prevention","type":"text"}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Anti-Patterns (Shibboleths)","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Anti-Pattern 1: Logging PII or Secrets in Production","type":"text"}]},{"type":"paragraph","content":[{"text":"Novice thinking","type":"text","marks":[{"type":"strong"}]},{"text":": \"I'll just log the full request body to debug this auth issue.\"","type":"text"}]},{"type":"paragraph","content":[{"text":"Why wrong","type":"text","marks":[{"type":"strong"}]},{"text":": GDPR/CCPA violations carry fines up to 4% of global revenue. Secrets in logs propagate to log aggregators, S3 exports, audit trails — all places with different access controls. A single ","type":"text"},{"text":"console.log(req.body)","type":"text","marks":[{"type":"code_inline"}]},{"text":" can expose thousands of user passwords in your Datadog dashboard.","type":"text"}]},{"type":"paragraph","content":[{"text":"Detection signature","type":"text","marks":[{"type":"strong"}]},{"text":": Search your logs for ","type":"text"},{"text":"password","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"ssn","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"cardNumber","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"authorization","type":"text","marks":[{"type":"code_inline"}]},{"text":" as field values (not keys). If any appear, you have a PII leak.","type":"text"}]},{"type":"paragraph","content":[{"text":"Fix — Allowlist approach:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// Never log what you don't explicitly approve\nconst SAFE_BODY_FIELDS = ['orderId', 'productId', 'quantity', 'currency'];\n\nlogger.info({\n body: pick(req.body, SAFE_BODY_FIELDS), // only known-safe fields\n path: req.path,\n method: req.method,\n}, 'Request received');","type":"text"}]},{"type":"paragraph","content":[{"text":"Fix — Redaction in logger config:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// Pino's redact runs before any transport\nconst logger = pino({\n redact: {\n paths: [\n 'req.headers.authorization',\n 'req.headers.cookie',\n 'body.password',\n 'body.*.password', // nested objects too\n 'body.cardNumber',\n 'body.ssn',\n '*.token',\n '*.secret',\n ],\n censor: '[REDACTED]',\n },\n});","type":"text"}]},{"type":"paragraph","content":[{"text":"Shibboleth","type":"text","marks":[{"type":"strong"}]},{"text":": An expert sets up redaction at logger initialization, not as a reminder comment. Redaction must be structural, not ad-hoc.","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":3},"content":[{"text":"Anti-Pattern 2: Unstructured String Logs Instead of Structured JSON","type":"text"}]},{"type":"paragraph","content":[{"text":"Novice thinking","type":"text","marks":[{"type":"strong"}]},{"text":": ","type":"text"},{"text":"logger.info('User ' + userId + ' purchased ' + productId + ' for

Logging & Observability Structured logging, distributed tracing, and metrics for production systems. Covers the full observability stack from log formatting to alert routing. When to Use Activate on: "structured logging", "distributed tracing", "OpenTelemetry", "OTel", "correlation ID", "log levels", "Grafana dashboard", "alerting thresholds", "SLI SLO", "Prometheus metrics", "PagerDuty integration", "observability stack", "Winston setup", "Pino logger", "log aggregation", "Datadog", "Honeycomb" NOT for: Performance profiling (CPU/memory flamegraphs) | Load testing | Database query optimizati…

+ amount)","type":"text","marks":[{"type":"code_inline"}]}]},{"type":"paragraph","content":[{"text":"Why wrong","type":"text","marks":[{"type":"strong"}]},{"text":": You cannot filter, aggregate, or alert on string-interpolated data in any log aggregator. A Grafana query for ","type":"text"},{"text":"amount > 1000","type":"text","marks":[{"type":"code_inline"}]},{"text":" requires ","type":"text"},{"text":"amount","type":"text","marks":[{"type":"code_inline"}]},{"text":" to be a numeric field, not embedded in a sentence. String logs are write-only — you can read them but not query them at scale.","type":"text"}]},{"type":"paragraph","content":[{"text":"Impact","type":"text","marks":[{"type":"strong"}]},{"text":": Your 10 million daily log lines become unsearchable. MTTR (mean time to recovery) during incidents doubles because engineers grep through strings instead of filtering structured fields.","type":"text"}]},{"type":"paragraph","content":[{"text":"Fix:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// Bad: string log — amount is buried in text\nlogger.info(`User ${userId} purchased ${productId} for ${amount}`);\n\n// Good: structured — every field is queryable\nlogger.info({ userId, productId, amountDollars: amount / 100 }, 'purchase_completed');","type":"text"}]},{"type":"paragraph","content":[{"text":"Consistent event naming","type":"text","marks":[{"type":"strong"}]},{"text":": Use ","type":"text"},{"text":"snake_case","type":"text","marks":[{"type":"code_inline"}]},{"text":" verb-noun event names (","type":"text"},{"text":"payment_processed","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"user_signed_up","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"order_failed","type":"text","marks":[{"type":"code_inline"}]},{"text":") as the message string. This creates a stable vocabulary for dashboards and alerts.","type":"text"}]},{"type":"paragraph","content":[{"text":"Shibboleth","type":"text","marks":[{"type":"strong"}]},{"text":": The log message string is for humans scanning log tails. All queryable data lives in structured fields. A logger that produces ","type":"text"},{"text":"{}","type":"text","marks":[{"type":"code_inline"}]},{"text":" as its output shape is better than one that produces readable strings.","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":3},"content":[{"text":"Anti-Pattern 3: Log-and-Throw (Duplicate Log Entries)","type":"text"}]},{"type":"paragraph","content":[{"text":"Novice thinking","type":"text","marks":[{"type":"strong"}]},{"text":": Log the error, then re-throw so the caller also knows about it.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// BAD: log-and-throw\nasync function processPayment(orderId: string) {\n try {\n return await chargeCard(orderId);\n } catch (err) {\n logger.error({ err, orderId }, 'Payment failed'); // Logged here\n throw err; // And the caller logs it again\n }\n}\n\nasync function handleCheckout(req, res) {\n try {\n await processPayment(req.body.orderId);\n } catch (err) {\n logger.error({ err }, 'Checkout failed'); // Same error logged twice\n res.status(500).json({ error: 'Checkout failed' });\n }\n}","type":"text"}]},{"type":"paragraph","content":[{"text":"Why wrong","type":"text","marks":[{"type":"strong"}]},{"text":": Every error appears 2-5 times in your logs depending on call depth. Alerting on error count becomes unreliable. Incident review is confusing — engineers think there were multiple failures. Log volume costs money (Datadog charges per ingested GB).","type":"text"}]},{"type":"paragraph","content":[{"text":"Fix — Log only at the boundary where you handle the error:","type":"text","marks":[{"type":"strong"}]}]},{"type":"code_block","attrs":{"wrap":false,"language":"typescript"},"content":[{"text":"// Good: log only where you decide what to do with the error\nasync function processPayment(orderId: string) {\n // No try-catch: let errors propagate naturally\n return await chargeCard(orderId);\n}\n\nasync function handleCheckout(req, res) {\n try {\n await processPayment(req.body.orderId);\n res.json({ success: true });\n } catch (err) {\n // One log, at the boundary where we're deciding to return 500\n logger.error({ err, orderId: req.body.orderId }, 'checkout_failed');\n res.status(500).json({ error: 'Checkout failed' });\n }\n}","type":"text"}]},{"type":"paragraph","content":[{"text":"Rule","type":"text","marks":[{"type":"strong"}]},{"text":": Log where you handle. Don't log where you propagate. The call stack in the error object already tells you where it originated.","type":"text"}]},{"type":"paragraph","content":[{"text":"Shibboleth","type":"text","marks":[{"type":"strong"}]},{"text":": If you see the same ","type":"text"},{"text":"traceId","type":"text","marks":[{"type":"code_inline"}]},{"text":" appear in more than two error log lines for a single request, you have a log-and-throw chain somewhere.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Quality Checklist","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":""},"content":[{"text":"[ ] All log lines are JSON (no string concatenation)\n[ ] Log level strategy documented: what goes at each level\n[ ] PII/secrets redacted at logger config level, not call site\n[ ] Correlation IDs propagated on all outbound HTTP calls\n[ ] OTel SDK initialized before any other imports\n[ ] Error logs include the error object (not just message)\n[ ] No log-and-throw patterns in error handling\n[ ] DEBUG logs use conditional guards or sampling\n[ ] SLI/SLO defined for each critical user journey\n[ ] Alert routing: notify vs page threshold documented\n[ ] Runbook linked from every paging alert\n[ ] Log retention policy set (cost vs compliance)","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Output Artifacts","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Logger configuration","type":"text","marks":[{"type":"strong"}]},{"text":" — Pino/Winston/structlog setup with redaction rules","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"OTel bootstrap file","type":"text","marks":[{"type":"strong"}]},{"text":" — SDK init with auto-instrumentation","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Correlation middleware","type":"text","marks":[{"type":"strong"}]},{"text":" — AsyncLocalStorage request context","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Prometheus metrics module","type":"text","marks":[{"type":"strong"}]},{"text":" — Counter/histogram/gauge definitions","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Grafana dashboard JSON","type":"text","marks":[{"type":"strong"}]},{"text":" — Four golden signals panels","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Alertmanager rules YAML","type":"text","marks":[{"type":"strong"}]},{"text":" — SLO-based alert definitions","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"logging-observability","author":"@skillopedia","source":{"stars":113,"repo_name":"some_claude_skills","origin_url":"https://github.com/erichowens/some_claude_skills/blob/HEAD/.claude/skills/logging-observability/SKILL.md","repo_owner":"erichowens","body_sha256":"d3355ee526cdcd956ca1dbcb09d091675329eb3e4f1feeb693e530853339987d","cluster_key":"26457c8ee2ac8814d764a440ff28c85810593876888d2e20683372312b48be35","clean_bundle":{"format":"clean-skill-bundle-v1","source":"erichowens/some_claude_skills/.claude/skills/logging-observability/SKILL.md","attachments":[{"id":"1bc6ec78-e834-5be8-8618-3b99b31dc57d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/1bc6ec78-e834-5be8-8618-3b99b31dc57d/attachment.md","path":"references/alerting-patterns.md","size":9897,"sha256":"7916a9fd5204b3e03c55c1ca8c59975e80a298da0972f7214e2ac7ec760e37af","contentType":"text/markdown; charset=utf-8"},{"id":"0698c6b8-5a45-584c-9070-da3735bcff69","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/0698c6b8-5a45-584c-9070-da3735bcff69/attachment.md","path":"references/opentelemetry-setup.md","size":10328,"sha256":"52867fd70ea3cf4753ab29c11a883b3a81199a11b6c4829c307040856fa1f075","contentType":"text/markdown; charset=utf-8"}],"bundle_sha256":"4c5016dba1f9a2f5c991a7f2b90110279ac83dda68d22667a7d9a9c3014b6521","attachment_count":2,"text_attachments":2,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":".claude/skills/logging-observability/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"testing-qa","category_label":"Testing"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"testing-qa","metadata":{"tags":["observability","logging","tracing","metrics","opentelemetry","monitoring"],"category":"Code Quality & Testing","pairs-with":[{"skill":"api-architect","reason":"API request tracing and correlation IDs"},{"skill":"devops-automator","reason":"Deploying collectors and dashboards"},{"skill":"background-job-orchestrator","reason":"Distributed job observability"}]},"import_tag":"clean-skills-v1","description":"Structured logging, distributed tracing, and metrics for production applications. [What: OpenTelemetry setup, log level strategy, correlation IDs, SLI/SLO alerting thresholds, Grafana dashboard design, PagerDuty integration] [When: setting up production logging, adding observability to a service, debugging distributed systems, designing alerting, implementing traces/metrics/logs] [Keywords: logging, observability, OpenTelemetry, OTel, structured logs, distributed tracing, correlation ID, metrics, Grafana, Prometheus, PagerDuty, Winston, Pino, structlog, log levels, SLI, SLO, alerting] NOT for application performance profiling (use a profiler), load testing, or database query optimization.","allowed-tools":"Read,Write,Edit,Bash(npm:*,npx:*,pip:*,docker:*)","argument-hint":"[service description] [stack: node|python|go|java] [current problem: no-logging|no-tracing|alert-fatigue|pii-leak]"}},"renderedAt":1782986816863}

Logging & Observability Structured logging, distributed tracing, and metrics for production systems. Covers the full observability stack from log formatting to alert routing. When to Use Activate on: "structured logging", "distributed tracing", "OpenTelemetry", "OTel", "correlation ID", "log levels", "Grafana dashboard", "alerting thresholds", "SLI SLO", "Prometheus metrics", "PagerDuty integration", "observability stack", "Winston setup", "Pino logger", "log aggregation", "Datadog", "Honeycomb" NOT for: Performance profiling (CPU/memory flamegraphs) | Load testing | Database query optimizati…