Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

` for benchmark-only runs.\n- For contention or scheduler issues, use trace, block, and mutex tooling instead of only CPU profiles.\n- For intermittent production latency, consider the Go 1.25+ flight recorder before building custom tracing machinery.\n\n## Go 1.26-specific posture\n\n- Re-measure old workarounds on Go 1.26; runtime and compiler changes may have made older allocation, cgo, and GC workarounds obsolete.\n- On Linux containers, remember that Go 1.25+ made `GOMAXPROCS` container-aware by default. Do not cargo-cult `automaxprocs` into modern Go services without a measured reason.\n- Use `testing.T.ArtifactDir` plus `go test -artifacts -outputdir ...` when a benchmark or perf regression test needs to retain profiles, traces, or other debugging output.\n\n## Output expectations\n\nWhen reporting findings or a fix:\n\n1. State the bottleneck and the evidence.\n2. State the specific change and why it should move the measured metric.\n3. Report before/after benchmark or profile deltas.\n4. Call out residual risks, version assumptions, or production-only gaps.\n---","attachment_filenames":["agents/openai.yaml","references/hot-path.md","references/measurement.md","references/optimization.md"],"attachments":[{"filename":"agents/openai.yaml","content":"interface:\n display_name: \"Go Performance\"\n short_description: \"Profile and optimize Go performance issues\"\n default_prompt: \"Use $go-performance to benchmark, profile, and optimize this Go hot path.\"\n","content_type":"application/yaml; charset=utf-8","language":"yaml","size":204,"content_sha256":"2e1c70350087ae6bd1c650188c99014cd984e93285ecaf7d9a9c207b985c2feb"},{"filename":"references/hot-path.md","content":"# CPU-Bound Hot-Path Techniques\n\nSource: distilled from \"Notes from optimizing CPU-bound Go hot paths\" (blog.andr2i.com, 2026-05-03) plus current Go 1.26 compiler behavior.\n\nRead this only after measurement has named a single hot kernel that dominates CPU time. These techniques trade source-code clarity for cycles; do not apply them speculatively.\n\n## Inlining cost budget\n\nThe Go compiler inlines functions whose internal \"cost\" stays under approximately 80 units. Once inlined, the call disappears and the body becomes part of the caller's optimization problem (better register allocation, more BCE, more constant propagation).\n\nTo check whether a function is being inlined:\n\n```bash\ngo build -gcflags=all='-m=2' ./pkg 2>&1 | rg 'can inline|inlining call|cannot inline'\n```\n\nIf the hot function is just over budget, reduce its cost rather than rewriting the algorithm:\n\n- extract panic and error paths into separate functions marked `//go:noinline`\n- collapse obvious redundancies and unused locals\n- split rarely taken branches out of the hot loop\n\nPGO raises the inlining threshold per call site based on profile evidence. It is the right escape hatch for end-product binaries; library authors cannot rely on consumers running it.\n\n## Dispatch cost: avoid abstractions in tight loops\n\nInside the inner loop of a CPU-bound kernel, abstractions that add an extra dispatch step are expensive. The blog measurement on a Brotli hash kernel (378 MiB/s baseline) shows the cost:\n\n| Form | Throughput | Delta |\n| ----------------------- | -------------- | --------- |\n| Concrete function | 378.0 MiB/s | baseline |\n| Generic `[T any]` | 320.6 MiB/s | -15.18 % |\n| Closure (`func(...)`) | 322.0 MiB/s | -14.82 % |\n| Interface method | 274.3 MiB/s | -27.44 % |\n\nWhy each costs more:\n\n- **Generics**: Go uses GC Shape Stenciling, not full monomorphization. Method calls on a type parameter dispatch through a generics dictionary, and the inliner explicitly does not inline them even when the concrete type is statically known.\n- **Closures**: captured variables force stack allocation; the function pointer must be loaded on every call, and the compiler cannot inline through it.\n- **Interfaces**: itab indirection on every call, and devirtualization rarely fires in real codebases.\n\nThese deltas only matter when each call does little work (byte-oriented kernels, tight inner loops). Once the call body processes ~64 bytes of input or more per invocation, the dispatch overhead falls into low single digits as a fraction of total time.\n\n### Specialization patterns\n\nWhen measurement names a hot kernel and an abstraction is in the way:\n\n1. Hand-duplicate the function for each concrete type or strategy.\n2. If the duplicate count exceeds ~5 and the bodies stay synchronized, switch to `go generate` with text templates.\n3. As a last resort, manually inline the callee into the caller (accept the maintenance cost).\n\nThe blog author kept 16 hand-duplicated variants because the bodies diverged during tuning, making templates harder to maintain than the duplication.\n\n## Compiler-friendly idioms\n\nThese rewrites are mathematically transparent but communicate invariants to the compiler so it can drop guard instructions.\n\n### Bounds Check Elimination (BCE) via hint loads\n\nA single dummy access at the top of a loop proves the upper bound for all subsequent indexed accesses:\n\n```go\n// hint to compiler: indices 0..3 are valid below\n_ = b[3]\nv := uint32(b[0]) | uint32(b[1])\u003c\u003c8 | uint32(b[2])\u003c\u003c16 | uint32(b[3])\u003c\u003c24\n```\n\nWithout the hint, each `b[i]` emits a `CMPQ` and conditional branch. After the hint, the compiler proves all four are in bounds and removes the per-access checks. See golang/go#14808 for the formal pattern.\n\n### Shift masking\n\nWhen the shift amount is provably less than the data width, mask it explicitly:\n\n```go\n// before: compiler emits SHLQ + CMPQ + SBBQ + ANDQ guard sequence\ny := x \u003c\u003c n\n\n// after: collapses to a single SHLQ; mask is a no-op when n \u003c 64\ny := x \u003c\u003c (n & 63)\n```\n\nThe mask is identity for valid `n`, but it is the proof the compiler needs to drop the guard.\n\n### Pointer arithmetic via `unsafe`\n\nWhen the access pattern cannot be expressed as a hint load, `unsafe.Pointer` arithmetic skips bounds checks entirely. Treat as a last resort: write a Go fallback alongside, validate with race-free tests on representative inputs, and document the safety argument inline.\n\n## Diagnose register pressure from assembly\n\nWhen optimizing a tight loop, dump the assembly:\n\n```bash\ngo test -gcflags='-S' -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

./pkg 2>&1 | rg -A 60 BenchmarkFoo\n```\n\nLook for repeated stack reloads inside the loop body, for example:\n\n```\nMOVQ 0x20(SP), AX\nMOVQ 0x28(SP), BX\nMOVQ 0x30(SP), CX\n```\n\nMany `MOVQ ... SP` per iteration mean the compiler has spilled live values to the stack because the loop exceeds available registers. Reduce live values: shorten variable lifetimes, hoist invariants, drop unused intermediates, or split the loop into two stages.\n\n## Last-resort techniques\n\nApply these only after measurement, BCE, specialization, and inlining work have run out.\n\n### Manual loop unrolling\n\nGo has no `//go:unroll`. Manually duplicate loop bodies when the per-iteration overhead (increment, compare, branch) is a meaningful fraction of the work and the unrolled body still fits register and i-cache budgets. Verify with benchmarks; unrolling sometimes regresses due to i-cache or alignment effects.\n\n### Whole-function assembly with Go fallback\n\nFor algorithms that need prefetch, hand-tuned SIMD, or precise register allocation, write the entire hot function in Plan 9 assembly:\n\n- ship a portable `_amd64.s` (and `_arm64.s` if relevant) with a build tag\n- keep a portable Go fallback for other platforms and for verification\n- never write assembly for tiny helpers; they pay the call overhead without the inlining benefit\n\nGo does not yet expose `PREFETCHT0` as an intrinsic to user code (proposal: golang/go#68769); whole-function assembly is the current workaround when prefetch matters.\n\n### Experimental SIMD\n\nOn Go 1.26 with `GOEXPERIMENT=simd`, the experimental SIMD package exposes vector intrinsics on AMD64 (golang/go#73787). It is useful for portable vector kernels but expect API changes. Library code should keep an assembly or scalar fallback until the API stabilizes.\n\n## Trust the measurement\n\nOptimization deltas under ~3-5 % are usually code-layout noise (where the linker placed the function), not a real algorithmic win. Validate with the layout-noise procedure in [measurement.md](measurement.md#code-layout-noise) before claiming a small improvement.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":6687,"content_sha256":"ddd491d23563be3c18bb9863f7959e04762976dc420d2893eb2530cb1c8da908"},{"filename":"references/measurement.md","content":"# Measurement Workflow\n\nSource snapshot: refreshed 2026-05-26 from official Go 1.26 docs and selected profiling guides\n\n- Go 1.26 release notes: https://go.dev/doc/go1.26\n- Diagnostics overview: https://go.dev/doc/diagnostics\n- `testing.B.Loop`: https://go.dev/blog/testing-b-loop\n- `runtime/pprof`: https://pkg.go.dev/runtime/pprof\n- `net/http/pprof`: https://pkg.go.dev/net/http/pprof\n- `runtime/trace` and flight recorder: https://pkg.go.dev/runtime/trace and https://go.dev/blog/flight-recorder\n- PGO: https://go.dev/doc/pgo\n- `benchstat`: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat\n- JetBrains Go profiling guide: https://blog.jetbrains.com/go/2026/05/20/golang-profiling-guide/\n\n## Benchmark first\n\nPrefer a targeted benchmark before changing code.\n\nFor new or updated benchmarks on Go 1.24+:\n\n```go\nfunc BenchmarkFoo(b *testing.B) {\n fixture := newFixture()\n for b.Loop() {\n foo(fixture)\n }\n}\n```\n\nWhy `b.Loop`:\n\n- excludes setup and cleanup from timing\n- avoids most manual timer management\n- helps prevent dead-code elimination surprises\n\nUse `b.RunParallel` for concurrent code paths and pair it with `go test -cpu`.\n\n## Standard benchmark commands\n\nCollect stable results before and after a change:\n\n```bash\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-benchmem -count=10 ./pkg > before.txt\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-benchmem -count=10 ./pkg > after.txt\nbenchstat before.txt after.txt\n```\n\nUseful variations:\n\n```bash\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-benchmem -benchtime=500ms ./pkg\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-benchmem -benchtime=100x ./pkg\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkParallelFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-benchmem -cpu=1,2,4 ./pkg\n```\n\nRules:\n\n- Use `-count=10` or more before trusting `benchstat`.\n- Use `-benchmem` for almost every optimization pass.\n- Do not mix `-race`, coverage, or unrelated noisy tests into performance measurement runs.\n\n### Code-layout noise\n\nWhere the linker places a function in the binary affects its alignment to cache lines and decode windows, which can shift benchmark results by 3-4 % even when nothing about the function changed. Treat single-commit deltas under ~3-5 % as noise unless the result reproduces across an unrelated change.\n\nTo estimate the inherent layout variance of a benchmark:\n\n1. Run the benchmark 10 times and record the result.\n2. Add or remove a tiny unrelated function elsewhere in the binary (a no-op exported helper works).\n3. Run the benchmark 10 times again.\n4. The spread between the two `benchstat` outputs is the layout-noise floor; real wins must clear it.\n\nRun benchmarks longer than feels necessary, especially for small kernels.\n\nInstall `benchstat` if needed:\n\n```bash\ngo install golang.org/x/perf/cmd/benchstat@latest\n```\n\n## Profile one dimension at a time\n\nThe Go docs explicitly warn that diagnostics can interfere with each other. Collect focused data.\n\n### Profile chooser\n\nMatch the profile to the symptom before collecting data:\n\n| Symptom | Start with | What it does and does not show |\n| --- | --- | --- |\n| CPU is saturated, a benchmark regressed, or PGO needs input | CPU profile | Samples active execution. It finds compute hot paths, not time spent blocked on locks, channels, network, or timers. |\n| RSS or live heap grows | Heap profile | Defaults to `inuse_space`, which shows retained heap. Use `gc=1` or `runtime.GC()` when the question is \"what is still live after GC?\" |\n| GC work or allocation rate is high | Allocs profile or memory profile with `alloc_space` / `alloc_objects` | Shows cumulative allocation pressure, including objects that were already collected. Profiles are sampled, not exact. |\n| CPU is low but latency is high | Goroutine dump, then block and mutex profiles | Goroutine shows current stacks; block shows where goroutines waited; mutex shows lock holders that made others wait. |\n| Scheduler behavior or intermittent latency is unclear | Trace or Go 1.25+ flight recorder | Captures a runtime timeline instead of aggregate samples. Use for runnable queues, network polling, syscalls, and goroutine scheduling. |\n\nBlock and mutex profiles are disabled by default and add overhead. Enable them only for the narrow benchmark or a short production investigation window.\n\n### CPU\n\n```bash\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-cpuprofile=cpu.pprof ./pkg\ngo tool pprof -top -cum cpu.pprof\ngo tool pprof -http=:0 cpu.pprof\n```\n\n### Heap and allocs\n\n```bash\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-memprofile=mem.pprof ./pkg\ngo tool pprof -top -sample_index=inuse_space mem.pprof\ngo tool pprof -top -sample_index=inuse_objects mem.pprof\ngo tool pprof -top -sample_index=alloc_space mem.pprof\ngo tool pprof -top -sample_index=alloc_objects mem.pprof\n```\n\nHeap and allocs profiles expose the same four sample types. Heap defaults to `inuse_space` for retained live heap; allocs defaults to `alloc_space` for cumulative allocation pressure.\n\nUse `-memprofilerate=1` only when you need more precise allocation data and can tolerate the extra overhead.\n\n### Mutex and block contention\n\n```bash\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-mutexprofile=mutex.pprof -mutexprofilefraction=1 ./pkg\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-blockprofile=block.pprof ./pkg\ngo tool pprof -top -sample_index=contentions mutex.pprof\ngo tool pprof -top -sample_index=contentions block.pprof\ngo tool pprof -top -sample_index=delay mutex.pprof\ngo tool pprof -top -sample_index=delay block.pprof\n```\n\nFor mutex and block profiles, check both delay and contentions. High delay with low contentions means a few waits were expensive; high contentions with low delay means many short waits.\n\n### Goroutine dump\n\nUse goroutine profiles for current pileups, leaks, and deadlocks. Text is often more useful than a graph:\n\n```bash\ncurl 'http://localhost:6060/debug/pprof/goroutine?debug=2'\n```\n\n### Trace\n\nUse trace for scheduler, latency, blocking, and concurrency-path issues:\n\n```bash\ngo test -run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-bench='^BenchmarkFoo

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

-trace=trace.out ./pkg\ngo tool trace trace.out\n```\n\n## Inspecting pprof output\n\nStart with both flat and cumulative views:\n\n```bash\ngo tool pprof -text cpu.pprof\ngo tool pprof -top -cum cpu.pprof\ngo tool pprof -tree mutex.pprof\ngo tool pprof -list='mypkg.(*Cache).Get' cpu.pprof\ngo tool pprof -peek='mypkg.(*Cache).Get' cpu.pprof\ngo tool pprof -http=:0 cpu.pprof\n```\n\nInterpretation rules:\n\n- Flat cost is work done directly in the function. High flat cost points at the function body.\n- Cumulative cost includes callees. High cumulative cost points at a caller or workflow that creates expensive downstream work.\n- Flame graph width is cost; stack height is call depth, not importance.\n- For memory profiles, always state the sample index used (`inuse_space`, `alloc_space`, `inuse_objects`, or `alloc_objects`).\n- For block and mutex profiles, always state whether the result is ordered by delay or contentions.\n\n## Service profiling\n\nFor a long-running service, prefer `net/http/pprof` or `runtime/pprof`.\n\n```go\nimport _ \"net/http/pprof\"\n```\n\n```bash\n# CPU: 30-second active execution sample\ngo tool pprof 'http://localhost:6060/debug/pprof/profile?seconds=30'\n\n# Live heap, forcing GC first when retained objects are the question\ngo tool pprof 'http://localhost:6060/debug/pprof/heap?gc=1'\n\n# Cumulative allocation pressure since process start\ngo tool pprof 'http://localhost:6060/debug/pprof/allocs'\n\n# Contention profiles; useful only if runtime rates were enabled\ngo tool pprof 'http://localhost:6060/debug/pprof/block'\ngo tool pprof 'http://localhost:6060/debug/pprof/mutex'\n\n# Current goroutine pileup, usually best read as text\ncurl 'http://localhost:6060/debug/pprof/goroutine?debug=2'\n\n# Runtime timeline\ncurl -o trace.out 'http://localhost:6060/debug/pprof/trace?seconds=5'\ngo tool trace trace.out\n```\n\nNotes:\n\n- `net/http/pprof` endpoints must be requested with `GET`.\n- Heap with `gc=1` forces garbage collection first; use `gc=0` or omit it when forcing GC would hide short-lived garbage or allocation spikes.\n- Block and mutex endpoints are useful only after `runtime.SetBlockProfileRate` or `runtime.SetMutexProfileFraction` has been configured.\n- For short investigation windows, `runtime.SetBlockProfileRate(1)` and `runtime.SetMutexProfileFraction(1)` capture every event; use lower-overhead sampling for production.\n- Heap, allocs, mutex, block, and goroutine endpoints support `seconds=N` delta profiles.\n- CPU and trace endpoints use `seconds=N` as capture duration.\n\n## Flight recorder\n\nUse the Go 1.25+ flight recorder when latency incidents are intermittent and you need the last few seconds before failure rather than a constantly running full trace.\n\nIt is a better fit than ad hoc tracing when:\n\n- the service is long-running\n- the trigger is rare or unpredictable\n- you need scheduler and goroutine context just before the incident\n\n## Runtime metrics\n\nUse `runtime/metrics` for low-overhead continuous observation in production.\n\nHigh-signal keys:\n\n- `/gc/heap/live:bytes`\n- `/gc/heap/goal:bytes`\n- `/gc/gomemlimit:bytes`\n- `/cpu/classes/gc/total:cpu-seconds`\n- `/sched/goroutines:goroutines`\n- `/sched/latencies:seconds`\n- `/sync/mutex/wait/total:seconds`\n- `/cgo/go-to-c-calls:calls`\n\nUse these to validate that a benchmark win also improves the live system shape.\n\n## PGO\n\nGo PGO consumes representative CPU pprof profiles.\n\nTypical workflow:\n\n```bash\ncurl -o cpu.pprof 'http://localhost:6060/debug/pprof/profile?seconds=30'\ngo build -pgo=cpu.pprof ./cmd/server\n```\n\nOr place a representative profile at `default.pgo` in the main package directory and let `go build` pick it up automatically.\n\nRules:\n\n- use representative production traffic when possible\n- use benchmark-generated profiles only when they are truly representative\n- do not expect PGO to rescue bad algorithms or contention bugs\n- re-run benchmarks after enabling PGO; keep it only if it helps your workload\n\n## Artifact handling in tests\n\nIf a test or benchmark emits traces, profiles, or logs for later inspection, use `t.ArtifactDir()` or `b.ArtifactDir()` and run:\n\n```bash\ngo test -artifacts -outputdir \"$PWD/test-output\" ./...\n```\n\nThis keeps performance evidence attached to the failing or interesting test instead of scattering files across the repo.\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":10240,"content_sha256":"163e457b2b28dd061a5495765316ac9614f6b10cf194cc03eb97f09c5f0ae85a"},{"filename":"references/optimization.md","content":"# Optimization Heuristics\n\nThis file combines current Go 1.26-era practice with the strongest hot-path guidance from the `ipsw` `go-performance` skill.\n\nFor deep CPU-bound hot-path techniques (inlining cost budget, dispatch cost, BCE hints, assembly fallback), read [hot-path.md](hot-path.md) after profiling identifies a dominant kernel.\n\n## Fix in this order\n\n1. Eliminate unnecessary work.\n2. Improve algorithmic complexity or batching.\n3. Reduce allocations and copying.\n4. Reduce lock contention or scheduler stalls.\n5. Re-check whether PGO improves the already-good version.\n6. Apply micro-optimizations only on measured hot paths.\n\n## Go 1.26 reality check\n\nBefore preserving complex old workarounds, re-measure on Go 1.26:\n\n- Green Tea GC is now on by default.\n- baseline cgo overhead is lower.\n- the compiler can place more slice backing stores on the stack.\n- experimental SIMD package is available behind `GOEXPERIMENT=simd` on AMD64; see [hot-path.md](hot-path.md#experimental-simd) before adopting.\n\nPractical effect:\n\n- keep `sync.Pool`, manual reuse, and cgo batching only when benchmarks still justify them\n- remove cargo-culted allocation avoidance if the current compiler/runtime already made it cheap\n\n## Allocation and escape work\n\nIf profiles show allocation pressure:\n\n- preallocate slices and maps when final size is known\n- reduce temporary objects in inner loops\n- inspect escape and inlining output when needed:\n\n```bash\ngo test -gcflags=all=-m=2 ./pkg 2>&1 | rg 'escapes to heap|moved to heap|cannot inline'\n```\n\nUse compiler diagnostics to explain an allocation you already measured, not as a substitute for profiling.\n\n## Hot-path patterns worth carrying forward\n\nUse these only when the benchmark or profile points at them.\n\n### Prefer `strconv` over `fmt` for primitive string conversion\n\n```go\n// slower\ns := fmt.Sprint(n)\n\n// faster\ns := strconv.Itoa(n)\n```\n\n### Avoid repeated `string` to `[]byte` conversions in loops\n\n```go\n// slower\nfor b.Loop() {\n w.Write([]byte(\"hello\"))\n}\n\n// faster\ndata := []byte(\"hello\")\nfor b.Loop() {\n w.Write(data)\n}\n```\n\n### Pre-size slices and maps\n\n```go\nitems := make([]T, 0, n)\nindex := make(map[string]T, n)\n```\n\n### Pass small values directly\n\nDo not pass pointers just to avoid copying a string or interface-sized value. The indirection can be more expensive and makes escape behavior worse.\n\nBad examples:\n\n- `*string`\n- `*io.Reader`\n\nKeep pointer parameters when mutation, identity, or large-struct copying is the real requirement.\n\n## Concurrency and contention\n\nIf CPU is low but latency is bad, suspect waiting rather than compute:\n\n- inspect mutex and block profiles\n- inspect trace for runnable goroutine buildup and scheduler delay\n- benchmark parallel paths with `b.RunParallel` and `-cpu`\n\nOn Linux containers, do not assume `GOMAXPROCS` needs manual tuning first. Go 1.25+ already accounts for CPU quota by default. Measure before adding compatibility shims.\n\n## Memory limit tuning\n\nIf the service fights memory pressure, use `GOMEMLIMIT` or `runtime/debug.SetMemoryLimit` deliberately and verify the effect with runtime metrics.\n\nBe careful:\n\n- a memory limit that is too low can force the GC to run almost continuously\n- the Go memory limit does not include memory owned outside the Go runtime, such as C allocations or `syscall.Mmap`\n\n## What good looks like\n\nA solid optimization change usually has all of these:\n\n- a benchmark or production metric that reproduces the problem\n- a profile or trace that isolates the dominant cost\n- a targeted code change with a simple explanation\n- a `benchstat` comparison or production delta showing improvement\n- no extra complexity that lacks measured payoff\n","content_type":"text/markdown; charset=utf-8","language":"markdown","size":3690,"content_sha256":"ae3940aacaea442d3d5778c3d42723d654572a033413a8b2a371d8f3cb9f6036"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"Go Performance","type":"text"}]},{"type":"paragraph","content":[{"text":"Start with measurement, not rewriting.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"When to use this skill","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Profile Go code with ","type":"text"},{"text":"pprof","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"runtime/trace","type":"text","marks":[{"type":"code_inline"}]},{"text":", flight recording, or IDE-collected pprof-compatible profiles.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Write or repair Go benchmarks and compare performance changes with ","type":"text"},{"text":"benchstat","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Decide whether a measured Go optimization is worth the added complexity.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"When NOT to use this skill","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize speculatively.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The task is correctness, refactoring, or API design without a throughput, latency, or resource concern.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"The user wants a Go language tutorial or general code review, not performance work.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Read the right reference","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Read ","type":"text"},{"text":"references/measurement.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/measurement.md","title":null}}]},{"text":" for benchmark setup, ","type":"text"},{"text":"go test","type":"text","marks":[{"type":"code_inline"}]},{"text":" flags, ","type":"text"},{"text":"pprof","type":"text","marks":[{"type":"code_inline"}]},{"text":", trace, flight recording, runtime metrics, and PGO workflow.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Read ","type":"text"},{"text":"references/optimization.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/optimization.md","title":null}}]},{"text":" when changing code after measurement or reviewing hot-path code.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Read ","type":"text"},{"text":"references/hot-path.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/hot-path.md","title":null}}]},{"text":" only after profiling names a single dominant CPU kernel: covers inlining cost budget, dispatch cost (generics/interface/closure), bounds-check-elimination hints, register-pressure diagnosis, and assembly/SIMD escalation.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Default workflow","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Reproduce the problem and name the metric that matters: ","type":"text"},{"text":"ns/op","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"B/op","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"allocs/op","type":"text","marks":[{"type":"code_inline"}]},{"text":", throughput, tail latency, pause time, goroutine growth, or CPU saturation.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Add or repair a benchmark before changing code. On Go 1.24+ prefer ","type":"text"},{"text":"b.Loop()","type":"text","marks":[{"type":"code_inline"}]},{"text":" for new or edited benchmarks unless the repo must support older Go.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Run the benchmark repeatedly and compare with ","type":"text"},{"text":"benchstat","type":"text","marks":[{"type":"code_inline"}]},{"text":"; do not trust one run.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Choose the profile that matches the symptom: CPU for active compute, heap/allocs for memory, goroutine/block/mutex for waiting and contention, trace for scheduler timelines. Do not mix high-overhead diagnostics unless the issue requires correlation.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Fix the dominant cost first: algorithmic complexity, redundant work, bad data layout, excess allocation, or contention.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Re-run the same benchmark and compare with ","type":"text"},{"text":"benchstat","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Apply PGO only after the code path is correct and the profile is representative.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Validate the change under realistic service conditions with runtime metrics, ","type":"text"},{"text":"net/http/pprof","type":"text","marks":[{"type":"code_inline"}]},{"text":", or flight recording if the issue is production-only.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Rules of engagement","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Prefer algorithmic or architectural fixes over stylistic micro-optimizations.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use benchmark evidence and profiles to justify code complexity.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For long-running services, profile the service shape you actually run; microbenchmarks alone are not enough.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use ","type":"text"},{"text":"-run='^

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…

","type":"text","marks":[{"type":"code_inline"}]},{"text":" for benchmark-only runs.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For contention or scheduler issues, use trace, block, and mutex tooling instead of only CPU profiles.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For intermittent production latency, consider the Go 1.25+ flight recorder before building custom tracing machinery.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Go 1.26-specific posture","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Re-measure old workarounds on Go 1.26; runtime and compiler changes may have made older allocation, cgo, and GC workarounds obsolete.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"On Linux containers, remember that Go 1.25+ made ","type":"text"},{"text":"GOMAXPROCS","type":"text","marks":[{"type":"code_inline"}]},{"text":" container-aware by default. Do not cargo-cult ","type":"text"},{"text":"automaxprocs","type":"text","marks":[{"type":"code_inline"}]},{"text":" into modern Go services without a measured reason.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Use ","type":"text"},{"text":"testing.T.ArtifactDir","type":"text","marks":[{"type":"code_inline"}]},{"text":" plus ","type":"text"},{"text":"go test -artifacts -outputdir ...","type":"text","marks":[{"type":"code_inline"}]},{"text":" when a benchmark or perf regression test needs to retain profiles, traces, or other debugging output.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Output expectations","type":"text"}]},{"type":"paragraph","content":[{"text":"When reporting findings or a fix:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"State the bottleneck and the evidence.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"State the specific change and why it should move the measured metric.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Report before/after benchmark or profile deltas.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Call out residual risks, version assumptions, or production-only gaps.","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"go-performance","author":"@skillopedia","source":{"stars":15,"repo_name":"dotfiles","origin_url":"https://github.com/blacktop/dotfiles/blob/HEAD/ai/skills/go-performance/SKILL.md","repo_owner":"blacktop","body_sha256":"b205b30bfe0e68b50e4d4ede401fe6d6d87f37abcf8b44d0e0b3918c79eca12b","cluster_key":"00aa25918ebad6f9d024c117ace724bad02e8b8c0edb870a6bd5d9de10b1793a","clean_bundle":{"format":"clean-skill-bundle-v1","source":"blacktop/dotfiles/ai/skills/go-performance/SKILL.md","attachments":[{"id":"c689946f-96d8-50f1-be23-5fb28ecc16a8","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/c689946f-96d8-50f1-be23-5fb28ecc16a8/attachment.yaml","path":"agents/openai.yaml","size":204,"sha256":"2e1c70350087ae6bd1c650188c99014cd984e93285ecaf7d9a9c207b985c2feb","contentType":"application/yaml; charset=utf-8"},{"id":"4d488e46-59c9-589b-87a9-2a3be89e0d11","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/4d488e46-59c9-589b-87a9-2a3be89e0d11/attachment.md","path":"references/hot-path.md","size":6687,"sha256":"ddd491d23563be3c18bb9863f7959e04762976dc420d2893eb2530cb1c8da908","contentType":"text/markdown; charset=utf-8"},{"id":"b960591e-70e1-5f1c-8945-27d63f4ef9c2","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/b960591e-70e1-5f1c-8945-27d63f4ef9c2/attachment.md","path":"references/measurement.md","size":10240,"sha256":"163e457b2b28dd061a5495765316ac9614f6b10cf194cc03eb97f09c5f0ae85a","contentType":"text/markdown; charset=utf-8"},{"id":"61f93dca-3490-599f-a84c-a7b6296814d5","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/61f93dca-3490-599f-a84c-a7b6296814d5/attachment.md","path":"references/optimization.md","size":3690,"sha256":"ae3940aacaea442d3d5778c3d42723d654572a033413a8b2a371d8f3cb9f6036","contentType":"text/markdown; charset=utf-8"}],"bundle_sha256":"df2ca4ecc9710f8676a039eb79987aec93a3012755153623e3a42921803896f9","attachment_count":4,"text_attachments":4,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"ai/skills/go-performance/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"data-analytics","category_label":"Data"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"data-analytics","import_tag":"clean-skills-v1","description":"Measure and improve Go program performance on modern Go (1.24+). Use when profiling Go code, diagnosing CPU or memory bottlenecks, investigating latency or contention, writing or fixing benchmarks, comparing benchmark results, using pprof or trace data, applying PGO, or tuning hot-path Go code."}},"renderedAt":1782988149938}

Go Performance Start with measurement, not rewriting. When to use this skill - Profile Go code with , , flight recording, or IDE-collected pprof-compatible profiles. - Diagnose CPU hot paths, allocation pressure, retained heap growth, goroutine pileups, scheduler delay, or lock/channel contention. - Write or repair Go benchmarks and compare performance changes with . - Decide whether a measured Go optimization is worth the added complexity. When NOT to use this skill - No measurement of an actual performance problem exists yet. Write the benchmark or capture the profile first; do not optimize…