sparse-autoencoder-training

SAELens: Sparse Autoencoders for Mechanistic Interpretability SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity. GitHub : jbloomAus/SAELens (1,100+ stars) The Problem: Polysemanticity & Superposition Individual neurons in neural networks are polysemantic - they activate in multiple, semantically distinct contexts. This happens because models use superposition to represent more features than…