AI Research🇰🇷 한국어

SAE and TensorLens: The Age of Feature Interpretability

Individual neurons are uninterpretable. Sparse Autoencoders extract monosemantic features from model internals, and TensorLens analyzes the entire Transformer as a single unified tensor.

SAE and TensorLens: The Age of Feature Interpretability

SAE and TensorLens: The Age of Feature Interpretability

In the previous two posts, we:

  • Logit/Tuned Lens: Read the model's intermediate predictions
  • Activation Patching: Traced which activations are causally responsible for the answer

But here we hit a fundamental problem:

🔒

Sign in to continue reading

Create a free account to access the full content.

Related Posts