Superposition in Transformers: A Novel Way of Building Mixture of Experts

Community Article Published January 4, 2025

Forget Catastrophic Forgetting: Superposition is Here to Revolutionize How We Fine-Tune LLMs

Abstract

Catastrophic forgetting remains a major challenge when adapting large language models (LLMs) to new tasks or domains. We introduce Superposition in Transformers, a novel architecture that leverages autoencoders to superimpose the hidden representations of a base model and a fine-tuned model within a shared parameter space. Using B-spline-based blending coefficients and autoencoders that adaptively reconstruct hidden states based on the input data distribution, our method effectively mitigates catastrophic forgetting and enables a new paradigm of "in-model" superposition.

The Solution: Superposition — Merging Minds in a Shared Space

Superposition in Transformers takes a radically different approach. Instead of adding new layers or hiring more experts, it cleverly merges the knowledge of a base model and a fine-tuned model within the same set of parameters. It’s like giving your doctor the ability to seamlessly switch between their general practitioner and cardiologist mindsets, accessing the right knowledge at the right time.

Here’s how it works, in a nutshell:

1 — Blending with B-Splines, Not Replacing: The paper introduces a mathematical technique using B-splines to smoothly blend the internal representations (hidden states) of the two models, layer by layer. Think of it like a dimmer switch, smoothly transitioning between the two “expert” states.

2 — Autoencoders for Reconstruction and Polysemanticity: At key points in the model, autoencoders are inserted. These components play a crucial role in:

  • Blended State Refinement: Autoencoders reconstruct the blended hidden states, ensuring that critical features from both base and fine-tuned representations are preserved. This enables the model to adapt dynamically to specific input domains.
  • Encouraging Polysemanticity: By compressing and reconstructing hidden states through bottleneck pathways, autoencoders foster a distributed representation. This mechanism promotes polysemantic neurons capable of handling multiple tasks or domains. The 2D variant introduces dual-pathway autoencoders, emphasizing both local and global feature extraction, further enhancing polysemanticity.

3 — Train Smarter, Not Harder: Only the blending coefficients and autoencoders are trained, while the weights of the original models are frozen. This means you’re adding a tiny amount of new information, not overwriting what’s already there.

Modified GPT-2 Architecture: Modified GPT-2 Architecture

The Results: Best of Both Worlds

The paper demonstrates the effectiveness of this approach by merging a standard GPT-2 model (the general practitioner) with a version fine-tuned on French text (the specialist). The results are remarkable:

  • No More Forgetting: The merged model performed well on both English and French tasks, retaining the strengths of both original models. The specialist didn’t forget how to speak English! The merged model outperformed linear interpolation and task arithmetic methods, achieving a perplexity of 47.01 compared to 60.29 and 61.30, respectively. Next-token prediction accuracy also saw a significant boost.

  • Efficient and Compact: Compared to MoE, Superposition adds very little overhead, making it practical for real-world use.

  • Polyglot Neurons: Intriguingly, the merged model developed “polysemantic” neurons — single neurons that respond to concepts in both English and French. This suggests a more efficient and integrated way of representing knowledge.

  • Dynamic Representation Switching: Visualization techniques like t-SNE revealed the model’s ability to reconstruct domain-specific hidden states based on input language.

    • English Inputs: Hidden states reconstructed by the merged model clustered closely with those from the base model, demonstrating the model’s ability to revert to the base representation for English text.

    • French Inputs: The reconstructed hidden states aligned with the fine-tuned model, reflecting an effective adaptation to French text.

Hidden states reconstruction - Layer 6 With Autoencoders

  • Evaluation of Hidden State Trajectories: To further analyze model behavior, hidden state trajectories across layers were evaluated using Principal Component Analysis (PCA). Hidden states from layers 3 to 7 were extracted and projected into a 2D space, revealing:

    • Distinct Trajectories: The base, fine-tuned, and merged models exhibited unique paths in the PCA space, with the merged model dynamically transitioning between the two domains.

    • Improved Trajectory Stability (2D Variant): PCA analyses reveal that the 2D variant stabilizes hidden state trajectories, minimizing abrupt shifts between domains.

Hidden states trajectories

Analysis of the 2D Variant Architecture

The 2D variant of Superposition in Transformers introduces an enhanced blending mechanism and dual-pathway autoencoders, emphasizing both local and global feature extraction.

___*_ Architecture Highlights

  • B-Spline Alpha Module: This module computes layer-wise, dimension-specific blending coefficients. These coefficients, shaped by B-splines, allow for fine-grained control over the blending process, adapting to the nuances of different tasks and input data.
  • Regularization: Smoothness, centrality, and bias variance losses ensure the blending coefficients remain interpretable and effective.
  • Dual-Pathway Autoencoders:
    • Global Pathway: A low-rank adapter captures global features, emphasizing overall context.
    • Local Pathway: Convolutional layers extract localized features, refining task-specific representations.
    • Reconstruction: The decoder combines these pathways, ensuring accurate recovery of blended hidden states.
  • Integration into the Transformer Stack: The merged model uses the blended hidden states from the alpha module and refines them through the autoencoders in layers 4 to 11. Layers outside this range rely on direct blending, preserving computational efficiency.

Repository Overview

  1. 1D-Alpha_Variant_LayerBias_LinearConv.ipynb
    Demonstrates the "1D-alpha model" using scalar α values for each layer.

    • B-spline-based α blending implementation
    • Autoencoder usage for reconstructing base/fine-tuned hidden states
    • Perplexity and accuracy metrics for English-French adaptation
  2. 2D-Alpha_Variant_LayerBias_ResLinearAdapter-Conv.ipynb
    Explores the "2D-alpha model" with vector-based α per dimension.

    • Local (convolutional) and global (adapter) autoencoder pathways
    • Polysemantic neuron analysis and multi-task representation
    • t-SNE visualizations of hidden states
  3. Benchmarks_1DAlpha_LayerBias_ConvLinear.ipynb

    • Performance comparisons against baselines
    • Perplexity and Jensen-Shannon divergence analysis
    • Direct comparisons with linear interpolation methods

Beyond Language: A New approach for Modular AI

While the initial experiments focused on language, the implications of Superposition are far broader. The paper suggests the potential to:

  • Create Multi-Talented Models: Merge an LLM with specialized experts in coding, mathematics, or even emotional intelligence, creating a truly versatile AI.
  • Dynamic Switching: Develop models that can seamlessly switch between different modes of thinking within a single conversation, just like a human expert. For instance, a model could solve a math equation using its symbolic reasoning module, and then explain the real-world applications of the solution using its general knowledge, all in one smooth response.
  • Resource Efficiency: Training compact auxiliary modules while keeping the main model frozen ensures scalability and resource efficiency.
  • Continual Learning: Easily update models with new skills and knowledge without retraining them from scratch, keeping them relevant in a constantly evolving world.

The Future is Superposed

Superposition in Transformers is more than just a clever technical trick. It represents a fundamental shift in how we think about building and adapting LLMs. It is a new lens through which we can view model adaptability and efficiency. By embracing the principles of superposition and modularity, we can create AI systems that are not only more powerful but also more adaptable, efficient, and ultimately, more human-like in their ability to learn and reason. By blending representations within a shared space and enabling dynamic reconstruction, this approach sets the stage for a new era of AI systems capable of seamlessly integrating and retaining diverse knowledge domains.