Frankestein Transformer: Unified Encoder-Decoder Library, CLI, and Research-Grounded Design Notes
Abstract
Frankestein Transformer presents a unified configuration-driven toolkit for systematic experimentation with modern transformer architectures, spanning seventeen sequence mixer variants and twenty-two optimizer families. The system supports both encoder-style masked language modeling (MLM) and decoder-style autoregressive (AR) next-token prediction through flexible model class and mode configuration, with specialized fine-tuning workflows for both architectures. The research contributions are threefold: (i) a strict schemabased configuration contract that enables reproducible experimentation across diverse attention mechanisms, including standard softmax attention, sigmoid attention, retentive networks, selective state-space models, continuous-depth transformers, adaptive depth routing (Mixture-of-Depths) [57], conditional memory augmentation (Engram) [8], sparse attention patterns, and gated mechanisms; (ii) a comprehensive optimizer routing framework supporting variance-reduction methods (MARS, Adan, AdEMAMix), memory-efficient variants (Adafactor, GaLore, Lion), schedule-free approaches, second-order preconditioners (Shampoo, SOAP, Sophia), and low-rank APOLLO-family optimizers (Apollo, Apollo-Mini, QApollo) [55]; and (iii) end-to-end workflows spanning quantized deployment via ternary weight packing and sentence-embedding training inspired by SBERT, backed by an expanded quality-assurance stack with broad unit test coverage, YAML example validation, and continuous integration execution. The toolkit implements a web-based configuration interface that provides schema-driven form rendering with inline documentation and real-time validation. This technical reference document includes architectural diagrams, executionflow visualizations, decision tables, and comprehensive appendices synthesizing literature on transformer architectures, sparse attention mechanisms, gated attention variants, and optimization algorithms. The system enables rapid iteration while maintaining reproducible experimental conditions through its schema-first design philosophy.
Full Text
Frankestein Transformer: Unified Encoder-Decoder Library, CLI, and Research-Grounded Design Notes
Erick F. Merino M. This work is not affiliated erickfmm@gmail.com
February 2026
Abstract
Frankestein Transformer presents a unified configuration-driven toolkit for systematic ex- perimentation with modern transformer architectures, spanning seventeen sequence mixer variants and twenty-two optimizer families. The system supports both encoder-style masked language modeling (MLM) and decoder-style autoregressive (AR) next-token prediction through flexible model class and mode configuration, with specialized fine-tuning work- flows for both architectures. The research contributions are threefold: (i) a strict schema- based configuration contract that enables reproducible experimentation across diverse at- tention mechanisms, including standard softmax attention, sigmoid attention, retentive net- works, selective state-space models, continuous-depth transformers, adaptive depth routing (Mixture-of-Depths) [57], conditional memory augmentation (Engram) [8], sparse attention patterns, and gated mechanisms; (ii) a comprehensive optimizer routing framework sup- porting variance-reduction methods (MARS, Adan, AdEMAMix), memory-efficient variants (Adafactor, GaLore, Lion), schedule-free approaches, second-order preconditioners (Sham- poo, SOAP, Sophia), and low-rank APOLLO-family optimizers (Apollo, Apollo-Mini, Q- Apollo) [55]; and (iii) end-to-end workflows spanning quantized deployment via ternary weight packing and sentence-embedding training inspired by SBERT, backed by an ex- panded quality-assurance stack with broad unit test coverage, YAML example validation, and continuous integration execution. The toolkit implements a web-based configuration in- terface that provides schema-driven form rendering with inline documentation and real-time validation. This technical reference document includes architectural diagrams, execution- flow visualizations, decision tables, and comprehensive appendices synthesizing literature on transformer architectures, sparse attention mechanisms, gated attention variants, and optimization algorithms. The system enables rapid iteration while maintaining reproducible experimental conditions through its schema-first design philosophy.
Contents
1 Introduction 6 1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Reading Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Web-Based Configuration Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related Work 9 2.1 Sequence Mixer Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Dense Attention Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Recurrent and Retentive Architectures . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Sparse Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.4 Gated Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 System Design and Architecture 10 3.1 Configuration-Centric Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Schema Scope and Validation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Complete Model Feature Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Training Task Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Complete Training Feature Inventory . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.6 Optimizer Prefix Contract (Full) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.7 Training Safety and Runtime Semantics . . . . . . . . . . . . . . . . . . . . . . . 16 3.8 Normalization Variants: RMSNorm, Dynamic Tanh, and Dynamic Erf . . . . . . 16 3.9 RMSNorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.10 Dynamic Tanh (DyT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.11 Dynamic Erf (Derf) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.12 Schema Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Architecture Taxonomy and Implementation 18 4.1 Attention and Sequence-Mixer Families . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Standard Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Sigmoid Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Retentive Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.5 Selective SSM (Mamba) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.6 ODE-style Continuous Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.7 Test-time Memory (Titans) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.8 Sparse and Gated Extensions in the Current Codebase . . . . . . . . . . . . . . . 20 4.9 Implemented Sparse Attention Blocks (Detailed) . . . . . . . . . . . . . . . . . . 20 4.10 Implemented Gated Attention Blocks (Detailed) . . . . . . . . . . . . . . . . . . . 21
5 Optimizer Families and Training Dynamics 23
6 Quantization and Deployment 23 6.1 Ternary Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.2 Activation Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.3 Size Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 SBERT Downstream Tasks 24
8 Summary Tables 25 8.1 Attention and Sequence-Mixer Summary . . . . . . . . . . . . . . . . . . . . . . . 25 8.2 Optimizer Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9 Discussion 27 9.1 Schema-Driven Design Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 9.2 Architectural Coverage and Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 9.3 Optimizer Landscape Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . 27 9.4 Deployment and Production Considerations . . . . . . . . . . . . . . . . . . . . . 28 9.5 Integration and Extensibility Challenges . . . . . . . . . . . . . . . . . . . . . . . 28
10 Conclusion 28 10.1 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Bibliography 30
A Annex A: Optimizer Families 33 A.1 Memory and Complexity Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 33 A.2 The Evolution of Optimization in Neural Networks . . . . . . . . . . . . . . . . . 34 A.3 Standard Baseline and Adaptive Optimizers . . . . . . . . . . . . . . . . . . . . . 34 A.4 Advanced Momentum and Variance Reduction (2024–2025) . . . . . . . . . . . . 36 A.5 Large-Batch, Memory-Efficient, and Parameter-Free Optimizers . . . . . . . . . . 40 A.6 Second-Order, Geometric, and Orthogonality Optimizers . . . . . . . . . . . . . . 45
B Annex B: Dense, Recurrent, and Memory-Augmented Transformers 49 B.1 Dense Attention Baselines: Standard and Sigmoid . . . . . . . . . . . . . . . . . . 49 B.1.1 Standard Softmax Attention . . . . . . . . . . . . . . . . . . . . . . . . . . 49 B.1.2 Sigmoid Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 B.2 Recurrent and Retentive Architectures . . . . . . . . . . . . . . . . . . . . . . . . 50 B.2.1 Retentive Networks (RetNet) . . . . . . . . . . . . . . . . . . . . . . . . . 50 B.2.2 Mamba: Selective State-Space Models . . . . . . . . . . . . . . . . . . . . 51 B.3 Continuous-Depth Transformers: ODE Integration . . . . . . . . . . . . . . . . . 51 B.3.1 ODE Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 B.4 Test-Time Memory: Titans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 B.5 Architectural Comparison and Synthesis . . . . . . . . . . . . . . . . . . . . . . . 52
C Annex C: Comprehensive Sparse Attention Mechanisms 53 C.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 C.2 Sparse Transformer: Factorized Strided and Fixed Patterns . . . . . . . . . . . . 53 C.2.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 C.2.2 Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 C.2.3 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 C.3 Longformer: Sliding Window, Dilation, and Global Tokens . . . . . . . . . . . . . 55 C.3.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 C.3.2 Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 C.3.3 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 C.4 BigBird: Random, Local, and Global Sparse Graph . . . . . . . . . . . . . . . . . 56 C.4.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 C.4.2 Theoretical Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 C.4.3 Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 C.4.4 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 C.5 SparseK Attention: Differentiable Top-k Selection . . . . . . . . . . . . . . . . . . 57 C.5.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 C.5.2 Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 C.5.3 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 C.6 NSA (Native Sparse Attention): Hardware-Aligned Hierarchical Branches . . . . 58 C.6.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 C.6.2 Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 C.6.3 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 C.7 FASA: Frequency-Aware Sparse Attention . . . . . . . . . . . . . . . . . . . . . . 60 C.7.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 C.7.2 Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 C.7.3 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 C.8 SpargeAttn: Two-Stage Block-Level Filtering . . . . . . . . . . . . . . . . . . . . 61 C.8.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 C.8.2 Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
C.8.3 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 C.9 Comparative Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 C.10 Design Space and Selection Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 63
D Annex D: Gated Attention Families—Complete Literature Analysis 63 D.1 Executive Summary: Gating for Memory Control . . . . . . . . . . . . . . . . . . 63 D.2 1. Gated Linear Attention (GLA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 D.2.1 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 D.2.2 Hardware-Efficient Training . . . . . . . . . . . . . . . . . . . . . . . . . . 65 D.2.3 Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 D.3 2. DeltaNet: Error-Correcting Linear Attention . . . . . . . . . . . . . . . . . . . 65 D.3.1 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 D.3.2 Efficient Parallel Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 D.3.3 Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 D.4 3. Gated DeltaNet: Synthesis of Gating and Error Correction . . . . . . . . . . . 67 D.4.1 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 D.4.2 Empirical Performance and Trade-offs . . . . . . . . . . . . . . . . . . . . 68 D.5 4. HGRN2: Hierarchical Gating with Outer-Product Expansion . . . . . . . . . . 68 D.5.1 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 D.5.2 Scaling and Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . 69 D.6 5. Forgetting Transformer (FoX): Gating in Softmax Logit Space . . . . . . . . . 69 D.6.1 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 D.6.2 Integration with FlashAttention . . . . . . . . . . . . . . . . . . . . . . . . 70 D.6.3 Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 D.7 6. Gated Attention (Post-SDPA Sigmoid Gating) . . . . . . . . . . . . . . . . . . 70 D.7.1 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 D.7.2 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 D.8 Comparative Analysis: Taxonomy of Gated Mechanisms . . . . . . . . . . . . . . 71 D.9 Unified Gating Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 D.10 Implementation and Practical Guidance . . . . . . . . . . . . . . . . . . . . . . . 73 D.10.1 When to Use Each Architecture . . . . . . . . . . . . . . . . . . . . . . . . 73 D.10.2 Hardware Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
E Annex E: Conceptual Introduction—Transformers and Attention for Begin- ners 74 E.1 What is a Transformer? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 E.2 The Attention Mechanism: An Analogy . . . . . . . . . . . . . . . . . . . . . . . 74 E.3 Practical Example Step-by-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 E.3.1 Context 1: “The cat eats fish” . . . . . . . . . . . . . . . . . . . . . . . . . 75 E.3.2 Context 2: “The restaurant eats into profit margins” . . . . . . . . . . . . 75 E.4 How Attention Works Mathematically (Simple Version) . . . . . . . . . . . . . . 75 E.4.1 Step 1: Queries, Keys, and Values . . . . . . . . . . . . . . . . . . . . . . 75 E.4.2 Step 2: Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 E.4.3 Step 3: Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 E.4.4 Step 4: Combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 E.5 Multiple Attention Heads: Multiple Perspectives . . . . . . . . . . . . . . . . . . 76 E.6 Stacking Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 E.7 Why Does It Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 E.8 Practical Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 E.8.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 77 E.8.2 Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 E.8.3 Device Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
E.9 Modern Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 E.10 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
1 Introduction
1.1 Motivation and Problem Statement
Transformer architectures have fundamentally transformed the landscape of sequence modeling across natural language processing [38], computer vision, and computational biology. The suc- cess of models such as BERT [11], GPT, and their variants has established the transformer as the de facto standard for representation learning in deep learning. However, the rapid prolif- eration of architectural innovations presents significant practical challenges for researchers and practitioners. Recent years have witnessed an explosion of alternative attention and sequence-mixing mech- anisms, each addressing specific limitations of standard softmax attention: quadratic computa- tional complexity [9], memory-efficient inference requirements [36, 12], content-based selective state management [2], and hardware-aware optimizations [30]. Simultaneously, the optimization literature has diversified beyond classical AdamW, introducing variance-reduction techniques [42, 26, 49], memory-efficient variants [33, 54], schedule-free approaches [10], and second-order preconditioning methods [14, 39]. The practical consequence is a fragmented research ecosystem where experimental compari- son across architectures and optimizers requires significant engineering effort. Researchers must implement and debug multiple variants from scratch, ensure consistent training pipelines, and manage complex hyperparameter spaces. This fragmentation hampers reproducibility, slows scientific progress, and increases the barrier to entry for new researchers. This work addresses these challenges through a unified, configuration-driven experimentation toolkit available at https://github.com/erickfmm/frankestein-transformer that provides:
1. Schema-First Design: A strict, validated configuration contract that enforces reproducibil- ity while supporting seventeen distinct sequence mixer architectures and twenty-two optimizer families.
2. Architecture Agnostic Training: Common training infrastructure supporting dense at- tention baselines (standard, sigmoid), recurrent alternatives (RetNet, Mamba, ODE-style blocks), adaptive depth routing (Mixture-of-Depths) [57], conditional memory blocks (En- gram) [8], sparse attention patterns (Sparse Transformer, Longformer, BigBird, SparseK, NSA, SpargeAttn, FASA), and gated mechanisms (GLA, DeltaNet, Gated DeltaNet, HGRN2, FoX, Gated Softmax).
3. Optimizer Routing Framework: Prefixed hyperparameter groups enabling fine-grained control over embeddings, normalization layers, recurrent blocks, attention blocks, and other parameter subsets across diverse optimizers, including the APOLLO family (Apollo, Apollo- Mini, Q-Apollo) [55].
4. End-to-End Workflows: Integrated deployment via quantization (ternary weight packing, INT8 activations) and sentence-embedding capabilities inspired by SBERT [31].
5. Interactive Configuration: Web-based interface providing schema-driven form rendering, real-time validation, and CLI command generation.
6. Reliability Tooling: Comprehensive automated testing with unit-test suites, YAML preset validation, and CI automation across supported Python versions.
The project command surface is:
frankestein-transformer
with subcommands for encoder and decoder workflows: train, finetune, deploy, quantize, infer, sbert-train, sbert-infer, web-server. The web-server command launches a Streamlit-based configuration builder that provides:
• Schema-driven form fields with parameter titles and detailed descriptions
• Real-time tooltips and help text for each configuration option
• Live YAML preview and download functionality
• Generated CLI commands for training, deployment, inference, and SBERT workflows
This interactive interface serves as an alternative to manual YAML editing, improving us- ability for users exploring available configuration options and understanding their impact on model behavior and training dynamics.
1.2 Contributions
This work makes the following primary contributions:
1. Unified Configuration Schema: A YAML-based schema contract with strict valida- tion that supports seventeen distinct sequence mixer architectures across four categories (dense baselines, recurrent/retentive blocks, sparse attention patterns, and gated mecha- nisms), alongside adaptive depth and conditional memory controls, while enforcing repro- ducibility through additionalProperties: false constraints.
2. Comprehensive Architecture Support: Implementation of modern transformer variants including standard softmax attention [38], sigmoid attention [30], RetNet [36], Mamba [12], ODE-style continuous transformers [52], Titans memory-augmented attention [2], Engram conditional memory layers [8], Mixture-of-Depths token routing [57], sparse attention mech- anisms [9, 3, 50, 23, 48, 53, 41], and gated attention architectures [44, 45, 46, 28, 19, 29]. System supports both encoder-mode training with bidirectional attention for masked lan- guage modeling (MLM) and decoder-mode training with causal masking for autoregressive next-token prediction.
3. Optimizer Routing Framework: Prefixed hyperparameter system enabling per-parameter- group control across twenty-two optimizers, including variance-reduction methods (MARS, Adan, AdEMAMix, Cautious AdamW), memory-efficient variants (Adafactor, GaLore, Lion), schedule-free approaches (Schedule-Free AdamW), curvature-aware methods (Sophia), second- order preconditioners (Shampoo, SOAP), orthogonality-oriented optimizers (Muon, Turbo- Muon), large-batch scaling (LAMB), and the APOLLO family (Apollo, Apollo-Mini, Q- Apollo) [55].
4. Quantization and Deployment: Integrated deployment pipeline supporting ternary weight packing and INT8 activation quantization with size estimates following 1.58-bit storage ap- proximation.
5. Sentence-Embedding Workflows: SBERT-inspired training and inference pipelines sup- porting similarity scoring, retrieval, clustering, and persistent embedding export.
6. Interactive Configuration Interface: Streamlit-based web server providing schema-driven form generation, real-time validation, inline documentation, and CLI command generation.
7. Automated Validation Pipeline: Continuous integration and expanded unit testing that verify model training paths, optimizer/schema integration, and YAML example compatibility.
1.3 Reading Guide
This document is organized as a technical reference addressing four operational concerns:
1. Prerequisites: Appendix E provides an accessible introduction to transformers and atten- tion mechanisms for readers new to the field.
2. Configuration Contract: Section 3.1 describes the YAML schema that enforces valid ex- periments and Section 3.2 explains validation rules.
3. Architecture Selection: Section 3.8 covers normalization options; Section 4.1 provides comprehensive comparison of sequence mixer families; Appendices B, C, and D synthesize supporting literature.
4. Optimization Dynamics: Section 5 details optimizer routing and training dynamics; Ap- pendix A provides comprehensive optimizer family analysis.
5. Deployment and Inference: Sections 6 and 7 describe quantized deployment and sentence- embedding workflows.
1.4 Web-Based Configuration Builder
In addition to direct YAML editing, this project provides a Streamlit-based web interface (ac- cessed via web-server command) that improves configuration accessibility and discoverability. The interface presents schema fields with:
• Schema-driven form rendering — All fields are dynamically generated from the authori- tative schema, ensuring consistency and validation.
• Inline parameter documentation — Each form field displays a title from the schema as its label, with the description shown as a help tooltip on hover.
• Real-time configuration preview — Users see live YAML output as they modify form fields, enabling immediate validation feedback.
• CLI command generation — The interface generates complete CLI commands for training, deployment, inference, and SBERT workflows based on current configuration.
• Accessibility improvements — Tooltips and structured forms make configuration options easier to understand, especially for users new to the project or exploring novel architectures.
This web-based approach addresses common usability barriers in configuration-driven exper- imentation:
• Reduces need to memorize YAML structure and field names
• Prevents typos through schema validation
• Provides educational context through inline documentation
• Enables rapid experimentation with guided parameter tuning
• Serves as both a configuration tool and a learning resource
The web interface implementation uses Streamlit’s form widgets with:
• st.checkbox() for binary toggles with help text
• st.number_input() for numeric fields with step size and format
• st.selectbox() for enum choices with options display
• st.multiselect() for array selections from defined options
• st.text_input() for string fields
• st.info(), st.caption() for supplementary information
Schema metadata (title and description fields) are extracted and rendered systematically across all form sections, including:
• Model architecture parameters (hidden size, layers, attention heads, etc.)
• Training runtime settings (batch size, accumulation, scheduler)
• Optimizer configuration with per-parameter-group hyperparameters
• Deployment and quantization options
• SBERT-specific training and inference parameters
2 Related Work
This work sits at the intersection of three active research areas: alternative attention archi- tectures, sparse attention mechanisms, and advanced optimization algorithms. This section provides a concise survey; detailed mathematical formulations and algorithmic descriptions are deferred to the appendices.
2.1 Sequence Mixer Architectures
2.1.1 Dense Attention Baselines
Standard softmax attention [38] achieves full global context through n2 pairwise interactions but incurs prohibitive memory and compute costs for long sequences. Sigmoid attention [30] replaces row-wise softmax with element-wise sigmoid, yielding faster convergence and up to 17% kernel speedup, though requiring hybrid-norm stabilization at scale. Both serve as dense baselines in this toolkit.
2.1.2 Recurrent and Retentive Architectures
RetNet [36] resolves the “impossible triangle” of parallel training, O(1) inference, and strong per- formance via a dual attention–retention formulation. Mamba [12] introduces input-dependent selectivity into state-space models, achieving linear complexity with hardware-aware scan al- gorithms. ODE-style transformers [52] treat depth as numerical integration over a continuous dynamical system. Titans [2] augments inference with test-time neural memorization for ex- tremely long contexts. Detailed formulations are provided in Appendix B.
2.1.3 Sparse Attention Mechanisms
Sparse attention methods reduce the quadratic bottleneck by restricting token interactions: Sparse Transformer [9] uses factorized strided and fixed patterns (O(n√n)); Longformer [3] employs sliding windows with global tokens (O(n·w)); BigBird [50] combines local, random, and global paths; SparseK [23] applies differentiable top-k selection; NSA [48] uses a three-branch hierarchical design; SpargeAttn [53] performs two-stage block-level pruning; and FASA [41] leverages RoPE frequency features for token selection. Complete descriptions are in Appendix C.
2.1.4 Gated Attention Mechanisms
Gated architectures control information flow through learnable gates: GLA [44] adds data- dependent diagonal gating to linear attention; DeltaNet [45] applies the delta learning rule to recurrent state updates; Gated DeltaNet [46] synthesizes both mechanisms; HGRN2 [28] introduces hierarchical forget gates with outer-product state expansion; FoX [19] embeds forget gates directly into softmax attention; and Gated Softmax [29] applies post-SDPA channel gating. Full mathematical derivations appear in Appendix D.
2.2 Optimization Algorithms
Optimization of highly parameterized transformer architectures requires navigating non-convex loss landscapes with heterogeneous Hessian spectra. The literature has diversified into several families beyond the classical AdamW baseline [22]:
• Variance reduction and momentum: RAdam [21], Adan [42], ADOPT [37], AdEMAMix [26], MARS [49], and Cautious optimizers [18] address early-step instability, convergence guar- antees, and dual-EMA history mixing.
• Memory-efficient: Adafactor [33], GaLore [54], Lion [7], and APOLLO [55] reduce optimizer- state memory through factorization, low-rank projection, or sign-based updates.
• Schedule-free and parameter-free: Schedule-Free AdamW [10] and Prodigy [24] absorb scheduler or learning-rate tuning into the optimizer dynamics.
• Second-order and curvature-aware: Shampoo [14], SOAP [39], and Sophia [20] incorpo- rate approximate second-order information.
• Geometry-oriented: Muon [34] and Turbo-Muon [4] reshape update geometry through orthogonalization.
Detailed algorithmic descriptions, pseudocode, and a comprehensive complexity comparison are provided in Appendix A.
3 System Design and Architecture
3.1 Configuration-Centric Architecture
The core design choice in this repository is that experimentation is schema first. Instead of exposing a large number of loosely checked flags, the project forces model topology, optimizer family, training limits, and telemetry options through a single validated configuration docu- ment. This reduces ambiguity when reproducing results and makes it possible to compare many architectures under a consistent operational interface. The authoritative contract is configs/schema.yaml. It enforces three top-level objects:
• model_class
• model
• training
Model Class Selection. The model_class field determines the architectural variant instan- tiated by the training pipeline. Three options are supported:
• frankenstein: Mixed-architecture encoder models supporting diverse attention mechanisms (standard, sigmoid, retentive, state-space, sparse, and gated mixers) with MoE (Mixture of Experts) and advanced features. Optimized for bidirectional encoder-style training with masked language modeling (MLM) objectives.
• mini: Simplified encoder variant designed for smaller-scale training scenarios. Provides re- duced parameter overhead and faster iteration for experimentation and prototyping.
• frankesteindecoder: Autoregressive causal decoder for LLM-style next-token generation. Enables causal attention masking for sequential text generation tasks. When this class is selected, runtime enforces mode=’decoder’.
Training Mode Selection. The model.mode field controls attention masking behavior across the model:
• encoder: Uses bidirectional attention where all tokens attend to all other tokens in the sequence. Suitable for masked language modeling (MLM) pre-training tasks where the model learns to predict randomly masked tokens based on full context.
• decoder: Uses causal masking where each token can only attend to previous tokens in the sequence. Required for autoregressive (AR) next-token prediction tasks such as language modeling and text generation. When model_class=’frankesteindecoder’, the system au- tomatically forces mode=’decoder’ at runtime.
This dual architecture support enables the system to handle both encoder-style pre-training (MLM on bidirectional contexts) and decoder-style generation (causal autoregressive prediction) through a unified configuration interface. The model.layer_pattern supports legacy, sparse, and gated blocks:
• Retentive Network (RetNet) — internal reference: sun_retentive_2023 — code name: retnet, retnet_attn
• Mamba (Selective State Space Model) — internal reference: gu_mamba_2023 — code name: mamba
• ODE-style Continuous Depth Block — internal reference: zhang_continuous_2021 — code name: ode
• Titans Memory-Augmented Attention — internal reference: behrouz_titans_2025 — code name: titan_attn
• Standard Softmax Attention — internal reference: vaswani_attention_2017 — code name: standard_attn
• Sigmoid Self-Attention — internal reference: ramapuram_theory_2024 — code name: sigmoid_attn
• Sparse Transformer — internal reference: child_sparse_transformer_2019 — code name: sparse_transformer_attn
• Longformer — internal reference: beltagy_longformer_2020 — code name: longformer_attn
• BigBird — internal reference: zaheer_bigbird_2020 — code name: bigbird_attn
YAML Config
Validation schema + rules
CLI train/deploy/infer
Web Streamlit
Training AMP + scheduler
Model 17 mixers
Optimizer 22 families
Model Build pattern + loops
Train Loop checks + logs
Opt Step group routing
SBERT search/cluster
Deploy quantization
Figure 1: System architecture: configuration flows through validation, partitions into model/training/optimizer, executes runtime, produces deployment/SBERT artifacts.
• SparseK Attention — internal reference: lou_sparsek_2024 — code name: sparsek_attn
• Native Sparse Attention (NSA) — internal reference: yuan_nsa_2025 — code name: nsa_attn
• SpargeAttn — internal reference: zhang_spargeattn_2025 — code name: sparge_attn
• FASA (Frequency-aware Sparse Attention) — internal reference: wang_fasa_2026 — code name: fasa_attn
• Gated Linear Attention (GLA) — internal reference: yang_gla_2023 — code name: gla_attn
• DeltaNet — internal reference: yang_deltanet_2024 — code name: deltanet_attn
• Gated DeltaNet — internal reference: yang_gated_deltanet_2024 — code name: gated_deltanet_attn
• HGRN2 — internal reference: qin_hgrn2_2024 — code name: hgrn2_attn
• Forgetting Transformer (FoX) — internal reference: lin_forgetting_transformer_2025 — code name: fox_attn
• Gated Softmax Attention — internal reference: qiu_gated_attention_2025 — code name: gated_softmax_attn
which corresponds to current attention and sequence-mixer literature [36, 12, 52, 2, 30, 38, 9, 3, 50, 23, 48, 53, 41, 44, 45, 46, 28, 19, 29] The training.optimizer.optimizer_class supports a broad optimizer family: sgd_momentum, adamw, adafactor, galore_adamw, prodigy, lion, sophia, muon, turbo_muon, radam, adan, adopt, ademamix, mars_adamw, cautious_adamw, lamb, schedulefree_adamw, shampoo, soap, apollo, apollo_mini, and q_apollo.
3.2 Schema Scope and Validation Rules
The schema is strict: top-level and nested objects set additionalProperties: false. This guarantees that unknown keys fail fast instead of being silently ignored. The training.optimizer.parameters object is additionally constrained by optimizer-specific prefix rules through allOf+if/then pat- tern checks. Normalization values currently accepted by schema are:
norm_type ∈{layer_norm, dynamic_tanh, derf}
Thus, rms_norm is not a valid schema value in the current contract.
3.3 Complete Model Feature Inventory
Field Type/Range Meaning
model_class enum Architecture variant: frankenstein (mixed- architecture encoder), mini (simplified en- coder), frankesteindecoder (autoregressive decoder). model.mode enum Attention mode: encoder (bidirectional, for MLM) or decoder (causal, for AR). When model_class=’frankesteindecoder’, forces mode=’decoder’. vocab_size int ≥1 Vocabulary size. hidden_size int ≥1 Hidden dimension. num_layers int ≥1 Physical layer count. num_loops int ≥1 Logical loop count (looped blocks). num_heads int ≥1 Attention heads. retention_heads int ≥1 Retention heads for RetNet-style mixers. num_experts int ≥1 MoE expert count. top_k_experts int ≥1 Top-k expert routing in MoE. dropout float [0, 1] Global dropout. layer_pattern array enum Ordered block list: legacy (retnet, retnet_attn, mamba, ode, titan_attn, standard_attn, sigmoid_attn), sparse (sparse_transformer_attn, longformer_attn, bigbird_attn, sparsek_attn, nsa_attn, sparge_attn, fasa_attn), and gated (gla_attn, deltanet_attn, gated_deltanet_attn, hgrn2_attn, fox_attn, gated_softmax_attn). ode_solver enum rk4 or euler. ode_steps int ≥1 ODE integration steps. use_bitnet bool Enable low-bit BitLinear path. norm_type enum layer_norm, dynamic_tanh, derf. use_factorized_embedding bool Enable factorized embeddings. factorized_embedding_dim int ≥1 Reduced embedding dimension for factoriza- tion. use_embedding_conv bool Enable Conv1d over embedding stream. embedding_conv_kernel int ≥1 Conv1d kernel size. hope_base float ≥0 HoPE base value (optional in schema). hope_damping float ≥0 HoPE damping (optional in schema). use_hope bool Apply HoPE in titan_attn. use_moe bool Enable MoE FFN routing path. ffn_hidden_size int ≥1 FFN intermediate width.
Field Type/Range Meaning
ffn_activation enum silu or gelu.
Looped depth induced by schema is:
Llogical = num_layers × num_loops
which is the configuration-level definition of looped blocks.
3.4 Training Task Types
The training.task field determines the training objective, working in conjunction with model.mode to define how the model learns:
Masked Language Modeling (MLM).
• Mode: Requires mode=’encoder’ for bidirectional attention.
• Objective: Randomly mask tokens in input sequence (typically 15%) and train model to predict masked tokens based on full bidirectional context.
• Use Case: Pre-training encoders for representation learning, following BERT-style method- ology [11]. Model learns bidirectional representations capturing context from both left and right.
• Configuration: Uses mlm_probability parameter to control masking fraction.
Autoregressive (AR) Next-Token Prediction.
• Mode: Requires mode=’decoder’ for causal masking.
• Objective: Train model to predict next token in sequence given all previous tokens only.
• Use Case: Language generation and LLM-style tasks following GPT methodology. Model learns sequential dependencies with causal attention where each token can only attend to preceding tokens.
• Model Class: Typically uses model_class=’frankesteindecoder’ for autoregressive de- coder architectures.
This dual task support enables unified experimentation across both encoder-style pre-training (MLM for bidirectional understanding) and decoder-style generation (AR for sequential text production) within the same codebase.
3.5 Complete Training Feature Inventory
Field Type/Range Meaning
batch_size int ≥1 Loader batch size. dataloader_workers int ≥0 PyTorch dataloader workers. max_length int ≥1 Sequence length cap. task enum Training objective: mlm (masked language modeling) or sbert (sentence embedding). mlm_probability float [0, 1] MLM masking probability (applies only when task=’mlm’).
Field Type/Range Meaning
max_samples int ≥1 Maximum streamed samples. dataset_batch_size int ≥1 Internal streaming dataset chunk size. num_workers int ≥0 Streaming dataset workers. cache_dir string Dataset cache directory. local_parquet_dir string Optional local parquet path. prefer_local_cache bool Prefer local cache when available. stream_local_parquet bool Stream from local parquet mode. use_amp bool Mixed precision toggle. gradient_accumulation_steps int ≥1 Effective batch through accumulation. optimizer object Contains optimizer_class and prefixed parameters. scheduler_total_steps int ≥1 Scheduler horizon. scheduler_warmup_ratio float [0, 1] Warmup ratio. scheduler_type enum cosine, constant, linear_warmup_then_constant. grad_clip_max_norm float ≥0 Global norm clipping threshold. inf_post_clip_threshold float ≥0 Exploding-gradient guard threshold after clip- ping. max_nan_retries int ≥0 Retry budget for NaN/Inf instability. checkpoint_every_n_steps int ≥1 Rolling checkpoint frequency. max_rolling_checkpoints int ≥1 Number of rolling checkpoints to keep. num_best_checkpoints int ≥1 Number of best checkpoints tracked. nan_check_interval int ≥1 NaN/Inf check cadence. log_gradient_stats bool Enable gradient statistics logging. gradient_log_interval int ≥1 Gradient logging cadence. csv_log_path string Step-level CSV output path. csv_rotate_on_schema_change bool Rotate CSV if logging schema changes. gpu_metrics_backend enum nvml or none. nvml_device_index int ≥0 Device index for NVML telemetry. enable_block_grad_norms bool Include per-block gradient norm telemetry. telemetry_log_interval int ≥1 Heavy telemetry interval (optimizer steps). use_galore bool Enable GaLore strategy. galore_rank int ≥1 GaLore low-rank projection dimension. galore_update_interval int ≥1 Projection refresh interval. galore_scale float ≥0 Gradient scaling in projected space. galore_max_dim int ≥1 Maximum tensor dimension for GaLore pro- jection.
3.6 Optimizer Prefix Contract (Full)
Supported optimizer_class values are: sgd_momentum, adamw, adafactor, galore_adamw, prodigy, lion, sophia, muon, turbo_muon, radam, adan, adopt, ademamix, mars_adamw, cautious_adamw, lamb, schedulefree_adamw, shampoo, soap, apollo, apollo_mini, q_apollo. Shared per-group suffix families (all prefixed by optimizer name) are:
• LR groups: lr_embeddings, lr_norms, lr_ode, lr_retnet, lr_mamba, lr_attention, lr_other
• Weight decay groups: wd_embeddings, wd_norms, wd_ode, wd_retnet, wd_mamba, wd_attention, wd_other
• Beta groups: betas_embeddings, betas_norms, betas_ode, betas_retnet, betas_mamba, betas_attention, betas_other
• Epsilon groups: eps_embeddings, eps_norms, eps_ode, eps_retnet, eps_mamba, eps_attention, eps_other
Optimizer-specific global suffixes:
• sgd_momentum: momentum, nesterov
• adafactor: beta2_decay, clip_threshold, eps1, eps2
• galore_adamw: rank, update_proj_gap
• prodigy: d_coef
• sophia: rho, update_k
• muon / turbo_muon: momentum, nesterov, ns_steps, ns_eps
• cautious_adamw: cautious_clip
• apollo: rank, update_proj_gap, scale, scale_type, proj_type, scale_front, disable_nl
• apollo_mini: update_proj_gap, scale, proj_type, scale_front, disable_nl
• q_apollo: rank, update_proj_gap, scale, scale_type, proj_type, scale_front, disable_nl, quant_bits
All other classes in the list above accept only prefixed shared groups.
3.7 Training Safety and Runtime Semantics
Schema-level safety features include accumulation, clipping, post-clip explosion checks, and NaN retries:
K X
gacc = 1
i=1 gi, K = gradient_accumulation_steps
K
gclip = gacc · min 1, τ ∥gacc∥2 + ϵ
, τ = grad_clip_max_norm
then overflow guards use inf_post_clip_threshold and retry logic bounded by max_nan_retries.
3.8 Normalization Variants: RMSNorm, Dynamic Tanh, and Dynamic Erf
Normalization determines how activation scale is controlled across depth. In this repository, normalization is not only a modeling choice but also a schema compatibility question, because only certain values are currently accepted by norm_type. The three formulations most relevant to this codebase are:
3.9 RMSNorm
RMSNorm removes mean-centering and only rescales by root mean square magnitude [51]:
v u u t1
d X
i=1 x2 i + ϵ, yi = γi xi RMS(x)
RMS(x) =
d
Compared with LayerNorm, RMSNorm is computationally simpler (no subtraction of feature mean) and is often used when reducing normalization overhead is important.
Algorithm 1 Schema-Driven Training Step with Stability Controls Require: Batch stream, config C
1: Initialize retry counter r ←0
2: for each optimizer step do
3: Accumulate gradients for K = C.gradient_accumulation_steps micro-batches
4: Apply global norm clipping with τ = C.grad_clip_max_norm
5: if post-clip gradient exceeds C.inf_post_clip_threshold or NaN/Inf detected then
6: if r < C.max_nan_retries then
7: restore safe state / skip step; r ←r + 1
8: continue
9: else
10: stop training with failure state
11: end if
12: end if
13: run optimizer step selected by optimizer_class
14: update scheduler (cosine, constant, or linear_warmup_then_constant)
15: if step mod checkpoint_every_n_steps= 0 then
16: save rolling checkpoint and prune to max_rolling_checkpoints
17: end if
18: update best checkpoints up to num_best_checkpoints
19: emit CSV + telemetry following gradient_log_interval and telemetry_log_interval
20: end for
3.10 Dynamic Tanh (DyT)
Dynamic Tanh proposes replacing explicit normalization with a bounded elementwise map [56]:
DyT(x) = tanh(αx)
where α is learned. The core idea is that bounded nonlinear contraction can provide stable signal scaling without explicitly computing per-token normalization statistics.
3.11 Dynamic Erf (Derf)
Derf extends the same normalization-free direction by using an error-function based map [6]:
Derf(x) = erf(αx + s)
with learnable scale/shift. Reported results in the cited work indicate stronger performance than DyT and common normalization baselines across multiple domains.
3.12 Schema Implications
Current configuration contract in this repository allows:
norm_type ∈{layer_norm, dynamic_tanh, derf}
so DyT and Derf are directly available in schema-driven runs, while RMSNorm is not currently an accepted enum value and would require code/schema extension.
Method Formula Stats Needed Notes
RMSNorm γixi/ q
1 d P j x2 j + ϵ RMS only Lower overhead; widely used baseline [51].
Method Formula Stats Needed Notes
Dynamic Tanh tanh(αx) none Normalization-free bounded transform; drop-in replace- ment [56]. Dynamic Erf erf(αx + s) none Normalization-free alternative; improves over DyT [6].
4 Architecture Taxonomy and Implementation
4.1 Attention and Sequence-Mixer Families
This system implements seventeen distinct sequence mixer architectures organized into five func- tional categories reflecting research trends in sequence modeling design. The taxonomical orga- nization reflects evolving understanding of how to balance expressivity, computational efficiency, and memory constraints.
1. Dense Attention Baselines: Standard softmax attention and sigmoid attention provide full global contextualization at quadratic computational cost, serving as reference baselines for comparison with more efficient alternatives.
2. Recurrent and Retentive Architectures: RetNet, Mamba, ODE-style blocks, and Titans maintain state representations enabling O(1) inference cost while preserving expressivity through recurrent dynamics, selective parameters, or test-time memory adaptation.
3. Sparse Attention Patterns: Seven sparse variants (Sparse Transformer, Longformer, Big- Bird, SparseK, NSA, SpargeAttn, FASA) reduce quadratic complexity through structured sparsity, token selection, or training-free pruning strategies.
4. Gated Memory Mechanisms: Six gated architectures (GLA, DeltaNet, Gated DeltaNet, HGRN2, FoX, Gated Softmax) introduce data-dependent control over memory retention, forgetting, and update strength.
4.2 Standard Attention
Given projected matrices (Q, K, V ):
Attn(Q, K, V ) = softmax QK⊤
V
√dk
This is the baseline mechanism for content routing [38].
4.3 Sigmoid Attention
Sigmoid attention removes row-wise probability normalization:
SigmoidAttn(Q, K, V ) = σ QK⊤
√dk + b V
and has different training stability requirements, often with additional normalization [30].
Sequence Mixer Registry (17 variants)
Dense (2) Recurrent (4) Sparse (7) Gated (6)
sparse_transformer_attn
gla_attn
standard_attn
retnet / retnet_attn
sigmoid_attn
longformer_attn
deltanet_attn
mamba
bigbird_attn
gated_deltanet_attn
ode
sparsek_attn
hgrn2_attn
titan_attn
nsa_attn
fox_attn
sparge_attn
gated_softmax_attn
fasa_attn
Figure 2: Comprehensive taxonomy of seventeen supported sequence mixer architectures. Dense baselines provide full global routing at quadratic cost. Recurrent architectures enable constant- time inference through state compression. Sparse variants reduce complexity through structured patterns or token selection. Gated mechanisms introduce data-dependent control over memory retention and forgetting.
4.4 Retentive Formulation
RetNet uses retention with decay matrix D:
Retention(Q, K, V ) = QK⊤⊙D V
with recurrent form: Sn = γSn−1 + k⊤ n vn, on = qnSn
enabling low-cost recurrent inference [36, 43].
4.5 Selective SSM (Mamba)
Discrete selective state-space recurrence is:
ht = ¯Atht−1 + ¯Btxt, yt = Ctht
where ( ¯At, ¯Bt, Ct) depend on input, preserving linear-time scaling with hardware-aware scan [12, 15].
4.6 ODE-style Continuous Updates
Continuous-depth framing: dh(t)
dt = fθ(h(t), t)
with practical RK integrators for discrete execution [52].
4.7 Test-time Memory (Titans)
A memory-augmented update can be written:
Mt = (1 −αt)Mt−1 + St, St = ηtSt−1 −θt∇ℓ(Mt−1; xt)
to adapt memory at inference time [2, 1].
Token sequence
Hybrid sparse graph BigBird / Sparse Transformer
Selective pruning SparseK / NSA / FASA / SpargeAttn
Local / windowed Longformer
Three sparse design strategies: locality, structured graph sparsity, and learned or predicted selection
Figure 3: Conceptual map of sparse attention design choices used in the codebase. Different methods reduce cost by restricting neighborhoods, constructing sparse graphs, or selecting only high-value tokens/blocks.
4.8 Sparse and Gated Extensions in the Current Codebase
The mixer registry now includes sparse blocks: sparse_transformer_attn, longformer_attn, bigbird_attn, sparsek_attn, nsa_attn, sparge_attn, and fasa_attn; and gated blocks: gla_attn, deltanet_attn, gated_deltanet_attn, hgrn2_attn, fox_attn, and gated_softmax_attn. The implementation enforces an explicit execution policy for training-free sparse methods: fasa_attn and sparge_attn are eval/inference-only and raise runtime errors if used while the model is in training mode.
4.9 Implemented Sparse Attention Blocks (Detailed)
This codebase includes seven sparse attention families [9, 3, 50, 23, 48, 53, 41].
Sparse Transformer (sparse_transformer_attn). Uses factorized sparse masks (strided + fixed) to approximate dense connectivity at lower cost than full O(n2) attention:
qiK⊤ Ai √dk
!
Attni = softmax
VAi
where Ai is the sparse neighborhood induced by stride/fixed rules. [9]
Longformer (longformer_attn). Uses sliding-window locality with optional global tokens:
Ai = {j : |i −j| ≤w/2} ∪G
yielding linear scaling in sequence length for fixed window w. [3]
BigBird (bigbird_attn). Combines local windows, random links, and global tokens:
Ai = Awindow i ∪Arandom i ∪Aglobal i
to preserve strong long-context connectivity with sparse computation. [50]
SparseK (sparsek_attn). Uses a differentiable top-k style projection over importance scores before attention, so only selected KV pairs participate in the expensive dot-product path. [23]
NSA (nsa_attn). Implements a three-branch sparse design: compressed branch, selected branch, and local window branch, then combines them with learned gates:
c∈{cmp,sel,win} gc t Attn(qt, ˜Kc t , ˜V c t )
ot = X
Previous state St−1
Gate(s) αt, βt, Gt, ft
New key/value or SDPA output
Updated memory / output St or Ot
Figure 4: Generic gating template. A gate can decay existing memory, regulate write strength, or modulate dense attention outputs, depending on the block family.
SpargeAttn (sparge_attn). Two-stage training-free block filtering: first predicts negligible block interactions, then applies softmax-aware pruning to remove low-contribution blocks. [53]
FASA (fasa_attn). Frequency-aware training-free attention: uses dominant RoPE frequency chunks for token importance prediction, then applies full attention only on selected tokens. [41]
Block Trainable Asymptotic Trend Primary Sparsity Unit Current Integration Notes Ref
[9]
Sparse Trans- former Yes sub- quadratic mask pattern (token-level) Factorized strided/fixed masks inside SDPA pipeline.
Longformer Yes linear in n (fixed w) sliding window + global tokens Window mask with op- tional global indices. [3]
Randomized sparse mask plus local/global paths. [50]
BigBird Yes near-linear window + ran- dom + global edges
[23]
Learned score net + SparseK projection + gathered KV attention.
differentiable top-k KV selection
SparseK Yes linear-like (selected KV)
[48]
Three sparse branches gated into one output tensor.
compressed blocks + se- lected blocks + local window
NSA Yes reduced- token multi- branch
sparse block dependent block-level pre- dicted sparsity Eval-only in this repo; raises in training mode. [53]
SpargeAttn No (training- free)
FASA No (training- free)
selected- token de- pendent
dominant fre- quency chunks + selected tokens
Eval-only in this repo; raises in training mode. [41]
4.10 Implemented Gated Attention Blocks (Detailed)
This codebase includes seven gated blocks [44, 45, 46, 36, 28, 19, 29]. The unifying idea is that gating controls what information survives. Some gates act on re- current state updates (GLA, DeltaNet variants, HGRN2), while others modify the full attention path itself (FoX and Gated Softmax). This makes gating especially useful when the model must trade off recall, recency, and bounded memory.
GLA (gla_attn). Gated Linear Attention applies data-dependent multiplicative decay in recurrent state updates: St = Gt ⊙St−1 + vtk⊤ t , ot = Stqt
to control memory accumulation. [44]
DeltaNet (deltanet_attn). Uses a delta-rule error-correcting write with learned write strength βt: St = St−1(I −βtktk⊤ t ) + βtvtk⊤ t
which improves targeted memory replacement. [45]
Gated DeltaNet (gated_deltanet_attn). Adds decay gate αt on top of delta-rule writes:
St = αtSt−1(I −βtktk⊤ t ) + βtvtk⊤ t
for both global forgetting and local corrective updates. [46]
RetNet Attn Alias (retnet_attn). Provides an explicit gated-package alias wrapping multi- scale retention behavior for naming consistency in layer registries. [36]
HGRN2 (hgrn2_attn). Uses lower-bounded forget gates with outer-product state expansion:
St = diag(gt)St−1 + vtk⊤ t
to increase recurrent state expressiveness while remaining efficient. [28]
FoX (fox_attn). Injects token-wise forget bias directly into softmax logits:
O = softmax(QK⊤+ D)V
where D is derived from cumulative log-forget gates. [19]
Gated Softmax (gated_softmax_attn). Applies a post-SDPA sigmoid gate:
Y ′ = SDPA(Q, K, V ) ⊙σ(XWg)
which adds multiplicative channel gating without replacing softmax attention. [29]
Block State Type Gate Mech- anism Softmax Path Current Integration Notes Ref
GLA matrix re- current state data- dependent multiplicative decay
No (linear recurrent) Recurrent update with low-rank gate projection. [44]
DeltaNet matrix re- current state write gate (β) No (linear recurrent) Delta-rule correction with normalized Q/K. [45]
Gated DeltaNet matrix re- current state decay + write gates (α, β) No (linear recurrent) Combined forgetting and targeted writing. [46]
RetNet Attn matrix re- current state fixed multi- scale decay No (reten- tion) Alias wrapper over exist- ing RetNet mixer. [36]
HGRN2 matrix re- current state lower- bounded forget gate
No (linear recurrent) Hierarchical recurrent gat- ing with outer products. [28]
Algorithm 2 Pattern-Driven Mixer Forward (Conceptual) Require: Hidden states H, pattern P, layer index ℓ
1: m ←P[ℓmod |P|]
2: if m ∈{fasa_attn, sparge_attn} and model is in training mode then
3: raise configuration/runtime error (training-free block in train mode)
4: else if m = standard_attn then
5: H ←softmax-attention(H)
6: else if m = sigmoid_attn then
7: H ←sigmoid-attention(H)
8: else if m ∈{retnet, retnet_attn} then
9: H ←retention(H)
10: else if m = mamba then
11: H ←selective-ssm(H)
12: else if m = ode then
13: H ←rk-step(H)
14: else if m is a sparse attention key then
15: H ←sparse-attention-family(H)
16: else if m is a gated attention key then
17: H ←gated-attention-family(H)
18: else
19: H ←memory-augmented-attn(H)
20: end if
21: return H
Block State Type Gate Mech- anism Softmax Path Current Integration Notes Ref
FoX full attention matrix logit-space forget gate Yes Forget bias added before softmax. [19]
Gated Softmax full attention matrix post-attention sigmoid gate Yes Sigmoid gating applied af- ter SDPA output. [29]
5 Optimizer Families and Training Dynamics
Optimization of highly parameterized transformer architectures presents significant challenges due to non-convex loss landscapes, saddle points, and block heterogeneity across parameter groups. This system addresses these challenges through a unified framework supporting twenty- two optimizer families spanning six algorithmic categories: (1) classical baselines (SGD, AdamW), (2) advanced momentum and variance reduction (Adan, ADOPT, AdEMAMix, MARS, Cau- tious), (3) memory-efficient variants (Adafactor, GaLore, Lion, APOLLO, APOLLO-Mini, Q- APOLLO), (4) schedule-free and parameter-free methods (Schedule-Free AdamW, Prodigy) plus large-batch scaling through LAMB, (5) curvature-aware and second-order (Sophia, Shampoo, SOAP), and (6) geometry-oriented (Muon, Turbo-Muon). Detailed algorithmic descriptions, pseudocode, and a memory/complexity comparison table for all optimizers are provided in Ap- pendix A.
6 Quantization and Deployment
The deploy stack uses ternary weight packing plus INT8 activation quantization for efficient artifacts.
Choose optimizer objective
Reliable baseline AdamW / RAdam
Lower optimizer memory Adafactor / GaLore / Lion
Aggressive or structured Adan, ADOPT, Sophia, Shampoo, SOAP, Muon
Schema-prefixed groups LR, weight decay, betas, eps
Figure 5: Optimizer selection in practice: the class determines the update rule, while the schema controls how hyperparameters are applied across embeddings, norms, recurrent blocks, attention blocks, and other parameters.
Trained checkpoint Weight packing ternary / low-bit
Activation scaling INT8 path
Deployable artifact smaller storage footprint
Figure 6: Deployment path from a trained checkpoint to a compact artifact. The codebase treats quantization as a deploy-stage transformation rather than a separate model family.
6.1 Ternary Quantization
Given weight tensor W, a practical scaling is:
s = mean(|W|), ˜W = clip round W
, −1, 1
s
which approximates BitNet-style low-bit updates [40]. The packed mapping uses two bits per weight symbol for storage efficiency.
6.2 Activation Quantization
For activations x:
q = round x · 127 max(|x|) + ϵ
, q ∈[−128, 127]
with dequantization x ≈q/α.
6.3 Size Estimates
For N parameters:
FP32 size ≈4N, FP16 size ≈2N, 1.58-bit size ≈1.58
8 N
before metadata and packing overhead. This aligns with lightweight deployment goals [32, 5].
7 SBERT Downstream Tasks
Sentence embedding is built on Siamese-style training [31]. For sentence pair (s1, s2) with embeddings (e1, e2):
cos(e1, e2) = e⊤ 1 e2 ∥e1∥∥e2∥
Shared encoder
Similarity cosine score
Search top-k retrieval
Cluster / encode offline analysis
Figure 7: SBERT workflow reuse. A single encoder supports online pair scoring, corpus retrieval, clustering, and persistent embedding export.
Algorithm 3 SBERT Inference Mode Router Require: mode m, model E, inputs X
1: if m = similarity then
2: return cos(E(x1), E(x2))
3: else if m = search then
4: return top-k by dot-product/cosine against corpus embeddings
5: else if m = cluster then
6: return clustering labels over E(X)
7: else
8: return serialized embeddings E(X)
9: end if
and regression-style cosine loss:
Lcos = (cos(e1, e2) −y)2
with y ∈[−1, 1] in this pipeline. Supported downstream modes:
• Similarity: pairwise score between two sentences.
• Search: top-k nearest neighbors over a corpus.
• Cluster: grouping embeddings (e.g., k-means).
• Encode: persistent embedding export for later retrieval.
8 Summary Tables
8.1 Attention and Sequence-Mixer Summary
Type Core Equation Train Infer Notes
Standard At- tention softmax(QK⊤/√dk)V O(n2d) O(n)/step Baseline expressive global rout- ing [38]. Sigmoid Atten- tion σ(QK⊤/√dk + b)V O(n2d) O(n)/step Element-wise gating; often needs stabilization norm [30]. RetNet (QK⊤⊙D)V O(n2d) or chunkwise O(1)/step Parallel/recurrent dual form with decay retention [36]. Mamba ht = ¯Atht−1 + ¯Btxt O(nd) O(1)/step Selective state-space with hardware-aware scan [12].
Type Core Equation Train Infer Notes
dh
dt = fθ(h, t) solver- dependent solver- dependent Continuous-depth interpreta- tion; RK integration [52]. Titans memory Mt = (1−αt)Mt−1+ St approx. O(nd) retrieval- centric Test-time memory updates with surprise-driven dynamics [2].
ODE-style block
8.2 Optimizer Summary
Optimizer Family State Cost Key Idea Ref
SGD+MomentumClassical first- order Low Momentum-accelerated baseline with minimal state [27]
AdamW Adaptive first/second moment
High Decoupled weight decay baseline [22]
RAdam Adaptive variance-corrected High Rectifies early adaptive variance [21]
Adan Momentum + variance reduc- tion
High Nesterov-style adaptive update [42]
ADOPT Adam variant High Reordered updates with improved convergence guarantees [37]
AdEMAMix Multi-EMA adap- tive High Mixes short and long horizon EMAs [26]
MARS Variance-reduced preconditioned High Recursive momentum correction [49]
Cautious AdamW Masked momen- tum High Apply updates only on sign- consistent directions [18]
LAMB Layer-wise adap- tive moments High Trust-ratio scaling for very large- batch training [47]
Schedule-free AdamW Scheduler-free adaptive High Remove explicit LR schedule de- pendence [10]
Adafactor Memory-efficient adaptive Medium Factorized second moments for ma- trix tensors [33]
GaLore AdamW Low-rank gradient projection Medium Optimize in projected low-rank gra- dient space [54]
APOLLO Low-rank adap- tive projection Medium Structured scaling from projected Adam-style moments [55]
APOLLO- Mini Rank-1 adaptive projection Low Tensor-wise scaled APOLLO vari- ant for extreme memory savings [55]
Q-APOLLO Quantized low- rank projection Low Quantized APOLLO state for ultra- low-memory training [55]
Prodigy Parameter-free adaptation Medium Distance-adaptive step calibration [24]
Lion Sign momentum Low Momentum sign update, reduced state [7]
Sophia Approx. second- order Medium Diagonal Hessian preconditioning with clipping [20]
Shampoo Matrix precondi- tioner High Kronecker-structured second-order statistics [14]
SOAP Shampoo + Adam basis High Adam-like tracking in precondi- tioner eigenbasis [39]
Muon Orthogonality- based Medium Orthogonalized matrix updates [34]
Optimizer Family State Cost Key Idea Ref
Turbo-Muon Accelerated or- thogonalization Medium Preconditioned Newton-Schulz speedup [4]
9 Discussion
The design choices in Transformer Encoder Frankenstein reflect several engineering and research tensions in modern deep learning tooling.
9.1 Schema-Driven Design Trade-offs
The schema-first approach provides significant reproducibility benefits by enforcing explicit con- tracts and failing fast on invalid configurations. However, this approach also introduces rigidity: adding new architectures or optimizers requires schema extensions rather than loose command- line arguments. The prefixed hyperparameter system enables fine-grained control but increases configuration complexity for users accustomed to simpler interfaces. The decision to enforce additionalProperties: false at all schema levels eliminates silent parameter swallowing that has plagued earlier configuration systems, but this strictness requires careful schema maintenance when extending system capabilities. Each new attention mechanism or optimizer variant must be properly integrated into the validation framework, in- cluding schema field definitions with appropriate types and constraints, prefixed hyperparameter mapping for optimizer-specific groups, default values aligned with research best practices, and documentation strings for web interface rendering.
9.2 Architectural Coverage and Gaps
The seventeen implemented mixer architectures span major research directions in sequence mod- eling, but certain gaps remain. The system lacks recent hybrid architectures such as Griffin [35] and Jamba [13], which combine gating with state-space models. MoE (Mixture of Experts) routing is implemented for FFN layers but not for attention computation, where recent work has shown benefits [17]. The sparse attention coverage is comprehensive, but implementation of training-free methods (FASA, SpargeAttn) raises runtime errors during training, reflecting architectural constraints: these methods require pretrained checkpoints from full-attention models or specific fine-tuning procedures that are not currently automated. The gated mechanism coverage is strong across major categories (GLA, DeltaNet, Gated DeltaNet, HGRN2, FoX, Gated Softmax).
9.3 Optimizer Landscape Fragmentation
The support for twenty-two optimizer families across six algorithmic categories demonstrates comprehensiveness but also highlights the fragmented state of optimization research. Users face significant decision complexity when choosing among variance-reduction methods (Adan, MARS), memory-efficient variants (GaLore, Adafactor, APOLLO), and curvature-aware ap- proaches (Sophia, Shampoo). The prefixed hyperparameter system, while powerful, requires understanding of which parameters are relevant for each optimizer class. The implementation quality varies across optimizers: classical methods (AdamW, SGD with momentum) are highly optimized in PyTorch, while newer methods (Muon, Turbo-Muon, SOAP) may require custom implementations that affect numerical stability and performance character- istics.
9.4 Deployment and Production Considerations
The quantization pipeline demonstrates practical deployment concerns but makes specific en- gineering trade-offs. Ternary weight packing reduces storage to approximately 1.58 bits per parameter, but this aggressive compression may degrade performance, especially for smaller models where quantization error is more significant. The current implementation applies quan- tization uniformly across parameter types. The SBERT workflows provide practical utility for semantic similarity and retrieval tasks, but implementation assumes standard pooling strategies (CLS token, mean pooling). Recent advances such as Matryoshka embeddings [25] and contrastive learning refinements [16] are not yet incorporated.
9.5 Integration and Extensibility Challenges
The current codebase structure, while functional, presents maintenance challenges as architec- ture and optimizer families expand. The dispatcher pattern for mixer selection and optimizer routing handles extensibility but risks becoming a “kitchen sink” of conditional logic. Future versions would benefit from plugin-based architectures where new mixers and optimizers could be registered declaratively rather than modifying core dispatch logic. The web configuration interface provides significant usability improvements but introduces deployment complexity: running Streamlit alongside training jobs requires additional resources and infrastructure considerations that may not be appropriate for all environments, particularly HPC clusters without web access.
10 Conclusion
Transformer Encoder Frankenstein presents a unified, configuration-driven experimentation plat- form addressing critical challenges in modern deep learning research: architectural fragmenta- tion across dense attention, recurrent models, sparse patterns, and gated mechanisms; optimizer landscape complexity spanning classical baselines, variance-reduction methods, memory-efficient variants, schedule-free approaches, curvature-aware algorithms, and geometry-oriented methods; and end-to-end deployment workflows spanning quantization and sentence embedding applica- tions. The system’s primary contributions are:
1. Schema-First Design: A strict YAML-based configuration contract with validation and prefixed hyperparameter routing enabling reproducible experiments across seventeen mixer architectures and twenty-two optimizer families.
2. Comprehensive Architecture Support: Implementation spanning major research cate- gories including dense baselines (standard, sigmoid attention), recurrent alternatives (Ret- Net, Mamba, ODE-style, Titans), sparse attention (Sparse Transformer, Longformer, Big- Bird, SparseK, NSA, SpargeAttn, FASA), and gated mechanisms (GLA, DeltaNet, Gated DeltaNet, HGRN2, FoX, Gated Softmax).
3. Unified Optimizer Framework: Prefixed hyperparameter groups enabling fine-grained control over embeddings, normalization layers, recurrent blocks, attention weights, and FFN parameters across classical baselines (SGD+Momentum, AdamW), variance-reduction (Adan, ADOPT, AdEMAMix, MARS, Cautious), memory-efficient (Adafactor, GaLore, Lion, APOLLO, APOLLO-Mini, Q-APOLLO), large-batch and schedule simplification (LAMB, Schedule- Free AdamW, Prodigy), curvature-aware (Sophia), second-order (Shampoo, SOAP), and geometry-oriented (Muon, Turbo-Muon) optimizers.
4. End-to-End Workflows: Integrated deployment pipeline supporting ternary weight pack- ing and INT8 activation quantization; SBERT-inspired training and inference for semantic similarity, retrieval, and clustering tasks.
5. Interactive Configuration: Streamlit-based web interface providing schema-driven form generation, real-time validation, inline documentation, and CLI command synthesis.
This system enables rapid experimental iteration while maintaining reproducibility through strict configuration contracts. By consolidating diverse research contributions into a unified toolkit, it lowers barriers to exploring novel architectures and optimization strategies, particu- larly for researchers who may lack resources to implement and validate each variant indepen- dently.
10.1 Limitations and Future Directions
Several limitations and promising directions for future work emerge from this system’s design and implementation:
1. Architecture Integration: Recent hybrid architectures (Griffin, Jamba, Mamba-X) demon- strate benefits of combining multiple mechanisms into unified blocks. Future versions should integrate these architectures and explore systematic composition patterns.
2. Advanced Quantization: Current implementation uses uniform ternary packing across all parameters. Research on layer-wise, channel-wise, and importance-aware quantization suggests more sophisticated strategies could improve quality-efficiency trade-offs.
3. Plugin-Based Extensibility: The current dispatch pattern becomes increasingly complex with each new addition. A plugin architecture allowing declarative registration of new mixers, optimizers, and normalization methods would improve maintainability and reduce risk of bugs in core dispatch logic.
4. Automated Hyperparameter Optimization: The schema supports extensive hyperpa- rameter spaces, but users must manually explore these spaces. Integration with Bayesian optimization, multi-armed bandit strategies, or gradient-based hyperparameter tuning could automate effective configuration discovery.
5. Production Deployment: The web interface improves usability but may not be appropriate for all deployment environments. Headless configuration modes, API-based configuration management, or improved CLI ergonomics could serve HPC and production workflows.
6. Evaluation Benchmarking: While the system enables training with diverse architectures, comprehensive benchmarking comparing performance across mixers and optimizers on stan- dardized tasks would provide valuable guidance for configuration selection.
7. Training Stability Guarantees: Current implementation includes NaN/Inf guards and gradient clipping, but formal analysis of stability conditions for different mixer-optimizer combinations, particularly with looped blocks and aggressive quantization, remains open.
8. Multimodal and Task-Specific Extensions: The current design focuses on sequence modeling. Extensions for vision-language models, multimodal architectures, and task-specific fine-tuning workflows (e.g., instruction tuning, RLHF) would broaden applicability.
The research trajectory of sequence modeling continues toward hybrid approaches that com- bine strengths of multiple paradigms—compression from recurrence, selectivity from attention, gating for memory management, and sparsity for efficiency. A unified experimentation platform
like Transformer Encoder Frankenstein is increasingly valuable as this convergence accelerates, enabling researchers to systematically explore this expanding design space with reproducible, well-engineered infrastructure.
Bibliography
[1] Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time, . URL http://arxiv.org/abs/2501.00663.
[2] Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time, . URL https://arxiv.org/abs/2501.00663. Version Number: 1.
[3] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document trans- former, 2020. URL https://arxiv.org/abs/2004.05150.
[4] Thibaut Boissin, Thomas Massena, Franck Mamalet, and Mathieu Serrurier. Turbo- muon: Accelerating orthogonality-based optimization with pre-conditioning. URL https: //arxiv.org/abs/2512.04632. Version Number: 1.
[5] Riccardo Bravin, Massimo Pavan, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, and Manuel Roveri. EmbBERT: Attention under 2 MB memory. URL http://arxiv.org/abs/ 2502.10001.
[6] Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, and Zhuang Liu. Stronger normalization-free transformers, . URL http://arxiv.org/abs/2512.10938.
[7] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms, . URL https://arxiv.org/abs/2302.06675. Version Number: 4.
[8] Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for large language models, 2026. URL https://arxiv.org/abs/2601.07372.
[9] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019. URL https://arxiv.org/abs/1904.10509.
[10] Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The road less scheduled. URL https://arxiv.org/abs/2405.15682. Version Number: 4.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. URL http://arxiv.org/ abs/1810.04805.
[12] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. URL https://arxiv.org/abs/2312.00752. Version Number: 2.
[13] Albert Gu, AI21 Labs, et al. Jamba: A hybrid transformer-mamba language model, 2024. URL https://arxiv.org/abs/2403.19887.
[14] Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. URL https://arxiv.org/abs/1802.09568. Version Number: 2.
[15] Sukjun Hwang, Aakash Lahoti, Tri Dao, and Albert Gu. Hydra: Bidirectional state space models through generalized matrix mixers. URL http://arxiv.org/abs/2407.09941.
[16] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning, 2020. URL https://arxiv.org/abs/2004.11362.
[17] Mike Lewis, Shruti Bhosale, Tim Dettmers, Douwe Kiela, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models, 2021. URL https://arxiv.org/abs/ 2103.16716.
[18] Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. URL https://arxiv.org/abs/2411.16085. Version Num- ber: 4.
[19] Zhixuan Lin, Ke Wang, et al. Forgetting transformer: Softmax attention with a forget gate, 2025. URL https://arxiv.org/abs/2503.02130.
[20] Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, . URL https://arxiv. org/abs/2305.14342. Version Number: 4.
[21] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond, . URL https: //arxiv.org/abs/1908.03265. Version Number: 4.
[22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. URL https: //arxiv.org/abs/1711.05101. Version Number: 3.
[23] Tianyu Lou, Zheyu Chen, Tao Yu, et al. Efficient sparse attention for long-range trans- formers, 2024. URL https://arxiv.org/abs/2406.16747.
[24] Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter- free learner. URL https://arxiv.org/abs/2306.06101. Version Number: 4.
[25] Niklas Muennighoff et al. Matryoshka representation learning, 2022. URL https://arxiv. org/abs/2205.13147.
[26] Matteo Pagliardini, Pierre Ablin, and David Grangier. The AdEMAMix optimizer: Better, faster, older. URL https://arxiv.org/abs/2409.03137. Version Number: 2.
[27] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. doi: 10. 1016/0041-5553(64)90137-5.
[28] Zhen Qin, Xu Han, et al. Hgrn2: Gated linear rnns with state expansion, 2024. URL
https://arxiv.org/abs/2404.07904.
[29] Yuxiang Qiu, Qwen Team, et al. Gated attention for large language models, 2025. URL
https://arxiv.org/abs/2505.06708.
[30] Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, and Russ Webb. Theory, analysis, and best practices for sigmoid self-attention. URL https://arxiv.org/ abs/2409.04431. Version Number: 2.
[31] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. URL http://arxiv.org/abs/1908.10084.
[32] Hema Hariharan Samson. Lightweight transformer architectures for edge devices in real- time applications. URL http://arxiv.org/abs/2601.03290.
[33] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear mem- ory cost. URL https://arxiv.org/abs/1804.04235. Version Number: 1.
[34] Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the con- vergence analysis of muon. URL https://arxiv.org/abs/2505.23737. Version Number: 1.
[35] Rafael Soares et al. Griffin: Mixing gated linear recurrences with local attention for efficient sequence modeling, 2024. URL https://arxiv.org/abs/2402.19427.
[36] Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. URL https://arxiv.org/abs/2307.08621. Version Number: 4.
[37] Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Mat- suo. ADOPT: Modified adam can converge with any β2 with the optimal rate. URL https://arxiv.org/abs/2411.02853. Version Number: 3.
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. URL https: //arxiv.org/abs/1706.03762. Version Number: 7.
[39] Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfon- brener, Lucas Janson, and Sham Kakade. SOAP: Improving and stabilizing shampoo using adam. URL https://arxiv.org/abs/2409.11321. Version Number: 2.
[40] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit transformers for large language models. URL http://arxiv.org/abs/2310.11453.
[41] Zhe Wang, Ming Liu, et al. Fasa: Frequency-aware sparse attention, 2026. URL https: //arxiv.org/abs/2602.03152.
[42] Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. URL https://arxiv. org/abs/2208.06677. Version Number: 5.
[43] Haiqi Yang, Zhiyuan Li, Yi Chang, and Yuan Wu. A survey of retentive network. URL
http://arxiv.org/abs/2506.06708.
[44] Songlin Yang, Bailin Wang, et al. Gated linear attention transformers with hardware- efficient training, 2023. URL https://arxiv.org/abs/2312.06635.
[45] Songlin Yang, Bailin Wang, et al. Parallelizing linear transformers with the delta rule over sequence length, 2024. URL https://arxiv.org/abs/2406.06484.
[46] Songlin Yang, Bailin Wang, et al. Gated delta networks: Improving mamba2 with delta rule, 2024. URL https://arxiv.org/abs/2412.06464.
[47] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Large batch optimization for deep learn- ing: Training BERT in 76 minutes. URL https://arxiv.org/abs/1904.00962. Version Number: 3.
[48] Han Yuan, DeepSeek-AI, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention, 2025. URL https://arxiv.org/abs/2502.11089.
[49] Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. MARS: Unleashing the power of variance reduction for training large models. URL https://arxiv.org/abs/ 2411.10438. Version Number: 4.
[50] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Pike Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, 2020. doi: 10.48550/ARXIV.2007.14062. URL https://arxiv.org/abs/2007. 14062.
[51] Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization. URL http: //arxiv.org/abs/1910.07467.
[52] Jing Zhang, Peng Zhang, Baiwen Kong, Junqiu Wei, and Xin Jiang. Continuous self- attention models with neural ODE networks. 35(16):14393–14401. ISSN 2374-3468, 2159- 5399. doi: 10.1609/aaai.v35i16.17692. URL https://ojs.aaai.org/index.php/AAAI/ article/view/17692.
[53] Yichi Zhang, Yizhong Wang, et al. Accurate and training-free sparse attention accelerating any model inference, 2025. URL https://arxiv.org/abs/2502.18137.
[54] Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuan- dong Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. URL https://arxiv.org/abs/2403.03507. Version Number: 2.
[55] Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, and Jinwon Lee. Apollo: Sgd-like memory, adamw-level performance, 2025. URL https://arxiv.org/abs/2412.05270.
[56] Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization. URL http://arxiv.org/abs/2503.10622.
[57] Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, and Xinggang Wang. Mixture- of-depths attention, 2026. URL https://arxiv.org/abs/2603.15619.
A Annex A: Optimizer Families
A.1 Memory and Complexity Comparison
Table 8 summarizes the memory overhead (number of state buffers per parameter), per-step computational complexity, and key hyperparameters for all supported optimizers. Memory overhead is expressed in terms of the number of state tensors maintained per model parameter, where each tensor has the same shape as the parameter.
Optimizer State Buffers Per-Step Cost Key Hyperparams
SGD+Momentum 1 (m) O(n) lr, momentum, wd AdamW 2 (m, v) O(n) lr, β1, β2, eps, wd RAdam 2 (m, v) O(n) lr, β1, β2, eps, wd
Optimizer State Buffers Per-Step Cost Key Hyperparams
Adan 3 (m, v, s) O(n) lr, β1, β2, β3, eps, wd ADOPT 2 (m, v) O(n) lr, β1, β2, eps, wd AdEMAMix 3 (m1, m2, v) O(n) lr, β1, β2, β3, eps, wd MARS 3 (m, v, z) O(n) lr, β1, β2, eps, wd, γ Cautious AdamW 2 (m, v) O(n) lr, β1, β2, eps, wd LAMB 2 (m, v) O(n) lr, β1, β2, eps, wd Schedule-Free 3 (z, x, n) O(n) lr, β1, β2, wd Adafactor 1–2 (row/col) O(n) lr, β1, β2, eps, wd, factored GaLore 2 (m, v) + SVD/proj O(nr) lr, rank r, β1, β2, eps, wd Prodigy 3 (m, v, d) O(n) β1, β2, eps, wd, d0 Lion 1 (m) O(n) lr, β1, β2, wd Shampoo 2d (Li, Ri) O(n1+2/d) lr, eps, wd, matrix eps SOAP 2d + 2 (m, v) O(n1+1/d) lr, β1, β2, eps, wd, shampoo eps Sophia 3 (m, v, h) O(n) lr, β1, β2, eps, wd, k Muon 2 (m, v) O(n · k) lr, momentum, wd, NS steps k Turbo-Muon 2 (m, v) O(n · k) lr, momentum, wd, NS steps k APOLLO 2 low-rank (mR, vR) + proj O(nr) lr, rank r, update gap, scale, betas, eps APOLLO-Mini 2 rank-1 (mR, vR) + proj O(n) lr, update gap, scale, betas, eps, wd Q-APOLLO 2 quantized low-rank + proj O(nr) lr, rank r, update gap, scale, quant bits Table 8: Memory and complexity comparison of all supported optimizers. n = number of parameters, d = tensor dimension- ality, r = low-rank dimension, k = Newton–Schulz iteration count. State buffers lists the number of optimizer state ten- sors maintained per parameter group. Per-step cost focuses on the dominant additional cost beyond the gradient compu- tation itself.
A.2 The Evolution of Optimization in Neural Networks
The optimizer survey frames transformer optimization as a response to three structural pressures: non-convex loss landscapes, severe curvature heterogeneity across parameter blocks, and the memory cost of storing optimizer state for very large models. The report argues that the field has diverged into several trajectories: adaptive first-order baselines, variance-reduction methods, memory-efficient methods, structured second-order preconditioners, schedule-free methods, and orthogonality-oriented updates.
A.3 Standard Baseline and Adaptive Optimizers
SGD with Momentum. The classical update accumulates a momentum buffer and then applies a fixed learning rate. At each iteration t, the algorithm maintains an exponential moving average mt of past gradients:
mt = βmt−1 + gt (1)
θt+1 = θt −ηmt (2)
where gt = ∇f(θt), β ∈[0.8, 0.99] controls the momentum decay, and η is the fixed learning rate. The momentum term acts as velocity: it accelerates movement in consistent directions while dampening oscillations in variable directions. Its strengths are low memory overhead (single momentum buffer per parameter) and strong generalization when tuned carefully. Its
main weakness in transformer workloads is poor robustness to heterogeneous curvature (different Hessian spectra across parameter groups) and strong dependence on learning-rate schedules.
Algorithm 4 SGD with Momentum Require: Initial parameters θ0, learning rate η, momentum coefficient β, weight decay λ Ensure: Updated parameters θT
1: Initialize: momentum buffer m ←0
2: for t = 0, 1, 2, . . . , T −1 do
3: Compute gradient: gt ←∇f(θt)
4: if weight_decay > 0 then
5: gt ←gt + λθt ▷L2 regularization
6: end if
7: Update momentum: m ←β · m + gt 8: Update parameters: θt+1 ←θt −η · m
9: end forreturn θT
Adam and AdamW. Adam tracks exponential moving averages of both first moment (mean) and second moment (uncentered variance) to achieve element-wise adaptive learning rates. The update rule is:
mt = β1mt−1 + (1 −β1)gt (3)
vt = β2vt−1 + (1 −β2)g2 t (4)
ˆmt = mt 1 −βt 1 , ˆvt = vt 1 −βt 2 (bias correction) (5)
(6)
θt+1 = θt −η ˆmt √ˆvt + ϵ
AdamW introduces a crucial modification: weight decay is applied directly to parameters (de- coupled from gradients): θt ←θt(1−ηλ) rather than adding λθt to the gradient. This decoupling prevents the adaptive scaling from interfering with regularization strength. The report treats AdamW as the practical baseline for transformer fine-tuning because it converges quickly and is relatively forgiving to hyperparameter variation. The tradeoff is memory cost: both moment tensors must be stored for every parameter, doubling optimizer-state memory relative to SGD.
Algorithm 5 AdamW (Adam with Decoupled Weight Decay) Require: Initial parameters θ0, learning rate η, exponential decay rates β1, β2 ∈[0, 1) Require: Momentum constant ϵ, weight decay λ Ensure: Updated parameters θT
1: Initialize: first moment m ←0, second moment v ←0, step counter t ←0
2: for t = 1, 2, . . . , T do
3: Compute gradient: gt ←∇f(θt−1)
4: Weight decay (decoupled): θt−1 ←θt−1(1 −ηλ)
5: Update moments: m ←β1m + (1 −β1)gt 6: v ←β2v + (1 −β2)g2 t 7: Bias correction: ˆm ←m/(1 −βt 1), ˆv ←v/(1 −βt 2)
8: Update parameters: θt ←θt−1 −η( ˆm/( √
ˆv + ϵ))
9: end forreturn θT
RAdam (Rectified Adam). RAdam addresses Adam’s instability in early training by dy- namically rectifying the adaptive learning rate. The key observation is that the second moment
vt has very high variance in the first few steps, causing unreliable adaptive scaling. RAdam computes the effective simple moving average (SMA) window length:
ρt = ρ∞−2tβt 2 1 −βt 2 , where ρ∞= 2 1 −β2 −1
When ρt > 4 (sufficient samples for variance estimation), RAdam applies adaptive scaling with a rectification term:
s
(ρt −4)(ρt −2)ρ∞ (ρ∞−4)(ρ∞−2)ρt , θt+1 = θt −ηrt ˆmt √ˆvt + ϵ
rt =
When ρt ≤4, RAdam falls back to SGD with momentum: θt+1 = θt −η ˆmt. This graceful tran- sition eliminates the need for manual learning-rate warmup schedules and improves robustness.
Algorithm 6 RAdam (Rectified Adam) Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ Ensure: Updated parameters θT
1: Initialize: m ←0, v ←0, t ←0
2: Compute: ρ∞←2/(1 −β2) −1
3: for t = 1, 2, . . . , T do
4: gt ←∇f(θt−1)
5: if weight_decay > 0 then
6: θt−1 ←θt−1(1 −ηλ)
7: end if
8: m ←β1m + (1 −β1)gt 9: v ←β2v + (1 −β2)g2 t 10: ˆm ←m/(1 −βt 1), ˆv ←v/(1 −βt 2)
11: ρt ←ρ∞−2tβt 2/(1 −βt 2)
12: if ρt > 4 then
13: rt ← p
((ρt −4)(ρt −2)ρ∞)/((ρ∞−4)(ρ∞−2)ρt)
14: θt ←θt−1 −ηrt ˆm/( √
ˆv + ϵ)
15: else
16: θt ←θt−1 −η ˆm ▷SGD with momentum
17: end if
18: end forreturn θT
A.4 Advanced Momentum and Variance Reduction (2024–2025)
Adan (Adaptive Nesterov Momentum). Adan reformulates Nesterov acceleration without the extra gradient computation required by classical Nesterov SGD. The algorithm maintains three momentum buffers:
mt = (1 −β1)mt−1 + β1gt (first moment) (7)
vt = (1 −β2)vt−1 + β2(gt −gt−1) (velocity/gradient difference) (8)
nt = (1 −β3)nt−1 + β3[gt + (1 −β1)(gt −gt−1)]2 (Nesterov second moment) (9)
The Nesterov Momentum Estimation (NME) term ¯gt = gt + (1 −β1)(gt −gt−1) estimates the gradient at a future position without evaluating it. The update combines momentum with velocity:
¯mt = mt + (1 −β1)vt, θt+1 = θt −η ¯mt √nt + ϵ
Adan achieves fast convergence across diverse architectures (CNNs, GANs, Transformers) through this acceleration. The cost is maintaining three momentum-like buffers, increasing memory over- head relative to Adam.
Algorithm 7 Adan (Adaptive Nesterov Momentum) Require: Initial parameters θ0, learning rate η, β1, β2, β3 ∈(0, 1) Ensure: Updated parameters θT
1: Initialize: m ←0, v ←0, n ←0, g−1 ←0 (previous gradient)
2: for t = 1, 2, . . . , T do
3: gt ←∇f(θt−1)
4: m ←(1 −β1)m + β1gt 5: ∆gt ←gt −gt−1 6: v ←(1 −β2)v + β2∆gt 7: Nesterov estimation: ¯gt ←gt + (1 −β1)∆gt 8: n ←(1 −β3)n + β3¯g2 t 9: Combined momentum: ¯m ←m + (1 −β1)v
10: θt ←θt−1 −η( ¯m/(√n + ϵ))
11: gt−1 ←gt 12: end forreturn θT
ADOPT (Adam with Optimal Pruned Tuning). ADOPT fixes a fundamental theoretical issue in Adam: the gradient appears in both the first moment and second moment estimates, creating circularity. ADOPT decouples by using the previous step’s second moment for the denominator:
vt = β2vt−1 + (1 −β2)g2 t (10)
mt = β1mt−1 + (1 −β1) gt √vt−1 + ϵ (uses vt−1, not vt) (11)
θt+1 = θt −ηmt (12)
This simple reordering achieves the optimal convergence rate O(1/ √
T) with any choice of β2 ∈(0, 1), without bounded-noise assumptions. The practical consequence is that ADOPT is a drop-in replacement for Adam with stronger theoretical guarantees and comparable or superior empirical performance across vision, NLP, RL, and generative modeling domains.
Algorithm 8 ADOPT (Adam with Optimal Pruned Tuning) Require: Initial parameters θ0, learning rate η, β1, β2, tolerance ϵ Ensure: Updated parameters θT
1: Initialize: m ←0, v ←0
2: for t = 1, 2, . . . , T do
3: gt ←∇f(θt−1)
4: if weight_decay > 0 then
5: θt−1 ←θt−1(1 −ηλ)
6: end if
7: Use previous second moment: denom ← p
max(v, ϵ) + ϵ
8: Update first moment using previous variance: m ←β1m + (1 −β1)(gt/denom)
9: Update second moment with current gradient: v ←β2v + (1 −β2)g2 t 10: θt ←θt−1 −ηm
11: end forreturn θT
AdEMAMix (Exponential Moving Average Mixture). AdEMAMix replaces Adam’s sin- gle EMA of gradients with a mixture of two EMAs: one fast-decaying and one slow-decaying. This addresses the observation that gradients remain informative over tens of thousands of steps, not just hundreds:
m1,t = β1m1,t−1 + (1 −β1)gt (fast EMA, β1 ≈0.9) (13)
m2,t = β3m2,t−1 + (1 −β3)gt (slow EMA, β3 ≈0.9999) (14)
vt = β2vt−1 + (1 −β2)g2 t (15)
mt = m1,t + αtm2,t (mixture with scheduled weight αt) (16)
θt+1 = θt −η mt √vt + ϵ (17)
The mixture weight αt typically increases during training, allowing the fast EMA to provide immediate adaptation while the slow EMA accumulates long-range gradient correlations. Em- pirically, a 1.3B model on 101B tokens achieves similar loss to AdamW on 197B tokens (95% data efficiency gain), suggesting the slow EMA significantly reduces model forgetting.
Algorithm 9 AdEMAMix (Exponential Moving Average Mixture) Require: Initial parameters θ0, learning rate η, β1 (fast), β3 (slow), β2 (second moment) Require: Mixture weight αt (typically increasing), tolerance ϵ Ensure: Updated parameters θT
1: Initialize: m1 ←0 (fast EMA), m2 ←0 (slow EMA), v ←0
2: for t = 1, 2, . . . , T do
3: gt ←∇f(θt−1)
4: if weight_decay > 0 then
5: θt−1 ←θt−1(1 −ηλ)
6: end if
7: m1 ←β1m1 + (1 −β1)gt ▷Fast EMA
8: m2 ←β3m2 + (1 −β3)gt ▷Slow EMA
9: v ←β2v + (1 −β2)g2 t 10: mt ←m1 + αtm2 ▷Mixture
11: θt ←θt−1 −η(mt/(√v + ϵ))
12: end forreturn θT
MARS (Make Adaptive learning Rates Shine). MARS combines preconditioned gradient methods (e.g., AdamW) with variance reduction via scaled stochastic recursive momentum. The core innovation is a variance-reduced gradient estimate:
ct = gt + γ(ct−1 −gt−1) (SVRG-style recursive estimate) (18)
mt = β1mt−1 + (1 −β1)ct (19)
vt = β2vt−1 + (1 −β2)c2 t (20)
θt+1 = θt −η mt √vt + ϵ (21)
where γ ∈[0.01, 0.1] controls how much historical gradient information is retained. The variance- reduced gradient ct acts as an implicit noise filter, amplifying consistent signal directions and dampening contradictory noise. MARS-AdamW consistently outperforms bare AdamW by sig- nificant margins on GPT-2 pretraining, suggesting that variance reduction substantially im- proves convergence in mini-batch stochastic training.
Algorithm 10 MARS (Make Adaptive learning Rates Shine) Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ Require: Variance reduction coefficient γ ∈[0.01, 0.1] Ensure: Updated parameters θT
1: Initialize: m ←0, v ←0, c ←0 (variance-reduced gradient), g−1 ←0
2: for t = 1, 2, . . . , T do
3: gt ←∇f(θt−1)
4: Variance reduction: c ←gt + γ(c −gt−1)
5: m ←β1m + (1 −β1)c
6: v ←β2v + (1 −β2)c2
7: if weight_decay > 0 then
8: θt−1 ←θt−1(1 −ηλ)
9: end if
10: θt ←θt−1 −η(m/(√v + ϵ))
11: gt−1 ←gt 12: end forreturn θT
Cautious Optimizers (Cautious AdamW, Cautious Lion). The Cautious framework applies a one-line modification to any momentum-based optimizer: mask the update so that only dimensions where momentum and gradient directions agree are applied:
( 1 if mt[i] · gt[i] > 0 (agreement) 0 otherwise (22)
maski =
⊙mask (element-wise masking) (23)
ut = mt √vt + ϵ
θt+1 = θt −ηut (24)
The intuition: momentum mt estimates the gradient direction from history; the current gradi- ent gt is instantaneous signal. When both agree, the optimizer is confident and should update aggressively. When they disagree, the optimizer is conflicted—the historical trend points toward a region that may have been good before, but current evidence contradicts it. By masking conflicted dimensions, Cautious becomes more conservative and avoids corrupted steps. Empiri- cally, this simple masking achieves up to 1.47× speedup on Llama and MAE pretraining while preserving convergence guarantees.
Algorithm 11 Cautious AdamW (Consensus-based Update Masking) Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ Ensure: Updated parameters θT
1: Initialize: m ←0, v ←0
2: for t = 1, 2, . . . , T do
3: gt ←∇f(θt−1)
4: m ←β1m + (1 −β1)gt 5: v ←β2v + (1 −β2)g2 t 6: Consensus mask: maski ←1 if m[i] · gt[i] > 0, else 0 ▷Agreement
7: Base Adam update: u ←m/(√v + ϵ)
8: Apply mask: umasked ←u ⊙mask ▷Element-wise
9: if weight_decay > 0 then
10: θt−1 ←θt−1(1 −ηλ)
11: end if
12: θt ←θt−1 −ηumasked 13: end forreturn θT
A.5 Large-Batch, Memory-Efficient, and Parameter-Free Optimizers
LAMB (Layer-wise Adaptive Moments optimizer for Batch training). LAMB extends Adam with layer-wise adaptive rate scaling (inspired by LARS), enabling stable training with extreme batch sizes (e.g., 64K on BERT):
mL t = β1mL t−1 + (1 −β1)gL t (layer-wise first moment) (25)
vL t = β2vL t−1 + (1 −β2)(gL t )2 (26)
uL adam = mL t p
vL t + ϵ + λθL t−1 (base adaptive step with decay) (27)
ϕL = ∥θL t−1∥2 ∥uL adam∥2 (trust ratio: layer normalization) (28)
θL t = θL t−1 −η · ϕL · uL adam (29)
The trust ratio ϕL = ∥θL∥2/∥uL adam∥2 normalizes the effective update step relative to weight magnitude. In very large batches, stochastic gradient noise can exceed signal, destabilizing layer-wise learning rates. By scaling updates proportionally to weight norms, LAMB ensures relative changes rather than absolute ones, preventing small confident updates in large vectors from dominating. LAMB preserves small-batch generalization benefits while enabling efficient large-batch training.
Algorithm 12 LAMB (Layer-wise Adaptive Moments optimizer for Batch training) Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ Ensure: Updated parameters θT
1: Initialize: For each layer L: mL ←0, vL ←0
2: for t = 1, 2, . . . , T do
3: for each layer L do
4: gL t ←∇f(θL t )
5: mL ←β1mL + (1 −β1)gL t 6: vL ←β2vL + (1 −β2)(gL t )2
7: Base Adam: uL adam ←mL/( √
vL + ϵ)
8: if weight_decay > 0 then
9: uL adam ←uL adam + λθL t−1 10: end if
11: Trust ratio: ϕL ←∥θL t−1∥2/∥uL adam∥2 if both nonzero, else 1
12: θL t ←θL t−1 −ηϕLuL adam 13: end for
14: end forreturn θT
Schedule-Free AdamW. Schedule-free methods remove explicit scheduler design from the optimization recipe. The algorithm maintains two parameter streams: zt (exploration) and xt (smooth average):
vt = β2vt−1 + (1 −β2)g2 t (variance) (30)
zt = zt−1(1 −ηλ) −η gt √vt + ϵ (adaptive step with decay) (31)
ct = 1 t + 1 (averaging coefficient, decreases over time) (32)
xt = (1 −ct)xt−1 + ctzt (iterate averaging) (33)
yt = (1 −β)zt + βxt (interpolation for evaluation point) (34)
The averaging coefficient ct = 1/(t + 1) implements an implicit learning-rate schedule without explicitly specifying total steps T. This framework unifies scheduling and iterate averaging: the algorithm explores via zt while accumulating stable direction via xt. The evaluation point yt (used for gradient computation) interpolates between exploration and stability. Schedule-Free achieves state-of-the-art convergence across convex optimization, large-scale deep learning, and reinforcement learning, while removing a major hyperparameter.
Algorithm 13 Schedule-Free AdamW Require: Initial parameters θ0, learning rate η (fixed), β1, β2, weight decay λ Ensure: Updated parameters θT or averaging xT
1: Initialize: z ←θ0 (exploration), x ←θ0 (average), v ←0, t ←0
2: for t = 1, 2, . . . , T do
3: gt ←∇f(yt−1) (gradient at interpolation point)
4: v ←β2v + (1 −β2)g2 t ▷Variance
5: Weight decay on z: z ←z(1 −ηλ)
6: z ←z −ηgt/(√v + ϵ) ▷Adaptive step on raw
7: ct ←1/(t + 1) ▷Averaging coefficient
8: x ←(1 −ct)x + ctz ▷Iterate averaging
9: yt ←(1 −β1)z + β1x ▷Interpolation for next eval
10: end forreturn xT or yT
Adafactor. Adafactor reduces optimizer memory by factorizing second-moment statistics for matrix-shaped parameters, storing only row and column accumulators rather than dense variance states. For a gradient matrix Gt ∈Rm×n:
Rt = β2Rt−1 + (1 −β2)(G2 t )1T n (row variance) (35)
Ct = β2Ct−1 + (1 −β2)1⊤ m(G2 t ) (column variance) (36)
ˆVt = RtCt
1⊤ n Rt (reconstructed variance via outer product) (37)
Ut = Gt p ˆVt + ϵ (normalized adaptive step) (38)
ˆUt = Ut max(1, RMS(Ut)) (stability clipping) (39)
Instead of storing m × n second-moment values, Adafactor stores only m + n accumulators, reducing memory from O(nparams) to O(√nparams). This is most attractive when VRAM is dominated by optimizer state rather than activations. The tradeoff is reduced optimization expressiveness and potential instability on some tasks.
Algorithm 14 Adafactor (Factorized Second-Moment Statistics) Require: Gradient matrix Gt ∈Rm×n, learning rate η, β2 ≈0.999 Ensure: Updated parameters
1: Initialize: Row accumulators R ←ϵ1T m, Column C ←ϵ1n 2: for t = 1, 2, . . . , T do
3: Gt ←∇f(θt) (matrix gradient)
4: R ←β2R + (1 −β2)(G2 t )1T n ▷Row inner products
5: C ←β2C + (1 −β2)1⊤ m(G2 t ) ▷Column inner products
6: Reconstructed variance: ˆV ←(R · C)/(1⊤ n R) ▷Via outer product
7: Normalized adaptive step: Ut ←Gt/( p
ˆV + ϵ)
8: Per-row RMS clipping: ˆUt ←Ut/ max(1, RMS(Ut))
9: θt+1 ←θt −η ˆUt 10: end for
GaLore (Gradient Low-Rank Projection). GaLore projects 2D gradients into a low-rank subspace before optimization, reducing optimizer-state memory while maintaining adaptive step scaling. For a 2D parameter matrix W ∈Rm×n:
U, S, V = SVD(Gt) (compute singular decomposition) (40)
P ∈Rn×r or P ⊤∈Rr×m (select top-r singular vectors) (41)
Glow = P ⊤Gt (project to low-rank space) (42)
∆low = Adam(Glow) (optimize in compressed space) (43)
∆= P∆low (reconstruct in original space) (44)
By projecting into a rank-r subspace (typically r ≪min(m, n)), optimizer state is reduced from O(mn) to O(r(m + n)) for 2D parameters. This complementary memory-saving approach is especially relevant when the model is too large for full-rank optimizer state. GaLore works synergistically with other techniques and has shown strong empirical results on billion-parameter models.
Algorithm 15 GaLore (Gradient Low-Rank Projection) Require: 2D parameter matrix W ∈Rm×n, learning rate η, rank r ≪min(m, n) Ensure: Updated parameters
1: Initialize: Adam state in low-rank space (if rank-2 variant)
2: for t = 1, 2, . . . , T do
3: Gt ←∇f(Wt)
4: if t mod K = 1 then ▷Periodic SVD
5: Ut, St, V ⊤ t ←SVD(Gt) (full or thin)
6: Select projection: P ←Ut[:, : r] or P ←Vt[:, : r]
7: end if
8: Project gradient: Glow ←P ⊤Gt (or GtP ⊤for right projection)
9: Run Adam on low-rank: ∆low ←Adam(Glow)
10: Reconstruct in original space: ∆←P∆low (or ∆lowP ⊤)
11: Wt ←Wt −η∆
12: end for
APOLLO, APOLLO-Mini, and Q-APOLLO. The APOLLO family [55] begins from the observation that AdamW’s element-wise denominator can be coarsened into a struc- tured learning-rate update. Rather than storing dense moments for every parameter entry,
APOLLO projects a matrix gradient Gt ∈Rm×n into a compact random subspace and tracks Adam-style moments there:
Rt = PtGt or Rt = GtP ⊤ t (45)
MR t = β1MR t−1 + (1 −β1)Rt (46)
V R t = β2V R t−1 + (1 −β2)R2 t (47)
eRt = MR t p
V R t + ϵ (48)
The projected state is not expanded back into a dense low-rank update as in SVD-based methods. Instead, APOLLO estimates a structured scaling tensor St in the original space. In the standard APOLLO variant, that scaling is channel-wise: each row or channel receives its own norm ratio. In APOLLO-Mini, the scaling is reduced to a single tensor-wise scalar, corresponding to the rank-1 extreme described in the paper. The resulting parameter update is therefore Adam-like in adaptation but much closer to SGD in state cost:
Wt ←(1 −ηλ)Wt−1 −η α (Gt ⊙St)
where α is the extra scale factor used to stabilize highly compressed variants. The paper’s practical contribution is twofold. First, APOLLO replaces expensive repeated SVD with periodically refreshed Gaussian random projection, so the systems burden is ordinary matrix multiplication rather than spectral decomposition. Second, the optimizer is unusually tolerant to extreme compression: even APOLLO-Mini, which keeps only rank-1 auxiliary state, remains competitive with or better than AdamW in the reported pre-training experiments while approaching SGD-level memory cost.
Algorithm 16 APOLLO / APOLLO-Mini Structured Gradient Scaling Require: Weight matrix W ∈Rm×n with m ≤n, learning rate η, scale factor α, decay rates (β1, β2), weight decay λ, rank r, projection refresh interval T Ensure: Updated parameters WT
1: Initialize projected moments: MR ←0, V R ←0, step t ←0
2: repeat
3: Compute gradient: Gt ←∇W ϕ(Wt)
4: if t mod T = 0 then
5: Sample Gaussian projector Pt ∼N(0, 1/r) with a fresh seed
6: end if
7: Project gradient: Rt ←PtGt ▷or GtP ⊤ t depending on layout
8: Update projected AdamW moments: MR t , V R t ←AdamWState(Rt; β1, β2)
9: Normalize projected state: eRt ←MR t /( p
V R t + ϵ)
10: if APOLLO then
11: St ←diag(sR 1 , . . . , sR m), where sR i = ∥eRt[i, :]∥2/∥Rt[i, :]∥2 12: else
13: St ←sR t , where sR t = ∥eRt∥2/∥Rt∥2 14: end if
15: Update weights: Wt ←(1 −ηλ)Wt−1 −η α (Gt ⊙St)
16: t ←t + 1
17: until convergence
APOLLO-Mini. APOLLO-Mini is the family member optimized for extreme memory ef- ficiency. The paper motivates it by arguing that, in a rank-1 compact space, channel-wise scaling becomes too noisy, so the update is coarsened further to a single tensor-wise scale. The
loss of granularity is partially offset by the explicit scaling factor α (the paper discusses values such as 128 in this regime), giving a useful point on the optimization Pareto frontier: very small state, no SVD overhead, and still strong pre-training behavior.
Q-APOLLO. The APOLLO paper also stresses that the family combines naturally with quantization for ultra-low-memory training. This repository turns that systems observation into a concrete optimizer variant, q_apollo. In the implementation, APOLLO’s low-rank first and second moments are stored in quantized form together with per-tensor scales and offsets, and are dequantized only when needed for the next step. Q-APOLLO therefore preserves APOLLO’s projected-gradient logic while reducing the precision of the remaining optimizer state, making it the most aggressive memory-saving member of the local optimizer stack.
Prodigy (Approximating the Distance Estimate). Prodigy adapts the effective step scale through a running distance-like statistic, eliminating the need for explicit learning-rate tuning. The algorithm maintains a cumulative distance estimate:
ut = gt √vt + ϵ (unnormalized adaptive step) (49)
st = st−1 + ⟨ut, θt −θ0⟩ (cumulative signed distance) (50)
dt = max(dt−1, d0 + dcoef · st) (distance estimate with lower bound) (51)
θt+1 = θt −η · dt · ut (52)
where d0 is an initial bound and dcoef controls how aggressively the estimate adapts. The distance estimate dt captures the combined magnitude of past gradients weighted by actual parameter displacement, implementing an implicit effective learning rate. This distance-aware scaling achieves robust convergence across problem scales and batch sizes without requiring a learning-rate schedule. Prodigy reduces hyperparameter sensitivity by estimating step sizes from optimization geometry rather than problem-specific prior knowledge.
Algorithm 17 Prodigy (Approximating the Distance Estimate) Require: Initial parameters θ0, learning rate η, coeff dcoef, initial dist d0 > 0 Ensure: Updated parameters θT
1: Initialize: s ←0 (cumulative signed distance), d ←d0 2: for t = 1, 2, . . . , T do
3: gt ←∇f(θt−1)
4: vt ←β2vt−1 + (1 −β2)g2 t ▷Second moment
5: ut ←gt/(√vt + ϵ) ▷Unnormalized adaptive
6: Update distance: s ←s + ⟨ut, θt −θ0⟩ ▷Signed distance
7: d ←max(d, d0 + dcoef · s) ▷Lower-bounded distance
8: θt ←θt−1 −η · d · ut ▷Distance-scaled update
9: end forreturn θT
A.6 Second-Order, Geometric, and Orthogonality Optimizers
Algorithm 18 Shampoo (Matrix Preconditioning via Kronecker Products) Require: Initial parameters θ0, learning rate η, eigendecomposition interval K Ensure: Updated parameters θT
1: Initialize: L ←ϵIm, R ←ϵIn (left and right Gram matrices)
2: for t = 1, 2, . . . , T do
3: G ←∇f(θt−1) ▷Gradient
4: Update Gram matrices: L ←L + GG⊤, R ←R + G⊤G
5: if t mod K == 0 then
6: QL, ΛL ←eigh(L) ▷Eigendecomposition of L
7: QR, ΛR ←eigh(R) ▷Eigendecomposition of R
8: L−1/4 ←QL(ΛL + ϵ)−1/4Q⊤ L 9: R−1/4 ←QR(ΛR + ϵ)−1/4Q⊤ R 10: end if
11: ∆θ ←L−1/4GR−1/4 ▷Preconditioned gradient
12: θt ←θt−1 −η · ∆θ
13: end forreturn θT
Shampoo (Matrix Preconditioning via Kronecker Products). Shampoo is a struc- tured second-order method that computes matrix preconditioners from Kronecker-structured outer-product statistics. For a 2D parameter matrix W ∈Rm×n:
Lt = Lt−1 + GtG⊤ t (left/row Gram matrix, m × m) (53)
Rt = Rt−1 + G⊤ t Gt (right/column Gram matrix, n × n) (54)
L−1/4 t = QL(ΛL)−1/4Q⊤ L (via eigendecomposition) (55)
R−1/4 t = QR(ΛR)−1/4Q⊤ R (56)
∆Wt = L−1/4 t GtR−1/4 t (preconditioned gradient) (57)
Shampoo approximates the full-matrix Adagrad preconditioner H−1/2 (where H is the Hessian) using Kronecker-factored structure. Instead of storing a (mn) × (mn) preconditioner, Shampoo maintains two smaller matrices (m×m and n×n), capturing cross-parameter correlations within and across groups. The method has proven effective at scale (Google production systems) and is well-suited to transformer architectures with high-rank structure. Eigendecompositions are performed periodically (e.g., every K = 10 steps) to amortize cost. Tradeoff: higher per- step compute and periodic O(m3 + n3) eigendecomposition versus improved conditioning and accelerated convergence.
SOAP (Shampoo with Adam in eigenbasis). SOAP is a simplified variant of Shampoo that decouples preconditioning from momentum tracking. Instead of complex matrix algebra on preconditioned gradients, SOAP runs standard Adam in the eigenbasis of Shampoo’s precondi-
Lt = Lt−1 + GtG⊤ t , Rt = Rt−1 + G⊤ t Gt (Gram accumulation) (58)
QL, ΛL = eigh(Lt), QR, ΛR = eigh(Rt) (periodic eigendecomposition) (59)
Grot t = Q⊤ LGtQR (rotate gradient into eigenbasis) (60)
mrot t = β1mrot t−1 + (1 −β1)Grot t (61)
vrot t = β2vrot t−1 + (1 −β2)(Grot t )2 (Adam in rotated space) (62)
Urot t = mrot t p
vrot t + ϵ , Ut = QLUrot t Q⊤ R (rotate back) (63)
The key insight: all momentum tracking occurs in the well-conditioned eigenbasis, simplifying numerical stability and theoretical analysis. SOAP combines benefits of Shampoo (explicit curvature structure) and Adam (proven momentum mechanics), while being cleaner and often more stable than full Shampoo. Eigendecompositions are recomputed every K steps, amortizing the cost.
Algorithm 19 SOAP (Shampoo with Adam in Eigenbasis) Require: Initial parameters θ0, learning rate η, β1, β2, eigendecomposition interval K Ensure: Updated parameters θT
1: Initialize: L ←ϵIm, R ←ϵIn, m ←0, v ←0
2: for t = 1, 2, . . . , T do
3: G ←∇f(θt−1) ▷Gradient
4: Update Gram: L ←L + GG⊤, R ←R + G⊤G
5: if t mod K == 0 then
6: QL, ΛL ←eigh(L), QR, ΛR ←eigh(R)
7: end if
8: Grot ←Q⊤ LGQR ▷Rotate gradient into eigenbasis
9: m ←β1m + (1 −β1)Grot ▷Exponential moving average (bias-correct outside)
10: v ←β2v + (1 −β2)(Grot)2 ▷Second moment (bias-correct outside)
11: u ←m/(√v + ϵ) ▷Adaptive step in eigenbasis
12: ∆θ ←QLuQ⊤ R ▷Rotate back to parameter space
13: θt ←θt−1 −η · ∆θ
14: end forreturn θT
Lion (EvoLved Sign Momentum). Lion achieves minimal memory overhead by using sign-based updates instead of full adaptive scaling:
ct = β1mt + (1 −β1)gt (momentum input) (64)
θt+1 = θt −η (sign(ct) + λθt) (sign operator update) (65)
mt+1 = β2mt + (1 −β2)gt (momentum accumulation) (66)
All element-wise scaling operations are replaced with sign(), which outputs {−1, 0, +1}. This dramatically reduces memory compared to Adam-like methods: Lion stores only the momentum buffer, no second moment variance. The tradeoff: sign-based updates sacrifice the element-wise learning-rate adaptation that makes Adam effective. Lion is positioned not as a universally superior optimizer but as a specialized low-memory, high-throughput alternative for scenarios where activation memory dominates and some optimizer sophistication can be sacrificed. Works well in large-batch regimes and when hardware throughput is the primary constraint.
Algorithm 20 Lion (EvoLved Sign Momentum) Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ Ensure: Updated parameters θT
1: Initialize: m ←0 (momentum buffer)
2: for t = 1, 2, . . . , T do
3: g ←∇f(θt−1) ▷Gradient
4: c ←β1m + (1 −β1)g ▷Momentum input
5: θt ←θt−1 −η(sign(c) + λθt−1) ▷Sign-based update with weight decay
6: m ←β2m + (1 −β2)g ▷Momentum accumulation (decoupled)
7: end forreturn θT
Sophia (Second-Order Hessian Information with Optimized Approximation). Sophia uses diagonal Hessian estimates for curvature-aware scaling without the memory and com- pute cost of dense second-order methods.
mt = β1mt−1 + (1 −β1)gt (first moment) (67)
ht = β2ht−(k−1) + (1 −β2)ˆht (diagonal Hessian, updated every k steps) (68)
clip(x, C) = min(max(x, −C), C) (element-wise clipping) (69)
θt+1 = θt −η · clip mt max(γht, ϵ), 1 (70)
where ˆht is a diagonal estimate (e.g., Hutchinson-trace estimator) and γ is a scaling factor. The diagonal Hessian ht captures local curvature, allowing the optimizer to take smaller steps in sharp directions and larger steps in flat directions. The clipping operation clip( cdot, 1) prevents adaptive steps from exploding. Sophia belongs to the family of curvature- aware methods seeking better conditioning without the O(n2) or O(n3) cost of full second-order methods. Works particularly well in second-pass fine-tuning scenarios.
Algorithm 21 Sophia (Diagonal Hessian with Clipped Updates) Require: Initial parameters θ0, learning rate η, β1, β2, Hessian update freq k, clip threshold C Ensure: Updated parameters θT
1: Initialize: m ←0, h ←ϵ (diagonal Hessian estimate)
2: for t = 1, 2, . . . , T do
3: g ←∇f(θt−1)
4: m ←β1m + (1 −β1)g ▷First moment (bias-correct outside)
5: if t mod k == 0 then
6: ˆh ←HutchinsonEstimate(g) ▷Diagonal Hessian via Hutchinson
7: h ←β2h + (1 −β2)ˆh ▷Update Hessian estimate
8: end if
9: u ← m max(γh,ϵ) ▷Adaptive step (element-wise)
10: u ←clip(u, C) ▷Element-wise clipping to [−C, C]
11: θt ←θt−1 −η · u
12: end forreturn θT
Muon and Turbo-Muon (Orthogonality-Based Optimizers). Muon and Turbo-Muon are orthogonality-oriented optimizers that reshape update geometry using Newton-Schulz
polynomial iterations for orthogonalization. For a 2D parameter matrix W with gradient Gt:
X0 = Gt ∥Gt∥F + ϵ (normalized gradient) (71)
Ak = XkX⊤ k (Gramian) (72)
Bk = bAk + cA2 k (polynomial step, coefficients b, c from Newton-Schulz) (73)
Xk+1 = aXk + BkXk (iteration k = 0, 1, . . . , 4) (74)
Wt+1 = Wt −ηXK (update with orthogonalized direction) (75)
Muon performs 5 Newton-Schulz iterations to produce an approximately orthogonal direction, ensuring updates respect geometric constraints. Turbo-Muon adds an almost-orthogonal pre- conditioning (AOL) step before iterations, reducing the number of required orthogonalization steps to 4. The rationale: orthogonal updates preserve norms during training, avoiding the norm creep observed in momentum-based methods. These optimizers show promise on large- scale training but require custom CUDA kernels for efficiency. Tradeoff: high per-step compute (matrix multiplications and polynomial iterations) versus fundamentally better-conditioned up- date geometry.
Algorithm 22 Muon (Newton-Schulz Orthogonalization) Require: Initial parameters θ (matrices), learning rate η, Newton-Schulz iters K Ensure: Updated parameters θ
1: for each 2D parameter matrix W do
2: G ←∇f(W) ▷Gradient
3: X0 ← G ∥G∥F +ϵ ▷Normalize gradient by Frobenius norm
4: for k = 0, 1, . . . , K −1 do
5: Ak ←XkX⊤ k ▷Gramian (cost: O(mn2) or O(m2n) for tall/wide)
6: Bk ←(3/2)Ak −(1/2)A2 k ▷Newton-Schulz coefficients
7: Xk+1 ←BkXk ▷Polynomial step
8: end for
9: W ←W −ηXK ▷Update with orthogonalized direction
10: end forreturn updated θ
Algorithm 23 Turbo-Muon (AOL-Preconditioned Orthogonalization) Require: Initial parameters θ (matrices), learning rate η, Newton-Schulz iters K′ (typically 4) Ensure: Updated parameters θ
1: for each 2D parameter matrix W do
2: G ←∇f(W) ▷Gradient
3: X0 ← G ∥G∥F +ϵ ▷Normalize gradient
4: AOL (Almost-Orthogonal preconditioner): ▷Preconditioning step
5: P ←(1.5I −0.5X0X⊤ 0 ) ▷Preconditioning matrix
6: X0 ←PX0 ▷Preconditioned start
7: for k = 0, 1, . . . , K′ −1 do
8: Ak ←XkX⊤ k 9: Bk ←(3/2)Ak −(1/2)A2 k 10: Xk+1 ←BkXk ▷Reduced iterations due to AOL preconditioning
11: end for
12: W ←W −ηXK′ ▷Update with preconditioned orthogonalized direction
13: end forreturn updated θ
Group Methods Primary Goal Interpretation from the Sur- vey
Classical base- line SGD, AdamW, RAdam stability and reference baselines
These define the comparison floor for newer optimizer claims.
Momentum re- design Adan, AdEMAMix, MARS, Cautious AdamW
faster or safer first- order adap- tation
Best when convergence speed or noisy-gradient stability is the main concern.
Large-batch and schedule simpli- fication
LAMB, Schedule-Free AdamW operational robustness at scale
Reduce brittleness from batch-size growth or schedule engineering.
Memory- efficient Adafactor, GaLore, APOLLO, APOLLO- Mini, Q-APOLLO, Lion
optimizer- state reduc- tion
Most useful when VRAM is dom- inated by optimizer state rather than activations; APOLLO-family methods replace dense AdamW moments with projected or quan- tized structured scaling. Curvature- aware Shampoo, SOAP, Sophia better condi- tioning Prefer when richer geometry is worth implementation and com- pute overhead. Geometry- oriented Muon, Turbo-Muon orthogonalized update structure
Specialized options for matrix ge- ometry and representation shap- ing.
B Annex B: Dense, Recurrent, and Memory-Augmented Trans- formers
This comprehensive annex synthesizes modern transformer architectures beyond the standard softmax baseline. The field has evolved toward six primary paradigms: dense global attention with variants, recurrent state compression with decay, selective state-space models, continuous- depth numerical integration, memory-augmented architectures, and hybrid approaches. Un- derstanding this taxonomy illuminates fundamental tradeoffs between expressiveness, computa- tional cost, memory footprint, and deployment simplicity.
B.1 Dense Attention Baselines: Standard and Sigmoid
B.1.1 Standard Softmax Attention
Standard multi-head self-attention [38] computes scaled dot-product similarity between token embeddings:
1: Input: Query matrix Q ∈Rn×d, Key matrix K ∈Rn×d, Value matrix V ∈Rn×d
2: scores ←QK⊤
d ∈Rn×n ▷compute pairwise similarities
√
3: attention_weights ←softmax(scores, axis = 1) ∈Rn×n ▷normalize across keys
4: Output: Y = attention_weights · V ∈Rn×d ▷weighted value combination For autoregressive generation, KV caching stores past keys and values to avoid O(n2) recom- putation. However, this linearly growing cache occupies ∼n · dhidden bytes, which can consume hundreds of gigabytes for billion-parameter models. Standard attention achieves perfect ex- pressiveness within the context window—any token can attend to any other token with learned weights—but pays the price of dense computation.
• Training complexity: O(n2 · d) time, O(n2) space (attention matrix materialization).
• Inference complexity: O(n) time per token, O(n) space (KV cache).
• Strengths: Unparalleled expressiveness; perfect history recall; highly parallelizable.
• Weaknesses: Quadratic bottleneck prohibits very long contexts; KV cache dominates mem- ory during generation.
B.1.2 Sigmoid Attention
Sigmoid attention replaces row-wise softmax with element-wise sigmoid activation [30]:
1: Input: Q, K, V as above, learnable bias b ∈Rn×n
2: logits ←QK⊤
d + b
√
3: attention_weights ←σ(logits) ▷element-wise sigmoid, not softmax
4: Output: Y = attention_weights ⊙V Unlike softmax, sigmoid does not enforce a probability distribution (weights need not sum to 1), enabling stronger token independence. Theoretical analysis via mixture-of-experts shows sigmoid achieves superior sample complexity: O(n−0.51) convergence for ReLU experts versus softmax’s O(n−0.24). However, empirical training revealed gradient instabilities at scale. The remedy is hybrid-norm—adding normalization after the attention output—which stabilizes gradients without sacrificing the theoretical benefits of element-wise gating.
• Training complexity: O(n2 · d) (identical to standard), but element-wise operations enable 17% inference speedup via FlashSigmoid.
• Inference complexity: O(n) per token with KV cache (asymptotically same, but lower constant factors).
• Strengths: Overcomes zero-sum competition; avoids row-wise synchronization; hardware- friendly implementation.
• Weaknesses: Requires careful stabilization (hybrid-norm); training instability at large scales without auxiliary loss.
B.2 Recurrent and Retentive Architectures
B.2.1 Retentive Networks (RetNet)
RetNet [36] unifies three computation paradigms: parallel training, recurrent inference, and chunkwise deployment. Its core innovation is the retention mechanism, which uses a fixed exponential decay matrix to model temporal importance:
1: Parallel (training) form:
2: decay_matrix[i, j] ←γi−j for i ≥j, else 0 ▷causal exponential decay
3: decay_matrix[i, j] ←0 for i < j ▷causal masking
4: Yparallel ←(QK⊤⊙decay_matrix)V ▷element-wise multiplication with decay
1: Recurrent (inference) form:
2: for t = 1, . . . , n do
3: st ←γst−1 + ktv⊤ t ▷state update with decay
4: yt ←qtst ▷output via query-state interaction
5: end for The decay scalar γ ∈(0, 1) controls the temporal window. RetNet uses multi-scale re- tention with different γ values per head (e.g., γ = 1 −2−5, 1 −2−6, . . .), allowing short-term and long-term dependencies simultaneously. Chunkwise recurrent mode divides sequences into chunks, processes each chunk in parallel, and threads a recurrent state between chunks.
• Training complexity: O(n2 · d) (parallel), or O(n · c · d) (chunkwise recurrent with chunk size c).
• Inference complexity: O(1) per token, O(d2) state space (fixed matrix).
• Strengths: Constant-time inference; triple computation paradigm; multi-scale consolidation.
• Weaknesses: Fixed decay imposes rigid inductive bias; may truncate learned long-range patterns.
B.2.2 Mamba: Selective State-Space Models
Mamba [12] frames recurrence as a continuous-time dynamical system with input-dependent parameters, achieving both linear training and constant-time inference:
1: Continuous dynamics: h′(t) = Ah(t) + Bx(t), y(t) = Ch(t)
2: Discretization with step size ∆t:
3: ¯At ←exp(∆tA)
4: ¯Bt ←(∆tA)−1(exp(∆tA) −I)∆tBt 5: Recurrent update:
6: for t = 1, . . . , n do
7: ∆t ←softplus(Linear(xt)) ▷input-dependent step size
8: Bt ←Linear(xt), Ct ←Linear(xt) ▷input-dependent projection matrices
9: ht ←¯Atht−1 + ¯Btxt ▷state transition
10: yt ←Ctht ▷output projectionContinuous
11: end for The critical innovation is that A (the system matrix) is not input-dependent, but ∆t, Bt, and Ct are, making the system time-varying. This selectivity allows the model to ignore irrelevant information by setting ∆t near zero, effectively creating a gate. A hardware-aware parallel scan algorithm implements the recurrence efficiently on GPUs by fusing computation within SRAM, avoiding expensive HBM bandwidth.
• Training complexity: O(n · d) via hardware-aware scan (linear in sequence length).
• Inference complexity: O(1) per token, O(d) state (hidden vector).
• Strengths: Achieves linear training and constant-inference simultaneously; practical on long sequences (millions of tokens).
• Weaknesses: State vector compression can weaken exact copying and dense associative recall versus full attention.
B.3 Continuous-Depth Transformers: ODE Integration
B.3.1 ODE Transformer
The ODE Transformer [52] interprets network depth as numerical integration of a continuous dynamical system, using higher-order Runge-Kutta solvers to reduce truncation error:
1: Continuous formulation: dh(t)
dt = fθ(h(t), t) where fθ is the transformer sub-network
2: Runge-Kutta-4 discrete approximation:
3: k1 ←fθ(ht, t)
4: k2 ←fθ(ht + 1
2k1, t + 1
2∆t)
5: k3 ←fθ(ht + 1
2k2, t + 1
2∆t)
6: k4 ←fθ(ht + k3, t + ∆t)
7: ht+1 ←ht + ∆t
6 (k1 + 2k2 + 2k3 + k4) ▷weighted combination
Instead of simple Euler residual connections ht+1 = ht + fθ(ht), the RK4 block computes four intermediate evaluations and combines them with classical RK4 weights. To avoid vanishing gradients, the architecture introduces learned gating that interpolates between intermediate approximations:
1: g ←σ(Linear([k1, k2, k3, k4])) ▷learnable gate
2: ht+1 ←ht + g · k1 + (1 −g) · k2 ▷interpolated refinement This formulation reduces the effective number of parameters through weight sharing—the same fθ is evaluated multiple times—while providing richer trajectory refinement.
• Training complexity: O(k · n2 · d) where k is the RK order (4 for RK4).
• Inference complexity: O(k · n) per token (higher constant overhead).
• Strengths: Significantly higher accuracy on generation tasks (state-of-the-art BLEU); parameter- efficient via weight sharing.
• Weaknesses: High per-step compute; inference latency increases by constant factor k; com- plex gating required for stability.
B.4 Test-Time Memory: Titans
The Titans architecture [2] introduces an orthogonal dimension: instead of static weights, the model maintains a learnable memory that is updated during inference based on a surprise-driven signal:
1: Local short-term attention: Apply standard or sparse attention within a fixed context window c
2: ylocal t ←Attention(qt, k[t−c:t], v[t−c:t])
3: Memory update signal (surprise):
4: St ←ηtSt−1 −θt∇ℓ(Mt−1; xt) ▷surprise as gradient of loss w.r.t. memory
5: Mt ←(1 −αt)Mt−1 + St ▷memory updated via EMA
6: Long-term memory retrieval:
7: ymemory t ←M∗ t (qt) ▷query memory module for retrieved information
8: Output combination:
9: yt ←ylocal t + gate(ymemory t ) ▷combine local and retrieved memory The memory module M is literally updated during the forward pass by computing gradients of an associative loss and applying SGD steps with momentum. The decay rates ηt, θt, αt are themselves input-dependent, allowing the model to switch memory paradigms when context shifts.
• Training complexity: Roughly O(c2 + n · f) where c is local window and f is memory update overhead.
• Inference complexity: O(c2) local attention plus O(1) memory retrieval per token.
• Strengths: Handles extreme context lengths; enables true associative recall; memory adapts to input.
• Weaknesses: Significantly more complex; inference includes gradient computation; higher coordination overhead.
B.5 Architectural Comparison and Synthesis
Architecture Train Infer State Key Characteristic
Standard At- tention O(n2d) O(n) cache O(nd) Perfect expressiveness, full token routing, KV bottle- neck. Sigmoid At- tention O(n2d) O(n) cache O(nd) Element-wise gating, hardware-efficient, stable at scale with hybrid-norm. RetNet O(n2d) O(1) O(d2) Multi-scale decay, triple com- putation mode, fixed forget- ting pattern. Mamba O(nd) O(1) O(d) Selective input-dependent dy- namics, linear training and inference, hardware scan. ODE Trans- former O(kn2d) O(kn) O(nd) Numerical integration refine- ment, weight sharing via stages, higher accuracy. Titans O(c2 + nf) O(c2 + 1) O(1) Test-time memory adapta- tion, extreme context, on- inference parameter updates.
The field exhibits a clear progression along two axes: (i) computational efficiency, mov- ing from O(n2) to O(n) or O(1), and (ii) memory adaptivity, shifting from static weights to dynamic, test-time updated models. Standard attention remains the expressiveness baseline; Mamba and RetNet represent the practical efficiency frontier; Titans introduces an orthogo- nal innovation axis (test-time learning). The choice of architecture reflects the fundamental engineering constraint: expressiveness versus deployment cost.
C Annex C: Comprehensive Sparse Attention Mechanisms
C.1 Executive Summary
Sparse attention mechanisms address the prohibitive O(n2) computational and memory com- plexity of standard scaled dot-product attention by restricting the set of attended key-value pairs. Modern sparse attention spans a rich design space distinguished by four orthogonal dimensions: (i) sparsity pattern (fixed geometric, data-dependent, or frequency-based), (ii) trainability (end-to-end trained or inference-only), (iii) sparsity unit (individual tokens, local windows, or blocks), and (iv) mechanism (masking, selection, filtering, or compression). This annex synthe- sizes seven major sparse attention families—Sparse Transformer, Longformer, BigBird, SparseK, NSA, SpargeAttn, and FASA—providing mathematical formulations, algorithmic pseudocode, and comprehensive architectural comparisons.
C.2 Sparse Transformer: Factorized Strided and Fixed Patterns
C.2.1 Mathematical Formulation
The Sparse Transformer [9] reduces quadratic complexity to O(n√n) by factorizing sparse at- tention into two complementary sparse heads. Let n be the sequence length and l = ⌊√n⌋be the stride.
Strided attention head : Each position i attends to every l-th previous position:
A(strided) i = {j : j ≤i, (i −j) mod l = 0}
Fixed attention head : Each position i attends to positions within its local block plus fixed summary columns:
A(fixed) i = {j : ⌊j/l⌋= ⌊i/l⌋} ∪{j : j mod l ∈{l −c, . . . , l −1}}
where c is a hyperparameter controlling the number of summary columns (typically c = 1 or c = 2). The attention computation for each head h follows standard scaled dot-product rules re- stricted to the sparse set:
!
qh i Kh Ai⊤ √dk
V h Ai
Attnh(Q, K, V )i = softmax
The key insight is that with the two factorized heads operating in parallel across a depth of L transformer layers, every position can reach every other position through a path of length at most L + 1, preserving long-range reachability with a fraction of dense attention cost.
C.2.2 Algorithmic Pseudocode
Algorithm 24 Sparse Transformer Attention: Strided + Fixed Dual Heads Require: Query Q ∈Rn×d, Key K ∈Rn×d, Value V ∈Rn×d
Require: Stride l = ⌊√n⌋, summary columns c ∈{1, 2} Ensure: Output Y ∈Rn×d
1: Head 1: Strided Attention
2: for i = 1, . . . , n do
3: Ai ←{j : j ≤i and (i −j) mod l = 0} ▷Every l-th position
4: scoresi ←qiK⊤ Ai/√dk 5: attni ←softmax(scoresi)
6: y(1) i ←attniVAi 7: end for
8: Head 2: Fixed Block + Summary Attention
9: for i = 1, . . . , n do
10: block_start ←⌊i/l⌋· l, block_end ←min(i + 1, block_start + l)
11: Ai ←[block_start, block_end) ∪{cols from last c columns}
12: scoresi ←qiK⊤ Ai/√dk 13: attni ←softmax(scoresi)
14: y(2) i ←attniVAi 15: end for
16: Concatenate: Y = [y(1)∥y(2)] and project via output linear layer return Y
C.2.3 Key Characteristics
• Complexity: O(n√n · d) during training; O(n) per token during generation.
• Trainable: Yes; full backpropagation through selected positions.
• Sparsity pattern: Fixed and data-agnostic; patterns do not adapt to content.
• Strength: Structured reachability guarantees long-range dependencies; proven effective on long sequences (Enwik8, text images).
• Limitation: Fixed patterns may miss important but non-contiguous relationships; requires custom CUDA kernels for practical efficiency.
C.3 Longformer: Sliding Window, Dilation, and Global Tokens
C.3.1 Mathematical Formulation
Longformer [3] achieves linear O(n·w) complexity by combining three complementary attention patterns.
Sliding window attention : Local neighborhood of size w:
A(window) i = {j : |i −j| ≤w/2}
Dilated sliding window : Windowed attention with dilation d, skipping stride d + 1:
A(dilated) i = {j : |i −j| ≤(w/2)(d + 1) and (i −j) mod (d + 1) = 0}
Global attention : Designated global tokens (e.g., [CLS]) attend to all positions and are attended by all:
( {1, . . . , n} if i ∈G
A(global) i =
A(window) i ∪G otherwise
The three patterns are applied via separate projections. The local and global attention can use either the same projections or task-specific separate ones.
C.3.2 Algorithmic Pseudocode
Algorithm 25 Longformer: Sliding Window + Dilated + Global Require: Query Q ∈Rn×d, Key K, Value V, window size w, dilation d, global token indices G Ensure: Output Y
1: for i = 1, . . . , n do
2: if i ∈G then
3: Ai ←{1, . . . , n} ▷Global token attends to all
4: else
5: Local window: A(local) ←[i −w/2, i + w/2] ∩[1, n]
6: Dilated offsets: A(dilated) ←{j : (i −j) mod (d + 1) = 0} ∩A(dilated range)
7: Combine: Ai ←A(local) ∪A(dilated) ∪G
8: end if
9: scoresi ←qiK⊤ Ai/√dk 10: attni ←softmax(scoresi)
11: yi ←attniVAi 12: end forreturn Y
C.3.3 Key Characteristics
• Complexity: O(n · w) where w is the window size (linear when w ≪n).
• Trainable: Yes; drop-in replacement for standard attention.
• Coverage: With L layers and window size w, receptive field grows to L × w, covering entire sequence with shallow networks.
• Strength: Practical for document-level tasks; dilation expands local receptive fields with minimal overhead; global tokens provide query-document correspondence.
• Limitation: Window size remains a design hyperparameter; suboptimal for tasks requiring dense attention patterns.
C.4 BigBird: Random, Local, and Global Sparse Graph
C.4.1 Mathematical Formulation
BigBird [50] preserves universal approximation and Turing completeness using a principled mix of three sparsity patterns.
Random connections : Each position connects to r randomly sampled positions:
A(random) i = RandomSample({1, . . . , n}, r)
Local window : Sliding window neighborhood:
A(local) i = {j : |i −j| ≤w/2}
Global tokens : Task-specific global anchors:
( {1, . . . , n} if i ∈G
A(global) i =
A(random) i ∪A(local) i ∪G otherwise
The composite sparse mask M is:
Mij = 1[j ∈A(random) i ∪A(local) i ∪A(global) i ]
In practice, BigBird groups tokens into blocks of size b and applies the sparsity decision at the block level, enabling efficient block-sparse implementations.
C.4.2 Theoretical Guarantee
The authors prove that with O(1) random connections and global tokens, the sparse graph main- tains the same universal approximation and Turing completeness properties as dense attention. Formally, any computable function can be represented and any sequence of operations can be simulated within the BigBird connectivity graph.
C.4.3 Algorithmic Pseudocode
Algorithm 26 BigBird: Random + Local + Global Block-Sparse Attention Require: Query Q, Key K, Value V, block size b, random links per block r, global indices G Ensure: Output Y
1: Reshape into blocks: nb = ⌈n/b⌉blocks
2: for block i = 0, . . . , nb −1 do
3: for position j in block i do
4: Local window: A(local) j = nearby positions in block ∪boundary positions
5: Random block sampling: Sample r blocks uniformly at random, add all positions from those blocks
6: Global: Add all global token positions
7: Aj ←A(local) j ∪A(random) j ∪G
8: end for
9: end for
10: for i = 1, . . . , n do
11: scoresi ←qiK⊤ Ai/√dk 12: attni ←softmax(scoresi)
13: yi ←attniVAi 14: end forreturn Y
C.4.4 Key Characteristics
• Complexity: O(n) or near-linear in optimal block-sparse implementation.
• Trainable: Yes.
• Theoretical grounding: Provably maintains universal approximation and Turing-completeness with sparse connectivity.
• Strength: Strong long-document performance on question answering, summarization, and genomics tasks.
• Limitation: Random sampling introduces variance and non-determinism; tuning of r, w, g hyperparameters remains task-dependent.
C.5 SparseK Attention: Differentiable Top-k Selection
C.5.1 Mathematical Formulation
SparseK [23] enables learnable sparsity via a differentiable top-k operator. A scoring network ϕθ evaluates key-value pair importance:
uj = ϕθ(kj, qi) ∈R
The SparseK operator selects the top-k scores while remaining differentiable. It computes a threshold τ(u) such that the sum of active scores equals k:
SparseK(u, k)j = max(uj −τ(u), 0), where X
j max(uj −τ, 0) = k
The threshold τ is found via bisection. The attention becomes:
mj = SparseK(u, k)j, Attni = softmax qiK⊤ sel √dk
Vsel
where Ksel, Vsel contain only the top-k entries (those with mj > 0).
C.5.2 Algorithmic Pseudocode
Algorithm 27 SparseK: Differentiable Top-k Attention Require: Query q ∈Rd, Key matrix K ∈Rn×d, Value matrix V ∈Rn×d
Require: Scoring network ϕθ, target sparsity k Ensure: Attention output y
1: Compute importance scores:
2: for j = 1, . . . , n do
3: uj ←ϕθ(kj, q)
4: end for
5: Compute differentiable top-k via threshold:
6: usorted ←sort(u, descending = True)
7: cumsum ←cumsum(usorted)
8: Find ρ = max{i : usorted[i] > 0 and cumsum[i] ≤k}
9: τ ←(usorted[ρ] −k)/(ρ + 1) ▷Threshold for k active elements
10: m ←max(u −τ, 0) ▷Differentiable selection mask
11: Apply mask and compute attention:
12: Ksel ←K[m > 0], Vsel ←V [m > 0]
13: scores ←qK⊤ sel/√dk 14: attn ←softmax(scores)
15: y ←attn · Vsel return y
C.5.3 Key Characteristics
• Complexity: O(n · d) during training (linear in sequence length); O(k) per token during generation.
• Trainable: Yes; end-to-end gradient flow through the SparseK operator.
• Incremental generation: Supports efficient constant-memory autoregressive generation.
• Strength: Seamlessly integrates into existing LLM architectures; minimal fine-tuning needed.
• Limitation: Scattered memory access from top-k selection may limit cache efficiency on some hardware; scoring network adds overhead.
C.6 NSA (Native Sparse Attention): Hardware-Aligned Hierarchical Branches
C.6.1 Mathematical Formulation
NSA [48] decomposes sparse attention into three parallel branches that are combined through learned gates.
Compression branch : A learnable MLP φ compresses key-value blocks:
˜Kcmp t = [φ(kid+1:id+l)]1≤i≤⌊(t−l)/d⌋ , ˜V cmp t = [φ(vid+1:id+l)]
Selection branch : High-importance blocks are selected based on compressed scores:
pcmp t = softmax(q⊤ t ˜Kcmp t / √
d), It = TopK(pcmp t , n), ˜Ksel t = Gather(K, It)
Sliding window branch : Fixed local context:
˜Kwin t = Kt−w:t, ˜V win t = Vt−w:t
Gated combination : The three attention outputs are combined via learned gates:
α(cmp) t = σ(Linearcmp(qt)), α(sel) t = σ(Linearsel(qt)), α(win) t = σ(Linearwin(qt))
yt = α(cmp) t · Attn(qt, ˜Kcmp, ˜V cmp) + α(sel) t · Attn(qt, ˜Ksel, ˜V sel) + α(win) t · Attn(qt, ˜Kwin, ˜V win)
C.6.2 Algorithmic Pseudocode
Algorithm 28 NSA: Compressed + Selected + Window Branches with Learned Gating Require: Query qt, cached Key/Value sequences, block size l, stride d, selection count n, window w Require: Compression MLP φ, gate networks Linearcmp, Linearsel, Linearwin Ensure: Output yt
1: Compression branch:
2: for i = 1 to ⌊t/d⌋do
3: kcmp i ←φ(Kid+1:id+l) ▷Compress block via MLP
4: vcmp i ←φ(Vid+1:id+l)
5: end for
6: ˜Kcmp ←[kcmp 1 , . . . , kcmp ⌊t/d⌋], ˜Vcmp ←[vcmp 1 , . . . , vcmp ⌊t/d⌋]
7: y(cmp) t ←SDPA(qt, ˜Kcmp, ˜Vcmp)
8: Selection branch:
9: Compute scores on compressed: s ←softmax(qt ˜Kcmp⊤/ √
d)
10: Select top-n important blocks: I ←TopK(s, n)
11: Gather full blocks: ˜Ksel ←[KIid+1:Iid+l]i, ˜Vsel ←[VIid+1:Iid+l]i
12: y(sel) t ←SDPA(qt, ˜Ksel, ˜Vsel)
13: Window branch:
14: y(win) t ←SDPA(qt, Kt−w:t, Vt−w:t)
15: Gated fusion:
16: α(cmp) ←σ(Linearcmp(qt)), α(sel) ←σ(Linearsel(qt)), α(win) ←σ(Linearwin(qt))
17: yt ←α(cmp)y(cmp) t + α(sel)y(sel) t + α(win)y(win) t return yt
C.6.3 Key Characteristics
• Complexity: O(t/d + n · l + w) tokens attended per step (significantly reduced versus O(t) for dense).
• Trainable: Yes; direct training with all branches differentiable.
• Hardware alignment: Designed for efficient tensor core utilization; compression and selec- tion reduce memory bandwidth pressure.
• Strength: 11.6× decoding speedup and 9× forward speedup on 64k sequences; outperforms full attention on many tasks.
• Limitation: Requires custom Triton/CUDA kernels; complex multi-branch architecture in- creases engineering overhead.
C.7 FASA: Frequency-Aware Sparse Attention
C.7.1 Mathematical Formulation
FASA [41] is a training-free inference-time method that exploits rotary position embedding (RoPE) structure. Under RoPE, the token embedding is rotated by θi = B−2(i−1)/d, decompos- ing the d-dimensional space into d/2 frequency chunks (FCs).
Key insight : Only a small subset (< 1%) of FCs matter for contextual awareness; most encode positional patterns.
Dominant frequency chunk identification : For each layer l and head h, identify dominant FCs via contextual agreement (CA):
CAl,h,i = |TopK(αl,h) ∩TopK(αl,h,i)|
K
where αl,h is the full attention mask and αl,h,i is the mask using only FC i. Dominant FCs have high CA—they contribute meaningfully to the final attention pattern.
Token importance prediction (TIP) stage : Using only dominant FCs, compute lightweight importance per token:
Sl,h t = X
αl,h,i(qt, K1:t), Tt = TopK-Indices(St, Nfac)
i∈Il,h dom
Focused attention computation (FAC) stage : Compute full-precision attention on se- lected tokens:
qtK⊤ Tt √
!
ˆαFAC = softmax
, ot = ˆαFACVTt
d
C.7.2 Algorithmic Pseudocode
Algorithm 29 FASA: Frequency-Aware Sparse Attention (Inference-Time) Require: Current query qt, cached Key/Value K1:t, V1:t, RoPE base B, dominant FCs Idom Require: TIP token budget Ntip, FAC token budget Nfac Ensure: Output ot
1: Stage 1: Token Importance Prediction (TIP)
2: for layer l and head h do
3: for position i = 1 to t do
4: Compute importance from dominant FCs: si ←0
5: for fc ∈Il,h dom do
6: Rotate query and key into FC fc: q(fc), k(fc) i 7: s(fc) i ←softmax(q(fc)k(fc)⊤ i / √
d)
8: si ←si + s(fc) i 9: end for
10: end for
11: Tt ←TopK −Indices([s1, . . . , st], Nfac) ▷Select top tokens
12: end for
13: Stage 2: Focused Attention Computation (FAC)
14: scoresfac ←qtK⊤ Tt/ √
d ▷Full-precision dot product
15: αfac ←softmax(scoresfac) ▷Full softmax
16: ot ←αfacVTt ▷Weighted value sum return ot
C.7.3 Key Characteristics
• Complexity: TIP is O(t · Ntip); FAC is O(Nfac · d).
• Trainable: No; training-free inference optimization.
• Applicability: Requires RoPE-based models; extended to ALiBi and MLA with modifica- tions.
• Strength: Near-oracle accuracy with ≤256 tokens out of millions; up to 2.56× decoding speedup; orthogonal to quantization.
• Limitation: Offline offline calibration required; dominant FC identification adds preprocess- ing overhead.
C.8 SpargeAttn: Two-Stage Block-Level Filtering
C.8.1 Mathematical Formulation
SpargeAttn [53] is a universal training-free method that predicts which blocks of the atten- tion matrix will contain negligible values and skips computation for those blocks.
Stage 1 — Sparse prediction : Compute low-cost proxy importance ˆsij for block (i, j):
ˆsij = fpred(Qi, Kj; hyperparams), skip if ˆsij < ϵ1
Common predictors include block-mean similarity or self-similarity statistics.
Stage 2 — Softmax-aware filtering : After computing ˜Pij = softmax(QiK⊤ j / √
d), check if the block’s maximum probability is negligible relative to the running softmax maximum:
skip PijVj if max( ˜Pij) < emold−mnew · ϵ2
where mold, mnew are online softmax running maxima (computed via FlashAttention-style numerics).
C.8.2 Algorithmic Pseudocode
Algorithm 30 SpargeAttn: Two-Stage Block-Sparse Filtering Require: Query blocks Qi for i = 1, . . . , nb, Key blocks Kj for j = 1, . . . , nb, Value blocks Vj Require: Prediction threshold ϵ1, softmax threshold ϵ2, block size bs Ensure: Output Y
1: Initialize: online softmax max mglobal ←−∞
2: for block row i = 1 to nb do
3: Initialize: block row output Yi ←0, local max mi ←−∞
4: for block column j = 1 to nb do
5: Stage 1: Sparse Prediction
6: Compute proxy importance: ˆsij ←fpred(Qi, Kj)
7: if ˆsij < ϵ1 then
8: skip this block [i, j] ▷Early termination
9: continue ▷Move to next block
10: end if
11: Stage 2: Compute and Filter
12: ˜Pij ←softmax(QiK⊤ j / √
d)
13: mblock ←max( ˜Pij)
14: if mblock < emi−mglobal · ϵ2 then
15: Block contributes negligibly; skip ▷Softmax-aware pruning
16: continue
17: end if
18: Update global max: mi ←max(mi, mblock)
19: Accumulate: Yi ←Yi + ˜PijVj 20: end for
21: mglobal ←max(mglobal, mi)
22: end forreturn Y
C.8.3 Key Characteristics
• Complexity: Empirical O(n2 · s) where s is the fraction of non-skipped blocks (typically 0.2–0.5).
• Trainable: No; plug-and-play acceleration for existing models.
• Universality: Works on language models, image diffusion models, and video generation.
• Strength: 2.5–5× speedup compared to dense or previous sparse methods; compatible with quantization (SageAttention integration).
• Limitation: Speedup depends on inherent sparsity; block-level granularity may miss finer patterns; threshold tuning required per model family.
C.9 Comparative Summary
Method Complexity Trainable Unit Year Primary Contribu- tion
Sparse Transformer O(n√nd) Y Token pat- tern 2019 Factorized strided + fixed masks Longformer O(nwd) Y Local + global 2020 Linear scaling with dila- tion BigBird O(n) Y Graph sparse 2020 Theoretical complete- ness proof SparseK O(nd) Y Top-k tokens 2024 Differentiable selection
NSA O((t/d + nl′ + w)d) Y Multi- branch 2025 Hardware-aligned 3- branch design FASA O(tNtip + Nfacd) N Freq chunks 2026 RoPE frequency insight
SpargeAttn O(n2s · d) N Block filter 2025 Two-stage universal fil- tering
C.10 Design Space and Selection Criteria
The seven sparse attention methods occupy complementary positions in a multidimensional design space:
• Geometric/fixed patterns (Sparse Transformer, Longformer): Simple to implement and analyze, but patterns do not adapt to content.
• Learned selection (SparseK, NSA): Enable adaptation through trainable selection networks, at the cost of higher implementation complexity.
• Training-free acceleration (FASA, SpargeAttn): Enable drop-in speedup of existing models without retraining, suited for deployed systems.
• Theoretical guarantees (BigBird): Provide formal proofs of expressive completeness, im- portant for understanding safety margins.
Selection guidance:
1. New model training where resources permit: NSA or SparseK for maximal inference speedup; BigBird for theoretical assurance.
2. Long-context document understanding: Longformer or FASA for practical simplicity; NSA for extreme scale.
3. Accelerating existing models: FASA for RoPE-based LLMs; SpargeAttn for any archi- tecture.
4. Constrained environments: Sparse Transformer for simplicity and proven efficiency at moderate scale.
D Annex D: Gated Attention Families—Complete Literature Anal- ysis
D.1 Executive Summary: Gating for Memory Control
The emerging field of gated attention mechanisms addresses a fundamental challenge in sequence modeling: how should information be retained, forgotten, and updated as the model processes
new tokens? Traditional additive recurrent states accumulate all information without erasure, leading to memory saturation and inability to adapt to changing context. Softmax attention provides full expressiveness but uses O(Ld) KV cache during inference, limiting context window size. Gated mechanisms provide a middle ground: selective control over information survival. The unifying principle across gated architectures is that gating controls what information survives. Gates operate at three distinct levels:
1. Recurrent state decay (GLA, HGRN2): Multiplicative gates on the memory matrix St control how much previous state persists into the next step.
2. Write strength in recurrent updates (DeltaNet, Gated DeltaNet): Gates control the magnitude of new information written into memory.
3. Softmax logit biasing (FoX): Gates inject token-level recency bias into attention logit computation.
4. Post-attention sparsity (Gated Softmax): Gates selectively scale SDPA output channels.
This appendix synthesizes seven major gated-attention architectures published 2023–2025, analyzing their mathematical foundations, hardware efficiency characteristics, and empirical trade-offs.
D.2 1. Gated Linear Attention (GLA)
D.2.1 Mathematical Foundation
Gated Linear Attention [44] augments vanilla linear attention (which uses a matrix-valued re- current state) with data-dependent multiplicative decay. Standard linear attention reformulates the traditional softmax mechanism as a recurrence over hidden states:
St = St−1 + vtk⊤ t , ot = Stqt
where state St ∈Rdv×dk accumulates outer products. This purely additive formulation suf- fers from “memory overload”: information accumulates monotonically and cannot be erased, degrading performance on tasks requiring context selection. GLA introduces a data-dependent diagonal gating matrix Gt ∈[0, 1]dk×dk:
St = Gt ⊙St−1 + vtk⊤ t , ot = Stqt
The gate Gt is parameterized with an outer-product structure:
Gt = 1α⊤ t , αt = logsigmoid (Wgkxt) /c
where Wgk is a low-rank projection (Rd→dk) and c ≈16 is a normalizer. The logsigmoid activation maps αt ∈R to approximately [−1/2, 0], and division by c further contracts this to near-identity for initialization. The resulting gate Gt = 1 + αt ≈1 near initialization, then learns to decay (αt →−∞) historical information.
Algorithm 31 Gated Linear Attention (Recurrent Form) Require: Input xt; prior state St−1 Require: Projections: Wq, Wk, Wv (standard); Wgk (gate projection) Ensure: Output ot and updated state St
1: qt ←Wqxt 2: kt ←Wkxt 3: vt ←Wvxt 4: Compute gate: αt ←logsigmoid(Wgkxt)/16 ▷Per-key-dim decay
5: Apply outer-product gating: Gt ←1α⊤ t ▷diagonal outer product
6: Decay previous state: Sdecay t ←Gt ⊙St−1 ▷element-wise product
7: Add new association: St ←Sdecay t + vtk⊤ t 8: Retrieve via query: oraw t ←Stqt 9: Apply output gate: ot ←GroupNorm(oraw t ) ⊗SiLU(Wgxt) return ot, St
D.2.2 Hardware-Efficient Training
For parallel training, GLA uses a chunkwise block-wise algorithm that groups tokens into chunks C and computes recurrent transitions in parallel:
O[t] = ←− Q[t]S⊤ [t] + Q[t]K⊤ [t] ⊙Γ[t] V[t]
where ←− Q[t] are attending to the state at the chunk boundary (lookback), and Γ[t] is the causal mask with decay-aware scaling:
(Qi−1 l=j αl if i > j (lookback) Qi−j l=1 αl if i ≤j (in-chunk)
Γ[t],ij =
This design maximizes matmul operations susceptible to tensor-core acceleration, achieving sub- quadratic total complexity O(nLd2/C) where L is the number of attention heads and C is chunk size.
D.2.3 Strengths and Limitations
Aspect Strengths Limitations
Computational efficiency Sub-quadratic training via chunkwise parallelism. Constant-memory infer- ence O(d2).
Requires custom CUDA/Triton kernels; not available in all frame- works. Context generalization Extends 2K-token train- ing to 20K+ tokens with negligible perplexity degradation.
Still underperforms soft- max on retrieval-heavy benchmarks (needle-in- haystack). Selectivity Per-key-dimension data- dependent decay. Gate structure limits per-key-value selectivity; no direct control over which specific associations to erase.
D.3 2. DeltaNet: Error-Correcting Linear Attention
D.3.1 Mathematical Foundation
DeltaNet [45] applies a classical delta learning rule to linear attention state updates. Instead of additive accumulation, DeltaNet computes the difference between the predicted value (retrieved
from memory) and the target value, then performs an error-correcting update:
St = St−1(I −βtktk⊤ t ) + βtvtk⊤ t
where βt ∈(0, 1) is a data-dependent writing strength scalar. This update can be decomposed as an erase-write operation:
vold t = St−1kt (current prediction), vupdated t = βtvt + (1 −βt)vold t (blend old/new)
so that: St = St−1 −vold t k⊤ t + vupdated t k⊤ t
The key insight is that this update rule minimizes an online MSE loss at each timestep:
2∥Skt −vt∥2 ⇒∇SLt = (Skt −vt)k⊤ t
Lt(S) = 1
One step of SGD with learning rate βt yields the delta rule update. This connection to test- time training (TTT) provides theoretical grounding: DeltaNet is directly optimizing for value prediction at inference time.
D.3.2 Efficient Parallel Training
DeltaNet’s transition matrix I −βtktk⊤ t is a generalized Householder matrix (rank-1 update to identity). For chunkwise training, DeltaNet uses the WY representation, which represents the product of m such rank-1 updates compactly:
m Y
i=1 (I −βikik⊤ i ) = I −WY⊤
where W, Y ∈Rd×m are constructed recursively. This factorization enables matmul-rich paral- lelism without materializing the full product, keeping chunkwise training efficient.
Algorithm 32 DeltaNet State Update with WY Representation Require: Input chunk X[t]; prior state St−1 Require: L2-normalized queries ˆQt, keys ˆKt, values Vt Ensure: Output O[t] and updated state St
1: Chunkwise phase:
2: for position i = 1 to chunk_size do
3: Normalize: ˆki ←ˆKt[i]/∥ˆKt[i]∥, ˆqi ←ˆQt[i]/∥ˆQt[i]∥
4: Compute write strength: βi ←sigmoid(Wβxi)
5: Prediction: ˆvpred ←Sin-chunk i−1 ˆki
6: Error-aware blend: vupdate i ←βi(vi −ˆvpred)
7: Erase-write: Sin-chunk i ←Sin-chunk i−1 −ˆvpredˆk⊤ i + (βivi + (1 −βi)ˆvpred)ˆk⊤ i 8: Output: oi ←Sin-chunk i ˆqi 9: end for
10: Inter-chunk phase: Use WY representation of transition products return O[t], St
D.3.3 Strengths and Limitations
Aspect Strengths Limitations
Associative recall Perfect recall on Multi- Query Associative Recall (MQAR) benchmark; error-correcting semantics naturally fit retrieval patterns.
Without global forget- ting, memory still crowds over extreme sequence lengths; requires periodic reset mechanisms for unbounded context. Theory Grounded in online MSE optimization; clear con- nection to test-time train- ing paradigm.
Requires L2-normalized keys for numerical sta- bility; double-width normalization adds over- head. Throughput WY representation en- ables efficient chunkwise training.
Slightly slower than Mamba2 per-token due to richer transition matrices (rank-1 instead of diago- nal).
D.4 3. Gated DeltaNet: Synthesis of Gating and Error Correction
D.4.1 Mathematical Foundation
Gated DeltaNet [46] synthesizes the strengths of GLA and DeltaNet by combining a decay gate αt (from GLA) with the writing strength gate βt (from DeltaNet):
St = St−1 αt(I −βtktk⊤ t ) + βtvtk⊤ t
Rewriting for clarity: St = αtSt−1(I −βtktk⊤ t ) + βtvtk⊤ t
The two gates are complementary:
• αt →0 rapidly erases historical state: St ≈βtvtk⊤ t (context reset).
• αt →1 recovers pure delta-rule behavior: St ≈St−1(I−βtktk⊤ t )+βtvtk⊤ t (targeted updates).
• βt →0 skips writing but decays: St ≈αtSt−1 (pure forgetting).
• βt →1 replaces memory aggressively: St ≈αtSt−1 + vtk⊤ t (replacement).
From an online learning perspective, Gated DeltaNet corresponds to:
min St ∥St −αtSt−1∥2 F −2⟨Stkt, βt(vt −αtSt−1kt)⟩
which introduces an adaptive weight decay αt into an SGD-like update—analogous to decoupled weight decay in deep learning optimization.
Algorithm 33 Gated DeltaNet Parallel Chunkwise Forward Require: Input chunk X[i]; prior state Si−1; chunk size C Require: Projections: Wq, Wk, Wv, Wα, Wβ Ensure: Output O[i] and state Si
1: In-chunk computation:
2: for j = 1 to C do
3: qj ←Wqxj, kj ←Wkxj, vj ←Wvxj 4: αj ←sigmoid(Wαxj), βj ←sigmoid(Wβxj)
5: Apply combined gate: Sj ←αjSj−1(I −βjkjk⊤ j ) + βjvjk⊤ j 6: oj ←Sjqj 7: end for
8: Inter-chunk recurrence:
9: Si ←SC from in-chunk; propagate across chunks via cumulative Q α masks
10: Output gating: O[i] ←GroupNorm(Oraw) ⊗SiLU(WgX[i]) return O[i], Si
D.4.2 Empirical Performance and Trade-offs
Gated DeltaNet has been integrated into Alibaba’s Qwen3-Next and Qwen3.5 production models. Empirically:
Task Gated DeltaNet Mamba2 DeltaNet GLA Language Modeling 1.0× 1.08× 1.05× 1.12× Associative Recall 1.0× −−(noperfect) 1.0× 0.75 Length Extrapolation 1.0× 0.98× 0.96× 0.94× Throughput (tokens/sec) 0.85× 1.0× 0.82× 0.88×
Gated DeltaNet achieves the best balance across diverse tasks at the cost of slightly reduced throughput due to richer transition matrices.
D.5 4. HGRN2: Hierarchical Gating with Outer-Product Expansion
D.5.1 Mathematical Foundation
HGRN2 [28] uses an outer-product-based state expansion with hierarchically lower-bounded forget gates. The state update is:
St = diag(gt) · St−1 + vtk⊤ t , ot = Stqt
where gt ∈[bℓ, 1]d is a lower-bounded forget gate for layer ℓ. The bounds satisfy:
0 ≤bℓshallow < bℓmiddle < bℓdeep ≤1
i.e., bounds increase (become less restrictive) in the deeper layers. The gate itself is computed as: gt = bℓ+ (1 −bℓ) · σ(Wgxt)
where σ is sigmoid. When Wg outputs weakly negative logits, gt ≈bℓ(forced retention at shallow layers). When outputs are strongly positive, gt ≈1 (memory refresh at any layer). The hierarchical structure encourages different layers to specialize to different timescales: shallow layers model local dependencies (high retention due to binding bℓ), while deeper layers capture long-range structure (flexible gating with high upper bound).
Algorithm 34 HGRN2 Forward with Hierarchical Bounds Require: Input xt; prior state St−1; layer index ℓ∈[0, L −1] Require: Non-decreasing bounds: 0 ≤b0 < b1 < · · · < bL−1 ≤1 Ensure: Output ot and state St
1: Layer-specific retain bound: b ←bℓ 2: Project: qt ←Wqxt, kt ←Wkxt, vt ←Wvxt 3: Compute forget gate with binding: gt ←b + (1 −b) · σ(Wgxt) ▷Constrained to [b, 1]
4: Diagonal multiplication (vectorized): St ←St−1 ◦diag(gt) + vtk⊤ t 5: Retrieve: ot ←Stqt 6: Optional normalization: ot ←RMSNorm(ot) return ot, St
D.5.2 Scaling and Empirical Results
At 3B scale on 100B tokens, HGRN2 slightly outperforms Mamba2 and LLaMA-architecture Transformers on language modeling. The hierarchical gating provides multi-scale temporal mod- eling without explicit layer-wise architectural variants. However, vector-valued diagonal gating (as opposed to element-wise selection) provides less per-item flexibility than delta-rule variants.
D.6 5. Forgetting Transformer (FoX): Gating in Softmax Logit Space
D.6.1 Mathematical Foundation
Forgetting Transformer (FoX), proposed by Lin et al. [19], embeds a forget gate directly into softmax attention logits. Rather than replacing softmax with linear recurrence, FoX preserves full softmax expressiveness while adding recency control through a data-dependent logit bias. For each token position t and all keys j ≤t, compute a scalar forget gate:
ft = σ(w⊤ f xt + bf)
The logit bias at position (i, j) is the cumulative log-forget product:
i X
iY
dij =
l=j+1 log fl = log
l=j+1 fl
The full attention output becomes:
O = softmax QK⊤
√dk + D V
where Dij = dij. Equivalently:
Attention(i, j) = exp(q⊤ i kj/√dk + dij) Pi j′=1 exp(q⊤ i kj′/√dk + dij′)
This weighting down-weights older tokens multiplicatively: a token from 10 steps ago is scaled by Qi l=j+1 fl, which is typically much less than 1 if ft < 1 on average. The mechanism is mathematically equivalent to a data-dependent learnable variant of ALiBi (Attention with Linear Biases), where earlier work used fixed dij = −∞· |i −j| or dij = −(i −j).
Algorithm 35 FoX: Forgetting Attention with Recency Bias Require: Queries Q, keys K, values V; input X Require: Forget gate parameters: wf ∈Rd, bf ∈R Ensure: Output attention O
1: Compute forget gates:
2: for t = 1 to T do
3: ft ←σ(w⊤ f xt + bf) ▷Per-token forget probability
4: end for
5: Compute cumulative log-forget biases:
6: for i = 1 to T do
7: for j = 1 to i do
8: dij ←Pi l=j+1 log fl ▷Cumulative product in log space
9: end for
10: end for
11: Standard SDPA with bias:
12: Compute scores: S ←QK⊤/√dk 13: Add bias: S ←S + D
14: Apply softmax: A ←softmax(S, dim = −1)
15: Weight values: O ←AV return O
D.6.2 Integration with FlashAttention
The key advantage of FoX is compatibility with FlashAttention and existing optimized imple- mentations. The forget biases are added in the softmax logit computation, which is already a core operation in FlashAttention’s block-wise algorithm. The overhead is:
Overhead ≈0.5 −−2% wall-clock time
since the bias addition is amortized across matmul operations.
D.6.3 Strengths and Limitations
FoX achieves state-of-the-art results on length extrapolation, near-perfect needle-in-haystack retrieval, and superior long-context understanding compared to Transformers. However, it re- mains O(L2d) at training and O(Ld) per step at inference with KV cache, limiting extreme sequence lengths.
D.7 6. Gated Attention (Post-SDPA Sigmoid Gating)
D.7.1 Mathematical Foundation
The NeurIPS 2025 Best Paper by the Qwen team proposes applying a sigmoid gate after scaled dot-product attention (SDPA):
Y = SDPA(Q, K, V) = softmax QK⊤
V
√dk
Y′ = Y ⊙σ (XWθ)
where the gate score σ(XWθ) can have shape Rn×q×dk (element-wise gating per query head) or Rn×q (head-wise fixed gating). Across 30+ variants and 15B MoE + 1.7B dense models trained on 3.5T tokens, the team found:
• Headwise gating adds only ∼1.6M parameters to a 15B model (negligible overhead).
• Effective gates are sparse: mean activation ≈0.116 (i.e., 88% are zero or near-zero).
• Gates act as query-dependent filters, suppressing low-value channels while preserving high- value signal.
• Gates demonstrably eliminate the attention sink phenomenon, where the attention pattern becomes dominated by a single early token.
Algorithm 36 Gated Softmax Attention (Post-SDPA) Require: Queries Q, Keys K, Values V; input X Require: Gate projection matrix Wg ∈Rd→dgate where dgate ∈{1, dk} Ensure: Gated output Y′
1: Standard SDPA:
2: Compute attention: A ←softmax(QK⊤/√dk)
3: Weight values: Y ←AV
4: Compute gates:
5: Project input: G ←XWg ▷shape: [n, q, dgate]
6: Apply sigmoid: G ←σ(G) ▷Element-wise gating
7: Apply gating:
8: if dgate = 1 then
9: Broadcast and scale: Y′ ←Y · G ▷Head-wise scaling
10: else ▷dgate = dk 11: Element-wise modulation: Y′ ←Y ⊙G ▷Per-dimension gating
12: end ifreturn Y′
D.7.2 Key Findings
• Attention-sink suppression: Gating breaks the feedback loop where softmax concentrates on the first token, enabling more balanced attention.
• Training stability: Models with gating tolerate larger learning rates (+30% higher stable ηmax).
• Minimal overhead: Latency impact is only 1.6% on H100 GPUs; throughput drops at most 2 −3%.
• Scaling law improvement: Improves loss across model scales from 800M to 110B parameters consistently.
D.8 Comparative Analysis: Taxonomy of Gated Mechanisms
Architecture Publication Year State Represen- tation
Gate Type Training Complexity Inference Complexity Ref
GLA 2023 Matrix (re- current) Diagonal decay O(Ld2) O(d2) memory [44]
DeltaNet 2024 Matrix (re- current) Scalar write (β) O(Ld2) O(d2) memory [45]
Gated DeltaNet 2024 Matrix (re- current) Decay + write O(Ld2) O(d2) memory [46]
RetNet 2023 Matrix (re- current) Fixed ex- ponential O(Ld2) O(d) per-step [36]
Gate Type Training Complexity Inference Complexity Ref
Architecture Publication Year State Represen- tation
HGRN2 2024 Matrix (re- current) Diagonal (bounded) O(Ld2) O(d2) memory [28]
FoX 2025 Full atten- tion Logit- space bias O(L2d) O(L) KV cache [19]
O(L2d) O(L) KV cache [29]
Gated Soft- max 2025 Full atten- tion Post- SDPA sigmoid
Architecture Trainable In-Context Recall Associative Recall Length Ex- trapolation Key Ad- vantage
GLA Yes Moderate Weak Good (20K) Hardware efficiency; chunkwise parallelism DeltaNet Yes Strong Perfect (MQAR) Good Error- correcting seman- tics; strong retrieval Gated DeltaNet Yes Very strong Perfect Excellent Balanced: forgetting + targeted updates RetNet Yes Weak Weak Good Simplicity; no KV cache needed HGRN2 Yes Moderate Weak Moderate Multi-scale temporal modeling FoX Yes Very strong Very strong (near-perfect) Excellent Preserve softmax; minimal overhead Gated Soft- max Yes Strong Good Good Simplicity; no replace- ment of SDPA
Architecture Typical Throughput Memory per- Inference Primary Use Case
GLA High (chunkwise CUDA) O(d2) constant Long-context with memory constraints DeltaNet Medium-High O(d2) constant Retrieval and associa- tive reasoning Gated DeltaNet Medium O(d2) constant Production: Qwen3- Next, balanced perfor- mance RetNet High O(d) per-step Extremely fast infer- ence; simple models HGRN2 Medium-High O(d2) constant Multi-scale temporal patterns FoX Moderate (quadratic) O(Ld) KV cache Length extrapolation; long-context generation
Architecture Typical Throughput Memory per- Inference Primary Use Case
Gated Soft- max High (minimal overhead) O(Ld) KV cache Drop-in improvement for existing models
D.9 Unified Gating Principle
All seven architectures share a common principle: gating is a mechanism to control infor- mation flow and selective memory retention. The specific design choices diverge along several dimensions:
1. Location of gating: Recurrent state updates (GLA, DeltaNet, Gated DeltaNet, HGRN2) versus softmax logit/output space (FoX, Gated Softmax, RetNet).
2. Granularity of control: Dimension-wise (GLA, HGRN2 diagonal), key-specific (DeltaNet), token-wise (FoX), or head-wise (Gated Softmax).
3. Data dependence: Fully data-dependent (GLA, DeltaNet, Gated DeltaNet, FoX, Gated Softmax) versus fixed schedules (RetNet with exponential decay).
4. Expressiveness trade-off: Linear recurrent efficiency (GLA, DeltaNet, Gated DeltaNet, HGRN2, RetNet) versus full softmax expressiveness (FoX, Gated Softmax).
D.10 Implementation and Practical Guidance
D.10.1 When to Use Each Architecture
1. GLA: Fast inference with moderate-to-long contexts (2K–20K tokens), when custom CUDA kernels are acceptable, and retrieval performance is not critical.
2. DeltaNet: Strong associative recall and in-context learning tasks; good for synthetic tasks (MQAR) and key-value association testing.
3. Gated DeltaNet: Production models requiring the best balance of recall, forgetting, and throughput (proven in Qwen3-Next; ICLR 2025).
4. RetNet: Extremely efficient inference without KV caching; suitable for edge/mobile deploy- ments or toy models.
5. HGRN2: Multi-scale temporal hierarchies; when layer-specific forget bounds are desirable.
6. FoX: Length extrapolation and long-context understanding; when quadratic training is ac- ceptable and replacing softmax is not desired.
7. Gated Softmax: Quick improvement to existing softmax models with minimal code changes; best for practitioners wanting immediate gains without major refactoring.
D.10.2 Hardware Considerations
Hardware Target Recommended Ar- chitectures Notes
GPU (H100/A100) Gated Softmax, FoX, Gated DeltaNet Mature softmax/linear implementations. GLA/DeltaNet require custom kernels. TPU (high memory) GLA, DeltaNet, Gated DeltaNet, HGRN2 Hardware supports ef- ficient matmuls; lin- ear recurrent methods shine. Mobile/Edge RetNet, lightweight Gated Softmax Constant-memory in- ference critical.
E Annex E: Conceptual Introduction—Transformers and Atten- tion for Beginners
This appendix provides an accessible introduction to the fundamental concepts of transformers and the attention mechanism, aimed at readers without deep prior knowledge in deep learning. Transformers are the engine behind modern artificial intelligence systems like ChatGPT, but their operation can be understood through real-world analogies.
E.1 What is a Transformer?
A transformer is a neural architecture that processes text or data sequences by interpreting the meaning of each word while considering all other words in context. Imagine you read a sentence:
“The bank was full of people waiting. Mary went to the bank.”
What does “bank” mean in each case? In the first sentence, it refers to a financial institution (or a bench). In the second, it is less clear without context. A transformer solves this auto- matically by looking at all surrounding words. In the second sentence, seeing “Mary went”, the model better understands that it probably is a riverbank (where she goes to swim) or a financial institution (where she goes to conduct a transaction). This ability to understand the complete meaning of a word by considering the entire sentence is called contextualization, and it is the heart of how modern transformers work.
E.2 The Attention Mechanism: An Analogy
The attention mechanism is the “magic trick” that allows transformers to understand context. Think of it as participating in an important conversation in a noisy room:
1. Lots of noise: Someone is speaking and being very important to you.
2. You ignore the rest: Your brain automatically focuses attention on that person, ignoring other sounds.
3. You understand better: By concentrating, you catch every word clearly.
The neural attention mechanism works the same way: each word “asks” how important every other word is to understanding its meaning, then “focuses” its attention on the most relevant ones.
E.3 Practical Example Step-by-Step
Let’s see how a transformer understands the word “eats” in two different contexts:
E.3.1 Context 1: “The cat eats fish”
The transformer, when processing “eats”, internally asks:
• How relevant is “The”? Little (it’s just an article). Low attention.
• How relevant is “cat”? Very relevant (it’s the subject, who eats). High attention.
• How relevant is “fish”? Very relevant (it’s what is eaten). High attention.
Thus, the mental representation of “eats” focuses primarily on “cat” and “fish”.
E.3.2 Context 2: “The restaurant eats into profit margins”
Here the transformer asks in a different context:
• How relevant is “restaurant”? Very relevant (it’s the subject). High attention.
• How relevant is “profit”? Very relevant (it’s affected). High attention.
• How relevant is “margins”? Very relevant (it’s what’s being consumed). High attention.
The word “eats” gets a completely different representation because it “attends” to different words in different contexts.
E.4 How Attention Works Mathematically (Simple Version)
Behind the attention shift is mathematics. Here is the simplified version without too much technical detail:
E.4.1 Step 1: Queries, Keys, and Values
Each word in the sentence is transformed into three versions:
• Query: “What information do I need to know?”
• Key: “What information do I have?”
• Value: “Here is my important information.”
Imagine in a library, each visitor is a “query”, each book is a “key”, and the content is the “value”. The librarian (attention) matches queries with the most relevant keys to access the correct value.
E.4.2 Step 2: Compatibility
The system examines how compatible each query is with each key. Queries similar to keys get a high compatibility score. This is calculated by multiplying the query by the key (mathematically, the dot product).
E.4.3 Step 3: Focus
The compatibility scores are converted into “focus weights”. High compatibilities mean “focus attention here”, and low ones mean “ignore this”. Mathematically, a function called softmax converts scores into percentages (like: 40% attention here, 35% here, 25% there).
E.4.4 Step 4: Combine
Finally, the system combines all values, weighted by the focus weights. If a word has 80% attention on the subject, 80% of the subject’s information gets blended into that word’s repre- sentation.
E.5 Multiple Attention Heads: Multiple Perspectives
A key feature is that transformers do not use a single attention but multiple attention heads in parallel. It’s as if you had 8 people analyzing the sentence simultaneously, each paying attention to different aspects:
• Person 1: “I focus on subject-verb relationships.”
• Person 2: “I search for the direct object.”
• Person 3: “I track information about verb tense.”
• Person 4: “I search for modifiers and adjectives.”
Each “head” learns to focus on different language patterns. Together, they capture much richer understanding than a single head.
E.6 Stacking Layers
Transformers process information in multiple layers (usually 12 to 48 for practical models). Imagine editing an essay:
1. First pass: You fix spelling and basic grammar.
2. Second pass: You improve clarity and sentence structure.
3. Third pass: You ensure consistency and narrative flow.
Transformers work the same way: each layer refines text understanding. The first layer captures basic features (simple words, genders), while later layers understand complex concepts (word relationships, abstract meanings).
E.7 Why Does It Matter?
The attention mechanism was revolutionary because:
1. Parallelism: It finds context for any word with any other word, without processing them sequentially (unlike earlier systems). This makes it very fast.
2. Flexibility: It learns what patterns to look for automatically from data, without you needing to program rules manually.
3. Scalability: It works from small texts to contexts with millions of words.
4. General Capabilities: The same mechanism works for translation, summarization, question- answering, text generation, computer vision, and more.
E.8 Practical Challenges
Despite the extraordinary success of transformers, they present challenges:
E.8.1 Computational Complexity
If a sentence has 1,000 words, the standard attention mechanism must compare each word with all other 1,000 words. That’s 1, 000×1, 000 = 1, 000, 000 comparisons. For long documents with millions of words, this becomes prohibitively expensive in time and memory.
E.8.2 Memory Requirements
Especially during inference (when using trained models), storing the entire attention matrix can consume enormous amounts of RAM or GPU memory, limiting the sequence lengths you can process.
E.8.3 Device Efficiency
Transformers were designed for powerful TPUs and GPUs. Running them on phones or embed- ded devices is challenging.
E.9 Modern Solutions
To solve these challenges, research has proposed many variants:
• Sparse Attention: Instead of comparing each word with ALL others, it compares only with a strategic subset (nearby neighbors, periodic patterns). Reduces complexity from O(n2) to O(n log n) or O(n).
• Recurrent Models: Like Mamba, they incorporate aspects of old recurrent neural networks but with better modern efficiency.
• Gated Attention: Mechanisms that selectively learn what information to pass forward, reducing storage needs.
• Quantization: Use smaller numbers (integers instead of decimals) to reduce memory without losing too much precision.
These innovations are what this toolkit (Frankenstein) allows you to experiment with easily.
E.10 Key Takeaways
• A transformer is a neural network that understands language context very well.
• The attention mechanism allows each word to “focus” on which other words are relevant to understanding its meaning.
• Attention works by computing compatibility between each word (query) and all others (keys), then combining information based on that compatibility (values).
• Multiple heads of attention explore different patterns in parallel.
• Multiple layers progressively refine understanding.
• The main challenge is that standard attention has quadratic complexity, which is expensive for long texts.
• Many modern solutions (sparse attention, recurrent models, quantization) address these challenges, and this toolkit lets you explore all of them.
📝 About this HTML version
This HTML document was automatically generated from the PDF. Some formatting, figures, or mathematical notation may not be perfectly preserved. For the authoritative version, please refer to the PDF.