# Frankestein Transformer: Unified Encoder-Decoder Library, CLI, and Research-Grounded Design Notes

## Abstract

Frankestein Transformer presents a unified configuration-driven toolkit for systematic experimentation with modern transformer architectures, spanning seventeen sequence mixer
variants and twenty-two optimizer families. The system supports both encoder-style masked
language modeling (MLM) and decoder-style autoregressive (AR) next-token prediction
through flexible model class and mode configuration, with specialized fine-tuning workflows for both architectures. The research contributions are threefold: (i) a strict schemabased configuration contract that enables reproducible experimentation across diverse attention mechanisms, including standard softmax attention, sigmoid attention, retentive networks, selective state-space models, continuous-depth transformers, adaptive depth routing
(Mixture-of-Depths) [57], conditional memory augmentation (Engram) [8], sparse attention
patterns, and gated mechanisms; (ii) a comprehensive optimizer routing framework supporting variance-reduction methods (MARS, Adan, AdEMAMix), memory-efficient variants
(Adafactor, GaLore, Lion), schedule-free approaches, second-order preconditioners (Shampoo, SOAP, Sophia), and low-rank APOLLO-family optimizers (Apollo, Apollo-Mini, QApollo) [55]; and (iii) end-to-end workflows spanning quantized deployment via ternary
weight packing and sentence-embedding training inspired by SBERT, backed by an expanded quality-assurance stack with broad unit test coverage, YAML example validation,
and continuous integration execution. The toolkit implements a web-based configuration interface that provides schema-driven form rendering with inline documentation and real-time
validation. This technical reference document includes architectural diagrams, executionflow visualizations, decision tables, and comprehensive appendices synthesizing literature
on transformer architectures, sparse attention mechanisms, gated attention variants, and
optimization algorithms. The system enables rapid iteration while maintaining reproducible
experimental conditions through its schema-first design philosophy.

---

## Full Text

Frankestein Transformer:
Unified Encoder-Decoder Library, CLI, and Research-Grounded
Design Notes

Erick F. Merino M.
This work is not affiliated
erickfmm@gmail.com

February 2026

Abstract

Frankestein Transformer presents a unified configuration-driven toolkit for systematic ex-
perimentation with modern transformer architectures, spanning seventeen sequence mixer
variants and twenty-two optimizer families. The system supports both encoder-style masked
language modeling (MLM) and decoder-style autoregressive (AR) next-token prediction
through flexible model class and mode configuration, with specialized fine-tuning work-
flows for both architectures. The research contributions are threefold: (i) a strict schema-
based configuration contract that enables reproducible experimentation across diverse at-
tention mechanisms, including standard softmax attention, sigmoid attention, retentive net-
works, selective state-space models, continuous-depth transformers, adaptive depth routing
(Mixture-of-Depths) [57], conditional memory augmentation (Engram) [8], sparse attention
patterns, and gated mechanisms; (ii) a comprehensive optimizer routing framework sup-
porting variance-reduction methods (MARS, Adan, AdEMAMix), memory-efficient variants
(Adafactor, GaLore, Lion), schedule-free approaches, second-order preconditioners (Sham-
poo, SOAP, Sophia), and low-rank APOLLO-family optimizers (Apollo, Apollo-Mini, Q-
Apollo) [55]; and (iii) end-to-end workflows spanning quantized deployment via ternary
weight packing and sentence-embedding training inspired by SBERT, backed by an ex-
panded quality-assurance stack with broad unit test coverage, YAML example validation,
and continuous integration execution. The toolkit implements a web-based configuration in-
terface that provides schema-driven form rendering with inline documentation and real-time
validation. This technical reference document includes architectural diagrams, execution-
flow visualizations, decision tables, and comprehensive appendices synthesizing literature
on transformer architectures, sparse attention mechanisms, gated attention variants, and
optimization algorithms. The system enables rapid iteration while maintaining reproducible
experimental conditions through its schema-first design philosophy.

Contents

1
Introduction
6
1.1
Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
Reading Guide
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.4
Web-Based Configuration Builder . . . . . . . . . . . . . . . . . . . . . . . . . . .
8

2
Related Work
9
2.1
Sequence Mixer Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.1
Dense Attention Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.2
Recurrent and Retentive Architectures . . . . . . . . . . . . . . . . . . . .
9

2.1.3
Sparse Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.4
Gated Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2
Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10

3
System Design and Architecture
10
3.1
Configuration-Centric Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2
Schema Scope and Validation Rules . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.3
Complete Model Feature Inventory . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.4
Training Task Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.5
Complete Training Feature Inventory . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.6
Optimizer Prefix Contract (Full)
. . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.7
Training Safety and Runtime Semantics
. . . . . . . . . . . . . . . . . . . . . . .
16
3.8
Normalization Variants: RMSNorm, Dynamic Tanh, and Dynamic Erf . . . . . .
16
3.9
RMSNorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.10 Dynamic Tanh (DyT)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.11 Dynamic Erf (Derf)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.12 Schema Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17

4
Architecture Taxonomy and Implementation
18
4.1
Attention and Sequence-Mixer Families . . . . . . . . . . . . . . . . . . . . . . . .
18
4.2
Standard Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
4.3
Sigmoid Attention
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
4.4
Retentive Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.5
Selective SSM (Mamba) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.6
ODE-style Continuous Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.7
Test-time Memory (Titans)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.8
Sparse and Gated Extensions in the Current Codebase . . . . . . . . . . . . . . .
20
4.9
Implemented Sparse Attention Blocks (Detailed)
. . . . . . . . . . . . . . . . . .
20
4.10 Implemented Gated Attention Blocks (Detailed) . . . . . . . . . . . . . . . . . . .
21

5
Optimizer Families and Training Dynamics
23

6
Quantization and Deployment
23
6.1
Ternary Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
6.2
Activation Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
6.3
Size Estimates
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24

7
SBERT Downstream Tasks
24

8
Summary Tables
25
8.1
Attention and Sequence-Mixer Summary . . . . . . . . . . . . . . . . . . . . . . .
25
8.2
Optimizer Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26

9
Discussion
27
9.1
Schema-Driven Design Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
9.2
Architectural Coverage and Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
9.3
Optimizer Landscape Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . .
27
9.4
Deployment and Production Considerations . . . . . . . . . . . . . . . . . . . . .
28
9.5
Integration and Extensibility Challenges . . . . . . . . . . . . . . . . . . . . . . .
28

10 Conclusion
28
10.1 Limitations and Future Directions
. . . . . . . . . . . . . . . . . . . . . . . . . .
29

Bibliography
30

A Annex A: Optimizer Families
33
A.1 Memory and Complexity Comparison . . . . . . . . . . . . . . . . . . . . . . . . .
33
A.2 The Evolution of Optimization in Neural Networks . . . . . . . . . . . . . . . . .
34
A.3 Standard Baseline and Adaptive Optimizers . . . . . . . . . . . . . . . . . . . . .
34
A.4 Advanced Momentum and Variance Reduction (2024–2025)
. . . . . . . . . . . .
36
A.5 Large-Batch, Memory-Efficient, and Parameter-Free Optimizers . . . . . . . . . .
40
A.6 Second-Order, Geometric, and Orthogonality Optimizers . . . . . . . . . . . . . .
45

B Annex B: Dense, Recurrent, and Memory-Augmented Transformers
49
B.1
Dense Attention Baselines: Standard and Sigmoid . . . . . . . . . . . . . . . . . .
49
B.1.1
Standard Softmax Attention . . . . . . . . . . . . . . . . . . . . . . . . . .
49
B.1.2
Sigmoid Attention
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
B.2
Recurrent and Retentive Architectures . . . . . . . . . . . . . . . . . . . . . . . .
50
B.2.1
Retentive Networks (RetNet) . . . . . . . . . . . . . . . . . . . . . . . . .
50
B.2.2
Mamba: Selective State-Space Models . . . . . . . . . . . . . . . . . . . .
51
B.3
Continuous-Depth Transformers: ODE Integration
. . . . . . . . . . . . . . . . .
51
B.3.1
ODE Transformer
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
B.4
Test-Time Memory: Titans
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
B.5
Architectural Comparison and Synthesis . . . . . . . . . . . . . . . . . . . . . . .
52

C Annex C: Comprehensive Sparse Attention Mechanisms
53
C.1 Executive Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
C.2 Sparse Transformer: Factorized Strided and Fixed Patterns
. . . . . . . . . . . .
53
C.2.1
Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
C.2.2
Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
C.2.3
Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
C.3 Longformer: Sliding Window, Dilation, and Global Tokens . . . . . . . . . . . . .
55
C.3.1
Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
C.3.2
Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
C.3.3
Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
C.4 BigBird: Random, Local, and Global Sparse Graph . . . . . . . . . . . . . . . . .
56
C.4.1
Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
C.4.2
Theoretical Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
C.4.3
Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
C.4.4
Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
C.5 SparseK Attention: Differentiable Top-k Selection . . . . . . . . . . . . . . . . . .
57
C.5.1
Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
C.5.2
Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
C.5.3
Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
C.6 NSA (Native Sparse Attention): Hardware-Aligned Hierarchical Branches
. . . .
58
C.6.1
Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
C.6.2
Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
C.6.3
Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
C.7 FASA: Frequency-Aware Sparse Attention . . . . . . . . . . . . . . . . . . . . . .
60
C.7.1
Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
C.7.2
Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
C.7.3
Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
C.8 SpargeAttn: Two-Stage Block-Level Filtering
. . . . . . . . . . . . . . . . . . . .
61
C.8.1
Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
C.8.2
Algorithmic Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62

C.8.3
Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
C.9 Comparative Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
C.10 Design Space and Selection Criteria . . . . . . . . . . . . . . . . . . . . . . . . . .
63

D Annex D: Gated Attention Families—Complete Literature Analysis
63
D.1 Executive Summary: Gating for Memory Control . . . . . . . . . . . . . . . . . .
63
D.2 1. Gated Linear Attention (GLA) . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
D.2.1
Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
D.2.2
Hardware-Efficient Training . . . . . . . . . . . . . . . . . . . . . . . . . .
65
D.2.3
Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
D.3 2. DeltaNet: Error-Correcting Linear Attention . . . . . . . . . . . . . . . . . . .
65
D.3.1
Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
D.3.2
Efficient Parallel Training . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
D.3.3
Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
D.4 3. Gated DeltaNet: Synthesis of Gating and Error Correction . . . . . . . . . . .
67
D.4.1
Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
D.4.2
Empirical Performance and Trade-offs
. . . . . . . . . . . . . . . . . . . .
68
D.5 4. HGRN2: Hierarchical Gating with Outer-Product Expansion . . . . . . . . . .
68
D.5.1
Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
D.5.2
Scaling and Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . .
69
D.6 5. Forgetting Transformer (FoX): Gating in Softmax Logit Space . . . . . . . . .
69
D.6.1
Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
D.6.2
Integration with FlashAttention . . . . . . . . . . . . . . . . . . . . . . . .
70
D.6.3
Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
D.7 6. Gated Attention (Post-SDPA Sigmoid Gating) . . . . . . . . . . . . . . . . . .
70
D.7.1
Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
D.7.2
Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
D.8 Comparative Analysis: Taxonomy of Gated Mechanisms . . . . . . . . . . . . . .
71
D.9 Unified Gating Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
D.10 Implementation and Practical Guidance . . . . . . . . . . . . . . . . . . . . . . .
73
D.10.1 When to Use Each Architecture . . . . . . . . . . . . . . . . . . . . . . . .
73
D.10.2 Hardware Considerations
. . . . . . . . . . . . . . . . . . . . . . . . . . .
74

E Annex E: Conceptual Introduction—Transformers and Attention for Begin-
ners
74
E.1
What is a Transformer?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
E.2
The Attention Mechanism: An Analogy
. . . . . . . . . . . . . . . . . . . . . . .
74
E.3
Practical Example Step-by-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
E.3.1
Context 1: “The cat eats fish” . . . . . . . . . . . . . . . . . . . . . . . . .
75
E.3.2
Context 2: “The restaurant eats into profit margins”
. . . . . . . . . . . .
75
E.4
How Attention Works Mathematically (Simple Version)
. . . . . . . . . . . . . .
75
E.4.1
Step 1: Queries, Keys, and Values
. . . . . . . . . . . . . . . . . . . . . .
75
E.4.2
Step 2: Compatibility
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
E.4.3
Step 3: Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
E.4.4
Step 4: Combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
E.5
Multiple Attention Heads: Multiple Perspectives
. . . . . . . . . . . . . . . . . .
76
E.6
Stacking Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
E.7
Why Does It Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
E.8
Practical Challenges
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
E.8.1
Computational Complexity
. . . . . . . . . . . . . . . . . . . . . . . . . .
77
E.8.2
Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
E.8.3
Device Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77

E.9
Modern Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
E.10 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77

1
Introduction

1.1
Motivation and Problem Statement

Transformer architectures have fundamentally transformed the landscape of sequence modeling
across natural language processing [38], computer vision, and computational biology. The suc-
cess of models such as BERT [11], GPT, and their variants has established the transformer as
the de facto standard for representation learning in deep learning. However, the rapid prolif-
eration of architectural innovations presents significant practical challenges for researchers and
practitioners.
Recent years have witnessed an explosion of alternative attention and sequence-mixing mech-
anisms, each addressing specific limitations of standard softmax attention: quadratic computa-
tional complexity [9], memory-efficient inference requirements [36, 12], content-based selective
state management [2], and hardware-aware optimizations [30]. Simultaneously, the optimization
literature has diversified beyond classical AdamW, introducing variance-reduction techniques
[42, 26, 49], memory-efficient variants [33, 54], schedule-free approaches [10], and second-order
preconditioning methods [14, 39].
The practical consequence is a fragmented research ecosystem where experimental compari-
son across architectures and optimizers requires significant engineering effort. Researchers must
implement and debug multiple variants from scratch, ensure consistent training pipelines, and
manage complex hyperparameter spaces.
This fragmentation hampers reproducibility, slows
scientific progress, and increases the barrier to entry for new researchers.
This work addresses these challenges through a unified, configuration-driven experimentation
toolkit available at https://github.com/erickfmm/frankestein-transformer that provides:

1. Schema-First Design: A strict, validated configuration contract that enforces reproducibil-
ity while supporting seventeen distinct sequence mixer architectures and twenty-two optimizer
families.

2. Architecture Agnostic Training: Common training infrastructure supporting dense at-
tention baselines (standard, sigmoid), recurrent alternatives (RetNet, Mamba, ODE-style
blocks), adaptive depth routing (Mixture-of-Depths) [57], conditional memory blocks (En-
gram) [8], sparse attention patterns (Sparse Transformer, Longformer, BigBird, SparseK,
NSA, SpargeAttn, FASA), and gated mechanisms (GLA, DeltaNet, Gated DeltaNet, HGRN2,
FoX, Gated Softmax).

3. Optimizer Routing Framework: Prefixed hyperparameter groups enabling fine-grained
control over embeddings, normalization layers, recurrent blocks, attention blocks, and other
parameter subsets across diverse optimizers, including the APOLLO family (Apollo, Apollo-
Mini, Q-Apollo) [55].

4. End-to-End Workflows: Integrated deployment via quantization (ternary weight packing,
INT8 activations) and sentence-embedding capabilities inspired by SBERT [31].

5. Interactive Configuration: Web-based interface providing schema-driven form rendering,
real-time validation, and CLI command generation.

6. Reliability Tooling: Comprehensive automated testing with unit-test suites, YAML preset
validation, and CI automation across supported Python versions.

The project command surface is:

frankestein-transformer

with subcommands for encoder and decoder workflows: train, finetune, deploy, quantize,
infer, sbert-train, sbert-infer, web-server.
The web-server command launches a Streamlit-based configuration builder that provides:

• Schema-driven form fields with parameter titles and detailed descriptions

• Real-time tooltips and help text for each configuration option

• Live YAML preview and download functionality

• Generated CLI commands for training, deployment, inference, and SBERT workflows

This interactive interface serves as an alternative to manual YAML editing, improving us-
ability for users exploring available configuration options and understanding their impact on
model behavior and training dynamics.

1.2
Contributions

This work makes the following primary contributions:

1. Unified Configuration Schema:
A YAML-based schema contract with strict valida-
tion that supports seventeen distinct sequence mixer architectures across four categories
(dense baselines, recurrent/retentive blocks, sparse attention patterns, and gated mecha-
nisms), alongside adaptive depth and conditional memory controls, while enforcing repro-
ducibility through additionalProperties:
false constraints.

2. Comprehensive Architecture Support: Implementation of modern transformer variants
including standard softmax attention [38], sigmoid attention [30], RetNet [36], Mamba [12],
ODE-style continuous transformers [52], Titans memory-augmented attention [2], Engram
conditional memory layers [8], Mixture-of-Depths token routing [57], sparse attention mech-
anisms [9, 3, 50, 23, 48, 53, 41], and gated attention architectures [44, 45, 46, 28, 19, 29].
System supports both encoder-mode training with bidirectional attention for masked lan-
guage modeling (MLM) and decoder-mode training with causal masking for autoregressive
next-token prediction.

3. Optimizer Routing Framework: Prefixed hyperparameter system enabling per-parameter-
group control across twenty-two optimizers, including variance-reduction methods (MARS,
Adan, AdEMAMix, Cautious AdamW), memory-efficient variants (Adafactor, GaLore, Lion),
schedule-free approaches (Schedule-Free AdamW), curvature-aware methods (Sophia), second-
order preconditioners (Shampoo, SOAP), orthogonality-oriented optimizers (Muon, Turbo-
Muon), large-batch scaling (LAMB), and the APOLLO family (Apollo, Apollo-Mini, Q-
Apollo) [55].

4. Quantization and Deployment: Integrated deployment pipeline supporting ternary weight
packing and INT8 activation quantization with size estimates following 1.58-bit storage ap-
proximation.

5. Sentence-Embedding Workflows: SBERT-inspired training and inference pipelines sup-
porting similarity scoring, retrieval, clustering, and persistent embedding export.

6. Interactive Configuration Interface: Streamlit-based web server providing schema-driven
form generation, real-time validation, inline documentation, and CLI command generation.

7. Automated Validation Pipeline: Continuous integration and expanded unit testing that
verify model training paths, optimizer/schema integration, and YAML example compatibility.

1.3
Reading Guide

This document is organized as a technical reference addressing four operational concerns:

1. Prerequisites: Appendix E provides an accessible introduction to transformers and atten-
tion mechanisms for readers new to the field.

2. Configuration Contract: Section 3.1 describes the YAML schema that enforces valid ex-
periments and Section 3.2 explains validation rules.

3. Architecture Selection: Section 3.8 covers normalization options; Section 4.1 provides
comprehensive comparison of sequence mixer families; Appendices B, C, and D synthesize
supporting literature.

4. Optimization Dynamics: Section 5 details optimizer routing and training dynamics; Ap-
pendix A provides comprehensive optimizer family analysis.

5. Deployment and Inference: Sections 6 and 7 describe quantized deployment and sentence-
embedding workflows.

1.4
Web-Based Configuration Builder

In addition to direct YAML editing, this project provides a Streamlit-based web interface (ac-
cessed via web-server command) that improves configuration accessibility and discoverability.
The interface presents schema fields with:

• Schema-driven form rendering — All fields are dynamically generated from the authori-
tative schema, ensuring consistency and validation.

• Inline parameter documentation — Each form field displays a title from the schema as
its label, with the description shown as a help tooltip on hover.

• Real-time configuration preview — Users see live YAML output as they modify form
fields, enabling immediate validation feedback.

• CLI command generation — The interface generates complete CLI commands for training,
deployment, inference, and SBERT workflows based on current configuration.

• Accessibility improvements — Tooltips and structured forms make configuration options
easier to understand, especially for users new to the project or exploring novel architectures.

This web-based approach addresses common usability barriers in configuration-driven exper-
imentation:

• Reduces need to memorize YAML structure and field names

• Prevents typos through schema validation

• Provides educational context through inline documentation

• Enables rapid experimentation with guided parameter tuning

• Serves as both a configuration tool and a learning resource

The web interface implementation uses Streamlit’s form widgets with:

• st.checkbox() for binary toggles with help text

• st.number_input() for numeric fields with step size and format

• st.selectbox() for enum choices with options display

• st.multiselect() for array selections from defined options

• st.text_input() for string fields

• st.info(), st.caption() for supplementary information

Schema metadata (title and description fields) are extracted and rendered systematically
across all form sections, including:

• Model architecture parameters (hidden size, layers, attention heads, etc.)

• Training runtime settings (batch size, accumulation, scheduler)

• Optimizer configuration with per-parameter-group hyperparameters

• Deployment and quantization options

• SBERT-specific training and inference parameters

2
Related Work

This work sits at the intersection of three active research areas: alternative attention archi-
tectures, sparse attention mechanisms, and advanced optimization algorithms.
This section
provides a concise survey; detailed mathematical formulations and algorithmic descriptions are
deferred to the appendices.

2.1
Sequence Mixer Architectures

2.1.1
Dense Attention Baselines

Standard softmax attention [38] achieves full global context through n2 pairwise interactions but
incurs prohibitive memory and compute costs for long sequences. Sigmoid attention [30] replaces
row-wise softmax with element-wise sigmoid, yielding faster convergence and up to 17% kernel
speedup, though requiring hybrid-norm stabilization at scale. Both serve as dense baselines in
this toolkit.

2.1.2
Recurrent and Retentive Architectures

RetNet [36] resolves the “impossible triangle” of parallel training, O(1) inference, and strong per-
formance via a dual attention–retention formulation. Mamba [12] introduces input-dependent
selectivity into state-space models, achieving linear complexity with hardware-aware scan al-
gorithms. ODE-style transformers [52] treat depth as numerical integration over a continuous
dynamical system. Titans [2] augments inference with test-time neural memorization for ex-
tremely long contexts. Detailed formulations are provided in Appendix B.

2.1.3
Sparse Attention Mechanisms

Sparse attention methods reduce the quadratic bottleneck by restricting token interactions:
Sparse Transformer [9] uses factorized strided and fixed patterns (O(n√n)); Longformer [3]
employs sliding windows with global tokens (O(n·w)); BigBird [50] combines local, random, and
global paths; SparseK [23] applies differentiable top-k selection; NSA [48] uses a three-branch
hierarchical design; SpargeAttn [53] performs two-stage block-level pruning; and FASA [41]
leverages RoPE frequency features for token selection. Complete descriptions are in Appendix C.

2.1.4
Gated Attention Mechanisms

Gated architectures control information flow through learnable gates: GLA [44] adds data-
dependent diagonal gating to linear attention; DeltaNet [45] applies the delta learning rule
to recurrent state updates; Gated DeltaNet [46] synthesizes both mechanisms; HGRN2 [28]
introduces hierarchical forget gates with outer-product state expansion; FoX [19] embeds forget
gates directly into softmax attention; and Gated Softmax [29] applies post-SDPA channel gating.
Full mathematical derivations appear in Appendix D.

2.2
Optimization Algorithms

Optimization of highly parameterized transformer architectures requires navigating non-convex
loss landscapes with heterogeneous Hessian spectra. The literature has diversified into several
families beyond the classical AdamW baseline [22]:

• Variance reduction and momentum: RAdam [21], Adan [42], ADOPT [37], AdEMAMix
[26], MARS [49], and Cautious optimizers [18] address early-step instability, convergence guar-
antees, and dual-EMA history mixing.

• Memory-efficient: Adafactor [33], GaLore [54], Lion [7], and APOLLO [55] reduce optimizer-
state memory through factorization, low-rank projection, or sign-based updates.

• Schedule-free and parameter-free: Schedule-Free AdamW [10] and Prodigy [24] absorb
scheduler or learning-rate tuning into the optimizer dynamics.

• Second-order and curvature-aware: Shampoo [14], SOAP [39], and Sophia [20] incorpo-
rate approximate second-order information.

• Geometry-oriented: Muon [34] and Turbo-Muon [4] reshape update geometry through
orthogonalization.

Detailed algorithmic descriptions, pseudocode, and a comprehensive complexity comparison
are provided in Appendix A.

3
System Design and Architecture

3.1
Configuration-Centric Architecture

The core design choice in this repository is that experimentation is schema first. Instead of
exposing a large number of loosely checked flags, the project forces model topology, optimizer
family, training limits, and telemetry options through a single validated configuration docu-
ment. This reduces ambiguity when reproducing results and makes it possible to compare many
architectures under a consistent operational interface.
The authoritative contract is configs/schema.yaml. It enforces three top-level objects:

• model_class

• model

• training

Model Class Selection.
The model_class field determines the architectural variant instan-
tiated by the training pipeline. Three options are supported:

• frankenstein: Mixed-architecture encoder models supporting diverse attention mechanisms
(standard, sigmoid, retentive, state-space, sparse, and gated mixers) with MoE (Mixture
of Experts) and advanced features. Optimized for bidirectional encoder-style training with
masked language modeling (MLM) objectives.

• mini: Simplified encoder variant designed for smaller-scale training scenarios. Provides re-
duced parameter overhead and faster iteration for experimentation and prototyping.

• frankesteindecoder: Autoregressive causal decoder for LLM-style next-token generation.
Enables causal attention masking for sequential text generation tasks. When this class is
selected, runtime enforces mode=’decoder’.

Training Mode Selection.
The model.mode field controls attention masking behavior across
the model:

• encoder: Uses bidirectional attention where all tokens attend to all other tokens in the
sequence. Suitable for masked language modeling (MLM) pre-training tasks where the model
learns to predict randomly masked tokens based on full context.

• decoder: Uses causal masking where each token can only attend to previous tokens in the
sequence.
Required for autoregressive (AR) next-token prediction tasks such as language
modeling and text generation. When model_class=’frankesteindecoder’, the system au-
tomatically forces mode=’decoder’ at runtime.

This dual architecture support enables the system to handle both encoder-style pre-training
(MLM on bidirectional contexts) and decoder-style generation (causal autoregressive prediction)
through a unified configuration interface.
The model.layer_pattern supports legacy, sparse, and gated blocks:

• Retentive Network (RetNet) — internal reference: sun_retentive_2023 — code name:
retnet, retnet_attn

• Mamba (Selective State Space Model) — internal reference: gu_mamba_2023 — code
name: mamba

• ODE-style Continuous Depth Block — internal reference: zhang_continuous_2021 —
code name: ode

• Titans Memory-Augmented Attention — internal reference: behrouz_titans_2025 —
code name: titan_attn

• Standard Softmax Attention — internal reference: vaswani_attention_2017 — code
name: standard_attn

• Sigmoid Self-Attention — internal reference:
ramapuram_theory_2024 — code name:
sigmoid_attn

• Sparse Transformer — internal reference: child_sparse_transformer_2019 — code name:
sparse_transformer_attn

• Longformer — internal reference: beltagy_longformer_2020 — code name: longformer_attn

• BigBird — internal reference: zaheer_bigbird_2020 — code name: bigbird_attn

YAML Config

Validation
schema + rules

CLI
train/deploy/infer

Web
Streamlit

Training
AMP + scheduler

Model
17 mixers

Optimizer
22 families

Model Build
pattern + loops

Train Loop
checks + logs

Opt Step
group routing

SBERT
search/cluster

Deploy
quantization

Figure 1:
System architecture:
configuration flows through validation,
partitions into
model/training/optimizer, executes runtime, produces deployment/SBERT artifacts.

• SparseK Attention — internal reference: lou_sparsek_2024 — code name: sparsek_attn

• Native Sparse Attention (NSA) — internal reference: yuan_nsa_2025 — code name:
nsa_attn

• SpargeAttn — internal reference: zhang_spargeattn_2025 — code name: sparge_attn

• FASA (Frequency-aware Sparse Attention) — internal reference: wang_fasa_2026 —
code name: fasa_attn

• Gated Linear Attention (GLA) — internal reference: yang_gla_2023 — code name:
gla_attn

• DeltaNet — internal reference: yang_deltanet_2024 — code name: deltanet_attn

• Gated DeltaNet — internal reference: yang_gated_deltanet_2024 — code name: gated_deltanet_attn

• HGRN2 — internal reference: qin_hgrn2_2024 — code name: hgrn2_attn

• Forgetting Transformer (FoX) — internal reference: lin_forgetting_transformer_2025
— code name: fox_attn

• Gated Softmax Attention — internal reference:
qiu_gated_attention_2025 — code
name: gated_softmax_attn

which corresponds to current attention and sequence-mixer literature [36, 12, 52, 2, 30, 38,
9, 3, 50, 23, 48, 53, 41, 44, 45, 46, 28, 19, 29]
The training.optimizer.optimizer_class supports a broad optimizer family: sgd_momentum,
adamw, adafactor, galore_adamw, prodigy, lion, sophia, muon, turbo_muon, radam, adan,
adopt, ademamix, mars_adamw, cautious_adamw, lamb, schedulefree_adamw, shampoo, soap,
apollo, apollo_mini, and q_apollo.

3.2
Schema Scope and Validation Rules

The schema is strict: top-level and nested objects set additionalProperties:
false. This
guarantees that unknown keys fail fast instead of being silently ignored. The training.optimizer.parameters
object is additionally constrained by optimizer-specific prefix rules through allOf+if/then pat-
tern checks.
Normalization values currently accepted by schema are:

norm_type ∈{layer_norm, dynamic_tanh, derf}

Thus, rms_norm is not a valid schema value in the current contract.

3.3
Complete Model Feature Inventory

Field
Type/Range
Meaning

model_class
enum
Architecture variant: frankenstein (mixed-
architecture encoder), mini (simplified en-
coder), frankesteindecoder (autoregressive
decoder).
model.mode
enum
Attention mode: encoder (bidirectional, for
MLM) or decoder (causal, for AR). When
model_class=’frankesteindecoder’, forces
mode=’decoder’.
vocab_size
int ≥1
Vocabulary size.
hidden_size
int ≥1
Hidden dimension.
num_layers
int ≥1
Physical layer count.
num_loops
int ≥1
Logical loop count (looped blocks).
num_heads
int ≥1
Attention heads.
retention_heads
int ≥1
Retention heads for RetNet-style mixers.
num_experts
int ≥1
MoE expert count.
top_k_experts
int ≥1
Top-k expert routing in MoE.
dropout
float [0, 1]
Global dropout.
layer_pattern
array enum
Ordered
block
list:
legacy
(retnet,
retnet_attn,
mamba,
ode,
titan_attn,
standard_attn,
sigmoid_attn),
sparse
(sparse_transformer_attn,
longformer_attn,
bigbird_attn,
sparsek_attn,
nsa_attn,
sparge_attn,
fasa_attn),
and
gated
(gla_attn,
deltanet_attn,
gated_deltanet_attn,
hgrn2_attn,
fox_attn, gated_softmax_attn).
ode_solver
enum
rk4 or euler.
ode_steps
int ≥1
ODE integration steps.
use_bitnet
bool
Enable low-bit BitLinear path.
norm_type
enum
layer_norm, dynamic_tanh, derf.
use_factorized_embedding
bool
Enable factorized embeddings.
factorized_embedding_dim
int ≥1
Reduced embedding dimension for factoriza-
tion.
use_embedding_conv
bool
Enable Conv1d over embedding stream.
embedding_conv_kernel
int ≥1
Conv1d kernel size.
hope_base
float ≥0
HoPE base value (optional in schema).
hope_damping
float ≥0
HoPE damping (optional in schema).
use_hope
bool
Apply HoPE in titan_attn.
use_moe
bool
Enable MoE FFN routing path.
ffn_hidden_size
int ≥1
FFN intermediate width.

Field
Type/Range
Meaning

ffn_activation
enum
silu or gelu.

Looped depth induced by schema is:

Llogical = num_layers × num_loops

which is the configuration-level definition of looped blocks.

3.4
Training Task Types

The training.task field determines the training objective, working in conjunction with model.mode
to define how the model learns:

Masked Language Modeling (MLM).

• Mode: Requires mode=’encoder’ for bidirectional attention.

• Objective: Randomly mask tokens in input sequence (typically 15%) and train model to
predict masked tokens based on full bidirectional context.

• Use Case: Pre-training encoders for representation learning, following BERT-style method-
ology [11]. Model learns bidirectional representations capturing context from both left and
right.

• Configuration: Uses mlm_probability parameter to control masking fraction.

Autoregressive (AR) Next-Token Prediction.

• Mode: Requires mode=’decoder’ for causal masking.

• Objective: Train model to predict next token in sequence given all previous tokens only.

• Use Case: Language generation and LLM-style tasks following GPT methodology. Model
learns sequential dependencies with causal attention where each token can only attend to
preceding tokens.

• Model Class: Typically uses model_class=’frankesteindecoder’ for autoregressive de-
coder architectures.

This dual task support enables unified experimentation across both encoder-style pre-training
(MLM for bidirectional understanding) and decoder-style generation (AR for sequential text
production) within the same codebase.

3.5
Complete Training Feature Inventory

Field
Type/Range
Meaning

batch_size
int ≥1
Loader batch size.
dataloader_workers
int ≥0
PyTorch dataloader workers.
max_length
int ≥1
Sequence length cap.
task
enum
Training objective:
mlm (masked language
modeling) or sbert (sentence embedding).
mlm_probability
float [0, 1]
MLM masking probability (applies only when
task=’mlm’).

Field
Type/Range
Meaning

max_samples
int ≥1
Maximum streamed samples.
dataset_batch_size
int ≥1
Internal streaming dataset chunk size.
num_workers
int ≥0
Streaming dataset workers.
cache_dir
string
Dataset cache directory.
local_parquet_dir
string
Optional local parquet path.
prefer_local_cache
bool
Prefer local cache when available.
stream_local_parquet
bool
Stream from local parquet mode.
use_amp
bool
Mixed precision toggle.
gradient_accumulation_steps
int ≥1
Effective batch through accumulation.
optimizer
object
Contains
optimizer_class
and
prefixed
parameters.
scheduler_total_steps
int ≥1
Scheduler horizon.
scheduler_warmup_ratio
float [0, 1]
Warmup ratio.
scheduler_type
enum
cosine,
constant,
linear_warmup_then_constant.
grad_clip_max_norm
float ≥0
Global norm clipping threshold.
inf_post_clip_threshold
float ≥0
Exploding-gradient guard threshold after clip-
ping.
max_nan_retries
int ≥0
Retry budget for NaN/Inf instability.
checkpoint_every_n_steps
int ≥1
Rolling checkpoint frequency.
max_rolling_checkpoints
int ≥1
Number of rolling checkpoints to keep.
num_best_checkpoints
int ≥1
Number of best checkpoints tracked.
nan_check_interval
int ≥1
NaN/Inf check cadence.
log_gradient_stats
bool
Enable gradient statistics logging.
gradient_log_interval
int ≥1
Gradient logging cadence.
csv_log_path
string
Step-level CSV output path.
csv_rotate_on_schema_change
bool
Rotate CSV if logging schema changes.
gpu_metrics_backend
enum
nvml or none.
nvml_device_index
int ≥0
Device index for NVML telemetry.
enable_block_grad_norms
bool
Include per-block gradient norm telemetry.
telemetry_log_interval
int ≥1
Heavy telemetry interval (optimizer steps).
use_galore
bool
Enable GaLore strategy.
galore_rank
int ≥1
GaLore low-rank projection dimension.
galore_update_interval
int ≥1
Projection refresh interval.
galore_scale
float ≥0
Gradient scaling in projected space.
galore_max_dim
int ≥1
Maximum tensor dimension for GaLore pro-
jection.

3.6
Optimizer Prefix Contract (Full)

Supported optimizer_class values are:
sgd_momentum, adamw, adafactor, galore_adamw,
prodigy, lion, sophia, muon, turbo_muon, radam, adan, adopt, ademamix, mars_adamw, cautious_adamw,
lamb, schedulefree_adamw, shampoo, soap, apollo, apollo_mini, q_apollo.
Shared per-group suffix families (all prefixed by optimizer name) are:

• LR groups: lr_embeddings, lr_norms, lr_ode, lr_retnet, lr_mamba, lr_attention, lr_other

• Weight decay groups: wd_embeddings, wd_norms, wd_ode, wd_retnet, wd_mamba, wd_attention,
wd_other

• Beta groups: betas_embeddings, betas_norms, betas_ode, betas_retnet, betas_mamba,
betas_attention, betas_other

• Epsilon groups: eps_embeddings, eps_norms, eps_ode, eps_retnet, eps_mamba, eps_attention,
eps_other

Optimizer-specific global suffixes:

• sgd_momentum: momentum, nesterov

• adafactor: beta2_decay, clip_threshold, eps1, eps2

• galore_adamw: rank, update_proj_gap

• prodigy: d_coef

• sophia: rho, update_k

• muon / turbo_muon: momentum, nesterov, ns_steps, ns_eps

• cautious_adamw: cautious_clip

• apollo: rank, update_proj_gap, scale, scale_type, proj_type, scale_front, disable_nl

• apollo_mini: update_proj_gap, scale, proj_type, scale_front, disable_nl

• q_apollo: rank, update_proj_gap, scale, scale_type, proj_type, scale_front, disable_nl,
quant_bits

All other classes in the list above accept only prefixed shared groups.

3.7
Training Safety and Runtime Semantics

Schema-level safety features include accumulation, clipping, post-clip explosion checks, and NaN
retries:

K
X

gacc = 1

i=1
gi,
K = gradient_accumulation_steps

K

gclip = gacc · min

1,
τ
∥gacc∥2 + ϵ


,
τ = grad_clip_max_norm

then overflow guards use inf_post_clip_threshold and retry logic bounded by max_nan_retries.

3.8
Normalization Variants: RMSNorm, Dynamic Tanh, and Dynamic Erf

Normalization determines how activation scale is controlled across depth. In this repository,
normalization is not only a modeling choice but also a schema compatibility question, because
only certain values are currently accepted by norm_type. The three formulations most relevant
to this codebase are:

3.9
RMSNorm

RMSNorm removes mean-centering and only rescales by root mean square magnitude [51]:

v
u
u
t1

d
X

i=1
x2
i + ϵ,
yi = γi
xi
RMS(x)

RMS(x) =

d

Compared with LayerNorm, RMSNorm is computationally simpler (no subtraction of feature
mean) and is often used when reducing normalization overhead is important.

Algorithm 1 Schema-Driven Training Step with Stability Controls
Require: Batch stream, config C

1: Initialize retry counter r ←0

2: for each optimizer step do

3:
Accumulate gradients for K = C.gradient_accumulation_steps micro-batches

4:
Apply global norm clipping with τ = C.grad_clip_max_norm

5:
if post-clip gradient exceeds C.inf_post_clip_threshold or NaN/Inf detected then

6:
if r < C.max_nan_retries then

7:
restore safe state / skip step; r ←r + 1

8:
continue

9:
else

10:
stop training with failure state

11:
end if

12:
end if

13:
run optimizer step selected by optimizer_class

14:
update scheduler (cosine, constant, or linear_warmup_then_constant)

15:
if step mod checkpoint_every_n_steps= 0 then

16:
save rolling checkpoint and prune to max_rolling_checkpoints

17:
end if

18:
update best checkpoints up to num_best_checkpoints

19:
emit CSV + telemetry following gradient_log_interval and telemetry_log_interval

20: end for

3.10
Dynamic Tanh (DyT)

Dynamic Tanh proposes replacing explicit normalization with a bounded elementwise map [56]:

DyT(x) = tanh(αx)

where α is learned. The core idea is that bounded nonlinear contraction can provide stable
signal scaling without explicitly computing per-token normalization statistics.

3.11
Dynamic Erf (Derf)

Derf extends the same normalization-free direction by using an error-function based map [6]:

Derf(x) = erf(αx + s)

with learnable scale/shift. Reported results in the cited work indicate stronger performance
than DyT and common normalization baselines across multiple domains.

3.12
Schema Implications

Current configuration contract in this repository allows:

norm_type ∈{layer_norm, dynamic_tanh, derf}

so DyT and Derf are directly available in schema-driven runs, while RMSNorm is not currently
an accepted enum value and would require code/schema extension.

Method
Formula
Stats Needed
Notes

RMSNorm
γixi/
q

1
d
P
j x2
j + ϵ
RMS only
Lower overhead; widely used
baseline [51].

Method
Formula
Stats Needed
Notes

Dynamic Tanh
tanh(αx)
none
Normalization-free
bounded
transform;
drop-in
replace-
ment [56].
Dynamic Erf
erf(αx + s)
none
Normalization-free alternative;
improves over DyT [6].

4
Architecture Taxonomy and Implementation

4.1
Attention and Sequence-Mixer Families

This system implements seventeen distinct sequence mixer architectures organized into five func-
tional categories reflecting research trends in sequence modeling design. The taxonomical orga-
nization reflects evolving understanding of how to balance expressivity, computational efficiency,
and memory constraints.

1. Dense Attention Baselines: Standard softmax attention and sigmoid attention provide
full global contextualization at quadratic computational cost, serving as reference baselines
for comparison with more efficient alternatives.

2. Recurrent and Retentive Architectures: RetNet, Mamba, ODE-style blocks, and Titans
maintain state representations enabling O(1) inference cost while preserving expressivity
through recurrent dynamics, selective parameters, or test-time memory adaptation.

3. Sparse Attention Patterns: Seven sparse variants (Sparse Transformer, Longformer, Big-
Bird, SparseK, NSA, SpargeAttn, FASA) reduce quadratic complexity through structured
sparsity, token selection, or training-free pruning strategies.

4. Gated Memory Mechanisms: Six gated architectures (GLA, DeltaNet, Gated DeltaNet,
HGRN2, FoX, Gated Softmax) introduce data-dependent control over memory retention,
forgetting, and update strength.

4.2
Standard Attention

Given projected matrices (Q, K, V ):

Attn(Q, K, V ) = softmax
QK⊤


V

√dk

This is the baseline mechanism for content routing [38].

4.3
Sigmoid Attention

Sigmoid attention removes row-wise probability normalization:

SigmoidAttn(Q, K, V ) = σ
QK⊤

√dk
+ b

V

and has different training stability requirements, often with additional normalization [30].

Sequence Mixer Registry (17 variants)

Dense (2)
Recurrent (4)
Sparse (7)
Gated (6)

sparse_transformer_attn

gla_attn

standard_attn

retnet / retnet_attn

sigmoid_attn

longformer_attn

deltanet_attn

mamba

bigbird_attn

gated_deltanet_attn

ode

sparsek_attn

hgrn2_attn

titan_attn

nsa_attn

fox_attn

sparge_attn

gated_softmax_attn

fasa_attn

Figure 2: Comprehensive taxonomy of seventeen supported sequence mixer architectures. Dense
baselines provide full global routing at quadratic cost. Recurrent architectures enable constant-
time inference through state compression. Sparse variants reduce complexity through structured
patterns or token selection. Gated mechanisms introduce data-dependent control over memory
retention and forgetting.

4.4
Retentive Formulation

RetNet uses retention with decay matrix D:

Retention(Q, K, V ) =

QK⊤⊙D

V

with recurrent form:
Sn = γSn−1 + k⊤
n vn,
on = qnSn

enabling low-cost recurrent inference [36, 43].

4.5
Selective SSM (Mamba)

Discrete selective state-space recurrence is:

ht = ¯Atht−1 + ¯Btxt,
yt = Ctht

where ( ¯At, ¯Bt, Ct) depend on input, preserving linear-time scaling with hardware-aware scan
[12, 15].

4.6
ODE-style Continuous Updates

Continuous-depth framing:
dh(t)

dt
= fθ(h(t), t)

with practical RK integrators for discrete execution [52].

4.7
Test-time Memory (Titans)

A memory-augmented update can be written:

Mt = (1 −αt)Mt−1 + St,
St = ηtSt−1 −θt∇ℓ(Mt−1; xt)

to adapt memory at inference time [2, 1].

Token sequence

Hybrid sparse graph
BigBird / Sparse Transformer

Selective pruning
SparseK / NSA / FASA / SpargeAttn

Local / windowed
Longformer

Three sparse design strategies: locality, structured graph sparsity, and learned or predicted selection

Figure 3: Conceptual map of sparse attention design choices used in the codebase. Different
methods reduce cost by restricting neighborhoods, constructing sparse graphs, or selecting only
high-value tokens/blocks.

4.8
Sparse and Gated Extensions in the Current Codebase

The mixer registry now includes sparse blocks: sparse_transformer_attn, longformer_attn,
bigbird_attn, sparsek_attn, nsa_attn, sparge_attn, and fasa_attn; and gated blocks:
gla_attn, deltanet_attn, gated_deltanet_attn, hgrn2_attn, fox_attn, and gated_softmax_attn.
The implementation enforces an explicit execution policy for training-free sparse methods:
fasa_attn and sparge_attn are eval/inference-only and raise runtime errors if used while the
model is in training mode.

4.9
Implemented Sparse Attention Blocks (Detailed)

This codebase includes seven sparse attention families [9, 3, 50, 23, 48, 53, 41].

Sparse Transformer (sparse_transformer_attn).
Uses factorized sparse masks (strided +
fixed) to approximate dense connectivity at lower cost than full O(n2) attention:

qiK⊤
Ai
√dk

!

Attni = softmax

VAi

where Ai is the sparse neighborhood induced by stride/fixed rules. [9]

Longformer (longformer_attn).
Uses sliding-window locality with optional global tokens:

Ai = {j : |i −j| ≤w/2} ∪G

yielding linear scaling in sequence length for fixed window w. [3]

BigBird (bigbird_attn).
Combines local windows, random links, and global tokens:

Ai = Awindow
i
∪Arandom
i
∪Aglobal
i

to preserve strong long-context connectivity with sparse computation. [50]

SparseK (sparsek_attn).
Uses a differentiable top-k style projection over importance scores
before attention, so only selected KV pairs participate in the expensive dot-product path. [23]

NSA (nsa_attn).
Implements a three-branch sparse design: compressed branch, selected
branch, and local window branch, then combines them with learned gates:

c∈{cmp,sel,win}
gc
t Attn(qt, ˜Kc
t , ˜V c
t )

ot =
X

Previous state
St−1

Gate(s)
αt, βt, Gt, ft

New key/value
or SDPA output

Updated memory / output
St or Ot

Figure 4: Generic gating template. A gate can decay existing memory, regulate write strength,
or modulate dense attention outputs, depending on the block family.

SpargeAttn (sparge_attn).
Two-stage training-free block filtering: first predicts negligible
block interactions, then applies softmax-aware pruning to remove low-contribution blocks. [53]

FASA (fasa_attn).
Frequency-aware training-free attention: uses dominant RoPE frequency
chunks for token importance prediction, then applies full attention only on selected tokens. [41]

Block
Trainable
Asymptotic
Trend
Primary
Sparsity Unit
Current
Integration
Notes
Ref

[9]

Sparse
Trans-
former
Yes
sub-
quadratic
mask
pattern
(token-level)
Factorized
strided/fixed
masks
inside
SDPA
pipeline.

Longformer
Yes
linear in n
(fixed w)
sliding window
+ global tokens
Window
mask
with
op-
tional global indices.
[3]

Randomized sparse mask
plus local/global paths.
[50]

BigBird
Yes
near-linear
window + ran-
dom + global
edges

[23]

Learned
score
net
+
SparseK
projection
+
gathered KV attention.

differentiable
top-k
KV
selection

SparseK
Yes
linear-like
(selected
KV)

[48]

Three
sparse
branches
gated
into
one
output
tensor.

compressed
blocks
+
se-
lected blocks +
local window

NSA
Yes
reduced-
token multi-
branch

sparse block
dependent
block-level pre-
dicted sparsity
Eval-only
in
this
repo;
raises in training mode.
[53]

SpargeAttn
No
(training-
free)

FASA
No
(training-
free)

selected-
token
de-
pendent

dominant
fre-
quency chunks
+
selected
tokens

Eval-only
in
this
repo;
raises in training mode.
[41]

4.10
Implemented Gated Attention Blocks (Detailed)

This codebase includes seven gated blocks [44, 45, 46, 36, 28, 19, 29].
The unifying idea is that gating controls what information survives. Some gates act on re-
current state updates (GLA, DeltaNet variants, HGRN2), while others modify the full attention
path itself (FoX and Gated Softmax). This makes gating especially useful when the model must
trade off recall, recency, and bounded memory.

GLA (gla_attn).
Gated Linear Attention applies data-dependent multiplicative decay in
recurrent state updates:
St = Gt ⊙St−1 + vtk⊤
t ,
ot = Stqt

to control memory accumulation. [44]

DeltaNet (deltanet_attn).
Uses a delta-rule error-correcting write with learned write strength
βt:
St = St−1(I −βtktk⊤
t ) + βtvtk⊤
t

which improves targeted memory replacement. [45]

Gated DeltaNet (gated_deltanet_attn).
Adds decay gate αt on top of delta-rule writes:

St = αtSt−1(I −βtktk⊤
t ) + βtvtk⊤
t

for both global forgetting and local corrective updates. [46]

RetNet Attn Alias (retnet_attn).
Provides an explicit gated-package alias wrapping multi-
scale retention behavior for naming consistency in layer registries. [36]

HGRN2 (hgrn2_attn).
Uses lower-bounded forget gates with outer-product state expansion:

St = diag(gt)St−1 + vtk⊤
t

to increase recurrent state expressiveness while remaining efficient. [28]

FoX (fox_attn).
Injects token-wise forget bias directly into softmax logits:

O = softmax(QK⊤+ D)V

where D is derived from cumulative log-forget gates. [19]

Gated Softmax (gated_softmax_attn).
Applies a post-SDPA sigmoid gate:

Y ′ = SDPA(Q, K, V ) ⊙σ(XWg)

which adds multiplicative channel gating without replacing softmax attention. [29]

Block
State Type
Gate Mech-
anism
Softmax
Path
Current
Integration
Notes
Ref

GLA
matrix
re-
current state
data-
dependent
multiplicative
decay

No
(linear
recurrent)
Recurrent
update
with
low-rank gate projection.
[44]

DeltaNet
matrix
re-
current state
write gate (β)
No
(linear
recurrent)
Delta-rule correction with
normalized Q/K.
[45]

Gated
DeltaNet
matrix
re-
current state
decay + write
gates (α, β)
No
(linear
recurrent)
Combined forgetting and
targeted writing.
[46]

RetNet Attn
matrix
re-
current state
fixed
multi-
scale decay
No
(reten-
tion)
Alias wrapper over exist-
ing RetNet mixer.
[36]

HGRN2
matrix
re-
current state
lower-
bounded
forget gate

No
(linear
recurrent)
Hierarchical recurrent gat-
ing with outer products.
[28]

Algorithm 2 Pattern-Driven Mixer Forward (Conceptual)
Require: Hidden states H, pattern P, layer index ℓ

1: m ←P[ℓmod |P|]

2: if m ∈{fasa_attn, sparge_attn} and model is in training mode then

3:
raise configuration/runtime error (training-free block in train mode)

4: else if m = standard_attn then

5:
H ←softmax-attention(H)

6: else if m = sigmoid_attn then

7:
H ←sigmoid-attention(H)

8: else if m ∈{retnet, retnet_attn} then

9:
H ←retention(H)

10: else if m = mamba then

11:
H ←selective-ssm(H)

12: else if m = ode then

13:
H ←rk-step(H)

14: else if m is a sparse attention key then

15:
H ←sparse-attention-family(H)

16: else if m is a gated attention key then

17:
H ←gated-attention-family(H)

18: else

19:
H ←memory-augmented-attn(H)

20: end if

21: return H

Block
State Type
Gate Mech-
anism
Softmax
Path
Current
Integration
Notes
Ref

FoX
full attention
matrix
logit-space
forget gate
Yes
Forget bias added before
softmax.
[19]

Gated Softmax
full attention
matrix
post-attention
sigmoid gate
Yes
Sigmoid gating applied af-
ter SDPA output.
[29]

5
Optimizer Families and Training Dynamics

Optimization of highly parameterized transformer architectures presents significant challenges
due to non-convex loss landscapes, saddle points, and block heterogeneity across parameter
groups. This system addresses these challenges through a unified framework supporting twenty-
two optimizer families spanning six algorithmic categories: (1) classical baselines (SGD, AdamW),
(2) advanced momentum and variance reduction (Adan, ADOPT, AdEMAMix, MARS, Cau-
tious), (3) memory-efficient variants (Adafactor, GaLore, Lion, APOLLO, APOLLO-Mini, Q-
APOLLO), (4) schedule-free and parameter-free methods (Schedule-Free AdamW, Prodigy) plus
large-batch scaling through LAMB, (5) curvature-aware and second-order (Sophia, Shampoo,
SOAP), and (6) geometry-oriented (Muon, Turbo-Muon). Detailed algorithmic descriptions,
pseudocode, and a memory/complexity comparison table for all optimizers are provided in Ap-
pendix A.

6
Quantization and Deployment

The deploy stack uses ternary weight packing plus INT8 activation quantization for efficient
artifacts.

Choose optimizer objective

Reliable baseline
AdamW / RAdam

Lower optimizer memory
Adafactor / GaLore / Lion

Aggressive or structured
Adan, ADOPT, Sophia,
Shampoo, SOAP, Muon

Schema-prefixed groups
LR, weight decay, betas, eps

Figure 5: Optimizer selection in practice: the class determines the update rule, while the schema
controls how hyperparameters are applied across embeddings, norms, recurrent blocks, attention
blocks, and other parameters.

Trained checkpoint
Weight packing
ternary / low-bit

Activation scaling
INT8 path

Deployable artifact
smaller storage footprint

Figure 6: Deployment path from a trained checkpoint to a compact artifact. The codebase
treats quantization as a deploy-stage transformation rather than a separate model family.

6.1
Ternary Quantization

Given weight tensor W, a practical scaling is:

s = mean(|W|),
˜W = clip

round
W


, −1, 1


s

which approximates BitNet-style low-bit updates [40]. The packed mapping uses two bits per
weight symbol for storage efficiency.

6.2
Activation Quantization

For activations x:

q = round

x ·
127
max(|x|) + ϵ


,
q ∈[−128, 127]

with dequantization x ≈q/α.

6.3
Size Estimates

For N parameters:

FP32 size ≈4N,
FP16 size ≈2N,
1.58-bit size ≈1.58

8 N

before metadata and packing overhead. This aligns with lightweight deployment goals [32, 5].

7
SBERT Downstream Tasks

Sentence embedding is built on Siamese-style training [31].
For sentence pair (s1, s2) with
embeddings (e1, e2):

cos(e1, e2) =
e⊤
1 e2
∥e1∥∥e2∥

Shared encoder

Similarity
cosine score

Search
top-k retrieval

Cluster / encode
offline analysis

Figure 7: SBERT workflow reuse. A single encoder supports online pair scoring, corpus retrieval,
clustering, and persistent embedding export.

Algorithm 3 SBERT Inference Mode Router
Require: mode m, model E, inputs X

1: if m = similarity then

2:
return cos(E(x1), E(x2))

3: else if m = search then

4:
return top-k by dot-product/cosine against corpus embeddings

5: else if m = cluster then

6:
return clustering labels over E(X)

7: else

8:
return serialized embeddings E(X)

9: end if

and regression-style cosine loss:

Lcos = (cos(e1, e2) −y)2

with y ∈[−1, 1] in this pipeline.
Supported downstream modes:

• Similarity: pairwise score between two sentences.

• Search: top-k nearest neighbors over a corpus.

• Cluster: grouping embeddings (e.g., k-means).

• Encode: persistent embedding export for later retrieval.

8
Summary Tables

8.1
Attention and Sequence-Mixer Summary

Type
Core Equation
Train
Infer
Notes

Standard
At-
tention
softmax(QK⊤/√dk)V O(n2d)
O(n)/step
Baseline expressive global rout-
ing [38].
Sigmoid Atten-
tion
σ(QK⊤/√dk + b)V
O(n2d)
O(n)/step
Element-wise
gating;
often
needs stabilization norm [30].
RetNet
(QK⊤⊙D)V
O(n2d)
or
chunkwise
O(1)/step
Parallel/recurrent
dual
form
with decay retention [36].
Mamba
ht = ¯Atht−1 + ¯Btxt
O(nd)
O(1)/step
Selective
state-space
with
hardware-aware scan [12].

Type
Core Equation
Train
Infer
Notes

dh

dt = fθ(h, t)
solver-
dependent
solver-
dependent
Continuous-depth
interpreta-
tion; RK integration [52].
Titans memory
Mt = (1−αt)Mt−1+
St
approx.
O(nd)
retrieval-
centric
Test-time
memory
updates
with surprise-driven dynamics
[2].

ODE-style
block

8.2
Optimizer Summary

Optimizer
Family
State Cost
Key Idea
Ref

SGD+MomentumClassical
first-
order
Low
Momentum-accelerated
baseline
with minimal state
[27]

AdamW
Adaptive
first/second
moment

High
Decoupled weight decay baseline
[22]

RAdam
Adaptive
variance-corrected
High
Rectifies early adaptive variance
[21]

Adan
Momentum
+
variance
reduc-
tion

High
Nesterov-style adaptive update
[42]

ADOPT
Adam variant
High
Reordered updates with improved
convergence guarantees
[37]

AdEMAMix
Multi-EMA adap-
tive
High
Mixes short and long horizon EMAs
[26]

MARS
Variance-reduced
preconditioned
High
Recursive momentum correction
[49]

Cautious
AdamW
Masked
momen-
tum
High
Apply
updates
only
on
sign-
consistent directions
[18]

LAMB
Layer-wise
adap-
tive moments
High
Trust-ratio scaling for very large-
batch training
[47]

Schedule-free
AdamW
Scheduler-free
adaptive
High
Remove explicit LR schedule de-
pendence
[10]

Adafactor
Memory-efficient
adaptive
Medium
Factorized second moments for ma-
trix tensors
[33]

GaLore
AdamW
Low-rank gradient
projection
Medium
Optimize in projected low-rank gra-
dient space
[54]

APOLLO
Low-rank
adap-
tive projection
Medium
Structured scaling from projected
Adam-style moments
[55]

APOLLO-
Mini
Rank-1
adaptive
projection
Low
Tensor-wise scaled APOLLO vari-
ant for extreme memory savings
[55]

Q-APOLLO
Quantized
low-
rank projection
Low
Quantized APOLLO state for ultra-
low-memory training
[55]

Prodigy
Parameter-free
adaptation
Medium
Distance-adaptive step calibration
[24]

Lion
Sign momentum
Low
Momentum sign update, reduced
state
[7]

Sophia
Approx.
second-
order
Medium
Diagonal Hessian preconditioning
with clipping
[20]

Shampoo
Matrix
precondi-
tioner
High
Kronecker-structured second-order
statistics
[14]

SOAP
Shampoo + Adam
basis
High
Adam-like
tracking
in
precondi-
tioner eigenbasis
[39]

Muon
Orthogonality-
based
Medium
Orthogonalized matrix updates
[34]

Optimizer
Family
State Cost
Key Idea
Ref

Turbo-Muon
Accelerated
or-
thogonalization
Medium
Preconditioned
Newton-Schulz
speedup
[4]

9
Discussion

The design choices in Transformer Encoder Frankenstein reflect several engineering and research
tensions in modern deep learning tooling.

9.1
Schema-Driven Design Trade-offs

The schema-first approach provides significant reproducibility benefits by enforcing explicit con-
tracts and failing fast on invalid configurations. However, this approach also introduces rigidity:
adding new architectures or optimizers requires schema extensions rather than loose command-
line arguments. The prefixed hyperparameter system enables fine-grained control but increases
configuration complexity for users accustomed to simpler interfaces.
The decision to enforce additionalProperties:
false at all schema levels eliminates
silent parameter swallowing that has plagued earlier configuration systems, but this strictness
requires careful schema maintenance when extending system capabilities. Each new attention
mechanism or optimizer variant must be properly integrated into the validation framework, in-
cluding schema field definitions with appropriate types and constraints, prefixed hyperparameter
mapping for optimizer-specific groups, default values aligned with research best practices, and
documentation strings for web interface rendering.

9.2
Architectural Coverage and Gaps

The seventeen implemented mixer architectures span major research directions in sequence mod-
eling, but certain gaps remain. The system lacks recent hybrid architectures such as Griffin [35]
and Jamba [13], which combine gating with state-space models. MoE (Mixture of Experts)
routing is implemented for FFN layers but not for attention computation, where recent work
has shown benefits [17].
The sparse attention coverage is comprehensive, but implementation of training-free methods
(FASA, SpargeAttn) raises runtime errors during training, reflecting architectural constraints:
these methods require pretrained checkpoints from full-attention models or specific fine-tuning
procedures that are not currently automated.
The gated mechanism coverage is strong across major categories (GLA, DeltaNet, Gated
DeltaNet, HGRN2, FoX, Gated Softmax).

9.3
Optimizer Landscape Fragmentation

The support for twenty-two optimizer families across six algorithmic categories demonstrates
comprehensiveness but also highlights the fragmented state of optimization research.
Users
face significant decision complexity when choosing among variance-reduction methods (Adan,
MARS), memory-efficient variants (GaLore, Adafactor, APOLLO), and curvature-aware ap-
proaches (Sophia, Shampoo). The prefixed hyperparameter system, while powerful, requires
understanding of which parameters are relevant for each optimizer class.
The implementation quality varies across optimizers: classical methods (AdamW, SGD with
momentum) are highly optimized in PyTorch, while newer methods (Muon, Turbo-Muon, SOAP)
may require custom implementations that affect numerical stability and performance character-
istics.

9.4
Deployment and Production Considerations

The quantization pipeline demonstrates practical deployment concerns but makes specific en-
gineering trade-offs.
Ternary weight packing reduces storage to approximately 1.58 bits per
parameter, but this aggressive compression may degrade performance, especially for smaller
models where quantization error is more significant. The current implementation applies quan-
tization uniformly across parameter types.
The SBERT workflows provide practical utility for semantic similarity and retrieval tasks,
but implementation assumes standard pooling strategies (CLS token, mean pooling). Recent
advances such as Matryoshka embeddings [25] and contrastive learning refinements [16] are not
yet incorporated.

9.5
Integration and Extensibility Challenges

The current codebase structure, while functional, presents maintenance challenges as architec-
ture and optimizer families expand. The dispatcher pattern for mixer selection and optimizer
routing handles extensibility but risks becoming a “kitchen sink” of conditional logic. Future
versions would benefit from plugin-based architectures where new mixers and optimizers could
be registered declaratively rather than modifying core dispatch logic.
The web configuration interface provides significant usability improvements but introduces
deployment complexity: running Streamlit alongside training jobs requires additional resources
and infrastructure considerations that may not be appropriate for all environments, particularly
HPC clusters without web access.

10
Conclusion

Transformer Encoder Frankenstein presents a unified, configuration-driven experimentation plat-
form addressing critical challenges in modern deep learning research: architectural fragmenta-
tion across dense attention, recurrent models, sparse patterns, and gated mechanisms; optimizer
landscape complexity spanning classical baselines, variance-reduction methods, memory-efficient
variants, schedule-free approaches, curvature-aware algorithms, and geometry-oriented methods;
and end-to-end deployment workflows spanning quantization and sentence embedding applica-
tions.
The system’s primary contributions are:

1. Schema-First Design: A strict YAML-based configuration contract with validation and
prefixed hyperparameter routing enabling reproducible experiments across seventeen mixer
architectures and twenty-two optimizer families.

2. Comprehensive Architecture Support: Implementation spanning major research cate-
gories including dense baselines (standard, sigmoid attention), recurrent alternatives (Ret-
Net, Mamba, ODE-style, Titans), sparse attention (Sparse Transformer, Longformer, Big-
Bird, SparseK, NSA, SpargeAttn, FASA), and gated mechanisms (GLA, DeltaNet, Gated
DeltaNet, HGRN2, FoX, Gated Softmax).

3. Unified Optimizer Framework: Prefixed hyperparameter groups enabling fine-grained
control over embeddings, normalization layers, recurrent blocks, attention weights, and FFN
parameters across classical baselines (SGD+Momentum, AdamW), variance-reduction (Adan,
ADOPT, AdEMAMix, MARS, Cautious), memory-efficient (Adafactor, GaLore, Lion, APOLLO,
APOLLO-Mini, Q-APOLLO), large-batch and schedule simplification (LAMB, Schedule-
Free AdamW, Prodigy), curvature-aware (Sophia), second-order (Shampoo, SOAP), and
geometry-oriented (Muon, Turbo-Muon) optimizers.

4. End-to-End Workflows: Integrated deployment pipeline supporting ternary weight pack-
ing and INT8 activation quantization; SBERT-inspired training and inference for semantic
similarity, retrieval, and clustering tasks.

5. Interactive Configuration: Streamlit-based web interface providing schema-driven form
generation, real-time validation, inline documentation, and CLI command synthesis.

This system enables rapid experimental iteration while maintaining reproducibility through
strict configuration contracts.
By consolidating diverse research contributions into a unified
toolkit, it lowers barriers to exploring novel architectures and optimization strategies, particu-
larly for researchers who may lack resources to implement and validate each variant indepen-
dently.

10.1
Limitations and Future Directions

Several limitations and promising directions for future work emerge from this system’s design
and implementation:

1. Architecture Integration: Recent hybrid architectures (Griffin, Jamba, Mamba-X) demon-
strate benefits of combining multiple mechanisms into unified blocks. Future versions should
integrate these architectures and explore systematic composition patterns.

2. Advanced Quantization: Current implementation uses uniform ternary packing across
all parameters.
Research on layer-wise, channel-wise, and importance-aware quantization
suggests more sophisticated strategies could improve quality-efficiency trade-offs.

3. Plugin-Based Extensibility: The current dispatch pattern becomes increasingly complex
with each new addition. A plugin architecture allowing declarative registration of new mixers,
optimizers, and normalization methods would improve maintainability and reduce risk of bugs
in core dispatch logic.

4. Automated Hyperparameter Optimization: The schema supports extensive hyperpa-
rameter spaces, but users must manually explore these spaces. Integration with Bayesian
optimization, multi-armed bandit strategies, or gradient-based hyperparameter tuning could
automate effective configuration discovery.

5. Production Deployment: The web interface improves usability but may not be appropriate
for all deployment environments.
Headless configuration modes, API-based configuration
management, or improved CLI ergonomics could serve HPC and production workflows.

6. Evaluation Benchmarking: While the system enables training with diverse architectures,
comprehensive benchmarking comparing performance across mixers and optimizers on stan-
dardized tasks would provide valuable guidance for configuration selection.

7. Training Stability Guarantees: Current implementation includes NaN/Inf guards and
gradient clipping, but formal analysis of stability conditions for different mixer-optimizer
combinations, particularly with looped blocks and aggressive quantization, remains open.

8. Multimodal and Task-Specific Extensions: The current design focuses on sequence
modeling. Extensions for vision-language models, multimodal architectures, and task-specific
fine-tuning workflows (e.g., instruction tuning, RLHF) would broaden applicability.

The research trajectory of sequence modeling continues toward hybrid approaches that com-
bine strengths of multiple paradigms—compression from recurrence, selectivity from attention,
gating for memory management, and sparsity for efficiency. A unified experimentation platform

like Transformer Encoder Frankenstein is increasingly valuable as this convergence accelerates,
enabling researchers to systematically explore this expanding design space with reproducible,
well-engineered infrastructure.

Bibliography

[1] Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test
time, . URL http://arxiv.org/abs/2501.00663.

[2] Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test
time, . URL https://arxiv.org/abs/2501.00663. Version Number: 1.

[3] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document trans-
former, 2020. URL https://arxiv.org/abs/2004.05150.

[4] Thibaut Boissin, Thomas Massena, Franck Mamalet, and Mathieu Serrurier.
Turbo-
muon: Accelerating orthogonality-based optimization with pre-conditioning. URL https:
//arxiv.org/abs/2512.04632. Version Number: 1.

[5] Riccardo Bravin, Massimo Pavan, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, and
Manuel Roveri. EmbBERT: Attention under 2 MB memory. URL http://arxiv.org/abs/
2502.10001.

[6] Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, and Zhuang Liu.
Stronger
normalization-free transformers, . URL http://arxiv.org/abs/2512.10938.

[7] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu
Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic
discovery of optimization algorithms, . URL https://arxiv.org/abs/2302.06675. Version
Number: 4.

[8] Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao
Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao,
and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for
large language models, 2026. URL https://arxiv.org/abs/2601.07372.

[9] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences
with sparse transformers, 2019. URL https://arxiv.org/abs/1904.10509.

[10] Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled,
and Ashok Cutkosky. The road less scheduled. URL https://arxiv.org/abs/2405.15682.
Version Number: 4.

[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training
of deep bidirectional transformers for language understanding. URL http://arxiv.org/
abs/1810.04805.

[12] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.
URL https://arxiv.org/abs/2312.00752. Version Number: 2.

[13] Albert Gu, AI21 Labs, et al. Jamba: A hybrid transformer-mamba language model, 2024.
URL https://arxiv.org/abs/2403.19887.

[14] Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor
optimization. URL https://arxiv.org/abs/1802.09568. Version Number: 2.

[15] Sukjun Hwang, Aakash Lahoti, Tri Dao, and Albert Gu. Hydra: Bidirectional state space
models through generalized matrix mixers. URL http://arxiv.org/abs/2407.09941.

[16] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola,
Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning, 2020. URL
https://arxiv.org/abs/2004.11362.

[17] Mike Lewis, Shruti Bhosale, Tim Dettmers, Douwe Kiela, and Luke Zettlemoyer. Base
layers: Simplifying training of large, sparse models, 2021. URL https://arxiv.org/abs/
2103.16716.

[18] Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving
training with one line of code. URL https://arxiv.org/abs/2411.16085. Version Num-
ber: 4.

[19] Zhixuan Lin, Ke Wang, et al. Forgetting transformer: Softmax attention with a forget gate,
2025. URL https://arxiv.org/abs/2503.02130.

[20] Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma.
Sophia: A scalable
stochastic second-order optimizer for language model pre-training, . URL https://arxiv.
org/abs/2305.14342. Version Number: 4.

[21] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao,
and Jiawei Han. On the variance of the adaptive learning rate and beyond, . URL https:
//arxiv.org/abs/1908.03265. Version Number: 4.

[22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. URL https:
//arxiv.org/abs/1711.05101. Version Number: 3.

[23] Tianyu Lou, Zheyu Chen, Tao Yu, et al. Efficient sparse attention for long-range trans-
formers, 2024. URL https://arxiv.org/abs/2406.16747.

[24] Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-
free learner. URL https://arxiv.org/abs/2306.06101. Version Number: 4.

[25] Niklas Muennighoff et al. Matryoshka representation learning, 2022. URL https://arxiv.
org/abs/2205.13147.

[26] Matteo Pagliardini, Pierre Ablin, and David Grangier. The AdEMAMix optimizer: Better,
faster, older. URL https://arxiv.org/abs/2409.03137. Version Number: 2.

[27] Boris T. Polyak.
Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. doi: 10.
1016/0041-5553(64)90137-5.

[28] Zhen Qin, Xu Han, et al. Hgrn2: Gated linear rnns with state expansion, 2024. URL

https://arxiv.org/abs/2404.07904.

[29] Yuxiang Qiu, Qwen Team, et al. Gated attention for large language models, 2025. URL

https://arxiv.org/abs/2505.06708.

[30] Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre
Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, and Russ Webb.
Theory, analysis, and best practices for sigmoid self-attention. URL https://arxiv.org/
abs/2409.04431. Version Number: 2.

[31] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese
BERT-networks. URL http://arxiv.org/abs/1908.10084.

[32] Hema Hariharan Samson. Lightweight transformer architectures for edge devices in real-
time applications. URL http://arxiv.org/abs/2601.03290.

[33] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear mem-
ory cost. URL https://arxiv.org/abs/1804.04235. Version Number: 1.

[34] Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the con-
vergence analysis of muon. URL https://arxiv.org/abs/2505.23737. Version Number:
1.

[35] Rafael Soares et al. Griffin: Mixing gated linear recurrences with local attention for efficient
sequence modeling, 2024. URL https://arxiv.org/abs/2402.19427.

[36] Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong
Wang, and Furu Wei. Retentive network: A successor to transformer for large language
models. URL https://arxiv.org/abs/2307.08621. Version Number: 4.

[37] Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong,
Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Mat-
suo.
ADOPT: Modified adam can converge with any β2 with the optimal rate.
URL
https://arxiv.org/abs/2411.02853. Version Number: 3.

[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need.
URL https:
//arxiv.org/abs/1706.03762. Version Number: 7.

[39] Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfon-
brener, Lucas Janson, and Sham Kakade. SOAP: Improving and stabilizing shampoo using
adam. URL https://arxiv.org/abs/2409.11321. Version Number: 2.

[40] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan
Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit transformers for large
language models. URL http://arxiv.org/abs/2310.11453.

[41] Zhe Wang, Ming Liu, et al. Fasa: Frequency-aware sparse attention, 2026. URL https:
//arxiv.org/abs/2602.03152.

[42] Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan.
Adan: Adaptive
nesterov momentum algorithm for faster optimizing deep models. URL https://arxiv.
org/abs/2208.06677. Version Number: 5.

[43] Haiqi Yang, Zhiyuan Li, Yi Chang, and Yuan Wu. A survey of retentive network. URL

http://arxiv.org/abs/2506.06708.

[44] Songlin Yang, Bailin Wang, et al.
Gated linear attention transformers with hardware-
efficient training, 2023. URL https://arxiv.org/abs/2312.06635.

[45] Songlin Yang, Bailin Wang, et al. Parallelizing linear transformers with the delta rule over
sequence length, 2024. URL https://arxiv.org/abs/2406.06484.

[46] Songlin Yang, Bailin Wang, et al. Gated delta networks: Improving mamba2 with delta
rule, 2024. URL https://arxiv.org/abs/2412.06464.

[47] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli,
Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Large batch optimization for deep learn-
ing: Training BERT in 76 minutes. URL https://arxiv.org/abs/1904.00962. Version
Number: 3.

[48] Han Yuan, DeepSeek-AI, et al. Native sparse attention: Hardware-aligned and natively
trainable sparse attention, 2025. URL https://arxiv.org/abs/2502.11089.

[49] Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. MARS: Unleashing
the power of variance reduction for training large models. URL https://arxiv.org/abs/
2411.10438. Version Number: 4.

[50] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti,
Santiago Ontanon, Pike Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed.
Big bird: Transformers for longer sequences. In Advances in Neural Information Processing
Systems, 2020. doi: 10.48550/ARXIV.2007.14062. URL https://arxiv.org/abs/2007.
14062.

[51] Biao Zhang and Rico Sennrich.
Root Mean Square Layer Normalization.
URL http:
//arxiv.org/abs/1910.07467.

[52] Jing Zhang, Peng Zhang, Baiwen Kong, Junqiu Wei, and Xin Jiang.
Continuous self-
attention models with neural ODE networks. 35(16):14393–14401. ISSN 2374-3468, 2159-
5399.
doi: 10.1609/aaai.v35i16.17692.
URL https://ojs.aaai.org/index.php/AAAI/
article/view/17692.

[53] Yichi Zhang, Yizhong Wang, et al. Accurate and training-free sparse attention accelerating
any model inference, 2025. URL https://arxiv.org/abs/2502.18137.

[54] Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuan-
dong Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. URL
https://arxiv.org/abs/2403.03507. Version Number: 2.

[55] Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long,
David Z. Pan, Zhangyang Wang, and Jinwon Lee. Apollo: Sgd-like memory, adamw-level
performance, 2025. URL https://arxiv.org/abs/2412.05270.

[56] Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without
normalization. URL http://arxiv.org/abs/2503.10622.

[57] Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang,
Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, and Xinggang Wang. Mixture-
of-depths attention, 2026. URL https://arxiv.org/abs/2603.15619.

A
Annex A: Optimizer Families


![Table 1](paper-38-v4_images/table_1.png)
*Table 1*

A.1
Memory and Complexity Comparison

Table 8 summarizes the memory overhead (number of state buffers per parameter), per-step
computational complexity, and key hyperparameters for all supported optimizers.
Memory
overhead is expressed in terms of the number of state tensors maintained per model parameter,
where each tensor has the same shape as the parameter.

Optimizer
State Buffers
Per-Step Cost
Key Hyperparams

SGD+Momentum
1 (m)
O(n)
lr, momentum, wd
AdamW
2 (m, v)
O(n)
lr, β1, β2, eps, wd
RAdam
2 (m, v)
O(n)
lr, β1, β2, eps, wd


![Table 2](paper-38-v4_images/table_2.png)
*Table 2*

Optimizer
State Buffers
Per-Step Cost
Key Hyperparams

Adan
3 (m, v, s)
O(n)
lr, β1, β2, β3, eps, wd
ADOPT
2 (m, v)
O(n)
lr, β1, β2, eps, wd
AdEMAMix
3 (m1, m2, v)
O(n)
lr, β1, β2, β3, eps, wd
MARS
3 (m, v, z)
O(n)
lr, β1, β2, eps, wd, γ
Cautious AdamW
2 (m, v)
O(n)
lr, β1, β2, eps, wd
LAMB
2 (m, v)
O(n)
lr, β1, β2, eps, wd
Schedule-Free
3 (z, x, n)
O(n)
lr, β1, β2, wd
Adafactor
1–2 (row/col)
O(n)
lr, β1, β2, eps, wd, factored
GaLore
2 (m, v) + SVD/proj
O(nr)
lr, rank r, β1, β2, eps, wd
Prodigy
3 (m, v, d)
O(n)
β1, β2, eps, wd, d0
Lion
1 (m)
O(n)
lr, β1, β2, wd
Shampoo
2d (Li, Ri)
O(n1+2/d)
lr, eps, wd, matrix eps
SOAP
2d + 2 (m, v)
O(n1+1/d)
lr, β1, β2, eps, wd, shampoo eps
Sophia
3 (m, v, h)
O(n)
lr, β1, β2, eps, wd, k
Muon
2 (m, v)
O(n · k)
lr, momentum, wd, NS steps k
Turbo-Muon
2 (m, v)
O(n · k)
lr, momentum, wd, NS steps k
APOLLO
2 low-rank (mR, vR) + proj
O(nr)
lr, rank r, update gap, scale, betas, eps
APOLLO-Mini
2 rank-1 (mR, vR) + proj
O(n)
lr, update gap, scale, betas, eps, wd
Q-APOLLO
2 quantized low-rank + proj
O(nr)
lr, rank r, update gap, scale, quant bits
Table 8: Memory and complexity comparison of all supported
optimizers. n = number of parameters, d = tensor dimension-
ality, r = low-rank dimension, k = Newton–Schulz iteration
count. State buffers lists the number of optimizer state ten-
sors maintained per parameter group. Per-step cost focuses
on the dominant additional cost beyond the gradient compu-
tation itself.

A.2
The Evolution of Optimization in Neural Networks

The optimizer survey frames transformer optimization as a response to three structural pressures:
non-convex loss landscapes, severe curvature heterogeneity across parameter blocks, and the
memory cost of storing optimizer state for very large models. The report argues that the field
has diverged into several trajectories: adaptive first-order baselines, variance-reduction methods,
memory-efficient methods, structured second-order preconditioners, schedule-free methods, and
orthogonality-oriented updates.

A.3
Standard Baseline and Adaptive Optimizers

SGD with Momentum.
The classical update accumulates a momentum buffer and then
applies a fixed learning rate. At each iteration t, the algorithm maintains an exponential moving
average mt of past gradients:

mt = βmt−1 + gt
(1)

θt+1 = θt −ηmt
(2)

where gt = ∇f(θt), β ∈[0.8, 0.99] controls the momentum decay, and η is the fixed learning
rate. The momentum term acts as velocity: it accelerates movement in consistent directions
while dampening oscillations in variable directions.
Its strengths are low memory overhead
(single momentum buffer per parameter) and strong generalization when tuned carefully. Its

main weakness in transformer workloads is poor robustness to heterogeneous curvature (different
Hessian spectra across parameter groups) and strong dependence on learning-rate schedules.

Algorithm 4 SGD with Momentum
Require: Initial parameters θ0, learning rate η, momentum coefficient β, weight decay λ
Ensure: Updated parameters θT

1: Initialize: momentum buffer m ←0

2: for t = 0, 1, 2, . . . , T −1 do

3:
Compute gradient: gt ←∇f(θt)

4:
if weight_decay > 0 then

5:
gt ←gt + λθt
▷L2 regularization

6:
end if

7:
Update momentum: m ←β · m + gt
8:
Update parameters: θt+1 ←θt −η · m

9: end forreturn θT

Adam and AdamW.
Adam tracks exponential moving averages of both first moment (mean)
and second moment (uncentered variance) to achieve element-wise adaptive learning rates. The
update rule is:

mt = β1mt−1 + (1 −β1)gt
(3)

vt = β2vt−1 + (1 −β2)g2
t
(4)

ˆmt =
mt
1 −βt
1
,
ˆvt =
vt
1 −βt
2
(bias correction)
(5)


(6)

θt+1 = θt −η

ˆmt
√ˆvt + ϵ

AdamW introduces a crucial modification: weight decay is applied directly to parameters (de-
coupled from gradients): θt ←θt(1−ηλ) rather than adding λθt to the gradient. This decoupling
prevents the adaptive scaling from interfering with regularization strength. The report treats
AdamW as the practical baseline for transformer fine-tuning because it converges quickly and
is relatively forgiving to hyperparameter variation. The tradeoff is memory cost: both moment
tensors must be stored for every parameter, doubling optimizer-state memory relative to SGD.

Algorithm 5 AdamW (Adam with Decoupled Weight Decay)
Require: Initial parameters θ0, learning rate η, exponential decay rates β1, β2 ∈[0, 1)
Require: Momentum constant ϵ, weight decay λ
Ensure: Updated parameters θT

1: Initialize: first moment m ←0, second moment v ←0, step counter t ←0

2: for t = 1, 2, . . . , T do

3:
Compute gradient: gt ←∇f(θt−1)

4:
Weight decay (decoupled): θt−1 ←θt−1(1 −ηλ)

5:
Update moments: m ←β1m + (1 −β1)gt
6:
v ←β2v + (1 −β2)g2
t
7:
Bias correction: ˆm ←m/(1 −βt
1), ˆv ←v/(1 −βt
2)

8:
Update parameters: θt ←θt−1 −η( ˆm/(
√

ˆv + ϵ))

9: end forreturn θT

RAdam (Rectified Adam).
RAdam addresses Adam’s instability in early training by dy-
namically rectifying the adaptive learning rate. The key observation is that the second moment

vt has very high variance in the first few steps, causing unreliable adaptive scaling. RAdam
computes the effective simple moving average (SMA) window length:

ρt = ρ∞−2tβt
2
1 −βt
2
,
where
ρ∞=
2
1 −β2
−1

When ρt > 4 (sufficient samples for variance estimation), RAdam applies adaptive scaling with
a rectification term:

s

(ρt −4)(ρt −2)ρ∞
(ρ∞−4)(ρ∞−2)ρt
,
θt+1 = θt −ηrt
ˆmt
√ˆvt + ϵ

rt =

When ρt ≤4, RAdam falls back to SGD with momentum: θt+1 = θt −η ˆmt. This graceful tran-
sition eliminates the need for manual learning-rate warmup schedules and improves robustness.

Algorithm 6 RAdam (Rectified Adam)
Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ
Ensure: Updated parameters θT

1: Initialize: m ←0, v ←0, t ←0

2: Compute: ρ∞←2/(1 −β2) −1

3: for t = 1, 2, . . . , T do

4:
gt ←∇f(θt−1)

5:
if weight_decay > 0 then

6:
θt−1 ←θt−1(1 −ηλ)

7:
end if

8:
m ←β1m + (1 −β1)gt
9:
v ←β2v + (1 −β2)g2
t
10:
ˆm ←m/(1 −βt
1), ˆv ←v/(1 −βt
2)

11:
ρt ←ρ∞−2tβt
2/(1 −βt
2)

12:
if ρt > 4 then

13:
rt ←
p

((ρt −4)(ρt −2)ρ∞)/((ρ∞−4)(ρ∞−2)ρt)

14:
θt ←θt−1 −ηrt ˆm/(
√

ˆv + ϵ)

15:
else

16:
θt ←θt−1 −η ˆm
▷SGD with momentum

17:
end if

18: end forreturn θT

A.4
Advanced Momentum and Variance Reduction (2024–2025)

Adan (Adaptive Nesterov Momentum).
Adan reformulates Nesterov acceleration without
the extra gradient computation required by classical Nesterov SGD. The algorithm maintains
three momentum buffers:

mt = (1 −β1)mt−1 + β1gt
(first moment)
(7)

vt = (1 −β2)vt−1 + β2(gt −gt−1)
(velocity/gradient difference)
(8)

nt = (1 −β3)nt−1 + β3[gt + (1 −β1)(gt −gt−1)]2
(Nesterov second moment)
(9)

The Nesterov Momentum Estimation (NME) term ¯gt = gt + (1 −β1)(gt −gt−1) estimates
the gradient at a future position without evaluating it. The update combines momentum with
velocity:

¯mt = mt + (1 −β1)vt,
θt+1 = θt −η
¯mt
√nt + ϵ

Adan achieves fast convergence across diverse architectures (CNNs, GANs, Transformers) through
this acceleration. The cost is maintaining three momentum-like buffers, increasing memory over-
head relative to Adam.

Algorithm 7 Adan (Adaptive Nesterov Momentum)
Require: Initial parameters θ0, learning rate η, β1, β2, β3 ∈(0, 1)
Ensure: Updated parameters θT

1: Initialize: m ←0, v ←0, n ←0, g−1 ←0 (previous gradient)

2: for t = 1, 2, . . . , T do

3:
gt ←∇f(θt−1)

4:
m ←(1 −β1)m + β1gt
5:
∆gt ←gt −gt−1
6:
v ←(1 −β2)v + β2∆gt
7:
Nesterov estimation: ¯gt ←gt + (1 −β1)∆gt
8:
n ←(1 −β3)n + β3¯g2
t
9:
Combined momentum: ¯m ←m + (1 −β1)v

10:
θt ←θt−1 −η( ¯m/(√n + ϵ))

11:
gt−1 ←gt
12: end forreturn θT

ADOPT (Adam with Optimal Pruned Tuning).
ADOPT fixes a fundamental theoretical
issue in Adam: the gradient appears in both the first moment and second moment estimates,
creating circularity. ADOPT decouples by using the previous step’s second moment for the
denominator:

vt = β2vt−1 + (1 −β2)g2
t
(10)

mt = β1mt−1 + (1 −β1)
gt
√vt−1 + ϵ
(uses vt−1, not vt)
(11)

θt+1 = θt −ηmt
(12)

This simple reordering achieves the optimal convergence rate O(1/
√

T) with any choice of
β2 ∈(0, 1), without bounded-noise assumptions. The practical consequence is that ADOPT is a
drop-in replacement for Adam with stronger theoretical guarantees and comparable or superior
empirical performance across vision, NLP, RL, and generative modeling domains.

Algorithm 8 ADOPT (Adam with Optimal Pruned Tuning)
Require: Initial parameters θ0, learning rate η, β1, β2, tolerance ϵ
Ensure: Updated parameters θT

1: Initialize: m ←0, v ←0

2: for t = 1, 2, . . . , T do

3:
gt ←∇f(θt−1)

4:
if weight_decay > 0 then

5:
θt−1 ←θt−1(1 −ηλ)

6:
end if

7:
Use previous second moment: denom ←
p

max(v, ϵ) + ϵ

8:
Update first moment using previous variance: m ←β1m + (1 −β1)(gt/denom)

9:
Update second moment with current gradient: v ←β2v + (1 −β2)g2
t
10:
θt ←θt−1 −ηm

11: end forreturn θT

AdEMAMix (Exponential Moving Average Mixture).
AdEMAMix replaces Adam’s sin-
gle EMA of gradients with a mixture of two EMAs: one fast-decaying and one slow-decaying.
This addresses the observation that gradients remain informative over tens of thousands of steps,
not just hundreds:

m1,t = β1m1,t−1 + (1 −β1)gt
(fast EMA, β1 ≈0.9)
(13)

m2,t = β3m2,t−1 + (1 −β3)gt
(slow EMA, β3 ≈0.9999)
(14)

vt = β2vt−1 + (1 −β2)g2
t
(15)

mt = m1,t + αtm2,t
(mixture with scheduled weight αt)
(16)

θt+1 = θt −η
mt
√vt + ϵ
(17)

The mixture weight αt typically increases during training, allowing the fast EMA to provide
immediate adaptation while the slow EMA accumulates long-range gradient correlations. Em-
pirically, a 1.3B model on 101B tokens achieves similar loss to AdamW on 197B tokens (95%
data efficiency gain), suggesting the slow EMA significantly reduces model forgetting.

Algorithm 9 AdEMAMix (Exponential Moving Average Mixture)
Require: Initial parameters θ0, learning rate η, β1 (fast), β3 (slow), β2 (second moment)
Require: Mixture weight αt (typically increasing), tolerance ϵ
Ensure: Updated parameters θT

1: Initialize: m1 ←0 (fast EMA), m2 ←0 (slow EMA), v ←0

2: for t = 1, 2, . . . , T do

3:
gt ←∇f(θt−1)

4:
if weight_decay > 0 then

5:
θt−1 ←θt−1(1 −ηλ)

6:
end if

7:
m1 ←β1m1 + (1 −β1)gt
▷Fast EMA

8:
m2 ←β3m2 + (1 −β3)gt
▷Slow EMA

9:
v ←β2v + (1 −β2)g2
t
10:
mt ←m1 + αtm2
▷Mixture

11:
θt ←θt−1 −η(mt/(√v + ϵ))

12: end forreturn θT

MARS (Make Adaptive learning Rates Shine).
MARS combines preconditioned gradient
methods (e.g., AdamW) with variance reduction via scaled stochastic recursive momentum.
The core innovation is a variance-reduced gradient estimate:

ct = gt + γ(ct−1 −gt−1)
(SVRG-style recursive estimate)
(18)

mt = β1mt−1 + (1 −β1)ct
(19)

vt = β2vt−1 + (1 −β2)c2
t
(20)

θt+1 = θt −η
mt
√vt + ϵ
(21)

where γ ∈[0.01, 0.1] controls how much historical gradient information is retained. The variance-
reduced gradient ct acts as an implicit noise filter, amplifying consistent signal directions and
dampening contradictory noise. MARS-AdamW consistently outperforms bare AdamW by sig-
nificant margins on GPT-2 pretraining, suggesting that variance reduction substantially im-
proves convergence in mini-batch stochastic training.

Algorithm 10 MARS (Make Adaptive learning Rates Shine)
Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ
Require: Variance reduction coefficient γ ∈[0.01, 0.1]
Ensure: Updated parameters θT

1: Initialize: m ←0, v ←0, c ←0 (variance-reduced gradient), g−1 ←0

2: for t = 1, 2, . . . , T do

3:
gt ←∇f(θt−1)

4:
Variance reduction: c ←gt + γ(c −gt−1)

5:
m ←β1m + (1 −β1)c

6:
v ←β2v + (1 −β2)c2

7:
if weight_decay > 0 then

8:
θt−1 ←θt−1(1 −ηλ)

9:
end if

10:
θt ←θt−1 −η(m/(√v + ϵ))

11:
gt−1 ←gt
12: end forreturn θT

Cautious Optimizers (Cautious AdamW, Cautious Lion).
The Cautious framework
applies a one-line modification to any momentum-based optimizer: mask the update so that
only dimensions where momentum and gradient directions agree are applied:

(
1
if mt[i] · gt[i] > 0
(agreement)
0
otherwise
(22)

maski =


⊙mask
(element-wise masking)
(23)

ut =

mt
√vt + ϵ

θt+1 = θt −ηut
(24)

The intuition: momentum mt estimates the gradient direction from history; the current gradi-
ent gt is instantaneous signal. When both agree, the optimizer is confident and should update
aggressively. When they disagree, the optimizer is conflicted—the historical trend points toward
a region that may have been good before, but current evidence contradicts it.
By masking
conflicted dimensions, Cautious becomes more conservative and avoids corrupted steps. Empiri-
cally, this simple masking achieves up to 1.47× speedup on Llama and MAE pretraining while
preserving convergence guarantees.

Algorithm 11 Cautious AdamW (Consensus-based Update Masking)
Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ
Ensure: Updated parameters θT

1: Initialize: m ←0, v ←0

2: for t = 1, 2, . . . , T do

3:
gt ←∇f(θt−1)

4:
m ←β1m + (1 −β1)gt
5:
v ←β2v + (1 −β2)g2
t
6:
Consensus mask: maski ←1 if m[i] · gt[i] > 0, else 0
▷Agreement

7:
Base Adam update: u ←m/(√v + ϵ)

8:
Apply mask: umasked ←u ⊙mask
▷Element-wise

9:
if weight_decay > 0 then

10:
θt−1 ←θt−1(1 −ηλ)

11:
end if

12:
θt ←θt−1 −ηumasked
13: end forreturn θT

A.5
Large-Batch, Memory-Efficient, and Parameter-Free Optimizers

LAMB (Layer-wise Adaptive Moments optimizer for Batch training).
LAMB extends
Adam with layer-wise adaptive rate scaling (inspired by LARS), enabling stable training
with extreme batch sizes (e.g., 64K on BERT):

mL
t = β1mL
t−1 + (1 −β1)gL
t
(layer-wise first moment)
(25)

vL
t = β2vL
t−1 + (1 −β2)(gL
t )2
(26)

uL
adam =
mL
t
p

vL
t + ϵ
+ λθL
t−1
(base adaptive step with decay)
(27)

ϕL = ∥θL
t−1∥2
∥uL
adam∥2
(trust ratio: layer normalization)
(28)

θL
t = θL
t−1 −η · ϕL · uL
adam
(29)

The trust ratio ϕL = ∥θL∥2/∥uL
adam∥2 normalizes the effective update step relative to weight
magnitude.
In very large batches, stochastic gradient noise can exceed signal, destabilizing
layer-wise learning rates. By scaling updates proportionally to weight norms, LAMB ensures
relative changes rather than absolute ones, preventing small confident updates in large vectors
from dominating. LAMB preserves small-batch generalization benefits while enabling efficient
large-batch training.

Algorithm 12 LAMB (Layer-wise Adaptive Moments optimizer for Batch training)
Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ
Ensure: Updated parameters θT

1: Initialize: For each layer L: mL ←0, vL ←0

2: for t = 1, 2, . . . , T do

3:
for each layer L do

4:
gL
t ←∇f(θL
t )

5:
mL ←β1mL + (1 −β1)gL
t
6:
vL ←β2vL + (1 −β2)(gL
t )2

7:
Base Adam: uL
adam ←mL/(
√

vL + ϵ)

8:
if weight_decay > 0 then

9:
uL
adam ←uL
adam + λθL
t−1
10:
end if

11:
Trust ratio: ϕL ←∥θL
t−1∥2/∥uL
adam∥2 if both nonzero, else 1

12:
θL
t ←θL
t−1 −ηϕLuL
adam
13:
end for

14: end forreturn θT

Schedule-Free AdamW.
Schedule-free methods remove explicit scheduler design from the
optimization recipe. The algorithm maintains two parameter streams: zt (exploration) and xt
(smooth average):

vt = β2vt−1 + (1 −β2)g2
t
(variance)
(30)

zt = zt−1(1 −ηλ) −η
gt
√vt + ϵ
(adaptive step with decay)
(31)

ct =
1
t + 1
(averaging coefficient, decreases over time)
(32)

xt = (1 −ct)xt−1 + ctzt
(iterate averaging)
(33)

yt = (1 −β)zt + βxt
(interpolation for evaluation point)
(34)

The averaging coefficient ct = 1/(t + 1) implements an implicit learning-rate schedule without
explicitly specifying total steps T. This framework unifies scheduling and iterate averaging: the
algorithm explores via zt while accumulating stable direction via xt. The evaluation point yt
(used for gradient computation) interpolates between exploration and stability. Schedule-Free
achieves state-of-the-art convergence across convex optimization, large-scale deep learning, and
reinforcement learning, while removing a major hyperparameter.

Algorithm 13 Schedule-Free AdamW
Require: Initial parameters θ0, learning rate η (fixed), β1, β2, weight decay λ
Ensure: Updated parameters θT or averaging xT

1: Initialize: z ←θ0 (exploration), x ←θ0 (average), v ←0, t ←0

2: for t = 1, 2, . . . , T do

3:
gt ←∇f(yt−1) (gradient at interpolation point)

4:
v ←β2v + (1 −β2)g2
t
▷Variance

5:
Weight decay on z: z ←z(1 −ηλ)

6:
z ←z −ηgt/(√v + ϵ)
▷Adaptive step on raw

7:
ct ←1/(t + 1)
▷Averaging coefficient

8:
x ←(1 −ct)x + ctz
▷Iterate averaging

9:
yt ←(1 −β1)z + β1x
▷Interpolation for next eval

10: end forreturn xT or yT

Adafactor.
Adafactor reduces optimizer memory by factorizing second-moment statistics
for matrix-shaped parameters, storing only row and column accumulators rather than dense
variance states. For a gradient matrix Gt ∈Rm×n:

Rt = β2Rt−1 + (1 −β2)(G2
t )1T
n
(row variance)
(35)

Ct = β2Ct−1 + (1 −β2)1⊤
m(G2
t )
(column variance)
(36)

ˆVt = RtCt

1⊤
n Rt
(reconstructed variance via outer product)
(37)

Ut =
Gt
p ˆVt + ϵ
(normalized adaptive step)
(38)

ˆUt =
Ut
max(1, RMS(Ut))
(stability clipping)
(39)

Instead of storing m × n second-moment values, Adafactor stores only m + n accumulators,
reducing memory from O(nparams) to O(√nparams).
This is most attractive when VRAM is
dominated by optimizer state rather than activations.
The tradeoff is reduced optimization
expressiveness and potential instability on some tasks.

Algorithm 14 Adafactor (Factorized Second-Moment Statistics)
Require: Gradient matrix Gt ∈Rm×n, learning rate η, β2 ≈0.999
Ensure: Updated parameters

1: Initialize: Row accumulators R ←ϵ1T
m, Column C ←ϵ1n
2: for t = 1, 2, . . . , T do

3:
Gt ←∇f(θt) (matrix gradient)

4:
R ←β2R + (1 −β2)(G2
t )1T
n
▷Row inner products

5:
C ←β2C + (1 −β2)1⊤
m(G2
t )
▷Column inner products

6:
Reconstructed variance: ˆV ←(R · C)/(1⊤
n R)
▷Via outer product

7:
Normalized adaptive step: Ut ←Gt/(
p

ˆV + ϵ)

8:
Per-row RMS clipping: ˆUt ←Ut/ max(1, RMS(Ut))

9:
θt+1 ←θt −η ˆUt
10: end for

GaLore (Gradient Low-Rank Projection).
GaLore projects 2D gradients into a low-rank
subspace before optimization, reducing optimizer-state memory while maintaining adaptive step
scaling. For a 2D parameter matrix W ∈Rm×n:

U, S, V = SVD(Gt)
(compute singular decomposition)
(40)

P ∈Rn×r
or
P ⊤∈Rr×m
(select top-r singular vectors)
(41)

Glow = P ⊤Gt
(project to low-rank space)
(42)

∆low = Adam(Glow)
(optimize in compressed space)
(43)

∆= P∆low
(reconstruct in original space)
(44)

By projecting into a rank-r subspace (typically r ≪min(m, n)), optimizer state is reduced
from O(mn) to O(r(m + n)) for 2D parameters. This complementary memory-saving approach
is especially relevant when the model is too large for full-rank optimizer state. GaLore works
synergistically with other techniques and has shown strong empirical results on billion-parameter
models.

Algorithm 15 GaLore (Gradient Low-Rank Projection)
Require: 2D parameter matrix W ∈Rm×n, learning rate η, rank r ≪min(m, n)
Ensure: Updated parameters

1: Initialize: Adam state in low-rank space (if rank-2 variant)

2: for t = 1, 2, . . . , T do

3:
Gt ←∇f(Wt)

4:
if t mod K = 1 then
▷Periodic SVD

5:
Ut, St, V ⊤
t
←SVD(Gt) (full or thin)

6:
Select projection: P ←Ut[:, : r] or P ←Vt[:, : r]

7:
end if

8:
Project gradient: Glow ←P ⊤Gt (or GtP ⊤for right projection)

9:
Run Adam on low-rank: ∆low ←Adam(Glow)

10:
Reconstruct in original space: ∆←P∆low (or ∆lowP ⊤)

11:
Wt ←Wt −η∆

12: end for

APOLLO, APOLLO-Mini, and Q-APOLLO.
The APOLLO family [55] begins from
the observation that AdamW’s element-wise denominator can be coarsened into a struc-
tured learning-rate update. Rather than storing dense moments for every parameter entry,

APOLLO projects a matrix gradient Gt ∈Rm×n into a compact random subspace and tracks
Adam-style moments there:

Rt = PtGt
or
Rt = GtP ⊤
t
(45)

MR
t = β1MR
t−1 + (1 −β1)Rt
(46)

V R
t
= β2V R
t−1 + (1 −β2)R2
t
(47)

eRt =
MR
t
p

V R
t + ϵ
(48)

The projected state is not expanded back into a dense low-rank update as in SVD-based methods.
Instead, APOLLO estimates a structured scaling tensor St in the original space. In the standard
APOLLO variant, that scaling is channel-wise: each row or channel receives its own norm ratio.
In APOLLO-Mini, the scaling is reduced to a single tensor-wise scalar, corresponding to the
rank-1 extreme described in the paper. The resulting parameter update is therefore Adam-like
in adaptation but much closer to SGD in state cost:

Wt ←(1 −ηλ)Wt−1 −η α (Gt ⊙St)

where α is the extra scale factor used to stabilize highly compressed variants.
The paper’s practical contribution is twofold. First, APOLLO replaces expensive repeated
SVD with periodically refreshed Gaussian random projection, so the systems burden is
ordinary matrix multiplication rather than spectral decomposition. Second, the optimizer is
unusually tolerant to extreme compression: even APOLLO-Mini, which keeps only rank-1
auxiliary state, remains competitive with or better than AdamW in the reported pre-training
experiments while approaching SGD-level memory cost.

Algorithm 16 APOLLO / APOLLO-Mini Structured Gradient Scaling
Require: Weight matrix W ∈Rm×n with m ≤n, learning rate η, scale factor α, decay rates
(β1, β2), weight decay λ, rank r, projection refresh interval T
Ensure: Updated parameters WT

1: Initialize projected moments: MR ←0, V R ←0, step t ←0

2: repeat

3:
Compute gradient: Gt ←∇W ϕ(Wt)

4:
if t mod T = 0 then

5:
Sample Gaussian projector Pt ∼N(0, 1/r) with a fresh seed

6:
end if

7:
Project gradient: Rt ←PtGt
▷or GtP ⊤
t
depending on layout

8:
Update projected AdamW moments: MR
t , V R
t
←AdamWState(Rt; β1, β2)

9:
Normalize projected state: eRt ←MR
t /(
p

V R
t + ϵ)

10:
if APOLLO then

11:
St ←diag(sR
1 , . . . , sR
m), where sR
i = ∥eRt[i, :]∥2/∥Rt[i, :]∥2
12:
else

13:
St ←sR
t , where sR
t = ∥eRt∥2/∥Rt∥2
14:
end if

15:
Update weights: Wt ←(1 −ηλ)Wt−1 −η α (Gt ⊙St)

16:
t ←t + 1

17: until convergence

APOLLO-Mini.
APOLLO-Mini is the family member optimized for extreme memory ef-
ficiency. The paper motivates it by arguing that, in a rank-1 compact space, channel-wise
scaling becomes too noisy, so the update is coarsened further to a single tensor-wise scale. The

loss of granularity is partially offset by the explicit scaling factor α (the paper discusses values
such as 128 in this regime), giving a useful point on the optimization Pareto frontier: very small
state, no SVD overhead, and still strong pre-training behavior.

Q-APOLLO.
The APOLLO paper also stresses that the family combines naturally with
quantization for ultra-low-memory training. This repository turns that systems observation
into a concrete optimizer variant, q_apollo. In the implementation, APOLLO’s low-rank first
and second moments are stored in quantized form together with per-tensor scales and offsets, and
are dequantized only when needed for the next step. Q-APOLLO therefore preserves APOLLO’s
projected-gradient logic while reducing the precision of the remaining optimizer state, making
it the most aggressive memory-saving member of the local optimizer stack.

Prodigy (Approximating the Distance Estimate).
Prodigy adapts the effective step scale
through a running distance-like statistic, eliminating the need for explicit learning-rate tuning.
The algorithm maintains a cumulative distance estimate:

ut =
gt
√vt + ϵ
(unnormalized adaptive step)
(49)

st = st−1 + ⟨ut, θt −θ0⟩
(cumulative signed distance)
(50)

dt = max(dt−1, d0 + dcoef · st)
(distance estimate with lower bound)
(51)

θt+1 = θt −η · dt · ut
(52)

where d0 is an initial bound and dcoef controls how aggressively the estimate adapts.
The
distance estimate dt captures the combined magnitude of past gradients weighted by actual
parameter displacement, implementing an implicit effective learning rate. This distance-aware
scaling achieves robust convergence across problem scales and batch sizes without requiring a
learning-rate schedule. Prodigy reduces hyperparameter sensitivity by estimating step sizes from
optimization geometry rather than problem-specific prior knowledge.

Algorithm 17 Prodigy (Approximating the Distance Estimate)
Require: Initial parameters θ0, learning rate η, coeff dcoef, initial dist d0 > 0
Ensure: Updated parameters θT

1: Initialize: s ←0 (cumulative signed distance), d ←d0
2: for t = 1, 2, . . . , T do

3:
gt ←∇f(θt−1)

4:
vt ←β2vt−1 + (1 −β2)g2
t
▷Second moment

5:
ut ←gt/(√vt + ϵ)
▷Unnormalized adaptive

6:
Update distance: s ←s + ⟨ut, θt −θ0⟩
▷Signed distance

7:
d ←max(d, d0 + dcoef · s)
▷Lower-bounded distance

8:
θt ←θt−1 −η · d · ut
▷Distance-scaled update

9: end forreturn θT

A.6
Second-Order, Geometric, and Orthogonality Optimizers

Algorithm 18 Shampoo (Matrix Preconditioning via Kronecker Products)
Require: Initial parameters θ0, learning rate η, eigendecomposition interval K
Ensure: Updated parameters θT

1: Initialize: L ←ϵIm, R ←ϵIn (left and right Gram matrices)

2: for t = 1, 2, . . . , T do

3:
G ←∇f(θt−1)
▷Gradient

4:
Update Gram matrices: L ←L + GG⊤, R ←R + G⊤G

5:
if t mod K == 0 then

6:
QL, ΛL ←eigh(L)
▷Eigendecomposition of L

7:
QR, ΛR ←eigh(R)
▷Eigendecomposition of R

8:
L−1/4 ←QL(ΛL + ϵ)−1/4Q⊤
L
9:
R−1/4 ←QR(ΛR + ϵ)−1/4Q⊤
R
10:
end if

11:
∆θ ←L−1/4GR−1/4
▷Preconditioned gradient

12:
θt ←θt−1 −η · ∆θ

13: end forreturn θT

Shampoo (Matrix Preconditioning via Kronecker Products).
Shampoo is a struc-
tured second-order method that computes matrix preconditioners from Kronecker-structured
outer-product statistics. For a 2D parameter matrix W ∈Rm×n:

Lt = Lt−1 + GtG⊤
t
(left/row Gram matrix, m × m)
(53)

Rt = Rt−1 + G⊤
t Gt
(right/column Gram matrix, n × n)
(54)

L−1/4
t
= QL(ΛL)−1/4Q⊤
L
(via eigendecomposition)
(55)

R−1/4
t
= QR(ΛR)−1/4Q⊤
R
(56)

∆Wt = L−1/4
t
GtR−1/4
t
(preconditioned gradient)
(57)

Shampoo approximates the full-matrix Adagrad preconditioner H−1/2 (where H is the Hessian)
using Kronecker-factored structure. Instead of storing a (mn) × (mn) preconditioner, Shampoo
maintains two smaller matrices (m×m and n×n), capturing cross-parameter correlations within
and across groups. The method has proven effective at scale (Google production systems) and
is well-suited to transformer architectures with high-rank structure. Eigendecompositions are
performed periodically (e.g., every K = 10 steps) to amortize cost.
Tradeoff: higher per-
step compute and periodic O(m3 + n3) eigendecomposition versus improved conditioning and
accelerated convergence.

SOAP (Shampoo with Adam in eigenbasis).
SOAP is a simplified variant of Shampoo
that decouples preconditioning from momentum tracking. Instead of complex matrix algebra on
preconditioned gradients, SOAP runs standard Adam in the eigenbasis of Shampoo’s precondi-

Lt = Lt−1 + GtG⊤
t ,
Rt = Rt−1 + G⊤
t Gt
(Gram accumulation)
(58)

QL, ΛL = eigh(Lt),
QR, ΛR = eigh(Rt)
(periodic eigendecomposition)
(59)

Grot
t
= Q⊤
LGtQR
(rotate gradient into eigenbasis)
(60)

mrot
t
= β1mrot
t−1 + (1 −β1)Grot
t
(61)

vrot
t
= β2vrot
t−1 + (1 −β2)(Grot
t )2
(Adam in rotated space)
(62)

Urot
t
=
mrot
t
p

vrot
t
+ ϵ
,
Ut = QLUrot
t
Q⊤
R
(rotate back)
(63)

The key insight: all momentum tracking occurs in the well-conditioned eigenbasis, simplifying
numerical stability and theoretical analysis.
SOAP combines benefits of Shampoo (explicit
curvature structure) and Adam (proven momentum mechanics), while being cleaner and often
more stable than full Shampoo. Eigendecompositions are recomputed every K steps, amortizing
the cost.

Algorithm 19 SOAP (Shampoo with Adam in Eigenbasis)
Require: Initial parameters θ0, learning rate η, β1, β2, eigendecomposition interval K
Ensure: Updated parameters θT

1: Initialize: L ←ϵIm, R ←ϵIn, m ←0, v ←0

2: for t = 1, 2, . . . , T do

3:
G ←∇f(θt−1)
▷Gradient

4:
Update Gram: L ←L + GG⊤, R ←R + G⊤G

5:
if t mod K == 0 then

6:
QL, ΛL ←eigh(L), QR, ΛR ←eigh(R)

7:
end if

8:
Grot ←Q⊤
LGQR
▷Rotate gradient into eigenbasis

9:
m ←β1m + (1 −β1)Grot
▷Exponential moving average (bias-correct outside)

10:
v ←β2v + (1 −β2)(Grot)2
▷Second moment (bias-correct outside)

11:
u ←m/(√v + ϵ)
▷Adaptive step in eigenbasis

12:
∆θ ←QLuQ⊤
R
▷Rotate back to parameter space

13:
θt ←θt−1 −η · ∆θ

14: end forreturn θT

Lion (EvoLved Sign Momentum).
Lion achieves minimal memory overhead by using
sign-based updates instead of full adaptive scaling:

ct = β1mt + (1 −β1)gt
(momentum input)
(64)

θt+1 = θt −η (sign(ct) + λθt)
(sign operator update)
(65)

mt+1 = β2mt + (1 −β2)gt
(momentum accumulation)
(66)

All element-wise scaling operations are replaced with sign(), which outputs {−1, 0, +1}. This
dramatically reduces memory compared to Adam-like methods: Lion stores only the momentum
buffer, no second moment variance. The tradeoff: sign-based updates sacrifice the element-wise
learning-rate adaptation that makes Adam effective. Lion is positioned not as a universally
superior optimizer but as a specialized low-memory, high-throughput alternative for scenarios
where activation memory dominates and some optimizer sophistication can be sacrificed. Works
well in large-batch regimes and when hardware throughput is the primary constraint.

Algorithm 20 Lion (EvoLved Sign Momentum)
Require: Initial parameters θ0, learning rate η, β1, β2, weight decay λ
Ensure: Updated parameters θT

1: Initialize: m ←0 (momentum buffer)

2: for t = 1, 2, . . . , T do

3:
g ←∇f(θt−1)
▷Gradient

4:
c ←β1m + (1 −β1)g
▷Momentum input

5:
θt ←θt−1 −η(sign(c) + λθt−1)
▷Sign-based update with weight decay

6:
m ←β2m + (1 −β2)g
▷Momentum accumulation (decoupled)

7: end forreturn θT

Sophia (Second-Order Hessian Information with Optimized Approximation).
Sophia
uses diagonal Hessian estimates for curvature-aware scaling without the memory and com-
pute cost of dense second-order methods.

mt = β1mt−1 + (1 −β1)gt
(first moment)
(67)

ht = β2ht−(k−1) + (1 −β2)ˆht
(diagonal Hessian, updated every k steps)
(68)

clip(x, C) = min(max(x, −C), C)
(element-wise clipping)
(69)

θt+1 = θt −η · clip

mt
max(γht, ϵ), 1

(70)

where ˆht is a diagonal estimate (e.g., Hutchinson-trace estimator) and γ is a scaling factor. The
diagonal Hessian ht captures local curvature, allowing the optimizer to take smaller steps in
sharp directions and larger steps in flat directions. The clipping operation clip(
cdot, 1) prevents adaptive steps from exploding.
Sophia belongs to the family of curvature-
aware methods seeking better conditioning without the O(n2) or O(n3) cost of full second-order
methods. Works particularly well in second-pass fine-tuning scenarios.

Algorithm 21 Sophia (Diagonal Hessian with Clipped Updates)
Require: Initial parameters θ0, learning rate η, β1, β2, Hessian update freq k, clip threshold C
Ensure: Updated parameters θT

1: Initialize: m ←0, h ←ϵ (diagonal Hessian estimate)

2: for t = 1, 2, . . . , T do

3:
g ←∇f(θt−1)

4:
m ←β1m + (1 −β1)g
▷First moment (bias-correct outside)

5:
if t mod k == 0 then

6:
ˆh ←HutchinsonEstimate(g)
▷Diagonal Hessian via Hutchinson

7:
h ←β2h + (1 −β2)ˆh
▷Update Hessian estimate

8:
end if

9:
u ←
m
max(γh,ϵ)
▷Adaptive step (element-wise)

10:
u ←clip(u, C)
▷Element-wise clipping to [−C, C]

11:
θt ←θt−1 −η · u

12: end forreturn θT

Muon and Turbo-Muon (Orthogonality-Based Optimizers).
Muon and Turbo-Muon
are orthogonality-oriented optimizers that reshape update geometry using Newton-Schulz

polynomial iterations for orthogonalization. For a 2D parameter matrix W with gradient Gt:

X0 =
Gt
∥Gt∥F + ϵ
(normalized gradient)
(71)

Ak = XkX⊤
k
(Gramian)
(72)

Bk = bAk + cA2
k
(polynomial step, coefficients b, c from Newton-Schulz)
(73)

Xk+1 = aXk + BkXk
(iteration k = 0, 1, . . . , 4)
(74)

Wt+1 = Wt −ηXK
(update with orthogonalized direction)
(75)

Muon performs 5 Newton-Schulz iterations to produce an approximately orthogonal direction,
ensuring updates respect geometric constraints. Turbo-Muon adds an almost-orthogonal pre-
conditioning (AOL) step before iterations, reducing the number of required orthogonalization
steps to 4. The rationale: orthogonal updates preserve norms during training, avoiding the
norm creep observed in momentum-based methods. These optimizers show promise on large-
scale training but require custom CUDA kernels for efficiency. Tradeoff: high per-step compute
(matrix multiplications and polynomial iterations) versus fundamentally better-conditioned up-
date geometry.

Algorithm 22 Muon (Newton-Schulz Orthogonalization)
Require: Initial parameters θ (matrices), learning rate η, Newton-Schulz iters K
Ensure: Updated parameters θ

1: for each 2D parameter matrix W do

2:
G ←∇f(W)
▷Gradient

3:
X0 ←
G
∥G∥F +ϵ
▷Normalize gradient by Frobenius norm

4:
for k = 0, 1, . . . , K −1 do

5:
Ak ←XkX⊤
k
▷Gramian (cost: O(mn2) or O(m2n) for tall/wide)

6:
Bk ←(3/2)Ak −(1/2)A2
k
▷Newton-Schulz coefficients

7:
Xk+1 ←BkXk
▷Polynomial step

8:
end for

9:
W ←W −ηXK
▷Update with orthogonalized direction

10: end forreturn updated θ

Algorithm 23 Turbo-Muon (AOL-Preconditioned Orthogonalization)
Require: Initial parameters θ (matrices), learning rate η, Newton-Schulz iters K′ (typically 4)
Ensure: Updated parameters θ

1: for each 2D parameter matrix W do

2:
G ←∇f(W)
▷Gradient

3:
X0 ←
G
∥G∥F +ϵ
▷Normalize gradient

4:
AOL (Almost-Orthogonal preconditioner):
▷Preconditioning step

5:
P ←(1.5I −0.5X0X⊤
0 )
▷Preconditioning matrix

6:
X0 ←PX0
▷Preconditioned start

7:
for k = 0, 1, . . . , K′ −1 do

8:
Ak ←XkX⊤
k
9:
Bk ←(3/2)Ak −(1/2)A2
k
10:
Xk+1 ←BkXk
▷Reduced iterations due to AOL preconditioning

11:
end for

12:
W ←W −ηXK′
▷Update with preconditioned orthogonalized direction

13: end forreturn updated θ

Group
Methods
Primary
Goal
Interpretation from the Sur-
vey

Classical
base-
line
SGD,
AdamW,
RAdam
stability and
reference
baselines

These define the comparison floor
for newer optimizer claims.

Momentum
re-
design
Adan,
AdEMAMix,
MARS,
Cautious
AdamW

faster
or
safer
first-
order
adap-
tation

Best when convergence speed or
noisy-gradient stability is the main
concern.

Large-batch and
schedule simpli-
fication

LAMB, Schedule-Free
AdamW
operational
robustness
at scale

Reduce brittleness from batch-size
growth or schedule engineering.

Memory-
efficient
Adafactor,
GaLore,
APOLLO,
APOLLO-
Mini,
Q-APOLLO,
Lion

optimizer-
state
reduc-
tion

Most useful when VRAM is dom-
inated by optimizer state rather
than activations; APOLLO-family
methods replace dense AdamW
moments with projected or quan-
tized structured scaling.
Curvature-
aware
Shampoo,
SOAP,
Sophia
better condi-
tioning
Prefer when richer geometry is
worth implementation and com-
pute overhead.
Geometry-
oriented
Muon, Turbo-Muon
orthogonalized
update
structure

Specialized options for matrix ge-
ometry and representation shap-
ing.

B
Annex B: Dense, Recurrent, and Memory-Augmented Trans-
formers

This comprehensive annex synthesizes modern transformer architectures beyond the standard
softmax baseline. The field has evolved toward six primary paradigms: dense global attention
with variants, recurrent state compression with decay, selective state-space models, continuous-
depth numerical integration, memory-augmented architectures, and hybrid approaches.
Un-
derstanding this taxonomy illuminates fundamental tradeoffs between expressiveness, computa-
tional cost, memory footprint, and deployment simplicity.

B.1
Dense Attention Baselines: Standard and Sigmoid

B.1.1
Standard Softmax Attention

Standard multi-head self-attention [38] computes scaled dot-product similarity between token
embeddings:

1: Input: Query matrix Q ∈Rn×d, Key matrix K ∈Rn×d, Value matrix V ∈Rn×d

2: scores ←QK⊤

d
∈Rn×n
▷compute pairwise similarities

√

3: attention_weights ←softmax(scores, axis = 1) ∈Rn×n
▷normalize across keys

4: Output: Y = attention_weights · V ∈Rn×d
▷weighted value combination
For autoregressive generation, KV caching stores past keys and values to avoid O(n2) recom-
putation. However, this linearly growing cache occupies ∼n · dhidden bytes, which can consume
hundreds of gigabytes for billion-parameter models. Standard attention achieves perfect ex-
pressiveness within the context window—any token can attend to any other token with learned
weights—but pays the price of dense computation.

• Training complexity: O(n2 · d) time, O(n2) space (attention matrix materialization).

• Inference complexity: O(n) time per token, O(n) space (KV cache).

• Strengths: Unparalleled expressiveness; perfect history recall; highly parallelizable.

• Weaknesses: Quadratic bottleneck prohibits very long contexts; KV cache dominates mem-
ory during generation.

B.1.2
Sigmoid Attention

Sigmoid attention replaces row-wise softmax with element-wise sigmoid activation [30]:

1: Input: Q, K, V as above, learnable bias b ∈Rn×n

2: logits ←QK⊤

d + b

√

3: attention_weights ←σ(logits)
▷element-wise sigmoid, not softmax

4: Output: Y = attention_weights ⊙V
Unlike softmax, sigmoid does not enforce a probability distribution (weights need not sum
to 1), enabling stronger token independence. Theoretical analysis via mixture-of-experts shows
sigmoid achieves superior sample complexity: O(n−0.51) convergence for ReLU experts versus
softmax’s O(n−0.24). However, empirical training revealed gradient instabilities at scale. The
remedy is hybrid-norm—adding normalization after the attention output—which stabilizes
gradients without sacrificing the theoretical benefits of element-wise gating.

• Training complexity: O(n2 · d) (identical to standard), but element-wise operations enable
17% inference speedup via FlashSigmoid.

• Inference complexity: O(n) per token with KV cache (asymptotically same, but lower
constant factors).

• Strengths: Overcomes zero-sum competition; avoids row-wise synchronization; hardware-
friendly implementation.

• Weaknesses: Requires careful stabilization (hybrid-norm); training instability at large scales
without auxiliary loss.

B.2
Recurrent and Retentive Architectures

B.2.1
Retentive Networks (RetNet)

RetNet [36] unifies three computation paradigms: parallel training, recurrent inference, and
chunkwise deployment. Its core innovation is the retention mechanism, which uses a fixed
exponential decay matrix to model temporal importance:

1: Parallel (training) form:

2: decay_matrix[i, j] ←γi−j for i ≥j, else 0
▷causal exponential decay

3: decay_matrix[i, j] ←0 for i < j
▷causal masking

4: Yparallel ←(QK⊤⊙decay_matrix)V
▷element-wise multiplication with decay

1: Recurrent (inference) form:

2: for t = 1, . . . , n do

3:
st ←γst−1 + ktv⊤
t
▷state update with decay

4:
yt ←qtst
▷output via query-state interaction

5: end for
The decay scalar γ ∈(0, 1) controls the temporal window. RetNet uses multi-scale re-
tention with different γ values per head (e.g., γ = 1 −2−5, 1 −2−6, . . .), allowing short-term
and long-term dependencies simultaneously. Chunkwise recurrent mode divides sequences into
chunks, processes each chunk in parallel, and threads a recurrent state between chunks.

• Training complexity: O(n2 · d) (parallel), or O(n · c · d) (chunkwise recurrent with chunk
size c).

• Inference complexity: O(1) per token, O(d2) state space (fixed matrix).

• Strengths: Constant-time inference; triple computation paradigm; multi-scale consolidation.

• Weaknesses: Fixed decay imposes rigid inductive bias; may truncate learned long-range
patterns.

B.2.2
Mamba: Selective State-Space Models

Mamba [12] frames recurrence as a continuous-time dynamical system with input-dependent
parameters, achieving both linear training and constant-time inference:

1: Continuous dynamics: h′(t) = Ah(t) + Bx(t),
y(t) = Ch(t)

2: Discretization with step size ∆t:

3: ¯At ←exp(∆tA)

4: ¯Bt ←(∆tA)−1(exp(∆tA) −I)∆tBt
5: Recurrent update:

6: for t = 1, . . . , n do

7:
∆t ←softplus(Linear(xt))
▷input-dependent step size

8:
Bt ←Linear(xt),
Ct ←Linear(xt)
▷input-dependent projection matrices

9:
ht ←¯Atht−1 + ¯Btxt
▷state transition

10:
yt ←Ctht
▷output projectionContinuous

11: end for
The critical innovation is that A (the system matrix) is not input-dependent, but ∆t, Bt, and
Ct are, making the system time-varying. This selectivity allows the model to ignore irrelevant
information by setting ∆t near zero, effectively creating a gate. A hardware-aware parallel scan
algorithm implements the recurrence efficiently on GPUs by fusing computation within SRAM,
avoiding expensive HBM bandwidth.

• Training complexity: O(n · d) via hardware-aware scan (linear in sequence length).

• Inference complexity: O(1) per token, O(d) state (hidden vector).

• Strengths: Achieves linear training and constant-inference simultaneously; practical on long
sequences (millions of tokens).

• Weaknesses: State vector compression can weaken exact copying and dense associative recall
versus full attention.

B.3
Continuous-Depth Transformers: ODE Integration

B.3.1
ODE Transformer

The ODE Transformer [52] interprets network depth as numerical integration of a continuous
dynamical system, using higher-order Runge-Kutta solvers to reduce truncation error:

1: Continuous formulation: dh(t)

dt
= fθ(h(t), t) where fθ is the transformer sub-network

2: Runge-Kutta-4 discrete approximation:

3: k1 ←fθ(ht, t)

4: k2 ←fθ(ht + 1

2k1, t + 1

2∆t)

5: k3 ←fθ(ht + 1

2k2, t + 1

2∆t)

6: k4 ←fθ(ht + k3, t + ∆t)

7: ht+1 ←ht + ∆t

6 (k1 + 2k2 + 2k3 + k4)
▷weighted combination

Instead of simple Euler residual connections ht+1 = ht + fθ(ht), the RK4 block computes
four intermediate evaluations and combines them with classical RK4 weights. To avoid vanishing
gradients, the architecture introduces learned gating that interpolates between intermediate
approximations:

1: g ←σ(Linear([k1, k2, k3, k4]))
▷learnable gate

2: ht+1 ←ht + g · k1 + (1 −g) · k2
▷interpolated refinement
This formulation reduces the effective number of parameters through weight sharing—the
same fθ is evaluated multiple times—while providing richer trajectory refinement.

• Training complexity: O(k · n2 · d) where k is the RK order (4 for RK4).

• Inference complexity: O(k · n) per token (higher constant overhead).

• Strengths: Significantly higher accuracy on generation tasks (state-of-the-art BLEU); parameter-
efficient via weight sharing.

• Weaknesses: High per-step compute; inference latency increases by constant factor k; com-
plex gating required for stability.

B.4
Test-Time Memory: Titans

The Titans architecture [2] introduces an orthogonal dimension: instead of static weights, the
model maintains a learnable memory that is updated during inference based on a surprise-driven
signal:

1: Local short-term attention: Apply standard or sparse attention within a fixed context
window c

2: ylocal
t
←Attention(qt, k[t−c:t], v[t−c:t])

3: Memory update signal (surprise):

4: St ←ηtSt−1 −θt∇ℓ(Mt−1; xt)
▷surprise as gradient of loss w.r.t. memory

5: Mt ←(1 −αt)Mt−1 + St
▷memory updated via EMA

6: Long-term memory retrieval:

7: ymemory
t
←M∗
t (qt)
▷query memory module for retrieved information

8: Output combination:

9: yt ←ylocal
t
+ gate(ymemory
t
)
▷combine local and retrieved memory
The memory module M is literally updated during the forward pass by computing gradients
of an associative loss and applying SGD steps with momentum. The decay rates ηt, θt, αt are
themselves input-dependent, allowing the model to switch memory paradigms when context
shifts.

• Training complexity: Roughly O(c2 + n · f) where c is local window and f is memory
update overhead.

• Inference complexity: O(c2) local attention plus O(1) memory retrieval per token.

• Strengths: Handles extreme context lengths; enables true associative recall; memory adapts
to input.

• Weaknesses: Significantly more complex; inference includes gradient computation; higher
coordination overhead.

B.5
Architectural Comparison and Synthesis

Architecture Train
Infer
State
Key Characteristic

Standard At-
tention
O(n2d)
O(n)
cache
O(nd)
Perfect
expressiveness,
full
token
routing,
KV
bottle-
neck.
Sigmoid At-
tention
O(n2d)
O(n)
cache
O(nd)
Element-wise
gating,
hardware-efficient,
stable
at scale with hybrid-norm.
RetNet
O(n2d)
O(1)
O(d2)
Multi-scale decay, triple com-
putation mode, fixed forget-
ting pattern.
Mamba
O(nd)
O(1)
O(d)
Selective input-dependent dy-
namics, linear training and
inference, hardware scan.
ODE Trans-
former
O(kn2d)
O(kn)
O(nd)
Numerical integration refine-
ment,
weight
sharing
via
stages, higher accuracy.
Titans
O(c2 + nf)
O(c2 + 1)
O(1)
Test-time
memory
adapta-
tion,
extreme context,
on-
inference parameter updates.

The field exhibits a clear progression along two axes: (i) computational efficiency, mov-
ing from O(n2) to O(n) or O(1), and (ii) memory adaptivity, shifting from static weights
to dynamic, test-time updated models. Standard attention remains the expressiveness baseline;
Mamba and RetNet represent the practical efficiency frontier; Titans introduces an orthogo-
nal innovation axis (test-time learning). The choice of architecture reflects the fundamental
engineering constraint: expressiveness versus deployment cost.

C
Annex C: Comprehensive Sparse Attention Mechanisms

C.1
Executive Summary

Sparse attention mechanisms address the prohibitive O(n2) computational and memory com-
plexity of standard scaled dot-product attention by restricting the set of attended key-value pairs.
Modern sparse attention spans a rich design space distinguished by four orthogonal dimensions:
(i) sparsity pattern (fixed geometric, data-dependent, or frequency-based), (ii) trainability
(end-to-end trained or inference-only), (iii) sparsity unit (individual tokens, local windows, or
blocks), and (iv) mechanism (masking, selection, filtering, or compression). This annex synthe-
sizes seven major sparse attention families—Sparse Transformer, Longformer, BigBird, SparseK,
NSA, SpargeAttn, and FASA—providing mathematical formulations, algorithmic pseudocode,
and comprehensive architectural comparisons.

C.2
Sparse Transformer: Factorized Strided and Fixed Patterns

C.2.1
Mathematical Formulation

The Sparse Transformer [9] reduces quadratic complexity to O(n√n) by factorizing sparse at-
tention into two complementary sparse heads. Let n be the sequence length and l = ⌊√n⌋be
the stride.

Strided attention head
: Each position i attends to every l-th previous position:

A(strided)
i
= {j : j ≤i, (i −j) mod l = 0}

Fixed attention head
: Each position i attends to positions within its local block plus fixed
summary columns:

A(fixed)
i
= {j : ⌊j/l⌋= ⌊i/l⌋} ∪{j : j mod l ∈{l −c, . . . , l −1}}

where c is a hyperparameter controlling the number of summary columns (typically c = 1 or
c = 2).
The attention computation for each head h follows standard scaled dot-product rules re-
stricted to the sparse set:

!

qh
i Kh
Ai⊤
√dk

V h
Ai

Attnh(Q, K, V )i = softmax

The key insight is that with the two factorized heads operating in parallel across a depth of
L transformer layers, every position can reach every other position through a path of length at
most L + 1, preserving long-range reachability with a fraction of dense attention cost.

C.2.2
Algorithmic Pseudocode

Algorithm 24 Sparse Transformer Attention: Strided + Fixed Dual Heads
Require: Query Q ∈Rn×d, Key K ∈Rn×d, Value V ∈Rn×d

Require: Stride l = ⌊√n⌋, summary columns c ∈{1, 2}
Ensure: Output Y ∈Rn×d

1: Head 1: Strided Attention

2: for i = 1, . . . , n do

3:
Ai ←{j : j ≤i and (i −j) mod l = 0}
▷Every l-th position

4:
scoresi ←qiK⊤
Ai/√dk
5:
attni ←softmax(scoresi)

6:
y(1)
i
←attniVAi
7: end for

8: Head 2: Fixed Block + Summary Attention

9: for i = 1, . . . , n do

10:
block_start ←⌊i/l⌋· l, block_end ←min(i + 1, block_start + l)

11:
Ai ←[block_start, block_end) ∪{cols from last c columns}

12:
scoresi ←qiK⊤
Ai/√dk
13:
attni ←softmax(scoresi)

14:
y(2)
i
←attniVAi
15: end for

16: Concatenate: Y = [y(1)∥y(2)] and project via output linear layer return Y

C.2.3
Key Characteristics

• Complexity: O(n√n · d) during training; O(n) per token during generation.

• Trainable: Yes; full backpropagation through selected positions.

• Sparsity pattern: Fixed and data-agnostic; patterns do not adapt to content.

• Strength: Structured reachability guarantees long-range dependencies; proven effective on
long sequences (Enwik8, text images).

• Limitation: Fixed patterns may miss important but non-contiguous relationships; requires
custom CUDA kernels for practical efficiency.

C.3
Longformer: Sliding Window, Dilation, and Global Tokens

C.3.1
Mathematical Formulation

Longformer [3] achieves linear O(n·w) complexity by combining three complementary attention
patterns.

Sliding window attention
: Local neighborhood of size w:

A(window)
i
= {j : |i −j| ≤w/2}

Dilated sliding window
: Windowed attention with dilation d, skipping stride d + 1:

A(dilated)
i
= {j : |i −j| ≤(w/2)(d + 1) and (i −j) mod (d + 1) = 0}

Global attention
: Designated global tokens (e.g., [CLS]) attend to all positions and are
attended by all:

(
{1, . . . , n}
if i ∈G

A(global)
i
=

A(window)
i
∪G
otherwise

The three patterns are applied via separate projections. The local and global attention can
use either the same projections or task-specific separate ones.

C.3.2
Algorithmic Pseudocode

Algorithm 25 Longformer: Sliding Window + Dilated + Global
Require: Query Q ∈Rn×d, Key K, Value V, window size w, dilation d, global token indices G
Ensure: Output Y

1: for i = 1, . . . , n do

2:
if i ∈G then

3:
Ai ←{1, . . . , n}
▷Global token attends to all

4:
else

5:
Local window: A(local) ←[i −w/2, i + w/2] ∩[1, n]

6:
Dilated offsets: A(dilated) ←{j : (i −j) mod (d + 1) = 0} ∩A(dilated range)

7:
Combine: Ai ←A(local) ∪A(dilated) ∪G

8:
end if

9:
scoresi ←qiK⊤
Ai/√dk
10:
attni ←softmax(scoresi)

11:
yi ←attniVAi
12: end forreturn Y

C.3.3
Key Characteristics

• Complexity: O(n · w) where w is the window size (linear when w ≪n).

• Trainable: Yes; drop-in replacement for standard attention.

• Coverage: With L layers and window size w, receptive field grows to L × w, covering entire
sequence with shallow networks.

• Strength: Practical for document-level tasks; dilation expands local receptive fields with
minimal overhead; global tokens provide query-document correspondence.

• Limitation: Window size remains a design hyperparameter; suboptimal for tasks requiring
dense attention patterns.

C.4
BigBird: Random, Local, and Global Sparse Graph

C.4.1
Mathematical Formulation

BigBird [50] preserves universal approximation and Turing completeness using a principled mix
of three sparsity patterns.

Random connections
: Each position connects to r randomly sampled positions:

A(random)
i
= RandomSample({1, . . . , n}, r)

Local window
: Sliding window neighborhood:

A(local)
i
= {j : |i −j| ≤w/2}

Global tokens
: Task-specific global anchors:

(
{1, . . . , n}
if i ∈G

A(global)
i
=

A(random)
i
∪A(local)
i
∪G
otherwise

The composite sparse mask M is:

Mij = 1[j ∈A(random)
i
∪A(local)
i
∪A(global)
i
]

In practice, BigBird groups tokens into blocks of size b and applies the sparsity decision at
the block level, enabling efficient block-sparse implementations.

C.4.2
Theoretical Guarantee

The authors prove that with O(1) random connections and global tokens, the sparse graph main-
tains the same universal approximation and Turing completeness properties as dense attention.
Formally, any computable function can be represented and any sequence of operations can be
simulated within the BigBird connectivity graph.

C.4.3
Algorithmic Pseudocode

Algorithm 26 BigBird: Random + Local + Global Block-Sparse Attention
Require: Query Q, Key K, Value V, block size b, random links per block r, global indices G
Ensure: Output Y

1: Reshape into blocks: nb = ⌈n/b⌉blocks

2: for block i = 0, . . . , nb −1 do

3:
for position j in block i do

4:
Local window: A(local)
j
= nearby positions in block ∪boundary positions

5:
Random block sampling: Sample r blocks uniformly at random, add all positions
from those blocks

6:
Global: Add all global token positions

7:
Aj ←A(local)
j
∪A(random)
j
∪G

8:
end for

9: end for

10: for i = 1, . . . , n do

11:
scoresi ←qiK⊤
Ai/√dk
12:
attni ←softmax(scoresi)

13:
yi ←attniVAi
14: end forreturn Y

C.4.4
Key Characteristics

• Complexity: O(n) or near-linear in optimal block-sparse implementation.

• Trainable: Yes.

• Theoretical grounding: Provably maintains universal approximation and Turing-completeness
with sparse connectivity.

• Strength: Strong long-document performance on question answering, summarization, and
genomics tasks.

• Limitation: Random sampling introduces variance and non-determinism; tuning of r, w, g
hyperparameters remains task-dependent.

C.5
SparseK Attention: Differentiable Top-k Selection

C.5.1
Mathematical Formulation

SparseK [23] enables learnable sparsity via a differentiable top-k operator. A scoring network
ϕθ evaluates key-value pair importance:

uj = ϕθ(kj, qi) ∈R

The SparseK operator selects the top-k scores while remaining differentiable. It computes
a threshold τ(u) such that the sum of active scores equals k:

SparseK(u, k)j = max(uj −τ(u), 0),
where
X

j
max(uj −τ, 0) = k

The threshold τ is found via bisection. The attention becomes:

mj = SparseK(u, k)j,
Attni = softmax
qiK⊤
sel
√dk


Vsel

where Ksel, Vsel contain only the top-k entries (those with mj > 0).

C.5.2
Algorithmic Pseudocode

Algorithm 27 SparseK: Differentiable Top-k Attention
Require: Query q ∈Rd, Key matrix K ∈Rn×d, Value matrix V ∈Rn×d

Require: Scoring network ϕθ, target sparsity k
Ensure: Attention output y

1: Compute importance scores:

2: for j = 1, . . . , n do

3:
uj ←ϕθ(kj, q)

4: end for

5: Compute differentiable top-k via threshold:

6: usorted ←sort(u, descending = True)

7: cumsum ←cumsum(usorted)

8: Find ρ = max{i : usorted[i] > 0 and cumsum[i] ≤k}

9: τ ←(usorted[ρ] −k)/(ρ + 1)
▷Threshold for k active elements

10: m ←max(u −τ, 0)
▷Differentiable selection mask

11: Apply mask and compute attention:

12: Ksel ←K[m > 0], Vsel ←V [m > 0]

13: scores ←qK⊤
sel/√dk
14: attn ←softmax(scores)

15: y ←attn · Vsel return y

C.5.3
Key Characteristics

• Complexity: O(n · d) during training (linear in sequence length); O(k) per token during
generation.

• Trainable: Yes; end-to-end gradient flow through the SparseK operator.

• Incremental generation: Supports efficient constant-memory autoregressive generation.

• Strength: Seamlessly integrates into existing LLM architectures; minimal fine-tuning needed.

• Limitation: Scattered memory access from top-k selection may limit cache efficiency on some
hardware; scoring network adds overhead.

C.6
NSA (Native Sparse Attention): Hardware-Aligned Hierarchical Branches

C.6.1
Mathematical Formulation

NSA [48] decomposes sparse attention into three parallel branches that are combined through
learned gates.

Compression branch
: A learnable MLP φ compresses key-value blocks:

˜Kcmp
t
= [φ(kid+1:id+l)]1≤i≤⌊(t−l)/d⌋
,
˜V cmp
t
= [φ(vid+1:id+l)]

Selection branch
: High-importance blocks are selected based on compressed scores:

pcmp
t
= softmax(q⊤
t ˜Kcmp
t
/
√

d),
It = TopK(pcmp
t
, n),
˜Ksel
t
= Gather(K, It)

Sliding window branch
: Fixed local context:

˜Kwin
t
= Kt−w:t,
˜V win
t
= Vt−w:t

Gated combination
: The three attention outputs are combined via learned gates:

α(cmp)
t
= σ(Linearcmp(qt)),
α(sel)
t
= σ(Linearsel(qt)),
α(win)
t
= σ(Linearwin(qt))

yt = α(cmp)
t
· Attn(qt, ˜Kcmp, ˜V cmp) + α(sel)
t
· Attn(qt, ˜Ksel, ˜V sel) + α(win)
t
· Attn(qt, ˜Kwin, ˜V win)

C.6.2
Algorithmic Pseudocode

Algorithm 28 NSA: Compressed + Selected + Window Branches with Learned Gating
Require: Query qt, cached Key/Value sequences, block size l, stride d, selection count n,
window w
Require: Compression MLP φ, gate networks Linearcmp, Linearsel, Linearwin
Ensure: Output yt

1: Compression branch:

2: for i = 1 to ⌊t/d⌋do

3:
kcmp
i
←φ(Kid+1:id+l)
▷Compress block via MLP

4:
vcmp
i
←φ(Vid+1:id+l)

5: end for

6: ˜Kcmp ←[kcmp
1
, . . . , kcmp
⌊t/d⌋], ˜Vcmp ←[vcmp
1
, . . . , vcmp
⌊t/d⌋]

7: y(cmp)
t
←SDPA(qt, ˜Kcmp, ˜Vcmp)

8: Selection branch:

9: Compute scores on compressed: s ←softmax(qt ˜Kcmp⊤/
√

d)

10: Select top-n important blocks: I ←TopK(s, n)

11: Gather full blocks: ˜Ksel ←[KIid+1:Iid+l]i, ˜Vsel ←[VIid+1:Iid+l]i

12: y(sel)
t
←SDPA(qt, ˜Ksel, ˜Vsel)

13: Window branch:

14: y(win)
t
←SDPA(qt, Kt−w:t, Vt−w:t)

15: Gated fusion:

16: α(cmp) ←σ(Linearcmp(qt)), α(sel) ←σ(Linearsel(qt)), α(win) ←σ(Linearwin(qt))

17: yt ←α(cmp)y(cmp)
t
+ α(sel)y(sel)
t
+ α(win)y(win)
t
return yt

C.6.3
Key Characteristics

• Complexity: O(t/d + n · l + w) tokens attended per step (significantly reduced versus O(t)
for dense).

• Trainable: Yes; direct training with all branches differentiable.

• Hardware alignment: Designed for efficient tensor core utilization; compression and selec-
tion reduce memory bandwidth pressure.

• Strength: 11.6× decoding speedup and 9× forward speedup on 64k sequences; outperforms
full attention on many tasks.

• Limitation: Requires custom Triton/CUDA kernels; complex multi-branch architecture in-
creases engineering overhead.

C.7
FASA: Frequency-Aware Sparse Attention

C.7.1
Mathematical Formulation

FASA [41] is a training-free inference-time method that exploits rotary position embedding
(RoPE) structure. Under RoPE, the token embedding is rotated by θi = B−2(i−1)/d, decompos-
ing the d-dimensional space into d/2 frequency chunks (FCs).

Key insight
: Only a small subset (< 1%) of FCs matter for contextual awareness; most
encode positional patterns.

Dominant frequency chunk identification
: For each layer l and head h, identify dominant
FCs via contextual agreement (CA):

CAl,h,i = |TopK(αl,h) ∩TopK(αl,h,i)|

K

where αl,h is the full attention mask and αl,h,i is the mask using only FC i. Dominant FCs
have high CA—they contribute meaningfully to the final attention pattern.

Token importance prediction (TIP) stage
: Using only dominant FCs, compute lightweight
importance per token:

Sl,h
t
=
X

αl,h,i(qt, K1:t),
Tt = TopK-Indices(St, Nfac)

i∈Il,h
dom

Focused attention computation (FAC) stage
: Compute full-precision attention on se-
lected tokens:

qtK⊤
Tt
√

!

ˆαFAC = softmax

,
ot = ˆαFACVTt

d

C.7.2
Algorithmic Pseudocode

Algorithm 29 FASA: Frequency-Aware Sparse Attention (Inference-Time)
Require: Current query qt, cached Key/Value K1:t, V1:t, RoPE base B, dominant FCs Idom
Require: TIP token budget Ntip, FAC token budget Nfac
Ensure: Output ot

1: Stage 1: Token Importance Prediction (TIP)

2: for layer l and head h do

3:
for position i = 1 to t do

4:
Compute importance from dominant FCs: si ←0

5:
for fc ∈Il,h
dom do

6:
Rotate query and key into FC fc: q(fc), k(fc)
i
7:
s(fc)
i
←softmax(q(fc)k(fc)⊤
i
/
√

d)

8:
si ←si + s(fc)
i
9:
end for

10:
end for

11:
Tt ←TopK −Indices([s1, . . . , st], Nfac)
▷Select top tokens

12: end for

13: Stage 2: Focused Attention Computation (FAC)

14: scoresfac ←qtK⊤
Tt/
√

d
▷Full-precision dot product

15: αfac ←softmax(scoresfac)
▷Full softmax

16: ot ←αfacVTt
▷Weighted value sum return ot

C.7.3
Key Characteristics

• Complexity: TIP is O(t · Ntip); FAC is O(Nfac · d).

• Trainable: No; training-free inference optimization.

• Applicability: Requires RoPE-based models; extended to ALiBi and MLA with modifica-
tions.

• Strength: Near-oracle accuracy with ≤256 tokens out of millions; up to 2.56× decoding
speedup; orthogonal to quantization.

• Limitation: Offline offline calibration required; dominant FC identification adds preprocess-
ing overhead.

C.8
SpargeAttn: Two-Stage Block-Level Filtering

C.8.1
Mathematical Formulation

SpargeAttn [53] is a universal training-free method that predicts which blocks of the atten-
tion matrix will contain negligible values and skips computation for those blocks.

Stage 1 — Sparse prediction
: Compute low-cost proxy importance ˆsij for block (i, j):

ˆsij = fpred(Qi, Kj; hyperparams),
skip if ˆsij < ϵ1

Common predictors include block-mean similarity or self-similarity statistics.

Stage 2 — Softmax-aware filtering
: After computing ˜Pij = softmax(QiK⊤
j /
√

d), check
if the block’s maximum probability is negligible relative to the running softmax maximum:

skip PijVj
if
max( ˜Pij) < emold−mnew · ϵ2

where mold, mnew are online softmax running maxima (computed via FlashAttention-style
numerics).

C.8.2
Algorithmic Pseudocode

Algorithm 30 SpargeAttn: Two-Stage Block-Sparse Filtering
Require: Query blocks Qi for i = 1, . . . , nb, Key blocks Kj for j = 1, . . . , nb, Value blocks Vj
Require: Prediction threshold ϵ1, softmax threshold ϵ2, block size bs
Ensure: Output Y

1: Initialize: online softmax max mglobal ←−∞

2: for block row i = 1 to nb do

3:
Initialize: block row output Yi ←0, local max mi ←−∞

4:
for block column j = 1 to nb do

5:
Stage 1: Sparse Prediction

6:
Compute proxy importance: ˆsij ←fpred(Qi, Kj)

7:
if ˆsij < ϵ1 then

8:
skip this block [i, j]
▷Early termination

9:
continue
▷Move to next block

10:
end if

11:
Stage 2: Compute and Filter

12:
˜Pij ←softmax(QiK⊤
j /
√

d)

13:
mblock ←max( ˜Pij)

14:
if mblock < emi−mglobal · ϵ2 then

15:
Block contributes negligibly; skip
▷Softmax-aware pruning

16:
continue

17:
end if

18:
Update global max: mi ←max(mi, mblock)

19:
Accumulate: Yi ←Yi + ˜PijVj
20:
end for

21:
mglobal ←max(mglobal, mi)

22: end forreturn Y

C.8.3
Key Characteristics

• Complexity: Empirical O(n2 · s) where s is the fraction of non-skipped blocks (typically
0.2–0.5).

• Trainable: No; plug-and-play acceleration for existing models.

• Universality: Works on language models, image diffusion models, and video generation.

• Strength: 2.5–5× speedup compared to dense or previous sparse methods; compatible with
quantization (SageAttention integration).

• Limitation: Speedup depends on inherent sparsity; block-level granularity may miss finer
patterns; threshold tuning required per model family.

C.9
Comparative Summary

Method
Complexity Trainable
Unit
Year
Primary
Contribu-
tion

Sparse
Transformer
O(n√nd)
Y
Token pat-
tern
2019
Factorized
strided
+
fixed masks
Longformer
O(nwd)
Y
Local
+
global
2020
Linear scaling with dila-
tion
BigBird
O(n)
Y
Graph
sparse
2020
Theoretical
complete-
ness proof
SparseK
O(nd)
Y
Top-k
tokens
2024
Differentiable selection

NSA
O((t/d
+
nl′ + w)d)
Y
Multi-
branch
2025
Hardware-aligned
3-
branch design
FASA
O(tNtip
+
Nfacd)
N
Freq
chunks
2026
RoPE frequency insight

SpargeAttn
O(n2s · d)
N
Block filter
2025
Two-stage universal fil-
tering

C.10
Design Space and Selection Criteria

The seven sparse attention methods occupy complementary positions in a multidimensional
design space:

• Geometric/fixed patterns (Sparse Transformer, Longformer): Simple to implement and
analyze, but patterns do not adapt to content.

• Learned selection (SparseK, NSA): Enable adaptation through trainable selection networks,
at the cost of higher implementation complexity.

• Training-free acceleration (FASA, SpargeAttn): Enable drop-in speedup of existing models
without retraining, suited for deployed systems.

• Theoretical guarantees (BigBird): Provide formal proofs of expressive completeness, im-
portant for understanding safety margins.

Selection guidance:

1. New model training where resources permit: NSA or SparseK for maximal inference
speedup; BigBird for theoretical assurance.

2. Long-context document understanding: Longformer or FASA for practical simplicity;
NSA for extreme scale.

3. Accelerating existing models: FASA for RoPE-based LLMs; SpargeAttn for any archi-
tecture.

4. Constrained environments: Sparse Transformer for simplicity and proven efficiency at
moderate scale.

D
Annex D: Gated Attention Families—Complete Literature Anal-
ysis

D.1
Executive Summary: Gating for Memory Control

The emerging field of gated attention mechanisms addresses a fundamental challenge in sequence
modeling: how should information be retained, forgotten, and updated as the model processes

new tokens? Traditional additive recurrent states accumulate all information without erasure,
leading to memory saturation and inability to adapt to changing context. Softmax attention
provides full expressiveness but uses O(Ld) KV cache during inference, limiting context window
size. Gated mechanisms provide a middle ground: selective control over information survival.
The unifying principle across gated architectures is that gating controls what information
survives. Gates operate at three distinct levels:

1. Recurrent state decay (GLA, HGRN2): Multiplicative gates on the memory matrix St
control how much previous state persists into the next step.

2. Write strength in recurrent updates (DeltaNet, Gated DeltaNet): Gates control the
magnitude of new information written into memory.

3. Softmax logit biasing (FoX): Gates inject token-level recency bias into attention logit
computation.

4. Post-attention sparsity (Gated Softmax): Gates selectively scale SDPA output channels.

This appendix synthesizes seven major gated-attention architectures published 2023–2025,
analyzing their mathematical foundations, hardware efficiency characteristics, and empirical
trade-offs.

D.2
1. Gated Linear Attention (GLA)

D.2.1
Mathematical Foundation

Gated Linear Attention [44] augments vanilla linear attention (which uses a matrix-valued re-
current state) with data-dependent multiplicative decay. Standard linear attention reformulates
the traditional softmax mechanism as a recurrence over hidden states:

St = St−1 + vtk⊤
t ,
ot = Stqt

where state St ∈Rdv×dk accumulates outer products. This purely additive formulation suf-
fers from “memory overload”: information accumulates monotonically and cannot be erased,
degrading performance on tasks requiring context selection.
GLA introduces a data-dependent diagonal gating matrix Gt ∈[0, 1]dk×dk:

St = Gt ⊙St−1 + vtk⊤
t ,
ot = Stqt

The gate Gt is parameterized with an outer-product structure:

Gt = 1α⊤
t ,
αt = logsigmoid (Wgkxt) /c

where Wgk is a low-rank projection (Rd→dk) and c ≈16 is a normalizer.
The logsigmoid
activation maps αt ∈R to approximately [−1/2, 0], and division by c further contracts this to
near-identity for initialization. The resulting gate Gt = 1 + αt ≈1 near initialization, then
learns to decay (αt →−∞) historical information.

Algorithm 31 Gated Linear Attention (Recurrent Form)
Require: Input xt; prior state St−1
Require: Projections: Wq, Wk, Wv (standard); Wgk (gate projection)
Ensure: Output ot and updated state St

1: qt ←Wqxt
2: kt ←Wkxt
3: vt ←Wvxt
4: Compute gate: αt ←logsigmoid(Wgkxt)/16
▷Per-key-dim decay

5: Apply outer-product gating: Gt ←1α⊤
t
▷diagonal outer product

6: Decay previous state: Sdecay
t
←Gt ⊙St−1
▷element-wise product

7: Add new association: St ←Sdecay
t
+ vtk⊤
t
8: Retrieve via query: oraw
t
←Stqt
9: Apply output gate: ot ←GroupNorm(oraw
t
) ⊗SiLU(Wgxt) return ot, St

D.2.2
Hardware-Efficient Training

For parallel training, GLA uses a chunkwise block-wise algorithm that groups tokens into chunks
C and computes recurrent transitions in parallel:

O[t] = ←−
Q[t]S⊤
[t] +

Q[t]K⊤
[t] ⊙Γ[t]

V[t]

where ←−
Q[t] are attending to the state at the chunk boundary (lookback), and Γ[t] is the causal
mask with decay-aware scaling:

(Qi−1
l=j αl
if i > j (lookback)
Qi−j
l=1 αl
if i ≤j (in-chunk)

Γ[t],ij =

This design maximizes matmul operations susceptible to tensor-core acceleration, achieving sub-
quadratic total complexity O(nLd2/C) where L is the number of attention heads and C is chunk
size.

D.2.3
Strengths and Limitations

Aspect
Strengths
Limitations

Computational efficiency
Sub-quadratic
training
via chunkwise parallelism.
Constant-memory
infer-
ence O(d2).

Requires
custom
CUDA/Triton
kernels;
not available in all frame-
works.
Context generalization
Extends 2K-token train-
ing to 20K+ tokens with
negligible
perplexity
degradation.

Still underperforms soft-
max
on
retrieval-heavy
benchmarks
(needle-in-
haystack).
Selectivity
Per-key-dimension
data-
dependent decay.
Gate
structure
limits
per-key-value
selectivity;
no
direct
control
over
which specific associations
to erase.

D.3
2. DeltaNet: Error-Correcting Linear Attention

D.3.1
Mathematical Foundation

DeltaNet [45] applies a classical delta learning rule to linear attention state updates. Instead of
additive accumulation, DeltaNet computes the difference between the predicted value (retrieved

from memory) and the target value, then performs an error-correcting update:

St = St−1(I −βtktk⊤
t ) + βtvtk⊤
t

where βt ∈(0, 1) is a data-dependent writing strength scalar. This update can be decomposed
as an erase-write operation:

vold
t
= St−1kt
(current prediction),
vupdated
t
= βtvt + (1 −βt)vold
t
(blend old/new)

so that:
St = St−1 −vold
t
k⊤
t + vupdated
t
k⊤
t

The key insight is that this update rule minimizes an online MSE loss at each timestep:

2∥Skt −vt∥2 ⇒∇SLt = (Skt −vt)k⊤
t

Lt(S) = 1

One step of SGD with learning rate βt yields the delta rule update. This connection to test-
time training (TTT) provides theoretical grounding: DeltaNet is directly optimizing for value
prediction at inference time.

D.3.2
Efficient Parallel Training

DeltaNet’s transition matrix I −βtktk⊤
t is a generalized Householder matrix (rank-1 update to
identity). For chunkwise training, DeltaNet uses the WY representation, which represents
the product of m such rank-1 updates compactly:

m
Y

i=1
(I −βikik⊤
i ) = I −WY⊤

where W, Y ∈Rd×m are constructed recursively. This factorization enables matmul-rich paral-
lelism without materializing the full product, keeping chunkwise training efficient.

Algorithm 32 DeltaNet State Update with WY Representation
Require: Input chunk X[t]; prior state St−1
Require: L2-normalized queries ˆQt, keys ˆKt, values Vt
Ensure: Output O[t] and updated state St

1: Chunkwise phase:

2: for position i = 1 to chunk_size do

3:
Normalize: ˆki ←ˆKt[i]/∥ˆKt[i]∥, ˆqi ←ˆQt[i]/∥ˆQt[i]∥

4:
Compute write strength: βi ←sigmoid(Wβxi)

5:
Prediction: ˆvpred ←Sin-chunk
i−1
ˆki

6:
Error-aware blend: vupdate
i
←βi(vi −ˆvpred)

7:
Erase-write: Sin-chunk
i
←Sin-chunk
i−1
−ˆvpredˆk⊤
i + (βivi + (1 −βi)ˆvpred)ˆk⊤
i
8:
Output: oi ←Sin-chunk
i
ˆqi
9: end for

10: Inter-chunk phase: Use WY representation of transition products return O[t], St

D.3.3
Strengths and Limitations

Aspect
Strengths
Limitations

Associative recall
Perfect recall on Multi-
Query Associative Recall
(MQAR)
benchmark;
error-correcting semantics
naturally
fit
retrieval
patterns.

Without
global
forget-
ting, memory still crowds
over
extreme
sequence
lengths; requires periodic
reset
mechanisms
for
unbounded context.
Theory
Grounded in online MSE
optimization;
clear con-
nection to test-time train-
ing paradigm.

Requires
L2-normalized
keys
for
numerical
sta-
bility;
double-width
normalization adds over-
head.
Throughput
WY
representation
en-
ables efficient chunkwise
training.

Slightly
slower
than
Mamba2 per-token due to
richer transition matrices
(rank-1 instead of diago-
nal).

D.4
3. Gated DeltaNet: Synthesis of Gating and Error Correction

D.4.1
Mathematical Foundation

Gated DeltaNet [46] synthesizes the strengths of GLA and DeltaNet by combining a decay gate
αt (from GLA) with the writing strength gate βt (from DeltaNet):

St = St−1

αt(I −βtktk⊤
t )

+ βtvtk⊤
t

Rewriting for clarity:
St = αtSt−1(I −βtktk⊤
t ) + βtvtk⊤
t

The two gates are complementary:

• αt →0 rapidly erases historical state: St ≈βtvtk⊤
t (context reset).

• αt →1 recovers pure delta-rule behavior: St ≈St−1(I−βtktk⊤
t )+βtvtk⊤
t (targeted updates).

• βt →0 skips writing but decays: St ≈αtSt−1 (pure forgetting).

• βt →1 replaces memory aggressively: St ≈αtSt−1 + vtk⊤
t (replacement).

From an online learning perspective, Gated DeltaNet corresponds to:

min
St ∥St −αtSt−1∥2
F −2⟨Stkt, βt(vt −αtSt−1kt)⟩

which introduces an adaptive weight decay αt into an SGD-like update—analogous to decoupled
weight decay in deep learning optimization.

Algorithm 33 Gated DeltaNet Parallel Chunkwise Forward
Require: Input chunk X[i]; prior state Si−1; chunk size C
Require: Projections: Wq, Wk, Wv, Wα, Wβ
Ensure: Output O[i] and state Si

1: In-chunk computation:

2: for j = 1 to C do

3:
qj ←Wqxj, kj ←Wkxj, vj ←Wvxj
4:
αj ←sigmoid(Wαxj), βj ←sigmoid(Wβxj)

5:
Apply combined gate: Sj ←αjSj−1(I −βjkjk⊤
j ) + βjvjk⊤
j
6:
oj ←Sjqj
7: end for

8: Inter-chunk recurrence:

9: Si ←SC from in-chunk; propagate across chunks via cumulative Q α masks

10: Output gating: O[i] ←GroupNorm(Oraw) ⊗SiLU(WgX[i]) return O[i], Si

D.4.2
Empirical Performance and Trade-offs

Gated DeltaNet has been integrated into Alibaba’s Qwen3-Next and Qwen3.5 production models.
Empirically:

Task
Gated DeltaNet
Mamba2
DeltaNet
GLA
Language Modeling
1.0×
1.08×
1.05×
1.12×
Associative Recall
1.0×
−−(noperfect)
1.0×
0.75
Length Extrapolation
1.0×
0.98×
0.96×
0.94×
Throughput (tokens/sec)
0.85×
1.0×
0.82×
0.88×

Gated DeltaNet achieves the best balance across diverse tasks at the cost of slightly reduced
throughput due to richer transition matrices.

D.5
4. HGRN2: Hierarchical Gating with Outer-Product Expansion

D.5.1
Mathematical Foundation

HGRN2 [28] uses an outer-product-based state expansion with hierarchically lower-bounded
forget gates. The state update is:

St = diag(gt) · St−1 + vtk⊤
t ,
ot = Stqt

where gt ∈[bℓ, 1]d is a lower-bounded forget gate for layer ℓ. The bounds satisfy:

0 ≤bℓshallow < bℓmiddle < bℓdeep ≤1

i.e., bounds increase (become less restrictive) in the deeper layers. The gate itself is computed
as:
gt = bℓ+ (1 −bℓ) · σ(Wgxt)

where σ is sigmoid.
When Wg outputs weakly negative logits, gt ≈bℓ(forced retention at
shallow layers). When outputs are strongly positive, gt ≈1 (memory refresh at any layer).
The hierarchical structure encourages different layers to specialize to different timescales:
shallow layers model local dependencies (high retention due to binding bℓ), while deeper layers
capture long-range structure (flexible gating with high upper bound).

Algorithm 34 HGRN2 Forward with Hierarchical Bounds
Require: Input xt; prior state St−1; layer index ℓ∈[0, L −1]
Require: Non-decreasing bounds: 0 ≤b0 < b1 < · · · < bL−1 ≤1
Ensure: Output ot and state St

1: Layer-specific retain bound: b ←bℓ
2: Project: qt ←Wqxt, kt ←Wkxt, vt ←Wvxt
3: Compute forget gate with binding: gt ←b + (1 −b) · σ(Wgxt)
▷Constrained to [b, 1]

4: Diagonal multiplication (vectorized): St ←St−1 ◦diag(gt) + vtk⊤
t
5: Retrieve: ot ←Stqt
6: Optional normalization: ot ←RMSNorm(ot) return ot, St

D.5.2
Scaling and Empirical Results

At 3B scale on 100B tokens, HGRN2 slightly outperforms Mamba2 and LLaMA-architecture
Transformers on language modeling. The hierarchical gating provides multi-scale temporal mod-
eling without explicit layer-wise architectural variants. However, vector-valued diagonal gating
(as opposed to element-wise selection) provides less per-item flexibility than delta-rule variants.

D.6
5. Forgetting Transformer (FoX): Gating in Softmax Logit Space

D.6.1
Mathematical Foundation

Forgetting Transformer (FoX), proposed by Lin et al. [19], embeds a forget gate directly into
softmax attention logits. Rather than replacing softmax with linear recurrence, FoX preserves
full softmax expressiveness while adding recency control through a data-dependent logit bias.
For each token position t and all keys j ≤t, compute a scalar forget gate:

ft = σ(w⊤
f xt + bf)

The logit bias at position (i, j) is the cumulative log-forget product:

i
X

iY

dij =

l=j+1
log fl = log

l=j+1
fl

The full attention output becomes:

O = softmax
QK⊤

√dk
+ D

V

where Dij = dij. Equivalently:

Attention(i, j) =
exp(q⊤
i kj/√dk + dij)
Pi
j′=1 exp(q⊤
i kj′/√dk + dij′)

This weighting down-weights older tokens multiplicatively: a token from 10 steps ago is scaled
by Qi
l=j+1 fl, which is typically much less than 1 if ft < 1 on average.
The mechanism is mathematically equivalent to a data-dependent learnable variant of
ALiBi (Attention with Linear Biases), where earlier work used fixed dij = −∞· |i −j| or
dij = −(i −j).

Algorithm 35 FoX: Forgetting Attention with Recency Bias
Require: Queries Q, keys K, values V; input X
Require: Forget gate parameters: wf ∈Rd, bf ∈R
Ensure: Output attention O

1: Compute forget gates:

2: for t = 1 to T do

3:
ft ←σ(w⊤
f xt + bf)
▷Per-token forget probability

4: end for

5: Compute cumulative log-forget biases:

6: for i = 1 to T do

7:
for j = 1 to i do

8:
dij ←Pi
l=j+1 log fl
▷Cumulative product in log space

9:
end for

10: end for

11: Standard SDPA with bias:

12: Compute scores: S ←QK⊤/√dk
13: Add bias: S ←S + D

14: Apply softmax: A ←softmax(S, dim = −1)

15: Weight values: O ←AV return O

D.6.2
Integration with FlashAttention

The key advantage of FoX is compatibility with FlashAttention and existing optimized imple-
mentations. The forget biases are added in the softmax logit computation, which is already a
core operation in FlashAttention’s block-wise algorithm. The overhead is:

Overhead ≈0.5 −−2% wall-clock time

since the bias addition is amortized across matmul operations.

D.6.3
Strengths and Limitations

FoX achieves state-of-the-art results on length extrapolation, near-perfect needle-in-haystack
retrieval, and superior long-context understanding compared to Transformers. However, it re-
mains O(L2d) at training and O(Ld) per step at inference with KV cache, limiting extreme
sequence lengths.

D.7
6. Gated Attention (Post-SDPA Sigmoid Gating)

D.7.1
Mathematical Foundation

The NeurIPS 2025 Best Paper by the Qwen team proposes applying a sigmoid gate after scaled
dot-product attention (SDPA):

Y = SDPA(Q, K, V) = softmax
QK⊤


V

√dk

Y′ = Y ⊙σ (XWθ)

where the gate score σ(XWθ) can have shape Rn×q×dk (element-wise gating per query head) or
Rn×q (head-wise fixed gating). Across 30+ variants and 15B MoE + 1.7B dense models trained
on 3.5T tokens, the team found:

• Headwise gating adds only ∼1.6M parameters to a 15B model (negligible overhead).

• Effective gates are sparse: mean activation ≈0.116 (i.e., 88% are zero or near-zero).

• Gates act as query-dependent filters, suppressing low-value channels while preserving high-
value signal.

• Gates demonstrably eliminate the attention sink phenomenon, where the attention pattern
becomes dominated by a single early token.

Algorithm 36 Gated Softmax Attention (Post-SDPA)
Require: Queries Q, Keys K, Values V; input X
Require: Gate projection matrix Wg ∈Rd→dgate where dgate ∈{1, dk}
Ensure: Gated output Y′

1: Standard SDPA:

2: Compute attention: A ←softmax(QK⊤/√dk)

3: Weight values: Y ←AV

4: Compute gates:

5: Project input: G ←XWg
▷shape: [n, q, dgate]

6: Apply sigmoid: G ←σ(G)
▷Element-wise gating

7: Apply gating:

8: if dgate = 1 then

9:
Broadcast and scale: Y′ ←Y · G
▷Head-wise scaling

10: else
▷dgate = dk
11:
Element-wise modulation: Y′ ←Y ⊙G
▷Per-dimension gating

12: end ifreturn Y′

D.7.2
Key Findings

• Attention-sink suppression: Gating breaks the feedback loop where softmax concentrates
on the first token, enabling more balanced attention.

• Training stability: Models with gating tolerate larger learning rates (+30% higher stable
ηmax).

• Minimal overhead: Latency impact is only 1.6% on H100 GPUs; throughput drops at most
2 −3%.

• Scaling law improvement: Improves loss across model scales from 800M to 110B parameters
consistently.

D.8
Comparative Analysis: Taxonomy of Gated Mechanisms

Architecture Publication
Year
State
Represen-
tation

Gate
Type
Training
Complexity
Inference
Complexity
Ref

GLA
2023
Matrix
(re-
current)
Diagonal
decay
O(Ld2)
O(d2) memory
[44]

DeltaNet
2024
Matrix
(re-
current)
Scalar
write (β)
O(Ld2)
O(d2) memory
[45]

Gated
DeltaNet
2024
Matrix
(re-
current)
Decay
+
write
O(Ld2)
O(d2) memory
[46]

RetNet
2023
Matrix
(re-
current)
Fixed
ex-
ponential
O(Ld2)
O(d) per-step
[36]

Gate
Type
Training
Complexity
Inference
Complexity
Ref

Architecture Publication
Year
State
Represen-
tation

HGRN2
2024
Matrix
(re-
current)
Diagonal
(bounded)
O(Ld2)
O(d2) memory
[28]

FoX
2025
Full
atten-
tion
Logit-
space bias
O(L2d)
O(L) KV cache
[19]

O(L2d)
O(L) KV cache
[29]

Gated
Soft-
max
2025
Full
atten-
tion
Post-
SDPA
sigmoid

Architecture Trainable
In-Context
Recall
Associative
Recall
Length
Ex-
trapolation
Key
Ad-
vantage

GLA
Yes
Moderate
Weak
Good (20K)
Hardware
efficiency;
chunkwise
parallelism
DeltaNet
Yes
Strong
Perfect
(MQAR)
Good
Error-
correcting
seman-
tics;
strong
retrieval
Gated
DeltaNet
Yes
Very strong
Perfect
Excellent
Balanced:
forgetting
+
targeted
updates
RetNet
Yes
Weak
Weak
Good
Simplicity;
no KV cache
needed
HGRN2
Yes
Moderate
Weak
Moderate
Multi-scale
temporal
modeling
FoX
Yes
Very strong
Very
strong
(near-perfect)
Excellent
Preserve
softmax;
minimal
overhead
Gated
Soft-
max
Yes
Strong
Good
Good
Simplicity;
no
replace-
ment
of
SDPA

Architecture
Typical
Throughput
Memory
per-
Inference
Primary Use Case

GLA
High (chunkwise
CUDA)
O(d2) constant
Long-context
with
memory constraints
DeltaNet
Medium-High
O(d2) constant
Retrieval and associa-
tive reasoning
Gated
DeltaNet
Medium
O(d2) constant
Production:
Qwen3-
Next, balanced perfor-
mance
RetNet
High
O(d) per-step
Extremely
fast
infer-
ence; simple models
HGRN2
Medium-High
O(d2) constant
Multi-scale
temporal
patterns
FoX
Moderate
(quadratic)
O(Ld) KV cache
Length
extrapolation;
long-context generation

Architecture
Typical
Throughput
Memory
per-
Inference
Primary Use Case

Gated
Soft-
max
High
(minimal
overhead)
O(Ld) KV cache
Drop-in
improvement
for existing models

D.9
Unified Gating Principle

All seven architectures share a common principle: gating is a mechanism to control infor-
mation flow and selective memory retention. The specific design choices diverge along
several dimensions:

1. Location of gating: Recurrent state updates (GLA, DeltaNet, Gated DeltaNet, HGRN2)
versus softmax logit/output space (FoX, Gated Softmax, RetNet).

2. Granularity of control: Dimension-wise (GLA, HGRN2 diagonal), key-specific (DeltaNet),
token-wise (FoX), or head-wise (Gated Softmax).

3. Data dependence: Fully data-dependent (GLA, DeltaNet, Gated DeltaNet, FoX, Gated
Softmax) versus fixed schedules (RetNet with exponential decay).

4. Expressiveness trade-off: Linear recurrent efficiency (GLA, DeltaNet, Gated DeltaNet,
HGRN2, RetNet) versus full softmax expressiveness (FoX, Gated Softmax).

D.10
Implementation and Practical Guidance

D.10.1
When to Use Each Architecture

1. GLA: Fast inference with moderate-to-long contexts (2K–20K tokens), when custom CUDA
kernels are acceptable, and retrieval performance is not critical.

2. DeltaNet: Strong associative recall and in-context learning tasks; good for synthetic tasks
(MQAR) and key-value association testing.

3. Gated DeltaNet: Production models requiring the best balance of recall, forgetting, and
throughput (proven in Qwen3-Next; ICLR 2025).

4. RetNet: Extremely efficient inference without KV caching; suitable for edge/mobile deploy-
ments or toy models.

5. HGRN2: Multi-scale temporal hierarchies; when layer-specific forget bounds are desirable.

6. FoX: Length extrapolation and long-context understanding; when quadratic training is ac-
ceptable and replacing softmax is not desired.

7. Gated Softmax: Quick improvement to existing softmax models with minimal code changes;
best for practitioners wanting immediate gains without major refactoring.

D.10.2
Hardware Considerations

Hardware Target
Recommended
Ar-
chitectures
Notes

GPU (H100/A100)
Gated Softmax, FoX,
Gated DeltaNet
Mature softmax/linear
implementations.
GLA/DeltaNet require
custom kernels.
TPU (high memory)
GLA, DeltaNet, Gated
DeltaNet, HGRN2
Hardware supports ef-
ficient
matmuls;
lin-
ear recurrent methods
shine.
Mobile/Edge
RetNet,
lightweight
Gated Softmax
Constant-memory
in-
ference critical.

E
Annex E: Conceptual Introduction—Transformers and Atten-
tion for Beginners

This appendix provides an accessible introduction to the fundamental concepts of transformers
and the attention mechanism, aimed at readers without deep prior knowledge in deep learning.
Transformers are the engine behind modern artificial intelligence systems like ChatGPT, but
their operation can be understood through real-world analogies.

E.1
What is a Transformer?

A transformer is a neural architecture that processes text or data sequences by interpreting the
meaning of each word while considering all other words in context. Imagine you read a sentence:

“The bank was full of people waiting. Mary went to the bank.”

What does “bank” mean in each case? In the first sentence, it refers to a financial institution
(or a bench). In the second, it is less clear without context. A transformer solves this auto-
matically by looking at all surrounding words. In the second sentence, seeing “Mary went”, the
model better understands that it probably is a riverbank (where she goes to swim) or a financial
institution (where she goes to conduct a transaction).
This ability to understand the complete meaning of a word by considering the entire sentence
is called contextualization, and it is the heart of how modern transformers work.

E.2
The Attention Mechanism: An Analogy

The attention mechanism is the “magic trick” that allows transformers to understand context.
Think of it as participating in an important conversation in a noisy room:

1. Lots of noise: Someone is speaking and being very important to you.

2. You ignore the rest: Your brain automatically focuses attention on that person, ignoring
other sounds.

3. You understand better: By concentrating, you catch every word clearly.

The neural attention mechanism works the same way: each word “asks” how important every
other word is to understanding its meaning, then “focuses” its attention on the most relevant
ones.

E.3
Practical Example Step-by-Step

Let’s see how a transformer understands the word “eats” in two different contexts:

E.3.1
Context 1: “The cat eats fish”

The transformer, when processing “eats”, internally asks:

• How relevant is “The”? Little (it’s just an article). Low attention.

• How relevant is “cat”? Very relevant (it’s the subject, who eats). High attention.

• How relevant is “fish”? Very relevant (it’s what is eaten). High attention.

Thus, the mental representation of “eats” focuses primarily on “cat” and “fish”.

E.3.2
Context 2: “The restaurant eats into profit margins”

Here the transformer asks in a different context:

• How relevant is “restaurant”? Very relevant (it’s the subject). High attention.

• How relevant is “profit”? Very relevant (it’s affected). High attention.

• How relevant is “margins”? Very relevant (it’s what’s being consumed). High attention.

The word “eats” gets a completely different representation because it “attends” to different
words in different contexts.

E.4
How Attention Works Mathematically (Simple Version)

Behind the attention shift is mathematics.
Here is the simplified version without too much
technical detail:

E.4.1
Step 1: Queries, Keys, and Values

Each word in the sentence is transformed into three versions:

• Query: “What information do I need to know?”

• Key: “What information do I have?”

• Value: “Here is my important information.”

Imagine in a library, each visitor is a “query”, each book is a “key”, and the content is the
“value”. The librarian (attention) matches queries with the most relevant keys to access the
correct value.

E.4.2
Step 2: Compatibility

The system examines how compatible each query is with each key. Queries similar to keys get a
high compatibility score. This is calculated by multiplying the query by the key (mathematically,
the dot product).

E.4.3
Step 3: Focus

The compatibility scores are converted into “focus weights”. High compatibilities mean “focus
attention here”, and low ones mean “ignore this”. Mathematically, a function called softmax
converts scores into percentages (like: 40% attention here, 35% here, 25% there).

E.4.4
Step 4: Combine

Finally, the system combines all values, weighted by the focus weights.
If a word has 80%
attention on the subject, 80% of the subject’s information gets blended into that word’s repre-
sentation.

E.5
Multiple Attention Heads: Multiple Perspectives

A key feature is that transformers do not use a single attention but multiple attention heads in
parallel. It’s as if you had 8 people analyzing the sentence simultaneously, each paying attention
to different aspects:

• Person 1: “I focus on subject-verb relationships.”

• Person 2: “I search for the direct object.”

• Person 3: “I track information about verb tense.”

• Person 4: “I search for modifiers and adjectives.”

Each “head” learns to focus on different language patterns. Together, they capture much
richer understanding than a single head.

E.6
Stacking Layers

Transformers process information in multiple layers (usually 12 to 48 for practical models).
Imagine editing an essay:

1. First pass: You fix spelling and basic grammar.

2. Second pass: You improve clarity and sentence structure.

3. Third pass: You ensure consistency and narrative flow.

Transformers work the same way: each layer refines text understanding.
The first layer
captures basic features (simple words, genders), while later layers understand complex concepts
(word relationships, abstract meanings).

E.7
Why Does It Matter?

The attention mechanism was revolutionary because:

1. Parallelism: It finds context for any word with any other word, without processing them
sequentially (unlike earlier systems). This makes it very fast.

2. Flexibility: It learns what patterns to look for automatically from data, without you needing
to program rules manually.

3. Scalability: It works from small texts to contexts with millions of words.

4. General Capabilities: The same mechanism works for translation, summarization, question-
answering, text generation, computer vision, and more.

E.8
Practical Challenges

Despite the extraordinary success of transformers, they present challenges:

E.8.1
Computational Complexity

If a sentence has 1,000 words, the standard attention mechanism must compare each word with
all other 1,000 words. That’s 1, 000×1, 000 = 1, 000, 000 comparisons. For long documents with
millions of words, this becomes prohibitively expensive in time and memory.

E.8.2
Memory Requirements

Especially during inference (when using trained models), storing the entire attention matrix can
consume enormous amounts of RAM or GPU memory, limiting the sequence lengths you can
process.

E.8.3
Device Efficiency

Transformers were designed for powerful TPUs and GPUs. Running them on phones or embed-
ded devices is challenging.

E.9
Modern Solutions

To solve these challenges, research has proposed many variants:

• Sparse Attention: Instead of comparing each word with ALL others, it compares only with
a strategic subset (nearby neighbors, periodic patterns). Reduces complexity from O(n2) to
O(n log n) or O(n).

• Recurrent Models: Like Mamba, they incorporate aspects of old recurrent neural networks
but with better modern efficiency.

• Gated Attention: Mechanisms that selectively learn what information to pass forward,
reducing storage needs.

• Quantization: Use smaller numbers (integers instead of decimals) to reduce memory without
losing too much precision.

These innovations are what this toolkit (Frankenstein) allows you to experiment with easily.

E.10
Key Takeaways

• A transformer is a neural network that understands language context very well.

• The attention mechanism allows each word to “focus” on which other words are relevant
to understanding its meaning.

• Attention works by computing compatibility between each word (query) and all others
(keys), then combining information based on that compatibility (values).

• Multiple heads of attention explore different patterns in parallel.

• Multiple layers progressively refine understanding.

• The main challenge is that standard attention has quadratic complexity, which is expensive
for long texts.

• Many modern solutions (sparse attention, recurrent models, quantization) address these
challenges, and this toolkit lets you explore all of them.


---

*This document was automatically generated from the PDF version.*