Frankestein Transformer: Unified Encoder-Decoder Library, CLI, and Research-Grounded Design Notes

GPT 5.4, Perplexity · Erick Merino
Published March 12, 2026 Version 4
Screened Endorsed AI Review Peer Review Accepted

Abstract

Frankestein Transformer presents a unified configuration-driven toolkit for systematic experimentation with modern transformer architectures, spanning seventeen sequence mixer variants and twenty-two optimizer families. The system supports both encoder-style masked language modeling (MLM) and decoder-style autoregressive (AR) next-token prediction through flexible model class and mode configuration, with specialized fine-tuning workflows for both architectures. The research contributions are threefold: (i) a strict schemabased configuration contract that enables reproducible experimentation across diverse attention mechanisms, including standard softmax attention, sigmoid attention, retentive networks, selective state-space models, continuous-depth transformers, adaptive depth routing (Mixture-of-Depths) [57], conditional memory augmentation (Engram) [8], sparse attention patterns, and gated mechanisms; (ii) a comprehensive optimizer routing framework supporting variance-reduction methods (MARS, Adan, AdEMAMix), memory-efficient variants (Adafactor, GaLore, Lion), schedule-free approaches, second-order preconditioners (Shampoo, SOAP, Sophia), and low-rank APOLLO-family optimizers (Apollo, Apollo-Mini, QApollo) [55]; and (iii) end-to-end workflows spanning quantized deployment via ternary weight packing and sentence-embedding training inspired by SBERT, backed by an expanded quality-assurance stack with broad unit test coverage, YAML example validation, and continuous integration execution. The toolkit implements a web-based configuration interface that provides schema-driven form rendering with inline documentation and real-time validation. This technical reference document includes architectural diagrams, executionflow visualizations, decision tables, and comprehensive appendices synthesizing literature on transformer architectures, sparse attention mechanisms, gated attention variants, and optimization algorithms. The system enables rapid iteration while maintaining reproducible experimental conditions through its schema-first design philosophy.

Loading PDF...

This may take a moment for large files

Comments

You must be logged in to comment

Login with ORCID

No comments yet. Be the first to comment!

Review Status

Stage 1

Awaiting Endorsement

Needs a Bronze+ ORCID scholar endorsement to advance.

Authors

AI Co-Authors

2.

GPT

Version: 5.4

Role: writing, writing code

3.

Perplexity

Role: Literature Review

Endorsements

No endorsements yet. This paper needs 1 endorsement from a bronze+ scholar to advance.

Endorse This Paper

You'll be asked to log in with ORCID.

Academic Categories

Artificial Intelligence

Interdisciplinary > Cognitive Science > Artificial Intelligence

Machine Learning

Formal Sciences > Computer Science > Artificial Intelligence > Machine Learning

Natural Language Processing

Formal Sciences > Computer Science > Artificial Intelligence > Natural Language Processing

Software Design

Formal Sciences > Computer Science > Software Engineering > Software Design

Version History

v4 (current) May 07, 2026

Include Apollo optimizer, Engram and MoDA, simplify main section and move tutorial to appendix

v3 Apr 13, 2026

change name and encoder-only to both encoder and decoder

View this version →
v2 Mar 24, 2026

Use skill "research-paper-writer" to improve format

View this version →
v1 Mar 12, 2026

Initial submission

Initial submission View this version →

Stats

Versions 4
Comments 0
Authors 3