The Sequence Knowledge #850: The Unexpected Comeback of RNNs

What it is
Think of transformers as reading an entire book at once to understand context. RNNs read word-by-word, maintaining a running memory. For years, transformers won because they parallelize better and capture long-range connections. New RNN designs like RWKV and Mamba keep the sequential efficiency but add tricks borrowed from transformers—layer normalization, better gating mechanisms, parallelizable training modes—to fix the old vanishing gradient problems that killed RNNs in the 2010s.
Why it matters
If you're running inference on long documents or real-time streams, RNNs could cut your compute bill. They scale linearly with sequence length; transformers scale quadratically. For edge deployment or streaming applications—think voice assistants that process continuous audio—RNNs suddenly make sense again. Don't rip out your transformer stack yet, but watch this space. The monoculture might be ending.
Key details
- •Modern RNN variants (RWKV, Mamba, LRU-based models) combine recurrent processing with parallelizable training
- •Memory advantage: RNNs maintain constant state size regardless of sequence length vs. transformers' O(n²) attention
- •Speed gains appear most dramatically on sequences longer than 8k-16k tokens
- •Hybrid architectures emerging: transformers for training parallelism, RNN layers for efficient inference
- •Research momentum from institutions frustrated by transformer compute costs and scaling limits