EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

What it is
Think of it as a Turing test for genuine logic. EsoLang-Bench gives AI models problems written in esoteric programming languages—intentionally obscure systems like Brainfuck (uses only 8 symbols) or Befunge (code runs in 2D grids). Because these languages have tiny footprints in training data, models can't coast on memorized solutions. They have to actually parse syntax and trace logic.
Why it matters
This cuts through the hype around 'reasoning' models. If you're building AI systems that need to handle novel problems—debugging custom code, analyzing unusual data formats, adapting to proprietary systems—this benchmark shows you which models can generalize versus which are sophisticated autocomplete. It's a filter for distinguishing real adaptability from memorization at scale.
Key details
- •Tests multiple esoteric languages: Brainfuck (minimalist), Befunge (2D execution), and others with <0.01% web presence
- •Benchmark available at esolang-bench.vercel.app with live model comparisons
- •Evaluation covers code execution, debugging, and logical reasoning tasks
- •Exposes failure modes in frontier models that ace standard benchmarks
- •Open framework—you can test your own models against the same tasks
Worth watching
Video data provided by YouTube. Videos link to youtube.com.
