AI Revolution – May 25, 2026
Monday, May 25, 2026·9:45
Enjoy the show? Subscribe to never miss an episode.
Show Notes
AI Revolution – May 25, 2026
Daily AI briefing — frontier models, research, and infrastructure.
Episode Summary
Today's episode covers 6 stories across 4 topic areas, including: Google Deepmind's AlphaProof Nexus solves decades-old math problems for a few hundred dollars; AI models often give the right answers but point to the wrong sources; Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation.
Stories Covered
• Research
Google Deepmind's AlphaProof Nexus solves decades-old math problems for a few hundred dollars
The Decoder · May 25 · Relevance: █████████░ 9/10
Why it matters: Formal verification via the Lean compiler gives AlphaProof Nexus a fundamentally different—and auditable—proof pipeline compared to natural-language reasoning approaches, signaling a maturation in AI-assisted mathematical and logical reasoning that has downstream implications for verified software and cryptographic proof systems.
- Solved nine open Erdős problems autonomously, including two unsolved for 56 years
- Uses the Lean formal proof compiler to verify every proof step automatically, eliminating hallucinated proofs
- Inference cost is only a few hundred dollars per problem, but overall success rate remains low at 2.5%
AI models often give the right answers but point to the wrong sources
The Decoder · May 25 · Relevance: ████████░░ 8/10
Why it matters: Attribution hallucination—where models produce correct answers backed by fabricated or mismatched citations—represents a distinct and underappreciated failure mode that poses direct liability risk for RAG-based systems deployed in legal, medical, and compliance contexts.
- Leading models including GPT and Gemini routinely cite passages that do not actually support their stated answers
- Researchers at Peking University coined the term 'attribution hallucination' and introduced the CiteVQA benchmark to test for it systematically
- The failure is especially dangerous in regulated fields like law and medicine where source traceability is a compliance requirement
ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training
The Decoder · May 24 · Relevance: ███████░░░ 7/10
Why it matters: ByteDance Seed's finding that question-answering-based training outperforms transcription for long multimodal documents challenges prevailing data curation assumptions and suggests a practical path to stronger document AI with smaller, cheaper models.
- A 7B model trained with question-answering supervision outperforms much larger models on long image-heavy document tasks
- The model generalizes to documents four times longer than anything seen during training
- The approach replaces page transcription with passage-retrieval-style question answering during training
• Model_Release
Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation
InfoQ AI/ML · May 25 · Relevance: ███████░░░ 7/10
Why it matters: Multi-token prediction with speculative decoding achieving ~3x throughput gains without quality degradation is a meaningful inference efficiency advance that directly reduces serving costs and latency for production deployments of open-weight models.
- Gemma 4 paired with MTP drafters achieves up to ~3x faster token generation via speculative decoding
- Multiple tokens are generated in parallel and verified in a single forward pass
- Throughput gains come with no reported quality loss on benchmarks
• Applications
George Hotz says coding agents will be "one of the most costly mistakes" in software development
The Decoder · May 25 · Relevance: ██████░░░░ 6/10
Why it matters: Hotz's empirically grounded critique—rooted in six months of hands-on testing rather than speculation—adds a concrete counterweight to the prevailing enthusiasm around agentic coding tools and highlights the growing hidden cost of AI-introduced bugs in production codebases.
- After six months of testing, Hotz concludes LLM coding agents produce fast prototypes but introduce increasingly hard-to-detect bugs
- His position reflects a deep split in the technical community over the production readiness of coding agents
- Hotz frames the risk as systemic, not just individual—a potential industry-wide costly mistake
• Industry
Deepmind's Hassabis sees humanity "in the foothills of the singularity" while LeCun says current AI isn't intelligent
The Decoder · May 24 · Relevance: ██████░░░░ 6/10
Why it matters: The public divergence between Hassabis, LeCun, and Vinyals on the trajectory of current AI reflects genuine disagreement at the frontier about architectural sufficiency and the path to general reasoning—context that matters for anyone making multi-year technology bets.
- Demis Hassabis (DeepMind) believes humanity is already in 'the foothills of the singularity'
- Yann LeCun maintains current AI systems are not genuinely intelligent and current architectures are insufficient
- Gemini co-lead Oriol Vinyals notes today's models lack the ability to learn from experience or produce genuine scientific breakthroughs
Further Reading
- • Google Deepmind's AlphaProof Nexus solves decades-old math problems for a few hundred dollars — The Decoder
- • AI models often give the right answers but point to the wrong sources — The Decoder
- • Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation — InfoQ AI/ML
- • ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training — The Decoder
- • George Hotz says coding agents will be "one of the most costly mistakes" in software development — The Decoder
- • Deepmind's Hassabis sees humanity "in the foothills of the singularity" while LeCun says current AI isn't intelligent — The Decoder
Full Transcript
Click to expand full episode transcript
Sam: AlphaProof Nexus just solved nine open Erdős problems — including two that have been unsolved for 56 years — and each solution cost a few hundred dollars in inference. The proofs aren't natural language arguments that might be wrong. Every step is verified through the Lean formal proof compiler. So either the proof checks or it doesn't. That's where we're starting today.
Priya: Good morning, welcome to AI Revolution for Monday, May 25th. I'm Priya Nair.
Sam: And I'm Sam Kim.
Priya: We've got a packed show. We're going deep on that AlphaProof Nexus result and what formal verification means for AI reasoning beyond math. Then we're covering a really important failure mode researchers are calling attribution hallucination — your model gets the answer right but cites the wrong evidence, which is a nightmare for regulated industries. We'll talk about Gemma 4's multi-token prediction hitting three-x inference speedups, a ByteDance study that rethinks how you train models on long documents, George Hotz's six-month verdict on coding agents, and a fascinating disagreement between Hassabis and LeCun about where we actually are on the path to general intelligence. Let's get into it.
Sam: So let's spend some time on AlphaProof Nexus because the technical approach here is genuinely different from what we've seen with other reasoning systems. Most AI math work — think of what OpenAI has been doing — operates in natural language. The model generates a proof in English or a mix of English and notation, and then humans have to verify whether it's correct. That's expensive, slow, and it doesn't scale. AlphaProof Nexus takes a completely different approach. It generates proofs in Lean, which is a formal proof language where every logical step is machine-checked. If the proof compiles, it's correct. Period. There's no ambiguity, no hallucinated reasoning steps that look plausible but are wrong.
Priya: So the Lean compiler is essentially acting as a ground-truth oracle. The model proposes, and the compiler disposes. That's a fundamentally different verification loop than having a second model or a human check the work.
Sam: Exactly. And this matters because one of the persistent problems with LLM reasoning is that models can generate arguments that are convincing but subtly wrong. When you have a formal verifier in the loop, that failure mode is eliminated. The model might fail to find a proof — and it does, 97.5 percent of the time — but when it succeeds, the proof is guaranteed correct.
Priya: That 2.5 percent success rate is worth sitting with. On one hand, it solved problems that stumped human mathematicians for decades. On the other hand, it fails on 39 out of 40 attempts. So you're looking at a system where the ceiling is extraordinary but the hit rate is very low. The economics work because inference is cheap — a few hundred dollars per problem — so you can afford a lot of failures.
Sam: Right. And there's a scaling argument here. As inference costs continue to drop and models improve, that 2.5 percent could climb significantly. But even at current rates, the system is already producing novel mathematical results. These aren't benchmark problems with known answers. These are open problems in combinatorics where the solutions are genuinely new.
Priya: The downstream implications go beyond pure mathematics. Formal verification in Lean is the same technology used for verified software and verified cryptographic protocols. If AI systems get better at generating Lean proofs, that capability transfers directly to proving properties about code — things like absence of buffer overflows, correctness of cryptographic implementations, compliance with formal specifications. That's a bridge from mathematical reasoning to software assurance.
Sam: And it's worth noting the competitive dynamic. OpenAI has been pursuing math reasoning through natural language. DeepMind is betting on formal verification. These are genuinely different architectural philosophies, and AlphaProof Nexus is a strong data point for the formal verification approach.
Priya: Let's shift to our second story, which is about a failure mode that I think a lot of teams deploying RAG systems need to hear about. Researchers at Peking University have identified what they're calling attribution hallucination. The model gives you the right answer but points to the wrong source.
Sam: This is subtle and that's what makes it dangerous. If a model gives you a wrong answer, you might catch it. But if the answer is correct and there's a citation next to it, most people stop checking. The problem is the citation might point to a passage that doesn't actually support the claim. The model essentially confabulates the connection between its answer and its source.
Priya: Think about what this means in practice. You're using a RAG system in a legal context. It tells you the correct statute applies to your case and cites paragraph three of a document. Your lawyer sees the citation, maybe even clicks through, but paragraph three is about something else entirely. The actual support was in paragraph seventeen, which the model never referenced. In a compliance audit, that's a finding. In litigation, that's a liability.
Sam: The researchers built a benchmark called CiteVQA specifically to test for this. And they found that leading models — GPT, Gemini — routinely exhibit this behavior. The answers are often right, but the evidential chain is broken. What's interesting technically is that this suggests the model's answer generation and its citation generation are somewhat decoupled processes. The model isn't actually reasoning from the cited passage to the answer. It's generating the answer from its broader understanding and then separately picking a citation that seems related.
Priya: Which means retrieval-augmented generation isn't really augmenting the generation in the way we assume. The retrieval provides context that improves the answer, but the citation is more like a post-hoc rationalization than a genuine provenance chain. For anyone building systems where source traceability is a requirement — legal, medical, financial compliance — this is a problem you need to test for explicitly. CiteVQA gives you a starting framework.
Sam: Let's talk about something more optimistic. Gemma 4 now supports multi-token prediction through speculative decoding, and the throughput gains are significant — up to roughly three-x faster token generation with no reported quality loss.
Priya: Explain how speculative decoding works for folks who haven't dug into it.
Sam: Sure. Standard autoregressive generation produces one token at a time — the model runs a full forward pass, picks a token, then runs another forward pass for the next token. It's sequential and slow. Speculative decoding uses a smaller, faster draft model to predict several tokens ahead in parallel. Then the main model verifies those predictions in a single forward pass. If the draft model guessed correctly — which it often does for predictable sequences — you've generated multiple tokens for the cost of one verification pass. If it guessed wrong, you fall back to the main model's choice. So you never lose quality, you only gain speed.
Priya: And the practical impact is real. Three-x throughput means you can serve the same traffic with a third of the GPUs, or you can serve three times the traffic with the same hardware. For teams running open-weight models in production, this is a direct cost reduction. And because Gemma 4 is open-weight, you can deploy these MTP drafters in your own infrastructure without API dependencies.
Sam: The fact that this comes with no quality degradation on benchmarks is key. You're not trading accuracy for speed. The verification step guarantees the output distribution matches what you'd get from standard decoding.
Priya: Next up, ByteDance Seed published a study that challenges a common assumption about training multimodal models on long documents. The conventional approach has been to train these models by having them transcribe pages — essentially OCR as a training signal. ByteDance found that training with question-answering supervision instead produces dramatically better results.
Sam: The numbers are striking. A 7-billion parameter model trained with QA supervision outperforms much larger models on long document tasks. And it generalizes to documents four times longer than anything it saw during training. The intuition is that transcription trains the model to be a copier — it learns to reproduce what's on the page. QA training forces the model to understand the content, locate relevant passages, and synthesize answers. It's the difference between reading to copy and reading to comprehend.
Priya: For teams building document AI, this is actionable. You might get better performance from a smaller, cheaper model by changing your training data format rather than scaling up parameters. That's a good trade.
Sam: Quick hit on George Hotz. After six months of hands-on testing with coding agents, his conclusion is that they produce fast prototypes but introduce bugs that compound and become increasingly hard to detect over time. He's calling it potentially one of the industry's most costly mistakes.
Priya: Hotz is an experienced systems programmer, and this isn't speculation — it's empirical experience. But I'd note that the community is genuinely split on this. Many teams report significant productivity gains from coding agents. The truth probably varies by use case. Greenfield prototyping is different from maintaining a large production codebase with complex invariants.
Sam: And finally, there's a fascinating public disagreement between Demis Hassabis, who says we're in the foothills of the singularity, Yann LeCun, who says current AI isn't genuinely intelligent, and Oriol Vinyals, who notes these systems still can't learn from experience or produce real scientific breakthroughs. Three deeply informed people looking at the same technology and reaching very different conclusions about what it means.
Priya: It's a useful reminder that the people closest to the frontier disagree about the most fundamental questions. When experts diverge this sharply, anyone claiming certainty about AI's trajectory should be viewed skeptically.
Sam: Looking ahead, I think the AlphaProof Nexus result opens a really interesting question: what happens when you combine formal verification with other domains? If AI can generate machine-checked proofs for open math problems, the same pipeline could eventually generate verified proofs about software correctness, protocol security, hardware designs. The 2.5 percent success rate needs to improve, but the architecture is sound.
Priya: And on the attribution hallucination front, I expect we'll see CiteVQA or something like it become a standard evaluation for RAG deployments, especially in regulated industries. This is the kind of failure mode that creates real legal exposure, and now there's a benchmark to test for it. Teams should be adding this to their evaluation suites now, not waiting for an incident.
Sam: The speculative decoding results from Gemma 4 are also worth watching. If three-x throughput becomes standard for open-weight models, it changes the economics of self-hosted inference significantly. That shifts the build-versus-buy calculation for a lot of organizations.
Priya: Lots to digest this week. That's our show for today. Show notes and links to every story we covered are at cleartext.fm.
Sam: Thanks for listening. We'll see you tomorrow.
AI Revolution is an automated daily podcast covering AI advancements. Generated 2026-05-25.
Sources: MIT Technology Review, VentureBeat AI, The Verge, Wired, TechCrunch AI, Ars Technica, IEEE Spectrum, The Decoder, The Gradient, Hugging Face Blog, Google AI Blog, AI News, SemiAnalysis, and The Register.