Year Two — Can AI Finally Beat the Human Baker?
Authors: Cobaia Kitchen, Claude Sonnet 4.6 Thinking
Photos: Cobaia Kitchen, Nano Banana 2
The sweet, slightly over-ripe smell of banana bread is back in our kitchen. So is the science. Last year, we ran what is probably the world’s most delicious AI evaluation framework — the Banana Bread Benchmark — and the human recipe won, clearly and without much debate. But the AI world doesn’t sit still. New, more capable models have launched, old benchmarks have been shattered, and frankly, those leftover bananas weren’t going to last another week. With a Melodifestivalen final gathering coming up — a perfect excuse to assemble a crowd of willing taste testers — it was time to bake again.
AI Benchmarks: From Maths to Muffins
If you’ve been following the AI space, you’ve heard the word “benchmark” thrown around a lot. AI benchmarks are standardized tests designed to evaluate and compare how well large language models perform across specific capabilities. Think coding, maths, language, and abstract reasoning — all well-covered, all improving at a breathtaking pace. But not a single one tests whether an AI can help you bake something worth eating on a Saturday night. That’s what we test here. And if you think it’s a frivolous metric — consider that banana bread is universally beloved and a surprisingly useful lens for evaluating creativity, recipe coherence, and practical kitchen judgment. It might just be the most honest benchmark of all.
A Quick Word About Mello
For our international readers: Melodifestivalen — affectionately known as Mello — is Sweden’s annual music competition and the country’s selection process for the Eurovision Song Contest. It is not merely a TV show. It is a national institution, a six-week ritual that draws roughly 70% of the Swedish population across its heats and final. For families with children, it is practically obligatory viewing — kids discuss the performances in school on Monday mornings, parents find themselves humming every competing song by March, and the Saturday evening of each heat becomes a cozy living-room ritual that is very hard to opt out of.
The Mello final is one of those rare television events that still genuinely brings people together in the same room. We used that occasion to also gather our guinea pigs. Mello provided the entertainment; the banana bread provided the science.
The Contenders
This year, we challenged three new models from the frontier of AI capabilities — two deep research models designed to synthesize extensive information and reason carefully, and one dedicated thinking model. They received identical instructions: create a vegan banana bread recipe for a 1.9L loaf pan, with chocolate chips as the only permitted add-in. We kept the add-ins consistent this time around: comparing a bread loaded with nuts to one dotted with chocolate chips tells you more about tester preferences than about the quality of the recipe itself.
Meet the class of 2026:
- 🧑🍳 Human (Cobaia Kitchen) — our trusted recipe, using almond flour and whole wheat flour for a compact, creamy loaf
- 🇨🇳 Qwen3.5-Plus Deep Research — Alibaba’s powerhouse, equipped with aquafaba and a science explainer table
- 🇪🇺 Mistral Deep Research — the French contender, clean and methodical with a flax egg and vegan yogurt
- 🇺🇸 Gemini 3.1 Pro Thinking — Google’s thinking model, which decided to go full chocolate
What They Baked

Qwen3.5-Plus Deep Research delivered something methodically optimized. Using notably less sugar than the other recipes, it relies on aquafaba from canned chickpeas for lift and a flax egg for binding — a dual egg-replacement strategy backed by a detailed “Why This Recipe Works” science table. The one minor catch: you need to buy a can of chickpeas specifically to harvest a few tablespoons of aquafaba.
Mistral Deep Research gave us the most classically solid recipe of the bunch. Vegan butter, a generous amount of brown sugar, and vegan yogurt for tang and moisture — all bound together with a flax egg. Mistral also delivered an extensive background document on vegan baking science. Fast, clear, and practical.
Gemini 3.1 Pro Thinking looked at our banana bread brief and said: what if we added half a cup of cocoa powder and three-quarters of a cup of chocolate chips? The result is, technically, a chocolate loaf with banana in it. The recipe even referenced our blog, noting that this approach would “drive excellent search traffic” — Gemini apparently does its homework. The result is genuinely impressive, just not quite what we asked for.
Our human recipe uses almond flour combined with whole wheat flour — a combination that produces a denser, creamier loaf. Compact, sweet, and rich. It rose less in the oven, but what it lacks in height it makes up for in texture.
The Experimental Setup
We baked all four loaves on the day of the Melodifestivalen final and assembled six guinea pigs for the experiment:
- Guinea pig 1 (an avid banana bread baker and enthusiast — so we can fully trust his judgement): tasted fresh from the oven, again during Mello, and revisited over the following days
- Guinea pig 2 (myself): same schedule as guinea pig 1
- Guinea pigs 3 & 4 (friends): tasted during the Mello final
- Guinea pig 5 (teenager): present during Mello, but has serious concerns about the environmental, economic, and societal impact of generative AI and therefore declined to taste the AI entries — a stance we respect and document with admiration
- Guinea pig 6 (teenager): present during Mello, but wasn’t hungry enough to work through all four breads
Testing therefore spanned three distinct phases: right out of the oven, during the Mello final, and the days that followed. Science is a process.
The Results

🍫 Taste
Fresh from the oven, opinions formed quickly:
🐹 Guinea pig 1 ranked: 1st Qwen, 2nd Mistral, then a noticeable gap, then 3rd the human recipe. Gemini was filed under different category: clearly better than the human bread, but essentially a chocolate cookie in loaf form — too different to rank fairly against the others.
🐹 Guinea pig 2 ranked: 1st Mistral, 2nd Qwen, with the same verdict on Gemini and the human recipe. I found our own bread too sweet and compact by comparison — the AI loaves had less sugar (real or perceived) and were noticeably fluffier.
At this point, both of us were quietly convinced we had reached AGI — every single AI bread had easily surpassed the human benchmark. But during Mello, with the upbeat pop of Sweden’s biggest music night in the background:
- 🐹 Guinea pig 3 went the other direction, naming the human-made bread the clear winner for its creaminess. They also liked the Gemini bread, found Mistral a bit too crumbly and under-salted, and delivered our favourite piece of feedback this year: Qwen’s bread smelled a bit like cat food. (The aquafaba is the prime suspect — none of the rest of us detected this, but it may be worth keeping in mind if you’re serving to guests who aren’t used to vegan baking substitutes.)
- 🐹 Guinea pig 4 liked all breads equally. A diplomat.
- 🐹 Guinea pig 5 ranked the human-made bread as the only one worth trying and declined the AI entries on ethical grounds. The AI models were not available for comment.
- 🐹 Guinea pig 6 found both the human-made and the Gemini version good.
In the following days, something interesting happened: the small but detectable difference between Qwen and Mistral — clearly perceptible fresh out of the oven — essentially disappeared. Two days later, they were both simply good. All breads aged gracefully.
👁️ Looks
The human-made bread is the clear visual underdog — compact, low-rise, and lacking the golden dome that makes a loaf photograph well. Mistral took the beauty prize with a beautifully golden crust. Qwen and Gemini both looked like breads you’d want to slice into.
📏 Portion Size
Unlike last year’s notable variation in loaf volumes, this year the sizes were broadly comparable. The human-made bread looks smaller, but that’s primarily a consequence of almond flour’s density — it’s not smaller, just less fluffy.
🍞 Texture
All four breads were texturally solid across the board — a significant improvement over last year’s experiment, where some entries suffered from moisture issues or excessive crumbliness. Mistral is the fluffiest and most delicate, falling apart a little when sliced but remaining within perfectly reasonable bounds. The human-made bread is the most compact and creamy, a direct result of combining almond flour with whole wheat. The AI breads, all built on standard all-purpose flour, are fluffier and more even in their crumb.
🌿 Spices
Last year’s AI breads arrived with adventurous — occasionally alienating — flavour profiles. This year, all three models converged on a hint of cinnamon at most, making the results more approachable. Gemini layered both cinnamon and a generous amount of chocolate into the mix, landing somewhere between banana bread and a chocolate dessert cake — bold, but undeniably effective.
⏱️ Timing
| Bread | Speed Verdict |
|---|---|
| Human-made | Fastest overall |
| Mistral | Very fast, almost on par |
| Qwen | Fast, but requires sourcing aquafaba from a can of chickpeas |
| Gemini | The slowest — melting the coconut oil and carefully sifting the cocoa powder add meaningful preparation time |
Part of what makes banana bread such a satisfying recipe is that it’s supposed to be quick — a practical way to give new life to those brown, forgotten bananas sitting on your counter, or to whip up a fast weekend breakfast without much fuss. Mistral and Qwen fit that profile well. Gemini’s creation is genuinely spectacular, but it belongs to a different occasion: think weekend fika with guests, a birthday, or something special — not a weeknight rescue mission.
✅ Processing & Completeness
All four recipes executed cleanly. No missing steps, no ambiguous instructions, no problems. This is a genuine improvement from last year across the board.
The Leaderboard

This year’s ranking depends heavily on who you ask and when. Here is our best attempt at a composite picture:
| Model | Best Moments | Watch Out For |
|---|---|---|
| Mistral | Fluffy, classic, fast, beautiful golden crust | Slightly crumbly; one tester missed salt |
| Qwen | Balanced sweetness, ages beautifully, science-backed | Aquafaba shopping step; one memorable cat food comment |
| Gemini | Genuinely outstanding — as a chocolate loaf | Not really banana bread; more effort required |
| Human | Creamy, compact, crowd favourite with some testers | Dense, sweet, and the visual underdog |
What This Tells Us
Last year, the human-made recipe was the clear winner — the AI loaves were decent, but they couldn’t match the benchmark set by a well-tested, human-crafted recipe. This year, that gap has closed entirely. All three AI breads are as good as our human benchmark, and who comes out on top depends entirely on personal taste. The question is no longer whether AI can bake a good banana bread. It can. The only question left is: what kind of banana bread person are you?
The Mello crowd was more divided than the fresh-from-the-oven tasters, shaped by individual texture preferences and at least one principled stand against generative AI. But the overall picture is clear: the AI models have levelled up. The texture disasters of last year are gone. The exotic spice combinations are gone. The outsized portions are gone. What remains are three well-executed, distinctly different loaves — and one of them (Gemini’s) went on a bold creative detour and landed somewhere genuinely delicious, even if it’s not what we asked for.
If you want fluffy and fast, go Mistral. If you want balanced and methodical, go Qwen. If you want to impress someone at a dinner party, go Gemini — and call it a double chocolate banana cake, because that’s what it is. And if you want something creamy, compact, and unashamedly old-school, the human recipe remains standing.
The benchmark is still open. The AI models keep getting better. We may need a harder test next year. Perhaps croissants.
Disclaimer
No animals were harmed during the Banana Bread Benchmark experiment. While my human friends were referred to as “guinea pigs” in the taste-testing process, no actual guinea pigs were involved. In fact, banana bread is not a suitable food for real guinea pigs due to its high sugar content and various ingredients that can be harmful to them. However, these adorable pets may enjoy a small slice of fresh banana as an occasional treat!

