The Banana Bread Benchmark Returns

Year Two — Can AI Finally Beat the Human Baker?

Authors: Cobaia Kitchen, Claude Sonnet 4.6 Thinking
Photos: Cobaia Kitchen, Nano Banana 2

The sweet, slightly over-ripe smell of banana bread is back in our kitchen. So is the science. Last year, we ran what is probably the world’s most delicious AI evaluation framework — the Banana Bread Benchmark — and the human recipe won, clearly and without much debate. But the AI world doesn’t sit still. New, more capable models have launched, old benchmarks have been shattered, and frankly, those leftover bananas weren’t going to last another week. With a Melodifestivalen final gathering coming up — a perfect excuse to assemble a crowd of willing taste testers — it was time to bake again.

AI Benchmarks: From Maths to Muffins

If you’ve been following the AI space, you’ve heard the word “benchmark” thrown around a lot. AI benchmarks are standardized tests designed to evaluate and compare how well large language models perform across specific capabilities. Think coding, maths, language, and abstract reasoning — all well-covered, all improving at a breathtaking pace. But not a single one tests whether an AI can help you bake something worth eating on a Saturday night. That’s what we test here. And if you think it’s a frivolous metric — consider that banana bread is universally beloved and a surprisingly useful lens for evaluating creativity, recipe coherence, and practical kitchen judgment. It might just be the most honest benchmark of all.

A Quick Word About Mello

For our international readers: Melodifestivalen — affectionately known as Mello — is Sweden’s annual music competition and the country’s selection process for the Eurovision Song Contest. It is not merely a TV show. It is a national institution, a six-week ritual that draws roughly 70% of the Swedish population across its heats and final. For families with children, it is practically obligatory viewing — kids discuss the performances in school on Monday mornings, parents find themselves humming every competing song by March, and the Saturday evening of each heat becomes a cozy living-room ritual that is very hard to opt out of.

The Mello final is one of those rare television events that still genuinely brings people together in the same room. We used that occasion to also gather our guinea pigs. Mello provided the entertainment; the banana bread provided the science.

The Contenders

This year, we challenged three new models from the frontier of AI capabilities — two deep research models designed to synthesize extensive information and reason carefully, and one dedicated thinking model. They received identical instructions: create a vegan banana bread recipe for a 1.9L loaf pan, with chocolate chips as the only permitted add-in. We kept the add-ins consistent this time around: comparing a bread loaded with nuts to one dotted with chocolate chips tells you more about tester preferences than about the quality of the recipe itself.

Meet the class of 2026:

  1. 🧑‍🍳 Human (Cobaia Kitchen) — our trusted recipe, using almond flour and whole wheat flour for a compact, creamy loaf
  2. 🇨🇳 Qwen3.5-Plus Deep Research — Alibaba’s powerhouse, equipped with aquafaba and a science explainer table
  3. 🇪🇺 Mistral Deep Research — the French contender, clean and methodical with a flax egg and vegan yogurt
  4. 🇺🇸 Gemini 3.1 Pro Thinking — Google’s thinking model, which decided to go full chocolate

What They Baked

Four vegan banana bread loaves lined up side by side on a wooden terrace table, each marked with a small flag on a toothpick: Chinese flag for Qwen (far left, deep brown with chocolate chips), a white guinea pig flag for the human recipe (second from left, golden brown and compact), US flag for Gemini (second from right, very dark almost black from cocoa), and EU flag for Mistral (far right, golden and evenly risen). In front of each loaf lies a single cut slice, clearly showing the distinct colour and texture differences between the four recipes. Bright outdoor daylight.

Qwen3.5-Plus Deep Research delivered something methodically optimized. Using notably less sugar than the other recipes, it relies on aquafaba from canned chickpeas for lift and a flax egg for binding — a dual egg-replacement strategy backed by a detailed “Why This Recipe Works” science table. The one minor catch: you need to buy a can of chickpeas specifically to harvest a few tablespoons of aquafaba.

Mistral Deep Research gave us the most classically solid recipe of the bunch. Vegan butter, a generous amount of brown sugar, and vegan yogurt for tang and moisture — all bound together with a flax egg. Mistral also delivered an extensive background document on vegan baking science. Fast, clear, and practical.

Gemini 3.1 Pro Thinking looked at our banana bread brief and said: what if we added half a cup of cocoa powder and three-quarters of a cup of chocolate chips? The result is, technically, a chocolate loaf with banana in it. The recipe even referenced our blog, noting that this approach would “drive excellent search traffic” — Gemini apparently does its homework. The result is genuinely impressive, just not quite what we asked for.

Our human recipe uses almond flour combined with whole wheat flour — a combination that produces a denser, creamier loaf. Compact, sweet, and rich. It rose less in the oven, but what it lacks in height it makes up for in texture.

The Experimental Setup

We baked all four loaves on the day of the Melodifestivalen final and assembled six guinea pigs for the experiment:

  • Guinea pig 1 (an avid banana bread baker and enthusiast — so we can fully trust his judgement): tasted fresh from the oven, again during Mello, and revisited over the following days
  • Guinea pig 2 (myself): same schedule as guinea pig 1
  • Guinea pigs 3 & 4 (friends): tasted during the Mello final
  • Guinea pig 5 (teenager): present during Mello, but has serious concerns about the environmental, economic, and societal impact of generative AI and therefore declined to taste the AI entries — a stance we respect and document with admiration
  • Guinea pig 6 (teenager): present during Mello, but wasn’t hungry enough to work through all four breads

Testing therefore spanned three distinct phases: right out of the ovenduring the Mello final, and the days that followed. Science is a process.

The Results

Four freshly baked vegan banana bread loaves arranged on a wooden cutting board on an outdoor terrace table, each labelled with a small flag: Chinese flag for Qwen, US flag for Gemini, EU flag for Mistral, and a white guinea pig flag for the human recipe. In the foreground, two patterned plates hold slices of all four breads side by side for tasting, each slice also marked with its corresponding flag. Two double-walled glass espresso cups sit beside the plates in the sunshine.

🍫 Taste

Fresh from the oven, opinions formed quickly:

🐹 Guinea pig 1 ranked: 1st Qwen, 2nd Mistral, then a noticeable gap, then 3rd the human recipe. Gemini was filed under different category: clearly better than the human bread, but essentially a chocolate cookie in loaf form — too different to rank fairly against the others.

🐹 Guinea pig 2 ranked: 1st Mistral, 2nd Qwen, with the same verdict on Gemini and the human recipe. I found our own bread too sweet and compact by comparison — the AI loaves had less sugar (real or perceived) and were noticeably fluffier.

At this point, both of us were quietly convinced we had reached AGI — every single AI bread had easily surpassed the human benchmark. But during Mello, with the upbeat pop of Sweden’s biggest music night in the background:

  • 🐹 Guinea pig 3 went the other direction, naming the human-made bread the clear winner for its creaminess. They also liked the Gemini bread, found Mistral a bit too crumbly and under-salted, and delivered our favourite piece of feedback this year: Qwen’s bread smelled a bit like cat food. (The aquafaba is the prime suspect — none of the rest of us detected this, but it may be worth keeping in mind if you’re serving to guests who aren’t used to vegan baking substitutes.)
  • 🐹 Guinea pig 4 liked all breads equally. A diplomat.
  • 🐹 Guinea pig 5 ranked the human-made bread as the only one worth trying and declined the AI entries on ethical grounds. The AI models were not available for comment.
  • 🐹 Guinea pig 6 found both the human-made and the Gemini version good.

In the following days, something interesting happened: the small but detectable difference between Qwen and Mistral — clearly perceptible fresh out of the oven — essentially disappeared. Two days later, they were both simply good. All breads aged gracefully.

👁️ Looks

The human-made bread is the clear visual underdog — compact, low-rise, and lacking the golden dome that makes a loaf photograph well. Mistral took the beauty prize with a beautifully golden crust. Qwen and Gemini both looked like breads you’d want to slice into.

📏 Portion Size

Unlike last year’s notable variation in loaf volumes, this year the sizes were broadly comparable. The human-made bread looks smaller, but that’s primarily a consequence of almond flour’s density — it’s not smaller, just less fluffy.

🍞 Texture

All four breads were texturally solid across the board — a significant improvement over last year’s experiment, where some entries suffered from moisture issues or excessive crumbliness. Mistral is the fluffiest and most delicate, falling apart a little when sliced but remaining within perfectly reasonable bounds. The human-made bread is the most compact and creamy, a direct result of combining almond flour with whole wheat. The AI breads, all built on standard all-purpose flour, are fluffier and more even in their crumb.

🌿 Spices

Last year’s AI breads arrived with adventurous — occasionally alienating — flavour profiles. This year, all three models converged on a hint of cinnamon at most, making the results more approachable. Gemini layered both cinnamon and a generous amount of chocolate into the mix, landing somewhere between banana bread and a chocolate dessert cake — bold, but undeniably effective.

⏱️ Timing

BreadSpeed Verdict
Human-madeFastest overall
Mistral Very fast, almost on par
QwenFast, but requires sourcing aquafaba from a can of chickpeas
GeminiThe slowest — melting the coconut oil and carefully sifting the cocoa powder add meaningful preparation time

Part of what makes banana bread such a satisfying recipe is that it’s supposed to be quick — a practical way to give new life to those brown, forgotten bananas sitting on your counter, or to whip up a fast weekend breakfast without much fuss. Mistral and Qwen fit that profile well. Gemini’s creation is genuinely spectacular, but it belongs to a different occasion: think weekend fika with guests, a birthday, or something special — not a weeknight rescue mission.

✅ Processing & Completeness

All four recipes executed cleanly. No missing steps, no ambiguous instructions, no problems. This is a genuine improvement from last year across the board.

The Leaderboard

An illustrated judging panel of six guinea pigs sitting in a row behind a long desk in a cozy living room, evaluating four banana bread recipes in a food competition setting. Each guinea pig displays a distinct personality: the first raises a scorecard enthusiastically, the second wears tiny glasses and looks analytical, the third points emphatically at the human-made bread slice, the fourth shrugs happily surrounded by all four plates, the fifth crosses its arms and deliberately turns away from the AI bread plates, and the sixth looks mildly bored with only two plates in front of it. The desk is scattered with scoring paddles, chocolate chips, and crumbs, with four labelled bread plates reading Qwen, My own, Gemini, and Mistral. In the background, a large TV displays a colourful music show in a warmly lit living room.

This year’s ranking depends heavily on who you ask and when. Here is our best attempt at a composite picture:

ModelBest MomentsWatch Out For
MistralFluffy, classic, fast, beautiful golden crustSlightly crumbly; one tester missed salt
QwenBalanced sweetness, ages beautifully, science-backedAquafaba shopping step; one memorable cat food comment
GeminiGenuinely outstanding — as a chocolate loafNot really banana bread; more effort required
HumanCreamy, compact, crowd favourite with some testersDense, sweet, and the visual underdog

What This Tells Us

Last year, the human-made recipe was the clear winner — the AI loaves were decent, but they couldn’t match the benchmark set by a well-tested, human-crafted recipe. This year, that gap has closed entirely. All three AI breads are as good as our human benchmark, and who comes out on top depends entirely on personal taste. The question is no longer whether AI can bake a good banana bread. It can. The only question left is: what kind of banana bread person are you?

The Mello crowd was more divided than the fresh-from-the-oven tasters, shaped by individual texture preferences and at least one principled stand against generative AI. But the overall picture is clear: the AI models have levelled up. The texture disasters of last year are gone. The exotic spice combinations are gone. The outsized portions are gone. What remains are three well-executed, distinctly different loaves — and one of them (Gemini’s) went on a bold creative detour and landed somewhere genuinely delicious, even if it’s not what we asked for.

If you want fluffy and fast, go Mistral. If you want balanced and methodical, go Qwen. If you want to impress someone at a dinner party, go Gemini — and call it a double chocolate banana cake, because that’s what it is. And if you want something creamy, compact, and unashamedly old-school, the human recipe remains standing.

The benchmark is still open. The AI models keep getting better. We may need a harder test next year. Perhaps croissants.

Disclaimer

No animals were harmed during the Banana Bread Benchmark experiment. While my human friends were referred to as “guinea pigs” in the taste-testing process, no actual guinea pigs were involved. In fact, banana bread is not a suitable food for real guinea pigs due to its high sugar content and various ingredients that can be harmful to them. However, these adorable pets may enjoy a small slice of fresh banana as an occasional treat!