I Threw 5 AI Models at a High-Stakes Life Decision. They All Needed Babysitting.

I wanted to punch my screen. Maybe throw my laptop out the window. Definitely wanted to grab Claude by its virtual collar and shake it.

Here I was, helping my cousin fill out her college application — the single most consequential form of her 18 years on this planet — and I had assembled a roster of AI models to help. Claude. Gemini. GPT. Yuanbao. Doubao. The whole starting lineup.

What I got was a masterclass in managed disappointment.

AI doesn’t fail because it’s stupid. It fails because we keep pretending it’s something it’s not: an autonomous decision-maker instead of a brilliant, easily confused intern.

The Setup: A Decision Worth Everything

Let me set the scene. My cousin is an arts-track student in Sichuan, where 2025 marks the first year of a brand-new exam reform. Nobody in my family had filled out a college application in over a decade. The rules had changed. The scoring system had changed. Everything we thought we knew was obsolete.

So I did what any reasonable tech person would do: I threw AI at it.

Not the specialized tools. I didn’t even know those existed until after I finished. I used the general-purpose models I actually trust day-to-day. And what I discovered over the course of this grueling, multi-day odyssey is that the gap between what AI can do and what we need it to do isn’t about intelligence. It’s about trust, context, and the dirty work of delegation.

Round 1: Claude Reads the Rulebook (And Actually Helps)

Before scores even came out, I spent a full day with Claude Opus, building a detailed guide to the new arts exam rules. And honestly? Claude was impressive. It proactively reminded me to use 2026 official policies, flagged that pre-2023 data was basically useless, and helped me construct a framework I could actually work with.

But then its usage limits kicked in. One question — a single sentence — would burn 25% of my quota. Context window too large,额度 gone, Claude benched.

Here’s the thing though: Claude was the only model that, when it raised a concern, would go verify it itself before reporting back to me. Every other model just threw the concern over the fence and said “you should probably check this.” That difference matters enormously when you’re exhausted and stressed and just need someone to handle something.

Round 2: The Data Hunt — Where AI Hits Its Ceiling

Once scores dropped, the information需求 exploded. Historical admission scores, percentile rankings, program details, tuition, employment outcomes — all scattered across individual university websites, some of which AI couldn’t even access.

My sisters used Doubao. They’d voice-ask it: “What’s the 2025 cutoff for this program at this school?” And every single time, after Doubao answered, they’d look at me and say: “But is Doubao telling the truth?”

In high-stakes moments, no one trusts AI. And they shouldn’t. Trust isn’t granted by capability — it’s earned through verification, and verification takes exactly as long as doing it yourself.

I ended up doing most of the searching on Baidu. Plain old search. Clicking through to official university websites. Reading PDFs. The same thing I would have done fifteen years ago, just slightly faster.

Gemini did shine when I asked it to compile school characteristics, core strengths, and employment data into a summary table. That was genuinely useful — information too scattered for any human to gather manually, synthesized into something I could actually work with. I had it include source links for every claim so I could verify.

But then Gemini started deleting columns from my table without being asked. Columns I needed. Columns I never complained about. Just… gone. I still don’t know why it does this.

Round 3: Yuanbao and the Art of Task Confusion

I had an HTML document of admission plans that I couldn’t search. So I took screenshots and sent them to Yuanbao, asking it to extract the data into a table AND cross-reference each school’s latest information from their official websites.

It refused. Told me it couldn’t access the internet, then smugly offered to “teach me a trick” for doing it myself. I was furious.

Days later, while writing this, I tested it again with the same images — but this time I only asked it to extract the data. No cross-referencing. It worked perfectly. Recognized everything.

The problem wasn’t the AI. It was me. I’d bundled two tasks into one prompt, and when it couldn’t do the second, it gave up on the first entirely. Bad AI results are almost always bad human delegation wearing a disguise.

Round 4: The Cross-Check Circus

With a shortlist of twenty-plus schools, I needed to finalize the order. This is where it got subjective: in-province over out-of-province, public over private, tuition caps, school prestige versus program fit. Real human trade-offs with no right answer.

I asked every model for input. Here’s what happened:

Yuanbao (DeepSeek): Got the year wrong. Assumed 2025 instead of 2026, even though that context was in our conversation history. Then it found a school didn’t exist in its database and recommended dropping it — until I told it the school was newly independent this year, at which point it suddenly recommended moving it UP the list. Same school, opposite advice, triggered by one sentence from me. It also condensed my twenty-plus options down to seven, apparently unaware that the parallel volunteer system allows forty-five choices.

GPT (Deep Research): The only model that validated my existing plan. I was immediately suspicious — was it just agreeing with me? But when I explained our subjective reasoning, it understood and offered genuinely sharp analysis. It cut straight to the core logic of my arrangement. Few words, lethal precision.

Gemini (Chrome): Gave me formatting I could directly copy into the application system, which was unexpectedly practical. Also suggested strategies I hadn’t considered — like leveraging newly added programs as “bargain picks.” But it had the same annoying habit of throwing up alarm-bell warnings without verifying them first. Bold red text, siren emojis, zero follow-through.

Claude: Knew the most context, gave the most personalized reminders — “you mentioned your cousin didn’t like this school, but it’s in your list, keep or drop?” But the usage limits made sustained work impossible.

The Real Bottleneck Isn’t the Model. It’s You.

Looking back, my AI orchestration was chaos. I used whichever model had quota remaining, whichever was convenient, whichever hadn’t annoyed me in the last hour. There was no system.

And here’s the uncomfortable truth I keep circling back to: the bottleneck in AI adoption isn’t the model’s intelligence. It’s the human’s ability to decompose problems, assign them clearly, manage context across tools, and verify outputs. We are the bottleneck. Us. The humans.

When I gave Yuanbao a muddled prompt, it failed. When I didn’t explicitly state the year, DeepSeek assumed the wrong one. When I didn’t specify “keep all columns,” Gemini deleted them. Every failure traced back to a gap in my instructions, not a gap in the model’s capability.

That’s not to let the models off the hook. They hallucinate. They lose context. They make confident claims about wrong data. But in high-stakes, highly contextual human decisions — the kind where a wrong answer reshapes someone’s life — AI is not an oracle. It’s a very fast, occasionally brilliant, perpetually confused assistant that needs hand-holding every step of the way.

What This Actually Means

We keep waiting for AI to become autonomous enough to hand off our hardest decisions. But this experience taught me something different: the hardest decisions will always require human judgment, and AI’s real value isn’t in making the decision — it’s in helping us see the decision more clearly.

Claude helped me understand rules I didn’t know. Gemini surfaced information I couldn’t find. GPT validated logic I was uncertain about. None of them filled out the application. All of them made me smarter while I did.

My cousin’s application is submitted. And honestly? I wasn’t just helping her. I was helping the version of myself from fifteen years ago — the one who had no tools, no guidance, and no idea what she was doing. The one who got it wrong.

AI didn’t make the decision for me. It made sure I had enough information to make it myself. And maybe that’s exactly what it should be.

FAQ

Q: If AI needs this much hand-holding, is it even worth using for complex decisions?

A: Yes, but reframe your expectations. AI won't make the decision for you — it makes you faster and better-informed while you make it yourself. The value is in surfacing information and validating logic, not in autonomous judgment.

Q: So the problem is just bad prompting? That sounds like blaming the user.

A: It's both. Models genuinely fail — they hallucinate, lose context, delete data unprompted. But the majority of failures I experienced traced back to unclear task decomposition on my end. If you bundle three requests into one prompt and the model chokes, that's on you.

Q: Which AI model is actually best for high-stakes decisions?

A: None of them alone. Claude had the best contextual awareness but crippled by usage limits. GPT cut straight to core logic. Gemini was best at data synthesis. The real answer is using multiple models as cross-checks — and trusting none of them blindly.

The Setup: A Decision Worth Everything

Round 1: Claude Reads the Rulebook (And Actually Helps)

Round 2: The Data Hunt — Where AI Hits Its Ceiling

Round 3: Yuanbao and the Art of Task Confusion

Round 4: The Cross-Check Circus

The Real Bottleneck Isn’t the Model. It’s You.

What This Actually Means

FAQ

📖 Related Articles

3 ‘Rescue’ Attempts, 1 Ruined Millennium-Old Statue: Have You Seen The Intervention Penalty?

Your AI Knowledge Base Is Failing Because You Skipped This One Step

Why ‘Vulkan on NetBSD’ Is a Massive Lie: The Compile-Gate Illusion

Your Health Advice Is a Lie. Here’s Who’s Profiting.