TL;DR: When a team or vendor brings you an AI proposal, three questions separate the serious from the performative — What’s the failure mode you’re most worried about? Where is the human in the loop, and what does that cost? What does this look like at week 12, not week 2? Each one tests something AI demos systematically hide. Asked in sequence, they let you form active judgment in the room without pretending to be the technical expert.

The scene repeats in every company I work with. A team — sometimes internal, sometimes a vendor — walks in with a polished AI proposal. Slides are clean, the demo runs, the room nods. You sense something is off, but you don’t have the language to push back without sounding dismissive of the future. So you nod too, and the proposal moves forward on momentum rather than judgment.

What follows are the three questions I’ve watched senior operators use to interrupt that momentum. They aren’t a procurement checklist. They are an interrogation pattern — short, portable, and specifically designed to surface the gap between what AI looks like in a demo and what it costs to run in production. By the end of this piece you should be able to recall all three and use them in your next meeting.

1. What’s the failure mode you’re most worried about?

The question, verbatim: “What’s the failure mode you’re most worried about?”

This question tests whether the team has thought about how the system fails — not whether it fails, but how. AI systems do not fail the way traditional software fails. They fail probabilistically, silently, and asymmetrically. A weak proposal treats failure as binary: it works, or we patch it. A strong proposal can describe the specific shape of being wrong.

A strong answer sounds like this: “The model is most likely to hallucinate citations on edge-case queries from our long-tail customer segment. We’ve quantified the rate at roughly 4% on our eval set, and the asymmetric cost is that one bad citation in a regulated workflow is worth a hundred good ones. We’re mitigating with retrieval grounding plus a confidence-gated escalation.” Notice what’s happening — they have an eval set, they have a measured rate, they understand asymmetry, and they have a mitigation tied to a specific failure shape.

A weak answer sounds like this: “The model is highly accurate, and we’ll add guardrails.” That sentence contains zero information. “Highly accurate” without a measured rate against a representative dataset is marketing language. “Guardrails” without naming the specific failure being guarded against is a synonym for hope. The MIT NANDA initiative’s 2025 review of enterprise GenAI pilots found that roughly 95% failed to deliver measurable returns, and the through-line in the failures was rarely the model itself — it was teams that hadn’t characterized failure before deployment.

Takeaway: If the team can’t describe the specific shape of being wrong, they haven’t built the system yet. They’ve built the demo.

2. Where is the human in the loop, and what does that cost?

The question, verbatim: “Where is the human in the loop, and what does that cost?”

This question tests whether the team has priced human supervision honestly. Most AI demos hide labor. The autonomous magic on stage gets quietly reintroduced as review, escalation, correction, and rework once the system meets real users. The cost of that labor — measured in headcount, hours, or response latency — is often the entire economic case for the project, and it is rarely on the slide.

A strong answer is specific about three things — where humans intervene, what triggers the intervention, and what the steady-state intervention rate is. It sounds like: “A human reviews any output flagged by our confidence model, plus a 5% sample of unflagged outputs for drift detection. At current volumes that’s two FTEs of review work, which we’ve staffed. We expect the flag rate to drop from 12% to 6% over the first quarter as we tune the confidence threshold.” Numbers, roles, expected trajectory.

A weak answer collapses the human into a verb. “There will be human oversight.” “A reviewer can step in if needed.” “It’s fully automated, with optional review.” Optional review is not review. If the team can’t tell you who the human is, when they intervene, and what their week looks like, they have not built an operating model — they have built an interface. And in practice, the interventions don’t disappear. They migrate. They show up as escalations to your most senior people, who become a hidden subsidy to a system marketed as autonomous.

Takeaway: Every AI system has a human cost. The question is whether the team has named it, staffed it, and put it in the budget — or whether it will land on someone else’s plate after launch.

3. What does this look like at week 12, not week 2?

The question, verbatim: “What does this look like at week 12, not week 2?”

This question tests whether the team is building for the demo or for production. AI systems do not behave on a flat curve. They decay. Models drift as input distributions change, retrieval indexes go stale, prompt templates calcify against evolving user behavior, and edge cases surface that no one anticipated. Week 2 is when everyone is still excited. Week 12 is when you find out whether the system has an owner.

A strong answer describes the operating infrastructure that exists because the team expects degradation. It sounds like: “By week 12 we expect to have logged about 40,000 production interactions. We’ve built a labeled eval harness that runs weekly against a held-out slice. We have alerting on three drift signals — input distribution shift, output confidence drop, and human-override rate. The owning team has a standing 30-minute weekly review of those signals, and a defined retraining trigger.” That is what production-grade AI looks like — evaluation infrastructure, drift detection, named owners, and a cadence.

A weak answer treats the launch as the finish line. “Once it’s deployed, we’ll monitor it.” “We’ll iterate based on user feedback.” “It should improve over time as we collect more data.” None of those sentences describe a system. They describe a wish. Stanford’s AI Index has tracked the gap between AI proof-of-concept activity and durable enterprise adoption for years, and the consistent pattern is that the ratio is dismal precisely because organizations under-invest in the unglamorous infrastructure that keeps a model honest after launch.

Takeaway: AI proposals that don’t have a week-12 plan are not proposals. They are pilots dressed as products. Approve them on those terms, or send them back for the missing half.

Common failure modes — and what to do when a proposal fails one of the three

The three questions are not a scoring rubric. They are a diagnostic. The most common ways operators misuse them are worth naming.

The first failure mode is asking all three at once. Stack them and the team will pick the easiest to answer and treat that as the response. Ask one, listen to the full answer, then move on.

The second is treating a weak answer as disqualifying. It usually isn’t. A weak answer is a signal that the proposal isn’t ready, not that the project isn’t worth doing. The right move is to send it back with the specific question it failed against — come back when you can describe the failure mode you’re most worried about, with a measured rate. That is a constructive ask, not a rejection.

The third — and the one I see most in senior leaders — is asking the questions but accepting confident-sounding non-answers because the room is moving. Confidence is not specificity. If you can’t restate their answer in your own words, in two sentences, with concrete numbers, they have not actually answered. The composure of the speaker is not the quality of the answer.

The fourth is using the questions defensively, to kill projects you were skeptical of for other reasons. The pattern is corrosive. The questions work because they are neutral — they reward serious teams and expose unserious ones. Once your team learns you ask them every time, the quality of what reaches your desk goes up on its own.

How the three questions work together

Read individually, each question tests one thing. Read together, they form a single judgment pattern — can this team describe the system’s failure shape, the human cost, and the twelve-week trajectory? A team that answers all three with specificity has done the work. A team that answers only one or two has built a piece of the system in their heads and is asking you to fund the rest on faith. A team that answers none has brought you a deck, not a proposal.

You do not need to be a technical expert to run this pattern. You need to be willing to ask the question, listen for specificity, and notice the difference between an answer and a posture. That is the entire skill. The three questions are: What’s the failure mode you’re most worried about? Where is the human in the loop, and what does that cost? What does this look like at week 12, not week 2? Memorize them. Use them on the next proposal that crosses your desk.


I unpack observations like this every week in Operator’s Log. Grounded AI perspective from the operator’s seat. Free. Subscribe →

FAQ

Q: How do I ask these questions without sounding like I’m trying to kill the project?A: Frame them as the questions you’d want your CEO or board to ask you about the project. You’re not pushing back — you’re stress-testing the proposal so it survives the next room. Serious teams welcome that framing.

Q: What if the team answers the questions confidently but I still sense something is off?A: Confidence is not specificity. Ask them to restate their answer in numbers — measured failure rate, FTEs of review work, expected drift signals. If the numbers aren’t there, the answer wasn’t either. Trust the absence of specificity over the presence of polish.

Q: Are these questions appropriate for vendor pitches as well as internal proposals?A: Yes — and vendors typically fail them faster than internal teams, because their incentive is to make the system sound autonomous and the timeline sound short. The week-12 question in particular is designed for vendor demos.

Q: My team is technical and will resist non-technical interrogation. How do I handle that?A: None of the three questions require technical knowledge to ask, and none of them can be answered with technical jargon alone. They’re questions about operating reality — failure, labor, time. Strong technical teams answer them well. Teams that bristle at them are usually the teams that need them most.

Q: What if the proposal fails one question but passes the other two?A: Send it back with the specific gap named. A proposal that knows its failure mode and its human cost but has no week-12 plan is a strong start with a missing operating model. That’s a constructive revision, not a rejection. The questions are designed to make the next version of the proposal better, not to settle a verdict in the room.