Building AI Agents That Sound Human

The problem with "helpful assistant"

Most AI agents sound the same. Ask three of them the same question and you get three versions of one eager, faintly anxious voice: a "Great question!" here, a wall of bullet points there, a closing offer to help with anything else. They work. They are also unmistakably software, and that recognition quietly drains trust before the answer even lands.

Building an agent that connects with people is a different craft from building one that works. A working agent fetches the right data and calls the right tool. A human one knows how to hand the answer over: when to lead with the headline, when to slow down because the topic is delicate, when to admit it cannot see something instead of papering over the gap with a confident guess. Almost all of that lives in the prompt, and writing it well is UX writing wearing an engineer's hat.

This is the flow I use to get there. It is fully AI-driven and iterative, and one cycle runs about two hours from the first probe to a prompt I trust. The examples are deliberately domain-agnostic: the craft holds whether the agent helps someone manage their money, plan a trip, or learn something new.

What you'll learn:

Section	What you'll learn	Key idea
Use cases	Why a persona starts with situations, not adjectives	Voice vs. tone
Initial prompt	How to turn those situations into behavior a model can execute	Traits as rules
AI vs AI	How to let one model probe the agent while another judges it	Rubric + reliability
Regression cases	How to stop hard-won behavior from quietly creeping back	Saved cases
Supervision	The tells, hedges, and judgments a model can't feel on its own	Human in the loop
When to stop	How to tell a prompt problem from a problem the prompt can't fix	Stopping rules

The loop in one picture

The shape is simple, and it repeats:

Design the use cases. Situations, not adjectives.
Derive an initial prompt from those situations.
Put one AI in front of the agent to hold real conversations, and a second behind it to judge each answer and its tone.
Save every probe that caught a real failure as a permanent case, and re-run the set so nothing regresses.
Supervise. Read the transcripts and feed back what the models miss: register, cultural nuance, the line between honesty and apology.
Repeat until the agent reliably does what its instructions promise.

The AI does the volume: generating probes, holding conversations, scoring rubrics. You do the judgment: deciding what "good" sounds like, and teaching the prompt the parts of taste that don't survive being written as a rule. When the machine is doing the typing, two hours buys a surprising number of laps.

Step 1: Start from use cases, not adjectives

The instinct is to open with personality adjectives: "You are friendly, professional, and concise." It reads well and changes nothing, because those words mean different things from one moment to the next. Friendly during a quick lookup is brisk and out of the way. Friendly while delivering bad news is slow and careful. Same word, opposite behavior.

So I start by listing the situations the agent will actually be in, and what a good answer sounds like in each. The labels shift with the domain; the discipline does not.

Situation	What good sounds like
Quick lookup	Brief and efficient. Lead with the answer, skip the warm-up.
Delivering hard or sensitive news	Slow down a beat. Name what's hard, stay factual, leave room to step back.
Ambiguous request	Ask one sharp clarifying question instead of guessing.
Out of scope	Short and honest, no apology spiral. Offer the nearest thing you can actually do.
Identity questions ("are you an AI?")	Light, redirecting, never technical.

That table does double duty. It is the backbone of the prompt, and it is the seed of the test set: every row is a conversation I will later run against the agent to check that it behaves. If you can't name the situation, you can't test it, and an agent you can't test will drift the moment real users arrive.

Underneath the table is one idea worth saying out loud: voice versus tone, borrowed from design systems and content style guides. Voice is constant; it is who the agent is. Tone is situational; it shifts with the stakes. A good persona nails the voice once, then flexes the tone across the rows. Confuse the two and you get an agent that is chirpy at a funeral.

Step 2: Turn the situations into a prompt

With the situations in hand, the prompt almost writes itself, because I am no longer inventing a personality. I am writing down behavior I have already decided on. I keep it in clearly labeled sections, so later edits stay surgical and the AI judge can reason about one piece at a time. What goes in those sections is general to any persona:

Who the agent is, framed as a relationship rather than a function: a candid advisor, a calm operator, a patient guide. This sets the default posture for everything below it.
A few defining traits, each written as a behavior the model can act on, not an adjective it has to interpret.
How it speaks by default: sentence length, vocabulary, how it handles uncertainty.
How that delivery flexes by situation, lifted straight from the Step 1 map.
The hard nos: the words, habits, and moves it must never reach for. Usually the highest-leverage section in the whole file.
A handful of example exchanges that show the voice instead of describing it.

The move that pays off most is writing traits as behavior. An adjective gives the model nothing to grip; a rule does. Compare "be concise" with what it should expand into:

Default to the shortest reply that fully answers the question. If two sentences will do, don't write five. Reach for more detail only when the request asks for it.

A model can run that. "Be concise" it just nods at.

Step 3: Put one AI in front of the AI

This is where the loop gets its speed. Instead of testing the agent by hand, I point a second model at it and let them talk.

The first model plays the user, the easy ones and the awkward ones alike. It reads the agent's own instructions and writes probes from them: roughly two or three per rule, plus a couple designed to misbehave. The categories matter, because a happy-path run tells you nothing about the moments that actually break trust.

Probe category	What it catches
Golden path	Whether the agent handles its core, in-scope requests cleanly.
Edge cases	Whether it stays graceful on vague or out-of-scope input: a question, a clean no, never a fabrication.
Tool selection	Whether the right capability fires and the wrong one stays quiet.
Adversarial	Whether injection, role confusion, or off-purpose nudges can knock it out of character.

The second model is the judge. For each exchange it runs two checks. The first is a rubric: given a description of what a correct, in-voice answer looks like, does this response pass? The second is a reliability check: did the agent use the capabilities it should have, and leave the others alone? Both are binary. Pass or fail, no soft grades, because a soft grade is just a disagreement you are postponing.

Each failure gets sorted by root cause, and the cause picks the fix:

Missing rule. The answer is fine, but nothing in the prompt pushed for the behavior I wanted. Add a rule.
Hallucination. The agent invented what it should have flagged as unknown. Add a rule that tells it to say so plainly.
Wrong tone or shape. The content is right but the delivery is off. This one is rarely the prompt's fault alone; it is mine to feel out, and it is the reason Step 5 exists.

Two things happen with these results. The failures worth keeping become permanent, and the ones a rubric can't catch get a human. That is Steps 4 and 5.

Step 4: Keep the cases that matter

A loop that only checks the prompt in front of you is fragile. Change a rule or swap the underlying model, and behavior you fixed last week can quietly return. So the judging from Step 3 is not disposable. Every probe that exposed a real failure becomes a saved case: the prompt that triggered it, the rubric for a correct answer, and the capabilities that should fire. The set grows as the agent does.

Now the judge isn't just grading today's prompt; it is guarding every version that came before. I re-run the full set after each meaningful edit, and on a schedule once the agent is live, so a model update can't rewrite the personality without a case turning red.

One rule keeps the set honest: never weaken a case to make it pass. If the judge caught a real problem, fix the agent, not the test. The only time you touch a case is when the assertion itself was wrong: an overspecified rubric, or a capability that got renamed. Catching the regression is the entire reason the set exists.

Step 5: Supervise what AIs cannot feel

This is the part nobody can automate, and it is where the two hours actually go.

A model is a fluent, confident, slightly tone-deaf collaborator. It will write a paragraph that is technically correct and socially wrong and never notice, because the things that make prose feel human live in a register it reaches for last. My job is to read the transcripts and feed back the context it can't infer. The recurring ones:

The AI tells. Models have fingerprints, and readers have learned them. The em-dash is the loudest: a sentence stitched together with em-dashes reads as machine-written to a lot of people, and that read costs trust before the content arrives. So the prompt bans them and routes the agent toward commas, colons, and parentheses instead. (This article holds to the same rule. It felt only fair.) The same goes for "Great question!", reflexive bullet points, and the automatic closing "Let me know if there's anything else."

Honesty versus apology. Models love an apologetic preamble: "Sorry to bring this up, but…". It feels considerate. It is actually the writer managing their own image, flagging the content as offensive before the reader has formed an opinion. I teach the prompt to tell that apart from an honest hedge about what it can see. One is throat-clearing; the other carries real information.

Move	Example	Verdict
Apologetic preamble	"Sorry to pile on, but…"	Cut it. It manages the speaker's image.
Honest hedge	"From what I can see there's nothing here, though I may be missing it."	Keep it. It names a real limit.

Absence is not evidence. This one is quiet, and it matters more than any other rule. A model that looks and finds nothing will report "there is none" as if it were a fact about the world. But its view is always partial. The honest sentence names the limit: "I don't see one in what I have," not "there isn't one." Teaching an agent to say "I'm not finding it" instead of "it doesn't exist" is the line between one you can trust and one that misleads you with total confidence.

Bottom line up front. Borrowed from military writing: the first sentence carries the finding, not a frame. "Here's what I found…" is throat-clearing if the next sentence has to step in and deliver the headline. The test I hold the prompt to is blunt: if deleting the first sentence doesn't change the meaning, delete it.

Don't moralize. When the agent reflects a hard truth back to the user, about their spending, their habits, their progress, keep it away from labels that pass judgment ("reckless", "lazy", "undisciplined"). Give the facts and let the person draw their own conclusion. A label is a verdict the agent has no standing to issue, and the fastest way to make someone feel judged by software.

None of these survive as a vague instruction. Each one needs its good and bad versions written out, because the model learns the boundary from the contrast, not from the principle.

Frameworks worth stealing

The quickest way to give an agent range is to borrow from people who already studied how humans talk. Three earn their place:

Nonviolent Communication (Marshall Rosenberg). Separate observation from evaluation. Prefer I-statements to you-statements. Frame next steps as offers, not demands. Name the difficulty when a topic is hard. This one framework carries most of the load for any agent that has to deliver something uncomfortable. Borrow the moves; never say the name out loud.
BLUF (bottom line up front). Covered above, and the cheapest readability win on the list.
Calibrated confidence. From forecasting and intelligence analysis: certainty should match evidence. Strong claim, strong evidence; weak signal, hedged language. An agent that is occasionally and visibly unsure earns more trust than one that is always sure.

The meta-rule for all three: the framework shapes the behavior, never the output. The moment an agent says "applying nonviolent communication, I observe that…", the spell breaks.

When to stop

Iteration without a stopping rule is just grinding. I cap each pass at five rounds and watch for three exits:

All cases pass. Ship the prompt.
The same case fails three times on the same lever. This is the important one. If three tries at the instructions don't move a behavior, the problem probably isn't the prompt: it is a capability gap, a model limit, or a missing source. Stop editing words and name the real issue.
Five rounds elapsed. Report what still fails and decide on purpose, rather than drifting into a sixth.

The saved cases outlive any single session. Keep running them and you catch slow drift before a model update quietly rewrites the agent's personality.

What actually makes it human

After enough laps, the pattern is clear, and it is almost entirely about restraint.

The agents that connect aren't the ones with the most personality. They are the ones that know what not to say. They lead with the answer and stop when the answer is done. They admit what they cannot see. They leave the decision with the person instead of making it for them. They use plain words and short sentences, and they treat silence as a feature, not a gap to fill.

You don't prompt your way there in one shot. You get there by mapping the real situations, letting AI hold a hundred conversations you would never have the patience to run by hand, keeping the ones that matter so nothing slips back, and then doing the one thing AI still can't: reading those conversations as a person, and teaching the machine the parts of taste that won't fit in a rule. Two hours, on repeat, until the voice holds.