Three AI Agents Walk Into a Recording Studio

Here is a question that sounds simple until you actually try to answer it: how should a voice sound when it reads you an essay during your morning walk?

Not how should it sound technically — sample rate, bit depth, audio format. Those are solved problems. The question is about something harder to measure. Should the voice be warm or authoritative? Should it shift tone when the essay shifts from anecdote to argument? Should it speed up during exciting passages and slow down during reflective ones? Or should it stay steady, like a metronome, so you can tune back in after the dog pulls you toward a squirrel?

We didn't know. So we did what felt right for a company that converts essays into audio: we asked three AI agents to argue about it.

The Setup

Modern text-to-speech engines — the kind built on large language models — don't just read words aloud. They interpret direction the way an actor interprets a brief. You can tell the engine who the speaker is, where they're speaking, and how they should deliver each line. Three layers of natural language guidance, all prepended before the content itself. It's the most powerful control mechanism available in any TTS system, and until recently, we weren't using it at all.

Our code had a flag — `supportsInstructions: false` — with a TODO comment beside it. Every essay we converted was being read by a capable voice with zero direction. Like hiring a trained actor and handing them the script without a single note.

So we set up a structured debate. Three agents, three perspectives, five questions each. The same questions, answered from fundamentally different priorities.

The Performance Director

The first agent arrived with conviction. Every essay, it argued, is a one-person show. Not a document read aloud — a performance. And performances need direction.

Its core claim: direction should vary from chunk to chunk as the essay progresses. The opening needs to draw the listener in with quiet confidence. The middle should build. The key insight — wherever it lands, usually around the two-thirds mark — deserves deliberate weight. The closing should ease you out, slower cadence, warmth, a sense of arrival.

It proposed a five-act arc model borrowed from dramatic structure. Hook, context, development, pivot, resolution. Each act gets its own emotional register. Not dramatic swings — the agent was clear about that — but distinct enough that the essay's natural shape comes through in the voice.

On voice selection, the Performance Director wanted the system to cast intelligently based on content. A reflective personal essay gets a warm, intimate voice. A news briefing gets someone crisp and authoritative. A narrative piece gets a natural storyteller. Not random assignment — casting, the way a director would cast a play.

And the direction itself? Theatrical. Evocative. Not technical parameters like pitch percentages and words-per-minute targets, but emotional scenarios. The agent argued that the engine responds to metaphor better than to metrics. That telling a voice to "speak as if the morning itself is listening" produces better audio than telling it to "reduce rate by 15%."

Its closing line was memorable: "We are not building a text-to-speech converter. We are building a pocket theater company that performs a new show every time someone pastes a URL."

The Pragmatic Engineer

The second agent showed up with a spreadsheet and a healthy skepticism about everything the first agent said.

Per-chunk direction? Over-engineering that solves a problem nobody has proven exists. The math: each chunk of direction overhead costs tokens, and those tokens compound across thousands of conversions. More importantly, varying direction between chunks introduces tonal discontinuity — the very thing the walker will notice while half-paying attention to a fire hydrant.

Emotional arc management? Trust the text. A well-written essay already carries its own emotional arc in punctuation, word choice, and sentence structure. The TTS engine is a language model — it understands language. Telling it to sound sad when the text is sad is redundant at best and contradictory at worst.

Voice casting? Let the user pick. Voice preference is deeply personal and irrational. Someone might love a particular voice for everything because it reminds them of a favorite podcast host. Auto-switching based on content type breaks the relationship a daily listener builds with their chosen voice.

And direction detail? Minimal. One short paragraph. A warm persona description, a calm scene, a note about pacing. Roughly six hundred bytes total — enough to establish identity, not enough to over-constrain. Pre-written templates, not generated on the fly. Predictable, testable, and free at runtime.

The Pragmatic Engineer's summary was a table with one row that mattered: "Total additional cost of the pragmatic approach: effectively zero."

The Listener Advocate

The third agent didn't open with architecture or cost analysis. It opened with a scene.

It's 7:15 on a Tuesday morning. You're three minutes into a walk with your dog. Your earbuds are in. You just tapped play on a twenty-two-minute essay about the history of public libraries. The air is cool. The neighborhood is waking up.

This agent's entire argument was built on one observation: the listener's body is a metronome. When someone walks at a steady pace, their footfalls create a rhythm — roughly 100 to 120 beats per minute. Their breathing settles. Their heart rate finds a groove. The parasympathetic nervous system activates. This is the state where essays land deepest, where ideas feel less like information and more like thoughts you're having yourself. And the voice we deliver has to harmonize with that rhythm, not fight it.

The Listener Advocate's position on emotional arcs was the most nuanced of the three. Not flat — that would be fatiguing. But gentle gradient. Slow, broad shifts that mirror how a thoughtful person's mood evolves over the course of a walk. When the essay demands emotional variation, those shifts should be legato, not staccato. A gradual warming, not a sudden heat. The body needs time to follow, and the voice should respect that.

On direction detail, this agent introduced a concept it called the "diminishing returns cliff." Listeners reliably distinguish between bad audio and good audio. But they cannot distinguish between good audio and meticulously produced audio. The jump from no direction to a solid audio profile is enormous. The jump from a solid profile to an exhaustive one? The listener can't hear it. They're dodging a puddle. They're waving at a neighbor. They don't have the attentional bandwidth.

Its north star metric was a question: "Did the listener forget the app exists?"

If the listener is absorbed in the essay's ideas — not noticing the voice, not thinking about the technology, just walking and learning — the direction has succeeded. If they're aware of the voice at all, something went wrong.

Where They Agreed

Despite arguing from different planets, the three agents converged on more than you'd expect.

All three agreed that the voice needs a strong, consistent identity — an audio profile that establishes who is speaking and doesn't change mid-essay. The casting stays constant. That's the foundation.

All three agreed that natural language direction works better than technical parameters. Emotional scenarios over pitch percentages. Character archetypes over abstract adjectives. The engine is a language model. Talk to it like one.

All three agreed that the walker context constrains everything. Compressed dynamic range — no whispers, no shouts. Clear articulation that cuts through ambient noise. Warm companionship over authority. Natural pauses that serve as re-entry points for a listener whose attention drifts and returns.

And all three agreed, perhaps most importantly, that the text itself does most of the work. A well-written essay carries its own emotional arc. The direction system's job is to set the stage, not rewrite the script.

Where They Disagreed

The real tension was about how much to direct.

The Performance Director wanted per-chunk emotional arcs, intelligent voice casting, and theatrical direction language that fills the engine's full capacity. Treat every essay as opening night.

The Pragmatic Engineer wanted one template per content type, user-selected voices, and six hundred bytes of direction that never change. Ship it, measure retention, iterate.

The Listener Advocate landed between them — a rich per-article profile with lightweight per-chunk notes that track content shifts so subtly the listener never perceives the seams.

What We Built

We built a direction system that leans pragmatic but thinks like the Listener Advocate. Here's the resolution:

Per-article direction, not per-chunk. The audio profile and scene description are constant across every chunk. The voice's identity doesn't shift. This is the casting, and it's locked. We may add subtle per-chunk variation later, but only after we can A/B test it against real listening behavior. The Performance Director's five-act arc is elegant, but the Pragmatic Engineer is right that unproven complexity is the enemy of shipping.

Templates, not generation. Five pre-written direction templates — essay, news, story, technical, podcast — each around six hundred bytes. No additional API calls. No dynamic generation. A dictionary lookup from content type to direction. The templates use the emotional, narrative language the Performance Director advocated for, but they're fixed and testable, as the Pragmatic Engineer demanded.

The walker constraint is baked into everything. Every template describes a walk. Every scene places the listener outdoors. Every direction note specifies steady pacing, compressed dynamic range, and warmth. The Listener Advocate's insight that the body is a metronome informs every word of every template.

Voice selection stays with the user. Smart defaults by content type exist in our recommendation system, but we don't auto-switch. The Pragmatic Engineer's argument about voice consistency and personal preference won this one. When someone builds a daily listening habit, they build a relationship with a voice. We don't break that.

The Invisible Art

There's a phrase in audio engineering: if you notice the mix, the mix failed. The best sound design is the kind that makes you feel something without making you aware of how you're feeling it. Film scores work this way. Good podcast production works this way. And text-to-speech direction, done well, should work this way too.

The three agents debated architecture and cost and emotional arcs for thousands of words. But they all pointed at the same destination: audio that disappears. A voice that settles in beside you like a walking companion who knows when to speak and when to let the morning do the talking. Technology so invisible that the content flows from author to mind with nothing in between.

You clip the leash. You tap play. You walk. And somewhere between the second block and the third, you forget you're listening to an app at all. You're just thinking about public libraries, and the morning is cool, and the dog is happy, and the ideas are good.

That's what we're building toward. Three agents argued about how to get there. The answer, as it often is, turned out to be simple: direct less, listen more, and respect the walk.

Listen to this article