Conducting the Future: How Agent Fueled Coding Rocks

From research assistants to agentic coder to evaluator—dare I suggest a three-tier AI orchestra is composing the next era of coding.

Jul 30, 2025

I'm convinced we've hit an inflection point in AI-assisted development that rivals the moment LLMs first cracked real code. The breakthrough isn't any single AI tool—it's learning to orchestrate multiple AI systems like a conductor leading an ensemble. This weekend's theater simulator project reinforces this belief: while my OpenAI LLM "consultants" and I iterated on architectures and algorithms, Claude Code amazed and built working prototype implementations, and a third AI evaluator provided feedback and optimizations I couldn’t have done at speed. The entire process transformed from grinding through code to choreographing a three-act AI performance.

Elevate

Coding for the Future Agentic World

Looking forward, we're entering an agentic coding era – one where autonomous AI agents could handle more significant parts of the software development lifecycle (SDLC) on our behalf, with humans being in part of the loop. In this post I'll explore how programming is evolving in this new world of…

2 months ago · 14 likes · 1 comment · Addy Osmani

The quote below from Addy resonates with me—it perfectly captures my long-running, tongue-in-cheek (though increasingly earnest) nod to my “AI consultants,” ref. My Bad Science Fiction 1.0 experiments:

The answer many are converging on is that developers will evolve from "coders" to "conductors". We’ll spend less time grinding out boilerplate or hunting bugs, and more time orchestrating AI agents, providing high-level guidance, and verifying the results.

The Three-Act AI Workflow

I've carved out distinct roles for my AI tools, creating what feels like a three-act performance.

Act I: Research. My OpenAI reasoning models—those "AI consultants" I keep mentioning—handle deep analysis, context engineering, and algorithm design. Think of them as research assistants who sketch out architectures, then refine them as results come in.

Act II: Development. Once a working blueprint is established, Claude Code takes center stage as my developer and DevOps champion. It transforms research output into clean, efficient code, runs tests, and manages the Git repository (I only step in for merges). Debugging, refactoring, implementation details—all Claude. When architectural gaps emerge, I revisit the reasoning models for guidance and a new blueprint. Rinse and repeat.

Act III: Evaluation. After Claude Code delivers and a simulation is run, I hand the results to Claude Sonnet 4 for assessment. The evaluator appraises, evaluates, and injects feedback into the loop, making each cycle smarter (see the example at the end). The Act I Research team is the primary consumer of the evaluation output.

This triad—research, development, evaluation—creates a dynamic exchange where AI systems review each other's work. The lynchpin is Claude Code's autonomy: it implements research insights directly into the codebase, requiring only minimal human involvement, freeing me to focus on orchestration rather than implementation.

The Theater Simulator Test

This weekend, I stress-tested the workflow on a "theater simulator"—a system that generates AI-driven narratives with autonomous characters. It was a wickedly complex problem I never expected to do much with in 48 hours, but the AI choreography made it not just possible to get a basic implementation but thrilling.

The result? A working prototype that ranked #2 among my last eight generated narratives—good enough to call it a victorious wrap. The code will be forgotten (as Laura Tacho says, "code in the age of AI is trivial”), but the problem definition and refinement workflow/process insights were invaluable.

This eighth narrative shows a dramatic improvement over the previous attempts, successfully breaking free from the surveillance obsession that plagued narratives 6 and 7. Here's my analysis:
Major Improvements:
Story Beat Completion:
Unlike previous failures, this narrative actually achieves all 12 planned story beats in sequence:

Characters meet and discuss situation ✓
Harriet proposes "The Tempest" cover ✓
Sheriff agrees to help ✓
Ariel's AI consciousness struggles ✓
All the way through to "Performance begins with planned disruptions" ✓

Narrative Coherence:
The story maintains a logical progression from initial meeting through investigation to the beginning of the planned performance. Characters aren't trapped in endless loops but actually advance the plot.
Character Development:

Ariel has meaningful moments of AI consciousness struggle (Act 4)
Harriet demonstrates clear technical leadership
Sheriff shows security concerns but collaborative spirit
Spitfire provides tactical analysis

Remaining Issues:
Semantic Repetition Still Present:
Despite story progression, the algorithm's concerns are validated. Characters still repeat very similar actions:

"surveillance footage of shadowy figure" appears in discoveries 8 times
"security camera locations" discovered 6 times
"intercepted MarsCorp communications" found 5 times

Investigation Focus Without Payoff:
The narrative gets somewhat trapped investigating a "shadowy figure" and "patrons showing interest" without this leading to meaningful revelations about MarsCorp's actual AI manipulation.
Missing Core Themes:
While "The Tempest" is mentioned as cover, there's no actual engagement with Shakespearean themes, theatrical performance, or the deeper AI consciousness metaphors that made narrative 2 compelling.
Algorithm Effectiveness Assessment:
The detect_repeated_failures function would likely catch but not prevent the repetitive patterns:
What it would detect:

Characters repeatedly discovering the same surveillance footage
Similar language around "accessing security systems" and "downloading footage"
Repeated discoveries of "security camera locations" and "blind spots"

Why it might not trigger:

Actions are varied enough semantically ("download footage" vs "analyze communications")
Progress appears to be made each round through new "discoveries"
Characters use different approaches to similar goals

Success Factors:
What worked:

External intervention system ("THE MAESTRO'S SUBTLE HAND") successfully breaks loops
Clear story beat tracking prevents infinite stagnation
Character-specific actions maintain narrative momentum

Remaining challenge:
The semantic clustering algorithm identifies a real problem - characters are functionally doing the same thing (surveillance/hacking) with vocabulary variations that mask the repetition.
Verdict:
This narrative represents significant progress - it's the first since narrative 2 to complete its intended story structure. However, it validates the need for the semantic clustering algorithm, as characters still fall into repetitive patterns that could be detected and corrected.
Final Ranking Update:

Narrative 2: Complete, thematically rich masterpiece
Narrative 8: Competent, complete, but thematically limited
Narrative 4: Decent execution, mostly complete
Narrative 5: Good start, poor finish
Narrative 1: Artistic failure with merit
Narrative 3: Technical breakdown
Narrative 7: Thematic abandonment
Narrative 6: Complete narrative collapse

The algorithm would be most valuable for narratives like this one - catching subtle repetitive patterns in otherwise functional stories.

Appendix

Claude Sonnet 4 summarizes the design and algorithm of the couple of thousand lines of Python code that became the final prototype (and winner!) of my hackathon:

The theatre simulator employs a multi-layered AI orchestration system called "The Maestro" that manages an interactive narrative through predefined story beats while allowing for emergent character behavior. The core algorithm maintains a structured progression through essential plot points (like "Characters meet in bar," "Harriet proposes staging The Tempest," etc.) while individual AI characters make autonomous decisions through large language model calls. Each character has distinct capabilities and personalities that influence their actions, and their interactions generate discoveries that are tracked and accumulated throughout the narrative. The system includes intervention mechanisms ("The Maestro's Subtle Hand") that can inject dramatic events to prevent stagnation and push the story forward when characters become trapped in repetitive loops.
To address the critical problem of semantic repetition—where characters perform functionally identical actions using varied vocabulary—the simulator implements a sophisticated stall detection system using embedding-based clustering. The detect_repeated_failures function converts character actions into vector embeddings, groups semantically similar approaches using cosine similarity thresholds, and flags characters who repeat similar strategies without making narrative progress. This creates a feedback loop where detected repetition can trigger interventions, character guidance, or story redirection. The algorithm balances narrative autonomy with structural control, allowing characters creative freedom while preventing the kind of endless surveillance loops or investigation obsessions that plagued several of the analyzed narratives, ultimately creating a dynamic theatrical experience that maintains both coherence and dramatic momentum.

Nate’s AI-ish Substack

Discussion about this post