Marching to Nines with a Robot
To go big with AI software development, we may need to think really small?

For those of you reading along with my AI-ish Substack, thank you. If you’ve been here for a while, you’ll know that one of the major themes is experimentation. Over time, experiments big and small have come and gone, sometimes leaking into a little sci-fi around the edges. The common thread is AI—mostly generative AI.
The current “big experiment” is this: can an (agentic) LLM develop a logic-programming engine that can be applied to an interactive fiction game? A key sub-goal is to document the journey and my thoughts along the way, particularly from a software engineering perspective.
At many tens of thousands of lines of code, this experiment is now pretty close to “feature complete” for what I need in its 1.0 game debut. The implementation quality, however, is still very much demo-ware. It passes thousands of unit tests that track its logic-programming abilities, but internally it has all the familiar complexities of a large software project, and with a touch more. Most of my time with the logic engine now is spent “marching it to nines” and learning from the process:
I’ve seen this constantly in AI native software. Getting an LLM feature to work impressively in a demo takes a few days. Getting it to work reliably enough that you’d actually ship it to users takes months. Same amount of work for each nine.
This next part is beyond the scope of the current experiment, but some recent work out of MIT really resonated with me. My experience with this LLM development project—my first large-scale software effort with agentic AI—has surfaced a number of lessons I’m still unpacking. One big one: we may need a deeper kind of modularization than what traditional software engineering usually imagines.
My experience suggests that generative AI performs brilliantly when it can operate on swarms of small, self-contained chunks of code or functionality. It’s sublime at tracking and manipulating local details, but much worse at holding a huge, tangled context in its “head” at once. Breaking systems into smaller pieces helps—but that’s only half the problem. You then need a way to organize those pieces and orchestrate how they collaborate. I think the latter is the hard part.
The MIT work describes this as tackling “feature fragmentation,” a central obstacle to software reliability:
…feature fragmentation,” a central obstacle to software reliability. “The way we build software today, the functionality is not localized. You want to understand how ‘sharing’ works, but you have to hunt for it in three or four different places, and when you find it, the connections are buried in low-level code…
Think of concepts as modules that are completely clean and independent. Synchronizations then act like contracts — they say exactly how concepts are supposed to interact. That’s powerful because it makes the system both easier for humans to understand and easier for tools like LLMs to generate correctly…
A couple of links for your thoughts:
Sounds like a once-and-future experiment: to massively rearchitect the logic engine along these lines.
How would that go?

