AI-First Innovation: How Novemberkiloecho Builds Software That Thinks With You

May 23, 2026 · Novemberkiloecho

"AI-first" is one of those phrases that has been hollowed out by repetition. Vendors slap it on landing pages. Decks promise 10x. Meanwhile the engineers actually shipping production code with agents are quietly converging on a much narrower, much more demanding definition.

This post — the anchor for a 14-day series on agentic coding — is our attempt to put that definition on paper. What does AI-first mean inside a Novemberkiloecho engagement? What changes for our clients? What practices do we insist on, and what do we refuse to do?

No hype. Just the working method.

What "AI-first" actually means here

When we say AI-first, we do not mean "we use Copilot." We mean that coding agents are present from the first discovery conversation through to the last production deployment, and the artifacts we produce — specs, tests, review checklists, deployment scripts — are shaped to be legible to both humans and models.

This is a genuine paradigm shift, not a tooling upgrade. The AI-native SDLC integrates AI seamlessly into every phase, from planning to deployment, and shifts emphasis toward requirements gathering, architecture design, and continuous validation — reducing time spent in implementation phases. That last point matters. The bottleneck moves. When code generation is no longer the slow step, the work that was always undervalued — clear requirements, sharp interfaces, real test coverage — becomes the work.

It also means treating context as a first-class engineering artifact. Context engineering builds on earlier prompt engineering techniques but broadens the scope from individual prompts to the systematic design of all information supplied to a tool, and practitioner accounts show that structuring project-specific context improves agent effectiveness in complex codebases. An AGENTS.md file, a curated retrieval index, an architectural decision log readable by the agent — these are not nice-to-haves. They are the substrate.

How we embed AI from discovery through delivery

We do not bolt AI on. We start with it.

Discovery. Agents help us interrogate the existing codebase before the kickoff call ends — running structural analyses, surfacing dead code, mapping call graphs. We come to week two with questions a traditional consultancy would still be writing slide decks to ask.
Specification. Specs are written to be dual-audience: clear for the human staff engineer reviewing them, and structured enough that an agent can decompose them into a task list without hallucinating scope.
Implementation. Agents draft, humans direct. We work in the modality that breaks work into small, incremental tasks — a few lines to a few dozen lines of code — that fit within the AI's attention span and can be reviewed easily. This is non-negotiable, because the 2025 DORA report found that working in small batches is still crucial even with AI, and AI tended to increase PR sizes by 154% when unmanaged.
Delivery. Eval harnesses, CI gates, and human review queues are wired up before the first feature lands, not after.

The throughline: AI is in every phase, but its role narrows as stakes rise. By the time we are merging to main, the agent's job is to suggest and explain; the human's job is to decide.

What changes for clients

If you've worked with a traditional consultancy, the cadence of an AI-first engagement will feel different. Some of those differences are uncomfortable. We think they're the right discomforts.

Faster iteration loops

This capability expansion enables tighter feedback loops and faster learning — tasks that once required weeks of cross-team coordination can become focused working sessions. In practice: a question that would have been "we'll get back to you next Tuesday" becomes "let's spike it now and look at the diff together in twenty minutes."

Different artifacts

You will get fewer 40-page Word documents and more living artifacts: an AGENTS.md checked into your repo, an eval suite that runs on every PR, a retrieval index over your domain docs. Repository-level context files serve as persistent configuration mechanisms that encode architectural constraints, build commands, and workflow conventions. These are the deliverables. The PDF is a byproduct.

Different review cadence

Review is more frequent and more granular. Instead of one big design review at the end of a sprint, expect daily review touchpoints scoped to single agent-authored changes. This is not micromanagement; it is the only way to keep the agent honest at scale.

A different conversation about scope

Engineers describe developing intuitions for AI delegation over time — historically they delegate tasks that are easily verifiable or low-stakes, while the more conceptually difficult or design-dependent a task, the more likely engineers keep it for themselves or work through it collaboratively with AI. We will be explicit with you about which work the agent should drive, which work a senior engineer should drive, and why. That conversation replaces a lot of the old "resourcing" discussion.

The practical best practices

Four practices form the spine of every engagement. The next two weeks of posts will unpack one per day, with concrete examples. Here is the short version.

1. Context engineering

Treat the agent's working context as a designed system. That means a curated AGENTS.md, project-specific style guides, an architectural decision record the agent can read, and retrieval over your actual domain — not just the public web. Teams invest heavily in providing the AI with the right context: up-to-date internal documentation, architectural guidelines, coding standards, and even organization-specific knowledge bases — the agent is not coding in a vacuum.

2. Evaluation harnesses

Before we let an agent touch a subsystem, we build the eval. Unit tests, integration tests, golden datasets, LLM-as-judge scorers where appropriate — wired into CI so every agent-authored change is measured. LLM-as-a-judge scales evaluation coverage through rubric-based scoring but cannot replace human judgment on domain-specific correctness; a stack combining automated checks, LLM scoring, HITL review, monitoring, and feedback loops delivers the strongest reliability. This is the part most teams skip and then regret.

3. Tight feedback loops

Small diffs. Fast tests. Immediate review. The discipline that good engineering teams already aspire to becomes existential when an agent is generating ten times the volume. Agentic engineering often requires that the organization have strong engineering practices already in place — continuous integration, high test coverage, linting, security scans — and DORA's 2025 research found that AI acts as an amplifier where good practices yield even better results with AI, while poor practices just get amplified into bigger problems.

4. Human-in-the-loop review gates

Every change passes a human gate before merge. The gate is not theater. LLM code reviews can help suggest improvements and assess correctness, but there is a risk of faulty outputs, which is why a Human-in-the-loop LLM Code Review process promotes knowledge sharing while mitigating the risk of faulty outputs. Recent benchmarks are sobering: evaluated against a validated LLM-as-judge framework, 8 frontier models detected only 15–31% of human-flagged issues on a diff-only configuration, demonstrating that AI code review remains far below human expert performance despite strong results on code generation benchmarks. The agent reviews. The human decides.

What we don't do

This list is as important as everything above.

No autonomous merging. Agents do not push to main. Period. They open PRs that humans approve.
No unreviewed shipping. Every artifact — code, infra change, migration, prompt — passes a named human reviewer before it touches production.
No "the AI will figure it out." If a spec is ambiguous, we fix the spec. We do not hope the model guesses well.
No replacing senior judgment. The human developer is not reduced to a passive prompter; the developer remains in full control of the design direction, using the model as a context-rich assistant rather than a disposable code generator.
No silent autonomy. If an agent took an action, the trace is reviewable. Every step is logged.

The industry is, generally, learning these lessons the hard way. Enterprises are being cautious about rolling out agentic systems and are focused on structured implementations of agents rather than open experimentation; they are not letting every developer freely use AI agents and are instead forming centralized AI enablement teams. Our position is that structure is not a constraint on speed — it is the precondition for it.

Key Takeaways

AI-first is a discipline, not a toolchain. It changes what artifacts you produce, how often you review, and where the bottleneck sits.
Context is engineering work. AGENTS.md, retrieval indexes, and ADRs are the substrate that makes agents useful on real codebases.
Eval harnesses come before features. If you cannot measure agent output, you cannot trust it.
Small batches still win. AI amplifies whatever practices you already have — good or bad.
Humans hold the gates. No autonomous merging, no unreviewed shipping, no abdication of judgment.

Over the next two weeks, we'll publish one post per day, each unpacking a single practice from this anchor — concrete examples, the failure modes we've seen, and the exact templates we use in client work. Tomorrow: how we structure an AGENTS.md for a brownfield codebase, and what we deliberately leave out.