Beyond Vibe Coding: Building AI-Assisted Development You Can Trust

Last week I built and deployed an application where Claude Code wrote every line of the code. I did this deliberately, outside my core professional expertise, and with enough at stake to take the governance seriously. Working through it brought something into focus that I keep returning to in conversations with engineering leaders about AI-assisted delivery.

The term “vibe coding” is both accurate and misleading. Accurate, because there genuinely is something collaborative in a good session with an AI coding assistant, where you describe what you want, it responds, you refine, it adapts. Misleading, because it implies the process is inherently informal, that governance and rigour are somehow at odds with the approach.

They are not, but the work required to make AI-assisted development something you can actually rely on is frequently underestimated, and the mistakes organisations make when they skip it are predictable.

The project

I’d been keeping manual notes on exercise and nutrition and wanted something better: a web application with charts, trends, AI-generated insights, and a proper interface rather than a notebook. I built it using Claude Code, which wrote every line of the application and infrastructure code.

This was a deliberate exercise in working outside my comfort zone. My technical background is in platform architecture, infrastructure, and operational controls, not application development. I can read and reason about code, but the Node.js backend, the frontend, and the data handling were all layers I was taking on trust in a meaningful sense. I couldn’t provide a confident, independent code review of application code I don’t work with day-to-day.

That constraint shaped everything about how I approached the project. It also mirrors, more closely than you might expect, the position many organisations find themselves in as they start to adopt AI-assisted development seriously.

Knowing which layers you can verify

The first decision was the most important one: being honest about which layers I could actually scrutinise and which I couldn’t.

This is not a question unique to AI-assisted development. It is the same question any technical leader faces when deciding how much to rely on an external supplier, a managed service, or a library maintained by a third party. You cannot verify everything. The question is whether you have compensating controls for what you cannot verify, and whether the scope of what you’re relying on without verification is bounded and understood.

In my case, infrastructure was territory I could verify confidently, having sufficient depth in public cloud networking, infrastructure as code, operational controls, and security-first design to assess whether things are correctly configured. CDN platforms like Cloudflare are also familiar territory. What I could not do was give the application code the same level of scrutiny.

Application Architecture: verified infrastructure and trusted application layers (click to expand)

So I compensated at the layers I could control, using Terraform to manage the infrastructure on AWS. The core decision was architectural: the EC2 instance never accepts inbound HTTP or HTTPS traffic from the internet. Instead, a Cloudflare Tunnel opens an outbound-only encrypted connection from the instance to Cloudflare’s edge, which handles TLS termination and applies an IP-based access policy before any traffic reaches my infrastructure. Inbound SSH is restricted to my home IP address. Data at rest is encrypted at the volume level.

None of this made the application code trustworthy in isolation. But it meant the attack surface was deliberately constrained: an exploitable vulnerability in the Node.js backend would be considerably harder to reach than it would have been with a conventional inbound-exposed deployment. The scope of what I was relying on without independent verification was bounded, understood, and mitigated at the perimeter.

Given my background, I wouldn’t have made code publicly accessible that I hadn’t written myself, or that an experienced engineer hadn’t reviewed. The compensating controls allowed me to run this in a constrained environment with a risk posture I was comfortable with. That distinction matters: I was clear about what the controls were achieving and what they were not.

Cloud Infrastructure: Cloudflare Tunnel, AWS Security Group Rules, and Data Storage (click to expand)

The broader principle applies directly to organisations: when your engineering team cannot independently review the code AI is producing, you need other controls. The reason may be an unfamiliar language, libraries no one has depth in, or simply a pace of production that outstrips review capacity, but the implication is the same. Network boundaries, access restrictions, monitoring, and graceful failure modes all matter more when the code review has been less thorough than you would ideally want.

Spec-driven development, not just vibe coding

The second decision was about process: I chose to develop the specification before any code was written.

This matters more than it might sound, because the real risk in vibe coding is that the prompts become the plan. You describe something, the model produces it, you refine it, and gradually an application grows. But there is no stable reference point to tell you whether what you have is what you intended, whether the latest session has drifted from the previous one, or whether what you have built is actually correct.

I used Claude to help develop the specification itself, which meant the model was being used as a thinking partner rather than a code generator at that stage. Working through the application requirements, security considerations, and user experience expectations before a line of code was written gave me something to build against and, critically, something to review against.

I maintained a running document throughout the build, a living specification covering both application and infrastructure that evolved as the project did but retained the thread of what I had originally intended. Claude generated both, and this is where the distinction from the earlier section becomes concrete: I could work through the Terraform it produced and assess whether it was doing what I expected, in a way I simply could not do for the application code. When the build was complete, I reviewed everything against a brief covering security, UX, functionality, and infrastructure.

The mechanism for keeping that specification current was itself AI-driven. I configured a Claude Code agent whose job was to maintain the document: logging changes as they were made, flagging new issues or risks, and tracking what had been resolved. Commit discipline was part of the same brief, with the agent ensuring changes were committed at sensible points with accurate, descriptive messages. The practical effect was that any new session could open the document alongside the codebase and immediately understand what had been built, what was outstanding, and what needed attention first. Security issues were explicitly prioritised within that list.

That review process is what made this something more than just vibe coding. There was an objective standard to assess against, not just a subjective sense of whether the result felt right. The specification was also the document I took into each new session, and the reason this mattered became clear as the project progressed: each new session brought a fresh assessment of the codebase, and without a stable reference, there was nothing to anchor it to what had been decided and built before.

The shifting assessment problem

Every time I started a new Claude session, its assessment of the codebase shifted. Not dramatically: the model did not suddenly claim the code was broken when it had been working, or reverse a previous judgement entirely. But its characterisation of what was present, what might need attention, and what the priorities were for a given piece of work varied noticeably from one session to the next.

This is not a flaw but an inherent property of how language models work, and it has a straightforward implication: model confidence is not a substitute for independent review.

The fact that a model tells you the code is well-structured, that it follows best practices, that there are no obvious security issues: none of that constitutes assurance. Assurance comes from an engineer with the relevant expertise reviewing the output against a defined standard, or from automated testing that verifies behaviour against a specification. It does not come from asking the tool that produced the code whether the code is good.

The stable specification I maintained gave me something to anchor each session to. Rather than asking Claude to assess the codebase from scratch, I could direct it against a defined set of requirements. This reduced, though did not eliminate, the drift between sessions.

For organisations, this problem is compounded by scale. If multiple engineers are working with an AI coding assistant and relying on the model’s self-assessment to gauge quality, you have a structural confidence problem: every piece of the codebase has been declared good by the tool that created it, and nobody has established independently whether that declaration is reliable.

Where organisations are getting this wrong

In customer conversations, I encounter two failure modes regularly.

The first is the absence of a specification. Teams start with AI-assisted development because of the speed, and the speed is genuine, and you can produce working code significantly faster than traditional approaches. But working code is not the same as correct, secure, maintainable code. Without a specification to review against, you cannot assess whether what you have is what you intended, and the speed advantage gradually erodes as the codebase becomes harder to reason about.

The second is misplaced confidence in the model itself. Teams observe that their AI coding assistant produces code that looks professional, is well-commented, handles edge cases, and passes basic testing. They treat that as assurance, but it is not. The quality of AI-generated code in isolation tells you relatively little about whether the overall system is trustworthy. Trustworthiness is a property of the system, the process, the review gates, and the operational controls, not of individual artefacts examined in isolation.

There is a third failure mode that is less visible but equally significant: the absence of accountability. Vibe coding is fast partly because it distributes accountability in ways that are genuinely unclear. Who is responsible for code that an AI produced, that an engineer reviewed without full expertise, in response to a prompt that no one has documented?

This matters because accountability is not just a governance formality. It is the mechanism by which organisations learn from failures and improve their processes. When something goes wrong with AI-generated code in production (and at some point, it will), the question “who approved this?” needs a real answer. The prompt is not documentation, and the model’s confidence assessment is not a review record. Organisations that have not established clear accountability chains before AI-assisted delivery becomes widespread will find that gap acutely visible at exactly the wrong moment.

What this means in practice

Know your layers. Before using AI to build anything, be explicit about which parts of the output you can verify and which you cannot. This is a function of your team’s expertise, not the model’s capability. An AI coding assistant can produce excellent Python if your senior Python engineers review it. It cannot substitute for those engineers.

Compensate for what you cannot verify. If you are deploying something into production that includes layers you have not independently verified, you need controls that bound the risk. Network boundaries, access controls, monitoring, and graceful failure modes all become more important when the code review has been lighter than you would ideally want.

Write the specification first. This predates AI entirely, but it becomes more important with AI-assisted development, not less. The speed at which code can be generated creates real pressure to skip the thinking and go straight to building. That pressure should be resisted. A specification gives you something to build against, something to review against, and a stable reference that persists across sessions and across team members.

Treat model confidence as noise. The model will tell you the code is good. That is what models do. It is not meaningful signal unless you have independent verification to support it. Building review processes that treat model self-assessment as a starting point rather than an answer is one of the more important cultural shifts teams need to make.

Put experienced engineers in the review loop for what matters. For any code that is publicly accessible, handles sensitive data, or is customer-facing, an experienced engineer needs to review what AI has produced.

The question worth asking

The question I keep returning to is not “can AI write good code?” It can. The question is whether your governance framework would tell you if something had gone wrong, and most of the time, without deliberate effort, it would not.

The organisations getting this right are not necessarily the ones moving fastest. They are the ones that have been honest about what changes and what does not, have adapted their review and assurance processes accordingly, and have established clear accountability for what AI produces and how it gets verified.

Vibe coding can be a legitimate part of your development workflow, but the governance framework has to exist regardless.

Working through AI governance or delivery assurance? If you’re establishing governance frameworks for AI-assisted delivery, or navigating where to draw the line between “AI-assisted” and “safe to release”, I work through these questions with engineering and technology leadership teams regularly. Get in touch .

The project#

Knowing which layers you can verify#

Spec-driven development, not just vibe coding#

The shifting assessment problem#

Where organisations are getting this wrong#

What this means in practice#

The question worth asking#