Vibe coding security: 45% of AI code fails OWASP tests

A developer built an entire SaaS product in a weekend. No framework knowledge, no security training, no code review. The app worked beautifully. It also exposed every user's API keys to the open internet.

This is the paradox at the heart of vibe coding in 2026: the fastest way to ship software is also the fastest way to ship vulnerabilities.

Vibe coding security: the numbers nobody wants to hear

Veracode tested over 100 large language models across 80 real-world coding tasks. The result: 45% of AI-generated code failed security tests, introducing vulnerabilities from the OWASP Top 10 directly into production systems. Java was the worst offender, with LLM-generated code failing more than 70% of the time. Python, C#, and JavaScript hovered between 38% and 45%.

Here is what makes this unsettling: models have dramatically improved at writing code that compiles. Two years ago, fewer than 20% of AI code samples even ran successfully. Today, 90% compile on the first try. But the security failure rate has barely moved. As Veracode's CTO Jens Wessling put it: "Models are getting better at coding accurately but are not improving at security."

Bigger models do not fix this either. Newer, larger LLMs produce code that is no more secure than their smaller predecessors.

The iteration trap: every revision makes it worse

If you have ever asked an AI to "just add this feature" five times in a row, you have likely made your codebase significantly more dangerous. Kaspersky researchers found that after just five prompt iterations, code contained 37% more critical vulnerabilities than the initial generation. Feature-focused prompts across the test set produced 158 new vulnerabilities, including 29 critical ones.

The mechanism is straightforward. Each revision adds complexity without architectural awareness. The model optimizes for "does it work?" while quietly accumulating AI security vulnerabilities in agent infrastructure that compound with every iteration. Wiz research confirmed this pattern: 20% of vibe-coded applications had serious vulnerabilities or configuration errors at the infrastructure level.

Where AI code actually fails (and where it does not)

A December 2025 study by security startup Tenzai tested five major AI coding tools, including Claude Code, Codex, Cursor, Replit, and Devin, by building three identical applications with each. They found 69 vulnerabilities across the 15 applications.

The surprising nuance: none of the tools produced exploitable SQL injection or cross-site scripting flaws. These well-documented attack patterns have been trained out of the models. The real danger lies in business logic vulnerabilities, the kind of flaws that require understanding what an application is supposed to do, not just how to write syntactically correct code.

An e-commerce checkout that lets users set their own prices. An API endpoint that trusts client-side authorization. These are the gaps that companies with no defense against AI code exploits consistently miss, because they require the kind of contextual reasoning that current models simply lack.

The speed illusion: 81% faster, but at what cost?

Ox Security analyzed 300 open-source projects and described AI-generated code as "highly functional but systematically lacking in architectural judgment." The anti-patterns appeared in 80-90% of AI-generated projects: excessive comments, avoidance of refactors, over-specification, and recurring bugs that should have been caught once and eliminated.

This creates what analyst Ana Bildea calls exponential tech debt: "Traditional technical debt accumulates linearly. AI technical debt is different. It compounds." Organizations that measured success by feature velocity discovered, within 18 months, that they could no longer ship features because AI productivity gains that vanish into rework had consumed every hour saved.

The METR study quantified this perception gap precisely: developers felt 20% faster while actually measuring 19% slower in real-world codebases. The speed is real in the moment. The cost arrives later, in security patches, architectural rewrites, and incident response.

The fix nobody wants to implement

The data points to one uncomfortable conclusion: security-focused prompting works, but almost nobody uses it. Kaspersky's research found that even a generic instruction like "make sure the code follows best practices for secure code" cut vulnerability rates in half. Language-specific security guidance reduced them further.

Yet the entire appeal of vibe coding is avoiding exactly this kind of deliberate thinking. Telling your AI to "build me a payment system" is fast. Telling it to "build me a payment system that validates server-side authorization, sanitizes all inputs, implements rate limiting, and follows OWASP guidelines for financial transactions" takes the same effort as understanding the security requirements yourself.

The companies getting this right treat AI code the way why most companies fail to profit from AI tools: not as a replacement for engineering judgment, but as a first draft that requires human architectural review. Every line generated, verified. Every feature added, threat-modeled.

Vibe coding ships fast. The question is whether you can afford to fix what it ships.

Related Reading:

Vibe coding ships 81% faster, but 45% of its code is insecure

Vibe coding security: the numbers nobody wants to hear

The iteration trap: every revision makes it worse

Where AI code actually fails (and where it does not)

The speed illusion: 81% faster, but at what cost?

The fix nobody wants to implement

Sources and References

You might also like:

AI Agents Fail 1 in 3 Tasks. Here's Why Companies Use Them Anyway

Small AI models now match GPT-4 on 80% of tasks for $0

Your AI assistant broke its own privacy policy 214 times