A Proposal for a Standardized Contextual Integrity Score (CIS)

Quantifying Intent in AI-Generated Code

The Implementation-Intent Gap

The proliferation of Generative AI has created a crisis in software quality. AI models excel at generating code that is functionally correct on the surface, but they have a fundamental inability to grasp the deeper architectural coherence or human intent behind the code. This creates a massive, unquantified liability, invisible to traditional static analysis tools, that only manifests after deployment.

Existing software benchmarks are not equipped for this new reality. Metrics like Cyclomatic Complexity are ambiguous, and Code Coverage has become a vanity metric, validating only that a line of code was executed, not that its semantic output was correct. We are measuring the artifacts of code generation while remaining blind to the intent.

The Contextual Integrity Score (CIS)

To address this critical gap, we propose the Contextual Integrity Score (CIS). The CIS is a standardized, composite metric designed to provide a quantifiable, multi-dimensional assessment of a software artifact's contextual integrity. It is an essential "nutritional label" for AI-generated code, enabling organizations to make informed risk-reward decisions before merging code into a production baseline.

The CIS provides a holistic evaluation by triangulating context through three distinct, mutually-reinforcing pillars:

Pillar I: Rationale Integrity Score (RIS)

This pillar quantifies the "Why?" by measuring the clarity of intent. It assesses the traceability and alignment of the code to a discernible business or functional requirement.

Opaque / Contradictory: The code's purpose is not discernible.
Misaligned: The code's purpose is inferred but misaligned with the domain.
Traceable: The code's purpose is clear and aligns with requirements.
Explicit & Aligned: The code's purpose is explicitly and accurately tied to the business rationale.

Pillar II: Architectural Integrity Score (AIS)

This pillar quantifies the "How-it-fits?" by measuring structural soundness and conformance. It assesses the code's structural maintainability and its programmatic adherence to prescribed architectural patterns.

Chaotic: The code is both locally complex and globally non-compliant.
Brittle / Deceptive: The code appears clean locally but is fundamentally wrong, violating core architectural rules.
Compliant: The code is architecturally sound but could benefit from refactoring.
Sound & Maintainable: The code is both locally clean and globally sound.

Pillar III: Testing Integrity Score (TIS)

This pillar quantifies the "What-it-does?" by measuring semantic and behavioral validation. It assesses the quality and relevance of the test suite, not just its simple line coverage.

Vanity Coverage: The test suite may have high code coverage but validates almost none of the actual requirements.
Gappy: The tests are relevant and cover the "happy paths" but miss critical edge cases.
Robust: The test suite is semantically sound and covers all critical business requirements.
Semantically Validated: The test suite fully validates the intent and behavior of the requirements.

A Call for Standardization

The CIS is proposed as a foundational framework for a new, essential conversation about software quality. The rapid accumulation of Contextual Debt represents a systemic risk to the software industry. We must pivot from reactive debugging to proactive, automated quality gates that can measure intent.

We believe that this requires a new, broad-based collaboration between academia and industry to refine, test, and adopt the CIS as a universal standard, ensuring that the future of AI-accelerated software development is not only fast, but also safe, reliable, and fundamentally trustworthy.