How it works Hooks, git notes, the open standard Toolkit Live · Burndown · Menubar · CLI Roadmap Now · Next · Later · Exploring Pair with Microsoft Copilot The outcome layer for your Copilot Power BI Compare vs git-ai, DX, LinearB, Faros Whitepaper A framework for AI code observability Pricing

WHITEPAPER · APRIL 2026

The Measurement
Gap.

A Framework for AI Code Observability in the Enterprise

For CTOs, VPs of Engineering, and CIOs accountable for AI investment outcomes.

5%
of enterprises see real AI ROI
86%
raised AI budgets in 2026
1.7×
defect density in AI code

Section 1

Executive Summary

Enterprises increased AI coding tool budgets by an average of 86% in 2026[1]. Yet only 5% report measurable ROI[2]. The gap is not the technology — it is the absence of observability.

Standard engineering intelligence platforms (DX, LinearB, Faros) measure team-level activity: cycle time, deployment frequency, PR velocity. None of them measure what fraction of a commit was authored by AI, which agent produced it, or whether that code survives in production.

Without per-line attribution, three critical questions remain unanswered:

  1. How much of our shipped code is AI-generated?
  2. Does that code survive — or does it generate rework?
  3. Which vendors and teams use AI most effectively?

This paper proposes a measurement framework based on three KPIs (Adoption, Durability, Churn) and an open standard (git-ai v3.0.0) for capturing per-line AI attribution at commit time. The approach is local-first, vendor-portable, and compatible with existing engineering intelligence tools.

Section 2

The State of AI Coding in 2026

AI coding adoption is at saturation. The relevant question is no longer "should we use AI?" but "is our AI investment producing measurable outcomes?"

97%

of tech leaders integrated AI into their backends[3]

66%

have not saved a single human headcount[3]

20–30%

advertised productivity gains from AI coding tools[4]

19%

actual measured slowdown in some studies[5]

48%

of AI-generated code contains security vulnerabilities[6]

7 hrs

lost per developer per week to AI rework when AI > 40%[7]

The 39-point perception gap. Developers feel 20% faster with AI but, in measured studies, are 19% slower due to longer reviews and higher bug density[5]. Self-reported productivity is no longer a reliable signal.

Section 3

The Measurement Gap

Engineering organizations have invested in measurement platforms for over a decade. DORA metrics, SPACE framework, DXI scores. None of them measure AI authorship.

Only 16.8% of organizations track investment per AI tool versus benefit[8]. Of the remaining 83.2%, most rely on developer surveys, anecdotal feedback, or no measurement at all. When the board asks if the AI budget is paying off, the honest answer is "we don't know."

What current platforms measure

Platform category What it tracks Measures AI?
Engineering intelligence (DX, LinearB, Faros) Cycle time, throughput, PR velocity, DORA No
Code quality (SonarQube, Code Climate) Static analysis, test coverage, complexity No
AI gateways (LLM proxies) Token consumption, API cost Inputs only
Developer surveys Self-reported satisfaction Subjective
AI code observability (Obsly AI, git-ai) Per-line attribution, durability, churn Yes

The gap is not in the data — git already stores everything needed. The gap is in capturing AI authorship at the moment of creation, before the signal disappears.

Section 4

A Framework for AI Code Observability

The framework is built on three principles:

1. Capture at the source, not after the fact

AI agents (Claude Code, Cursor, Codex, Windsurf) emit PreToolUse and PostToolUse hooks when they edit files. Capturing the diff at this moment is deterministic. Detection after the fact (e.g., AI classifiers on diffs) achieves <60% accuracy.

2. Persist as git-native metadata

Attribution data is stored in refs/notes/ai as structured git notes following the open git-ai v3.0.0 standard. The data travels with the code, survives rebases via post-rewrite hooks, and is accessible to any tool that reads git.

3. Aggregate without sending source code

Only metadata leaves the developer's machine: line numbers, agent identifiers, model names, timestamps. Source code is never transmitted. Compliance reviews pass on day one.

Architecture

DEVELOPER'S MACHINE AI agents Claude Code · Cursor Codex · Windsurf hooks obsly-ai CLI computes diff attributes lines refs/notes/ai git-ai v3.0.0 open standard git push (metadata only, no source) OBSLY AI CLOUD · OPTIONAL Aggregation · Per vendor · Per repository · Per developer (private) KPIs Adoption Durability Churn Export PDF CSV BI

Section 5

Three KPIs That Matter

Per-line attribution is the foundation, but the value comes from three derived metrics. These are the numbers a CTO should be able to recite for any quarter, repository, or vendor.

1

Adoption

% of code lines attributed to AI agents in a given period

Adoption alone is not a quality metric. A vendor at 80% AI may be performing better or worse than one at 30%. The value of Adoption is contextual: it sets the denominator for the other two KPIs.

Adoption = (lines AI-attributed) / (total lines added) × 100

Industry benchmark: Healthy teams operate between 25–40%. Above 40%, rework rates increase 20–25%[7].

2

Durability

% of AI-attributed lines still present in HEAD after N days

The single most important metric. Durability separates valuable AI code from rework. A line that survives 30 days in production was worth generating. A line rewritten the same week was not — it consumed prompt tokens, review time, and trust.

Durability(30d) = (AI lines unchanged 30 days later) / (AI lines added) × 100

Why it matters: Two vendors at 70% Adoption can have wildly different outcomes. One at 90% Durability is delivering value. One at 55% is generating rework you pay for twice.

3

Churn

% of AI-attributed lines rewritten by humans within N days

Churn is the inverse signal of Durability and the leading indicator of trouble. High churn means humans are systematically correcting AI output. It points to wrong tool choice, wrong prompts, or wrong domain fit.

Churn(7d) = (AI lines rewritten by human within 7 days) / (AI lines added) × 100

Diagnostic value: Churn segmented by agent reveals whether the issue is the tool (Cursor 18% vs Claude 6% on the same repo) or the developer (one team 4%, another team 22% on the same agent).

The reading order matters. Adoption tells you the volume. Durability tells you the value. Churn tells you the friction. Reporting any of these three in isolation is misleading.

Section 6

Implementation Path

AI code observability does not require a new SDLC. It is a layer added to existing repositories without disturbing developer workflow.

  1. W1

    Week 1 — Pilot on a single repository

    Install the CLI on three engineers' machines. Confirm hooks fire on every commit. Validate git notes appear under refs/notes/ai.

  2. W2

    Week 2 — Connect the dashboard

    Install the GitHub App. Verify that pushed commits appear with their attribution. Establish baseline Adoption / Durability / Churn for the pilot repo.

  3. W3-4

    Weeks 3–4 — Roll out to one team

    Onboard a complete engineering team. Compare per-developer metrics privately. Identify high-Durability and high-Churn patterns for coaching.

  4. M2

    Month 2 — Vendor visibility

    For organizations with external vendors, invite them as data providers. Establish quarterly review cadence with KPIs as agenda.

  5. Q1

    Quarter 1 — Board-ready report

    First quarterly report with three numbers: Adoption, Durability, Churn. Trend line. Vendor comparison. The report your CFO has been asking for.

Section 7

Illustrative Scenario

The following scenario is illustrative and based on patterns observed in 2026 industry research. Names are anonymized.

European bank · Two-vendor procurement review

A European retail bank engages two consultancies to deliver a new mobile banking platform. Both vendors charge equivalent rates per developer. After Q1, the bank's procurement team requests AI code attribution data from both vendors.

Vendor A

Adoption68%
Durability (30d)91%
Churn (7d)4%

Vendor B

Adoption82%
Durability (30d)58%
Churn (7d)23%

The reading: Vendor B uses AI more aggressively (82% vs 68%) but produces code that gets rewritten almost a quarter of the time. The bank pays for both the original generation and the rework. Vendor A uses AI less but with substantially better outcomes.

The conversation that follows: The bank does not need to terminate Vendor B. With this data, they can ask specific questions: which agents are being used, on which file types, by which teams. The data turns a vague concern into a structured procurement discussion.

Section 8

Conclusion

AI coding tools are not the problem. The absence of measurement is. Enterprise budgets have grown faster than the instruments to evaluate their return.

The framework proposed in this paper is intentionally minimal: three KPIs, one open standard, no source code transmission. It complements rather than replaces existing engineering intelligence platforms. It produces numbers a CTO can take to a board meeting and a procurement team can take to a vendor review.

The companies that adopt this layer in 2026 will be the ones that can answer, in twelve months, the only question that matters: did the AI investment pay off?

In one sentence

Without per-line attribution, AI coding investment is a faith-based exercise. With it, it becomes a managed program with KPIs the same as any other infrastructure spend.

Appendix

The Open Standard

Obsly AI implements the git-ai v3.0.0 specification, an open standard for AI code attribution stored as git notes under refs/notes/ai. The format is human-readable, version-controlled, and portable across tools.

Organizations adopting the standard retain full data portability. If a tool change is required for any reason, the underlying attribution data is independent of the analytics platform reading it. This is the same principle that made OpenTelemetry the default for observability instrumentation: the data outlives the vendor.

For the technical specification, see github.com/git-ai-project/git-ai.

References

Sources

  1. [1] Constellation Research, Enterprise Technology 2026: AI, SaaS, Data Trends. 86% of respondents reporting AI budget increases for 2026.
  2. [2] Master of Code, AI ROI: Why Only 5% of Enterprises See Real Returns in 2026.
  3. [3] Reported in multiple 2026 enterprise software analyses, including erp.today, Enterprise Software Faces AI-Driven Disruption as Development Productivity Gains Fail to Materialize.
  4. [4] NVIDIA, State of AI Report 2026. AI-accelerated coding tools deliver 20–30% productivity gains in software development.
  5. [5] METR / industry surveys (2026). Developer perception of velocity vs measured outcome — 39-point gap (perceived +20%, measured −19%).
  6. [6] Multiple security audits aggregated by Sonar State of Code Developer Survey 2026: at least 48% of AI-generated code contains security vulnerabilities.
  7. [7] Exceeds AI, AI Code Benchmarks: Safe Productivity Thresholds 2026. Above 40% AI code generation, rework increases 20–25%, costing approximately 7 hours per developer per week.
  8. [8] Larridin, State of Enterprise AI Q1 2026. Only 16.8% of organizations track investment per AI tool versus benefit.

Build your measurement layer.

Obsly AI is the reference implementation of the framework described in this paper. Free for individual developers. Per-seat for teams. Custom for enterprise.