The NDIA Copilot Trial: A Case Study in Vendor Capture, Missing Safety Controls and Systemic Governance Failure

18 Nov

This article was also published on LinkedIn and Substack.

With thanks to the persistence of Josh Taylor, the NDIA Microsoft Copilot risk assessment (all 109 pages) was finally released under FOI after eight months.

So over the weekend - while the NDIS systems were shutdown yet again (I’ll have more on this in an EOY outages roundup) - I buried myself in this document, anticipating discovering the symptoms of technology distress and an administration dangerously out of its depth.

And what a read. No wonder the NDIA did not want to release it.

Here’s what I found.

When a major public sector agency experiments with generative AI, the public should expect rigour, independence and accountability. The NDIA’s “Microsoft Copilot Trial” and its subsequent “risk assessment” deliver none of these. What the Agency presents as evidence is, in reality, a blend of unverified user enthusiasm, vendor talking points, and policy shortcuts dressed up as due diligence.

I started to see what appeared to be Microsoft marketing. So I asked Copilot’s competitor, ChatGPT, to do an analysis of the content: what was regurgitated from Microsoft; what was boilerplate from the DTA; and what was NDIA “narrative”.

After a detailed analysis of the NDIA’s Copilot trial documents, here is the uncomfortable truth:

Approximately 70% of the risk assessment document is directly derivative of Microsoft marketing language and Digital Transformation Agency boilerplate. Another 20% is NDIA wrapping paper.

Only about 10% shows genuine, NDIA-specific commentary - and even that lacks technical depth. No surprise. This “risk assessment” contains no research method or evaluation methodology.

What is in this “risk assessment”, is is not independent governance. This is vendor-led narrative construction. And it matters - because this Agency makes decisions that materially affect people’s safety, health, independence and lives.

The Three Structural Flaws That Undermine the Entire NDIA Copilot Trial

1️⃣ Structural Flaw #1: No quantification. No measurement. No baseline. The NDIA declares: “Productivity improved”. “Time savings achieved”. “High user satisfaction”. But there is no experimental design whatsoever. No baseline timings. No controlled comparisons. No metrics on accuracy, error rates or rework. No examination of quality.

The entire “evaluation” rests on self-reported sentiment surveys. In other words: we asked people how they felt about using Copilot, and they said it was good. That is not evaluation. It is a vendor-approved satisfaction poll masquerading as evidence.

2️⃣ Structural Flaw #2: No assessment of participant-facing risks (including risk to life). This is the heart of the problem. The NDIA trial entirely fails to consider how Copilot could: distort participant evidence; mis-summarise reports; introduce hallucinated facts into staff workflows; contaminate decision-making; contribute to “shadow automation” where AI shapes outcomes without authority; create systemic unfairness; and introduce bias for complex and intersectional disability profiles.

And shockingly:

The NDIA documents do not include “risk to life” as a category - despite APS AI guidance classifying any technology that affects people’s welfare, safety or human rights as high-risk.

Given that NDIS supports include equipment, personal care, crisis supports, behavioural supports and decisions that directly affect survival and safety - this omission is not just an oversight.

It is a profound governance failure. This is how RoboDebt happened. Not because of bad tech - but because of bad assumptions, unchecked automation, and a failure to ask the most important question: “What harm could this cause?” NDIA repeats the same pattern.

3️⃣ Structural Flaw #3: Accessibility benefits were never independently verified.The NDIA trumpets Copilot as: “Revolutionary for hearing-impaired staff”. “Transformational for dyslexic staff”. “A game changer for JAWS users”.

These claims are anecdotal, self-reported, and completely unverified. There is: no Assistive Technology compatibility matrix; no WCAG-aligned assessment; no independent accessibility testing; no structured task comparisons; no error-rate or cognitive-load measurement; no usability studies; no negative findings documented; no formal verification that Copilot is more accessible than existing tools.

It is on the public record that NDIA staff have sued the NDIA over accessibility, for failing to provide adequate assistive technology and physical adjustments, such as inaccessible IT systems and workspaces. These failures made it difficult for employees with disabilities to perform their jobs effectively.

Accessibility claims without structured evaluation are marketing, not evidence. And in a disability agency, treating accessibility anecdotes as proof is both dangerous and irresponsible.

A Forensic Critique of NDIA’s Evaluation Methodology

The Copilot trial reads like something written for Microsoft, not for the NDIA Board or the Australian public.

There is no experimental design. There is no control group. No baseline. No measurement framework.
There is no defined success criteria. The trial could only ever “succeed”, because nothing was specified that would allow it to fail.

There is an over-reliance on vendor claims. These include: Microsoft’s IRAP assessment. Microsoft’s encryption story. Microsoft’s “no training on your data” marketing. Microsoft’s “only accesses what you can already see” line. All repeated almost verbatim. And there is no independent verification.

Let’s Take a Look at data storage misinformation. The NDIA states: “All Copilot data is stored in Australia.” Absolute claims ALWAYS need to be tested, so I asked ChatGPT to do some digging.

This is not what Microsoft publicly commits to. Microsoft’s public position is far more nuanced: Core M365 data is stored in Australian regions; Copilot interaction data is tied to Preferred Data Location / local region; and Full in-country Copilot processing is only rolling out now through 2026

So it appears that the NDIA turned “Microsoft is working towards sovereign controls for Copilot” into an absolute claim of “All data stays here”. Meanwhile, the Commonwealth has launched a sovereign-hosted GPT-4o platform (GovAI Chat) - meaning ChatGPT-class tools can now be operated onshore, secure, and within APS controls.

NDIA’s narrative (“Copilot is safe, ChatGPT is risky”) is now technically obsolete.

Microsoft’s compete positioning becomes NDIA’s policy. The document repeats the Microsoft sales line that: “Copilot uses the same technology that powers ChatGPT, but with enterprise-level functionality designed for government.” But this ignores: ChatGPT Enterprise does have enterprise controls; GPT-4o is now sovereign-hosted for the APS; NDIA is simply accepting Microsoft’s commercial framing as truth.

This is textbook vendor capture.

The Real Participant Safety Risks NDIA Should Have Assessed

Here is what NDIA should have evaluated, but did not. Risk to life and human safety are nowhere to be seen.

Hallucination risk: This is a risk with ALL GenerativeAI. Copilot can generate plausible but false: summaries of participant evidence; interpretations of reports; legislative references; planning logic; assessments of functional capacity. In NDIA, a hallucination isn’t a typo - it can be a life-impacting error.

Compression and loss of nuance. Let’s take a look at the compression risk. This is a VERY significant risk, and needs to be examined in a control framework by clinicians - not the IT folk. It is noted that while there is reference to a yet to be convened “Ethics Committee” no such committee is reported on the NDIS website.

Copilot’s summarisation capabilities can inadvertently erase: complex psychosocial nuance; fluctuating conditions; cultural factors; trauma history; co-existing conditions.

Disability information does not compress safely.

Shadow automation. Staff under pressure will inevitably begin trusting Copilot summaries as “good enough”. This is how automation creep happens. This is how systemic bias forms. This is how administrative unfairness hardens into practice.

Bias and normative assumptions. General LLMs are trained on normative functioning and under-represent disabled voices. That bias flows into tone, structure and interpretation.

Risk to life. There is no declared risk to life. Delayed supports, misinterpreted evidence, incorrect summaries, or copy-paste errors can lead to: personal care failures; equipment delays; mental health crises; behavioural escalation; hospitalisation; risks of injury; risks of death.

NDIA should have declared this explicitly. It did not.

What Real Accessibility Validation Should Have Looked Like

The NDIS NDIA current defective systems and website, demonstrate that accessibility is not, and has not been for a long time, a priority for the NDIA. This risk assessment yet again shows it.

Instead of “users loved it”, NDIA should have conducted:

Formal co-design with disabled staff. Structured involvement, not anecdotes.
Assistive Technology compatibility testing. JAWS, NVDA, VoiceOver, magnifiers, switches, dictation - all should have been tested systematically.
Measured usability testing. Task time, error rates, cognitive load comparisons.
WCAG and AT interoperability reviews. With independent accessibility expert evaluation
Full documentation of failures. Because accessible for some ≠ accessible for all. And generative AI introduces new failure modes. None of this was done. Why?

The Bottom Line: This Trial Is Not Evidence. It Is Marketing-Led Narrative

The NDIA’s Copilot trial is the antithesis of responsible AI governance. It is an example of: vendor capture; methodological failure; misstated technical claims; unverified accessibility benefits; no participant safety assessment; no consideration of risk to life; blind acceptance of Microsoft’s competitive framing; and poor alignment with APS AI risk guidance.

It suggests, unintentionally but unmistakably, that the NDIA has learned little from the systemic failures that produced RoboDebt - or the concerns increasingly referred to as “RoboNDIS”.

AI in government requires: independent verification; rigorous testing; transparent methodology; safety-first design; participant-first governance; human rights considerations; and a commitment to truth, not marketing.

This trial delivered none of these.

This is an extraordinary abdication of governance, raising questions about misfeasance, and whether the NDIA Board is fit for purpose as the NDIA sleep walks no-questions-asked into an AI and automation worm-hole.

What this “risk assessment” shows, is that the NDIA is dangerously out of its depth in capability oblivion. No wonder the NDIA fought for to keep this secret. Secret’s out.

Time for the ANAO, Ombudsman, and legal system to step in and step up.

AIChatGPTRoboNDISRoboDebtAlgorithmsCopilot

Marie Johnson