Generative AI in Public Decision-Making: The Deloitte Welfare Report as a Case Study

Deloitte


In October 2025, Deloitte Australia agreed to give money back on a government-commissioned report on Australia’s welfare-compliance IT system, produced for the Department of Employment and Workplace Relations. The work was presented as an independent review of a system that automates penalties for welfare recipients who fail to meet certain obligations. Since then, it has become a case study of how generative AI can malfunction when incorporated into formal institutional processes.

The 237-page report was intended to evaluate the design and operation of Australia’s welfare system. However, a welfare law researcher at the University of Sydney examined the document and found many inaccuracies, “including a fabricated quote from a federal court judgment and references to nonexistent academic research papers.” Even the technical material appeared to contain fabricated details.

After the issue arose, Deloitte provided a revised report and acknowledged using an Azure Open AI-based generative AI toolchain to prepare parts of the document. The firm refunded the final installment of its fee, and the department stated that the main recommendations remained unchanged. 

The Department of Employment and Workplace Relations’ website even indicates that this report “was updated on 26 September 2025 to address a small number of corrections to references and footnotes. It replaces the report dated 4 July 2025.

On the surface, the incident ends there: errors were corrected, payment was adjusted, and recommendations were retained. However, the episode highlights several deeper issues.

From Model Error to Governance Failure

It is well known that generative AI models hallucinate. They produce texts that may not be supported by sources, doing it in a stylistically confident way. (For a discussion of hallucinations in AI, see: Yujie Sun et al. 2024.) As long as this behavior is confined to tests or drafts, it is merely a technical issue. However, once the output is incorporated into a government-commissioned report or similar product, it also becomes a governance matter.

The Deloitte case is informative because of how the error spread. The hallucinations did not remain confined to the trial environment. Instead, they appeared in an official document presented under the firm’s name and delivered through the usual professional channels. At this point, the relevant question is not only whether the model invented sources but also how the surrounding organizational procedures allowed these inventions to appear in a finished report.

Regulators have begun to respond to this type of development. In the field of auditing, for instance, large firms are incorporating AI tools into their processes without establishing clear methods for evaluating their impact on audit quality. The concern is not that the use of AI will always result in failure, but rather that standard safeguards, such as traceability, documented reviews, and accountability, may not keep pace with the introduction of these tools. From this perspective, the Deloitte refund seems less like an anomaly and more like an early example of a recurring pattern.

How Trust Is Amplified and Undermined

Consulting reports, audits, regulatory assessments, and policy reviews play an interesting role in public life. Many are only partially read, yet they influence decisions and establish a foundation for future actions. Their authority is based not only on their content, but also on the expectations surrounding them, such as those related to internal review, professional standards, and the care taken in verifying sources and methods.

When generative AI is used in this setting, its characteristic errors can be amplified by the same arrangements. For example, a hallucinated citation in an academic draft reflects negatively on the author. However, a hallucinated citation in a government-commissioned review raises questions about the contractor’s internal controls, the client’s review process, and eventually, the system of professional accountability. In that sense, Deloitte’s partial refund is not just a financial adjustment, but also an acknowledgment that the report’s reliability fell short of expectations.

AI is often introduced into knowledge-intensive work with the expectation that it will expand the scope of analysis by scanning more material, comparing more sources, and drafting more quickly than human analysts alone. However, if these tools are not accompanied by verification and accountability practices, they can have the opposite effect. They can make it easier for unsupported claims to pass through review because they are presented in polished prose that resembles the work of experts.

The Structural Risk Behind “More Scandals”

It is possible to view the Deloitte episode as unique to one firm in one jurisdiction. However, that interpretation overlooks the broader forces at play. Organizations across sectors are under pressure from markets, clients, and internal leadership to integrate AI into their operations. Under these conditions, the likelihood of other cases occurring does not depend on unusual behavior. It follows from ordinary adoption patterns.

Several structural features point in this direction. For example, there is a gap between the ease of technical deployment and the effort required to establish adequate oversight. Adding a generative model through a major cloud platform has become straightforward. However, constructing a comprehensive regime of documentation, testing, and human review around that model takes time and resources.

Another issue is the opacity of AI-assisted work. A report blending human-drafted content with AI-generated segments does not look significantly different from a report written entirely by humans. Without explicit disclosure and systematic checking, it is difficult to discern the difference. Currently, the only reason differences exist is because the technology is not fully developed yet, and the use of AI is not widespread. Once it becomes commonplace, oversight will be an essential and highly regarded part of the design process.

A third feature concerns liability. In standard professional services contracts, the firm is responsible to the client for the results it delivers. However, the provider of an AI model does not have a direct contractual relationship with the end client. If errors appear in an AI-generated report, the client looks to the firm that signed the contract, not the model vendor. Current circumstances encourage firms to use AI tools to increase efficiency while maintaining formal responsibility as it has always been. The combination of greater dependence on AI and unchanged external accountability creates conditions under which similar incidents are likely to recur.

Reliability, Not Catastrophe

The Deloitte case does not demonstrate that generative AI is unusable. This technology can help with drafting, summarizing, and exploring alternative formulations and perspectives. It shows that, in practice, reliability depends on the entire system in which the model is embedded, not just the model itself.

The revised report, the partial refund, and the department’s statement that the main recommendations still stand all reflect an underlying tension that will likely appear elsewhere. On one side is the claim that the core analysis remains intact. On the other hand, demonstrably incorrect references and quotations affect how the analysis is evaluated. Once supporting details are in doubt, confidence in the overall work cannot be taken for granted.

In this light, the notable feature of the case is not the presence of hallucinations, which are to be expected, but rather the apparent absence of a process capable of identifying them prior to publication. The report presumably underwent internal review, discussion with the client, and quality control procedures. At each stage, the fabricated references remained in place. Therefore, the central reliability issue is less about a particular model output and more about a system that did not treat that output with structured skepticism.

Conclusion: Trust as the Real Battleground

Discussions of generative AI often focus on capabilities, such as benchmark scores, parameter counts, and performance on standardized tasks. However, the Deloitte refund shifts the focus to a different dimension: the trustworthiness of AI-assisted work in specific institutional contexts. This raises questions about who relies on which outputs, under what conditions, and what correction mechanisms are used.

For governments, the case shows that commissioning work does not absolve them of responsibility for how that work is produced. For firms such as Deloitte, it demonstrates that changes in production methods should not affect the standards of the final product. For outside observers, it signals a shift in how AI is perceived. Errors in informal online content are treated one way. Similar errors in documents used to shape welfare policy are treated differently because different systems and expectations are involved.

Further incidents are likely not because AI systems will suddenly become less capable, but rather because the adoption of AI systems is proceeding faster than the ability of institutions to adapt. Systems designed for human expertise alone are being used in human-machine combinations without corresponding redesigns that make roles, review practices, and dependencies explicit. As long as this mismatch persists, every new report, audit, or policy review that relies on generative AI introduces new uncertainties.

Although the Deloitte incident has moved out of the headlines, it will remain a reference point in later discussions about the use of AI in public decision-making. The incident has already entered legal and policy debates about hallucinations, responsibility, and liability. From that perspective, the refund is more than just a closing step. It signifies a transition from viewing generative AI primarily as an efficiency tool to evaluating it in terms of institutional reliability and public trust and distrust.