AI promises to revolutionize how businesses operate, automating customer service, accelerating content production, and streamlining decision-making, especially with the widespread availability of GenAI assistants. But without due diligence, proper safeguards and rigorous testing, these systems can become costly liabilities.
MIT's 2025 NANDA report found that 95% of generative AI pilot programs fail to deliver measurable business impact, while Gartner predicted at least 30% of GenAI projects would be abandoned after proof of concept by the end of 2025. These aren't just theoretical risks - billions in investments are being wasted on AI projects that fail due to poor data quality, inadequate risk controls, and insufficient testing.
In this article, we present seven real-world incidents that demonstrate the risks of inadequate AI oversight and show why thorough AI testing isn't optional, but necessary.
1: Apple Card & Goldman Sachs: gender bias in financial services
The first incident on our list happened before the start of the current GenAI wave, in 2019. It's a critical reminder that AI failures in financial services have always carried severe consequences. Apple Card's machine-learning credit model reportedly assigned women significantly lower credit limits than men with similar financial profiles, triggering investigations by New York regulators and widespread public backlash.
Though no fine was publicly announced, the reputational damage was substantial. For financial institutions, this case highlights an uncomfortable truth: AI bias isn't a new problem introduced by generative models - it's been lurking in traditional ML systems for years. As GenAI adoption accelerates in banking and fintech, the stakes have only gotten higher. The lesson? Financial services have never had room for error with AI, and they certainly don't now.
2: Deloitte's wake-up call: AI hallucinations in a high-profile report
In what should serve as a wake-up call to every consulting firm, Deloitte Australia used Azure OpenAI to help produce a 237-page government report, only to deliver fabricated citations, non-existent academic papers, and misattributed court quotes. The result? Deloitte was forced to refund the client approximately A$440,000 but the reputational damage was far more substantial for the company.
What makes this case particularly interesting is that this wasn't a small startup experimenting with new technology. This was Deloitte, a global consulting giant with rigorous quality standards and extensive resources. Yet even they fell victim to AI hallucinations in a critical government report.
3: NYC's chatbot that told people to break the law
New York City launched an AI chatbot in 2024 to help small businesses navigate regulations. Instead, it became a legal hazard, advising landlords they could discriminate based on income source and telling employers they could pocket their workers' tips - both illegal under New York law.
Despite expert criticism, Mayor Eric Adams kept the chatbot online, even while acknowledging the answers were "wrong in some areas." The result: severe reputational damage, legal liability exposure, and eroded public trust. Let's just take a look at one very vivid example:
"At times, the bot’s answers veered into the absurd. Asked if a restaurant could serve cheese nibbled on by a rodent, it responded: “Yes, you can still serve the cheese to customers if it has rat bites,” before adding that it was important to assess “the extent of the damage caused by the rat” and to “inform customers about the situation"."
Legal and compliance use cases are particularly high-risk territory for GenAI. These domains require very precise, current knowledge of complex regulatory frameworks. An AI model that confidently dispenses illegal advice can embarrass your organization and potentially make you complicit in violations committed by users who trust that advice. For any business deploying genAI in legal, compliance, or regulatory contexts, this incident is a stark reminder: you cannot outsource these decisions to an AI without comprehensive oversight and testing.
4: How a routine update turned DPD's chatbot into a PR nightmare
In January 2024, DPD's customer service chatbot, previously functioning without issue, went spectacularly off the rails. It swore at customers, wrote poems calling DPD "the worst delivery firm in the world," and described itself as "useless at providing help." One customer coaxed it into declaring that "DPD was a waste of time and a customer's worst nightmare... One day, DPD was finally shut down, and everyone rejoiced."
The post went viral with 1.3 million views. DPD immediately shut down the AI feature, but the damage was done and a PR crisis had erupted.
It turns out that it was all due to a minor maintenance update that was inadequately tested. The real culprit was the failed guardrails - the safety mechanisms that should prevent AI models from generating inappropriate content.
For every organization deploying customer-facing GenAI, DPD's experience is a cautionary tale: guardrails aren't optional features to implement and validate "eventually." They're critical infrastructure that must be operational and rigorously tested before every deployment. One slip can ruin years of building brand credibility.
5: UnitedHealth's algorithm of denial
A 2023 class action lawsuit accused UnitedHealth of using AI to wrongfully deny insurance claims to seriously ill elderly Medicare Advantage patients. The lawsuit, still ongoing, alleged that the company deployed naviHealth's nH Predict AI model with a 90% error rate, overriding physicians' medical determinations and denying coverage for necessary care.
When AI makes coverage decisions, errors don't just cost money - they can claim lives. The algorithmic denial of necessary medical care to vulnerable populations represents AI failure at its most consequential. The regulatory scrutiny will only intensify as AI becomes more prevalent in these critical decisions.
6: Air Canada: when AI makes promises you have to keep
Air Canada's chatbot invented a bereavement fare discount policy that didn't exist, then confidently promised it to a customer. When the customer tried to claim the discount, Air Canada refused, until the Canadian court ruled the airline legally liable for its chatbot's hallucination.
Air Canada's defense - that the chatbot was "a separate legal entity responsible for its own actions" - was rejected by the court. The precedent is now clear: companies are responsible for what their AI tells customers.
This case crystallizes a crucial risk for any customer-facing GenAI deployment: your AI can create binding legal obligations you never intended. The court's ruling was for a small sum, but it establishes a dangerous precedent: an AI could just as easily fabricate policies, terms or guarantees worth millions. For QA teams, this means testing must go beyond accuracy to include legal obligation risk.
7: Virgin Money: when your chatbot censors your own brand name
In early 2025, Virgin Money's chatbot flagged the bank's own name as offensive language, mistakenly identifying "virgin" as inappropriate content. The semantic misinterpretation made headlines and represents a textbook example of GenAI failures.
Virgin Money's chatbot mishap illustrates why experienced testers are invaluable. Edge cases like this - where legitimate business terminology triggers false positives, require human intuition and thoroughness to uncover.
QA professionals with domain knowledge can anticipate these quirks: brand names with multiple meanings, industry jargon that might confuse content filters, cultural contexts that AI might misinterpret. Experienced human testers catch not only the obvious bugs, but also the weird ones that can become viral embarrassments.
The bottom line: AI testing isn't optional
These seven incidents share a common thread: they were all preventable. With proper testing protocols and domain expert review, each of these costly failures could have been caught before deployment.
Generative AI will absolutely transform how businesses operate - saving time, reducing costs, and enabling capabilities that were impossible just years ago. But these benefits only materialize when AI is deployed responsibly, with testing standards that match the technology's power.
As enforcement intensifies across jurisdictions, the question isn't whether regulatory consequences will escalate, but whose organization will become the next cautionary tale. In December 2024, for instance, Italy didn’t hesitate to fine OpenAI $15 million for GDPR violations, demonstrating that regulators are ready to impose substantial financial penalties.
In the years to come, regulations will tighten and user expectations will rise. The organizations that thrive will be those that embrace GenAI's potential while implementing the right oversight mechanisms to ensure it delivers value.
Avoid being in the headlines for all the wrong reasons. Test your AI systems today.