Anonymised case studies from real UK engagements across financial services, healthcare, retail, legal, and government sectors. NDA signed before all technical discussions.
A private financial institution was preparing to deploy an LLM system to automate regulatory disclosures, drawing on legacy filing templates from multiple internal sources — many of which carried ambiguous table formatting built up over years of manual updates.
Internal QA had signed off the system. The model risk team wanted independent validation before the system went live for the Q1 2025 disclosure cycle, in line with their obligations to UK regulatory institutions.
The four-week engagement combined RAGAS evaluation, DeepEval assessment, and adversarial stress tests written around the specific filing formats used. We found a class of hallucination that internal QA had missed: when source documents contained legacy table formatting — merged cells and footnote conventions common in UK regulatory filings — the model fabricated financial figures rather than flagging the ambiguity.
A single fabricated figure in a published regulatory disclosure would be a potentially material misstatement — with consequences for the institution and its standing with UK regulatory institutions well beyond the cost of the engagement.
We produced a findings report with severity ratings, reproducible test cases, and remediation guidance. The engineering team resolved all findings before the Q1 2025 cycle. We also handed over a CI/CD regression suite covering the hallucination triggers we identified — it now runs automatically before each quarterly disclosure release. A follow-on engagement is planned for Q3 2025 to extend coverage to further disclosure types.
A public sector healthcare organisation's digital programme team was deploying an LLM triage assistant, integrated with their Electronic Patient Record system. The system was designed to pre-classify patient queries, suggest triage pathways, and surface relevant clinical guidelines before handover to clinical staff.
The system had cleared all internal functional tests. The programme team was concerned about performance at peak — the Monday morning appointment surge and the winter demand spike — and needed to navigate UK sector regulator guidance on AI as a Medical Device classification before go-live.
Performance testing under simulated Monday morning peak load revealed two issues: response quality — measured by clinical guideline faithfulness and triage pathway accuracy — dropped significantly as concurrent users went beyond the midweek baseline. At P99, latency also breached the organisation's internal SLA thresholds. A triage assistant that degrades precisely at peak demand is the worst possible outcome in a clinical setting.
Neither failure was visible in functional testing, which had been run at low concurrency. Both were infrastructure and caching problems, not model problems — fully fixable without touching the model itself.
We revised the caching and fallback architecture and re-validated the system under the corrected configuration, confirming latency SLAs and quality thresholds held at projected peak load. We also produced the technical documentation needed for the sector compliance standards submission and a written assessment supporting the sector regulatory standards classification decision. The clinical go-live proceeded on schedule.
A large private sector retail organisation had deployed a four-agent LLM shopping assistant — product discovery, personalisation, stock availability, and checkout support — ahead of peak trading season. The system had performed well in standard testing. The engineering team wanted independent validation before Black Friday.
During chaos testing simulating Black Friday traffic, injecting a primary API outage caused the multi-agent pipeline to stall completely. No fallback activated, no graceful degradation, no user-facing message. Customers would have faced a non-responsive assistant with no explanation during the highest-revenue trading period of the year.
The incident logging configuration also had a gap: a primary API failure of this type would not have generated an UK data protection authority-reportable log entry automatically — a potential UK GDPR compliance issue if a data-related failure coincided with a provider outage.
We designed a recovery architecture with fallback routing to a secondary provider, context preservation across agent transitions, and UK data protection authority-compliant incident logging. The revised architecture was validated across a 14-hour chaos test at Black Friday load. The organisation went into peak trading season with confirmed recovery times within SLA and a compliant logging posture.
A private sector legal services organisation was preparing a model update ahead of onboarding new clients. Their legal advisory board had raised concerns about AI use guidance from UK regulatory institutions and whether the model's outputs were consistent enough for use in regulated legal workflows.
Red team testing used semantically equivalent clauses across hundreds of variations reflecting English, Welsh, and Scots law conventions — Schedule vs Annex, Purchaser vs Buyer, governed by English and Welsh law vs governed by the laws of Scotland, warranty vs representation, reasonable endeavours vs best endeavours. The model produced materially inconsistent risk ratings across clauses that differed only in standard legal phrasing.
For a legal product, that level of inconsistency is a real liability. A clause rated high-risk in one phrasing and medium-risk in an equivalent phrasing produces inconsistent advice, with professional indemnity implications under UK regulatory institution guidance.
We produced a consistency audit report with statistical analysis of rating variance across phrasing groups and a root cause assessment. We then built a consistency regression suite covering all identified phrasing variation groups, now integrated into the release pipeline. Every model update is tested for consistency regressions before deployment.
Standard functional testing at low concurrency consistently fails to catch the quality degradation, latency breaches, and hallucination triggers that appear under realistic production load. In every engagement to date, we have found issues that internal QA did not.
Most of the significant issues we find are not model hallucinations in isolation. They are pipeline failures: missing fallback chains, poor context preservation, incorrect caching, absent incident logging. Infrastructure problems with infrastructure solutions.
Most organisations reach us with their system nearly ready for go-live, without yet considering what audit trail their regulator will expect to see. Building compliance evidence into the testing process from the start costs far less than retrofitting it afterwards.
Tell us about your LLM deployment and we'll identify your three highest-risk areas within two working days — no cost, no obligation, NDA first.