🏦 Private Financial Institution 4-week engagement · London

Critical Hallucination Risks in a Regulatory Disclosure LLM Identified Before Go-Live

The Situation

A private financial institution was preparing to deploy an LLM system to automate regulatory disclosures, drawing on legacy filing templates from multiple internal sources — many of which carried ambiguous table formatting built up over years of manual updates.

Internal QA had signed off the system. The model risk team wanted independent validation before the system went live for the Q1 2025 disclosure cycle, in line with their obligations to UK regulatory institutions.

What We Found

The four-week engagement combined RAGAS evaluation, DeepEval assessment, and adversarial stress tests written around the specific filing formats used. We found a class of hallucination that internal QA had missed: when source documents contained legacy table formatting — merged cells and footnote conventions common in UK regulatory filings — the model fabricated financial figures rather than flagging the ambiguity.

A single fabricated figure in a published regulatory disclosure would be a potentially material misstatement — with consequences for the institution and its standing with UK regulatory institutions well beyond the cost of the engagement.

What We Delivered

We produced a findings report with severity ratings, reproducible test cases, and remediation guidance. The engineering team resolved all findings before the Q1 2025 cycle. We also handed over a CI/CD regression suite covering the hallucination triggers we identified — it now runs automatically before each quarterly disclosure release. A follow-on engagement is planned for Q3 2025 to extend coverage to further disclosure types.

"The PGN Limited team found failure modes our internal testing had missed entirely. Going live with no issues in Q1 — and having automated regression before every release — gave our model risk function and auditors exactly the evidence they needed."
Head of Model Risk, Private Financial Institution (anonymised)
Tools Used
RAGAS DeepEval LangSmith MLflow Azure OpenAI (UK South) Custom UK Banking Test Suite
Regulatory Frameworks
UK regulatory framework UK Regulatory Institutions Regulatory Accountability regulatory accounting
Primary Outcome
Zero issues at go-live
Q1 2025 regulatory disclosure cycle completed with no hallucination incidents. Internal audit signed off.
Engagement length
4 weeks
Discovery → testing → report → remediation support → CI/CD handover
Regulatory framework
UK regulatory framework
Model risk management alignment — evidence accepted by internal audit
Ongoing coverage
CI/CD Regression
Automated suite runs before every quarterly disclosure cycle
Services used
Performance + Resilience + Eval Framework
Three service lines in a single engagement
🏥 Public Sector Healthcare 3-week engagement · Midlands

Clinical Decision-Support LLM Passed Internal QA but Failed Under Peak Production Load

The Situation

A public sector healthcare organisation's digital programme team was deploying an LLM triage assistant, integrated with their Electronic Patient Record system. The system was designed to pre-classify patient queries, suggest triage pathways, and surface relevant clinical guidelines before handover to clinical staff.

The system had cleared all internal functional tests. The programme team was concerned about performance at peak — the Monday morning appointment surge and the winter demand spike — and needed to navigate UK sector regulator guidance on AI as a Medical Device classification before go-live.

What We Found

Performance testing under simulated Monday morning peak load revealed two issues: response quality — measured by clinical guideline faithfulness and triage pathway accuracy — dropped significantly as concurrent users went beyond the midweek baseline. At P99, latency also breached the organisation's internal SLA thresholds. A triage assistant that degrades precisely at peak demand is the worst possible outcome in a clinical setting.

Neither failure was visible in functional testing, which had been run at low concurrency. Both were infrastructure and caching problems, not model problems — fully fixable without touching the model itself.

What We Delivered

We revised the caching and fallback architecture and re-validated the system under the corrected configuration, confirming latency SLAs and quality thresholds held at projected peak load. We also produced the technical documentation needed for the sector compliance standards submission and a written assessment supporting the sector regulatory standards classification decision. The clinical go-live proceeded on schedule.

Tools Used
k6 DeepEval Arize Phoenix AWS Bedrock (eu-west-2) Healthcare Clinical Guidelines Dataset
Regulatory Frameworks
sector compliance standards sector regulatory standards UK Regulatory Institutions sector regulator Standards
Primary Outcome
On-schedule go-live
Clinical deployment proceeded on schedule with sector compliance standards and UK sector regulator documentation complete.
Engagement length
3 weeks
Key finding
31% quality drop
Under peak concurrent load vs. test conditions — invisible to functional QA
Regulatory frameworks
sector compliance standards + sector regulatory standards
🛒 Private Sector Retail 2-week engagement · Remote

Multi-Agent Shopping Assistant Pipeline Stalled Completely Under Simulated Black Friday Load

The Situation

A large private sector retail organisation had deployed a four-agent LLM shopping assistant — product discovery, personalisation, stock availability, and checkout support — ahead of peak trading season. The system had performed well in standard testing. The engineering team wanted independent validation before Black Friday.

What We Found

During chaos testing simulating Black Friday traffic, injecting a primary API outage caused the multi-agent pipeline to stall completely. No fallback activated, no graceful degradation, no user-facing message. Customers would have faced a non-responsive assistant with no explanation during the highest-revenue trading period of the year.

The incident logging configuration also had a gap: a primary API failure of this type would not have generated an UK data protection authority-reportable log entry automatically — a potential UK GDPR compliance issue if a data-related failure coincided with a provider outage.

What We Delivered

We designed a recovery architecture with fallback routing to a secondary provider, context preservation across agent transitions, and UK data protection authority-compliant incident logging. The revised architecture was validated across a 14-hour chaos test at Black Friday load. The organisation went into peak trading season with confirmed recovery times within SLA and a compliant logging posture.

Tools Used
Locust Helicone Chaos Toolkit AWS Bedrock (eu-west-2) fallback Custom Agent Test Harness
Regulatory Frameworks
UK Regulatory UK GDPR Incident Logging
Primary Outcome
Zero context lost
Full context preservation across agent transitions during provider outage at Black Friday scale.
Recovery time
< 400ms
RTO confirmed within SLA — fallback active before users notice disruption
Compliance outcome
UK data protection authority logging
Automated UK data protection authority-compliant incident log entries on every provider failure event
⚖️ Private Sector Legal Services 3-week engagement · Remote

Contract Review LLM Showed Significant Inconsistency Across English Law Clause Variations

The Situation

A private sector legal services organisation was preparing a model update ahead of onboarding new clients. Their legal advisory board had raised concerns about AI use guidance from UK regulatory institutions and whether the model's outputs were consistent enough for use in regulated legal workflows.

What We Found

Red team testing used semantically equivalent clauses across hundreds of variations reflecting English, Welsh, and Scots law conventions — Schedule vs Annex, Purchaser vs Buyer, governed by English and Welsh law vs governed by the laws of Scotland, warranty vs representation, reasonable endeavours vs best endeavours. The model produced materially inconsistent risk ratings across clauses that differed only in standard legal phrasing.

For a legal product, that level of inconsistency is a real liability. A clause rated high-risk in one phrasing and medium-risk in an equivalent phrasing produces inconsistent advice, with professional indemnity implications under UK regulatory institution guidance.

What We Delivered

We produced a consistency audit report with statistical analysis of rating variance across phrasing groups and a root cause assessment. We then built a consistency regression suite covering all identified phrasing variation groups, now integrated into the release pipeline. Every model update is tested for consistency regressions before deployment.

Tools Used
DeepEval Promptfoo LangSmith Custom English Law Clause Dataset
Regulatory Frameworks
UK Regulatory Guidance UK Regulatory Institutions Equality Act 2010
Primary Outcome
CI/CD regression live
Consistency regression suite running on every model update before deployment.
Inconsistency identified
23%
Rating variance across equivalent phrasing groups — invisible to functional QA
Regulatory alignment
UK Regulatory
Test suite structured around UK regulatory institution guidance for legal services AI deployments
Cross-Engagement Insights

What We See
Across Engagements

Finding 01

Internal QA Misses Production Failure Modes

Standard functional testing at low concurrency consistently fails to catch the quality degradation, latency breaches, and hallucination triggers that appear under realistic production load. In every engagement to date, we have found issues that internal QA did not.

Finding 02

Pipeline Failures Are More Common Than Model Failures

Most of the significant issues we find are not model hallucinations in isolation. They are pipeline failures: missing fallback chains, poor context preservation, incorrect caching, absent incident logging. Infrastructure problems with infrastructure solutions.

Finding 03

Regulatory Evidence Is Usually an Afterthought

Most organisations reach us with their system nearly ready for go-live, without yet considering what audit trail their regulator will expect to see. Building compliance evidence into the testing process from the start costs far less than retrofitting it afterwards.

Ready to Start?

Tell us about your LLM deployment and we'll identify your three highest-risk areas within two working days — no cost, no obligation, NDA first.