01 — Performance Testing

LLM Performance Testing

Standard functional testing validates that your model produces the correct output — but it does not tell you whether it will still produce the correct output when 500 concurrent users hit it simultaneously, or whether response latency will breach your SLA thresholds during peak demand.

We measure latency distribution (P50, P95, P99), quality degradation under load, token efficiency, and throughput ceiling — benchmarked against UK sector-specific SLA expectations and your own production traffic patterns.

Deliverables
Latency distribution report (P50/P95/P99) at production load profiles
Quality degradation curve — measuring accuracy against concurrency
Token efficiency and cost analysis across load scenarios
SLA breach risk assessment with sector-specific benchmarks
k6 load test scripts, ready for your CI/CD pipeline
k6 Locust DeepEval Helicone Azure OpenAI (UK South) AWS Bedrock (eu-west-2)
UK Regulatory Institutions Industry SLA Standards
latency_benchmark.py — live run
$ pgn run --profile uk-financial --concurrency 500
# Warming up — 100 req baseline
P50 312ms
P95 1,847ms
P99 6,203ms ⚠ SLA breach
# Quality degradation at 500 concurrent
Accuracy 68.4% (baseline: 94.1%)
SectorP95 SLAP99 SLA
Financial Services< 2,000ms< 5,000ms
Healthcare< 1,500ms< 3,000ms
Legal Services< 3,000ms< 8,000ms
02 — Resilience & Red Team

LLM Resilience & Red Team Testing

Your model may perform flawlessly in normal operation — but what happens when an adversarial user deliberately tries to manipulate it? Prompt injection, jailbreaks, data exfiltration attempts, and social engineering attacks are active threats for any publicly accessible LLM system.

Our red team testing is structured against the OWASP LLM Top 10, with sector-specific attack scenarios designed around the actual threat landscape for UK financial services, healthcare, and legal applications.

Deliverables
Full OWASP LLM Top 10 coverage report with findings by severity
Prompt injection and jailbreak attempt log with reproduction steps
Hallucination audit across domain-specific factual scenarios
Bias assessment across protected characteristics (Equality Act 2010)
Adversarial test suite for ongoing regression in CI/CD
Promptfoo Garak DeepEval LangSmith
OWASP LLM Top 10 UK Regulatory Institutions Equality Act 2010
OWASP LLM Top 10 Coverage
LLM01 — Prompt Injection
LLM02 — Insecure Output Handling
LLM03 — Training Data Poisoning
LLM04 — Model Denial of Service
LLM05 — Supply Chain Vulnerabilities
LLM06 — Sensitive Info Disclosure
LLM07 — Insecure Plugin Design
LLM08 — Excessive Agency
LLM09 — Overreliance
LLM10 — Model Theft
03 — Recovery & Continuity

Recovery & Continuity Testing

When your primary LLM provider goes down — and all major providers have experienced outages — does your system gracefully degrade, activate a fallback, and recover with no context loss? Or does it stall completely, leaving users with a non-responsive experience at exactly the wrong moment?

We run structured chaos engineering tests simulating provider outages, network degradation, and context window overflow. We validate your fallback chain, test recovery time objectives against your SLA, and verify compliant incident logging is in place.

Deliverables
Chaos engineering report — fallback activation, context preservation, RTO
Fallback architecture assessment with remediation recommendations
Compliant incident logging configuration review
DR validation at production load profile with recovery time measurement
Continuity test harness for periodic DR validation
Chaos Toolkit Locust Helicone Arize Phoenix AWS Bedrock (eu-west-2)
UK Regulatory Institutions Industry SLA Standards
chaos_test.py — provider outage simulation
$ pgn chaos --inject provider-outage
T+0ms Primary API outage injected
T+120ms Health check failed
T+340ms Fallback activated → Bedrock eu-west-2
T+490ms Context preserved — no data loss
T+510ms Incident log entry created
RTO: 340ms ✓ within SLA (<500ms)
04 — RAG Pipeline Testing

RAG Pipeline Testing

Retrieval-augmented generation pipelines introduce a layer of risk that model testing alone cannot address. Poor retrieval quality, source attribution errors, context window mismanagement, and knowledge cutoff issues can all cause your RAG system to produce plausible-sounding but factually incorrect answers.

We run end-to-end RAG pipeline evaluations using RAGAS — measuring faithfulness, context precision, answer relevancy, and hallucination rate — with domain-specific ground-truth datasets across UK industry verticals.

Deliverables
RAGAS evaluation report — faithfulness, context precision, answer relevancy
Retrieval quality analysis — hit rate, MRR, false retrieval patterns
Source attribution audit — hallucinated vs. retrieved vs. fabricated
Context window overflow testing at document scale
Automated RAG regression suite for CI/CD integration
RAGAS DeepEval LangSmith Arize Phoenix OpenAI Evals
UK Regulatory Institutions OWASP LLM Top 10
RAGAS evaluation results
MetricScoreThreshold
Faithfulness0.91≥ 0.85
Answer Relevancy0.87≥ 0.80
Context Precision0.73≥ 0.80
Context Recall0.89≥ 0.85
Hallucination Rate12.4%≤ 3%
05 — Evaluation Framework Design

Evaluation Framework Design

A one-time test engagement is not enough for a production LLM system. You need the ability to continuously monitor quality as your model, prompts, and data evolve. We design and build evaluation frameworks that your team can run and own independently — integrated into your CI/CD pipeline with full documentation and training.

Our frameworks combine automated LLM-as-judge scoring for high-throughput regression, human-in-the-loop review for critical edge cases, and MLflow-tracked experiment history that provides a defensible audit trail over time.

Deliverables
LLM-as-judge evaluation suite — prompt templates, scoring rubrics, calibration data
CI/CD regression integration — GitHub Actions or your preferred pipeline
MLflow experiment tracking configuration with audit trail
Domain-specific test dataset — built around your regulatory context
Knowledge transfer session and full documentation for your team
MLflow DeepEval Weights & Biases OpenAI Evals GitHub Actions
UK Regulatory Institutions Model Risk Standards
eval_framework.yaml — architecture
Automated Layer

RAGAS + DeepEval — runs on every PR merge. Flags regressions automatically.

LLM-as-Judge

GPT-4o evaluator with calibrated rubrics. Runs on 10% sample + all flagged cases.

Human Review

Domain expert review queue for high-risk outputs. Feeds calibration dataset.

Audit Trail

MLflow tracks all runs, scores, and human decisions. Full audit trail history.

06 — AI Governance & Regulatory

AI Governance & Regulatory Readiness

Most organisations approach governance as an afterthought — assembling evidence after the system is built, rather than building evidence generation into the testing process from the start. Retrofitting an audit trail for UK regulatory institutions is costly and often incomplete.

We work alongside your legal, compliance, and technical teams to produce the specific documentation, bias audits, risk classifications, and audit-ready evidence packs your regulator will expect to see.

Deliverables
Bias audit report aligned to Equality Act 2010 and UK fairness standards
Risk classification assessment aligned to UK regulatory institution expectations
Model validation evidence pack for internal audit and regulatory review
Technical security measures documentation for sector-specific compliance
AI use compliance assessment for regulated sector deployments
MLflow Weights & Biases LangSmith Custom UK DPIA Templates
UK Regulatory Institutions OWASP LLM Top 10 Equality Act 2010
regulatory_coverage_matrix
FrameworkCoverage
OWASP LLM Top 10Full coverage
Performance & LoadFull coverage
Resilience & Red TeamFull coverage
RAG PipelineFull coverage
Bias & FairnessFull coverage
UK RegulatoryFull coverage
FAQ

Common Questions

Most engagements run two to six weeks, depending on scope. A focused performance or resilience test for a single deployment is typically two to three weeks. A comprehensive multi-service engagement covering performance, resilience, RAG testing, and governance documentation runs four to six weeks. We agree a fixed timeline and scope before the engagement begins.
We prefer to test in a staging environment that mirrors production closely — with production traffic data replayed, not live user traffic. We can work entirely within your infrastructure (including air-gapped or private-cloud environments) or through secure access arrangements agreed under NDA. All testing is conducted under UK GDPR with data residency options within Azure UK South or AWS eu-west-2.
We are model-agnostic. We have tested GPT-4o and GPT-4 Turbo via Azure OpenAI (UK South), Claude 3.5 Sonnet and Claude 3 Opus via AWS Bedrock (eu-west-2), Gemini 1.5 Pro via Vertex AI (europe-west2), and open-weight models including Llama 3, Mistral 7B, and Falcon 40B on private UK-hosted infrastructure. If your model is not listed, get in touch.
We sign an NDA before any technical discussion. All testing is conducted under UK GDPR and Data Protection Act 2018. We work within your chosen infrastructure — we do not require access to client or patient data for testing; we use synthetic data generation and representative test datasets. Where real data must be used, we agree data handling protocols in writing before the engagement begins.
Yes — knowledge transfer is included in every engagement. We deliver CI/CD-ready test suites with full documentation, runbooks, and a training session for your team. We do not build dependencies on our continued involvement. You will own the test infrastructure entirely.
We produce structured evidence packs tailored to the specific regulatory audience — covering model validation reports, independent challenge documentation, bias assessment reports, technical security measures evidence, and audit-ready test logs. Each pack is designed to answer the specific questions UK regulatory institutions and internal auditors will ask.

Not Sure Which Service
You Need?

Tell us about your LLM deployment and we'll identify your highest-risk areas within two working days — at no cost, under NDA.