Services — PGN Limited

01 — Performance Testing

LLM Performance Testing

Standard functional testing validates that your model produces the correct output — but it does not tell you whether it will still produce the correct output when 500 concurrent users hit it simultaneously, or whether response latency will breach your SLA thresholds during peak demand.

We measure latency distribution (P50, P95, P99), quality degradation under load, token efficiency, and throughput ceiling — benchmarked against UK sector-specific SLA expectations and your own production traffic patterns.

Deliverables

Latency distribution report (P50/P95/P99) at production load profiles

Quality degradation curve — measuring accuracy against concurrency

Token efficiency and cost analysis across load scenarios

SLA breach risk assessment with sector-specific benchmarks

k6 load test scripts, ready for your CI/CD pipeline

k6 Locust DeepEval Helicone Azure OpenAI (UK South) AWS Bedrock (eu-west-2)

UK Regulatory Institutions Industry SLA Standards

latency_benchmark.py — live run

$ pgn run --profile uk-financial --concurrency 500

# Warming up — 100 req baseline

✓ P50 → 312ms

✓ P95 → 1,847ms

✗ P99 → 6,203ms ⚠ SLA breach

# Quality degradation at 500 concurrent

✗ Accuracy → 68.4% (baseline: 94.1%)

Sector	P95 SLA	P99 SLA
Financial Services	< 2,000ms	< 5,000ms
Healthcare	< 1,500ms	< 3,000ms
Legal Services	< 3,000ms	< 8,000ms

02 — Resilience & Red Team

LLM Resilience & Red Team Testing

Your model may perform flawlessly in normal operation — but what happens when an adversarial user deliberately tries to manipulate it? Prompt injection, jailbreaks, data exfiltration attempts, and social engineering attacks are active threats for any publicly accessible LLM system.

Our red team testing is structured against the OWASP LLM Top 10, with sector-specific attack scenarios designed around the actual threat landscape for UK financial services, healthcare, and legal applications.

Deliverables

Full OWASP LLM Top 10 coverage report with findings by severity

Prompt injection and jailbreak attempt log with reproduction steps

Hallucination audit across domain-specific factual scenarios

Bias assessment across protected characteristics (Equality Act 2010)

Adversarial test suite for ongoing regression in CI/CD

Promptfoo Garak DeepEval LangSmith

OWASP LLM Top 10 UK Regulatory Institutions Equality Act 2010

OWASP LLM Top 10 Coverage

LLM01 — Prompt Injection✓

LLM02 — Insecure Output Handling✓

LLM03 — Training Data Poisoning✓

LLM04 — Model Denial of Service✓

LLM05 — Supply Chain Vulnerabilities✓

LLM06 — Sensitive Info Disclosure✓

LLM07 — Insecure Plugin Design✓

LLM08 — Excessive Agency✓

LLM09 — Overreliance✓

LLM10 — Model Theft✓

03 — Recovery & Continuity

Recovery & Continuity Testing

When your primary LLM provider goes down — and all major providers have experienced outages — does your system gracefully degrade, activate a fallback, and recover with no context loss? Or does it stall completely, leaving users with a non-responsive experience at exactly the wrong moment?

We run structured chaos engineering tests simulating provider outages, network degradation, and context window overflow. We validate your fallback chain, test recovery time objectives against your SLA, and verify compliant incident logging is in place.

Deliverables

Chaos engineering report — fallback activation, context preservation, RTO

Fallback architecture assessment with remediation recommendations

Compliant incident logging configuration review

DR validation at production load profile with recovery time measurement

Continuity test harness for periodic DR validation

Chaos Toolkit Locust Helicone Arize Phoenix AWS Bedrock (eu-west-2)

UK Regulatory Institutions Industry SLA Standards

chaos_test.py — provider outage simulation

$ pgn chaos --inject provider-outage

⚠ T+0ms Primary API outage injected

⚠ T+120ms Health check failed

✓ T+340ms Fallback activated → Bedrock eu-west-2

✓ T+490ms Context preserved — no data loss

✓ T+510ms Incident log entry created

RTO: 340ms ✓ within SLA (<500ms)

04 — RAG Pipeline Testing

RAG Pipeline Testing

Retrieval-augmented generation pipelines introduce a layer of risk that model testing alone cannot address. Poor retrieval quality, source attribution errors, context window mismanagement, and knowledge cutoff issues can all cause your RAG system to produce plausible-sounding but factually incorrect answers.

We run end-to-end RAG pipeline evaluations using RAGAS — measuring faithfulness, context precision, answer relevancy, and hallucination rate — with domain-specific ground-truth datasets across UK industry verticals.

Deliverables

RAGAS evaluation report — faithfulness, context precision, answer relevancy

Retrieval quality analysis — hit rate, MRR, false retrieval patterns

Source attribution audit — hallucinated vs. retrieved vs. fabricated

Context window overflow testing at document scale

Automated RAG regression suite for CI/CD integration

RAGAS DeepEval LangSmith Arize Phoenix OpenAI Evals

UK Regulatory Institutions OWASP LLM Top 10

RAGAS evaluation results

Metric	Score	Threshold
Faithfulness	0.91	≥ 0.85
Answer Relevancy	0.87	≥ 0.80
Context Precision	0.73	≥ 0.80
Context Recall	0.89	≥ 0.85
Hallucination Rate	12.4%	≤ 3%

05 — Evaluation Framework Design

Evaluation Framework Design

A one-time test engagement is not enough for a production LLM system. You need the ability to continuously monitor quality as your model, prompts, and data evolve. We design and build evaluation frameworks that your team can run and own independently — integrated into your CI/CD pipeline with full documentation and training.

Our frameworks combine automated LLM-as-judge scoring for high-throughput regression, human-in-the-loop review for critical edge cases, and MLflow-tracked experiment history that provides a defensible audit trail over time.

Deliverables

LLM-as-judge evaluation suite — prompt templates, scoring rubrics, calibration data

CI/CD regression integration — GitHub Actions or your preferred pipeline

MLflow experiment tracking configuration with audit trail

Domain-specific test dataset — built around your regulatory context

Knowledge transfer session and full documentation for your team

MLflow DeepEval Weights & Biases OpenAI Evals GitHub Actions

UK Regulatory Institutions Model Risk Standards

eval_framework.yaml — architecture

Automated Layer

RAGAS + DeepEval — runs on every PR merge. Flags regressions automatically.

LLM-as-Judge

GPT-4o evaluator with calibrated rubrics. Runs on 10% sample + all flagged cases.

Human Review

Domain expert review queue for high-risk outputs. Feeds calibration dataset.

Audit Trail

MLflow tracks all runs, scores, and human decisions. Full audit trail history.

06 — AI Governance & Regulatory

AI Governance & Regulatory Readiness

Most organisations approach governance as an afterthought — assembling evidence after the system is built, rather than building evidence generation into the testing process from the start. Retrofitting an audit trail for UK regulatory institutions is costly and often incomplete.

We work alongside your legal, compliance, and technical teams to produce the specific documentation, bias audits, risk classifications, and audit-ready evidence packs your regulator will expect to see.

Deliverables

Bias audit report aligned to Equality Act 2010 and UK fairness standards

Risk classification assessment aligned to UK regulatory institution expectations

Model validation evidence pack for internal audit and regulatory review

Technical security measures documentation for sector-specific compliance

AI use compliance assessment for regulated sector deployments

MLflow Weights & Biases LangSmith Custom UK DPIA Templates

UK Regulatory Institutions OWASP LLM Top 10 Equality Act 2010

regulatory_coverage_matrix

Framework	Coverage
OWASP LLM Top 10	Full coverage
Performance & Load	Full coverage
Resilience & Red Team	Full coverage
RAG Pipeline	Full coverage
Bias & Fairness	Full coverage
UK Regulatory	Full coverage

FAQ

Common Questions

Most engagements run two to six weeks, depending on scope. A focused performance or resilience test for a single deployment is typically two to three weeks. A comprehensive multi-service engagement covering performance, resilience, RAG testing, and governance documentation runs four to six weeks. We agree a fixed timeline and scope before the engagement begins.

We prefer to test in a staging environment that mirrors production closely — with production traffic data replayed, not live user traffic. We can work entirely within your infrastructure (including air-gapped or private-cloud environments) or through secure access arrangements agreed under NDA. All testing is conducted under UK GDPR with data residency options within Azure UK South or AWS eu-west-2.

We are model-agnostic. We have tested GPT-4o and GPT-4 Turbo via Azure OpenAI (UK South), Claude 3.5 Sonnet and Claude 3 Opus via AWS Bedrock (eu-west-2), Gemini 1.5 Pro via Vertex AI (europe-west2), and open-weight models including Llama 3, Mistral 7B, and Falcon 40B on private UK-hosted infrastructure. If your model is not listed, get in touch.

We sign an NDA before any technical discussion. All testing is conducted under UK GDPR and Data Protection Act 2018. We work within your chosen infrastructure — we do not require access to client or patient data for testing; we use synthetic data generation and representative test datasets. Where real data must be used, we agree data handling protocols in writing before the engagement begins.

Yes — knowledge transfer is included in every engagement. We deliver CI/CD-ready test suites with full documentation, runbooks, and a training session for your team. We do not build dependencies on our continued involvement. You will own the test infrastructure entirely.

We produce structured evidence packs tailored to the specific regulatory audience — covering model validation reports, independent challenge documentation, bias assessment reports, technical security measures evidence, and audit-ready test logs. Each pack is designed to answer the specific questions UK regulatory institutions and internal auditors will ask.

LLM Testing That Survives Production.
Performance, Resilience, Recovery.

LLM Performance Testing

Deliverables

latency_benchmark.py — live run

LLM Resilience & Red Team Testing

Deliverables

OWASP LLM Top 10 Coverage

Recovery & Continuity Testing

Deliverables

chaos_test.py — provider outage simulation

RAG Pipeline Testing

Deliverables

RAGAS evaluation results

Evaluation Framework Design

Deliverables

eval_framework.yaml — architecture

Automated Layer

LLM-as-Judge

Human Review

Audit Trail

AI Governance & Regulatory Readiness

Deliverables

regulatory_coverage_matrix

Common Questions

Not Sure Which Service
You Need?

LLM Testing That Survives Production.Performance, Resilience, Recovery.

LLM Performance Testing

Deliverables

latency_benchmark.py — live run

LLM Resilience & Red Team Testing

Deliverables

OWASP LLM Top 10 Coverage

Recovery & Continuity Testing

Deliverables

chaos_test.py — provider outage simulation

RAG Pipeline Testing

Deliverables

RAGAS evaluation results

Evaluation Framework Design

Deliverables

eval_framework.yaml — architecture

Automated Layer

LLM-as-Judge

Human Review

Audit Trail

AI Governance & Regulatory Readiness

Deliverables

regulatory_coverage_matrix

Common Questions

Not Sure Which ServiceYou Need?

LLM Testing That Survives Production.
Performance, Resilience, Recovery.

Not Sure Which Service
You Need?