Testing Toolkit

The Tools
We Use

Every tool we use is chosen for a specific purpose. We do not apply a generic stack — our tooling combinations are built around your deployment architecture, regulatory context, and testing objectives.

pgn test --list-tools --certified-uk-compliant
# UK-GDPR certified tool stack
ragas evaluation · uk-financial, nhs, legal
deepeval evaluation · all sectors
promptfoo security · fca, ico adversarial
garak security · ncsc-aligned
k6 load · uk sector peak patterns
locust load · chaos + black friday
langsmith tracing · audit trail generation
helicone observ. · uk data residency
arize-phoenix observ. · semantic drift
mlflow tracking · ico-defensible audit
Evaluation

RAGAS

RAG evaluation framework, extended with UK domain-specific ground-truth datasets for financial services, healthcare clinical guidelines, and legal clause libraries.

RAG Pipelines
Evaluation

DeepEval

LLM unit testing framework with pytest integration. Supports LLM-as-judge scoring, hallucination detection, and CI/CD-ready test suites across any model or provider.

All Service Lines
Security

Promptfoo

Red team testing with UK-specific adversarial scenarios — regulatory evasion, UK financial regulator disclosure manipulation, UK data protection authority data extraction attempts, and social engineering patterns.

Red Team
Security

Garak

LLM vulnerability scanner aligned to UK AI Security Guidance Principles and OWASP LLM Top 10. Sector-specific probe libraries for UK financial, clinical, and legal contexts.

Security Audit
Load Testing

k6 & Locust

Load generation at UK sector peak traffic patterns — healthcare surge periods, financial quarter-end reporting, and peak retail load profiles.

Performance
Observability

LangSmith

Pipeline tracing and debugging with audit trail generation. Produces the structured logs required for UK regulatory framework model validation and regulatory accountability documentation.

Audit Trail
Observability

Helicone

LLM observability and cost monitoring with UK data residency support. Configured for GDPR-compliant log storage within Azure UK South or AWS eu-west-2.

Monitoring
Observability

Arize Phoenix

Tracing and semantic drift detection for production LLM systems. Alerts on quality degradation before it becomes a regulatory or operational issue.

Production Monitoring
Tracking

MLflow

Experiment tracking and model registry with UK data protection authority-defensible audit trail configuration. Every test run is logged with full reproducibility — essential for regulatory evidence packs.

Compliance Evidence
Tracking

Weights & Biases

Model performance tracking with visualised evidence packs designed for presentation to UK regulatory institutions and internal auditors — not just technical teams.

Regulatory Evidence
Evaluation

OpenAI Evals

Standardised evaluation harness extended with UK domain-specific test sets for financial services, healthcare, and legal clause consistency.

Benchmarking
Cloud

Azure OpenAI & AWS Bedrock

All testing conducted within UK data residency boundaries — Azure UK South and AWS eu-west-2. GCP europe-west2 available on request. Data sovereignty guaranteed.

UK Data Residency
Regulatory Frameworks

UK Compliance
Coverage

Every engagement produces evidence aligned to the specific regulatory frameworks your organisation must satisfy.

Financial Services

UK regulatory framework Model Risk

Three lines of defence documentation for AI model risk
Model inventory and validation evidence
Ongoing monitoring framework design
Independent challenge documentation for internal audit
Change management evidence for model updates
Data & Privacy

Regulatory Accountability & UK GDPR

DPIA-supporting technical assessment documentation
Automated decision-making transparency audit
Subject access and data minimisation review
Incident response logging configuration
Bias assessment for protected characteristics
Healthcare

sector compliance standards & sector regulatory standards

Technical security measures documentation for sector-specific compliance
UK sector regulator AI as a Medical Device classification support
Clinical safety documentation (DCB0129 / DCB0160)
sector regulator standards alignment evidence
Sector-specific security boundary configuration review
Government & AI Safety

UK AI Standards & UK cybersecurity guidance

UK AI Security Guidance Principles alignment assessment
UK AI Standards Responsible AI evaluation framework
OWASP LLM Top 10 coverage report
Online Safety Act 2023 content risk assessment
EU AI Act cross-border compliance mapping
Expertise

Who Does the Work

Our team is drawn from UK AI research, information security, and regulatory compliance backgrounds.

🤖

ML Practitioners & NLP Researchers

AI engineers and researchers with hands-on experience in production LLM deployments across UK regulated sectors.

PyTorch Transformers RAGAS MLflow
🛡️

AI Red Team Specialists

Penetration testing and adversarial AI specialists with experience in UK financial services and government security contexts.

OWASP LLM Garak Promptfoo UK cybersecurity guidance
📋

UK Regulatory Specialists

Former compliance professionals with direct experience of UK regulatory institutions across financial services, healthcare, legal, and public sector environments.

UK regulatory framework UK data protection authority GDPR sector compliance standards UK sector regulator
☁️

UK Cloud & MLOps Engineers

Cloud architects with UK data residency specialism — Azure UK South, AWS eu-west-2, and private on-premises LLM infrastructure.

Azure UK South AWS eu-west-2 Kubernetes CI/CD
Quarterly Reports

UK AI Risk Reports

Practical guidance on UK regulatory developments and LLM risk — written for technical and compliance teams, not for press releases.

Subscribe to the Quarterly Report

Delivered to your inbox each quarter. No marketing email — just the report.

Subscribe →
Q1 2025

UK regulatory framework: What UK Financial Services Need to Know About LLM Model Risk

Practical guide to the regulatory guidance model risk management framework as applied to LLM deployments — validation requirements, ongoing monitoring, and what internal audit will ask.

Request report
Q4 2024

AI Risk Classification: A Practical Guide for UK Healthcare Digital Teams

How to navigate UK sector regulator AI as a Medical Device classification for clinical LLM applications — software as a medical device criteria, regulatory pathway options, and documentation requirements.

Request report
Q3 2024

OWASP LLM Top 10: UK Implications and Practical Testing Approaches

What the OWASP LLM Top 10 means in practice for UK regulated deployments — from prompt injection to model theft — with sector-specific risk prioritisation guidance.

Request report
Q2 2024

AI Use in Regulated Legal Services Guidance: Implications for Legal Services Organisations

What UK regulatory institution guidance on AI use means in practice — consistency obligations, professional indemnity implications, and testing requirements for contract review and legal research tools.

Request report
Q1 2024

RAG in Regulated Environments: Hidden Risks and Testing Approaches

Why retrieval-augmented generation pipelines require a different testing approach — and what happens when retrieval quality, source attribution, or context window management goes wrong.

Request report
Q4 2023

UK AI Safety Institute: What the Evaluation Framework Means for Regulated Sectors

An overview of the UK AI Standards evaluation framework and its practical implications for financial services, healthcare, and government AI deployments under UK law.

Request report

Start with a Free
Risk Assessment

Tell us about your LLM deployment and we'll identify your three highest-risk areas within two working days — no cost, no obligation, NDA first.