AI Laboratory

Which model for your business?

Test different AI models on a specific task in your workflow. Measure. Compare. Decide.

Back to use cases

Choose a workflow

Select the business process where you want to test an AI agent.

Select the task to optimize

Click on the workflow step where you want to place an AI agent.

Extraction

Validation Test here

Routing

Approval

Validation

Classification

The agent must detect anomalies in invoices: inconsistent amounts, duplicates, missing purchase orders.

Input: Invoice data (supplier, amount, PO) Expected output: Valid / Anomaly + reason

Test dataset preview

Real anonymized cases from your history, with their expected classification (ground truth).

150 test cases Anonymized history

#INV-2024-0892 Valid

Supplier: ACME Corp Amount: 2 450,00 € Purchase order: PO-2024-1234

#INV-2024-0893 Anomaly

Supplier: Tech Solutions Amount: 18 750,00 € Purchase order: Missing

Reason: missing PO on amount > 15k€

#INV-2024-0894 Anomaly

Supplier: Global Services Amount: 5 200,00 € Purchase order: PO-2024-0087

Reason: duplicate detected (same supplier, same amount, same month)

68% valid cases

32% anomalies

Compare two models

Select the current model and the challenger to evaluate on this task.

Current

~1.8s / $0.002

Challenger

~1.2s / $0.005

Benchmark running 0%

Testing #INV-2024-0001

Current GPT-3.5 Turbo

78.2% Overall accuracy

False positives 15%

False negatives 8%

Average time 1.8s

Best for this task

Challenger GPT-4o

91.4% Overall accuracy

False positives 5%

False negatives 4%

Average time 1.2s

Overall accuracy

78.2%

91.4%

GPT-3.5 GPT-4o

Response examples

#INV-2024-0893 Ground truth: Anomaly

GPT-3.5 Turbo Anomaly

"Bon de commande manquant"

GPT-4o Anomaly

"PO absent + montant > seuil (15k€)"

How do we measure?

Each model processes 150 test cases. We compare its response (Valid/Anomaly) to the ground truth established by your experts. False positives generate unnecessary work, false negatives let anomalies through.

Ready to test on your real processes?

Connect your digital twin for benchmarks on your actual data.

Request a demo