AI Laboratory

Which model for your business?

Test different AI models on a specific task in your workflow. Measure. Compare. Decide.

Back to use cases
1

Choose a workflow

Select the business process where you want to test an AI agent.

2

Select the task to optimize

Click on the workflow step where you want to place an AI agent.

Extraction
Validation Test here
Routing
Approval

Validation

Classification

The agent must detect anomalies in invoices: inconsistent amounts, duplicates, missing purchase orders.

Input: Invoice data (supplier, amount, PO) Expected output: Valid / Anomaly + reason
3

Test dataset preview

Real anonymized cases from your history, with their expected classification (ground truth).

150 test cases Anonymized history
#INV-2024-0892 Valid
Supplier: ACME Corp Amount: 2 450,00 € Purchase order: PO-2024-1234
#INV-2024-0893 Anomaly
Supplier: Tech Solutions Amount: 18 750,00 € Purchase order: Missing
Reason: missing PO on amount > 15k€
#INV-2024-0894 Anomaly
Supplier: Global Services Amount: 5 200,00 € Purchase order: PO-2024-0087
Reason: duplicate detected (same supplier, same amount, same month)
68% valid cases
32% anomalies
4

Compare two models

Select the current model and the challenger to evaluate on this task.

Current
~1.8s / $0.002
VS
Challenger
~1.2s / $0.005
Benchmark running 0%
Testing #INV-2024-0001
Current GPT-3.5 Turbo
78.2% Overall accuracy
False positives 15%
False negatives 8%
Average time 1.8s
VS
Best for this task
Challenger GPT-4o
91.4% Overall accuracy
False positives 5%
False negatives 4%
Average time 1.2s

Overall accuracy

78.2%
91.4%
GPT-3.5 GPT-4o

Response examples

#INV-2024-0893 Ground truth: Anomaly
GPT-3.5 Turbo Anomaly

"Bon de commande manquant"

GPT-4o Anomaly

"PO absent + montant > seuil (15k€)"

How do we measure?

Each model processes 150 test cases. We compare its response (Valid/Anomaly) to the ground truth established by your experts. False positives generate unnecessary work, false negatives let anomalies through.

Ready to test on your real processes?

Connect your digital twin for benchmarks on your actual data.

Request a demo