Les-Audits-Affaires: The First Comprehensive French Buisness Legal AI Benchmark
TL;DR: We built
legmlai/les-audits-affaires
- the first benchmark proving AI struggles hugely at French business law. AI hallucinations in financial NLP occur in up to 41 percent of cases, costing billions annually. Our 2,670 test cases across 9 French business codes show specialized models are the only viable solution.
Current AI Performance is Abysmal
OpenAI's latest reasoning systems, according to their own report, show hallucination rates reaching 33% for their o3 model and a staggering 48% for o4-mini when answering questions about public figures. For comparison, Google's Gemini-2.0-Flash-001 achieved an industry-leading hallucination rate of just 0.7% in 2025, proving improvement is possible with the right approach.
In banking specifically, a recent BCG survey finds that only 25% of institutions have woven these capabilities into their strategic playbook. The remaining 75% are burning money on pilots that go nowhere.
Real Court Disasters Keep Mounting
AI legal expert Damien Charlotin tracks legal decisions in which lawyers have used evidence that featured AI hallucinations. His database details more than 30 instances in May 2025. Major firms aren't immune:
Date | Firm/Court | Error | Consequence |
---|---|---|---|
Feb 2025 | Morgan & Morgan (1,000+ lawyers) | Fictitious cases vs Walmart | Urgent firm-wide warning |
May 2025 | K&L Gates (1,700 lawyers) | 9 of 27 citations wrong | $31,000 sanctions |
May 2025 | Latham & Watkins | AI-hallucinated expert report | Motion to exclude evidence |
Nov 2024 | Texas Federal Court | Nonexistent cases | $2,000 penalty + AI course |
Source: AI's penchant for generating legal fiction in case filings has led courts around the country to question or discipline lawyers in at least seven cases over the last year
Les-Audits-Affaires: The Benchmark
We built a benchmark that reflects real French business complexity. Here's how:
2,670 Real-World Test Cases
Our 400+ personas aren't abstract - they're real business situations:
- Marie, 34: CFO at Lyon tech startup, dealing with R&D tax credits (CIR), BSPCE stock options, GDPR compliance
- Jean-Pierre, 52: Bordeaux restaurant owner, facing VAT irregularities (where higher levels of VAT irregularities are seen)
- Amélie, 28: Paris corporate counsel, handling M&A due diligence, DORA compliance
- Philippe, 45: Bank manager, managing CESOP starting in 2024 reporting
Coverage Across 9 Essential Codes
Legal Code | Test Cases | Focus Areas |
---|---|---|
Financial Law | 350 | Banking regulations, AML/CFT, payment services |
Commercial Law | 320 | Contracts, company formation, insolvency |
Tax Code (CGI) | 310 | VAT, corporate tax, deductions |
Insurance Law | 300 | Policies, claims, broker regulations |
Tax Law | 290 | International tax, transfer pricing |
Consumer Law | 290 | GDPR, e-commerce, warranties |
Employment Law | 280 | Contracts, termination, benefits |
IP Law | 270 | Patents, trademarks, licensing |
Procurement | 260 | Public tenders, compliance |
The 5-Dimension Evaluation
Every test case evaluates what actually matters to businesses:
Real Example: E-commerce VAT Compliance
Scenario: Sophie, e-commerce manager, €120k revenue, selling to Germany and Spain
Correct Answer:
- Action: Register for EU VAT, file monthly returns with EC Sales List
- Délai: Within 15 days of crossing €100k threshold
- Documents: VAT returns, EC Sales List, Intrastat declarations
- Impact: 20% VAT collection, €200 monthly filing costs
- Conséquences: €750 penalty + 0.4% monthly interest on unpaid VAT
Common AI Failures:
- Cites outdated €35k threshold (changed to €100k in 2025)
- Misses Intrastat requirement
- Invents "simplified quarterly filing" that doesn't exist
- Calculates penalties using wrong rates
Why Domain-Specific Models Win
The evidence is overwhelming. models trained on carefully curated datasets show a 40% reduction in hallucinations compared to those trained on raw internet data:
Model Type | Training Data | Legal Content | Hallucination Rate |
---|---|---|---|
General LLM | 13 trillion tokens | 0.3% | 41-75% |
Domain-Specific | 500 billion tokens | 100% | 5-15% |
Improvement | 26x less data | 333x more relevant | 88% better |
The French AI Opportunity
France is investing heavily: Microsoft earlier this week announced that it was investing €4 billion in cloud and AI infrastructure in France, bringing up to 25,000 of the most advanced GPUs to the country by the end of 2025. Combined with For about €6.5 billion in annual cost as of 2018, the French government estimates it will, in the long run, add 0.8% to the country's GDP and create 60,000 jobs from R&D tax credits, the infrastructure exists.
What's missing are models that understand French business law. "Accuracy costs money. Being helpful drives adoption." But when accuracy prevents €750 penalties, €16 billion in tax adjustments, and professional sanctions, accuracy is the only metric that matters.
Getting Started
# Load the benchmark
from datasets import load_dataset
dataset = load_dataset("legmlai/les-audits-affaires")
# Explore the data
print(f"Total cases: {len(dataset)}")
print(f"Example case: {dataset[0]}")
# Each case contains:
# - persona: business context and demographics
# - scenario: specific legal situation
# - ground_truth: correct answers for all 5 dimensions
# - legal_refs: articles from Légifrance
Run evaluations:
git clone [github]/les-audits-evaluation-harness
cd les-audits-evaluation-harness
python evaluate.py --model your_model --output results.json
Anti-Contamination Measures
We prevent benchmark gaming through:
- Open pipeline: Regenerate test cases with different personas
- Cross-LLM evaluation: GPT-4o generates, different model evaluates
- Real-time updates: Connected to current Légifrance data
- Variation: Same laws, different business contexts
The Path Forward
Most financial authorities have not issued AI regulations specific to financial institutions as existing frameworks already address most of these risks, but that's changing fast. This is one of the reasons behind the European DORA Regulation, which will come into force in January 2025.
With 77% of businesses who joined the study are concerned about AI hallucinations and Enterprises spend an average of $14,200 per employee per year to catch and correct AI hallucinations, the market desperately needs specialized models.
About legml.ai: We're building specialized AI models for French business law in Paris. Because when compliance matters, general AI isn't good enough.
Resources:
- Dataset:
legmlai/les-audits-affaires
- GitHub - Evaluation Harness:
les-audits-affaires-eval-harness
- Website: legml.ai
Built on French legal codes from louisbrulenaudet's comprehensive corpus.