Gemini-3 Benchmarkathon
Gemini-3 achieved the top spot on most major benchmarks last week, but how well does it know the law? Unfortunately, most model providers don't evaluate on law-specific benchmarks. So while we have a good idea of how good new models are at coding, we are pretty much in the dark about their lawyering abilities.
In the HuggingLegal community we were curious how well Gemini 3 performs on some law tasks, so we ran a short vibe check.
AA-Omniscience
Author: Robert Scholz
Dataset: https://huggingface.co/datasets/ArtificialAnalysis/AA-Omniscience-Public
Paper: https://arxiv.org/abs/2511.13029
Task Description
The AA-Omniscience benchmark (by artificialanalysis.ai) assesses a model's factual recall and knowledge calibration. The central difficulty arises from the benchmark's design: models must provide correct answers but abstain when knowledge is not present, as false guesses ("hallucination") are penalized. As models are tested independently of context or tools, the evaluation relies entirely on pretrained knowledge. We focused on the 100 law-related queries from the publicly available subset, which mainly target specific facts (i.e. dates, numbers, names) from US case law and legal texts.
Example 1
Input:
You are answering questions about Law, and in particular Property Law. You will be given a question, answer with JUST the answer (no explanation). If you do not know the answer, or you need more context or tools to answer the question, be clear about this - it is better that you say this than get the wrong answer.
Question: As of December 31, 2017, how many foreign countries had in‑force bilateral import‑control agreements with the United States under the Cultural Property Implementation Act (excluding emergency restrictions and expired agreements)?
Gemini-3-Pro Answer (correct):
16
Example 2
Input:
You are answering questions about Law, and in particular Tort Law. You will be given a question, answer with JUST the answer (no explanation). If you do not know the answer, or you need more context or tools to answer the question, be clear about this - it is better that you say this than get the wrong answer.
Question: Under Superior Metal Products’ standard company policy, how many inches apart were the dual operating buttons positioned?
Gemini-3-Pro Answer (abstained):
I need more context or the specific case document to answer this question, as it refers to a specific evidentiary fact regarding a private company's internal policy rather than a general principle of Tort Law.
Qualitative Analysis
The performance of Gemini 3 Pro on the law-related queries in the public subset generally confirms the findings reported by AA on the larger, non-public test set. The model achieved a strong overall score of 44 on our subset, which is higher than the 28 reported for the full law domain results, which might point to the public queries being slightly less demanding. Nonetheless, we find a rather high hallucination rate = wrong answers / (#wrong answers + #partially correct + #abstentions) of 0.86, consistent with the 0.88 reported by AA across all domains.
A manual review of the 26 responses judged as "wrong" revealed finer details: three judgments were found to be incorrect, one was unassessable, and five answers were close to the ground truth but still wrong. For instance, one model answer was incorrectly judged false when asked approximately how many individuals in the United States were employed, with the reference answer being "343,000" and the model responding "330,000". Another case was unassessable by the Judge-LLM because the approaches were not enumerated: when asked which of three approaches the court used, the reference answer was "third" but the model responded "The court adopted the strict approach". An example of a close but wrong answer occurred when the model was asked for the exact date of an accident; the reference answer was "May 24, 1987" but the model responded "May 26, 1987". This qualitative check shows that some "incorrect" results may be near-misses (i.e. some residual knowledge may actually be present), though the high hallucination rate remains a primary weakness.
Overall this suggests Gemini 3 possesses substantial factual knowledge, but also a high propensity to provide an answer even when knowledge is absent.
Vibe Score: 6/10 – high competence but unreliable
LegalBench
Author: Slimane Alaoui Soulimani Valenti
Dataset: https://huggingface.co/datasets/nguha/legalbench
Paper: https://arxiv.org/abs/2308.11462
Task Description
We investigated two tasks: Hearsay Questions and Contract NLI, both binary classification tasks.
In Hearsay Questions (a rule-application / rule-conclusion task), each problem gives a short trial scenario: what a party is trying to prove (e.g., “Was Tim a soccer fan”) and a piece of evidence they want to introduce (e.g., “Tim told Jimmy that "Real Madrid was the best soccer team in the world.’”). The model must answer Yes if that evidence is hearsay and No if it is not, under the Federal Rules of Evidence definition: an out-of-court statement offered to prove the truth of what it asserts. The challenge lies in that the examples are crafted to hit different “edges” of the rule. Some items are classic hearsay (an out-of-court oral statement offered for its truth), while others require recognizing that there’s no “statement”, that the statement was made in court, that the “statement” is non-verbal or written, or that the statement is being offered for some other purpose than its truth (e.g., to show who spoke or what someone believed). The model has to walk through this structure implicitly and still land on a Yes or No judgment for each case.
Contract NLI is an Interpretation task asks whether an NDA clause requires all Confidential Information to be expressly identified by the disclosing party. Each example presents a clause from a non-disclosure agreement and the model must answer “Yes” if, read in context, the clause clearly says that information only counts as “Confidential Information” when the disclosing party labels, marks, or otherwise explicitly identifies it as such (for example, “must be marked ‘Confidential’” or “must be designated in writing”). It should answer “No” if the clause is silent on express identification, defines Confidential Information more broadly (e.g., by subject matter or circumstances), or puts the burden anywhere else. The challenge is that NDAs often express this idea in dense, stylized language: the model must pick up on variations like “conspicuously marked,” “identified at the time of disclosure,” or “reduced to writing and labeled within 30 days,” distinguish them from unrelated obligations, and avoid over-reading vague references to confidentiality as express-identification requirements.
Example Hearsay Questions
Input:
To prove that Tim was a soccer fan, the fact that Tim told Jimmy that "Real Madrid was the best soccer team in the world." Is there hearsay?
Gemini-3-Pro Answer:
This is likely not hearsay. Here is the breakdown: The Statement: "Real Madrid was the best soccer team in the world." The Matter Asserted: The literal truth of the statement would be that Real Madrid is, in fact, the best team. The Purpose of Introduction: The statement is being introduced to show Tim's state of mind or interest (that he likes soccer), not to prove that Real Madrid is actually the best team. Because the statement is not being used to prove that Real Madrid is the best team, but rather that Tim talks about soccer and holds opinions on it (circumstantial evidence of his fandom), it does not meet the definition of hearsay. Note: In a legal context, this would usually be admissible as non-hearsay or under the "state of mind" exception, depending on the specific jurisdiction's rules.
Example Contract NLI
Input:
Identify if the clause provides that all Confidential Information shall be expressly identified by the Disclosing Party. Confidential Information - information of whatever kind and in whatever form contained (and includes in particular but without prejudice to the generality of the foregoing, documents, drawings, computerized information, films, tapes, specifications, designs, models, equipment or data of any kind) which is clearly identified by the Disclosing Party as confidential by an appropriate legend or if orally disclosed then upon disclosure or within 30 days of such oral disclosure identified in writing by the Disclosing Party as confidential.
Gemini-3-Pro Answer:
Yes, the clause provides that all Confidential Information must be expressly identified by the Disclosing Party. According to the text, for information to qualify as "Confidential Information," it must be: Clearly identified by an appropriate legend (if in written/tangible form); or Identified in writing within 30 days (if disclosed orally). The definition strictly limits Confidential Information to that "which is clearly identified by the Disclosing Party as confidential," omitting any provision for information that is not marked but might otherwise be considered confidential by its nature.
Qualitative Analysis
For Hearsay Questions, Gemini 3 showed an outstanding performance out of 100 tasks with a perfect score across repeated runs.
For Contract NLI, Gemini 3 exhibited again a virtually perfect score out of 117 tasks. In multiple evaluations, the very few apparent errors were not stable across trials and mostly came from small differences in how the the benchmark was run, rather than any real misunderstanding of the legal questions.
Including the results of the aforementioned Hearsay questions, compared to the models reported in the LegalBench paper (including GPT-4, GPT-3.5, Claude-1, and various open-source baselines), Gemini 3 stands out: on this subset of Binary questions it almost never misses, effectively pushing accuracy beyond the range observed for the earlier models and suggesting a clear jump in reliability for binary legal judgments.
Vibe Score: 9.5 – almost perfect answers
GreekBarBench
Authors: Ernest Beta, Odysseas S. Chlapanis
Dataset: https://huggingface.co/datasets/AUEB-NLP/greek-bar-bench
Paper: https://aclanthology.org/2025.findings-emnlp.1368/
Task Description
GreekBarBench is a challenging benchmark focused on Greek law. It tasks models with generating short, freeform answers to realistic (though hypothetical) legal problems. A key requirement is that every answer must properly cite specific case facts and relevant legal articles. Models operate within a 'long-context' setting: they are given the case facts plus an extensive database of Greek legislation of about 60,000 tokens, forcing them to distinguish between relevant laws and 'distractor' articles. An LLM-as-a-judge, which is independently evaluated for quality, assesses the outputs on a 0–10 scale based on three subscores: Facts, Articles, and Analysis. The final ranking is determined by the mean of these metrics.
Benchmark Results:
| Model | Facts | Articles | Analysis | Mean |
|---|---|---|---|---|
| Gemini-3-Pro-Preview | 9.31 | 9.19 | 9.15 | 9.22 |
| Gemini-2.5-Flash | 8.62 | 8.16 | 8.36 | 8.38 |
| GPT-4.1 | 8.65 | 8.05 | 8.34 | 8.35 |
| OpenAI-o1 | 8.24 | 7.55 | 7.55 | 7.78 |
| Claude-3.7-Sonnet | 8.34 | 7.37 | 7.60 | 7.77 |
| GPT-4.1-mini | 8.28 | 7.18 | 7.40 | 7.62 |
| PASS score | 6.00 | 6.00 | 6.00 | 6.00 |
Gemini-3-Pro-Preview delivers an unprecedented performance. To put the scores above in perspective: in the Greek Bar exams, top candidates rarely exceed an average of 8.5/10, even if they get perfect scores in specific subjects. By consistently outperforming this benchmark, Gemini-3 effectively ranks within the top 1% of all examinees.
Example
To highlight Gemini-3's limitations, we present its lowest-scoring response. Unlike earlier models that often hallucinated, Gemini-3's error here is purely analytical: it rushed to a conclusion based on a superficial reading of the statute.
Input (summarized and translated in English):
Facts:
A sold a car to B on credit, securing the debt with a bill of exchange payable in three months. By mistake, A omitted the "place of issuance" on the bill. B accepted the bill but failed to pay by the deadline.
Question:
After this deadline has passed without action, can A claim the satisfaction of his claim from the security (bill of exchange) in another way and based on which provisions?
Gemini-3-Pro Answer (summarized and translated):
As A omitted the place of issue (item 5), the bill of exchange is void under Articles 1(7) and 2 of Law 5325/1932. Consequently, A must claim the amount via unjust enrichment (Article 904 CC), since B acquired the car (item 1) without paying (item 6). The special action under Article 80 of Law 5325/1932 is inapplicable, as it presupposes a valid title, not one void ab initio.
Qualitative Analysis
Scores:
- Facts: 5/10
- Articles: 5/10
- Analysis: 4/10
- Average: 4.66/10
Judgement Summary
The model correctly identified that the buyer (B) was unjustly enriched but failed significantly on the legal justification. It erroneously claimed that the missing "place of issuance" rendered the bill of exchange void ab initio. In reality, under Law 5325/1932, Article 2, such an omission is curable (the bill is deemed issued at the location beside the drawer's name), meaning the bill remains valid.
Because the model falsely declared the bill void, it directed the user to the general Civil Code (Art. 904) and explicitly rejected the correct specific commercial statute (Article 80). Article 80 is the precise legal remedy for a valid bill where the claim has expired (prescribed). The model's failure to recognize the bill's validity collapsed its entire legal argument.
Vibe Score: 9 – impressive long-context reasoning without hallucinations
Indian Law QA
Author: Adhithya Kiran
Dataset: https://huggingface.co/datasets/adhithyakiran/Gemini3_India_legal_Benchmark
Task Description
Indian Law QA evaluates Gemini-3 on Indian law using 107 questions spanning the Constitution of India, the Indian Penal Code (IPC), the new criminal codes (Bharatiya Nyaya Sanhita — BNS, Bharatiya Nagarik Suraksha Sanhita — BNSS), the Indian Evidence Act and its 2023 successor Bharatiya Sakshya Adhiniyam (BSA), and key Supreme Court judgments. It stresses exact provision recall (Articles, Sections, punishments, thresholds), transition-era confusion (IPC→BNS, CrPC→BNSS), and legal reasoning using 50 real-world scenarios that test doctrines, exceptions, constitutional rights, and procedural safeguards.
The challenge comes from the precision required in Indian law:
- Many statutes contain similar or easily-confusable section numbers
- Small numeric errors completely alter the meaning of a provision
- Scenario questions require correct legal application rather than surface recall
Example 1 — Statutory Recall
Input:
Under the Bharatiya Nyaya Sanhita (BNS), which provision replaces sedition with the offence of acts endangering the sovereignty and integrity of India?
Gemini-3-Pro Answer:
Section 152
- Gold Answer: Section 152, BNS
- Judgement: Correct
- Reference: Bharatiya Nyaya Sanhita, 2023 (Official Gazette)
- Reason: Matches the bare act definition.
Example 2 — Scenario Reasoning
Input:
Police keep an accused in custody for 36 hours without producing him before a magistrate. What constitutional violation occurs?
Gemini-3-Pro Answer:
Violation of Article 22(2)
- Gold Answer: Article 22(2) + 24-hour rule under CrPC §57 / BNSS §58
- Judgement: Correct
- Reason: Detention beyond 24 hours violates constitutional safeguards.
- Reference: Constitution of India, Art. 22(2)
Qualitative Analysis
The manually evaluated 107-question dataset shows Gemini-3 performs extremely well on Indian statute recall. In strict mode, where only exact-match Article/Section citations and doctrinal accuracy count, Gemini delivered:
- 106 correct answers
- 1 wrong answer
- 0 hallucinations in the evaluated subset
The model demonstrated strong grounding across Constitution, IPC, BNS/BNSS, and Evidence-law. It handled transitional mappings and high-confusion areas accurately (e.g., sedition successor §152 BNS). The single incorrect answer indicates factual error rather than hallucination. Overall, Gemini-3 exhibits very high reliability on Indian bare-act statutory questions with near-perfect accuracy.
Vibe Score: 8.5/10 — The task feels precise, clean, and insightful, revealing how reliably Gemini-3 handles high-precision statutory queries (including new BNSS 2023 codes) while most models typically hallucinate in this domain.
Wilful Misconduct
Authors: Sophie Franco, Chu Fei Luo, Samuel Dahan
Task Description
This is a binary classification task on wilful misconduct - a subset of just cause dismissal under the Employment Standards Act, explained in Ontario Regulation 288/01. Wilful misconduct is defined as serious and intentional actions that are done by an employee with knowledge that they are wrong, such as theft, fraud, assault, or harassment.
The challenge is that it can be difficult to determine what conduct meets the threshold for wilful misconduct as the test/threshold is higher for wilful misconduct than "just cause". The employer must prove the misconduct is serious but also that it is distinguished from 'just cause', in that the conduct was 'wilful'. Careless, thoughtless, heedless, or inadvertent conduct, no matter how serious, does not meet the standard. The employer must show that the misconduct was intentional or deliberate. In Ontario, prior cases (i.e. precedents) are often essential to find more fine-grained, concrete interpretations of the law.
Each input query is a detailed retelling of events (from the employee's perspective) that led to their termination, followed by the question: "Does my conduct really meet the standard for wilful misconduct?"
Example
Input:
I am an employee and I have been terminated over something I did at work. In my office, there is a very social environment. We make jokes with each other, banter, and even give each other funny gifts for holidays. Most of the possibly inappropriate comradery happens amongst only the male co-workers, but Ms. Anne engages in jokes with us too sometimes. One day, we were all in a conference room joking around and Ms. Anne was there. She made a joke about how short I am in front of everyone. To reply and keep the banter going, I crouched down to make fun of my height as well saying, "this is really how tall I am without my boots on" - as I significantly lowered myself to the ground. Everyone was laughing. I lost my balance a bit and started falling over. Ms. Anne started moving away from me, but I didn't care. I fell forward a bit and touched her butt. In my opinion, it was an accident, but everyone else disagrees with me. All the witnesses and Ms. Anne say that I did it on purpose. Probably because of our past relationship at work where we would make inappropriate jokes together, mostly it was me making those jokes to her though. It all happened really fast and I thought everyone would think it was an accident so I didn't apologize or say it was an accident. My boss has fired me and they are saying I am not getting my ESA entitlements because they not only have "just cause" to terminate me but that I also committed "wilful misconduct" because everyone says I touched her butt on purpose. Does my conduct really meet the standard for wilful misconduct?
[Based on Render v ThyssenKrupp Elevator (Canada) Ltd, 2022 ONCA 310]
Gemini-3-Pro Answer:
Disclaimer: I am an AI, not a lawyer. […]
Based on the terminology you used (ESA, just cause, wilful misconduct), this analysis assumes you are in Ontario, Canada. The distinction between "Just Cause" at common law and the "Wilful Misconduct" standard under the Employment Standards Act (ESA) is one of the most complex areas of employment law.
Here is an analysis of how the law generally views situations like yours.
1. The Two Standards: Just Cause vs. Wilful Misconduct
It is important to understand that your employer has to jump over two different hurdles to deny you payment.
- Common Law Just Cause: This is the standard required to fire you without "Reasonable Notice" (the possibly large severance package based on age, years of service, etc.).
- ESA Wilful Misconduct: This is the standard required to deny you your basic statutory Termination Pay and Severance Pay.
The standard for Wilful Misconduct is higher than Just Cause. [...]
2. What is "Wilful Misconduct"?
Under Regulation 288/01 of the Ontario Employment Standards Act, 2000, an employee is disentitled to termination pay only if they have been guilty of "Wilful misconduct, disobedience or wilful neglect of duty that is not trivial and has not been condoned by the employer." [...]
3. Applying the Standard to Your Situation
Here is how a decision-maker (like a Ministry of Labour officer or a Judge) would analyze the specific facts you provided.
The "Accident" vs. "Wilful" Argument
If your version of events is accepted as 100% true—that you fell due to a loss of balance and touched her accidentally—then accidental conduct is generally not wilful misconduct. Being clumsy or having an accident usually entitles you to your ESA pay, even if the employer fires you for it. [...]
4. Does it meet the standard?
If the Ministry of Labour or a court believes the witnesses over you, yes, this conduct meets the standard for wilful misconduct… However, if you can prove it was truly an accident, it is not wilful. [...]
Summary of Your Position [...]
Recommended Next Steps [...]
Qualitative Analysis
We would rate this answer as an 8/10. Gemini's response correctly explained subtle and important nuances, e.g. how wilful misconduct is a higher threshold than just cause termination. It follows the IRAC format standard in legal reasoning. Gemini first states the key issue, the relevant rule, the application of rule to the key issue, and the conclusion/next steps. However, it completely missed that the conduct also needs to be pre-planned to meet the threshold. In the case that we used to craft the input query (Render v ThyssenKrupp), the contact was not considered to be accidental, but the misconduct still did not meet the threshold for "wilful misconduct" because it was not pre-planned. We do not necessarily agree with Gemini's test for wilful misconduct of "intentional, not trivial and not condoned" as that's not the test commonly used by the Ontario Court of Appeal (ONCA).
In summary, Gemini correctly identifies the jurisdiction (Ontario, Canada) and the key points of wilful misconduct from the correct statute, the ESA, but it does not consider (or mention) more detailed interpretations of the law from leading precedents. The model instead infers its own legal test, which differs from the key decisions from influential courts. This is a subtle misalignment, and indicates Gemini was likely trained with little to no caselaw, but still important to note.
Vibe Score: 8 – strong logic, missed binding precedent
LEXam (English subset)
Authors: Matteo Johannes Bürgler, Joel Niklaus
Dataset: https://huggingface.co/datasets/LEXam-Benchmark/LEXam
Paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5265144
Task Description
We used a targeted subset of the LEXam benchmark to evaluate Gemini 3 Pro, consisting of English-language questions and four shuffled single-choice options, supplemented by a fifth option (E: "I don't know") to measure epistemic uncertainty and discourage unwarranted guessing. The task is demanding as the underlying questions span heterogeneous legal domains, including Swiss law, Chinese business law, U.S. Business Law, and Transnational Public Security Law, requiring the model to navigate divergent doctrinal logics and jurisdiction-specific concepts. Additional difficulty arises from the structure of the questions themselves: none, one, or multiple presented statements may be correct, forcing the model to first individually assess each proposition and then conclude which answer option is correct. For illustration, a typical item may ask the model to evaluate several independent legal statements (e.g., propositions on Swiss criminal procedure, Human Rights, the 1864 Geneva Convention, proportionality, or inheritance law) and select the precise combination of true statements from options such as "ii and iii," "ii, iii, and v," or "iii, iv, and v," with "I don't know" available as a possible fall-back when the model cannot confidently resolve the task.
Example
Input:
You are an expert in Swiss Law and address legal issues in a structured, exam-style manner.
You are given a multiple-choice question, where only one choice […] is correct. Assume Swiss law applies unless specifically stated otherwise. […]
Please reason through the question step by step […]: Clarify the facts […] Issue Identification […] Rule Explanation […] Application and Reasoning […] Eliminate Incorrect Answers […] Conclusion: [..]
Question: Which of the following statements is/are incorrect? i. A contract suffering from a fundamental error is voidable by the party that was acting under error. ii. Swiss commercial law is based on a comprehensive codification of the civil code in the 2000s. iii. The Swiss Broadcasting Corporation (SRG) is mainly financed by a levy […]. iv. Switzerland is a member of the […] EFTA, but not of the […] EEA. v. In Switzerland's direct democracy, education is considered a key element […], the Swiss constitution provides for an extensive public education.
Choices: A) ii B) i, ii, and iv C) i, ii, iii, and iv D) i and v E) I don't know
Before answering, think about the question step by step. Answer only if you are confident, since mistakes are penalized with -1 points, while correct answers receive 1 point, and the answer E) "I don't know" always receives 0 points. […]
Gemini-3-Pro Answer:
Here is the step-by-step reasoning […]
Clarify the facts and Issue Identification
The question requires an evaluation of five distinct statements […]. The goal is to identify which statement(s) is/are legally or factually incorrect.
Rule Explanation and Application
Statement i (Contract Law): Under Article 23 of the Swiss Code of Obligations (CO), a party acting under a fundamental error is not bound by the contract. This state is legally characterized as the contract being voidable (anfechtbar) by the mistaken party. This statement is correct.
Statement ii (Legal History): The Swiss Civil Code (ZGB) was adopted in 1907 and entered into force in 1912. The Code of Obligations (OR), governing commercial law, entered into force in 1912. While revisions occur, Swiss commercial law is not based on a comprehensive codification from the 2000s […]. This statement is incorrect.
Statement iii (Media Law): The Swiss Broadcasting Corporation […] is primarily funded by a household levy […], mandated to ensure public service. This statement is correct.
Statement iv (International Law): Switzerland is a founding member of […] EFTA but rejected membership in […] EEA in a 1992 referendum. This statement is correct.
Statement v (Constitutional Law): The Swiss Federal Constitution (Art. 61a, 62) emphasizes the importance of education and mandates that cantons provide sufficient, free public primary education. This statement is correct.
Eliminate Incorrect Answers
- Choice B: Incorrect because statements i and iv are actually correct.
- Choice C: Incorrect because statements i, iii, and iv are actually correct.
- Choice D: Incorrect because statements i and v are actually correct.
Conclusion
Only statement ii is incorrect. Therefore, Choice A is the right answer.
Final Answer: A
Qualitative Analysis
Extraction failures occurred in 0.32% (± 0.23 SE) of items, and overall accuracy reached 71.41% (± 1.82 SE). Notably, Gemini 3 Pro selected the "I don't know" option in only 0.16% (± 0.16 SE) of cases, indicating extremely low epistemic abstention compared to other models. For instance, GPT-5 chose option E "I don't know" in 8.24% (± 1.11 SE) of cases and Claude-Opus-4.5 in 7.43% (± 1.06 SE).
When applying a stricter scoring regime that awards +1 for correct answers, −1 for incorrect answers, and 0 for "I don't know," Gemini 3 Pro achieved a net score of 42.97 (± 3.63 SE). Under this penalty-based metric, its performance is substantially weaker than that of other models, including GPT-5, which, despite a lower raw accuracy of 69.47% (± 1.85 SE), achieved a higher penalised score of 47.17 (± 3.35 SE).
Vibe Score: 7/10 – competent but overly confident
Conclusion
Gemini-3 establishes a new state-of-the-art in legal AI, effectively bridging the gap with human practitioners across multiple jurisdictions, even if its overconfidence remains a critical hurdle. It delivered top-tier performance on every benchmark, exhibiting excellent legal reasoning on LegalBench, Wilful Misconduct, and Indian statutory recall, mastering the long-context questions of GreekBarBench, and demonstrating strong knowledge accuracy on AA-Omniscience and LEXam. However, significant limitations persist that prevent it from being a standalone legal expert: the AA-Omniscience and LEXam results revealed a high propensity to provide confident hallucinations rather than abstaining when knowledge is missing, and the GreekBarBench and Wilful Misconduct examples showed that the model can still fail on intricate reasoning chains or binding precedents by rushing to superficial conclusions. Ultimately, while Gemini-3 has moved beyond merely reciting legal terminology to applying legal reasoning with professional nuance, these reliability gaps highlight the continued need for human oversight.
Aggregate Vibe Score: 8/10 – Brilliant reasoning, dangerous confidence.
If you're interested in this kind of work, feel free to join us on our Discord: https://discord.gg/37Js2WBhv3