bluebench / src /about.py
jbnayahu's picture
.
382809d unverified
from dataclasses import dataclass
from enum import Enum
@dataclass
class Task:
benchmark: str
metric: str
col_name: str
# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
# task_key in the json file, metric_key in the json file, name to display in the leaderboard
task0 = Task("bias", "score", "Bias")
task1 = Task("chatbot_abilities", "score", "Chatbot Abilities")
task2 = Task("entity_extraction", "score", "Entity Extraction")
task3 = Task("knowledge", "score", "Knowledge")
task4 = Task("legal", "score", "Legal")
task5 = Task("news_classification", "score", "News Classification")
task6 = Task("product_help", "score", "Product Help")
task7 = Task("qa_finance", "score", "QA Fianace")
task8 = Task("rag_general", "score", "RAG General")
task9 = Task("reasoning", "score", "Reasoning")
task10 = Task("safety", "score", "Safety")
task11 = Task("summarization", "score", "Summarization")
task12 = Task("translation", "score", "Translation")
# ---------------------------------------------------
# Your leaderboard name
TITLE_IMAGE = """<img src="https://raw.githubusercontent.com/IBM/unitxt/main/assets/catalog/blue_bench_high_res_01.png" style="display: block; margin-left: auto; margin-right: auto; width: 10%;"/>"""
TITLE = """<h1 align="center" id="space-title">BlueBench Leaderboard</h1>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
BlueBench is an open-source benchmark developed by domain experts to represent required needs of Enterprise users.
It is constructed using state-of-the-art benchmarking methodologies to ensure validity, robustness, and efficiency by utilizing <a href="https://www.unitxt.ai">unitxt</a>’s abilities for dynamic and flexible text processing.
As a dynamic and evolving benchmark, BlueBench currently encompasses diverse domains such as legal, finance, customer support, and news. It also evaluates a range of capabilities, including RAG, pro-social behavior, summarization, and chatbot performance, with additional tasks and domains to be integrated over time.
"""
# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = """
## How it works
BlueBench was designed with four goals in mind: representativeness, reliability, efficiency, and validity.
* **Representative**: tasks distribution represents the required skills in an enterprise setting
* **Valid**: tasks measure what they aim to measure
* **Robust**: going beyond single-prompt evaluation due to model’s brittleness
* **Efficiency**: evaluation is fast (cheap)
BlueBench is comprised of the following subtasks:
<style>
table th:first-of-type {
width: 20%;
}
table th:nth-of-type(2) {
width: 20%;
}
table th:nth-of-type(3) {
width: 60%;
}
</style>
| Task | Datasets | Description |
|------|----------|-------------|
| Reasoning | <pre><p><b>Hellaswag</b></p>[Dataset](https://huggingface.co/datasets/Rowan/hellaswag), [Paper](https://arxiv.org/abs/1905.07830), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.hellaswag.html)</pre> | <p>Commonsense natural language inference</p>Given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." Gatherd via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. |
| Reasoning | <pre><p><b>Openbook QA</b></p>[Dataset](https://huggingface.co/datasets/allenai/openbookqa), [Paper](https://aclanthology.org/D18-1260/), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.openbook_qa.html)</pre> | <p>Question answering dataset using open book exams</p>Comes with our questions is a set of 1326 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. |
| Machine Translation | <pre><p><b>Flores 101</b></p>[Dataset](https://huggingface.co/datasets/gsarti/flores_101), [Paper](https://arxiv.org/abs/2106.03193), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.mt.flores_101.__dir__.html)</pre> | <p>Benchmark dataset for machine translation</p>There are 101 lanugages in this dataset, each sentence appears in all languages, and all a total of `2k` sentences. We use the following language pairs: `["ara_eng", "deu_eng", "eng_ara", "eng_deu", "eng_fra", "eng_kor", "eng_por", "eng_ron", "eng_spa", "fra_eng", "jpn_eng", "kor_eng", "por_eng", "ron_eng", "spa_eng"]`. |
| Chatbot Abilities | <pre><p><b>Arena Hard</b></p>[Dataset](https://huggingface.co/datasets/lmsys/arena-hard-auto-v0.1), [Paper](https://arxiv.org/abs/2406.11939), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.arena_hard.generation.english_gpt_4_0314_reference.html)</pre> | <p>An automatic evaluation tool for instruction-tuned LLMs</p>Contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models" responses against a baseline model (default: GPT-4-0314 for here we are using `llama-3.3-70b`). |
| Classification | <pre><p><b>20_newsgroups</b></p>[Dataset](https://huggingface.co/datasets/SetFit/20_newsgroups), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.20_newsgroups.html)</pre> | <p>News article classification</p>The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date. |
| Bias | <pre><p><b>BBQ</b></p>[Dataset](https://huggingface.co/datasets/heegyu/bbq), [Paper](https://arxiv.org/abs/2110.08193), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.safety.bbq.__dir__.html)</pre> | <p>Question sets constructed to highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts.</p>It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses refect social biases, and (ii) given an adequately informative context, we test whether the model's biases override a correct answer choice. We fnd that models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conficts, with this difference widening to over 5 points on examples targeting gender for most models tested. |
| Legal Reasoning | <pre><p><b>Legalbench</b></p>[Dataset](https://huggingface.co/datasets/nguha/legalbench), [Paper](https://arxiv.org/abs/2308.11462), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.legalbench.__dir__.html)</pre> | <p>Evaluating legal reasoning in English large language models (LLMs).</p>LegalBench tasks span multiple types (binary classification, multi-class classification, extraction, generation, entailment), multiple types of text (statutes, judicial opinions, contracts, etc.), and multiple areas of law (evidence, contracts, civil procedure, etc.). For more information on tasks, we recommend visiting the website, where you can search through task descriptions, or the Github repository, which contains more granular task descriptions. We also recommend reading the paper, which provides more background on task significance and construction process. |
| Product Help | <pre><p><b>CFPB</b></p>[Dataset](https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/data/cfpb_complaints/cfpb_compliants.csv), [Unitxt Card](https://www.unitxt.ai/en/1.7.0_a/catalog.cards.CFPB.product.2023.html)</pre> | <p>A collection of complaints about consumer financial products and services that we sent to companies for response.</p>Its is a special and high quality subset that was gathred and refined bu teams at IBM. The Consumer Complaint Database is a collection of complaints about consumer financial products and services that we sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the Consumer Complaint Database. The database generally updates daily. Complaints can give us insights into problems people are experiencing in the marketplace and help us regulate consumer financial products and services under existing federal consumer financial laws, enforce those laws judiciously, and educate and empower consumers to make informed financial decisions. We also report on complaint trends annually in Consumer Response’s Annual Report to Congress. |
| General Knowledge | <pre><p><b>MMLU Pro</b></p>[Dataset](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), [Paper](https://arxiv.org/abs/2406.01574), [Unitxt Card](https://www.unitxt.ai/en/1.11.0/catalog/catalog.cards.mmlu_pro.__dir__.html)</pre> | <p>Massive multi-task understanding.</p>MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models’ capabilities. This dataset contains 12K complex questions across various disciplines. MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. |
| Entity Extraction | <pre><p><b>Universal NER</b></p>[Dataset](https://huggingface.co/datasets/universalner/universal_ner), [Paper](https://aclanthology.org/2024.naacl-long.243/), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.universal_ner.da.ddt.html)</pre> | <p>Benchmarks for Named Entity Recognition (NER) across multiple languages.</p>Universal NER (UNER) is an open, community-driven initiative aimed at creating gold-standard benchmarks for Named Entity Recognition (NER) across multiple languages. The primary objective of UNER is to offer high-quality, cross-lingually consistent annotations, thereby standardizing and advancing multilingual NER research. UNER v1 includes 19 datasets with named entity annotations, uniformly structured across 13 diverse languages. In Bluebench, we only use the english subset ("en.ewt"). |
| Safety | <pre><p><b>AttaQ 500</b></p>[Dataset](https://huggingface.co/datasets/ibm/AttaQ), [Paper](https://aclanthology.org/2023.gem-1.10/), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.attaq_500.html)</pre> | <p>Evaluate Large Language Models (LLMs) by assessing their tendency to generate harmful or undesirable responses.</p>The AttaQ red teaming dataset, consisting of 1402 carefully crafted adversarial questions, is designed to evaluate Large Language Models (LLMs) by assessing their tendency to generate harmful or undesirable responses. It may serve as a benchmark to assess the potential harm of responses produced by LLMs. The dataset is categorized into seven distinct classes of questions: deception, discrimination, harmful information, substance abuse, sexual content, personally identifiable information (PII), and violence. Researchers and developers can use this dataset to assess the behavior of LLMs and explore the various factors that influence their responses, ultimately aiming to enhance their harmlessness and ethical usage. |
| Bill Summarization | <pre><p><b>BillSUM</b></p>[Dataset](https://huggingface.co/datasets/FiscalNote/billsum), [Paper](https://aclanthology.org/D19-5406/), [Unitxt Card](https://www.unitxt.ai/en/stable/catalog/catalog.cards.billsum.html)</pre> | <p>Summarization of US Congressional and California state bills.</p>The data consists of three parts: US training bills, US test bills and California test bills. The US bills were collected from the Govinfo service provided by the United States Government Publishing Office (GPO) under CC0-1.0 license. The California, bills from the 2015-2016 session are available from the legislature’s website. |
| Post Summarization | <pre><p><b>TL;DR</b></p>[Dataset](https://huggingface.co/datasets/webis/tldr-17), [Paper](https://aclanthology.org/W17-4508/), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.tldr.html)</pre> | <p>Summarization dataset,</p>A large Reddit crawl, taking advantage of the common practice of appending a “TL;DR” to long posts. |
| RAG Response Generation | <pre><p><b>ClapNQ</b></p>[Dataset](https://huggingface.co/datasets/PrimeQA/clapnq), [Paper](https://arxiv.org/abs/2404.02103), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.rag.response_generation.clapnq.html)</pre> | <p>A benchmark for Long-form Question Answering.</p>CLAP NQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The CLAP NQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. CLAP NQ is created from the subset of Natural Questions (NQ) that have a long answer but no short answer. NQ consists of ~380k examples. There are ~30k questions that are long answers without short answers excluding tables and lists. To increases the likelihood of longer answers we only explored ones that have more than 5 sentences in the passage. The subset that was annotated consists of ~12k examples. All examples where cohesion of non-consecutive sentences was required for the answer were annotated a second time. The final dataset is made up of all data that went through two rounds of annotation. (We provide the single round annotations as well - it is only training data) An equal amount of unanswerable questions have also been added from the original NQ train/dev sets. |
| QA Finance | <pre><p><b>FinQA</b></p>[Dataset](https://huggingface.co/datasets/ibm/finqa), [Paper](https://arxiv.org/abs/2109.00122), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.fin_qa.html)</pre> | <p>A large-scale dataset with 2.8k financial reports for 8k Q&A pairs to study numerical reasoning with structured and unstructured evidence.</p>The FinQA dataset is designed to facilitate research and development in the area of question answering (QA) using financial texts. It consists of a subset of QA pairs from a larger dataset, originally created through a collaboration between researchers from the University of Pennsylvania, J.P. Morgan, and Amazon.The original dataset includes 8,281 QA pairs built against publicly available earnings reports of S&P 500 companies from 1999 to 2019 (FinQA: A Dataset of Numerical Reasoning over Financial Data.). This subset, specifically curated by Aiera, consists of 91 QA pairs. Each entry in the dataset includes a context, a question, and an answer, with each component manually verified for accuracy and formatting consistency. |
## Reproducibility
To reproduce our results, here is the commands you can run:
```
pip install unitxt[bluebench]
unitxt-evaluate --tasks "benchmarks.bluebench" --model cross_provider --model_args "model_name=$MODEL_TO_EVALUATE_IN_LITELLM_FORMAT,max_tokens=256" --output_path ./results/bluebench --log_samples --trust_remote_code --batch_size 8
```
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
"""