from datetime import datetime import pytz ABOUT_TEXT = """ ## Overview HREF is an evaluation benchmark that evaluates language models' ability to follow human instructions. It consists of 4,258 instructions covering 11 distinct categories, including various general chat capabilities like brainstorming, question answering and summarization and those focused on scientific text understanding like reasoning over numerical data. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64dff1ddb5cc372803af964d/0TK6xku0gdJPDs_nfwzns.png) ## Generation Configuration For reproducibility, we use greedy decoding for all models as default. We apply chat templates to the instructions if they are implemented in the model's tokenizer or explicity recommended by the model's creators. Please contact us if you would like to change this default configuration. ## Why HREF | Benchmark | Size | Evaluation Method | Baseline Model | Judge Model | Task Oriented | Contamination Resistant | Contains Human Reference| |--------------------|-------|------------|----------------|----------------|----------|------------|-----------| | MT-Bench | 80 | Score | --- | gpt4 | ✓ | ✗ | ✗ | | AlpacaEval 2.0 | 805 | PWC | gpt4-turbo | gpt4-turbo | ✗ | ✗ | ✗ | | Chatbot Arena | --- | PWC | --- | Human | ✗ | ✓ | ✗ | | Arena-Hard | 500 | PWC | gpt4-0314 | gpt4-turbo | ✗ | ✗ | ✗ | | WildBench | 1,024 | Score/PWC | gpt4-turbo | three models | ✗ | ✗ | ✗ | | **HREF** | 4,258 | PWC | Llama-3.1-405B-Instruct | Llama-3.3-70B-Instruct | ✓ | ✓ | ✓ | - **Human Reference**: HREF leverages human-written responses to provide more reliable evaluation than previous methods. - **Large**: HREF has the largest evaluation size among similar benchmarks, making its evaluation more reliable. - **Contamination-resistant**: HREF's evaluation set is hidden and uses public models as both the baseline model and the judge model, which makes it completely free of contamination. - **Task Oriented**: Instead of prompts from the users, HREF contains instructions that are written specifically targetting 8 distinct categories that are commonly used for instruction tuning, which allows it to provide more insights about how to improve language models. """ # Get Pacific time zone (handles PST/PDT automatically) pacific_tz = pytz.timezone('America/Los_Angeles') current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y") TOP_TEXT = f"""# HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [Code](https://github.com/allenai/href) | [Validation Set](https://huggingface.co/datasets/allenai/href) | [Human Agreement Set](https://huggingface.co/datasets/allenai/href_preference) | [Results](https://huggingface.co/datasets/allenai/href_results) | [Paper](https://arxiv.org/abs/2412.15524) | Total models: {{}} | Last restart (PST): {current_time} """