# Evaluation Guidelines ## 1. Selected Metrics ### 1.1 Relevance Metric Combines elements of **equivalence** (semantic match with ground truth) and **relevance** (degree to which the answer directly addresses the question). Graded on a four-point scale: - **2:** Correct and relevant (no irrelevant information). - **1:** Correct but contains irrelevant information. - **0:** No answer provided (abstention). - **-1:** Incorrect answer. ### 1.2 Faithfulness Metric Assesses whether the response is **grounded in the retrieved passages**. Graded on a three-point scale: - **1:** Full support. All answer parts are grounded. - **0:** Partial support. Not all answer parts are grounded. - **-1:** No support. All answer parts are not grounded. ### 1.3 Combination of Metrics Both **relevance** and **faithfulness** will contribute to the final evaluation score. The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding. ## 2. Manual and Automated Evaluation ### **2.1 First Stage:** - Automated evaluation by LLM **Claude 3.5 Sonnet**, using **relevance** and **faithfulness** metrics to rank the participant teams. ### **2.2 Final Stage:** - **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners. ## 3. Other Notable Points - A strict **length limit of 200 tokens** will be imposed to encourage concise answers. - Participants will submit: - **The answer**. - **All supporting passages**. - **The full prompt used for generation**. These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.