Spaces:
Sleeping
Sleeping

Rename Operational_Instructions/Evaluation_Guidelines_for_LiveRAG (1).md to Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md
eb4eb11
verified
# Evaluation Guidelines | |
## 1. Selected Metrics | |
### 1.1 Relevance Metric | |
Combines elements of **equivalence** (semantic match with ground truth) and **relevance** (degree to which the answer directly addresses the question). | |
Graded on a four-point scale: | |
- **2:** Correct and relevant (no irrelevant information). | |
- **1:** Correct but contains irrelevant information. | |
- **0:** No answer provided (abstention). | |
- **-1:** Incorrect answer. | |
### 1.2 Faithfulness Metric | |
Assesses whether the response is **grounded in the retrieved passages**. | |
Graded on a three-point scale: | |
- **1:** Full support. All answer parts are grounded. | |
- **0:** Partial support. Not all answer parts are grounded. | |
- **-1:** No support. All answer parts are not grounded. | |
### 1.3 Combination of Metrics | |
Both **relevance** and **faithfulness** will contribute to the final evaluation score. | |
The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding. | |
## 2. Manual and Automated Evaluation | |
### **2.1 First Stage:** | |
- Automated evaluation by LLM **Claude 3.5 Sonnet**, using **relevance** and **faithfulness** metrics to rank the participant teams. | |
### **2.2 Final Stage:** | |
- **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners. | |
## 3. Other Notable Points | |
- A strict **length limit of 200 tokens** will be imposed to encourage concise answers. | |
- Participants will submit: | |
- **The answer**. | |
- **All supporting passages**. | |
- **The full prompt used for generation**. | |
These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**. |