# Evaluation Guidelines

## 1. Selected Metrics  

### 1.1 Relevance Metric  
Combines elements of **equivalence** (semantic match with ground truth) and **relevance** (degree to which the answer directly addresses the question).  

Graded on a four-point scale:  
- **2:** Correct and relevant (no irrelevant information).  
- **1:** Correct but contains irrelevant information.  
- **0:** No answer provided (abstention).  
- **-1:** Incorrect answer.  

### 1.2 Faithfulness Metric  
Assesses whether the response is **grounded in the retrieved passages**.  

Graded on a three-point scale:  
- **1:** Full support. All answer parts are grounded.  
- **0:** Partial support. Not all answer parts are grounded.  
- **-1:** No support. All answer parts are not grounded.  

### 1.3 Combination of Metrics  
Both **relevance** and **faithfulness** will contribute to the final evaluation score.  

The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding.  
  

## 2. Manual and Automated Evaluation  

### **2.1 First Stage:**  
- Automated evaluation by LLM **Claude 3.5 Sonnet**, using **relevance** and **faithfulness** metrics to rank the participant teams.  

### **2.2 Final Stage:**  
- **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.  

## 3. Other Notable Points  
- A strict **length limit of 200 tokens** will be imposed to encourage concise answers.  
- Participants will submit:  
  - **The answer**.  
  - **All supporting passages**.  
  - **The full prompt used for generation**.    

These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.