Spaces:
Running
Running
Delete Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md
Browse files
Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md
DELETED
@@ -1,43 +0,0 @@
|
|
1 |
-
# Evaluation Guidelines
|
2 |
-
|
3 |
-
## 1. Selected Metrics
|
4 |
-
|
5 |
-
### 1.1 Relevance Metric
|
6 |
-
Combines elements of **equivalence** (semantic match with ground truth) and **relevance** (degree to which the answer directly addresses the question).
|
7 |
-
|
8 |
-
Graded on a four-point scale:
|
9 |
-
- **2:** Correct and relevant (no irrelevant information).
|
10 |
-
- **1:** Correct but contains irrelevant information.
|
11 |
-
- **0:** No answer provided (abstention).
|
12 |
-
- **-1:** Incorrect answer.
|
13 |
-
|
14 |
-
### 1.2 Faithfulness Metric
|
15 |
-
Assesses whether the response is **grounded in the retrieved passages**.
|
16 |
-
|
17 |
-
Graded on a three-point scale:
|
18 |
-
- **1:** Full support. All answer parts are grounded.
|
19 |
-
- **0:** Partial support. Not all answer parts are grounded.
|
20 |
-
- **-1:** No support. All answer parts are not grounded.
|
21 |
-
|
22 |
-
### 1.3 Combination of Metrics
|
23 |
-
Both **relevance** and **faithfulness** will contribute to the final evaluation score.
|
24 |
-
|
25 |
-
The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding.
|
26 |
-
|
27 |
-
|
28 |
-
## 2. Manual and Automated Evaluation
|
29 |
-
|
30 |
-
### **2.1 First Stage:**
|
31 |
-
- Automated evaluation by LLM **Claude 3.5 Sonnet**, using **relevance** and **faithfulness** metrics to rank the participant teams.
|
32 |
-
|
33 |
-
### **2.2 Final Stage:**
|
34 |
-
- **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.
|
35 |
-
|
36 |
-
## 3. Other Notable Points
|
37 |
-
- Answer length is **unlimited** but only the first **300 words** will be evaluated.
|
38 |
-
- Participants will submit:
|
39 |
-
- **The answer**.
|
40 |
-
- **All supporting passages**.
|
41 |
-
- **The full prompt used for generation**.
|
42 |
-
|
43 |
-
These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|