Orensomekh commited on
Commit
771014b
·
verified ·
1 Parent(s): 9f41814

Delete Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md

Browse files
Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md DELETED
@@ -1,43 +0,0 @@
1
- # Evaluation Guidelines
2
-
3
- ## 1. Selected Metrics
4
-
5
- ### 1.1 Relevance Metric
6
- Combines elements of **equivalence** (semantic match with ground truth) and **relevance** (degree to which the answer directly addresses the question).
7
-
8
- Graded on a four-point scale:
9
- - **2:** Correct and relevant (no irrelevant information).
10
- - **1:** Correct but contains irrelevant information.
11
- - **0:** No answer provided (abstention).
12
- - **-1:** Incorrect answer.
13
-
14
- ### 1.2 Faithfulness Metric
15
- Assesses whether the response is **grounded in the retrieved passages**.
16
-
17
- Graded on a three-point scale:
18
- - **1:** Full support. All answer parts are grounded.
19
- - **0:** Partial support. Not all answer parts are grounded.
20
- - **-1:** No support. All answer parts are not grounded.
21
-
22
- ### 1.3 Combination of Metrics
23
- Both **relevance** and **faithfulness** will contribute to the final evaluation score.
24
-
25
- The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding.
26
-
27
-
28
- ## 2. Manual and Automated Evaluation
29
-
30
- ### **2.1 First Stage:**
31
- - Automated evaluation by LLM **Claude 3.5 Sonnet**, using **relevance** and **faithfulness** metrics to rank the participant teams.
32
-
33
- ### **2.2 Final Stage:**
34
- - **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.
35
-
36
- ## 3. Other Notable Points
37
- - Answer length is **unlimited** but only the first **300 words** will be evaluated.
38
- - Participants will submit:
39
- - **The answer**.
40
- - **All supporting passages**.
41
- - **The full prompt used for generation**.
42
-
43
- These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.