Spaces:

LiveRAG
/

Challenge

Running

App Files Files Community

Orensomekh commited on Apr 27

Commit

771014b

verified ·

1 Parent(s): 9f41814

Delete Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md

Browse files

Files changed (1) hide show

Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md +0 -43

Operational_Instructions/Evaluation_Guidelines_for_LiveRAG.md DELETED Viewed

@@ -1,43 +0,0 @@
-# Evaluation Guidelines
-## 1. Selected Metrics
-### 1.1 Relevance Metric
-Combines elements of **equivalence** (semantic match with ground truth) and **relevance** (degree to which the answer directly addresses the question).
-Graded on a four-point scale:
-- **2:** Correct and relevant (no irrelevant information).
-- **1:** Correct but contains irrelevant information.
-- **0:** No answer provided (abstention).
-- **-1:** Incorrect answer.
-### 1.2 Faithfulness Metric
-Assesses whether the response is **grounded in the retrieved passages**.
-Graded on a three-point scale:
-- **1:** Full support. All answer parts are grounded.
-- **0:** Partial support. Not all answer parts are grounded.
-- **-1:** No support. All answer parts are not grounded.
-### 1.3 Combination of Metrics
-Both **relevance** and **faithfulness** will contribute to the final evaluation score.
-The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding.
-## 2. Manual and Automated Evaluation
-### **2.1 First Stage:**
-- Automated evaluation by LLM **Claude 3.5 Sonnet**, using **relevance** and **faithfulness** metrics to rank the participant teams.
-### **2.2 Final Stage:**
-- **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.
-## 3. Other Notable Points
-- Answer length is **unlimited** but only the first **300 words** will be evaluated.
-- Participants will submit:
-  - **The answer**.
-  - **All supporting passages**.
-  - **The full prompt used for generation**.
-These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.