Introduction

Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval

For more details, please refer to our paper and Github repository.

Usage

Using SGLang

import re
import sglang as sgl


PROMPT_TEMPLATE = """\
Here is the **relevance definition** in a retrieval task: {relevance_definition}

Now given a **query** ({query_type}) and a **document** ({doc_type}) in this retrieval task, your mission is to perform the following steps.

1. Query Analysis: Think to reason and describe what information would be most helpful in answering the query.
2. Document Analysis: Discuss how the information provided by the document fulfills or fails to fulfill the requirements implied by the query.
3. Relevance Annotation: Based on the relevance definition and the insights from the previous two steps, clearly justify your final relevance annotation result and annotate an integer score from a scale of 0 to 100. Please use the following guide:
    - **80-100 (Highly Relevant):** The document directly and comprehensively addresses the query's intent. It is a core and authoritative answer.
    - **60-80 (Relevant):** The document substantially addresses the query's intent, providing most of the key information, but might miss some minor details.
    - **40-60 (Moderately Relevant):** The document is on-topic and addresses a part of the query's intent, but it is not a comprehensive answer.
    - **20-40 (Slightly Relevant):** The document mentions keywords from the query, but its main topic is different. It offers very limited value.
    - **0-20 (Irrelevant):** The document does not address the query's intent at all and is off-topic.

After providing your detailed analysis and justification for all the steps above, conclude your entire response with the final relevance score. The score must be placed strictly between the <score> tags. There should be no other text or explanation inside the tags:
<score>
[From a scale of 0 to 100, annotate the degree of relevance between the query and the document.]
</score>

Query ({query_type}):
[Begin of Query]
{query}
[End of Query]

Document ({doc_type}):
[Begin of Document]
{doc}
[End of Document]
"""


def main():
    query = "In a party, how many guests do you need to have to ensure that either four people all know each other or four people are all complete strangers to one another?"
    doc = "\\section{Infinite Ramsey's Theorem}\nTags: Ramsey Theory, Named Theorems\n\n\\begin{theorem}\nLet $k, n \\in \\N$.\nFor any set $S$, let $S^{\\paren n}$ denote the set $\\set {\\set {s_1, \\ldots, s_n}: \\text{each } s_i \\in S}$ of cardinality $n$ subsets of $S$.\nLet $X$ be an infinite set.\nThen:\n:for every partition $P$ of $X^{\\paren n}$ into $k$ many components\n:there is an infinite subset $Y \\subseteq X$\nsuch that:\n:each member of $Y^{\\paren n}$ is in the same component of $P$.\n\\end{theorem}\n\n\\begin{proof}\nWe will prove the theorem for fixed $k$ by induction on $n$.\n\\end{proof}\n\n"
    query_type = "math problem"
    doc_type = "math-related passage"
    relevance_definition = "Given a query (math problem) and a document (math-related passage), the document is relevant to the query if the theorem described in the document can help solve the problem in the query."

    prompts = [
        PROMPT_TEMPLATE.format(
            relevance_definition=relevance_definition,
            query_type=query_type,
            doc_type=doc_type,
            query=query,
            doc=doc
        )
    ]
    
    llm = sgl.Engine(
        model_path="ljw13/retro-star-qwen3-14b-0928",
        tp_size=8,
        dp_size=1,
    )

    tokenizer = llm.tokenizer_manager.tokenizer
    messages = [[{"role": "user", "content": prompt}] for prompt in prompts]
    input_texts = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )

    sampling_params = {
        "n": 1,
        "temperature": 0.6,
        "max_new_tokens": 1024,
        "skip_special_tokens": False,
        "spaces_between_special_tokens": False,
    }

    outputs = llm.generate(
        input_texts,
        sampling_params=sampling_params,
    )

    llm.shutdown()

    scores = []
    for i, output in enumerate(outputs):
        print(output["text"])
        print("==" * 30)
        try:
            score = int(re.search(r"<score>\s*(\d+)\s*</score>", output["text"]).group(1))
        except AttributeError:
            score = 0
        scores.append(score)

    print("Scores:", scores)

if __name__ == "__main__":
    main()

Citation

If you find this model useful, please consider giving a like and citation:

@article{lan2025retro,
  title={Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval},
  author={Lan, Junwei and Chen, Jianlyu and Liu, Zheng and Li, Chaofan and Bao, Siqi and Lian, Defu},
  journal={arXiv preprint arXiv:2509.24869},
  year={2025}
}
Downloads last month
5
Safetensors
Model size
15B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ljw13/retro-star-qwen3-14b-0928

Finetuned
Qwen/Qwen3-14B
Finetuned
(143)
this model

Collection including ljw13/retro-star-qwen3-14b-0928