---
pipeline_tag: text-ranking
library_name: transformers
language:
- en
base_model:
- Qwen/Qwen3-14B
tags:
- text-generation-inference
---

# Introduction

> Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval

For more details, please refer to our [paper](https://arxiv.org/pdf/2509.24869) and [Github repository](https://github.com/VectorSpaceLab/agentic-search).

# Usage

## Using SGLang

```python
import re
import sglang as sgl


PROMPT_TEMPLATE = """\
Here is the **relevance definition** in a retrieval task: {relevance_definition}

Now given a **query** ({query_type}) and a **document** ({doc_type}) in this retrieval task, your mission is to perform the following steps.

1. Query Analysis: Think to reason and describe what information would be most helpful in answering the query.
2. Document Analysis: Discuss how the information provided by the document fulfills or fails to fulfill the requirements implied by the query.
3. Relevance Annotation: Based on the relevance definition and the insights from the previous two steps, clearly justify your final relevance annotation result and annotate an integer score from a scale of 0 to 100. Please use the following guide:
    - **80-100 (Highly Relevant):** The document directly and comprehensively addresses the query's intent. It is a core and authoritative answer.
    - **60-80 (Relevant):** The document substantially addresses the query's intent, providing most of the key information, but might miss some minor details.
    - **40-60 (Moderately Relevant):** The document is on-topic and addresses a part of the query's intent, but it is not a comprehensive answer.
    - **20-40 (Slightly Relevant):** The document mentions keywords from the query, but its main topic is different. It offers very limited value.
    - **0-20 (Irrelevant):** The document does not address the query's intent at all and is off-topic.

After providing your detailed analysis and justification for all the steps above, conclude your entire response with the final relevance score. The score must be placed strictly between the <score> tags. There should be no other text or explanation inside the tags:
<score>
[From a scale of 0 to 100, annotate the degree of relevance between the query and the document.]
</score>

Query ({query_type}):
[Begin of Query]
{query}
[End of Query]

Document ({doc_type}):
[Begin of Document]
{doc}
[End of Document]
"""


def main():
    query = "In a party, how many guests do you need to have to ensure that either four people all know each other or four people are all complete strangers to one another?"
    doc = "\\section{Infinite Ramsey's Theorem}\nTags: Ramsey Theory, Named Theorems\n\n\\begin{theorem}\nLet $k, n \\in \\N$.\nFor any set $S$, let $S^{\\paren n}$ denote the set $\\set {\\set {s_1, \\ldots, s_n}: \\text{each } s_i \\in S}$ of cardinality $n$ subsets of $S$.\nLet $X$ be an infinite set.\nThen:\n:for every partition $P$ of $X^{\\paren n}$ into $k$ many components\n:there is an infinite subset $Y \\subseteq X$\nsuch that:\n:each member of $Y^{\\paren n}$ is in the same component of $P$.\n\\end{theorem}\n\n\\begin{proof}\nWe will prove the theorem for fixed $k$ by induction on $n$.\n\\end{proof}\n\n"
    query_type = "math problem"
    doc_type = "math-related passage"
    relevance_definition = "Given a query (math problem) and a document (math-related passage), the document is relevant to the query if the theorem described in the document can help solve the problem in the query."

    prompts = [
        PROMPT_TEMPLATE.format(
            relevance_definition=relevance_definition,
            query_type=query_type,
            doc_type=doc_type,
            query=query,
            doc=doc
        )
    ]
    
    llm = sgl.Engine(
        model_path="ljw13/retro-star-qwen3-14b-0928",
        tp_size=8,
        dp_size=1,
    )

    tokenizer = llm.tokenizer_manager.tokenizer
    messages = [[{"role": "user", "content": prompt}] for prompt in prompts]
    input_texts = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )

    sampling_params = {
        "n": 1,
        "temperature": 0.6,
        "max_new_tokens": 1024,
        "skip_special_tokens": False,
        "spaces_between_special_tokens": False,
    }

    outputs = llm.generate(
        input_texts,
        sampling_params=sampling_params,
    )

    llm.shutdown()

    scores = []
    for i, output in enumerate(outputs):
        print(output["text"])
        print("==" * 30)
        try:
            score = int(re.search(r"<score>\s*(\d+)\s*</score>", output["text"]).group(1))
        except AttributeError:
            score = 0
        scores.append(score)

    print("Scores:", scores)

if __name__ == "__main__":
    main()
```

# Citation

If you find this model useful, please consider giving a like and citation:

```
@article{lan2025retro,
  title={Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval},
  author={Lan, Junwei and Chen, Jianlyu and Liu, Zheng and Li, Chaofan and Bao, Siqi and Lian, Defu},
  journal={arXiv preprint arXiv:2509.24869},
  year={2025}
}
```