DeAR-Reranking
Collection
DeAR (Deep Agent Rank): Dual-Stage Document Reranking with Reasoning Agents Accepted at EMNLP Findings 2025
β’
12 items
β’
Updated
β’
1
DeAR-3B-Reranker-CE-LoRA-v1 is an ultra-efficient LoRA adapter for neural reranking with Binary Cross-Entropy loss. At only ~40MB, this is the most storage-efficient model in the DeAR family while delivering fast, reliable reranking performance suitable for production deployments.
β
Minimal Storage: Only 40MB on disk
β
Ultra Fast: 1.5s inference time
β
Stable Training: BCE loss convergence
β
Production Ready: Reliable performance
β
Easy Updates: Simple adapter swapping
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel, PeftConfig
# Load LoRA adapter
adapter_path = "abdoelsayed/dear-3b-reranker-ce-lora-v1"
config = PeftConfig.from_pretrained(adapter_path)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained(
config.base_model_name_or_path,
num_labels=1,
torch_dtype=torch.bfloat16
)
# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()
model.eval().cuda()
# Score a query-document pair
query = "What is machine learning?"
document = "Machine learning is a subset of artificial intelligence..."
inputs = tokenizer(
f"query: {query}",
f"document: {document}",
return_tensors="pt",
truncation=True,
max_length=228,
padding="max_length"
)
inputs = {k: v.cuda() for k, v in inputs.items()}
with torch.no_grad():
score = model(**inputs).logits.squeeze().item()
print(f"Relevance score: {score}")
from typing import List, Tuple
@torch.inference_mode()
def efficient_rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size=128):
"""Optimized reranking for 3B LoRA model."""
scores = []
device = next(model.parameters()).device
for i in range(0, len(docs), batch_size):
batch = docs[i:i + batch_size]
# Prepare inputs
queries = [f"query: {query}"] * len(batch)
documents = [f"document: {title} {text}" for title, text in batch]
# Tokenize and move to device
inputs = tokenizer(
queries,
documents,
return_tensors="pt",
truncation=True,
max_length=228,
padding=True
)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Get scores
logits = model(**inputs).logits.squeeze(-1)
scores.extend(logits.cpu().tolist())
# Sort by relevance
return sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
# Example
query = "When did Thomas Edison invent the light bulb?"
docs = [
("", "Thomas Edison invented the light bulb in 1879"),
("", "Coffee is good for diet"),
("", "Lightning strike at Seoul"),
]
ranking = efficient_rerank(tokenizer, model, query, docs)
print(ranking)
# Output: [(0, -6.1), (2, -11.2), (1, -12.1)]
# Load different domain-specific adapters on the same base model
base_model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-3.2-3B",
num_labels=1,
torch_dtype=torch.bfloat16
)
# Medical domain adapter
medical_model = PeftModel.from_pretrained(
base_model,
"your-org/dear-3b-medical-lora"
).merge_and_unload()
# Legal domain adapter
legal_model = PeftModel.from_pretrained(
base_model,
"your-org/dear-3b-legal-lora"
).merge_and_unload()
# General domain (this model)
general_model = PeftModel.from_pretrained(
base_model,
"abdoelsayed/dear-3b-reranker-ce-lora-v1"
).merge_and_unload()
{
"r": 16,
"lora_alpha": 32,
"target_modules": [
"q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
"lora_dropout": 0.05,
"bias": "none",
"task_type": "SEQ_CLS"
}
Binary Cross-Entropy with Knowledge Distillation:
L = (1 - Ξ±) * BCE(y_pred, y_true) + Ξ± * KL(Ο(z_s/T), Ο(z_t/T))
where Ξ± = 0.1, T = 2.0
| Model | Storage | Ratio to 8B |
|---|---|---|
| 3B-CE-LoRA | 40MB | 0.25% |
| 3B-CE-Full | 6GB | 37.5% |
| 8B-CE-Full | 16GB | 100% |
| Model | Time (100 docs) | Throughput |
|---|---|---|
| 3B-CE-LoRA | 1.5s | 67 docs/s |
| 8B-CE-Full | 2.2s | 45 docs/s |
| Model | Inference GPU | Training GPU |
|---|---|---|
| 3B-CE-LoRA | 10GB | 18GB |
| 3B-CE-Full | 12GB | 24GB |
| 8B-CE-Full | 18GB | 38GB |
Storage Cost:
- Full 8B: $100/month (cloud storage)
- 3B LoRA: $0.25/month
- Savings: 99.75%
Training Cost:
- Full 3B: 18 GPU hours
- 3B LoRA: 6 GPU hours
- Savings: 67%
Inference Cost:
- Same throughput
- Lower memory β cheaper GPUs
- Estimated savings: 30-40%
High-Volume Search
Edge Deployment
Multi-Domain Systems
A/B Testing
Best for:
Use full 3B for:
Use 8B for:
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import torch
app = FastAPI()
class RerankRequest(BaseModel):
query: str
documents: List[str]
# Load model at startup
tokenizer, model = None, None
@app.on_event("startup")
async def load_model():
global tokenizer, model
adapter_path = "abdoelsayed/dear-3b-reranker-ce-lora-v1"
config = PeftConfig.from_pretrained(adapter_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
base = AutoModelForSequenceClassification.from_pretrained(
config.base_model_name_or_path,
num_labels=1,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(base, adapter_path)
model = model.merge_and_unload()
model.eval()
@app.post("/rerank")
async def rerank(request: RerankRequest):
docs = [("", doc) for doc in request.documents]
ranking = efficient_rerank(tokenizer, model, request.query, docs)
return {"ranking": ranking}
Pareto Frontier (TREC DL19):
NDCG@10 vs Storage Size:
ββ Teacher-13B: 73.8 @ 26GB
ββ DeAR-8B: 74.5 @ 16GB
ββ DeAR-8B-LoRA: 74.2 @ 100MB
ββ DeAR-3B: 71.2 @ 6GB
ββ DeAR-3B-LoRA: 70.5 @ 40MB β Best efficiency!
Efficiency Score: 70.5 NDCG / 40MB = 1.76 per MB
Full Version:
Same Size:
Larger:
Resources:
@article{abdallah2025dear,
title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
journal={arXiv preprint arXiv:2508.16998},
year={2025}
}
MIT License
Base model
meta-llama/Llama-3.2-3B