DeAR-3B-Reranker-CE-LoRA-v1

Model Description

DeAR-3B-Reranker-CE-LoRA-v1 is an ultra-efficient LoRA adapter for neural reranking with Binary Cross-Entropy loss. At only ~40MB, this is the most storage-efficient model in the DeAR family while delivering fast, reliable reranking performance suitable for production deployments.

Model Details

Model Type: LoRA Adapter for Pointwise Reranking
Base Model: meta-llama/Llama-3.2-3B
Adapter Size: ~40MB
Training Method: LoRA with Binary Cross-Entropy + Knowledge Distillation
LoRA Rank: 16
LoRA Alpha: 32
Trainable Parameters: 25M (0.8% of total)

Key Features

✅ Minimal Storage: Only 40MB on disk
✅ Ultra Fast: 1.5s inference time
✅ Stable Training: BCE loss convergence
✅ Production Ready: Reliable performance
✅ Easy Updates: Simple adapter swapping

Usage

Quick Load and Score

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel, PeftConfig

# Load LoRA adapter
adapter_path = "abdoelsayed/dear-3b-reranker-ce-lora-v1"
config = PeftConfig.from_pretrained(adapter_path)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    config.base_model_name_or_path,
    num_labels=1,
    torch_dtype=torch.bfloat16
)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()
model.eval().cuda()

# Score a query-document pair
query = "What is machine learning?"
document = "Machine learning is a subset of artificial intelligence..."

inputs = tokenizer(
    f"query: {query}",
    f"document: {document}",
    return_tensors="pt",
    truncation=True,
    max_length=228,
    padding="max_length"
)
inputs = {k: v.cuda() for k, v in inputs.items()}

with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()
    
print(f"Relevance score: {score}")

High-Throughput Reranking

from typing import List, Tuple

@torch.inference_mode()
def efficient_rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size=128):
    """Optimized reranking for 3B LoRA model."""
    scores = []
    device = next(model.parameters()).device
    
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i + batch_size]
        
        # Prepare inputs
        queries = [f"query: {query}"] * len(batch)
        documents = [f"document: {title} {text}" for title, text in batch]
        
        # Tokenize and move to device
        inputs = tokenizer(
            queries,
            documents,
            return_tensors="pt",
            truncation=True,
            max_length=228,
            padding=True
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Get scores
        logits = model(**inputs).logits.squeeze(-1)
        scores.extend(logits.cpu().tolist())
    
    # Sort by relevance
    return sorted(enumerate(scores), key=lambda x: x[1], reverse=True)


# Example
query = "When did Thomas Edison invent the light bulb?"
docs = [
    ("", "Thomas Edison invented the light bulb in 1879"),
    ("", "Coffee is good for diet"),
    ("", "Lightning strike at Seoul"),
]

ranking = efficient_rerank(tokenizer, model, query, docs)
print(ranking)
# Output: [(0, -6.1), (2, -11.2), (1, -12.1)]

Multi-Domain Deployment

# Load different domain-specific adapters on the same base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    num_labels=1,
    torch_dtype=torch.bfloat16
)

# Medical domain adapter
medical_model = PeftModel.from_pretrained(
    base_model, 
    "your-org/dear-3b-medical-lora"
).merge_and_unload()

# Legal domain adapter
legal_model = PeftModel.from_pretrained(
    base_model,
    "your-org/dear-3b-legal-lora"
).merge_and_unload()

# General domain (this model)
general_model = PeftModel.from_pretrained(
    base_model,
    "abdoelsayed/dear-3b-reranker-ce-lora-v1"
).merge_and_unload()

Training Details

LoRA Configuration

{
    "r": 16,
    "lora_alpha": 32,
    "target_modules": [
        "q_proj", "v_proj", "k_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "SEQ_CLS"
}

Training Hyperparameters

Learning Rate: 1e-4
Batch Size: 8
Gradient Accumulation: 2
Epochs: 2
Max Length: 228
Hardware: 4x A100 (40GB)
Training Time: ~6 hours
Memory Usage: ~18GB per GPU

Loss Function

Binary Cross-Entropy with Knowledge Distillation:

L = (1 - α) * BCE(y_pred, y_true) + α * KL(σ(z_s/T), σ(z_t/T))
where α = 0.1, T = 2.0

Efficiency Metrics

Storage Comparison

Model	Storage	Ratio to 8B
3B-CE-LoRA	40MB	0.25%
3B-CE-Full	6GB	37.5%
8B-CE-Full	16GB	100%

Speed Comparison

Model	Time (100 docs)	Throughput
3B-CE-LoRA	1.5s	67 docs/s
8B-CE-Full	2.2s	45 docs/s

Memory Comparison

Model	Inference GPU	Training GPU
3B-CE-LoRA	10GB	18GB
3B-CE-Full	12GB	24GB
8B-CE-Full	18GB	38GB

Advantages of LoRA Version

Cost Savings

Storage Cost:
- Full 8B: $100/month (cloud storage)
- 3B LoRA: $0.25/month
- Savings: 99.75%

Training Cost:
- Full 3B: 18 GPU hours
- 3B LoRA: 6 GPU hours  
- Savings: 67%

Inference Cost:
- Same throughput
- Lower memory → cheaper GPUs
- Estimated savings: 30-40%

Use Cases

Production Scenarios

High-Volume Search
- Process millions of queries/day
- Cost-effective at scale
- Fast response times
Edge Deployment
- Deploy on smaller GPUs
- Minimal storage footprint
- Quick model updates
Multi-Domain Systems
- Store multiple domain adapters
- Switch adapters dynamically
- Total storage: N × 40MB
A/B Testing
- Test multiple variants quickly
- Minimal storage overhead
- Easy rollback

When to Use

Best for:

✅ Extreme cost optimization
✅ Multi-tenant systems
✅ Frequent model updates
✅ Limited storage environments
✅ Edge/mobile deployment

Use full 3B for:

❌ Marginal accuracy improvement
❌ Single model deployment

Use 8B for:

❌ Maximum accuracy required

Deployment Example

FastAPI Service

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import torch

app = FastAPI()

class RerankRequest(BaseModel):
    query: str
    documents: List[str]

# Load model at startup
tokenizer, model = None, None

@app.on_event("startup")
async def load_model():
    global tokenizer, model
    
    adapter_path = "abdoelsayed/dear-3b-reranker-ce-lora-v1"
    config = PeftConfig.from_pretrained(adapter_path)
    
    tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
    base = AutoModelForSequenceClassification.from_pretrained(
        config.base_model_name_or_path,
        num_labels=1,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    
    model = PeftModel.from_pretrained(base, adapter_path)
    model = model.merge_and_unload()
    model.eval()

@app.post("/rerank")
async def rerank(request: RerankRequest):
    docs = [("", doc) for doc in request.documents]
    ranking = efficient_rerank(tokenizer, model, request.query, docs)
    return {"ranking": ranking}

Performance vs Storage

Pareto Frontier (TREC DL19):

NDCG@10 vs Storage Size:
├─ Teacher-13B: 73.8 @ 26GB
├─ DeAR-8B: 74.5 @ 16GB
├─ DeAR-8B-LoRA: 74.2 @ 100MB
├─ DeAR-3B: 71.2 @ 6GB
└─ DeAR-3B-LoRA: 70.5 @ 40MB ← Best efficiency!

Efficiency Score: 70.5 NDCG / 40MB = 1.76 per MB

Limitations

Accuracy: ~4 NDCG@10 lower than 8B models
Requires Base Model: Need to load base model first
Merging Overhead: One-time cost to merge adapter
Complex Queries: May miss subtle nuances

Related Models

Full Version:

DeAR-3B-CE - Full fine-tuned model

Same Size:

DeAR-3B-RankNet-LoRA - RankNet variant

Larger:

DeAR-8B-CE-LoRA - Better accuracy

Resources:

Citation

@article{abdallah2025dear,
  title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
  author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
  journal={arXiv preprint arXiv:2508.16998},
  year={2025}
}

License

MIT License

More Information

GitHub: DataScienceUIBK/DeAR-Reranking
Paper: arXiv:2508.16998
Collection: DeAR Models

Downloads last month: 35

Model tree for abdoelsayed/dear-3b-reranker-ce-lora-v1

Base model

meta-llama/Llama-3.2-3B

Adapter

(223)

this model

Datasets used to train abdoelsayed/dear-3b-reranker-ce-lora-v1

Collection including abdoelsayed/dear-3b-reranker-ce-lora-v1

DeAR-Reranking

Collection

DeAR (Deep Agent Rank): Dual-Stage Document Reranking with Reasoning Agents Accepted at EMNLP Findings 2025 • 12 items • Updated 20 days ago • 1