FLUX.1-dev LoRA Fine-tuned with Flow-GRPO

This LoRA (Low-Rank Adaptation) model is a fine-tuned version of FLUX.1-dev using Flow-GRPO (Flow-based Group Relative Policy Optimization), a novel reinforcement learning technique for flow matching models.

Model Description

This model was trained using the Flow-GRPO methodology described in the paper "Flow-GRPO: Training Flow Matching Models via Online RL". Flow-GRPO integrates online reinforcement learning into flow matching models by:

ODE-to-SDE conversion: Transforms deterministic flow matching into stochastic sampling for RL exploration
Denoising reduction: Uses fewer denoising steps during training while maintaining full quality at inference
Human preference optimization: Trained with PickScore reward to align with human preferences

Training Details

Core Configuration

Base Model: FLUX.1-dev
Training Method: Flow-GRPO with PickScore reward
Resolution: 512×512
Mixed Precision: bfloat16
Seed: 42

LoRA Configuration

LoRA Enabled: True
Rank: Not specified in config (typically 32-64)
Target Modules: Transformer layers

Training Hyperparameters

Learning Rate: 5e-5
Batch Size: 1 (with gradient accumulation: 32 steps)
Optimizer: 8-bit AdamW
- β₁: 0.9
- β₂: 0.999
- Weight Decay: 1e-4
- Epsilon: 1e-8
Gradient Clipping: Max norm 1.0
Max Epochs: 100,000
Save Frequency: Every 100 steps

Flow-GRPO Specific

Reward Function: PickScore (human preference)
Beta (KL penalty): 0.001
Clip Range: 0.2
Advantage Clipping: Max 5.0
Timestep Fraction: 0.2
Guidance Scale: 3.5

Sampling Configuration

Training Steps: 2 (denoising reduction)
Evaluation Steps: 4
Images per Prompt: 4
Batches per Epoch: 4

Usage

With Diffusers

import torch
from diffusers import FluxPipeline

# Load the base model
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", 
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load the LoRA weights
pipe.load_lora_weights("ighoshsubho/lora-grpo-flux-dev")

# Generate an image
prompt = "A serene landscape with mountains and a lake at sunset"
image = pipe(
    prompt,
    height=512,
    width=512,
    guidance_scale=3.5,
    num_inference_steps=20,
    max_sequence_length=256,
).images[0]

image.save("generated_image.png")

Adjusting LoRA Strength

# You can adjust the LoRA influence
pipe.set_adapters(["default"], adapter_weights=[0.8])  # 80% LoRA influence

Training Data & Objectives

Dataset: Custom PickScore dataset for human preference alignment
Prompt Function: General OCR prompts
Optimization Target: Maximizing PickScore while maintaining image quality
KL Regularization: Prevents reward hacking and maintains model stability

Performance Improvements

This model demonstrates improvements in:

Human preference alignment through PickScore optimization
Text rendering quality via OCR-focused training
Compositional understanding enhanced by Flow-GRPO's exploration mechanism
Stable training with minimal reward hacking due to KL regularization

Technical Notes

Uses denoising reduction during training (2 steps) for efficiency
Maintains full quality with standard inference steps (20-50)
Trained with mixed precision (bfloat16) for memory efficiency
8-bit AdamW optimizer reduces memory footprint
Gradient accumulation (32 steps) enables effective large batch training

Limitations

Optimized for 512×512 resolution
Focused on PickScore preferences (may not generalize to all aesthetic preferences)
LoRA adaptation may have reduced capacity compared to full fine-tuning

Citation

If you use this model, please cite the Flow-GRPO paper:

@article{liu2025flow,
  title={Flow-GRPO: Training Flow Matching Models via Online RL},
  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
  journal={arXiv preprint arXiv:2505.05470},
  year={2025}
}

License

This model is released under the Apache 2.0 License, following the base FLUX.1-dev model license.

ighoshsubho
/

lora-grpo-flux-dev