AI & ML interests
None defined yet.
Post
2552
Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.
So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.
Give it a go here:
https://lnkd.in/gf8Pi4-2
So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.
Give it a go here:
https://lnkd.in/gf8Pi4-2
Post
1705
Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.
This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️
We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.
Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.
Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.
We explore several key questions in the work, such as:
Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?
Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.
* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly
We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.
To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.
This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️
We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.
Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.
Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.
We explore several key questions in the work, such as:
Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?
Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.
* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly
We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.
To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.

sayakpaul
authored
a
paper
about 1 month ago

Aurelien-Morgan
posted
an
update
about 1 month ago
Post
402
Hey, I'll be presenting
@retrain-pipelines
and almighty function-calling at the Hugging Face Paris HQ, you guys.
Monday evening. Lightning-talk style. With AI Tinkerers.
Come hang !
https://paris.aitinkerers.org/p/ai-tinkerers-paris-ai21-labs-takeover-on-may-19th
https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller
Monday evening. Lightning-talk style. With AI Tinkerers.
Come hang !
https://paris.aitinkerers.org/p/ai-tinkerers-paris-ai21-labs-takeover-on-may-19th
https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller
Post
2578
PSA for anyone using
Both of these themes have been updated to fix some of the long-standing inconsistencies ever since the transition to Gradio v5. Textboxes are no longer bright green and
If your space is already using one of these themes, you just need to restart your space to get the latest version. No code changes needed.
Nymbo/Nymbo_Theme
or Nymbo/Nymbo_Theme_5
in a Gradio space ~Both of these themes have been updated to fix some of the long-standing inconsistencies ever since the transition to Gradio v5. Textboxes are no longer bright green and
in-line code
is readable now! Both themes are now visually identical across versions.If your space is already using one of these themes, you just need to restart your space to get the latest version. No code changes needed.

shivalikasingh
authored
a
paper
about 2 months ago

Aurelien-Morgan
posted
an
update
about 2 months ago
Post
3136
The Almighty function-caller
How would you like to build smart GenAi infrastructure ?
Give extensive tools memory to your edge agentic system,
And optimize the resources it takes to run yet a high-performance set of agents ?
We came up with a novel approach to function-calling at scale for smart companies and corporate-grade use-cases.
Read our full-fledged blog article on this here on Hugging Face :
https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller
How would you like to build smart GenAi infrastructure ?
Give extensive tools memory to your edge agentic system,
And optimize the resources it takes to run yet a high-performance set of agents ?
We came up with a novel approach to function-calling at scale for smart companies and corporate-grade use-cases.
Read our full-fledged blog article on this here on Hugging Face :
https://huggingface.co/blog/Aurelien-Morgan/the-almighty-function-caller

Aurelien-Morgan
posted
an
update
about 2 months ago
Post
664
retrain-pipelines 0.1.2
finally dropped. It comes with a hot Hugging Face Hub integration. Go check it out. We have 2 articles about it coming up. One already fully written so, be on the lookout !@retrain-pipelines
Also, I'll be volunteering at GOSIM AI Paris 2025. If you're interested in chatting, hmu.

sayakpaul
authored
a
paper
2 months ago

Aurelien-Morgan
posted
an
update
3 months ago

osanseviero
authored
a
paper
3 months ago

sayakpaul
authored
a
paper
3 months ago
Post
3600
🚀AraClip is now fully integrated with Hugging Face 🤗
AraClip is a specialized CLIP model that was created by @pain and optimized for Arabic text-image retrieval tasks🔥
🔗 Try it out 🔗
🤖 model: Arabic-Clip/araclip
🧩 Gradio demo: Arabic-Clip/Araclip-Simplified
🌐 website: https://arabic-clip.github.io/Arabic-CLIP/
AraClip is a specialized CLIP model that was created by @pain and optimized for Arabic text-image retrieval tasks🔥
🔗 Try it out 🔗
🤖 model: Arabic-Clip/araclip
🧩 Gradio demo: Arabic-Clip/Araclip-Simplified
🌐 website: https://arabic-clip.github.io/Arabic-CLIP/
Post
3866
Inference-time scaling meets Flux.1-Dev (and others) 🔥
Presenting a simple re-implementation of "Inference-time scaling diffusion models beyond denoising steps" by Ma et al.
I did the simplest random search strategy, but results can potentially be improved with better-guided search methods.
Supports Gemini 2 Flash & Qwen2.5 as verifiers for "LLMGrading" 🤗
The steps are simple:
For each round:
1> Starting by sampling 2 starting noises with different seeds.
2> Score the generations w.r.t a metric.
3> Obtain the best generation from the current round.
If you have more compute budget, go to the next search round. Scale the noise pool (
This constitutes the random search method as done in the paper by Google DeepMind.
Code, more results, and a bunch of other stuff are in the repository. Check it out here: https://github.com/sayakpaul/tt-scale-flux/ 🤗
Presenting a simple re-implementation of "Inference-time scaling diffusion models beyond denoising steps" by Ma et al.
I did the simplest random search strategy, but results can potentially be improved with better-guided search methods.
Supports Gemini 2 Flash & Qwen2.5 as verifiers for "LLMGrading" 🤗
The steps are simple:
For each round:
1> Starting by sampling 2 starting noises with different seeds.
2> Score the generations w.r.t a metric.
3> Obtain the best generation from the current round.
If you have more compute budget, go to the next search round. Scale the noise pool (
2 ** search_round
) and repeat 1 - 3.This constitutes the random search method as done in the paper by Google DeepMind.
Code, more results, and a bunch of other stuff are in the repository. Check it out here: https://github.com/sayakpaul/tt-scale-flux/ 🤗

vumichien
authored
a
paper
4 months ago

lucifertrj
posted
an
update
5 months ago
Post
661
Bhagavad Gita GPT assistant - Build fast RAG pipeline to index 1000+ pages using Binary Optimization
DeepSeek R-1 and Qdrant Binary Quantization
Check out the latest tutorial where we build a Bhagavad Gita GPT assistant—covering:
- DeepSeek R1 vs OpenAI O1
- Using Qdrant client with Binary Quantization
- Building the RAG pipeline with LlamaIndex
- Running inference with DeepSeek R1 Distill model on Groq
- Develop Streamlit app for the chatbot inference
Watch the full implementation here: https://www.youtube.com/watch?v=NK1wp3YVY4Q
DeepSeek R-1 and Qdrant Binary Quantization
Check out the latest tutorial where we build a Bhagavad Gita GPT assistant—covering:
- DeepSeek R1 vs OpenAI O1
- Using Qdrant client with Binary Quantization
- Building the RAG pipeline with LlamaIndex
- Running inference with DeepSeek R1 Distill model on Groq
- Develop Streamlit app for the chatbot inference
Watch the full implementation here: https://www.youtube.com/watch?v=NK1wp3YVY4Q
Post
2114
We have been cooking a couple of fine-tuning runs on CogVideoX with finetrainers, smol datasets, and LoRA to generate cool video effects like crushing, dissolving, etc.
We are also releasing a LoRA extraction utility from a fully fine-tuned checkpoint. I know that kind of stuff has existed since eternity, but the quality on video models was nothing short of spectacular. Below are some links:
* Models and datasets:
finetrainers
* finetrainers: https://github.com/a-r-r-o-w/finetrainers
* LoRA extraction: https://github.com/huggingface/diffusers/blob/main/scripts/extract_lora_from_model.py
We are also releasing a LoRA extraction utility from a fully fine-tuned checkpoint. I know that kind of stuff has existed since eternity, but the quality on video models was nothing short of spectacular. Below are some links:
* Models and datasets:
* finetrainers: https://github.com/a-r-r-o-w/finetrainers
* LoRA extraction: https://github.com/huggingface/diffusers/blob/main/scripts/extract_lora_from_model.py