AI & ML interests

Remote Sensing, Earth Observation

Recent Activity

prithivMLmodsΒ 
posted an update about 23 hours ago
view post
Post
1920
Qwen Image – The Latest Image Generation ModelπŸ”₯

Below are some samples generated using the Qwen Image Diffusion Model. Qwen-Image, a 20B MMDiT model for next-generation text-to-image generation, preserves typographic details, layout coherence, and contextual harmony with stunning accuracy. It is especially strong at creating stunning graphic posters with native text. The model is now open-source. [ πš€πš πšŽπš—-π™Έπš–πšŠπšπšŽ : Qwen/Qwen-Image ]

β€· Try the Qwen Image demo here: prithivMLmods/Qwen-Image-Diffusion, Qwen/Qwen-Image & more ...

β€· Qwen-Image Technical Report : Qwen-Image Technical Report (2508.02324)
β€· Qwen Image [GitHub] : https://github.com/QwenLM/Qwen-Image

Even more impressively, it demonstrates a strong ability to understand images. The model supports a wide range of vision-related tasks such as object detection, semantic segmentation, depth and edge (Canny) estimation, novel view synthesis, and image super-resolution. While each task is technically distinct, they can all be viewed as advanced forms of intelligent image editing driven by deep visual understanding. Collectively, these capabilities position Qwen-Image as more than just a tool for generating appealing visuals, it serves as a versatile foundation model for intelligent visual creation and transformation, seamlessly blending language, layout, and imagery.

Qwen-Image uses a dual-stream MMDiT architecture with a frozen Qwen2.5-VL, VAE encoder, RMSNorm for QK-Norm, LayerNorm elsewhere, and a custom MSRoPE scheme for joint image-text positional encoding.

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 4 days ago
view post
Post
3095
Introducing Camel-Doc-OCR-080125(v2), a document content-structure retrieval VLM designed for content extraction and summarization. This is the second model in the Camel Doc OCR VLM series, following Camel-Doc-OCR-062825(v1). The new version fixes formal table reconstruction issues in both en and zh language, achieving optimal performance for long-context inferences.πŸ€—πŸͺ

β€· Camel-Doc-OCR(v2) : prithivMLmods/Camel-Doc-OCR-080125
β€· Camel-Doc-OCR(v1) : prithivMLmods/Camel-Doc-OCR-062825
β€· Demo : prithivMLmods/core-OCR

Multimodal Model Collections and Spaces:

➝ Camel-Doc-OCR : prithivMLmods/camel-doc-ocr-080125-688c0c61c5dba648756f31f8
➝ Vision-Language (VLr) : prithivMLmods/vision-language-for-reasoning-vlr-6889b3f45917352b5e3a6f7a
➝ Multimodal Spaces : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
➝ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the model card of the respective model. !!
  • 2 replies
Β·
prithivMLmodsΒ 
posted an update 6 days ago
view post
Post
1048
Exciting to bring the explicitly grounded experimental reasoning model, Lumian-VLR-7B-Thinking, built on top of Qwen2.5-VL, featuring reasoning-aware trajectories with enhanced spatial perception. Along with this, we’ve also added a demo for the model while bringing some of the latest and most interesting models available on the hub to make full use of the remaining resources.

✨ Multimodal-VLM-Thinking : prithivMLmods/Multimodal-VLM-Thinking
✨ Multimodal-VLM-OCR : prithivMLmods/Multimodal-VLM-OCR

✦ Models used in these spaces:

✨ Lumian-VLR-7B-Thinking : prithivMLmods/Lumian-VLR-7B-Thinking
✨ Enesidaon-VLR-7B-no-Thinking : prithivMLmods/Enesidaon-VLR-7B-no-Thinking
✨ GLM-4.1V-9B-Thinking : zai-org/GLM-4.1V-9B-Thinking
✨ DREX-062225-exp : prithivMLmods/DREX-062225-exp & more ...

✦ Multimodal Model Collections and Spaces:

✨ Vision-Language (VLr) : prithivMLmods/vision-language-for-reasoning-vlr-6889b3f45917352b5e3a6f7a
✨ Multimodal Spaces : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
✨ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 9 days ago
view post
Post
4801
Explore OCR, Captioning, and Visual Understanding with Cutting-Edge Models on Hugging Face. πŸ€—πŸ§ͺ

I’ve put together a collection of Google Colab notebooks to experiment with some of the most exciting models available on the Hugging Face Hub focused on OCR, image captioning, and visual understanding tasks. [Image-to-Text] / [Image-Text-to-Text]

> πŸ“– OCR-ReportLab-Notebooks : prithivMLmods/OCR-ReportLab-Notebooks

These notebooks are built for quick prototyping and run on free T4 GPUs, making them perfect for experimentation, testing ideas, or just exploring what’s possible with modern vision-language models.

Note: The experimental notebooks are compiled with models that fit within the T4 GPU (free-tier) limits. More models along with their notebooks will be added over time.
prithivMLmodsΒ 
posted an update 12 days ago
view post
Post
2361
Excited to introduce the new experimental model "Qwen2.5-VL-7B-Abliterated-Caption-it", which is performing exceptionally well on image captioning tasks. This variant is specifically tailored for Abliterated Captioning and Uncensored Image Captioning. It is designed to generate highly detailed and descriptive captions across a broad range of visual categories including images with complex, sensitive, or nuanced content while handling varying aspect ratios and resolutions.πŸ§ͺπŸ€—

✨ Try the demo here : prithivMLmods/Qwen2.5-VL
✨ Qwen2.5-VL-7B-Abliterated-Caption-it : prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it
✨ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
✨ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!

Remove duplicates

#4 opened 12 days ago by
yichiac
prithivMLmodsΒ 
posted an update 13 days ago
view post
Post
2364
olmOCR [Allen AI] just got an upgrade! πŸ“ˆπŸ§‘β€πŸ³

The allenai/olmOCR-7B-0725 β€” fine-tuned with allenai/olmOCR-mix-0225 on top of Qwen/Qwen2.5-VL-7B-Instruct, pushing the boundaries of OCR technology. It takes a single document image as input, with the longest side resized to 1288 pixels. High-quality, openly available approach to parsing pdfs and other complex documents optical character recognition.

Try the demo here: prithivMLmods/Multimodal-OCR

✨ Model: allenai/olmOCR-7B-0725
✨ Model [fp8]: allenai/olmOCR-7B-0725-FP8
✨ Multimodal Implementations Space Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmodsΒ 
posted an update 17 days ago
view post
Post
5093
Upgraded the step-by-step notebook for fine-tuning SigLIP2 on domain-specific image classification tasks. The notebook supports both datasets with predefined train/test splits and those with only a train split, making it suitable for low-resource, custom, and real-world classification scenarios. πŸ“’πŸ‘‰

➺ FineTuning-SigLIP2-Notebook : prithivMLmods/FineTuning-SigLIP2-Notebook

➺ GitHub : https://github.com/PRITHIVSAKTHIUR/FineTuning-SigLIP-2

➺ In the first, datasets include predefined train and test splits, enabling conventional supervised learning and generalization evaluation : prithivMLmods/FineTuning-SigLIP2-Notebook (.ipynb)

➺ In the second scenario, only a training split is available; in such cases, the training set is either partially reserved for validation or reused entirely for evaluation : prithivMLmods/FineTuning-SigLIP2-Notebook (.ipynb)

This flexibility supports experimentation in constrained or domain-specific settings, where standard test annotations may not exist.