VIREX-062225-exp

The VIREX-062225-exp (Video Information Retrieval and Extraction eXpert - experimental) model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, specifically optimized for advanced video understanding, image comprehension, sense of reasoning, and natural language decision-making through long chain-of-thought (CoT) reasoning. Built on the robust Qwen2.5-VL architecture, this experimental model excels at extracting meaningful insights from visual content through sophisticated video-image frame sampling and multimodal reasoning.

VIREX: Video Information Retrieval and Extraction eXpert [ experimental ]

Key Enhancements

Advanced Video Information Retrieval: Capable of understanding complex video sequences, extracting key information, and providing detailed analysis of visual narratives across extended durations.
Enhanced Image Understanding with Physical Common Sense: Designed to comprehend real-world physics, spatial relationships, and contextual understanding in both static images and dynamic video content.
Long Chain-of-Thought Reasoning: Implements sophisticated reasoning pathways to provide detailed, logical explanations and decision-making processes in natural language.
Custom Video-Image Frame Sampling: Utilizes redesigned dataset methodology with intelligent frame sampling techniques for optimal training on video understanding tasks.
Multimodal Decision Making: Enables complex decision-making through integration of visual information and natural language processing with contextual understanding.
State-of-the-Art Video Comprehension: Achieves superior performance on video understanding benchmarks through modular combination of FineVideo and UltraVideo datasets.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/VIREX-062225-exp", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/VIREX-062225-exp")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "path/to/your/video.mp4",
            },
            {"type": "text", "text": "Analyze this video and explain the physical interactions you observe using chain-of-thought reasoning."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

This model is intended for:

Video Content Analysis: Deep understanding of video sequences, temporal relationships, and narrative comprehension.
Physical Common Sense Reasoning: Analysis of real-world physics, object interactions, and spatial relationships in visual content.
Chain-of-Thought Video Q&A: Detailed reasoning and explanation for complex video-based questions with step-by-step logical analysis.
Temporal Information Extraction: Retrieval of time-sensitive information and sequential understanding from video content.
Multimodal Decision Support: Integration of visual understanding with natural language reasoning for decision-making applications.
Educational and Research Applications: Analysis of instructional videos, research content, and educational material with detailed explanations.
Content Summarization: Intelligent summarization of video content with contextual understanding and key insight extraction.

Limitations

Experimental Status: As an experimental model, performance may vary across different use cases and requires further validation.
Computational Requirements: High memory and processing demands for video understanding tasks, not optimized for real-time applications.
Video Length Constraints: Performance may degrade with extremely long videos due to context window limitations.
Domain Specificity: Optimized primarily for general video understanding; specialized domains may require additional fine-tuning.
Frame Sampling Dependency: Performance is dependent on the quality and relevance of frame sampling during inference.
Reasoning Complexity: While capable of chain-of-thought reasoning, extremely complex logical chains may still present challenges.

Model Capabilities

Video Understanding

Temporal sequence analysis
Object tracking and identification
Scene transition recognition
Action and activity recognition

Physical Common Sense

Physics-based reasoning
Spatial relationship understanding
Cause-and-effect analysis
Real-world interaction comprehension

Chain-of-Thought Reasoning

Step-by-step logical analysis
Detailed explanation generation
Multi-step problem solving
Contextual reasoning pathways

Training Details

Parameter	Value
Dataset Size	11,750 samples (Modular Combination of FineVideo and UltraVideo)
Model Architecture	`Qwen2_5_VLForConditionalGeneration`
Hardware	3 × NVIDIA A40 (27 vCPUs)
Total Disk	250,000 MB
Training Time	4,489 seconds (~1.25 hours)
Learning Rate	1e-5
Scheduler	Linear Decay
Warmup Steps	500
Precision	bfloat16
Training Method	Custom dataset with redesigned video-to-image frame sampling

References

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
https://arxiv.org/pdf/2409.12191
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
https://arxiv.org/pdf/2308.12966
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
https://arxiv.org/pdf/2201.11903
Video Understanding with Large Language Models: A Survey
https://arxiv.org/pdf/2312.17432

prithivMLmods
/

VIREX-062225-exp

VIREX-062225-exp

Key Enhancements

Quick Start with Transformers

Intended Use

Limitations

Model Capabilities

Video Understanding

Physical Common Sense

Chain-of-Thought Reasoning

Training Details

References

Model tree for prithivMLmods/VIREX-062225-exp

Datasets used to train prithivMLmods/VIREX-062225-exp

Space using prithivMLmods/VIREX-062225-exp 1

Collection including prithivMLmods/VIREX-062225-exp

Doc VL