VIREX-062225-exp .png

VIREX-062225-exp

The VIREX-062225-exp (Video Information Retrieval and Extraction eXpert - experimental) model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, specifically optimized for advanced video understanding, image comprehension, sense of reasoning, and natural language decision-making through long chain-of-thought (CoT) reasoning. Built on the robust Qwen2.5-VL architecture, this experimental model excels at extracting meaningful insights from visual content through sophisticated video-image frame sampling and multimodal reasoning.

VIREX: Video Information Retrieval and Extraction eXpert [ experimental ]

Key Enhancements

  • Advanced Video Information Retrieval: Capable of understanding complex video sequences, extracting key information, and providing detailed analysis of visual narratives across extended durations.

  • Enhanced Image Understanding with Physical Common Sense: Designed to comprehend real-world physics, spatial relationships, and contextual understanding in both static images and dynamic video content.

  • Long Chain-of-Thought Reasoning: Implements sophisticated reasoning pathways to provide detailed, logical explanations and decision-making processes in natural language.

  • Custom Video-Image Frame Sampling: Utilizes redesigned dataset methodology with intelligent frame sampling techniques for optimal training on video understanding tasks.

  • Multimodal Decision Making: Enables complex decision-making through integration of visual information and natural language processing with contextual understanding.

  • State-of-the-Art Video Comprehension: Achieves superior performance on video understanding benchmarks through modular combination of FineVideo and UltraVideo datasets.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/VIREX-062225-exp", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/VIREX-062225-exp")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "path/to/your/video.mp4",
            },
            {"type": "text", "text": "Analyze this video and explain the physical interactions you observe using chain-of-thought reasoning."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

This model is intended for:

  • Video Content Analysis: Deep understanding of video sequences, temporal relationships, and narrative comprehension.
  • Physical Common Sense Reasoning: Analysis of real-world physics, object interactions, and spatial relationships in visual content.
  • Chain-of-Thought Video Q&A: Detailed reasoning and explanation for complex video-based questions with step-by-step logical analysis.
  • Temporal Information Extraction: Retrieval of time-sensitive information and sequential understanding from video content.
  • Multimodal Decision Support: Integration of visual understanding with natural language reasoning for decision-making applications.
  • Educational and Research Applications: Analysis of instructional videos, research content, and educational material with detailed explanations.
  • Content Summarization: Intelligent summarization of video content with contextual understanding and key insight extraction.

Limitations

  • Experimental Status: As an experimental model, performance may vary across different use cases and requires further validation.
  • Computational Requirements: High memory and processing demands for video understanding tasks, not optimized for real-time applications.
  • Video Length Constraints: Performance may degrade with extremely long videos due to context window limitations.
  • Domain Specificity: Optimized primarily for general video understanding; specialized domains may require additional fine-tuning.
  • Frame Sampling Dependency: Performance is dependent on the quality and relevance of frame sampling during inference.
  • Reasoning Complexity: While capable of chain-of-thought reasoning, extremely complex logical chains may still present challenges.

Model Capabilities

Video Understanding

  • Temporal sequence analysis
  • Object tracking and identification
  • Scene transition recognition
  • Action and activity recognition

Physical Common Sense

  • Physics-based reasoning
  • Spatial relationship understanding
  • Cause-and-effect analysis
  • Real-world interaction comprehension

Chain-of-Thought Reasoning

  • Step-by-step logical analysis
  • Detailed explanation generation
  • Multi-step problem solving
  • Contextual reasoning pathways

Training Details

Parameter Value
Dataset Size 11,750 samples (Modular Combination of FineVideo and UltraVideo)
Model Architecture Qwen2_5_VLForConditionalGeneration
Hardware 3 × NVIDIA A40 (27 vCPUs)
Total Disk 250,000 MB
Training Time 4,489 seconds (~1.25 hours)
Learning Rate 1e-5
Scheduler Linear Decay
Warmup Steps 500
Precision bfloat16
Training Method Custom dataset with redesigned video-to-image frame sampling

References

Downloads last month
73
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/VIREX-062225-exp

Finetuned
(383)
this model
Quantizations
2 models

Datasets used to train prithivMLmods/VIREX-062225-exp

Space using prithivMLmods/VIREX-062225-exp 1

Collection including prithivMLmods/VIREX-062225-exp