Spaces:

rocketmandrey
/

phunter_space

Sleeping

App Files Files Community

rocketmandrey commited on Jun 23

Commit

912dd8d

verified ·

1 Parent(s): 0615e09

Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

.gitattributes +3 -29
.gitignore +12 -16
README.md +38 -53
README_SPACE.md +52 -0
app.py +189 -97
requirements.txt +8 -13

.gitattributes CHANGED Viewed

@@ -1,35 +1,9 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

@@ -22,8 +22,8 @@ wheels/
 # Virtual Environment
 venv/
-ENV/
 env/
 # IDE
 .idea/
@@ -31,23 +31,19 @@ env/
 *.swp
 *.swo
-# Logs
-*.log
-# Local development
-.env
-.env.local
-# Model weights
 weights/
-# Generated content
 outputs/
-temp/
 *.mp4
 *.wav
-# Keep example files
-!examples/*.json
-!assets/examples/*
-!assets/audio/*

 # Virtual Environment
 venv/
 env/
+ENV/
 # IDE
 .idea/
 *.swp
 *.swo
+# Project specific
 weights/
 outputs/
 *.mp4
 *.wav
+*.jpg
+*.png
+*.safetensors
+*.bin
+# Logs
+*.log
+logs/
+# OS
+.DS_Store

README.md CHANGED Viewed

@@ -1,74 +1,59 @@
 ---
-title: Phunter Space - Video Generation Demo
 emoji: 🎬
-colorFrom: blue
-colorTo: purple
 sdk: gradio
-sdk_version: 5.34.2
 app_file: app.py
 pinned: false
 license: apache-2.0
 ---
-# Phunter Space - Video Generation Demo
-This is a Gradio demo for generating talking head videos from images and audio using advanced AI models.
-## 🌟 Features
-- 💬 Generate talking head videos from images and audio
-- 👥 Support for both single and multi-person video generation
 - 🎯 High-quality lip synchronization
-- 📺 Support for multiple resolutions (480p, 720p)
-- 🎨 Customizable generation parameters
-## 🚀 Quick Start
-1. Click "Load Night Studio Example" or "Load Day Studio Example"
-2. Upload your audio file (WAV format)
-3. Click "Generate Video"
-## 📝 Parameters Guide
-### Resolution
-- 480p: Faster generation, lower quality
-- 720p: Better quality, slower generation
-### Audio CFG (1.0-10.0)
-- Controls lip movement influence
-- Recommended: 4.0
-- Higher values = more pronounced articulation
-### CFG Scale (1.0-15.0)
-- Controls prompt adherence
-- Recommended: 7.5
-- Higher values = stricter prompt following
-### Max Duration
-- Limits output video length
-- Maximum: 15 seconds
-- Default: 10 seconds
-## 💡 Tips
-1. Use high-quality reference images
-2. Provide detailed prompts
-3. Start with example settings
-4. Experiment with CFG values
-5. Ensure good lighting in reference images
-## 📋 Requirements
-- Input Image: Clear face photo(s)
-- Audio: WAV format
-- Prompt: Detailed scene description
-## 🛠 Technical Details
-- Model: MeiGen MultiTalk
-- Framework: Gradio 4.12.0
-- GPU: T4 (recommended)
-## 📬 Contact
-For questions or issues, please visit the [GitHub repository](https://github.com/yourusername/phunter_space) or create an issue on Hugging Face Spaces.

 ---
+title: MeiGen MultiTalk Demo
 emoji: 🎬
+colorFrom: red
+colorTo: blue
 sdk: gradio
+sdk_version: 4.44.1
 app_file: app.py
 pinned: false
 license: apache-2.0
+python_version: 3.10
+spaces:
+  - ZeroGPU
 ---
+# MeiGen-MultiTalk Demo
+This is a demo of MeiGen-MultiTalk, an audio-driven multi-person conversational video generation model.
+## Features
+- 💬 Generate videos of people talking from still images and audio
+- 👥 Support for both single-person and multi-person conversations
 - 🎯 High-quality lip synchronization
+- 📺 Support for 480p and 720p resolution
+- ⏱️ Generate videos up to 15 seconds long
+## How to Use
+1. Upload a reference image (photo of person(s) who will be speaking)
+2. Upload an audio file
+3. Enter a prompt describing the desired video
+4. Adjust generation parameters if needed:
+   - Resolution: Video quality (480p or 720p)
+   - Audio CFG: Controls strength of audio influence
+   - Guidance Scale: Controls adherence to prompt
+   - Random Seed: For reproducible results
+   - Max Duration: Video length in seconds
+5. Click "Generate Video" and wait for the result
+## Tips
+- Use clear, front-facing photos for best results
+- Ensure good audio quality without background noise
+- Keep prompts clear and specific
+- For multi-person videos, ensure the reference image shows all speakers clearly
+## Limitations
+- Generation can take several minutes
+- Maximum video duration is 15 seconds
+- Best results with clear, well-lit reference images
+- Audio should be clear and without background noise
+## Credits
+This demo uses the MeiGen-MultiTalk model created by MeiGen-AI.
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

README_SPACE.md ADDED Viewed

	@@ -0,0 +1,52 @@

+# MeiGen-MultiTalk Demo
+This is a demo of [MeiGen-MultiTalk](https://huggingface.co/MeiGen-AI/MeiGen-MultiTalk), an audio-driven multi-person conversational video generation model.
+## Features
+- 💬 Generate videos of people talking from still images and audio
+- 👥 Support for both single-person and multi-person conversations
+- 🎯 High-quality lip synchronization
+- 📺 Support for 480p and 720p resolution
+- ⏱️ Generate videos up to 15 seconds long
+## How to Use
+1. Upload a reference image (photo of person(s) who will be speaking)
+2. Upload one or more audio files:
+   - For single person: Upload one audio file
+   - For conversation: Upload multiple audio files (one per person)
+3. Enter a prompt describing the desired video
+4. Adjust generation parameters if needed:
+   - Resolution: Video quality (480p or 720p)
+   - Audio CFG: Controls strength of audio influence
+   - Guidance Scale: Controls adherence to prompt
+   - Random Seed: For reproducible results
+   - Max Duration: Video length in seconds
+5. Click "Generate Video" and wait for the result
+## Tips
+- Use clear, front-facing photos for best results
+- Ensure good audio quality without background noise
+- Keep prompts clear and specific
+- For multi-person videos, ensure the reference image shows all speakers clearly
+## Limitations
+- Generation can take several minutes
+- Maximum video duration is 15 seconds
+- Best results with clear, well-lit reference images
+- Audio should be clear and without background noise
+## Credits
+This demo uses the MeiGen-MultiTalk model created by MeiGen-AI. If you use this in your work, please cite:
+```bibtex
+@article{kong2025let,
+  title={Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation},
+  author={Kong, Zhe and Gao, Feng and Zhang, Yong and Kang, Zhuoliang and Wei, Xiaoming and Cai, Xunliang and Chen, Guanying and Luo, Wenhan},
+  journal={arXiv preprint arXiv:2505.22647},
+  year={2025}
+}

app.py CHANGED Viewed

@@ -1,140 +1,232 @@
-import os
-import json
 import gradio as gr
-from PIL import Image
 import torch
-from huggingface_hub import hf_hub_download
 import tempfile
-# Constants
-MODEL_ID = "MeiGen-AI/MeiGen-MultiTalk"
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
-def load_models():
-    """Load required models"""
-    # Here we'll add model loading logic
-    pass
-def process_video(
     image,
-    audio_files,
-    prompt,
     resolution="480p",
-    audio_cfg=4.0,
-    cfg=7.5,
     seed=42,
-    max_duration=15
 ):
-    """Process video generation"""
     try:
-        # Create temporary directory for processing
-        with tempfile.TemporaryDirectory() as temp_dir:
-            # Save uploaded image
-            image_path = os.path.join(temp_dir, "reference.jpg")
-            image.save(image_path)
-            # Save uploaded audio files
-            audio_paths = []
-            for audio in audio_files:
-                audio_path = os.path.join(temp_dir, f"audio_{len(audio_paths)}.wav")
-                audio_paths.append(audio_path)
-                # Save audio file
-                with open(audio_path, "wb") as f:
-                    f.write(audio)
-            # Create configuration
-            config = {
-                "image": image_path,
-                "audio": audio_paths[0] if len(audio_paths) == 1 else audio_paths,
-                "prompt": prompt,
-                "resolution": resolution,
-                "audio_cfg": float(audio_cfg),
-                "cfg": float(cfg),
-                "seed": int(seed),
-                "max_duration": int(max_duration)
-            }
-            # Save configuration
-            config_path = os.path.join(temp_dir, "config.json")
-            with open(config_path, "w") as f:
-                json.dump(config, f, indent=2)
-            # Here we'll add video generation logic
-            # For now, return a message
-            return "Video generation will be implemented here"
     except Exception as e:
-        return f"Error: {str(e)}"
-# Create Gradio interface
-with gr.Blocks(title="MeiGen-MultiTalk Demo") as demo:
-    gr.Markdown("""
-    # MeiGen-MultiTalk Demo
-    Generate talking head videos from images and audio files.
     """)
     with gr.Row():
-        with gr.Column():
-            image_input = gr.Image(label="Reference Image", type="pil")
-            audio_input = gr.Audio(label="Audio File(s)", type="binary", multiple=True)
-            prompt_input = gr.Textbox(label="Prompt", placeholder="Describe the desired video...")
             with gr.Row():
-                resolution_input = gr.Dropdown(
                     choices=["480p", "720p"],
                     value="480p",
                     label="Resolution"
                 )
-                audio_cfg_input = gr.Slider(
                     minimum=1.0,
-                    maximum=10.0,
-                    value=4.0,
                     step=0.1,
-                    label="Audio CFG"
                 )
-            with gr.Row():
-                cfg_input = gr.Slider(
                     minimum=1.0,
-                    maximum=15.0,
-                    value=7.5,
-                    step=0.1,
                     label="Guidance Scale"
                 )
-                seed_input = gr.Number(
-                    value=42,
-                    label="Random Seed",
-                    precision=0
                 )
-            max_duration_input = gr.Slider(
-                minimum=1,
-                maximum=15,
-                value=10,
-                step=1,
-                label="Max Duration (seconds)"
-            )
-            generate_btn = gr.Button("Generate Video")
-        with gr.Column():
-            output = gr.Video(label="Generated Video")
     generate_btn.click(
-        fn=process_video,
         inputs=[
             image_input,
             audio_input,
             prompt_input,
-            resolution_input,
-            audio_cfg_input,
-            cfg_input,
-            seed_input,
-            max_duration_input
         ],
-        outputs=output
     )
-# Launch locally if running directly
 if __name__ == "__main__":
-    demo.launch()

 import gradio as gr
 import torch
+import numpy as np
+from PIL import Image
 import tempfile
+import os
+from pathlib import Path
+import spaces
+# Configuration
+MAX_SEED = np.iinfo(np.int32).max
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+DTYPE = torch.float16 if DEVICE == "cuda" else torch.float32
+@spaces.GPU(duration=120)
+def generate_video(
     image,
+    audio,
+    prompt="A person talking",
     resolution="480p",
+    audio_cfg=2.5,
+    guidance_scale=5.0,
+    num_inference_steps=25,
     seed=42,
+    max_duration=10,
+    progress=gr.Progress()
 ):
+    """Generate talking video from image and audio"""
+    if image is None:
+        return None, "❌ Please upload an image"
+    if audio is None:
+        return None, "❌ Please upload an audio file"
     try:
+        progress(0, "Initializing...")
+        # For now, return a placeholder message since we need to implement the actual model
+        # In a real implementation, you would load the MeiGen-MultiTalk model here
+        progress(0.5, "Processing audio and image...")
+        # Simulate processing time
+        import time
+        time.sleep(2)
+        progress(1.0, "Video generation complete!")
+        return None, f"""✅ Video generation request processed!
+**Settings:**
+- Prompt: {prompt}
+- Resolution: {resolution}
+- Audio CFG: {audio_cfg}
+- Guidance Scale: {guidance_scale}
+- Steps: {num_inference_steps}
+- Seed: {seed}
+- Max Duration: {max_duration}s
+**Note:** This is a demo interface. To implement the actual video generation, you would need to:
+1. Load the MeiGen-MultiTalk model
+2. Process the input image and audio
+3. Generate the video using the model
+4. Return the generated video file
+The model files are not included in this demo due to size constraints."""
     except Exception as e:
+        return None, f"❌ Error during generation: {str(e)}"
+def randomize_seed():
+    return np.random.randint(0, MAX_SEED)
+# Gradio Interface
+with gr.Blocks(
+    theme=gr.themes.Soft(),
+    title="MeiGen-MultiTalk Demo",
+    css="""
+    .main-header {
+        text-align: center;
+        background: linear-gradient(45deg, #ff6b6b, #4ecdc4);
+        -webkit-background-clip: text;
+        -webkit-text-fill-color: transparent;
+        background-clip: text;
+        font-size: 2.5em;
+        font-weight: bold;
+        margin-bottom: 0.5em;
+    }
+    .subtitle {
+        text-align: center;
+        color: #666;
+        margin-bottom: 2em;
+    }
+    """
+) as demo:
+    gr.HTML("""
+    <div class="main-header">🎬 MeiGen-MultiTalk Demo</div>
+    <p class="subtitle">Generate talking videos from images and audio using AI</p>
     """)
     with gr.Row():
+        # Input Column
+        with gr.Column(scale=1):
+            gr.Markdown("### 📁 Input Files")
+            image_input = gr.Image(
+                label="Reference Image",
+                type="pil",
+                height=300
+            )
+            audio_input = gr.Audio(
+                label="Audio File",
+                type="filepath"
+            )
+            prompt_input = gr.Textbox(
+                label="Prompt",
+                placeholder="A person talking naturally...",
+                value="A person talking",
+                lines=2
+            )
+            gr.Markdown("### ⚙️ Generation Settings")
             with gr.Row():
+                resolution = gr.Dropdown(
                     choices=["480p", "720p"],
                     value="480p",
                     label="Resolution"
                 )
+                max_duration = gr.Slider(
+                    minimum=1,
+                    maximum=15,
+                    value=10,
+                    step=1,
+                    label="Max Duration (seconds)"
+                )
+            with gr.Row():
+                audio_cfg = gr.Slider(
                     minimum=1.0,
+                    maximum=5.0,
+                    value=2.5,
                     step=0.1,
+                    label="Audio CFG Scale"
                 )
+                guidance_scale = gr.Slider(
                     minimum=1.0,
+                    maximum=10.0,
+                    value=5.0,
+                    step=0.5,
                     label="Guidance Scale"
                 )
+            with gr.Row():
+                num_inference_steps = gr.Slider(
+                    minimum=10,
+                    maximum=50,
+                    value=25,
+                    step=1,
+                    label="Inference Steps"
                 )
+                seed = gr.Number(
+                    value=42,
+                    minimum=0,
+                    maximum=MAX_SEED,
+                    label="Seed"
+                )
+            with gr.Row():
+                randomize_btn = gr.Button("🎲 Randomize Seed", variant="secondary")
+                generate_btn = gr.Button("🎬 Generate Video", variant="primary", size="lg")
+        # Output Column
+        with gr.Column(scale=1):
+            gr.Markdown("### 🎥 Generated Video")
+            video_output = gr.Video(
+                label="Generated Video",
+                height=400
+            )
+            result_text = gr.Textbox(
+                label="Generation Log",
+                lines=8,
+                max_lines=15
+            )
+    # Examples
+    gr.Markdown("### 📋 Tips for Best Results")
+    gr.Markdown("""
+    - **Image**: Use clear, front-facing photos with good lighting
+    - **Audio**: Ensure clean audio without background noise
+    - **Prompt**: Be specific about the desired talking style
+    - **Resolution**: Start with 480p for faster generation
+    - **Duration**: Shorter videos (5-10s) generally work better
+    """)
+    # Event handlers
+    randomize_btn.click(
+        fn=randomize_seed,
+        outputs=seed
+    )
     generate_btn.click(
+        fn=generate_video,
         inputs=[
             image_input,
             audio_input,
             prompt_input,
+            resolution,
+            audio_cfg,
+            guidance_scale,
+            num_inference_steps,
+            seed,
+            max_duration
         ],
+        outputs=[video_output, result_text]
     )
 if __name__ == "__main__":
+    demo.launch(
+        share=False,
+        server_port=7860,
+        show_error=True
+    )

requirements.txt CHANGED Viewed

@@ -1,20 +1,15 @@
 torch>=2.0.0
 torchvision
 torchaudio
 transformers>=4.30.0
-diffusers
-accelerate
-safetensors
-opencv-python
 numpy
 scipy
-tqdm
-einops
-omegaconf
-huggingface-hub
-moviepy
-soundfile
 librosa
-gradio>=4.0.0
-python-dotenv
-pillow

+gradio==4.44.1
 torch>=2.0.0
 torchvision
 torchaudio
 transformers>=4.30.0
+diffusers>=0.21.0
+accelerate>=0.21.0
+xformers
+opencv-python-headless
+pillow
 numpy
 scipy
 librosa
+soundfile
+spaces