VACE ControlNet Guide

VACE is a powerful ControlNet that enables Video-to-Video and Reference-to-Video generation. It allows you to inject your own images into output videos, animate characters, perform inpainting/outpainting, and continue existing videos.

Overview

VACE is probably one of the most powerful Wan models available. With it, you can:

Inject people or objects into scenes
Animate characters
Perform video inpainting and outpainting
Continue existing videos
Transfer motion from one video to another
Change the style of scenes while preserving the structure of the scenes

Getting Started

Model Selection

Select either "Vace 1.3B" or "Vace 13B" from the dropdown menu
Note: VACE works best with videos up to 7 seconds with the Riflex option enabled

You can also use any derived Vace models such as Vace Fusionix or combine Vace with Loras accelerator such as Causvid.

Input Types

1. Control Video

The Control Video is the source material that contains the instructions about what you want. So Vace expects in the Control Video some visual hints about the type of processing expected: for instance replacing an area by something else, converting an Open Pose wireframe into a human motion, colorizing an Area, transferring the depth of an image area, ...

For example, anywhere your control video contains the color 127 (grey), it will be considered as an area to be inpainting and replaced by the content of your text prompt and / or a reference image (see below). Likewise if the frames of a Control Video contains an Open Pose wireframe (basically some straight lines tied together that describes the pose of a person), Vace will automatically turn this Open Pose into a real human based on the text prompt and any reference Images (see below).

You can either build yourself the Control Video with the annotators tools provided by the Vace team (see the Vace ressources at the bottom) or you can let WanGP (recommended option) generates on the fly a Vace formatted Control Video based on information you provide.

WanGP wil need the following information to generate a Vace Control Video:

A Control Video : this video shouldn't have been altered by an annotator tool and can be taken straight from youtube or your camera
Control Video Process : This is the type of process you want to apply on the control video. For instance Transfer Human Motion will generate the Open Pose information from your video so that you can transfer this same motion to a generated character. If you want to do only Spatial Outpainting or Temporal Inpainting / Outpainting you may want to choose the Keep Unchanged process.
Area Processed : you can target the processing to a specific area. For instance even if there are multiple people in the Control Video you may want to replace only one them. If you decide to target an area you will need to provide a Video Mask as well. These types of videos can be easily created using the Matanyone tool embedded with WanGP (see the doc of Matanyone below). WanGP can apply different types of process, one the mask and another one on the outside the mask.

Another nice thing is that you can combine all effects above with Outpainting since WanGP will create automatically an outpainting area in the Control Video if you ask for this.

By default WanGP will ask Vace to generate new frames in the "same spirit" of the control video if the latter is shorter than the number frames that you have requested.

Be aware that the Control Video and Video Mask will be before anything happens resampled to the number of frames per second of Vace (usually 16) and resized to the output size you have requested.

2. Reference Images

With Reference Images you can inject people or objects of your choice in the Video. You can also force Images to appear at a specific frame nos in the Video.

If the Reference Image is a person or an object, it is recommended to turn on the background remover that will replace the background by the white color. This is not needed for a background image or an injected frame at a specific position.

It is recommended to describe injected objects/people explicitly in your text prompt so that Vace can connect the Reference Images to the new generated video and this will increase the chance that you will find your injected people or objects.

Understanding Vace Control Video and Mask format

As stated above WanGP will adapt the Control Video and the Video Mask to meet your instructions. You can preview the first frames of the new Control Video and of the Video Mask in the Generation Preview box (just click a thumbnail) to check that your request has been properly interpreted. You can as well ask WanGP to save in the main folder of WanGP the full generated Control Video and Video Mask by launching the app with the --save-masks command.

Look at the background colors of both the Control Video and the Video Mask: The Mask Video is the most important because depending on the color of its pixels, the Control Video will be interpreted differently. If an area in the Mask is black, the corresponding Control Video area will be kept as is. On the contrary if an area of the Mask is plain white, a Vace process will be applied on this area. If there isn't any Mask Video the Vace process will apply on the whole video frames. The nature of the process itself will depend on what there is in the Control Video for this area.

if the area is grey (127) in the Control Video, this area will be replaced by new content based on the text prompt or image references
if an area represents a person in the wireframe Open Pose format, it will be replaced by a person animated with motion described by the Open Pose.The appearance of the person will depend on the text prompt or image references
if an area contains multiples shades of grey, these will be assumed to represent different levels of image depth and Vace will try to generate new content located at the same depth

There are more Vace representations. For all the different mapping please refer the official Vace documentation.

Other Processing

Most of the processing below and the ones related to Control Video can be combined together.

Temporal Outpainting
Temporal Outpainting requires an existing Source Video or Control Video and it amounts to adding missing frames. It is implicit if you use a Source Video that you want to continue (new frames will be added at the end of this Video) or if you provide a Control Video that contains fewer frames than the number that you have requested to generate.
Temporal Inpainting
With temporal inpainting you are asking Vace to generate missing frames that should exist between existing frames. There are two ways to do that:
- Injected Reference Images : Each Image is injected a position of your choice and Vace will fill the gaps between these frames
- Frames to keep in Control Video : If using a Control Video, you can ask WanGP to hide some of these frames to let Vace generate "alternate frames" for these parts of the Control Video.
Spatial Outpainting
This feature creates new content to the top, bottom, left or right of existing frames of a Control Video. You can set the amount of content for each direction by specifying a percentage of extra content in relation to the existing frame. Please note that the resulting video will target the resolution you specified. So if this Resolution corresponds to that of your Control Video you may lose details. Therefore it may be relevant to pick a higher resolution with Spatial Outpainting.
There are two ways to do Spatial Outpainting:
- Injected Reference Frames : new content will be added around Injected Frames
- Control Video : new content will be added on all the frames of the whole Control Video

Example 1 : Replace a Person in one video by another one by keeping the Background

In Vace, select Control Video Process=Transfer human pose, Area processed=Masked area
In Matanyone Video Mask Creator, load your source video and create a mask where you targetted a specific person
Click Export to Control Video Input and Video Mask Input to transfer both the original video that now becomes the Control Video and the black & white mask that now defines the Video Mask Area
Back in Vace, in Reference Image select Inject Landscapes / People / Objects and upload one or several pictures of the new person
Generate

This works also with several people at the same time (you just need to mask several people in Matanyone), you can also play with the slider Expand / Shrink Mask if the new person is larger than the original one and of course, you can also use the text Prompt if you dont want to use an image for the swap.

Example 2 : Change the Background behind some characters

In Vace, select Control Video Process=Inpainting, Area processed=Non Masked area
In Matanyone Video Mask Creator, load your source video and create a mask where you targetted the people you want to keep
Click Export to Control Video Input and Video Mask Input to transfer both the original video that now becomes the Control Video and the black & white mask that now defines the Video Mask Area
Generate

If instead Control Video Process=Depth, then the background although it will be still different it will have a similar geometry than in the control video

Example 3 : Outpaint a Video to the Left and Inject a Character in this new area

In Vace, select Control Video Process=Keep Unchanged
Control Video Outpainting in Percentage enter the value 40 to the Left entry
In Reference Image select Inject Landscapes / People / Objects and upload one or several pictures of a person
Enter the Prompt such as "a person is coming from the left" (you will need of course a more accurate description)
Generate

Creating Face / Object Replacement Masks

Matanyone is a tool that will generate the Video Mask that needs to be combined with the Control Video. It is very useful as you just need to indicate in the first frame the area you want to mask and it will compute masked areas for the following frames by taking into account the motion.

Load your video in Matanyone
Click on the face or object in the first frame
Validate the mask by clicking Set Mask
Generate a copy of the control video (for easy transfers) and a new mask video by clicking "Generate Video Matting"
Export to VACE with Export to Control Video Input and Video Mask Input

Advanced Matanyone Tips

Negative Point Prompts: Remove parts from current selection if the mask goes beyond the desired area
Sub Masks: Create multiple independent masks, then combine them. This may be useful if you are struggling to select exactly what you want.

Window Sliding for Long Videos

Generate videos up to 1 minute by merging multiple windows: The longer the video the greater the quality degradation. However the effect will be less visible if your generated video reuses mostly non altered control video.

When this feature is enabled it is important to keep in mind that every positional argument of Vace (frames positions of Injected Reference Frames, Frames to keep in Control Video) are related to the first frame of the first Window. This is convenient as changing the size of a sliding window won't have any impact and this allows you define in advance the inject frames for all the windows.

Likewise, if you use Continue Video File by providing a Source Video, this Source Video will be considered as the first window and the positional arguments will be calculated in relation to the first frame of this Source Video. Also the overlap window size parameter will correspond to the number of frames used of the Source Video that is temporally outpainted to produce new content.

How It Works

Each window uses the corresponding time segment of the Control Video
Example: 0-4s control video → first window, 4-8s → second window, etc.
Automatic overlap management ensures smooth transitions

Formula

This formula gives the number of Generated Frames for a specific number of Sliding Windows :

Generated Frames = [Nb Windows - 1] × [Window Size - Overlap - Discard] + Window Size

Multi-Line Prompts (Experimental)

If you enable Text Prompts separated by a Carriage Return will be used for a new Sliding Window, you can define in advance a different prompt for each window.:

Each prompt is separated by a Carriage Return
Each line of prompt will be used for a different window
If more windows than prompt lines, last line repeats

Recommended Settings

Quality Settings

Skip Layer Guidance: Turn ON with default configuration for better results (useless with FusioniX of Causvid are there is no cfg)
Long Prompts: Use detailed descriptions, especially for background elements not in reference images
Steps: Use at least 15 steps for good quality, 30+ for best results if you use the original Vace model. But only 8-10 steps are sufficient with Vace Funsionix or if you use Loras such as Causvid or Self-Forcing.

Sliding Window Settings

For very long videos, configure sliding windows properly:

Window Size: Set appropriate duration for your content
Overlap Frames: Long enough for motion continuity, short enough to avoid blur propagation
Discard Last Frames: Remove at least 4 frames from each window (VACE 1.3B tends to blur final frames)
Add Overlapped Noise: May or may not reduce quality degradation over time

Background Removal

WanGP includes automatic background removal options:

Use for reference images containing people/objects
Don't use this for landscape/setting reference images (the first reference image)
If you are not happy with the automatic background removal tool you can use the Image version of Matanyone for a precise background removal

External Resources

Official VACE Resources

GitHub: https://github.com/ali-vilab/VACE/tree/main/vace/gradios
User Guide: https://github.com/ali-vilab/VACE/blob/main/UserGuide.md
Preprocessors: Gradio tools for preparing materials

Recommended External Tools

Annotation Tools: For creating precise masks
Video Editors: For preparing control videos
Background Removal: For cleaning reference images

Troubleshooting

Poor Quality Results

Use longer, more detailed prompts
Enable Skip Layer Guidance
Increase number of steps (30+)
Check reference image quality
Ensure proper mask creation

Inconsistent Windows

Increase overlap frames
Use consistent prompting across windows
Add noise to overlapped frames
Reduce discard frames if losing too much content

Memory Issues

Use VACE 1.3B instead of 13B
Reduce video length or resolution
Decrease window size
Enable quantization

Blurry Results

Reduce overlap frames
Increase discard last frames
Use higher resolution reference images
Check control video quality

Tips for Best Results

Detailed Prompts: Describe everything in the scene, especially elements not in reference images
Quality Reference Images: Use high-resolution, well-lit reference images
Proper Masking: Take time to create precise masks with Matanyone
Iterative Approach: Start with short videos, then extend successful results
Background Preparation: Remove complex backgrounds from object/person reference images
Consistent Lighting: Match lighting between reference images and intended scene