xrenaa JunGaoNVIDIA commited on
Commit
db6d2a8
·
verified ·
1 Parent(s): d965ac5

Update model card (#1)

Browse files

- Update model card (653bd41e316966d858b7989d4135f359be6e08e5)


Co-authored-by: Gao <JunGaoNVIDIA@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +144 -21
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: apache-2.0
3
  ---
4
 
5
  # GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
@@ -16,28 +16,151 @@ CVPR 2025 (Highlight)
16
  [Sanja Fidler](https://www.cs.toronto.edu/~fidler/),
17
  [Jun Gao](https://www.cs.toronto.edu/~jungao/) <br>
18
  \* indicates equal contribution <br>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  **[Paper](https://arxiv.org/pdf/2503.03751), [Project Page](https://research.nvidia.com/labs/toronto-ai/GEN3C/)**
20
 
21
- Abstract: We present GEN3C, a generative video model with precise Camera Control and
22
- temporal 3D Consistency. Prior video models already generate realistic videos,
23
- but they tend to leverage little 3D information, leading to inconsistencies,
24
- such as objects popping in and out of existence. Camera control, if implemented
25
- at all, is imprecise, because camera parameters are mere inputs to the neural
26
- network which must then infer how the video depends on the camera. In contrast,
27
- GEN3C is guided by a 3D cache: point clouds obtained by predicting the
28
- pixel-wise depth of seed images or previously generated frames. When generating
29
- the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with
30
- the new camera trajectory provided by the user. Crucially, this means that
31
- GEN3C neither has to remember what it previously generated nor does it have to
32
- infer the image structure from the camera pose. The model, instead, can focus
33
- all its generative power on previously unobserved regions, as well as advancing
34
- the scene state to the next frame. Our results demonstrate more precise camera
35
- control than prior work, as well as state-of-the-art results in sparse-view
36
- novel view synthesis, even in challenging settings such as driving scenes and
37
- monocular dynamic video. Results are best viewed in videos.
38
-
39
- ## Use the GEN3C Model
40
- Please visit the [GEN3C repository](https://github.com/nv-tlabs/GEN3C/) to access all relevant files and code.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
 
43
  ## Citation
 
1
  ---
2
+ {}
3
  ---
4
 
5
  # GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
 
16
  [Sanja Fidler](https://www.cs.toronto.edu/~fidler/),
17
  [Jun Gao](https://www.cs.toronto.edu/~jungao/) <br>
18
  \* indicates equal contribution <br>
19
+
20
+
21
+
22
+ ## Description: <br>
23
+
24
+ GEN3C is a generative video model with precise camera control and temporal three-dimensional (3D) Consistency. We achieve this with a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the two-dimensional (2D) renderings of the 3D cache with the new camera trajectory provided by the user. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video.
25
+
26
+ This model is ready for commercial/non-commercial use <br>
27
+
28
+ ### License/Terms of Use:
29
+ This model is released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). For a custom license, please contact cosmos-license@nvidia.com.
30
+
31
+ Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.
32
+
33
+ ### Deployment Geography:
34
+ Global
35
+
36
+ ### Use Case: <br>
37
+ This model is intended for researchers interested in developing consistent video generation and allows users to use cameras to control the final generation. For AV applications, we can enable users to generate driving videos and specify the camera trajectories in this video, such as switching from the viewpoint of a sedan car to a truck, or looking at a different lane.
38
+
39
+ ### Release Date: <br>
40
+ Github 06/10/2025 via https://github.com/nv-tlabs/Gen3C <br>
41
+
42
+ ## Reference:
43
+ GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
44
  **[Paper](https://arxiv.org/pdf/2503.03751), [Project Page](https://research.nvidia.com/labs/toronto-ai/GEN3C/)**
45
 
46
+
47
+
48
+ ## Model Architecture: <br>
49
+ **Architecture Type:** Convolutional Neural Network (CNN), Transformer <br>
50
+ **Network Architecture:** Transformer <br>
51
+
52
+ **This model was developed based on [Cosmos Predict 1](https://github.com/nvidia-cosmos/cosmos-predict1/tree/main) <br>
53
+ ** This model has 7B of model parameters. <br>
54
+
55
+ ## Input: <br>
56
+ **Input Type(s):** Camera Parameters, Image<br>
57
+ **Input Format(s):** 1D Array of Camera Poses, 2D Array of Images.<br>
58
+ **Input Parameters:** Camera Poses (1D), Images (2D) <br>
59
+ **Other Properties Related to Input:** The input image should be 720 * 1080 resolution, and we recommend using 121 frames for the camera parameters. <br>
60
+
61
+ ## Output: <br>
62
+ **Output Type(s):** Videos <br>
63
+ **Output Format:** MP4 video <br>
64
+ **Output Parameters:** 3D (N x H x W), with 3 channels (Red, Green, Blue ((RGB))<br>
65
+ **Other Properties Related to Output:** A sequence of images (N x H x W x 3), N is the number of frames, H is the height and W is the width. Three (3) refers to the number of RGB channels. <br>
66
+
67
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems A100 and H100. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
68
+
69
+ ## Software Integration: <br>
70
+ **Runtime Engine(s):**
71
+ *[Cosmos-Predict1](https://github.com/nvidia-cosmos/cosmos-predict1)<br>
72
+
73
+ **Supported Hardware Microarchitecture Compatibility:** <br>
74
+ * NVIDIA Ampere <br>
75
+ * NVIDIA Blackwell <br>
76
+ * NVIDIA Hopper <br>
77
+
78
+ **[Preferred/Supported] Operating System(s):** <br>
79
+ * Linux
80
+
81
+
82
+ ## Model Version(s):
83
+ -V1.0
84
+
85
+ ## Inference:
86
+ **Engine:** [Cosmos-Predict1](https://github.com/nvidia-cosmos/cosmos-predict1)
87
+ **Test Hardware:**
88
+ * NVIDIA Ampere <br>
89
+ * NVIDIA Blackwell <br>
90
+ * NVIDIA Hopper <br>
91
+
92
+ ## Ethical Considerations:
93
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
94
+
95
+ Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
96
+
97
+ For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [link to subcard](https://gitlab-master.nvidia.com/jung/gen3c_modelcard_subcard/-/blob/main/modelcard.md?ref_type=heads).
98
+
99
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
100
+
101
+
102
+
103
+ ### Plus Plus (++) Promise
104
+
105
+ We value you, the datasets, the diversity they represent, and what we have been entrusted with. This model and its associated data have been:
106
+
107
+ - Verified to comply with current applicable disclosure laws, regulations, and industry standards.
108
+ - Verified to comply with applicable privacy labeling requirements.
109
+ - Annotated to describe the collector/source (NVIDIA or a third-party).
110
+ - Characterized for technical limitations.
111
+ - Reviewed to ensure proper disclosure is accessible to, maintained for, and in compliance with NVIDIA data subjects and their requests.
112
+ - Reviewed before release.
113
+ - Tagged for known restrictions and potential safety implications.
114
+
115
+
116
+ ### Bias
117
+
118
+ Field | Response
119
+ :---------------------------------------------------------------------------------------------------|:---------------
120
+ Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None
121
+ Measures taken to mitigate against unwanted bias: | None
122
+
123
+
124
+ ### Explainability
125
+
126
+
127
+
128
+ Field | Response
129
+ :------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
130
+ Intended Task/Domain: | Novel view synthesis, video generation
131
+ Model Type: | Transformer
132
+ Intended Users: | Physical AI developers.
133
+ Output: | Videos
134
+ Describe how the model works: | We first predict depth for the input image, unproject it in to 3D to maintain a 3D cache. The 3D cache is then projected into a incomplete 2D video, which will be used as a condition for Cosmos to generate final video.
135
+ Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable.
136
+ Technical Limitations & Mitigation: | While the model aims to create photorealistic scenes that replicate real-world conditions, it may generate outputs that are not entirely visually accurate and may require augmentation and/or real-world data depending on the scope and use case.
137
+ Verified to have met prescribed NVIDIA quality standards: | Yes
138
+ Performance Metrics: | Qualitative and Quantitative Evaluation including PSNR, SSIM, LPIPS metrics. See [Gen3C](https://research.nvidia.com/labs/toronto-ai/GEN3C/paper.pdf) paper Section 5. for details.
139
+ Potential Known Risks: | This model may inaccurately characterize depth, which will make the generated video un-realistic and prone to artifacts.
140
+ Licensing: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
141
+
142
+
143
+
144
+ ### Privacy
145
+
146
+ Field | Response
147
+ :----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
148
+ Generatable or reverse engineerable personal data? | [None Known]
149
+ Personal data used to create this model? | [None Known]
150
+ Was consent obtained for any personal data used? | [None Known]
151
+ How often is dataset reviewed? | Before Release
152
+ Does data labeling (annotation, metadata) comply with privacy laws? | Yes
153
+ Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/
154
+
155
+ ### Safety
156
+
157
+ Field | Response
158
+ :---------------------------------------------------|:----------------------------------
159
+ Model Application Field(s): | World Generation
160
+ Describe the life critical impact (if present). | None Known <br>
161
+ Use Case Restrictions: | Abide by [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
162
+ Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
163
+
164
 
165
 
166
  ## Citation