Diffusers
Safetensors
English
DifixPipeline
shsolanki commited on
Commit
1c2d0a5
·
1 Parent(s): 4f645db

Difix Model Release

Browse files
.gitattributes CHANGED
@@ -22,7 +22,7 @@
22
  *.pt filter=lfs diff=lfs merge=lfs -text
23
  *.pth filter=lfs diff=lfs merge=lfs -text
24
  *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
  saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
  *.tar.* filter=lfs diff=lfs merge=lfs -text
28
  *.tar filter=lfs diff=lfs merge=lfs -text
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
22
  *.pt filter=lfs diff=lfs merge=lfs -text
23
  *.pth filter=lfs diff=lfs merge=lfs -text
24
  *.rar filter=lfs diff=lfs merge=lfs -text
25
+ **/*.safetensors filter=lfs diff=lfs merge=lfs -text
26
  saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
  *.tar.* filter=lfs diff=lfs merge=lfs -text
28
  *.tar filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
LICENSE.txt ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ NVIDIA License
2
+
3
+ 1. Definitions
4
+
5
+ “Licensor” means any person or entity that distributes its Work.
6
+ “Work” means (a) the original work of authorship made available under this license, which may include software, documentation, or other files, and (b) any additions to or derivative works thereof that are made available under this license.
7
+ The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work.
8
+ Works are “made available” under this license by including in or with the Work either (a) a copyright notice referencing the applicability of this license to the Work, or (b) a copy of this license.
9
+
10
+ 2. License Grant
11
+
12
+ 2.1 Copyright Grant. Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.
13
+
14
+ 3. Limitations
15
+
16
+ 3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work.
17
+
18
+ 3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work (“Your Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself.
19
+
20
+ 3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. As used herein, “non-commercially” means for research or evaluation purposes only.
21
+
22
+ 3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this license from such Licensor (including the grant in Section 2.1) will terminate immediately.
23
+
24
+ 3.5 Trademarks. This license does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this license.
25
+
26
+ 3.6 Termination. If you violate any term of this license, then your rights under this license (including the grant in Section 2.1) will terminate immediately.
27
+
28
+ 4. Disclaimer of Warranty.
29
+
30
+ THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF
31
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE.
32
+
33
+ 5. Limitation of Liability.
34
+
35
+ EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
36
+
37
+ STABILITY AI COMMUNITY LICENSE AGREEMENT
38
+
39
+ Last Updated: July 5, 2024
40
+
41
+ INTRODUCTION
42
+ This Agreement applies to any individual person or entity (“You”, “Your” or “Licensee”) that uses or distributes any portion or element of the Stability AI Materials or Derivative Works thereof for any Research & Non-Commercial or Commercial purpose. Capitalized terms not otherwise defined herein are defined in Section V below.
43
+
44
+ This Agreement is intended to allow research, non-commercial, and limited commercial uses of the Models free of charge. In order to ensure that certain limited commercial uses of the Models continue to be allowed, this Agreement preserves free access to the Models for people or organizations generating annual revenue of less than US $1,000,000 (or local currency equivalent).
45
+
46
+ By clicking “I Accept” or by using or distributing or using any portion or element of the Stability Materials or Derivative Works, You agree that You have read, understood and are bound by the terms of this Agreement. If You are acting on behalf of a company, organization or other entity, then “You” includes you and that entity, and You agree that You: (i) are an authorized representative of such entity with the authority to bind such entity to this Agreement, and (ii) You agree to the terms of this Agreement on that entity’s behalf.
47
+
48
+ RESEARCH & NON-COMMERCIAL USE LICENSE
49
+ Subject to the terms of this Agreement, Stability AI grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Stability AI’s intellectual property or other rights owned by Stability AI embodied in the Stability AI Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Stability AI Materials for any Research or Non-Commercial Purpose. “Research Purpose” means academic or scientific advancement, and in each case, is not primarily intended for commercial advantage or monetary compensation to You or others. “Non-Commercial Purpose” means any purpose other than a Research Purpose that is not primarily intended for commercial advantage or monetary compensation to You or others, such as personal use (i.e., hobbyist) or evaluation and testing.
50
+
51
+ COMMERCIAL USE LICENSE
52
+ Subject to the terms of this Agreement (including the remainder of this Section III), Stability AI grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Stability AI’s intellectual property or other rights owned by Stability AI embodied in the Stability AI Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Stability AI Materials for any Commercial Purpose. “Commercial Purpose” means any purpose other than a Research Purpose or Non-Commercial Purpose that is primarily intended for commercial advantage or monetary compensation to You or others, including but not limited to, (i) creating, modifying, or distributing Your product or service, including via a hosted service or application programming interface, and (ii) for Your business’s or organization’s internal operations. If You are using or distributing the Stability AI Materials for a Commercial Purpose, You must register with Stability AI at (https://stability.ai/community-license). If at any time You or Your Affiliate(s), either individually or in aggregate, generate more than USD $1,000,000 in annual revenue (or the equivalent thereof in Your local currency), regardless of whether that revenue is generated directly or indirectly from the Stability AI Materials or Derivative Works, any licenses granted to You under this Agreement shall terminate as of such date. You must request a license from Stability AI at (https://stability.ai/enterprise) , which Stability AI may grant to You in its sole discretion. If you receive Stability AI Materials, or any Derivative Works thereof, from a Licensee as part of an integrated end user product, then Section III of this Agreement will not apply to you.
53
+
54
+ GENERAL TERMS
55
+ Your Research, Non-Commercial, and Commercial License(s) under this Agreement are subject to the following terms. a. Distribution & Attribution. If You distribute or make available the Stability AI Materials or a Derivative Work to a third party, or a product or service that uses any portion of them, You shall: (i) provide a copy of this Agreement to that third party, (ii) retain the following attribution notice within a "Notice" text file distributed as a part of such copies: "This Stability AI Model is licensed under the Stability AI Community License, Copyright © Stability AI Ltd. All Rights Reserved”, and (iii) prominently display “Powered by Stability AI” on a related website, user interface, blogpost, about page, or product documentation. If You create a Derivative Work, You may add your own attribution notice(s) to the “Notice” text file included with that Derivative Work, provided that You clearly indicate which attributions apply to the Stability AI Materials and state in the “Notice” text file that You changed the Stability AI Materials and how it was modified. b. Use Restrictions. Your use of the Stability AI Materials and Derivative Works, including any output or results of the Stability AI Materials or Derivative Works, must comply with applicable laws and regulations (including Trade Control Laws and equivalent regulations) and adhere to the Documentation and Stability AI’s AUP, which is hereby incorporated by reference. Furthermore, You will not use the Stability AI Materials or Derivative Works, or any output or results of the Stability AI Materials or Derivative Works, to create or improve any foundational generative AI model (excluding the Models or Derivative Works). c. Intellectual Property. (i) Trademark License. No trademark licenses are granted under this Agreement, and in connection with the Stability AI Materials or Derivative Works, You may not use any name or mark owned by or associated with Stability AI or any of its Affiliates, except as required under Section IV(a) herein. (ii) Ownership of Derivative Works. As between You and Stability AI, You are the owner of Derivative Works You create, subject to Stability AI’s ownership of the Stability AI Materials and any Derivative Works made by or for Stability AI. (iii) Ownership of Outputs. As between You and Stability AI, You own any outputs generated from the Models or Derivative Works to the extent permitted by applicable law. (iv) Disputes. If You or Your Affiliate(s) institute litigation or other proceedings against Stability AI (including a cross-claim or counterclaim in a lawsuit) alleging that the Stability AI Materials, Derivative Works or associated outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by You, then any licenses granted to You under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Stability AI from and against any claim by any third party arising out of or related to Your use or distribution of the Stability AI Materials or Derivative Works in violation of this Agreement. (v) Feedback. From time to time, You may provide Stability AI with verbal and/or written suggestions, comments or other feedback related to Stability AI’s existing or prospective technology, products or services (collectively, “Feedback”). You are not obligated to provide Stability AI with Feedback, but to the extent that You do, You hereby grant Stability AI a perpetual, irrevocable, royalty-free, fully-paid, sub-licensable, transferable, non-exclusive, worldwide right and license to exploit the Feedback in any manner without restriction. Your Feedback is provided “AS IS” and You make no warranties whatsoever about any Feedback. d. Disclaimer Of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE STABILITY AI MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OR LAWFULNESS OF USING OR REDISTRIBUTING THE STABILITY AI MATERIALS, DERIVATIVE WORKS OR ANY OUTPUT OR RESULTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE STABILITY AI MATERIALS, DERIVATIVE WORKS AND ANY OUTPUT AND RESULTS. e. Limitation Of Liability. IN NO EVENT WILL STABILITY AI OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF STABILITY AI OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING. f. Term And Termination. The term of this Agreement will commence upon Your acceptance of this Agreement or access to the Stability AI Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Stability AI may terminate this Agreement if You are in breach of any term or condition of this Agreement. Upon termination of this Agreement, You shall delete and cease use of any Stability AI Materials or Derivative Works. Section IV(d), (e), and (g) shall survive the termination of this Agreement. g. Governing Law. This Agreement will be governed by and constructed in accordance with the laws of the United States and the State of California without regard to choice of law principles, and the UN Convention on Contracts for International Sale of Goods does not apply to this Agreement.
56
+
57
+ DEFINITIONS
58
+ “Affiliate(s)” means any entity that directly or indirectly controls, is controlled by, or is under common control with the subject entity; for purposes of this definition, “control” means direct or indirect ownership or control of more than 50% of the voting interests of the subject entity.
59
+
60
+ "Agreement" means this Stability AI Community License Agreement.
61
+
62
+ “AUP” means the Stability AI Acceptable Use Policy available at (https://stability.ai/use-policy), as may be updated from time to time.
63
+
64
+ "Derivative Work(s)” means (a) any derivative work of the Stability AI Materials as recognized by U.S. copyright laws and (b) any modifications to a Model, and any other model created which is based on or derived from the Model or the Model’s output, including “fine tune” and “low-rank adaptation” models derived from a Model or a Model’s output, but do not include the output of any Model.
65
+
66
+ “Documentation” means any specifications, manuals, documentation, and other written information provided by Stability AI related to the Software or Models.
67
+
68
+ “Model(s)" means, collectively, Stability AI’s proprietary models and algorithms, including machine-learning models, trained model weights and other elements of the foregoing listed on Stability’s Core Models Webpage available at (https://stability.ai/core-models), as may be updated from time to time.
69
+
70
+ "Stability AI" or "we" means Stability AI Ltd. and its Affiliates.
71
+
72
+ "Software" means Stability AI’s proprietary software made available under this Agreement now or in the future.
73
+
74
+ “Stability AI Materials” means, collectively, Stability’s proprietary Models, Software and Documentation (and any portion or combination thereof) made available under this Agreement.
75
+
76
+ “Trade Control Laws” means any applicable U.S. and non-U.S. export control and trade sanctions laws and regulations.
NOTICE ADDED
@@ -0,0 +1 @@
 
 
1
+ This Stability AI Model is licensed under the Stability AI Community License, Copyright © Stability AI Ltd. All Rights Reserved
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - DL3DV/DL3DV-10K-Sample
4
+ language:
5
+ - en
6
+ ---
7
+ # **Difix: Improving 3D Reconstructions with Single-Step Diffusion Models**
8
+ CVPR 2025 (Oral)
9
+ [**Code**](https://github.com/nv-tlabs/Difix3D) | [**Project Page**](https://research.nvidia.com/labs/toronto-ai/difix3d/) | [**Paper**](https://arxiv.org/abs/2503.01774)
10
+
11
+ ## Description:
12
+ Difix is a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by
13
+ underconstrained regions of 3D representation. The technology behind Difix is based on the concepts outlined in the paper titled
14
+ [DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models](https://arxiv.org/abs/2503.01774 ).
15
+
16
+ Difix has two operation modes:
17
+
18
+ * Offline mode: Used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction
19
+ and then distill them back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality.
20
+ * Online mode: Acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D
21
+ supervision and the limited capacity of current reconstruction models.
22
+
23
+ Difix is an all-encompassing solution, a single model compatible for both NeRF and 3DGS representations.
24
+
25
+ **This model is ready for research and development/non-commercial use only.**
26
+
27
+ **Model Developer:** NVIDIA
28
+
29
+ **Model Versions:** Difix-SD-FP32
30
+
31
+ **Deployment Geography:** Global
32
+
33
+ ### License/Terms of Use:
34
+ The use of the model and code is governed by the NVIDIA License. Additional Information: [LICENSE.md · stabilityai/sd-turbo at main](https://huggingface.co/stabilityai/sd-turbo/blob/main/LICENSE.md)
35
+
36
+
37
+ ### Use Case:
38
+ Difix is intended for Physical AI developers looking to enhance and improve their Neural Reconstruction pipelines. The model takes an image as an input and outputs a fixed image
39
+
40
+ **Release Date:** Github: [June 2025](https://github.com/nv-tlabs/Difix3D)
41
+
42
+ ## Model Architecture
43
+
44
+ **Architecture Type**: UNet
45
+
46
+ **Network Architecture**: A latent diffusion-based UNet coupled with a variational autoencoder (VAE).
47
+
48
+ ## Input
49
+
50
+ **Input Type(s)**: Image
51
+
52
+ **Input Format(s)**: Red, Green, Blue (RGB)
53
+
54
+ **Input Parameters**: Two-Dimensional (2D)
55
+
56
+ **Other Properties Related to Input**:
57
+ * Specific Resolution: [576px x 1024px]
58
+
59
+ ## Output
60
+
61
+ **Output Type(s)**: Image
62
+
63
+ **Output Format(s)**: Red, Green, Blue (RGB)
64
+
65
+ **Output Parameters**: Two-Dimensional (2D)
66
+
67
+ **Other Properties Related to Output**:
68
+ * Specific Resolution: [576px x 1024px]
69
+
70
+ ## Software Integration
71
+
72
+ **Runtime Engine(s)**: PyTorch
73
+
74
+ **Supported Hardware Microarchitecture Compatibility**:
75
+ * NVIDIA Ampere
76
+ * NVIDIA Hopper
77
+
78
+ **Note**: We are testing with FP32 Precision.
79
+
80
+ ## Inference
81
+ **Acceleration Engine**: [PyTorch](https://pytorch.org/)
82
+
83
+ **Test Hardware**:
84
+ * A100
85
+ * H100
86
+
87
+ **Operating System(s):** Linux (We have not tested on other operating systems.)
88
+
89
+ **System Requirements and Performance:**
90
+ This model requires X GB of GPU VRAM.
91
+ The following table shows inference time for a single generation across different NVIDIA GPU hardware:
92
+
93
+ | GPU Hardware | Inference Runtime |
94
+ |--------------|----------------------------|
95
+ | NVIDIA A100 | 0.355 sec |
96
+ | NVIDIA H100 | 0.223 sec |
97
+
98
+ ## Use the Difix Model
99
+ Please visit the [Difix3D repository](https://github.com/nv-tlabs/Difix3D) to access all relevant files and code needed to use Difix
100
+
101
+
102
+ ## Difix Dataset
103
+ - Data Collection Method: Human
104
+ - Labeling Method by Dataset: Human
105
+ - Properties: Difix was trained, tested, and evaluated using the [DL3DV-10k dataset](https://huggingface.co/datasets/DL3DV/DL3DV-10K-Sample), where 80% of the data was used for training, 10% for evaluation, and 10% for testing. DL3DV-10K is a large-scale dataset consisting of 10,510 high-resolution (4K) real-world video sequences, totaling approximately 51.2 million frames. The scenes span 65 diverse categories across indoor and outdoor environments. Each video is accompanied by metadata describing environmental conditions such as lighting (natural, artificial, mixed), surface materials (e.g., reflective or transparent), and texture complexity. The dataset is designed to support the development and evaluation of learning-based 3D vision methods.
106
+
107
+
108
+ ## Ethical Considerations:
109
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
110
+
111
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)
112
+
113
+ ---
114
+
115
+ ## ModelCard++
116
+
117
+ ### Bias
118
+
119
+ | Field | Response |
120
+ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------- |
121
+ | Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
122
+ | Measures taken to mitigate against unwanted bias: | None |
123
+
124
+ ### Explainability
125
+
126
+ | Field | Response |
127
+ | :-------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------- |
128
+ | Intended Domain: | Advanced Driver Assistance Systems |
129
+ | Model Type: | Image-to-Image |
130
+ | Intended Users: | Autonomous Vehicles developers enhancing and improving Neural Reconstruction pipelines. |
131
+ | Output: | Image |
132
+ | Describe how the model works: | The model takes as an input an image, and outputs a fixed image |
133
+ | Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | None |
134
+ | Technical Limitations: | The reconstruction relies on the quality and consistency of input images and camera calibrations; any deficiencies in these areas can negatively impact the final output. |
135
+ | Verified to have met prescribed NVIDIA quality standards: | Yes |
136
+ | Performance Metrics: | FID (Fréchet Inception Distance), PSNR (Peak Signal-to-Noise Ratio), LPIPS (Learned Perceptual Image Patch Similarity) |
137
+ | Potential Known Risks: | The model is not guaranteed to fix 100% of the image artifacts. please verify the generated scenarios are context and use appropriate. |
138
+ | Licensing: | The use of the model and code is governed by the NVIDIA License. Additional Information: [LICENSE.md · stabilityai/sd-turbo at main](https://huggingface.co/stabilityai/sd-turbo/blob/main/LICENSE.md). |
139
+
140
+ ### Privacy
141
+
142
+ | Field | Response |
143
+ | :------------------------------------------------------------------ | :------------- |
144
+ | Generatable or reverse engineerable personal data? | No |
145
+ | Personal data used to create this model? | No |
146
+ | How often is the dataset reviewed? | Before release |
147
+ | Is there provenance for all datasets used in training? | Yes |
148
+ | Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
149
+ | Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
150
+
151
+ ### Safety & Security
152
+
153
+ | Field | Response |
154
+ | :---------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
155
+ | Model Application(s): | Image Enhancement |
156
+ | List types of specific high-risk AI systems, if any, in which the model can be integrated: | The model can be used to develop Autonomous Vehicles stacks that can be integrated inside vehicles. The Difix model should not be deployed in a vehicle. |
157
+ | Describe the life critical impact (if present). | N/A - The model should not be deployed in a vehicle and will not perform life-critical tasks. |
158
+ | Use Case Restrictions: | Your use of the model and code is governed by the NVIDIA License. Additional Information: LICENSE.md · stabilityai/sd-turbo at main |
159
+ | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
model_index.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DifixPipeline",
3
+ "_diffusers_version": "0.25.1",
4
+ "_name_or_path": "nvidia/Difix3D",
5
+ "feature_extractor": [
6
+ null,
7
+ null
8
+ ],
9
+ "image_encoder": [
10
+ null,
11
+ null
12
+ ],
13
+ "requires_safety_checker": true,
14
+ "safety_checker": [
15
+ null,
16
+ null
17
+ ],
18
+ "scheduler": [
19
+ "diffusers",
20
+ "DDPMScheduler"
21
+ ],
22
+ "text_encoder": [
23
+ "transformers",
24
+ "CLIPTextModel"
25
+ ],
26
+ "tokenizer": [
27
+ "transformers",
28
+ "CLIPTokenizer"
29
+ ],
30
+ "unet": [
31
+ "diffusers",
32
+ "UNet2DConditionModel"
33
+ ],
34
+ "vae": [
35
+ "autoencoder_kl",
36
+ "AutoencoderKL"
37
+ ]
38
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDPMScheduler",
3
+ "_diffusers_version": "0.25.1",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "clip_sample_range": 1.0,
9
+ "dynamic_thresholding_ratio": 0.995,
10
+ "interpolation_type": "linear",
11
+ "num_train_timesteps": 1000,
12
+ "prediction_type": "epsilon",
13
+ "rescale_betas_zero_snr": false,
14
+ "sample_max_value": 1.0,
15
+ "set_alpha_to_one": false,
16
+ "sigma_max": null,
17
+ "sigma_min": null,
18
+ "skip_prk_steps": true,
19
+ "steps_offset": 1,
20
+ "thresholding": false,
21
+ "timestep_spacing": "trailing",
22
+ "timestep_type": "discrete",
23
+ "trained_betas": null,
24
+ "use_karras_sigmas": false,
25
+ "variance_type": "fixed_small"
26
+ }
text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "nvidia/Difix3D/text_encoder",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_size": 1024,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 23,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 512,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.48.3",
24
+ "vocab_size": 49408
25
+ }
text_encoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67e013543d4fac905c882e2993d86a2d454ee69dc9e8f37c0c23d33a48959d15
3
+ size 1361596304
tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "!",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|endoftext|>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "!",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "49406": {
13
+ "content": "<|startoftext|>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "49407": {
21
+ "content": "<|endoftext|>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ }
28
+ },
29
+ "bos_token": "<|startoftext|>",
30
+ "clean_up_tokenization_spaces": true,
31
+ "do_lower_case": true,
32
+ "eos_token": "<|endoftext|>",
33
+ "errors": "replace",
34
+ "extra_special_tokens": {},
35
+ "model_max_length": 77,
36
+ "pad_token": "!",
37
+ "tokenizer_class": "CLIPTokenizer",
38
+ "unk_token": "<|endoftext|>"
39
+ }
tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
unet/config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DConditionModel",
3
+ "_diffusers_version": "0.25.1",
4
+ "_name_or_path": "nvidia/Difix3D/unet",
5
+ "act_fn": "silu",
6
+ "addition_embed_type": null,
7
+ "addition_embed_type_num_heads": 64,
8
+ "addition_time_embed_dim": null,
9
+ "attention_head_dim": [
10
+ 5,
11
+ 10,
12
+ 20,
13
+ 20
14
+ ],
15
+ "attention_type": "default",
16
+ "block_out_channels": [
17
+ 320,
18
+ 640,
19
+ 1280,
20
+ 1280
21
+ ],
22
+ "center_input_sample": false,
23
+ "class_embed_type": null,
24
+ "class_embeddings_concat": false,
25
+ "conv_in_kernel": 3,
26
+ "conv_out_kernel": 3,
27
+ "cross_attention_dim": 1024,
28
+ "cross_attention_norm": null,
29
+ "down_block_types": [
30
+ "CrossAttnDownBlock2D",
31
+ "CrossAttnDownBlock2D",
32
+ "CrossAttnDownBlock2D",
33
+ "DownBlock2D"
34
+ ],
35
+ "downsample_padding": 1,
36
+ "dropout": 0.0,
37
+ "dual_cross_attention": false,
38
+ "encoder_hid_dim": null,
39
+ "encoder_hid_dim_type": null,
40
+ "flip_sin_to_cos": true,
41
+ "freq_shift": 0,
42
+ "in_channels": 4,
43
+ "layers_per_block": 2,
44
+ "mid_block_only_cross_attention": null,
45
+ "mid_block_scale_factor": 1,
46
+ "mid_block_type": "UNetMidBlock2DCrossAttn",
47
+ "norm_eps": 1e-05,
48
+ "norm_num_groups": 32,
49
+ "num_attention_heads": null,
50
+ "num_class_embeds": null,
51
+ "only_cross_attention": false,
52
+ "out_channels": 4,
53
+ "projection_class_embeddings_input_dim": null,
54
+ "resnet_out_scale_factor": 1.0,
55
+ "resnet_skip_time_act": false,
56
+ "resnet_time_scale_shift": "default",
57
+ "reverse_transformer_layers_per_block": null,
58
+ "sample_size": 64,
59
+ "time_cond_proj_dim": null,
60
+ "time_embedding_act_fn": null,
61
+ "time_embedding_dim": null,
62
+ "time_embedding_type": "positional",
63
+ "timestep_post_act": null,
64
+ "transformer_layers_per_block": 1,
65
+ "up_block_types": [
66
+ "UpBlock2D",
67
+ "CrossAttnUpBlock2D",
68
+ "CrossAttnUpBlock2D",
69
+ "CrossAttnUpBlock2D"
70
+ ],
71
+ "upcast_attention": null,
72
+ "use_linear_projection": true
73
+ }
unet/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3815819b0009d16b5f7538ecbf2dd0ac4a6b07a238ab82d869465c347864bb70
3
+ size 3463726504
vae/autoencoder_kl.py ADDED
@@ -0,0 +1,559 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 The HuggingFace Team. All rights reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ from typing import Dict, Optional, Tuple, Union
15
+
16
+ import torch
17
+ import torch.nn as nn
18
+ from peft import LoraConfig
19
+
20
+ from diffusers.configuration_utils import ConfigMixin, register_to_config
21
+ from diffusers.loaders import FromOriginalVAEMixin
22
+ from diffusers.utils.accelerate_utils import apply_forward_hook
23
+ from diffusers.models.attention_processor import (
24
+ ADDED_KV_ATTENTION_PROCESSORS,
25
+ CROSS_ATTENTION_PROCESSORS,
26
+ Attention,
27
+ AttentionProcessor,
28
+ AttnAddedKVProcessor,
29
+ AttnProcessor,
30
+ )
31
+ from diffusers.models.modeling_outputs import AutoencoderKLOutput
32
+ from diffusers.models.modeling_utils import ModelMixin
33
+ from diffusers.models.autoencoders.vae import Decoder, DecoderOutput, DiagonalGaussianDistribution, Encoder
34
+
35
+
36
+ def my_vae_encoder_fwd(self, sample):
37
+ sample = self.conv_in(sample)
38
+ l_blocks = []
39
+ # down
40
+ for down_block in self.down_blocks:
41
+ l_blocks.append(sample)
42
+ sample = down_block(sample)
43
+ # middle
44
+ sample = self.mid_block(sample)
45
+ sample = self.conv_norm_out(sample)
46
+ sample = self.conv_act(sample)
47
+ sample = self.conv_out(sample)
48
+ self.current_down_blocks = l_blocks
49
+ return sample
50
+
51
+
52
+ def my_vae_decoder_fwd(self, sample, latent_embeds=None):
53
+ sample = self.conv_in(sample)
54
+ upscale_dtype = next(iter(self.up_blocks.parameters())).dtype
55
+ # middle
56
+ sample = self.mid_block(sample, latent_embeds)
57
+ sample = sample.to(upscale_dtype)
58
+ if not self.ignore_skip:
59
+ skip_convs = [self.skip_conv_1, self.skip_conv_2, self.skip_conv_3, self.skip_conv_4]
60
+ # up
61
+ for idx, up_block in enumerate(self.up_blocks):
62
+ skip_in = skip_convs[idx](self.incoming_skip_acts[::-1][idx] * self.gamma)
63
+ # add skip
64
+ sample = sample + skip_in
65
+ sample = up_block(sample, latent_embeds)
66
+ else:
67
+ for idx, up_block in enumerate(self.up_blocks):
68
+ sample = up_block(sample, latent_embeds)
69
+ # post-process
70
+ if latent_embeds is None:
71
+ sample = self.conv_norm_out(sample)
72
+ else:
73
+ sample = self.conv_norm_out(sample, latent_embeds)
74
+ sample = self.conv_act(sample)
75
+ sample = self.conv_out(sample)
76
+ return sample
77
+
78
+
79
+ class AutoencoderKL(ModelMixin, ConfigMixin, FromOriginalVAEMixin):
80
+ r"""
81
+ A VAE model with KL loss for encoding images into latents and decoding latent representations into images.
82
+
83
+ This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
84
+ for all models (such as downloading or saving).
85
+
86
+ Parameters:
87
+ in_channels (int, *optional*, defaults to 3): Number of channels in the input image.
88
+ out_channels (int, *optional*, defaults to 3): Number of channels in the output.
89
+ down_block_types (`Tuple[str]`, *optional*, defaults to `("DownEncoderBlock2D",)`):
90
+ Tuple of downsample block types.
91
+ up_block_types (`Tuple[str]`, *optional*, defaults to `("UpDecoderBlock2D",)`):
92
+ Tuple of upsample block types.
93
+ block_out_channels (`Tuple[int]`, *optional*, defaults to `(64,)`):
94
+ Tuple of block output channels.
95
+ act_fn (`str`, *optional*, defaults to `"silu"`): The activation function to use.
96
+ latent_channels (`int`, *optional*, defaults to 4): Number of channels in the latent space.
97
+ sample_size (`int`, *optional*, defaults to `32`): Sample input size.
98
+ scaling_factor (`float`, *optional*, defaults to 0.18215):
99
+ The component-wise standard deviation of the trained latent space computed using the first batch of the
100
+ training set. This is used to scale the latent space to have unit variance when training the diffusion
101
+ model. The latents are scaled with the formula `z = z * scaling_factor` before being passed to the
102
+ diffusion model. When decoding, the latents are scaled back to the original scale with the formula: `z = 1
103
+ / scaling_factor * z`. For more details, refer to sections 4.3.2 and D.1 of the [High-Resolution Image
104
+ Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) paper.
105
+ force_upcast (`bool`, *optional*, default to `True`):
106
+ If enabled it will force the VAE to run in float32 for high image resolution pipelines, such as SD-XL. VAE
107
+ can be fine-tuned / trained to a lower range without loosing too much precision in which case
108
+ `force_upcast` can be set to `False` - see: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix
109
+ """
110
+
111
+ _supports_gradient_checkpointing = True
112
+
113
+ @register_to_config
114
+ def __init__(
115
+ self,
116
+ in_channels: int = 3,
117
+ out_channels: int = 3,
118
+ down_block_types: Tuple[str] = ("DownEncoderBlock2D",),
119
+ up_block_types: Tuple[str] = ("UpDecoderBlock2D",),
120
+ block_out_channels: Tuple[int] = (64,),
121
+ layers_per_block: int = 1,
122
+ act_fn: str = "silu",
123
+ latent_channels: int = 4,
124
+ norm_num_groups: int = 32,
125
+ sample_size: int = 32,
126
+ scaling_factor: float = 0.18215,
127
+ force_upcast: float = True,
128
+ lora_rank: int = 4,
129
+ gamma: float = 1.0,
130
+ ignore_skip: bool = False,
131
+ ):
132
+ super().__init__()
133
+
134
+ # pass init params to Encoder
135
+ self.encoder = Encoder(
136
+ in_channels=in_channels,
137
+ out_channels=latent_channels,
138
+ down_block_types=down_block_types,
139
+ block_out_channels=block_out_channels,
140
+ layers_per_block=layers_per_block,
141
+ act_fn=act_fn,
142
+ norm_num_groups=norm_num_groups,
143
+ double_z=True,
144
+ )
145
+
146
+ # pass init params to Decoder
147
+ self.decoder = Decoder(
148
+ in_channels=latent_channels,
149
+ out_channels=out_channels,
150
+ up_block_types=up_block_types,
151
+ block_out_channels=block_out_channels,
152
+ layers_per_block=layers_per_block,
153
+ norm_num_groups=norm_num_groups,
154
+ act_fn=act_fn,
155
+ )
156
+
157
+ self.quant_conv = nn.Conv2d(2 * latent_channels, 2 * latent_channels, 1)
158
+ self.post_quant_conv = nn.Conv2d(latent_channels, latent_channels, 1)
159
+
160
+ self.use_slicing = False
161
+ self.use_tiling = False
162
+
163
+ # only relevant if vae tiling is enabled
164
+ self.tile_sample_min_size = self.config.sample_size
165
+ sample_size = (
166
+ self.config.sample_size[0]
167
+ if isinstance(self.config.sample_size, (list, tuple))
168
+ else self.config.sample_size
169
+ )
170
+ self.tile_latent_min_size = int(sample_size / (2 ** (len(self.config.block_out_channels) - 1)))
171
+ self.tile_overlap_factor = 0.25
172
+
173
+ self.encoder.forward = my_vae_encoder_fwd.__get__(self.encoder, self.encoder.__class__)
174
+ self.decoder.forward = my_vae_decoder_fwd.__get__(self.decoder, self.decoder.__class__)
175
+ # add the skip connection convs
176
+ self.decoder.skip_conv_1 = torch.nn.Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
177
+ self.decoder.skip_conv_2 = torch.nn.Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
178
+ self.decoder.skip_conv_3 = torch.nn.Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
179
+ self.decoder.skip_conv_4 = torch.nn.Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
180
+ self.decoder.ignore_skip = ignore_skip
181
+ self.decoder.gamma = gamma
182
+
183
+ target_modules_vae = ["conv1", "conv2", "conv_in", "conv_shortcut", "conv", "conv_out",
184
+ "skip_conv_1", "skip_conv_2", "skip_conv_3", "skip_conv_4",
185
+ "to_k", "to_q", "to_v", "to_out.0",
186
+ ]
187
+ target_modules = []
188
+ for id, (name, param) in enumerate(self.named_modules()):
189
+ if 'decoder' in name and any(name.endswith(x) for x in target_modules_vae):
190
+ target_modules.append(name)
191
+ target_modules_vae = target_modules
192
+
193
+ vae_lora_config = LoraConfig(r=lora_rank, init_lora_weights="gaussian", target_modules=target_modules_vae)
194
+ self.add_adapter(vae_lora_config, adapter_name="vae_skip")
195
+
196
+ def _set_gradient_checkpointing(self, module, value=False):
197
+ if isinstance(module, (Encoder, Decoder)):
198
+ module.gradient_checkpointing = value
199
+
200
+ def enable_tiling(self, use_tiling: bool = True):
201
+ r"""
202
+ Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
203
+ compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
204
+ processing larger images.
205
+ """
206
+ self.use_tiling = use_tiling
207
+
208
+ def disable_tiling(self):
209
+ r"""
210
+ Disable tiled VAE decoding. If `enable_tiling` was previously enabled, this method will go back to computing
211
+ decoding in one step.
212
+ """
213
+ self.enable_tiling(False)
214
+
215
+ def enable_slicing(self):
216
+ r"""
217
+ Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
218
+ compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
219
+ """
220
+ self.use_slicing = True
221
+
222
+ def disable_slicing(self):
223
+ r"""
224
+ Disable sliced VAE decoding. If `enable_slicing` was previously enabled, this method will go back to computing
225
+ decoding in one step.
226
+ """
227
+ self.use_slicing = False
228
+
229
+ @property
230
+ # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.attn_processors
231
+ def attn_processors(self) -> Dict[str, AttentionProcessor]:
232
+ r"""
233
+ Returns:
234
+ `dict` of attention processors: A dictionary containing all attention processors used in the model with
235
+ indexed by its weight name.
236
+ """
237
+ # set recursively
238
+ processors = {}
239
+
240
+ def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: Dict[str, AttentionProcessor]):
241
+ if hasattr(module, "get_processor"):
242
+ processors[f"{name}.processor"] = module.get_processor(return_deprecated_lora=True)
243
+
244
+ for sub_name, child in module.named_children():
245
+ fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
246
+
247
+ return processors
248
+
249
+ for name, module in self.named_children():
250
+ fn_recursive_add_processors(name, module, processors)
251
+
252
+ return processors
253
+
254
+ # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_attn_processor
255
+ def set_attn_processor(
256
+ self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]], _remove_lora=False
257
+ ):
258
+ r"""
259
+ Sets the attention processor to use to compute attention.
260
+
261
+ Parameters:
262
+ processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
263
+ The instantiated processor class or a dictionary of processor classes that will be set as the processor
264
+ for **all** `Attention` layers.
265
+
266
+ If `processor` is a dict, the key needs to define the path to the corresponding cross attention
267
+ processor. This is strongly recommended when setting trainable attention processors.
268
+
269
+ """
270
+ count = len(self.attn_processors.keys())
271
+
272
+ if isinstance(processor, dict) and len(processor) != count:
273
+ raise ValueError(
274
+ f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
275
+ f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
276
+ )
277
+
278
+ def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
279
+ if hasattr(module, "set_processor"):
280
+ if not isinstance(processor, dict):
281
+ module.set_processor(processor, _remove_lora=_remove_lora)
282
+ else:
283
+ module.set_processor(processor.pop(f"{name}.processor"), _remove_lora=_remove_lora)
284
+
285
+ for sub_name, child in module.named_children():
286
+ fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
287
+
288
+ for name, module in self.named_children():
289
+ fn_recursive_attn_processor(name, module, processor)
290
+
291
+ # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_default_attn_processor
292
+ def set_default_attn_processor(self):
293
+ """
294
+ Disables custom attention processors and sets the default attention implementation.
295
+ """
296
+ if all(proc.__class__ in ADDED_KV_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
297
+ processor = AttnAddedKVProcessor()
298
+ elif all(proc.__class__ in CROSS_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
299
+ processor = AttnProcessor()
300
+ else:
301
+ raise ValueError(
302
+ f"Cannot call `set_default_attn_processor` when attention processors are of type {next(iter(self.attn_processors.values()))}"
303
+ )
304
+
305
+ self.set_attn_processor(processor, _remove_lora=True)
306
+
307
+ @apply_forward_hook
308
+ def encode(
309
+ self, x: torch.FloatTensor, return_dict: bool = True
310
+ ) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
311
+ """
312
+ Encode a batch of images into latents.
313
+
314
+ Args:
315
+ x (`torch.FloatTensor`): Input batch of images.
316
+ return_dict (`bool`, *optional*, defaults to `True`):
317
+ Whether to return a [`~models.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple.
318
+
319
+ Returns:
320
+ The latent representations of the encoded images. If `return_dict` is True, a
321
+ [`~models.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is returned.
322
+ """
323
+ if self.use_tiling and (x.shape[-1] > self.tile_sample_min_size or x.shape[-2] > self.tile_sample_min_size):
324
+ return self.tiled_encode(x, return_dict=return_dict)
325
+
326
+ if self.use_slicing and x.shape[0] > 1:
327
+ encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]
328
+ h = torch.cat(encoded_slices)
329
+ else:
330
+ h = self.encoder(x)
331
+
332
+ moments = self.quant_conv(h)
333
+ posterior = DiagonalGaussianDistribution(moments)
334
+
335
+ if not return_dict:
336
+ return (posterior,)
337
+
338
+ return AutoencoderKLOutput(latent_dist=posterior)
339
+
340
+ def _decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
341
+ if self.use_tiling and (z.shape[-1] > self.tile_latent_min_size or z.shape[-2] > self.tile_latent_min_size):
342
+ return self.tiled_decode(z, return_dict=return_dict)
343
+
344
+ z = self.post_quant_conv(z)
345
+ dec = self.decoder(z)
346
+
347
+ if not return_dict:
348
+ return (dec,)
349
+
350
+ return DecoderOutput(sample=dec)
351
+
352
+ @apply_forward_hook
353
+ def decode(
354
+ self, z: torch.FloatTensor, return_dict: bool = True, generator=None
355
+ ) -> Union[DecoderOutput, torch.FloatTensor]:
356
+ """
357
+ Decode a batch of images.
358
+
359
+ Args:
360
+ z (`torch.FloatTensor`): Input batch of latent vectors.
361
+ return_dict (`bool`, *optional*, defaults to `True`):
362
+ Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
363
+
364
+ Returns:
365
+ [`~models.vae.DecoderOutput`] or `tuple`:
366
+ If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
367
+ returned.
368
+
369
+ """
370
+ if self.use_slicing and z.shape[0] > 1:
371
+ decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
372
+ decoded = torch.cat(decoded_slices)
373
+ else:
374
+ decoded = self._decode(z).sample
375
+
376
+ if not return_dict:
377
+ return (decoded,)
378
+
379
+ return DecoderOutput(sample=decoded)
380
+
381
+ def blend_v(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
382
+ blend_extent = min(a.shape[2], b.shape[2], blend_extent)
383
+ for y in range(blend_extent):
384
+ b[:, :, y, :] = a[:, :, -blend_extent + y, :] * (1 - y / blend_extent) + b[:, :, y, :] * (y / blend_extent)
385
+ return b
386
+
387
+ def blend_h(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
388
+ blend_extent = min(a.shape[3], b.shape[3], blend_extent)
389
+ for x in range(blend_extent):
390
+ b[:, :, :, x] = a[:, :, :, -blend_extent + x] * (1 - x / blend_extent) + b[:, :, :, x] * (x / blend_extent)
391
+ return b
392
+
393
+ def tiled_encode(self, x: torch.FloatTensor, return_dict: bool = True) -> AutoencoderKLOutput:
394
+ r"""Encode a batch of images using a tiled encoder.
395
+
396
+ When this option is enabled, the VAE will split the input tensor into tiles to compute encoding in several
397
+ steps. This is useful to keep memory use constant regardless of image size. The end result of tiled encoding is
398
+ different from non-tiled encoding because each tile uses a different encoder. To avoid tiling artifacts, the
399
+ tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the
400
+ output, but they should be much less noticeable.
401
+
402
+ Args:
403
+ x (`torch.FloatTensor`): Input batch of images.
404
+ return_dict (`bool`, *optional*, defaults to `True`):
405
+ Whether or not to return a [`~models.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple.
406
+
407
+ Returns:
408
+ [`~models.autoencoder_kl.AutoencoderKLOutput`] or `tuple`:
409
+ If return_dict is True, a [`~models.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain
410
+ `tuple` is returned.
411
+ """
412
+ overlap_size = int(self.tile_sample_min_size * (1 - self.tile_overlap_factor))
413
+ blend_extent = int(self.tile_latent_min_size * self.tile_overlap_factor)
414
+ row_limit = self.tile_latent_min_size - blend_extent
415
+
416
+ # Split the image into 512x512 tiles and encode them separately.
417
+ rows = []
418
+ for i in range(0, x.shape[2], overlap_size):
419
+ row = []
420
+ for j in range(0, x.shape[3], overlap_size):
421
+ tile = x[:, :, i : i + self.tile_sample_min_size, j : j + self.tile_sample_min_size]
422
+ tile = self.encoder(tile)
423
+ tile = self.quant_conv(tile)
424
+ row.append(tile)
425
+ rows.append(row)
426
+ result_rows = []
427
+ for i, row in enumerate(rows):
428
+ result_row = []
429
+ for j, tile in enumerate(row):
430
+ # blend the above tile and the left tile
431
+ # to the current tile and add the current tile to the result row
432
+ if i > 0:
433
+ tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
434
+ if j > 0:
435
+ tile = self.blend_h(row[j - 1], tile, blend_extent)
436
+ result_row.append(tile[:, :, :row_limit, :row_limit])
437
+ result_rows.append(torch.cat(result_row, dim=3))
438
+
439
+ moments = torch.cat(result_rows, dim=2)
440
+ posterior = DiagonalGaussianDistribution(moments)
441
+
442
+ if not return_dict:
443
+ return (posterior,)
444
+
445
+ return AutoencoderKLOutput(latent_dist=posterior)
446
+
447
+ def tiled_decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
448
+ r"""
449
+ Decode a batch of images using a tiled decoder.
450
+
451
+ Args:
452
+ z (`torch.FloatTensor`): Input batch of latent vectors.
453
+ return_dict (`bool`, *optional*, defaults to `True`):
454
+ Whether or not to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
455
+
456
+ Returns:
457
+ [`~models.vae.DecoderOutput`] or `tuple`:
458
+ If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
459
+ returned.
460
+ """
461
+ overlap_size = int(self.tile_latent_min_size * (1 - self.tile_overlap_factor))
462
+ blend_extent = int(self.tile_sample_min_size * self.tile_overlap_factor)
463
+ row_limit = self.tile_sample_min_size - blend_extent
464
+
465
+ # Split z into overlapping 64x64 tiles and decode them separately.
466
+ # The tiles have an overlap to avoid seams between tiles.
467
+ rows = []
468
+ for i in range(0, z.shape[2], overlap_size):
469
+ row = []
470
+ for j in range(0, z.shape[3], overlap_size):
471
+ tile = z[:, :, i : i + self.tile_latent_min_size, j : j + self.tile_latent_min_size]
472
+ tile = self.post_quant_conv(tile)
473
+ decoded = self.decoder(tile)
474
+ row.append(decoded)
475
+ rows.append(row)
476
+ result_rows = []
477
+ for i, row in enumerate(rows):
478
+ result_row = []
479
+ for j, tile in enumerate(row):
480
+ # blend the above tile and the left tile
481
+ # to the current tile and add the current tile to the result row
482
+ if i > 0:
483
+ tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
484
+ if j > 0:
485
+ tile = self.blend_h(row[j - 1], tile, blend_extent)
486
+ result_row.append(tile[:, :, :row_limit, :row_limit])
487
+ result_rows.append(torch.cat(result_row, dim=3))
488
+
489
+ dec = torch.cat(result_rows, dim=2)
490
+ if not return_dict:
491
+ return (dec,)
492
+
493
+ return DecoderOutput(sample=dec)
494
+
495
+ def forward(
496
+ self,
497
+ sample: torch.FloatTensor,
498
+ sample_posterior: bool = False,
499
+ return_dict: bool = True,
500
+ generator: Optional[torch.Generator] = None,
501
+ ) -> Union[DecoderOutput, torch.FloatTensor]:
502
+ r"""
503
+ Args:
504
+ sample (`torch.FloatTensor`): Input sample.
505
+ sample_posterior (`bool`, *optional*, defaults to `False`):
506
+ Whether to sample from the posterior.
507
+ return_dict (`bool`, *optional*, defaults to `True`):
508
+ Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
509
+ """
510
+ x = sample
511
+ posterior = self.encode(x).latent_dist
512
+ if sample_posterior:
513
+ z = posterior.sample(generator=generator)
514
+ else:
515
+ z = posterior.mode()
516
+ dec = self.decode(z).sample
517
+
518
+ if not return_dict:
519
+ return (dec,)
520
+
521
+ return DecoderOutput(sample=dec)
522
+
523
+ # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections
524
+ def fuse_qkv_projections(self):
525
+ """
526
+ Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query,
527
+ key, value) are fused. For cross-attention modules, key and value projection matrices are fused.
528
+
529
+ <Tip warning={true}>
530
+
531
+ This API is 🧪 experimental.
532
+
533
+ </Tip>
534
+ """
535
+ self.original_attn_processors = None
536
+
537
+ for _, attn_processor in self.attn_processors.items():
538
+ if "Added" in str(attn_processor.__class__.__name__):
539
+ raise ValueError("`fuse_qkv_projections()` is not supported for models having added KV projections.")
540
+
541
+ self.original_attn_processors = self.attn_processors
542
+
543
+ for module in self.modules():
544
+ if isinstance(module, Attention):
545
+ module.fuse_projections(fuse=True)
546
+
547
+ # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.unfuse_qkv_projections
548
+ def unfuse_qkv_projections(self):
549
+ """Disables the fused QKV projection if enabled.
550
+
551
+ <Tip warning={true}>
552
+
553
+ This API is 🧪 experimental.
554
+
555
+ </Tip>
556
+
557
+ """
558
+ if self.original_attn_processors is not None:
559
+ self.set_attn_processor(self.original_attn_processors)
vae/config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.25.1",
4
+ "_name_or_path": "nvidia/Difix3D/vae",
5
+ "act_fn": "silu",
6
+ "block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512
11
+ ],
12
+ "down_block_types": [
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D",
16
+ "DownEncoderBlock2D"
17
+ ],
18
+ "force_upcast": true,
19
+ "gamma": 1.0,
20
+ "ignore_skip": false,
21
+ "in_channels": 3,
22
+ "latent_channels": 4,
23
+ "layers_per_block": 2,
24
+ "lora_rank": 4,
25
+ "norm_num_groups": 32,
26
+ "out_channels": 3,
27
+ "sample_size": 768,
28
+ "scaling_factor": 0.18215,
29
+ "up_block_types": [
30
+ "UpDecoderBlock2D",
31
+ "UpDecoderBlock2D",
32
+ "UpDecoderBlock2D",
33
+ "UpDecoderBlock2D"
34
+ ]
35
+ }
vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20a5e872469d801876e448ec1d499b1e99cc666497a6aa133ed22c9e0a7a1a25
3
+ size 338717612