Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
@@ -1,214 +1,12 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
-
[](https://arxiv.org/abs/2301.12503) [](https://audioldm.github.io/) [](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation) [](https://colab.research.google.com/github/olaviinha/NeuralTextToAudio/blob/main/AudioLDM_pub.ipynb?force_theme=dark) [](https://replicate.com/jagilley/audio-ldm)
|
4 |
-
|
5 |
-
<!-- # [](https://badge.fury.io/py/voicefixer) -->
|
6 |
-
|
7 |
-
**Generate speech, sound effects, music and beyond.**
|
8 |
-
|
9 |
-
This repo currently support:
|
10 |
-
|
11 |
-
- **Text-to-Audio Generation**: Generate audio given text input.
|
12 |
-
- **Audio-to-Audio Generation**: Given an audio, generate another audio that contain the same type of sound.
|
13 |
-
- **Text-guided Audio-to-Audio Style Transfer**: Transfer the sound of an audio into another one using the text description.
|
14 |
-
|
15 |
-
<hr>
|
16 |
-
|
17 |
-
## Important tricks to make your generated audio sound better
|
18 |
-
1. Try to provide more hints to AudioLDM, such as using more adjectives to describe your sound (e.g., clearly, high quality) or make your target more specific (e.g., "water stream in a forest" instead of "stream"). This can make sure AudioLDM understand what you want.
|
19 |
-
2. Try to use different random seeds, which can affect the generation quality significantly sometimes.
|
20 |
-
3. It's best to use general terms like 'man' or 'woman' instead of specific names for individuals or abstract objects that humans may not be familiar with.
|
21 |
-
|
22 |
-
# Change Log
|
23 |
-
|
24 |
-
**2023-04-10**: Try to finetune AudioLDM with MusicCaps and AudioCaps datasets. Add three more checkpoints, including audioldm-m-text-ft, audioldm-s-text-ft, and audioldm-m-full.
|
25 |
-
|
26 |
-
**2023-03-04**: Add two more checkpoints, one is small model with more training steps, another is a large model. Add model selection in the Gradio APP.
|
27 |
-
|
28 |
-
**2023-02-24**: Add audio-to-audio generation. Add test cases. Add a pipeline (python function) for audio super-resolution and inpainting.
|
29 |
-
|
30 |
-
**2023-02-15**: Add audio style transfer. Add more options on generation.
|
31 |
-
|
32 |
-
## Web APP
|
33 |
-
|
34 |
-
The web APP currently only support Text-to-Audio generation. For full functionality please refer to the [Commandline Usage](https://github.com/haoheliu/AudioLDM#commandline-usage).
|
35 |
-
|
36 |
-
1. Prepare running environment
|
37 |
-
```shell
|
38 |
-
conda create -n audioldm python=3.8; conda activate audioldm
|
39 |
-
pip3 install git+https://github.com/haoheliu/AudioLDM.git
|
40 |
-
git clone https://github.com/haoheliu/AudioLDM; cd AudioLDM
|
41 |
-
```
|
42 |
-
2. Start the web application (powered by Gradio)
|
43 |
-
```shell
|
44 |
-
python3 app.py
|
45 |
-
```
|
46 |
-
3. A link will be printed out. Click the link to open the browser and play.
|
47 |
-
|
48 |
-
## Commandline Usage
|
49 |
-
Prepare running environment
|
50 |
-
```shell
|
51 |
-
# Optional
|
52 |
-
conda create -n audioldm python=3.8; conda activate audioldm
|
53 |
-
# Install AudioLDM
|
54 |
-
pip3 install git+https://github.com/haoheliu/AudioLDM.git
|
55 |
-
```
|
56 |
-
|
57 |
-
:star2: **Text-to-Audio Generation**: generate an audio guided by a text
|
58 |
-
```shell
|
59 |
-
# The default --mode is "generation"
|
60 |
-
audioldm -t "A hammer is hitting a wooden surface"
|
61 |
-
# Result will be saved in "./output/generation"
|
62 |
-
```
|
63 |
-
|
64 |
-
:star2: **Audio-to-Audio Generation**: generate an audio guided by an audio (output will have similar audio events as the input audio file).
|
65 |
-
```shell
|
66 |
-
audioldm --file_path trumpet.wav
|
67 |
-
# Result will be saved in "./output/generation_audio_to_audio/trumpet"
|
68 |
-
```
|
69 |
-
|
70 |
-
:star2: **Text-guided Audio-to-Audio Style Transfer**
|
71 |
-
```shell
|
72 |
-
# Test run
|
73 |
-
# --file_path is the original audio file for transfer
|
74 |
-
# -t is the text AudioLDM uses for transfer.
|
75 |
-
# Please make sure that --file_path exist
|
76 |
-
audioldm --mode "transfer" --file_path trumpet.wav -t "Children Singing"
|
77 |
-
# Result will be saved in "./output/transfer/trumpet"
|
78 |
-
|
79 |
-
# Tune the value of --transfer_strength is important!
|
80 |
-
# --transfer_strength: A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text
|
81 |
-
audioldm --mode "transfer" --file_path trumpet.wav -t "Children Singing" --transfer_strength 0.25
|
82 |
-
```
|
83 |
-
|
84 |
-
:gear: How to choose between different model checkpoints?
|
85 |
-
```
|
86 |
-
# Add the --model_name parameter, choice={audioldm-m-text-ft, audioldm-s-text-ft, audioldm-m-full, audioldm-s-full,audioldm-l-full,audioldm-s-full-v2}
|
87 |
-
audioldm --model_name audioldm-s-full
|
88 |
-
```
|
89 |
-
|
90 |
-
- :star: audioldm-m-full (default, **recommend**): the medium AudioLDM without finetuning and trained with audio embeddings as condition *(added 2023-04-10)*.
|
91 |
-
- :star: audioldm-s-full (**recommend**): the original open-sourced version *(added 2023-02-01)*.
|
92 |
-
- :star: audioldm-s-full-v2 (**recommend**): more training steps comparing with audioldm-s-full *(added 2023-03-04)*.
|
93 |
-
- audioldm-s-text-ft: the small AudioLDM finetuned with AudioCaps and MusicCaps audio-text pairs *(added 2023-04-10)*.
|
94 |
-
- audioldm-m-text-ft: the medium large AudioLDM finetuned with AudioCaps and MusicCaps audio-text pairs *(added 2023-04-10)*.
|
95 |
-
- audioldm-l-full: larger model comparing with audioldm-s-full *(added 2023-03-04)*.
|
96 |
-
|
97 |
-
> @haoheliu personally did a evaluation regarding the overall quality of the checkpoint, which gives audioldm-m-full (6.85/10), audioldm-s-full (6.62/10), audioldm-s-text-ft (6/10), audioldm-m-text-ft (5.46/10). These score are only for reference and may not reflect the true performance of the checkpoint. Checkpoint performance also varying with different text input as well.
|
98 |
-
|
99 |
-
:grey_question: For more options on guidance scale, batchsize, seed, ddim steps, etc., please run
|
100 |
-
```shell
|
101 |
-
audioldm -h
|
102 |
-
```
|
103 |
-
```console
|
104 |
-
usage: audioldm [-h] [--mode {generation,transfer}] [-t TEXT] [-f FILE_PATH] [--transfer_strength TRANSFER_STRENGTH] [-s SAVE_PATH] [--model_name {audioldm-s-full,audioldm-l-full,audioldm-s-full-v2}] [-ckpt CKPT_PATH]
|
105 |
-
[-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-dur DURATION] [-n N_CANDIDATE_GEN_PER_TEXT] [--seed SEED]
|
106 |
-
|
107 |
-
optional arguments:
|
108 |
-
-h, --help show this help message and exit
|
109 |
-
--mode {generation,transfer}
|
110 |
-
generation: text-to-audio generation; transfer: style transfer
|
111 |
-
-t TEXT, --text TEXT Text prompt to the model for audio generation, DEFAULT ""
|
112 |
-
-f FILE_PATH, --file_path FILE_PATH
|
113 |
-
(--mode transfer): Original audio file for style transfer; Or (--mode generation): the guidance audio file for generating simialr audio, DEFAULT None
|
114 |
-
--transfer_strength TRANSFER_STRENGTH
|
115 |
-
A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text, DEFAULT 0.5
|
116 |
-
-s SAVE_PATH, --save_path SAVE_PATH
|
117 |
-
The path to save model output, DEFAULT "./output"
|
118 |
-
--model_name {audioldm-s-full,audioldm-l-full,audioldm-s-full-v2}
|
119 |
-
The checkpoint you gonna use, DEFAULT "audioldm-s-full"
|
120 |
-
-ckpt CKPT_PATH, --ckpt_path CKPT_PATH
|
121 |
-
(deprecated) The path to the pretrained .ckpt model, DEFAULT None
|
122 |
-
-b BATCHSIZE, --batchsize BATCHSIZE
|
123 |
-
Generate how many samples at the same time, DEFAULT 1
|
124 |
-
--ddim_steps DDIM_STEPS
|
125 |
-
The sampling step for DDIM, DEFAULT 200
|
126 |
-
-gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
|
127 |
-
Guidance scale (Large => better quality and relavancy to text; Small => better diversity), DEFAULT 2.5
|
128 |
-
-dur DURATION, --duration DURATION
|
129 |
-
The duration of the samples, DEFAULT 10
|
130 |
-
-n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
|
131 |
-
Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with heavier computation, DEFAULT 3
|
132 |
-
--seed SEED Change this value (any integer number) will lead to a different generation result. DEFAULT 42
|
133 |
-
```
|
134 |
-
|
135 |
-
For the evaluation of audio generative model, please refer to [audioldm_eval](https://github.com/haoheliu/audioldm_eval).
|
136 |
-
|
137 |
-
# Hugging Face 🧨 Diffusers
|
138 |
-
|
139 |
-
AudioLDM is available in the Hugging Face [🧨 Diffusers](https://github.com/huggingface/diffusers) library from v0.15.0 onwards. The official checkpoints can be found on the [Hugging Face Hub](https://huggingface.co/cvssp), alongside [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm) and [examples scripts](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm).
|
140 |
-
|
141 |
-
To install Diffusers and Transformers, run:
|
142 |
-
```bash
|
143 |
-
pip install --upgrade diffusers transformers
|
144 |
-
```
|
145 |
-
|
146 |
-
You can then load pre-trained weights into the [AudioLDM pipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm) and generate text-conditional audio outputs:
|
147 |
-
```python
|
148 |
-
from diffusers import AudioLDMPipeline
|
149 |
-
import torch
|
150 |
-
|
151 |
-
repo_id = "cvssp/audioldm-s-full-v2"
|
152 |
-
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
|
153 |
-
pipe = pipe.to("cuda")
|
154 |
-
|
155 |
-
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
|
156 |
-
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
|
157 |
-
```
|
158 |
-
|
159 |
-
# Web Demo
|
160 |
-
|
161 |
-
Integrated into [Hugging Face Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo [](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)
|
162 |
-
|
163 |
-
# TuneFlow Demo
|
164 |
-
|
165 |
-
Try out AudioLDM as a [TuneFlow](https://tuneflow.com) plugin [](https://github.com/tuneflow/AudioLDM). See how it can work in a real DAW (Digital Audio Workstation).
|
166 |
-
|
167 |
-
# TODO
|
168 |
-
|
169 |
-
[](https://www.buymeacoffee.com/haoheliuP)
|
170 |
-
|
171 |
-
- [x] Update the checkpoint with more training steps.
|
172 |
-
- [x] Update the checkpoint with more parameters (audioldm-l).
|
173 |
-
- [ ] Add AudioCaps finetuned AudioLDM-S model
|
174 |
-
- [x] Build pip installable package for commandline use
|
175 |
-
- [x] Build Gradio web application
|
176 |
-
- [ ] Add super-resolution, inpainting into Gradio web application
|
177 |
-
- [ ] Add style-transfer into Gradio web application
|
178 |
-
- [x] Add text-guided style transfer
|
179 |
-
- [x] Add audio-to-audio generation
|
180 |
-
- [x] Add audio super-resolution
|
181 |
-
- [x] Add audio inpainting
|
182 |
-
|
183 |
-
## Cite this work
|
184 |
-
|
185 |
-
If you found this tool useful, please consider citing
|
186 |
-
```bibtex
|
187 |
-
@article{liu2023audioldm,
|
188 |
-
title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
|
189 |
-
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
|
190 |
-
journal={Proceedings of the International Conference on Machine Learning},
|
191 |
-
year={2023}
|
192 |
-
pages={21450-21474}
|
193 |
-
}
|
194 |
-
```
|
195 |
-
|
196 |
-
# Hardware requirement
|
197 |
-
- GPU with 8GB of dedicated VRAM
|
198 |
-
- A system with a 64-bit operating system (Windows 7, 8.1 or 10, Ubuntu 16.04 or later, or macOS 10.13 or later) 16GB or more of system RAM
|
199 |
-
|
200 |
-
## Reference
|
201 |
-
Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.
|
202 |
-
|
203 |
-
> https://github.com/LAION-AI/CLAP
|
204 |
-
|
205 |
-
> https://github.com/CompVis/stable-diffusion
|
206 |
-
|
207 |
-
> https://github.com/v-iashin/SpecVQGAN
|
208 |
-
|
209 |
-
> https://github.com/toshas/torch-fidelity
|
210 |
-
|
211 |
-
|
212 |
-
We build the model with data from AudioSet, Freesound and BBC Sound Effect library. We share this demo based on the UK copyright exception of data for academic research.
|
213 |
-
|
214 |
-
<!-- This code repo is strictly for research demo purpose only. For commercial use please contact us. -->
|
|
|
1 |
+
---
|
2 |
+
title: AudioLDM
|
3 |
+
emoji: 🌍
|
4 |
+
colorFrom: purple
|
5 |
+
colorTo: red
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 4.44.1
|
8 |
+
app_file: app.py
|
9 |
+
pinned: true
|
10 |
+
short_description: Audio Gen
|
11 |
+
---
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|