inoculatemedia commited on
Commit
d2e5801
·
verified ·
1 Parent(s): 8815caa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -213
README.md CHANGED
@@ -1,214 +1,12 @@
1
- # :sound: Audio Generation with AudioLDM (ICML 2023)
 
 
 
 
 
 
 
 
 
 
2
 
3
- [![arXiv](https://img.shields.io/badge/arXiv-2301.12503-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2301.12503) [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://audioldm.github.io/) [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/olaviinha/NeuralTextToAudio/blob/main/AudioLDM_pub.ipynb?force_theme=dark) [![Replicate](https://replicate.com/jagilley/audio-ldm/badge)](https://replicate.com/jagilley/audio-ldm)
4
-
5
- <!-- # [![PyPI version](https://badge.fury.io/py/voicefixer.svg)](https://badge.fury.io/py/voicefixer) -->
6
-
7
- **Generate speech, sound effects, music and beyond.**
8
-
9
- This repo currently support:
10
-
11
- - **Text-to-Audio Generation**: Generate audio given text input.
12
- - **Audio-to-Audio Generation**: Given an audio, generate another audio that contain the same type of sound.
13
- - **Text-guided Audio-to-Audio Style Transfer**: Transfer the sound of an audio into another one using the text description.
14
-
15
- <hr>
16
-
17
- ## Important tricks to make your generated audio sound better
18
- 1. Try to provide more hints to AudioLDM, such as using more adjectives to describe your sound (e.g., clearly, high quality) or make your target more specific (e.g., "water stream in a forest" instead of "stream"). This can make sure AudioLDM understand what you want.
19
- 2. Try to use different random seeds, which can affect the generation quality significantly sometimes.
20
- 3. It's best to use general terms like 'man' or 'woman' instead of specific names for individuals or abstract objects that humans may not be familiar with.
21
-
22
- # Change Log
23
-
24
- **2023-04-10**: Try to finetune AudioLDM with MusicCaps and AudioCaps datasets. Add three more checkpoints, including audioldm-m-text-ft, audioldm-s-text-ft, and audioldm-m-full.
25
-
26
- **2023-03-04**: Add two more checkpoints, one is small model with more training steps, another is a large model. Add model selection in the Gradio APP.
27
-
28
- **2023-02-24**: Add audio-to-audio generation. Add test cases. Add a pipeline (python function) for audio super-resolution and inpainting.
29
-
30
- **2023-02-15**: Add audio style transfer. Add more options on generation.
31
-
32
- ## Web APP
33
-
34
- The web APP currently only support Text-to-Audio generation. For full functionality please refer to the [Commandline Usage](https://github.com/haoheliu/AudioLDM#commandline-usage).
35
-
36
- 1. Prepare running environment
37
- ```shell
38
- conda create -n audioldm python=3.8; conda activate audioldm
39
- pip3 install git+https://github.com/haoheliu/AudioLDM.git
40
- git clone https://github.com/haoheliu/AudioLDM; cd AudioLDM
41
- ```
42
- 2. Start the web application (powered by Gradio)
43
- ```shell
44
- python3 app.py
45
- ```
46
- 3. A link will be printed out. Click the link to open the browser and play.
47
-
48
- ## Commandline Usage
49
- Prepare running environment
50
- ```shell
51
- # Optional
52
- conda create -n audioldm python=3.8; conda activate audioldm
53
- # Install AudioLDM
54
- pip3 install git+https://github.com/haoheliu/AudioLDM.git
55
- ```
56
-
57
- :star2: **Text-to-Audio Generation**: generate an audio guided by a text
58
- ```shell
59
- # The default --mode is "generation"
60
- audioldm -t "A hammer is hitting a wooden surface"
61
- # Result will be saved in "./output/generation"
62
- ```
63
-
64
- :star2: **Audio-to-Audio Generation**: generate an audio guided by an audio (output will have similar audio events as the input audio file).
65
- ```shell
66
- audioldm --file_path trumpet.wav
67
- # Result will be saved in "./output/generation_audio_to_audio/trumpet"
68
- ```
69
-
70
- :star2: **Text-guided Audio-to-Audio Style Transfer**
71
- ```shell
72
- # Test run
73
- # --file_path is the original audio file for transfer
74
- # -t is the text AudioLDM uses for transfer.
75
- # Please make sure that --file_path exist
76
- audioldm --mode "transfer" --file_path trumpet.wav -t "Children Singing"
77
- # Result will be saved in "./output/transfer/trumpet"
78
-
79
- # Tune the value of --transfer_strength is important!
80
- # --transfer_strength: A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text
81
- audioldm --mode "transfer" --file_path trumpet.wav -t "Children Singing" --transfer_strength 0.25
82
- ```
83
-
84
- :gear: How to choose between different model checkpoints?
85
- ```
86
- # Add the --model_name parameter, choice={audioldm-m-text-ft, audioldm-s-text-ft, audioldm-m-full, audioldm-s-full,audioldm-l-full,audioldm-s-full-v2}
87
- audioldm --model_name audioldm-s-full
88
- ```
89
-
90
- - :star: audioldm-m-full (default, **recommend**): the medium AudioLDM without finetuning and trained with audio embeddings as condition *(added 2023-04-10)*.
91
- - :star: audioldm-s-full (**recommend**): the original open-sourced version *(added 2023-02-01)*.
92
- - :star: audioldm-s-full-v2 (**recommend**): more training steps comparing with audioldm-s-full *(added 2023-03-04)*.
93
- - audioldm-s-text-ft: the small AudioLDM finetuned with AudioCaps and MusicCaps audio-text pairs *(added 2023-04-10)*.
94
- - audioldm-m-text-ft: the medium large AudioLDM finetuned with AudioCaps and MusicCaps audio-text pairs *(added 2023-04-10)*.
95
- - audioldm-l-full: larger model comparing with audioldm-s-full *(added 2023-03-04)*.
96
-
97
- > @haoheliu personally did a evaluation regarding the overall quality of the checkpoint, which gives audioldm-m-full (6.85/10), audioldm-s-full (6.62/10), audioldm-s-text-ft (6/10), audioldm-m-text-ft (5.46/10). These score are only for reference and may not reflect the true performance of the checkpoint. Checkpoint performance also varying with different text input as well.
98
-
99
- :grey_question: For more options on guidance scale, batchsize, seed, ddim steps, etc., please run
100
- ```shell
101
- audioldm -h
102
- ```
103
- ```console
104
- usage: audioldm [-h] [--mode {generation,transfer}] [-t TEXT] [-f FILE_PATH] [--transfer_strength TRANSFER_STRENGTH] [-s SAVE_PATH] [--model_name {audioldm-s-full,audioldm-l-full,audioldm-s-full-v2}] [-ckpt CKPT_PATH]
105
- [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-dur DURATION] [-n N_CANDIDATE_GEN_PER_TEXT] [--seed SEED]
106
-
107
- optional arguments:
108
- -h, --help show this help message and exit
109
- --mode {generation,transfer}
110
- generation: text-to-audio generation; transfer: style transfer
111
- -t TEXT, --text TEXT Text prompt to the model for audio generation, DEFAULT ""
112
- -f FILE_PATH, --file_path FILE_PATH
113
- (--mode transfer): Original audio file for style transfer; Or (--mode generation): the guidance audio file for generating simialr audio, DEFAULT None
114
- --transfer_strength TRANSFER_STRENGTH
115
- A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text, DEFAULT 0.5
116
- -s SAVE_PATH, --save_path SAVE_PATH
117
- The path to save model output, DEFAULT "./output"
118
- --model_name {audioldm-s-full,audioldm-l-full,audioldm-s-full-v2}
119
- The checkpoint you gonna use, DEFAULT "audioldm-s-full"
120
- -ckpt CKPT_PATH, --ckpt_path CKPT_PATH
121
- (deprecated) The path to the pretrained .ckpt model, DEFAULT None
122
- -b BATCHSIZE, --batchsize BATCHSIZE
123
- Generate how many samples at the same time, DEFAULT 1
124
- --ddim_steps DDIM_STEPS
125
- The sampling step for DDIM, DEFAULT 200
126
- -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
127
- Guidance scale (Large => better quality and relavancy to text; Small => better diversity), DEFAULT 2.5
128
- -dur DURATION, --duration DURATION
129
- The duration of the samples, DEFAULT 10
130
- -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
131
- Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with heavier computation, DEFAULT 3
132
- --seed SEED Change this value (any integer number) will lead to a different generation result. DEFAULT 42
133
- ```
134
-
135
- For the evaluation of audio generative model, please refer to [audioldm_eval](https://github.com/haoheliu/audioldm_eval).
136
-
137
- # Hugging Face 🧨 Diffusers
138
-
139
- AudioLDM is available in the Hugging Face [🧨 Diffusers](https://github.com/huggingface/diffusers) library from v0.15.0 onwards. The official checkpoints can be found on the [Hugging Face Hub](https://huggingface.co/cvssp), alongside [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm) and [examples scripts](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm).
140
-
141
- To install Diffusers and Transformers, run:
142
- ```bash
143
- pip install --upgrade diffusers transformers
144
- ```
145
-
146
- You can then load pre-trained weights into the [AudioLDM pipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm) and generate text-conditional audio outputs:
147
- ```python
148
- from diffusers import AudioLDMPipeline
149
- import torch
150
-
151
- repo_id = "cvssp/audioldm-s-full-v2"
152
- pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
153
- pipe = pipe.to("cuda")
154
-
155
- prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
156
- audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
157
- ```
158
-
159
- # Web Demo
160
-
161
- Integrated into [Hugging Face Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)
162
-
163
- # TuneFlow Demo
164
-
165
- Try out AudioLDM as a [TuneFlow](https://tuneflow.com) plugin [![TuneFlow x AudioLDM](https://img.shields.io/badge/TuneFlow-AudioLDM-%23C563E6%20)](https://github.com/tuneflow/AudioLDM). See how it can work in a real DAW (Digital Audio Workstation).
166
-
167
- # TODO
168
-
169
- [!["Buy Me A Coffee"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://www.buymeacoffee.com/haoheliuP)
170
-
171
- - [x] Update the checkpoint with more training steps.
172
- - [x] Update the checkpoint with more parameters (audioldm-l).
173
- - [ ] Add AudioCaps finetuned AudioLDM-S model
174
- - [x] Build pip installable package for commandline use
175
- - [x] Build Gradio web application
176
- - [ ] Add super-resolution, inpainting into Gradio web application
177
- - [ ] Add style-transfer into Gradio web application
178
- - [x] Add text-guided style transfer
179
- - [x] Add audio-to-audio generation
180
- - [x] Add audio super-resolution
181
- - [x] Add audio inpainting
182
-
183
- ## Cite this work
184
-
185
- If you found this tool useful, please consider citing
186
- ```bibtex
187
- @article{liu2023audioldm,
188
- title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
189
- author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
190
- journal={Proceedings of the International Conference on Machine Learning},
191
- year={2023}
192
- pages={21450-21474}
193
- }
194
- ```
195
-
196
- # Hardware requirement
197
- - GPU with 8GB of dedicated VRAM
198
- - A system with a 64-bit operating system (Windows 7, 8.1 or 10, Ubuntu 16.04 or later, or macOS 10.13 or later) 16GB or more of system RAM
199
-
200
- ## Reference
201
- Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.
202
-
203
- > https://github.com/LAION-AI/CLAP
204
-
205
- > https://github.com/CompVis/stable-diffusion
206
-
207
- > https://github.com/v-iashin/SpecVQGAN
208
-
209
- > https://github.com/toshas/torch-fidelity
210
-
211
-
212
- We build the model with data from AudioSet, Freesound and BBC Sound Effect library. We share this demo based on the UK copyright exception of data for academic research.
213
-
214
- <!-- This code repo is strictly for research demo purpose only. For commercial use please contact us. -->
 
1
+ ---
2
+ title: AudioLDM
3
+ emoji: 🌍
4
+ colorFrom: purple
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 4.44.1
8
+ app_file: app.py
9
+ pinned: true
10
+ short_description: Audio Gen
11
+ ---
12