Spaces:
Running
on
Zero
Running
on
Zero
File size: 4,515 Bytes
1b6205a 3a7abf6 1b6205a 2cfe45c b5b7f54 1b6205a 37a9836 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
title: BARK Text to Audio with Batch Inference
emoji: 🪄
colorFrom: purple
colorTo: pink
sdk: gradio
python_version: "3.10.13"
sdk_version: "5.23.3"
suggested_hardware: cpu-upgrade
suggested_storage: small
app_file: app.py
short_description: Generate natural sounding speech audio from text
pinned: true
startup_duration_timeout: 45m
# models:
# - mistralai/Mistral-7B-Instruct-v0.2
# - stabilityai/stable-diffusion-xl-base-1.0
tags:
- text-to-audio
- gradio
- bark
preload_from_hub:
- suno/bark
---
# Generate Audio from text and clone voice with BARK
You can generate audio from text with natural sounding voice and clone any voice (not perfect).

Code worked on Python 3.12. May also work on other versions.
Example generated audio in the /assets/audio folder
## Features
- **Text-to-Audio Generation:** Generate speech from text using the BARK model (supports 'small' and 'large' variants).
- **Parameter Control:** Adjust semantic, coarse, and fine temperature settings for generation diversity. Set a generation seed for reproducibility.
- **Device Selection:** Run inference on available devices (CPU, CUDA, MPS).
- **Standard Voice Prompts:** Utilize built-in BARK voice prompts (`.npz` files) located in the `bark_prompts` directory.
- **Custom Voice Prompt Creation (Voice Cloning):**
- Upload your own audio file (.wav, .mp3).
- Generate a BARK-compatible semantic prompt (`.npz` file) using a custom-trained HuBERT model.
- The generated prompt appears in the "Select Voice Prompt" dropdown for immediate use.
- **Audio Management:** View, play, and delete generated audio files directly within the interface.
- **Training Scripts:** Includes scripts to generate the necessary dataset (`generate_audio_semantic_dataset.py`) and train the custom HuBERT model (`train_hubert.py`).
## Custom Voice Cloning Model
The core of the custom voice prompt generation relies on a fine-tuned HuBERT model.
- **Model:** `sleeper371/hubert-for-bark-semantic` on Hugging Face ([Link](https://huggingface.co/sleeper371/hubert-for-bark-semantic))
- **Architecture:** This model uses a HuBERT base feature extractor followed by a Transformer decoder head.
- **Training:** It was trained on over 4700 sentence pairs, mapping audio waveforms to the semantic tokens generated by BARK's semantic model. The training used a cross-entropy loss objective.
- **Dataset:** The training dataset is available at `sleeper371/bark-wave-semantic` on Hugging Face ([Link](https://huggingface.co/datasets/sleeper371/bark-wave-semantic)).
- **Comparison:** This approach is inspired by projects like [gitmylo/bark-data-gen](https://github.com/gitmylo/bark-data-gen), but differs in the head architecture (he used an LSTM head while I used a transformers decoder head)
## Setup and Installation
Follow these steps to set up the environment and run the application.
1. **Clone the Repository:**
2. **Create a Virtual Environment:**
It's highly recommended to use a virtual environment to manage dependencies.
```bash
# For Linux/macOS
python3 -m venv venv
source venv/bin/activate
# For Windows
python -m venv venv
.\venv\Scripts\activate
```
3. **Install Requirements:**
Make sure you have a `requirements.txt` file in the repository root containing all necessary packages (e.g., `gradio`, `torch`, `transformers`, `soundfile`, etc.).
```bash
pip install -r requirements.txt
```
## Running the Application
Once the setup is complete, run the Gradio application:
```bash
python app.py
```
This will launch the Gradio interface, typically accessible at http://127.0.0.1:7860 in your web browser. The console output will provide the exact URL.
## Training Your Own Custom HuBERT Model
If you want to train your own HuBERT model for voice cloning:
1. Generate Dataset:
- Use the generate_audio_semantic_dataset.py script.
2. Train the Model:
- Use the train_hubert.py script.
- This script takes the generated dataset (audio paths and semantic token paths) to fine-tune a HuBERT model with a Transformer decoder head.
- Configure training parameters (batch size, learning rate, epochs, output directory) within the script or via command-line arguments (if implemented).
## License
MIT
## Acknowledgements
- Suno AI, they trained the models
- gitmylo, inspired me to use HuBERT to predict semantic tokens from audio
|