Improve model card for AudioStory: add metadata and paper link

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +19 -7
README.md CHANGED
@@ -1,5 +1,13 @@
 
 
 
 
 
 
1
  # AudioStory: Generating Long-Form Narrative Audio with Large Language Models
2
 
 
 
3
  [[github]](https://github.com/TencentARC/AudioStory/)
4
 
5
  ✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.
@@ -23,7 +31,7 @@
23
 
24
  ## 🔎 Introduction
25
 
26
- ![audiostory](audiostory.png)
27
 
28
  Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:
29
 
@@ -51,17 +59,21 @@ Extensive experiments show the superiority of AudioStory on both single-audio ge
51
  ### 2. Cross-domain Video Dubbing (Tom & Jerry style)
52
 
53
  <table class="center">
 
 
 
 
54
  <td><video src="https://github.com/user-attachments/assets/e62d0c09-cdf0-4e51-b550-0a2c23f8d68d"></video></td>
55
- <td><video src="https://github.com/user-attachments/assets/736d22ca-6636-4ef0-99f3-768e4dfb112a"></video></td>
56
  <td><video src="https://github.com/user-attachments/assets/f2f7c94c-7f72-4cc0-8edc-290910980b04"></video></td>
57
  <tr>
58
  <td><video src="https://github.com/user-attachments/assets/d3e58dd4-31ae-4e32-aef1-03f1e649cb0c"></video></td>
59
- <td><video src="https://github.com/user-attachments/assets/4f68199f-e48a-4be7-b6dc-1acb8d377a6e"></video></td>
60
  <td><video src="https://github.com/user-attachments/assets/062236c3-1d26-4622-b843-cc0cd0c58053"></video></td>
61
  <tr>
62
  <td><video src="https://github.com/user-attachments/assets/8931f428-dd4d-430f-9927-068f2912dd36"></video></td>
63
- <td><video src="https://github.com/user-attachments/assets/ab7e46d5-f42c-472e-b66e-df786b658210"></video></td>
64
- <td><video src="https://github.com/user-attachments/assets/9a0998ad-b5a4-42ac-bdaf-ceaf796fc586"></video></td>
65
  <tr>
66
  </table >
67
 
@@ -86,7 +98,7 @@ Extensive experiments show the superiority of AudioStory on both single-audio ge
86
 
87
  ## 🔎 Methods
88
 
89
- ![audiostory_framework](audiostory_framework.png)
90
 
91
  To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs **interleaved reasoning generation**, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.
92
 
@@ -169,4 +181,4 @@ This repository is under the [Apache 2 License](https://github.com/mashijie1028/
169
 
170
  If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn
171
 
172
- Discussions and potential collaborations are also welcome.
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-audio
4
+ library_name: transformers
5
+ ---
6
+
7
  # AudioStory: Generating Long-Form Narrative Audio with Large Language Models
8
 
9
+ This model is presented in the paper [AudioStory: Generating Long-Form Narrative Audio with Large Language Models](https://huggingface.co/papers/2508.20088).
10
+
11
  [[github]](https://github.com/TencentARC/AudioStory/)
12
 
13
  ✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.
 
31
 
32
  ## 🔎 Introduction
33
 
34
+ ![audiostory](https://github.com/TencentARC/AudioStory/raw/main/audiostory.png)
35
 
36
  Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:
37
 
 
59
  ### 2. Cross-domain Video Dubbing (Tom & Jerry style)
60
 
61
  <table class="center">
62
+ <td><video src="https://github.com/user-attachments/assets/4089493c-2a26-4093-9709-0827c6dafcde"></video></td>
63
+ <td><video src="https://github.com/user-attachments/assets/67fafed1-2547-49ba-afaa-75fc7f9d58ca"></video></td>
64
+ <td><video src="https://github.com/user-attachments/assets/abbc9192-894c-49a2-9b55-8cc4852483c2"></video></td>
65
+ <tr>
66
  <td><video src="https://github.com/user-attachments/assets/e62d0c09-cdf0-4e51-b550-0a2c23f8d68d"></video></td>
67
+ <td><video src="https://github.com/user-attachments/assets/38339d5b-b96a-4ffd-8607-c94eb254beb6"></video></td>
68
  <td><video src="https://github.com/user-attachments/assets/f2f7c94c-7f72-4cc0-8edc-290910980b04"></video></td>
69
  <tr>
70
  <td><video src="https://github.com/user-attachments/assets/d3e58dd4-31ae-4e32-aef1-03f1e649cb0c"></video></td>
71
+ <td><video src="https://github.com/user-attachments/assets/ab7e46d5-f42c-472e-b66e-df786b658210"></video></td>
72
  <td><video src="https://github.com/user-attachments/assets/062236c3-1d26-4622-b843-cc0cd0c58053"></video></td>
73
  <tr>
74
  <td><video src="https://github.com/user-attachments/assets/8931f428-dd4d-430f-9927-068f2912dd36"></video></td>
75
+ <td><video src="https://github.com/user-attachments/assets/4f68199f-e48a-4be7-b6dc-1acb8d377a6e"></video></td>
76
+ <td><video src="https://github.com/user-attachments/assets/736d22ca-6636-4ef0-99f3-768e4dfb112a"></video></td>
77
  <tr>
78
  </table >
79
 
 
98
 
99
  ## 🔎 Methods
100
 
101
+ ![audiostory_framework](https://github.com/TencentARC/AudioStory/raw/main/audiostory_framework.png)
102
 
103
  To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs **interleaved reasoning generation**, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.
104
 
 
181
 
182
  If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn
183
 
184
+ Discussions and potential collaborations are also welcome.