TencentARC
/

AudioStory-3B

Safetensors

Model card Files Files and versions

xet

Community

Improve model card for AudioStory: add metadata and paper link

by nielsr HF Staff - opened Oct 11

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+19

-7

Files changed (1) hide show

README.md +19 -7

README.md CHANGED Viewed

@@ -1,5 +1,13 @@
 # AudioStory: Generating Long-Form Narrative Audio with Large Language Models
 [[github]](https://github.com/TencentARC/AudioStory/)
 ✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.
@@ -23,7 +31,7 @@
 ## 🔎 Introduction
-![audiostory](audiostory.png)
 Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:
@@ -51,17 +59,21 @@ Extensive experiments show the superiority of AudioStory on both single-audio ge
 ### 2. Cross-domain Video Dubbing (Tom & Jerry style)
 <table class="center">
     <td><video src="https://github.com/user-attachments/assets/e62d0c09-cdf0-4e51-b550-0a2c23f8d68d"></video></td>
-    <td><video src="https://github.com/user-attachments/assets/736d22ca-6636-4ef0-99f3-768e4dfb112a"></video></td>
     <td><video src="https://github.com/user-attachments/assets/f2f7c94c-7f72-4cc0-8edc-290910980b04"></video></td>
   <tr>
   <td><video src="https://github.com/user-attachments/assets/d3e58dd4-31ae-4e32-aef1-03f1e649cb0c"></video></td>
-  <td><video src="https://github.com/user-attachments/assets/4f68199f-e48a-4be7-b6dc-1acb8d377a6e"></video></td>
   <td><video src="https://github.com/user-attachments/assets/062236c3-1d26-4622-b843-cc0cd0c58053"></video></td>
 	<tr>
   <td><video src="https://github.com/user-attachments/assets/8931f428-dd4d-430f-9927-068f2912dd36"></video></td>
-  <td><video src="https://github.com/user-attachments/assets/ab7e46d5-f42c-472e-b66e-df786b658210"></video></td>
-  <td><video src="https://github.com/user-attachments/assets/9a0998ad-b5a4-42ac-bdaf-ceaf796fc586"></video></td>
   <tr>
 </table >
@@ -86,7 +98,7 @@ Extensive experiments show the superiority of AudioStory on both single-audio ge
 ## 🔎 Methods
-![audiostory_framework](audiostory_framework.png)
 To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end,  AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs **interleaved reasoning generation**, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.
@@ -169,4 +181,4 @@ This repository is under the [Apache 2 License](https://github.com/mashijie1028/
 If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn
-Discussions and potential collaborations are also welcome.

+---
+license: apache-2.0
+pipeline_tag: text-to-audio
+library_name: transformers
+---
 # AudioStory: Generating Long-Form Narrative Audio with Large Language Models
+This model is presented in the paper [AudioStory: Generating Long-Form Narrative Audio with Large Language Models](https://huggingface.co/papers/2508.20088).
 [[github]](https://github.com/TencentARC/AudioStory/)
 ✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.
 ## 🔎 Introduction
+![audiostory](https://github.com/TencentARC/AudioStory/raw/main/audiostory.png)
 Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:
 ### 2. Cross-domain Video Dubbing (Tom & Jerry style)
 <table class="center">
+		<td><video src="https://github.com/user-attachments/assets/4089493c-2a26-4093-9709-0827c6dafcde"></video></td>
+    <td><video src="https://github.com/user-attachments/assets/67fafed1-2547-49ba-afaa-75fc7f9d58ca"></video></td>
+    <td><video src="https://github.com/user-attachments/assets/abbc9192-894c-49a2-9b55-8cc4852483c2"></video></td>
+  <tr>
     <td><video src="https://github.com/user-attachments/assets/e62d0c09-cdf0-4e51-b550-0a2c23f8d68d"></video></td>
+    <td><video src="https://github.com/user-attachments/assets/38339d5b-b96a-4ffd-8607-c94eb254beb6"></video></td>
     <td><video src="https://github.com/user-attachments/assets/f2f7c94c-7f72-4cc0-8edc-290910980b04"></video></td>
   <tr>
   <td><video src="https://github.com/user-attachments/assets/d3e58dd4-31ae-4e32-aef1-03f1e649cb0c"></video></td>
+  <td><video src="https://github.com/user-attachments/assets/ab7e46d5-f42c-472e-b66e-df786b658210"></video></td>
   <td><video src="https://github.com/user-attachments/assets/062236c3-1d26-4622-b843-cc0cd0c58053"></video></td>
 	<tr>
   <td><video src="https://github.com/user-attachments/assets/8931f428-dd4d-430f-9927-068f2912dd36"></video></td>
+  <td><video src="https://github.com/user-attachments/assets/4f68199f-e48a-4be7-b6dc-1acb8d377a6e"></video></td>
+  <td><video src="https://github.com/user-attachments/assets/736d22ca-6636-4ef0-99f3-768e4dfb112a"></video></td>
   <tr>
 </table >
 ## 🔎 Methods
+![audiostory_framework](https://github.com/TencentARC/AudioStory/raw/main/audiostory_framework.png)
 To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end,  AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs **interleaved reasoning generation**, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.
 If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn
+Discussions and potential collaborations are also welcome.