[ISSUE] Transformers inference script doesn't work.
Thanks for the release of great model. 🤗
When running transformer_inference_script from repo it throws this error.
Error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[2], line 51
49 codes_tensor = [torch.tensor(c, dtype=torch.long, device="cuda").unsqueeze(0) for c in codes]
50 with torch.inference_mode():
---> 51 audio = snac_model.decoder(snac_model.quantizer.from_codes(codes_tensor))[0, 0].cpu().numpy()
53 # Save your emotional voice output
54 sf.write("output.wav", audio, 24000)
File ~/miniconda3/lib/python3.12/site-packages/snac/vq.py:95, in ResidualVectorQuantize.from_codes(self, codes)
93 for i in range(self.n_codebooks):
94 z_p_i = self.quantizers[i].decode_code(codes[i])
---> 95 z_q_i = self.quantizers[i].out_proj(z_p_i)
96 z_q_i = z_q_i.repeat_interleave(self.quantizers[i].stride, dim=-1)
97 z_q += z_q_i
File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1773, in Module._wrapped_call_impl(self, *args, **kwargs)
1771 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1772 else:
-> 1773 return self._call_impl(*args, **kwargs)
File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1784, in Module._call_impl(self, *args, **kwargs)
1779 # If we don't have any hooks, we want to skip the rest of the logic in
1780 # this function, and just call forward.
1781 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1782 or _global_backward_pre_hooks or _global_backward_hooks
1783 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1784 return forward_call(*args, **kwargs)
1786 result = None
1787 called_always_called_hooks = set()
File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/conv.py:371, in Conv1d.forward(self, input)
370 def forward(self, input: Tensor) -> Tensor:
--> 371 return self._conv_forward(input, self.weight, self.bias)
File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/conv.py:366, in Conv1d._conv_forward(self, input, weight, bias)
354 if self.padding_mode != "zeros":
355 return F.conv1d(
356 F.pad(
357 input, self._reversed_padding_repeated_twice, mode=self.padding_mode
(...)
364 self.groups,
365 )
--> 366 return F.conv1d(
367 input, weight, bias, self.stride, self.padding, self.dilation, self.groups
368 )
RuntimeError: Calculated padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than actual input size
ENV:
snac==1.2.1
transformers==4.57.1
torch==2.8.0
torchaudio==2.8.0
Hardware:
- H100
- Cuda12.4
With vLLM script from vllm_streaming_inference.py inference works.
[UPD] With vLLM it works but result is awful:
- just noise
# Example 1: Professional voice
description = (
"Realistic male voice in the 30s age with american accent. "
"Normal pitch, warm timbre, conversational pacing, neutral tone delivery at med intensity."
)
text = "Hello! This is a test of the Maya-1-Voice text-to-speech system."
2. random speech --> not following text that was provided
description = (
"Creative, dark_villain character. Male voice in their 40s with british accent. "
"Low pitch, gravelly timbre, slow pacing, angry tone at high intensity."
)
text = "The darkness isn't coming... <angry> it's already here!"
Same experience. However I got the second audio which says the correct text
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
import soundfile as sf
# Load the best open source voice AI model
model = AutoModelForCausalLM.from_pretrained(
"maya-research/maya1",
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("maya-research/maya1")
llm_device = next(model.parameters()).device
# Load SNAC audio decoder (24kHz)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")
snac_device = next(snac_model.parameters()).device
# Design your voice with natural language
description = "Robotic male voice with unnatural intonation, beeps, boops, and fast pacing"
text = "Hello! This is Maya1 <laugh> the best open source voice AI model with emotions."
# Create chat-formatted prompt so the model emits SNAC audio tokens
messages = [
{"role": "system", "content": "You are Maya, a voice AI that responds with SNAC audio tokens."},
{"role": "user", "content": f'<description="{description}"> {text}'},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(llm_device)
attention_mask = torch.ones_like(input_ids)
# Generate emotional speech
with torch.inference_mode():
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=500,
temperature=0.4,
top_p=0.9,
do_sample=True,
)
# Extract SNAC audio tokens
generated_ids = outputs[0, input_ids.shape[1]:]
snac_tokens = [t.item() for t in generated_ids if 128266 <= t <= 156937]
if not snac_tokens:
raise RuntimeError("Model did not emit any SNAC audio tokens. Try adjusting the prompt or sampling settings.")
# Decode SNAC tokens to audio frames
frames = len(snac_tokens) // 7
snac_tokens = snac_tokens[: frames * 7] # drop incomplete frames, if any
codes = [[], [], []]
for i in range(frames):
s = snac_tokens[i * 7 : (i + 1) * 7]
codes[0].append((s[0] - 128266) % 4096)
codes[1].extend([(s[1] - 128266) % 4096, (s[4] - 128266) % 4096])
codes[2].extend(
[
(s[2] - 128266) % 4096,
(s[3] - 128266) % 4096,
(s[5] - 128266) % 4096,
(s[6] - 128266) % 4096,
]
)
# Generate final audio with SNAC decoder
codes_tensor = [torch.tensor(c, dtype=torch.long, device=snac_device).unsqueeze(0) for c in codes]
with torch.inference_mode():
audio = snac_model.decoder(snac_model.quantizer.from_codes(codes_tensor))[0, 0].cpu().numpy()
# Save your emotional voice output
sf.write("output.wav", audio, 24000)
print("Voice generated successfully! Play output.wav")```
I was able to get it to run past the SNAC errors, by improving chat format template following - but it produces random voices and words
Thanks this worked, but it is generating random speech.
same problem for me: "Kernel size can't be greater than actual input size"
I have it "working" too but the output is always fucked up sounds except if you literally just give it the demo prompts lmao like why even release this ?
HF-Space - https://huggingface.co/spaces/maya-research/maya1
Repo with fastAPI implementation - https://github.com/MayaResearch/maya1-fastapi