[ISSUE] Transformers inference script doesn't work.

#2
by CCRss - opened

Thanks for the release of great model. 🤗

When running transformer_inference_script from repo it throws this error.

Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 51
     49 codes_tensor = [torch.tensor(c, dtype=torch.long, device="cuda").unsqueeze(0) for c in codes]
     50 with torch.inference_mode():
---> 51     audio = snac_model.decoder(snac_model.quantizer.from_codes(codes_tensor))[0, 0].cpu().numpy()
     53 # Save your emotional voice output
     54 sf.write("output.wav", audio, 24000)

File ~/miniconda3/lib/python3.12/site-packages/snac/vq.py:95, in ResidualVectorQuantize.from_codes(self, codes)
     93 for i in range(self.n_codebooks):
     94     z_p_i = self.quantizers[i].decode_code(codes[i])
---> 95     z_q_i = self.quantizers[i].out_proj(z_p_i)
     96     z_q_i = z_q_i.repeat_interleave(self.quantizers[i].stride, dim=-1)
     97     z_q += z_q_i

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1773, in Module._wrapped_call_impl(self, *args, **kwargs)
   1771     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1772 else:
-> 1773     return self._call_impl(*args, **kwargs)

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py:1784, in Module._call_impl(self, *args, **kwargs)
   1779 # If we don't have any hooks, we want to skip the rest of the logic in
   1780 # this function, and just call forward.
   1781 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1782         or _global_backward_pre_hooks or _global_backward_hooks
   1783         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1784     return forward_call(*args, **kwargs)
   1786 result = None
   1787 called_always_called_hooks = set()

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/conv.py:371, in Conv1d.forward(self, input)
    370 def forward(self, input: Tensor) -> Tensor:
--> 371     return self._conv_forward(input, self.weight, self.bias)

File ~/miniconda3/lib/python3.12/site-packages/torch/nn/modules/conv.py:366, in Conv1d._conv_forward(self, input, weight, bias)
    354 if self.padding_mode != "zeros":
    355     return F.conv1d(
    356         F.pad(
    357             input, self._reversed_padding_repeated_twice, mode=self.padding_mode
   (...)
    364         self.groups,
    365     )
--> 366 return F.conv1d(
    367     input, weight, bias, self.stride, self.padding, self.dilation, self.groups
    368 )

RuntimeError: Calculated padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than actual input size

ENV:

snac==1.2.1
transformers==4.57.1
torch==2.8.0
torchaudio==2.8.0

Hardware:

  • H100
  • Cuda12.4

With vLLM script from vllm_streaming_inference.py inference works.

[UPD] With vLLM it works but result is awful:

  1. just noise
# Example 1: Professional voice
    description = (
        "Realistic male voice in the 30s age with american accent. "
        "Normal pitch, warm timbre, conversational pacing, neutral tone delivery at med intensity."
    )
    text = "Hello! This is a test of the Maya-1-Voice text-to-speech system."


2. random speech --> not following text that was provided

    description = (
        "Creative, dark_villain character. Male voice in their 40s with british accent. "
        "Low pitch, gravelly timbre, slow pacing, angry tone at high intensity."
    )
    text = "The darkness isn't coming... <angry> it's already here!"

Same experience. However I got the second audio which says the correct text

from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
import soundfile as sf

# Load the best open source voice AI model
model = AutoModelForCausalLM.from_pretrained(
    "maya-research/maya1",
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("maya-research/maya1")
llm_device = next(model.parameters()).device

# Load SNAC audio decoder (24kHz)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")
snac_device = next(snac_model.parameters()).device

# Design your voice with natural language
description = "Robotic male voice with unnatural intonation, beeps, boops, and fast pacing"
text = "Hello! This is Maya1 <laugh> the best open source voice AI model with emotions."

# Create chat-formatted prompt so the model emits SNAC audio tokens
messages = [
    {"role": "system", "content": "You are Maya, a voice AI that responds with SNAC audio tokens."},
    {"role": "user", "content": f'<description="{description}"> {text}'},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(llm_device)
attention_mask = torch.ones_like(input_ids)

# Generate emotional speech
with torch.inference_mode():
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=500,
        temperature=0.4,
        top_p=0.9,
        do_sample=True,
    )

# Extract SNAC audio tokens
generated_ids = outputs[0, input_ids.shape[1]:]
snac_tokens = [t.item() for t in generated_ids if 128266 <= t <= 156937]
if not snac_tokens:
    raise RuntimeError("Model did not emit any SNAC audio tokens. Try adjusting the prompt or sampling settings.")

# Decode SNAC tokens to audio frames
frames = len(snac_tokens) // 7
snac_tokens = snac_tokens[: frames * 7]  # drop incomplete frames, if any
codes = [[], [], []]
for i in range(frames):
    s = snac_tokens[i * 7 : (i + 1) * 7]
    codes[0].append((s[0] - 128266) % 4096)
    codes[1].extend([(s[1] - 128266) % 4096, (s[4] - 128266) % 4096])
    codes[2].extend(
        [
            (s[2] - 128266) % 4096,
            (s[3] - 128266) % 4096,
            (s[5] - 128266) % 4096,
            (s[6] - 128266) % 4096,
        ]
    )

# Generate final audio with SNAC decoder
codes_tensor = [torch.tensor(c, dtype=torch.long, device=snac_device).unsqueeze(0) for c in codes]
with torch.inference_mode():
    audio = snac_model.decoder(snac_model.quantizer.from_codes(codes_tensor))[0, 0].cpu().numpy()

# Save your emotional voice output
sf.write("output.wav", audio, 24000)
print("Voice generated successfully! Play output.wav")```

I was able to get it to run past the SNAC errors, by improving chat format template following - but it produces random voices and words

Thanks this worked, but it is generating random speech.

same problem for me: "Kernel size can't be greater than actual input size"

I have it "working" too but the output is always fucked up sounds except if you literally just give it the demo prompts lmao like why even release this ?

Maya Research org

Sign up or log in to comment