Solving NaN Tensors and Pickling Errors in a ZeroGPU Space
Hi, in this post I'm going to talk about a recent difficulty I had involving an XTTS Space and the ZeroGPU from Hugging Face.
The problem gave me a lot of new knowledge about Python, Hugging Face and ZeroGPU! And I hope it can help someone who is going through something similar!
The Space
The Space involved is this: https://huggingface.co/spaces/rrg92/xtts
This Space contains a version of XTTS, which is a model to text to speech (TTS) and clone voices!
Someone who tried to use the Space commented that the voice clone wasn't working.
When I went to test, it wasn't working either, but everything else was.
So that you understand the main components involved in this problem, I'll summarize the structure:
It's a Gradio Space, using version 5.5.0 of Gradio.
There are two main files (modules):
xtts
andapp
:- the
xtts
is where I put all the imports to invoke the XTTS model and the functions that interact directly with it. app
is where the Gradio app is located, with its respectiveevent_listeners
. So, they invoke the functions of thextts
module.- This structure is a small adaptation of the xtts-streaming-server project. I put the API and the model in the same app, so I could use Gradio on Hugging Face, and, the main benefit: use ZeroGPU!
- the
Of all the functions, the ones relevant to the problem are these:
xtts.predict_speaker
This is the function that invoke model inference to clone voice.
Basically, it receives the binary of the reference audio file and calculates the voice embeddings calling model.
It invokes the model usingmodel.get_conditioning_latents
, passing the file binary. It returns these embeddings, which can later be sent toxtts.predict_speech
as the speaker voice.xtts.predict_speech
This is the function that converts text to speech.
Of the parameters it accepts, the most relevant for us are: the text to be converted and thespeaker
.
Thisspeaker
is embeddings that represent the voice.
The XTTS comes with a range of standard, high quality, studio voices, and we can also generate new embeddings usingxtts.predict_speaker
.
Anyway, one way or another, these are the main parameters. The function returns the binary of the generated audio.app.clone_voice
This is the function triggered when someone click button to clone the voice.
It receives as its first parameter, the reference audio provided by the user in the gradio interface. It is a string containing a file path.
Then, we open the file, using python open function (rb mode), and invoke thextts.predict_speaker
function, passing the binary returned byopen
.app.tts
This is the function invoked when the user click on TTS button, on gradio interface.
The function does a series of operations, but it all boils down to: determining the text, the embeddings of thespeaker
chosen in the interface, and invokingxtts.predict_speech
.
And to finish, as I wanted to run the TTS using ZeroGPU, I decorated the xtts.predict_speech
function with the @spaces.GPU
decorator. This is the official procedure documented by Hugging Face when we want to use GPU.
Now, you know the space structure. Lets dive into two problems I found!
Problem 1: probability tensor contains either
inf,
nan or element < 0
The first problem I noticed in the cloning process was the error returned when trying to generate text with a cloned voice:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 256, in thread_wrapper
res = future.result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/user/app/xtts.py", line 185, in predict_speech
out = model.inference(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 548, in inference
gpt_codes = self.gpt.generate(
File "/usr/local/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt.py", line 592, in generate
gen = self.gpt_inference.generate(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
File "/usr/local/lib/python3.10/site-packages/transformers/generation/utils.py", line 3249, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 624, in process_events
response = await route_utils.call_process_api(
File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 323, in call_process_api
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 2015, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1562, in call_function
prediction = await anyio.to_thread.run_sync( # type: ignore
File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2441, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 943, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 865, in wrapper
response = f(*args, **kwargs)
File "/home/user/app/app.py", line 218, in tts
generated_audio = xtts.predict_speech(ipts)
File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 214, in gradio_handler
raise res.value
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
This error only occurred when trying to use a cloned voice, and not a studio voice.
And, it occurred at the time of the TTS, not at the time of cloning the voice. In other words, it occurred in the xtts.predict_speech
function.
Also, in my local tests, I had no problems.
If you look at the space files, you'll see that there's a Dockerfile created.
This Docker is for when I want to test locally.
If you want try space locally, just run git clone, and afterdocker compose up
And, on top of that, the last message of the stack references a file from the spaces
lib.
All this led me to believe that the difference was in something related to ZeroGPU, since it was one of the main differences between local.
Since the message mentioned the tensors, and, in the stack, the predict_speech
function, the first thing I decided to do was to include a print
of the voice embeddings. Specifically, I added the print
in two points of this function:
@spaces.GPU
def predict_speech(parsed_input: TTSInputs):
print("device", model.device)
speaker_embedding = torch.tensor(parsed_input.speaker_embedding).unsqueeze(0).unsqueeze(-1)
gpt_cond_latent = torch.tensor(parsed_input.gpt_cond_latent).reshape((-1, 1024)).unsqueeze(0)
print(speaker_embedding)
print("latent:")
print(gpt_cond_latent)
My hope was to see if I could confirm at least some of the tensor values with NaN... And bingo:
Not only was the value of one of the tensors NaN, but ALL of them were.
If you look at the function, it returns 2 values that represent the speakers. Both are tensors, and they were all NaN.
Remember that, in the case of the cloned voice, these tensors were generated by the xtts.predict_speaker
function.
So, I decided to go a little deeper into the source, and added the prints directly to the output of this function:
def predict_speaker(wav_file):
"""Compute conditioning inputs from reference audio file."""
temp_audio_name = next(tempfile._get_candidate_names())
with open(temp_audio_name, "wb") as temp, torch.inference_mode():
print("device", model.device)
temp.write(io.BytesIO(wav_file.read()).getbuffer())
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
temp_audio_name
)
print(gpt_cond_latent);
print(speaker_embedding);
result = {
"gpt_cond_latent": gpt_cond_latent.cpu().squeeze().half().tolist(),
"speaker_embedding": speaker_embedding.cpu().squeeze().half().tolist(),
}
print(result);
return result;
And, again, I saw that already in the output of model.get_conditioning_latents
, the tensors were coming as NaN.
I went deeper into the XTTS source code to understand how this was done:
This is part of forked coqui-tts code
As the two calculated embeddings were NaN, I went to the speaker_embedding
, which is calculated first.
What this function does, basically, is convert the sample rate of the audio and invoke a method of the hifigan_decoder
object.
I didn't know about this, but I saw that there's this paper about a neural network called HiFi-GAN: https://arxiv.org/abs/2010.05646
But, from a quick read, I saw that it's a network for synthesizing speech... which, obviously, makes perfect sense for voice cloning!
Despite my limited knowledge at this level, I noticed that at this point, the to
method is invoked a lot, to put the tensors to another device. This made me wonder how this code could be working, considering that there is no GPU involved here, and only CPU. Then, I remembered a simple detail: the predict_speaker
function was running on CPU, and the predict_speech
function, on GPU... I imagined there could be some incompatibility problem with this...
This became even stranger, when I added logs to see on which device the XTTS model was loaded. This is the snippet:
And here is the log that was generated:
And what caught my attention was the following:
- The
device
variable starts with the value "cuda", so far so good, since this is the intention. - Next, right below, there is a check: if
not cuda
is available in torch, generate an error...
But no error is generated...
It means that, even though the code is running in a Space with ZeroGPU, and without the decorator, it detects that cuda is indeed available. - Then, the model is loaded, and, as expected, on the CPU. The "before" message shows the value "cpu".
- However, the model is moved to CUDA, and curiously, it is done successfully... Even being a code that runs without the decorator...
That is, I hadn't noticed this, but the model loads easily on the GPU, in a ZeroGPU Space without the decorator...
This made me believe that, when a function that doesn't have the decorator runs, these movements made to a device, can, somehow, generate NaN in the tensor. I still haven't figured out exactly why, and I'm doing tests in this Space: https://huggingface.co/spaces/rrg92/zero-test to try to simulate the scenario. When I have updates, I'll post.
To solve at this point, I just added the @spaces.GPU
decorator to the predict_speaker
function.
This generated the tensors correctly... However, when trying to clone, a new error appeared...
Problem 2: cannot pickle '_io.BufferedReader' object
After adding the decorator to the xtts.predict_speaker
function, an error was generated when trying to clone:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/spaces/utils.py", line 43, in put
super().put(obj)
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 371, in put
obj = _ForkingPickler.dumps(obj)
File "/usr/local/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_io.BufferedReader' object
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 624, in process_events
response = await route_utils.call_process_api(
File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 323, in call_process_api
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 2015, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1562, in call_function
prediction = await anyio.to_thread.run_sync( # type: ignore
File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2441, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 943, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 865, in wrapper
response = f(*args, **kwargs)
File "/home/user/app/app.py", line 127, in clone_speaker
embeddings = xtts.predict_speaker(open(upload_file,"rb"))
File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 202, in gradio_handler
worker.arg_queue.put(((args, kwargs), GradioPartialContext.get()))
File "/usr/local/lib/python3.10/site-packages/spaces/utils.py", line 51, in put
raise PicklingError(message)
_pickle.PicklingError: cannot pickle '_io.BufferedReader' object
This was the error generated when I tried to generate audio using a cloned voice.
Now it was a pickle error... I didn't know what it was, and after some research, I understood that it was related to object serialization, which is a process I know from other languages.
Basically, something in the call of my function wasn't able to be serialized.
And, as the only thing different was the decorator, I went to look at the decorator code again, in the part where the problem occurs:
I saw that the problematic part used to put something in a queue... And looking at the code of this queue, which wasn't very complex, I noticed that it basically needed to serialize these objects.
Since the error message mentioned _io.BufferedReader
, and I saw that the arguments are serialized, then, immediately I turned to the parameter passed to this function: wav_file
. This parameter is the file that the user provided in the interface. Specifically, the file binary. It is passed in this way by app.clone_speaker
:
That is, we open the file in binary mode and pass it to the function... With this, xtts.predict_speaker
receives a binary. I imagined that, instead of passing the binary, I could try passing the path, which would be a string. So I rewrote it as follows to maintain compatibility:
And voilà! The clone started to work! So, in summary, there were two problems:
The
xtts.predict_speaker
function was not decorated with the Space decorator, and, for some reason that I still don't know, instead of the model resulting in errors, or transferring to the CPU, it generated tensors with NaN.
Resolution: Added the@spaces.GPU
decorator to thextts.predict_speaker
functionIncluding the function in ZeroGPU, caused the error due to the type of the parameter;, because ZeroGPU pickles the arguments.
Resolution: Pass the string with the file path and open it inside thextts.predict_speaker
function.
Final Toughts
And curiously, this sparked a new question for me: How does Hugging Face implement ZeroGPU? I always wondered if, it dynamically adds the video card, or if it moves the machine, or if it's a custom driver that intercepts the calls and manages to send only the request to a machine with ZeroGPU... Etc.. anyway, many questions...
I created this Space: https://huggingface.co/spaces/rrg92/zero-test
And in it I'm doing tests to help me answer all the questions that are still left.
Anyway, doing this whole process helped me learn a lot more about Python, PyTorch, Hugging Face, ZeroGPU and XTTS. It was already worth it!
When I have more answers, I'll update this post and/or post a new one!
Also, I realize that my debugging approach — adding print statements and pushing changes from my local repo — is not the best practice. I tried using dev space mode, which is a nice feature, but I encountered some difficulties and ended up choosing that archaic method instead. However, I did learn a few things from the experience and hope to use it more effectively next time.
Special thanks to @p3nGu1nZz for reviewing this article and providing invaluable guidance with numerous tips that have deepened my understanding of AI. They’re currently working on an exciting project called Tau. Be sure to follow them on GitHub to stay updated on their work!
Thank you for reading!