It seems that this implementation fails with ZeroDivisionError when there are unpronounceable sequences in the generation.
['She pauses, watching you make your way over to the chair and collapse into it with relief.'] Processing time: 1.938103199005127 Real-time factor: 0.2928350478159128 Text splitted to sentences.
Processing time: 0.0
Traceback (most recent call last):
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\queueing.py", line 407, in call_prediction
output = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1550, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1199, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 519, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 512, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio_backends_asyncio.py", line 807, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 495, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 649, in gen_wrapper
yield from f(*args, **kwargs)
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 342, in generate_chat_reply_wrapper
for i, history in enumerate(generate_chat_reply(text, state, regenerate, _continue, loading_message=True)):
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 310, in generate_chat_reply
for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message):
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 278, in chatbot_wrapper
output['visible'][-1][1] = apply_extensions('output', output['visible'][-1][1], state, is_chat=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\modules\extensions.py", line 224, in apply_extensions
return EXTENSION_MAP[typ](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\modules\extensions.py", line 82, in _apply_string_extensions
text = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\extensions\XTTSv2\script.py", line 153, in output_modifier
return tts_narrator(string)
^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\extensions\XTTSv2\script.py", line 135, in tts_narrator
tts.tts_to_file(text=turn,
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\api.py", line 403, in tts_to_file
wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\api.py", line 341, in tts
wav = self.synthesizer.tts(
^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\utils\synthesizer.py", line 492, in tts
print(f" > Real-time factor: {process_time / audio_time}")
~~~~~~~~~~~~~^~~~~~~~~~~~
ZeroDivisionError: float division by zero```
do you know what the text was?
Nice job! I've noticed XTTSv2 also supports streaming. Do you think its possible to use it conjunction with token streaming or have it generated immediately after one sentence is finished? Since the TTS model keeps being in VRAM, using it simultaneously with text generation should be possible.
do you know what the text was?
It was a stop token '</s>' after the asterisk '*' causing the problem. It does work normally when there is a non asterisk prefixed stop token though.
*Mishka explains her understanding of the Chinese city based on your description.*</s>
> Text splitted to sentences.
['Mishka explains her understanding of the Chinese city based on your description.']
Processing time: 1.7906074523925781
> Real-time factor: 0.3192755719146748
Text splitted to sentences.
> Processing time: 0.0
Traceback (most recent call last):
...
I made the structure more similar to silero_tts and made some various fixes. I think that this looks pretty good now and it's working reliably.
@kanttouchthis I ended up removing the narrator feature for simplicity and will accept your PR to text-generation-webui-extensions for people who want to try it.
The only remaining issue is that the TTS library apparently re-downloads the model every time instead of using the existing cache. I'll merge this PR and try to find a solution to that in a future one.
The model cache issue was fixed in TTS 0.20.6
I'm seeing some oddity with the asterisk issue mentioned above. It causes the TTS to generate 2-4 seconds of audio strange sounds or sometimes cut out some of the speech, before restarting a sentence or two later.
What you see in the web interface
*This is a narrative description.* "This is the character speaking."
What you see if you look at the command prompt/text generation
"*This is a narrative description.", '*', '"This is the character speaking."'
I've listened to quite a few generations now and looked at quite a lot of the command prompt/terminal and best I can tell, its when that asterisk gets split/broken out. Im not sure if its specific to some models or just a general issue.
I have a suspicion that its also badly impacting generation time, as generations that seem to suffer this issue, seem to take a bit longer to process, even though the actual audio output isn't specifically any longer.
I'm on the current build of the coqui_tts extension (at time of writing).
Coqui also supports using different voices for the narrator etc. Can this feature be added? Said feature already exists in the extension located here: https://github.com/kanttouchthis/text_generation_webui_xtts
Nice job! I've noticed XTTSv2 also supports streaming. Do you think its possible to use it conjunction with token streaming or have it generated immediately after one sentence is finished? Since the TTS model keeps being in VRAM, using it simultaneously with text generation should be possible.
I'd also love to see that, but I think there is more to it then just calling the tts engines streaming mode. I added a feature request in regards of this topic and also describes the difference in text-generation streaming and tts streaming which needs to be made compatible: #4706
Maybe the text-generation team can comment on it but I guess it makes sense to have a dedicated issue for this topic.
@oobabooga @kanttouchthis Please could you have a look at this #4712 I have found a solution to speeding up speech generation for people who have a low VRAM situation. I've written some code (badly) that works, but, not actually being a coder, someone would need to integrate it properly into the script.py properly (and tidy up the code).
Thanks
@oobabooga I'm also curious about the source of the voice files.
Login to write a write a comment.
Checklist:
Description
adds XTTSv2 for multilingual TTS with voice cloning.
Installation needs to be tested further but seems to work on windows. Dependencies may cause conflicts.
Edit: example