add XTTSv2 #4673

oobabooga merged 19 commits into oobabooga:dev from main
kanttouchthis
kanttouchthis1 year ago (edited 1 year ago)👍 3🚀 3

Checklist:

Description

adds XTTSv2 for multilingual TTS with voice cloning.
Installation needs to be tested further but seems to work on windows. Dependencies may cause conflicts.
Edit: example

oobabooga Merge pull request #4579 from oobabooga/dev
454fcf39
oobabooga Merge pull request #4606 from oobabooga/dev
2337aebe
oobabooga Merge pull request #4608 from oobabooga/dev
8a2af87d
oobabooga Merge pull request #4627 from oobabooga/dev
0ee8d2b6
oobabooga Merge pull request #4628 from oobabooga/dev
f889302d
oobabooga Merge pull request #4632 from oobabooga/dev
3146124e
oobabooga Merge pull request #4660 from oobabooga/dev
d1bba48a
oobabooga Merge pull request #4662 from oobabooga/dev
22e7a22d
oobabooga Merge pull request #4664 from oobabooga/dev
f11092ac
kanttouchthis add XTTSv2
d51a9891
kanttouchthis fix installation
64a1c1d3
TeuMasaki
TeuMasaki1 year ago

It seems that this implementation fails with ZeroDivisionError when there are unpronounceable sequences in the generation.

['She pauses, watching you make your way over to the chair and collapse into it with relief.']
Processing time: 1.938103199005127
Real-time factor: 0.2928350478159128
Text splitted to sentences. 

Processing time: 0.0
Traceback (most recent call last):
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\queueing.py", line 407, in call_prediction
output = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1550, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1199, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 519, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 512, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio_backends_asyncio.py", line 807, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 495, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 649, in gen_wrapper
yield from f(*args, **kwargs)
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 342, in generate_chat_reply_wrapper
for i, history in enumerate(generate_chat_reply(text, state, regenerate, _continue, loading_message=True)):
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 310, in generate_chat_reply
for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message):
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 278, in chatbot_wrapper
output['visible'][-1][1] = apply_extensions('output', output['visible'][-1][1], state, is_chat=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\modules\extensions.py", line 224, in apply_extensions
return EXTENSION_MAP[typ](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\modules\extensions.py", line 82, in _apply_string_extensions
text = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\extensions\XTTSv2\script.py", line 153, in output_modifier
return tts_narrator(string)
^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\extensions\XTTSv2\script.py", line 135, in tts_narrator
tts.tts_to_file(text=turn,
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\api.py", line 403, in tts_to_file
wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\api.py", line 341, in tts
wav = self.synthesizer.tts(
^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\utils\synthesizer.py", line 492, in tts
print(f" > Real-time factor: {process_time / audio_time}")
~~~~~~~~~~~~~^~~~~~~~~~~~
ZeroDivisionError: float division by zero```

kanttouchthis
kanttouchthis1 year ago

do you know what the text was?

Dampfinchen
Dampfinchen1 year ago (edited 1 year ago)

Nice job! I've noticed XTTSv2 also supports streaming. Do you think its possible to use it conjunction with token streaming or have it generated immediately after one sentence is finished? Since the TTS model keeps being in VRAM, using it simultaneously with text generation should be possible.

TeuMasaki
TeuMasaki1 year ago (edited 1 year ago)

do you know what the text was?

It was a stop token '</s>' after the asterisk '*' causing the problem. It does work normally when there is a non asterisk prefixed stop token though.

*Mishka explains her understanding of the Chinese city based on your description.*</s>

> Text splitted to sentences.
['Mishka explains her understanding of the Chinese city based on your description.']
Processing time: 1.7906074523925781
> Real-time factor: 0.3192755719146748
Text splitted to sentences.
> Processing time: 0.0
Traceback (most recent call last):
...

oobabooga Move the folder XTTSv2 -> xttsv2
84478f97
oobabooga Sort imports
4a8f8344
oobabooga Move the folder xttsv2 -> coqui_tts
9b66e976
oobabooga Make the requirements just TTS==0.20.*
62d32b78
oobabooga Style changes
eab10499
oobabooga Make structure more similar to silero_tts + multiple fixes
ca270cfd
oobabooga Minor bug fix
334fabef
oobabooga Warn people about installing the requirements
4d096e49
oobabooga
oobabooga1 year ago👍 1

I made the structure more similar to silero_tts and made some various fixes. I think that this looks pretty good now and it's working reliably.

@kanttouchthis I ended up removing the narrator feature for simplicity and will accept your PR to text-generation-webui-extensions for people who want to try it.


The only remaining issue is that the TTS library apparently re-downloads the model every time instead of using the existing cache. I'll merge this PR and try to find a solution to that in a future one.

oobabooga oobabooga merged 8dc9ec34 into dev 1 year ago
kanttouchthis
kanttouchthis1 year ago👍 1

The model cache issue was fixed in TTS 0.20.6

erew123
erew1231 year ago (edited 1 year ago)

I'm seeing some oddity with the asterisk issue mentioned above. It causes the TTS to generate 2-4 seconds of audio strange sounds or sometimes cut out some of the speech, before restarting a sentence or two later.

What you see in the web interface
*This is a narrative description.* "This is the character speaking."

What you see if you look at the command prompt/text generation
"*This is a narrative description.", '*', '"This is the character speaking."'

I've listened to quite a few generations now and looked at quite a lot of the command prompt/terminal and best I can tell, its when that asterisk gets split/broken out. Im not sure if its specific to some models or just a general issue.

I have a suspicion that its also badly impacting generation time, as generations that seem to suffer this issue, seem to take a bit longer to process, even though the actual audio output isn't specifically any longer.

I'm on the current build of the coqui_tts extension (at time of writing).

ElhamAhmedian
ElhamAhmedian1 year ago

Which loader should be used in the extension?

image

Thanks

allenhs
allenhs1 year ago

Coqui also supports using different voices for the narrator etc. Can this feature be added? Said feature already exists in the extension located here: https://github.com/kanttouchthis/text_generation_webui_xtts

aios-ai
aios-ai1 year ago (edited 1 year ago)

Nice job! I've noticed XTTSv2 also supports streaming. Do you think its possible to use it conjunction with token streaming or have it generated immediately after one sentence is finished? Since the TTS model keeps being in VRAM, using it simultaneously with text generation should be possible.

I'd also love to see that, but I think there is more to it then just calling the tts engines streaming mode. I added a feature request in regards of this topic and also describes the difference in text-generation streaming and tts streaming which needs to be made compatible: #4706

Maybe the text-generation team can comment on it but I guess it makes sense to have a dedicated issue for this topic.

erew123
erew1231 year ago

@oobabooga @kanttouchthis Please could you have a look at this #4712 I have found a solution to speeding up speech generation for people who have a low VRAM situation. I've written some code (badly) that works, but, not actually being a coder, someone would need to integrate it properly into the script.py properly (and tidy up the code).

Thanks

morozig
morozig1 year ago

Hi guys! Can you please tell where those voices came from? Are they creative commons licensed in any way? I'm wandering if I can use them in a video game.
image

101100
1011001 year ago

@oobabooga I'm also curious about the source of the voice files.

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone