TTS:VITS-fast-fine-tuning 快速微调 VITS

发布于:2025-05-27 ⋅ 阅读:(20) ⋅ 点赞:(0)

1,项目概述 

VITS是一种语音合成的方法,是一个完全端到端的TTS 模型,它使用预先训练好的语音编码器将文本转化为语音,并且是直接从文本到语音波形的转换,无需额外的中间步骤或特征提取。

VITS的工作流程为:首先,系统接收输入的文本,然后通过一系列复杂的算法将其转换为发音规则。然后,这些规则被送入一个预先训练好的语音编码器,该编码器负责生成语音信号的特征表示。最后,这些特征会被输入到语音合成模型中,模型根据这个生成最终的语音。

它的优点在于能够生成与真实人声相媲美的高质量语音,但是缺点就是训练需要大量的训练语料来训练语音合成模型,同时也需要较复杂的训练流程。

所以, VITS-fast-fine-tuning 就是在 VITS 的基础上开发的一站式多角色模型微调工具,它通过微调预训练的 VITS 模型,使用户在不到 1 小时的时间内完成对预训练模型的微调,然后生成好的训练模型,就可以用指定的音色进行语音合成和声音克隆了

【项目地址】https://github.com/Plachtaa/VITS-fast-fine-tuning

【数据格式】https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD

2,本地部署

确保您已经安装了Python==3.8、CMake 和 C/C++ 编译器、ffmpeg;

pip install -r requirements.txt;

# 安装处理视频数据所需的库
pip install imageio==2.4.1
pip install moviepy

Build monotonic align (necessary for training)

cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
cd ..

下载训练辅助数据:

mkdir pretrained_models
# download data for fine-tuning
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/sampled_audio4ft_v2.zip
unzip sampled_audio4ft_v2.zip
# create necessary directories
mkdir video_data
mkdir raw_audio
mkdir denoised_audio
mkdir custom_character_voice
mkdir segmented_character_voice

下载任意一个预训练模型,可用选项有:

CJE: Trilingual (Chinese, Japanese, English)
wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/D_trilingual.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/G_trilingual.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/configs/uma_trilingual.json -O ./configs/finetune_speaker.json

CJ: Dualigual (Chinese, Japanese)
wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/D_0-p.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/G_0-p.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/config.json -O ./configs/finetune_speaker.json

C: Chinese only
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/D_0.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/G_0.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/config.json -O ./configs/finetune_speaker.json

自定义数据放在 custom_character_voice:

custom_character_voice
    - XiJun
        -XiJun_1.wav
        -XiJun_2.wav

3,本地训练 

【语音识别】借助 whisper-lager 语音识别,有哪些数据执行哪个!!!:

python scripts/video2audio.py
python scripts/denoise_audio.py
python scripts/long_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large
python scripts/short_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large
# 有辅助训练数据执行,记得修改目录
python scripts/resample.py

【报错】Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] 

【解决】short_audio_transcribe line 24

mel = whisper.log_mel_spectrogram(audio).to(model.device)
👇
mel = whisper.log_mel_spectrogram(audio, n_mels=128).to(model.device)

【数据整理】python preprocess_v2.py --add_auxiliary_data True --languages "{PRETRAINED_MODEL}"

【正式训练】python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed True

  • epoch 建议100以上
  • 关闭一些日志会很好
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
logging.getLogger('numba').setLevel(logging.WARNING)
warnings.filterwarnings(
    "ignore",
    message="stft with return_complex=False is deprecated"
)

【报错】Could not find module libtorio_ffmpeg6.pyd' (or one of its dependencies). 

【解决】finetune_speaker_v2.py 最开始添加:

from torchaudio._extension.utils import _init_dll_path
_init_dll_path()

【报错】RuntimeError: use_libuv was requested but PyTorch was build without libuv support 

【解决】finetune_speaker_v2 main() 添加:

os.environ['USE_LIBUV'] = '0'

【报错】size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([50, 192]) from checkpoint, the shape in current model is torch.Size([78, 192]).

【解决1】可能下载的预训练模型与配置文件搞串了,可能的多次下载导致。

【解决2】修改 untils.py 下的配置:

parser.add_argument('-c', '--config', type=str, default="./configs/modified_finetune_speaker.json", help='JSON file for configuration')
👇
parser.add_argument('-c', '--config', type=str, default="D:\\PyCharmWorkSpace\\TTS\\VITS-fast-fine-tuning\\configs\\finetune_speaker.json", help='JSON file for configuration')

【报错】mel() takes 0 positional arguments but 5 were given

【解决】pip install librosa==0.8.0

4,推理效果 

VITS:4张 V100 显卡训练一周,连话都说不清楚。

VITS-fast-fine:1张 4070 训练20分钟(200 epoch),效果还不错。

【注意】使用微调后的 config.json,主要在 VC_inference.py 中配置。

python VC_inference.py

【报错】__init__() got an unexpected keyword argument 'source'

【解决】修改 VC_inference.py

record_audio = gr.Audio(label="record your voice", source="microphone")
upload_audio = gr.Audio(label="or upload audio here", source="upload")
👇
record_audio = gr.Audio(label="record your voice")
upload_audio = gr.Audio(label="or upload audio here")