How to setup the TTS LLM model GPT-SoVITS

Features:

(GPT-SoVITS)[https://github.com/RVC-Boss/GPT-SoVITS]

  • Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.
  • Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
  • Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, Korean, Cantonese and Chinese.
  • WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.

Install in local

  • Clone the repo: git clone https://github.com/RVC-Boss/GPT-SoVITS.git
  • Setup the virtual enviroment
1conda create -n GPTSoVits python=3.9
2conda activate GPTSoVits
3bash install.sh
4# or use the china pypi source to installation
5# pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
  • Download the pretrained models to GPT_SoVITS/pretrained_models sub folder
1# in GPTSoVits virtual environment to operations
2pip install -U huggingface_hub
3export HF_ENDPOINT=https://hf-mirror.com
4# this following command sould be download the huggingface repo and move to local relative directory
5# The HuggingFace url: https://huggingface.co/lj1995/GPT-SoVITS/tree/main
6huggingface-cli download --resume-download lj1995/GPT-SoVITS --local-dir /GPT-SoVITS/GPT_SoVITS/pretrained_models
  • Upload the human voice sample model
1# for example: this is chinese voice sample model data
2# https://huggingface.co/baicai1145/GPT-SoVITS-STAR/tree/main
3# Download one of some and upload to `GPT-SoVITS/char_model/example001` sub folder
4# uncompress the files to get the: `*.pth` and `*.ckpt` and reference voice `*.wav`
5huggingface-cli download --resume-download baicai1145/GPT-SoVITS-STAR 三月七.zip  --local-dir /GPT-SoVITS/char_model/ 
  • Launch the GPT-SoVITS api
1# this command should start the GPT-SoVITS api service in 9880
2# and should laod the model weights
3python api_v2.py
  • Load the human voice role weights use api
1# use previously human voice model
2curl -X GET https://xxxx.ngrok-free.app/set_gpt_weights?weights_path=char_model/yq/yq-e10.ckpt
  • To testing generate the voice
1curl -x GET https://xxxx.ngrok-free.app/tts?text=Hello There, nice to meet you&text_lang=zh&ref_audio_path=char_model/yq/ref/yq01.wav&prompt_lang=zh&top_k=5&top_p=1&temperature=1&text_split_method=cut1&batch_size=1&batch_threshold=0.75&split_bucket=true&speed_factor=1&fragment_interval=0.3&seed=-1&media_type=wav&streaming_mode=false&repetition_penalty=1.35&parallel_infer=true

Use the GTP-SoVITS to train our own voice model

  • Start the webui
1# this sould be launch the webui in local: 9874
2python webui.py
  • Launch the Ngrok to inverse proxy to access
1ngrok http --url=xxxxx.ngrok-free.app 9874
  • Access the webui and start the train
  • Upload the source voice (*.wav) format to server folder and fill into the Audio slicer input for example: GPT-SoVITS/data/common/six-movies.wav
  • Step by Step in webui 0-Fetch dataset panel, main change the input path over *.wav files to save the folder path
  • Uninstall the ctranslate2==4.5.0 for the whisper if use faster whisper
1# reinstall the ctranslate2 if appearance error
2# Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so}
3conda activate GPTSoVits
4pip uninstall ctranslate2
5pip install ctranslate2==4.4.0
  • Check and labeling in the webui, let Ngrok bind to the 9871 port verify the text and voice is correct, and submit the text or continue check click the next idex
1ngrok http --url=xxxxx.ngrok-free.app 9871
  • Go to the 1-GPT-SOVITS-TTS panel
    • To ordinal click the 1Aa-Text buttons and wait complete
      • Final click the One-click formatting
    • Go to the 1B-Fine-tuned training panel
      • Start the 1Ba-SoVITS and 1Bb-GPT training
      • Wait the process complete
    • Go to the 1C-inference panel
      • Refreshing model paths to select previously traing model comon-tts models
      • Start Open TTS Inference WebUI
      • Change the Ngrok to bind to 9872 port
  • Testing and inference the models
    • Upload the source wav format voice (can download from slicer_opt)
    • Type the test text
    • Maybe to download the nltk data if the text include multiple languages
    1# download the nltk data 
    2# maybe (You know) use this url: git clone http://gitclone.com/github.com/nltk/nltk_data.git
    3git clone https://github.com/nltk/nltk_data
    4# delete the exist packages
    5rm -rf /root/nltk_data/corpora
    6rm -rf /root/nltk_data/taggers/
    7# override the packages
    8mv -f ./packages/* /root/nltk_data/
    
    • uncompress the *.zip file in /root/nltk_data use shell command Or you can uncompress the /root/nltk_data/taggers/averaged_perceptron_tagger_eng.zip if you use chinese and english mix only
    1#!/usr/bin/env bash
    2for zipfile in $(find /root/nltk_data  -name "*.zip"); do
    3    unzip  -o "$zipfile" -d "$(dirname "$zipfile")"
    4done
    

Use Ngrok to inverse proxy to access from external

  • (Ngrok)[https://dashboard.ngrok.com/]
  • Install the Ngrok command line interface to local copy from ngrok dashboard
  • Launch the ngrok client ngrok http http://127.0.0.1:9880