3

I'm trying to decide which one to use for my project but I can't find anywhere specific differences or comparisons of the models.

Hiperfly
  • 33
  • 1
  • 3

1 Answers1

3

From what I've seen, these two models have similar architectures since they took the architecture of SoftVC and combine it with the design of VITS. The RVC is a succession of SoVITS and it has some improvements.

Firstly, the RVC used ContentVec as the content encoder rather than HuBERT. ContentVec is an improved version of HuBERT, and it can ignore speaker information and only focus on content.

Secondly, the RVC used top1 retrieval to reduce tone leakage. It is just like the codebook used in VQ-VAE, mapping the unseen input into known input in the training dataset.

But according to the 4.1 version update in the SoVITS repo, they replaced HuBERT with ContentVec and also added feature retrieval functionality, so their performance should be the same now.

7/7 Update:

I just talked to their developers and confirmed that their architectures are nearly the same, except that in SoVITS you can select which content encoder to use (HuBERT or ContentVec). I drew a graph to illustrate the model of RVC: RVC Arch

Cat ALog
  • 46
  • 2