Demo page for "Neural network-based speech waveform generative models"

[Japanese version] [Top page of personal HP]
Last update: 11 June 2022 (samples synthesized by JETS added)
More synthesized speech samples and models will be added as they become available.

Review article

T. Okamoto, "Neural network-based speech waveform generative models," J. Acoust. Soc. Jpn., vol. 78, no. 6, pp. 328–337, June 2022. (in Japanese)

Ground truth

English
CMU ARCTIC slt (24 kHz) CMU ARCTIC bdl (24 kHz) LJSpeech 001-0001 (22.05 kHz) LJSpeech 050-0029 (22.05 kHz)
HiFi TTS 92_clean (22.05 kHz) HiFi TTS 92_clean (44.1 kHz) HiFi TTS 9017_clean (22.05 kHz) HiFi TTS 9017_clean (44.1 kHz)
Japanese
JSUT (24 kHz) JSUT (44.1 kHz) jvs004 (24 kHz) jvs001 (24 kHz)

Unconditonal WaveNet (9 bit, noise shaping applied)

slt bdl

WaveNet vocoder (9 bit, conditioned on mel-spectrograms, noise shaping applied)

slt bdl jsut

Multi-speaker WaveNet vocoder (9 bit, trained using jvs005 to jvs100, noise shaping applied)

jvs004 (unseen speaker) jvs001 (unseen speaker) slt (crosslingual condition) bdl (crosslingual condition)

LPCNet

slt bdl JSUT

WaveGlow

slt (few training data) bdl (same as slt) LJSpeech JSUT

Parallel WaveGAN

slt bdl LJSpeech JSUT

HiFi-GAN

LJSpeech
HiFi TTS 92_clean (22.05 kHz) HiFi TTS 92_clean (44.1 kHz) HiFi TTS 9017_clean (22.05 kHz) HiFi TTS 9017_clean (44.1 kHz)
JSUT (22.05 kHz) JSUT (44.1 kHz)

DiffWave (10 sub-modeling, Fibonacci-based 25 iterations)

slt bdl LJSpeech

Multi-speaker DiffWave (Trained using VCTK corpus, 10 sub-modeling, Fibonacci-based 25 iterations)

slt (unseen speaker) bdl (unseen speaker) LJSpeech (unseen speaker)

Entire end-to-end neural text-to-speech: VITS

slt (trainable with few data!!) bdl (trainable with few data!!) LJSpeech
HiFi TTS 92_clean (22.05 kHz) HiFi TTS 92_clean (44.1 kHz) HiFi TTS 9017_clean (22.05 kHz) HiFi TTS 9017_clean (44.1 kHz)
JSUT (22.05 kHz) JSUT (44.1 kHz)

Pipeline neural text-to-speech: Conformer-FastSpeech 2 + HiFi-GAN (Joint fine-tuning applied)

LJSpeech
HiFi TTS 92_clean (22.05 kHz) HiFi TTS 92_clean (44.1 kHz) HiFi TTS 9017_clean (22.05 kHz) HiFi TTS 9017_clean (44.1 kHz)
JSUT (22.05 kHz) JSUT (44.1 kHz)

Entire end to end text-to-speech: JETS (FastSpeech 2 + HiFi-GAN)

Not cited in the review
slt (trainable with few data!!) bdl (trainable with few data!!) LJSpeech
JSUT (24 kHz) JSUT (48 kHz, trainable with full-band)

Update histoty

11 June 2022: Samples synthesized by JETS added
27 May 2022: Demo speech samples uploaded

Acknowledgement

The synthesized samples of LPCNet (all) and Parallel WaveGAN (only JSUT) are produced when Keisuke Matsubara with Kobe University (graduated at March 2022) was interning at NICT.