Neural network-based speech waveform generative models

Demo page for "Neural network-based speech waveform generative models"

[Japanese version] [Top page of personal HP]
Last update: 11 June 2022 (samples synthesized by JETS added)
More synthesized speech samples and models will be added as they become available.

Review article

T. Okamoto, "Neural network-based speech waveform generative models," J. Acoust. Soc. Jpn., vol. 78, no. 6, pp. 328–337, June 2022. (in Japanese)

Ground truth

English

CMU ARCTIC slt (24 kHz)	CMU ARCTIC bdl (24 kHz)	LJSpeech 001-0001 (22.05 kHz)	LJSpeech 050-0029 (22.05 kHz)

HiFi TTS 92_clean (22.05 kHz)	HiFi TTS 92_clean (44.1 kHz)	HiFi TTS 9017_clean (22.05 kHz)	HiFi TTS 9017_clean (44.1 kHz)

Japanese

JSUT (24 kHz)	JSUT (44.1 kHz)	jvs004 (24 kHz)	jvs001 (24 kHz)

Unconditonal WaveNet (9 bit, noise shaping applied)

slt	bdl

WaveNet vocoder (9 bit, conditioned on mel-spectrograms, noise shaping applied)

slt	bdl	jsut

Multi-speaker WaveNet vocoder (9 bit, trained using jvs005 to jvs100, noise shaping applied)

jvs004 (unseen speaker)	jvs001 (unseen speaker)	slt (crosslingual condition)	bdl (crosslingual condition)

LPCNet

slt	bdl	JSUT

WaveGlow

slt (few training data)	bdl (same as slt)	LJSpeech	JSUT

Parallel WaveGAN

slt	bdl	LJSpeech	JSUT

HiFi-GAN

LJSpeech

HiFi TTS 92_clean (22.05 kHz)	HiFi TTS 92_clean (44.1 kHz)	HiFi TTS 9017_clean (22.05 kHz)	HiFi TTS 9017_clean (44.1 kHz)

JSUT (22.05 kHz)	JSUT (44.1 kHz)

DiffWave (10 sub-modeling, Fibonacci-based 25 iterations)

slt	bdl	LJSpeech

Multi-speaker DiffWave (Trained using VCTK corpus, 10 sub-modeling, Fibonacci-based 25 iterations)

slt (unseen speaker)	bdl (unseen speaker)	LJSpeech (unseen speaker)

Entire end-to-end neural text-to-speech: VITS

slt (trainable with few data!!)	bdl (trainable with few data!!)	LJSpeech

HiFi TTS 92_clean (22.05 kHz)	HiFi TTS 92_clean (44.1 kHz)	HiFi TTS 9017_clean (22.05 kHz)	HiFi TTS 9017_clean (44.1 kHz)

JSUT (22.05 kHz)	JSUT (44.1 kHz)

Pipeline neural text-to-speech: Conformer-FastSpeech 2 + HiFi-GAN (Joint fine-tuning applied)

LJSpeech

HiFi TTS 92_clean (22.05 kHz)	HiFi TTS 92_clean (44.1 kHz)	HiFi TTS 9017_clean (22.05 kHz)	HiFi TTS 9017_clean (44.1 kHz)

JSUT (22.05 kHz)	JSUT (44.1 kHz)

Entire end to end text-to-speech: JETS (FastSpeech 2 + HiFi-GAN)

Not cited in the review

slt (trainable with few data!!)	bdl (trainable with few data!!)	LJSpeech

JSUT (24 kHz)	JSUT (48 kHz, trainable with full-band)

Update histoty

11 June 2022: Samples synthesized by JETS added
27 May 2022: Demo speech samples uploaded

Acknowledgement

The synthesized samples of LPCNet (all) and Parallel WaveGAN (only JSUT) are produced when Keisuke Matsubara with Kobe University (graduated at March 2022) was interning at NICT.