Abstract:
We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech.
Previous work either uses unilateral context or one kind of context that cannot fully utilize the context information.
The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enable the model to predict context-dependent prosody.
We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset.
Experimental results demonstrate that the proposed method significantly outperforms two previous works.
We also conduct experiments to find the best combination of types, laterals, and lengths of context for audiobook TTS, which is never discussed in the literature before.
We show the audiobooks synthesized by ATCE-suc. (the best model), ATCE-bi, and the two baseline methods.
All of the utterances below are not included in the training set.