Audio Samples of the paper "Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts"

Authors: Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, Hiroshi Saruwatari

Note: In order to obtain best quality, we strongly encourage the listeners to take their headphones.

Contents

Abstract: We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral context or one kind of context that cannot fully utilize the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enable the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. We also conduct experiments to find the best combination of types, laterals, and lengths of context for audiobook TTS, which is never discussed in the literature before.

We show the audiobooks synthesized by ATCE-suc. (the best model), ATCE-bi, and the two baseline methods. All of the utterances below are not included in the training set.

Audiobook TTS

The following audiobook is ごんぎつね of a male speaker synthesized by the ATCE-suc. model.

Since the audiobooks are very large please find the rest of them in this link

The link above contains the following audiobooks. These are all famous books so you can find the texts of them by google.

ごんぎつね Female speaker
ごんぎつね Male speaker
手袋を買いに Female spekaer
桜桃 Female speaker
蜘蛛の糸 Female speaker
蜘蛛の糸 Male speaker

Context-dependent prosody prediction

All the following audios are synthesized using the same text with different contexts.

You can get more samples with different contexts at this link

Text:「おれと同じ一人ぼっちの兵十か」


Ground Truth

Synthesized 1
(Correct context)

Synthesized 2
(Low tension)

Synthesized 3
(High tension)