Abstract
Audiobook readers play with their voices to emphasize some text passages, highlight discourse changes or significant events, in order to make listening easier and entertaining. Dialog is a central passage in audiobooks where the reader applies significant voice transformation, mainly prosodic modifications, to realize character properties and changes. However, these intra-speaker modifications are hard to re-produce with simple text-to-speech synthesis. The manner of vocalizing characters involved in a given story depends on the text style and differs from one speaker to another. In this work, this problem is investigated through the prism of voice conversion. We propose to explore modifying the narrator’s voice to fit the context of the story, such as the character who is speaking, using voice conversion. To this end, two complementary experiments are designed: the first one aims to assess the quality of our Phonetic PosteriorGrams (PPG)-based voice conversion system using parallel data. Objective and subjective evaluations with naive raters are conducted to estimate the quality of the signal generated and the speaker similarity. The second experiment applies an intra speaker voice conversion, considering narration passages and character speech passages as two distinct speakers. Data are then no parallel and the dissimilarity between character and narrator is subjectively measured.
Tasks
- Inter-speaker voice conversion: change voice from speaker source to target speaeker.
- Intra-speaker voice conversion: change voice style.
Dataset
Inter-speaker
- Boule de suif (author Guy de maupassant) : FFR0009LV (Ezwa, female voice) , FFR0012LA (Victoria, female voice), MFR0015LA (Jean-Luc Fischer, male voice)
- La petite comtesse (author Octave Feuillet) : FFR0011LA (Pomme, female voice), MFR0013LA (DanielLuttringer, male voice), MFR0014LA (René Depasse, male voice)
Intra-speaker
- For the intra-speaker voice conversion or voice style conversion with the Synpaflex corpus.
Inter-speaker voice conversion samples
Speaker ID | Target Voc. | Synth. mode (SynMode) | Intra-gender Conv. mode | Inter-gender Conv. mode | Intra-gender Source Voc. | Inter-gender Source Voc. |
---|---|---|---|---|---|---|
FFR0009LV | ||||||
FFR0009LV | ||||||
FFR0009LV | ||||||
FFR0012LA | ||||||
FFR0012LA | ||||||
FFR0012LA | ||||||
MFR0013LA | ||||||
MFR0013LA | ||||||
MFR0013LA | ||||||
MFR0014LA | ||||||
MFR0014LA | ||||||
MFR0014LA |
Intra-speaker voice conversion
Reference A | Reference B | IS Voc. | Syn. mode - IS | Conv. mode- IS to DS | Syn. mode - DS | DS Voc. |
---|---|---|---|---|---|---|
Comments
References
[1] Sini, A., Lolive, D., Vidal, G., Tahon, M., & Delais-Roussarie, E. (2018, May). Synpaflex-corpus: An expressive french audiobooks corpus dedicated to expressive speech synthesis. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).