Title: High-Coverage Multi-Level Text Corpus for Non-Professional Voice Conservation
Authors: Markéta Jůzová, Daniel Tihelka, Jindřich Matoušek

Note: This corpus constitutes the research outcome TH02010307-V2 of the project "Automatic voice banking and reconstruction for patients after total laryngectomy" (TH02010307).                           

-----
This text corpus contains a carefully optimized set of sentences that could be used in the process of preparing a speech corpus for the development 
of personalized text-to-speech system. It was designed primarily for the voice conservation procedure that must be performed in a relatively short 
period before a person loses his/her own voice, typically because of the total laryngectomy.

Total laryngectomy is a radical treatment procedure which is often unavoidable to safe life of patients who were diagnosed with severe laryngeal cancer.
In spite of being very effective with respect to the primary treatment, it significantly handicaps the patients due to the permanent loss of their
ability to use voice and produce speech. Luckily, the modern methods of computer text-to-speech (TTS) synthesis offer a possibility for "digital
conservation" of patient's original voice for his/her future speech communication -- a procedure called voice banking or voice conservation. Moreover,
the banking procedure can be undertaken by any person facing voice degradation or loss in farther future, or who is simply is willing to keep his/her voice
print.

The key aspect is the design design of speech recording process, since the speakers are required to record speech data suitable enough for a personalized
TTS system with a reasonable level of quality. Given that there can be very little time between the diagnosis and surgery in case of laryngectomy, and
also the fact that a common speaker is absolutely non-trained in speech recording (sometimes even having lower computer-handling skills) makes the recording
conditions and speech corpus design very different from the recording of professional or semi-professional voice.

Therefore, the source material to be recorded (presented here) was designed with the aim of allowing to efficiently record as much speech data from
a non-professional speakers as possible, given their limited time and speaking abilities. It is arranged into multiple levels, each maximizing the given
phonetic and/or prosodic coverage with more details being handled as the level increases. Depending on the amount of data recorded, either statistical
parametric speech synthesis or unit selection can easily be used as the actual speech synthesis methods.

------


The sentences in the given XML were selected from the large set of tests by procedure described in details in [1]. The pre-selection step reduced the whole
set of sentences to approx. 120 000 sentences, having from 3 to 8 words. The reduced subset was them used to select the sentences from, using the following
6 levels of unit coverage optimization, as described in [1] and here:

Level 1: each phone at least 15 times (94 sentences)
Level 2: each phone with prosodeme at least 15 times (373 sentence; see "Detailed description of PROSODY codes" for the explanation of "prosodeme")
Level 3: each diphone at least 2 times (532 sentences)
Level 4: each diphone at least 5 times (996 sentences). Some phrases with rare diphones were added to the selection, containing more difficult words to
        read, though.
Level 5: each diphone in prosodeme at least 3 times, but stopped when 505 sentences sentences were selected (giving 2500 in total), since more than 6000
        sentences would be required to fulfill this requirement)
Level 6: uniform balancing up to 3500 phrases in total (see [4])

All the sentences in the resulting set were manually checked to not contain unknown, nonsense or hard to read words, and corrected if necessary.


In the XML file, every <phrase/> element contains single sentence to be recorded. The attributes are:

<ortho/>           ... orthographic form of the sentence - the text to read
<phone/>           ... phonetic transcription of the sentence in the alphabet specified in type="" attribute. The individual phones are separated by space,
                       words are separated by vertical bar [|], phrases within the sentence are separated by colon [:]
    type="SAMPA"   ... Czech SAMPA phonetic alphabet (http://www.phon.ucl.ac.uk/home/sampa/czech-uni.htm)
    type="IPA"     ... IPA - international phonetic alphabet
    type="PROSODY" ... prosodic assignment of the phones, see "Detailed description of PROSODY codes"


Additional attributes of <phrase/> tag:
    ID             ... unique sentence identifier which can be used, for example, as a file name
    order          ... the order of the recording to keep the highest unit coverage (as described in [1])
    type           ... sentence type ("oznam" = declarative, "zjist" = Y/N-question, "dopln" = Wh-question, etc.)
    level          ... the level the sentence was selected in (see above)




Detailed description of PROSODY codes:

The prosody description used (in <phone type="PROSODY"/> element) is based on prosody phrase grammar published in [2,3]. To summarize it, each prosodic
word (a base rhythm unit) is assigned to "prosodeme" with codes representing:

  1.1 ... falling prosody pattern, representing declarative phrase
  3.1 ... raising prosody pattern, representing the end of a phrase within a sentence
  2.1 ... raising prosody pattern, representing Y/N question
  0   ... any other prosody pattern, not functionally involved



References:

[1] Jůzová M. and Romportl J. and Tihelka D.: Speech Corpus Preparation for Voice Banking of Laryngectomised Patients. Text, Speech, and Dialogue,
    Lecture Notes in Artificial Intelligence, vol 9302, p. 282-290, Springer, Berlin, Heidelberg, 2015.

[2] Romportl, J. and Matoušek, J. and Tihelka, D.: Advanced Prosody Modelling. Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence,
    vol. 3206, p. 441-447, Springer, Berlin, Heidelberg, 2004.

[3] Romportl, J.: Structural Data-Driven Prosody Model for TTS Synthesis. Proceedings of the Speech Prosody 2006 Conference, p. 549-552, TUDpress,
    Dresden, 2006.

[4] Matoušek, J. and Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis.
    Proceedings of the second IASTED international conference on Computational intelligence, p. 442-447, ACTA Press, San Francisco, 2006.