A Prosodic Lexicon of Lhasa Tibetan: An Experimental Study Based on Speech Synthesis
Lu Chen, Zu Yiqing, Liu Chenning, Zhang Xiao
Submitted 2025-08-07 | ChinaXiv: chinaxiv-202508.00194

Abstract

[Abstract] This study proposes a method for constructing a prosodic lexicon for Lhasa Tibetan based on a continuous speech database, applicable to low-resource and complex languages. The prosodic lexicon constructed from a small set of high-quality data (3.77 hours, 2,526 sentences) significantly improves speech synthesis quality for Lhasa Tibetan. The study reveals that linguistic components involved in tone sandhi in continuous Lhasa Tibetan discourse are constrained by semantics, morphology, and syntax, reflecting hierarchical organization and chunking rules in the linguistic cognitive system. Phonetic manifestations of tone sandhi components include three types: monosyllabic citation tone, monosyllabic tone loss, and disyllabic tone sandhi. The specific manifestation of a syllable in connected speech is determined by three constraint conditions: the first constrained by morphology, the second by syntax, and the third by high-frequency grammatical constructions. Based on AI speech synthesis experiments, this study employs the first and third types of tone sandhi components and their constraint rules in continuous discourse as the foundation for prosodic lexicon construction when building language models, rather than conventional Tibetan dictionaries and word segmentation rules for information processing. Inspired by "Usage-Based Theory" in cognitive linguistics, this experiment extracted prefabricated chunks from 2,526 sentences. A prefabricated chunk lexicon (PrefabsLexicon) containing 175,000 entries was constructed based on the semantic and grammatical features of these chunks. To evaluate lexicon effectiveness, a word segmentation experiment was conducted using a 56-minute dataset from another Lhasa Tibetan speaker as the test set. Compared with traditional Tibetan dictionaries, the prefabricated chunk lexicon based on tone sandhi features achieved an F1-score of 0.92. Furthermore, in toneless Amdo Tibetan synthesis experiments, synthesis quality MOS (Mean Opinion Score) improved to 4.17, indicating cross-dialectal general applicability of the prefabricated chunk lexicon constructed based on tone sandhi features.

Full Text

Preamble

A Prosodic Lexicon of Lhasa Tibetan: An AI Speech Synthesis Experimental Study

LU Chen, LIU Chenning, ZHANG Xiao, ZU Yiqing*

Abstract This study proposes a method for constructing a prosodic lexicon of Lhasa Tibetan based on a continuous speech database, which is suitable for low-resource and complex languages. The prosodic lexicon constructed from a small amount of high-quality data (3.77 hours, 2,526 sentences) can significantly improve the speech synthesis performance of Lhasa Tibetan. The research reveals that tone sandhi in continuous Lhasa Tibetan discourse is constrained by semantic, lexical, and syntactic factors, reflecting the hierarchical organization and chunking rules of the language cognitive system. The phonetic manifestations of tone sandhi components include three types: single-syllable citation tone, single-syllable tone loss, and two-syllable tone sandhi. Which of these three forms a syllable takes in connected speech is subject to three types of constraints: the first type is constrained by word formation rules, the second type by syntactic rules, and the third type by high-frequency grammatical constructions. Based on AI speech synthesis experiments, when building the language model, the first and third types of tone sandhi components and their constraint rules in continuous speech serve as the foundation for constructing the prosodic lexicon, rather than conventional Tibetan dictionaries and segmentation rules for information processing. Inspired by the "Usage-Based Theory" in cognitive linguistics, this experiment extracted prefabricated chunks (prefabs) from 2,526 utterances. Based on the semantic and grammatical features of these prefabs, we constructed a Prefabs Lexicon containing 175,000 entries. To evaluate the lexicon's effectiveness, the segmentation experiment used a 56-minute dataset from another Lhasa broadcaster as the test set. Compared with traditional Tibetan dictionaries, the Prefabs Lexicon based on tone sandhi features achieved an F1-score of 0.92. Furthermore, in the synthesis experiment for toneless Amdo Tibetan, the MOS (Mean Opinion Score) improved to 4.17, indicating that the prefabs lexicon constructed based on tone sandhi features has universal applicability across dialects.

Keywords: Tone sandhi, Prefabricated chunks, Prosodic lexicon, Speech synthesis, Lhasa Tibetan

*Corresponding author: ZU Yiqing, iFLYTEK Co., Ltd. & Interdisciplinary Research Center for Language Sciences, University of Science and Technology of China, yqzu@iflytek.com.

1. Introduction

In the field of psychology, the Mental Lexicon studies the lexical activities that language users engage in during daily language comprehension and production. Jarema and Libben (2007:2) define the mental lexicon as "the cognitive system that constitutes the ability for conscious and unconscious lexical activity," emphasizing that the mental lexicon is the lexical activity itself, such as word comprehension, rather than the entity that facilitates lexical activity. Research in psycholinguistics in this area has had a direct impact on Natural Language Processing (NLP). According to Faber and Mairal Usón's (1999:20) study on English lexicons, an NLP lexicon typically needs to include: phonological knowledge of a language, the structure, stress, and intonation of words and expressions; morphological information of words; syntactic configurations of words in phrases and sentences; the meanings of words and how these meanings combine to form sentence meaning; pragmatic information such as communicative intentions, and so on. This demonstrates that a lexicon is not merely a list of words, but must also contain multi-level information related to words, including phonology, morphology, syntax, and knowledge of their dynamic changes in utterances. This means that lexicon research cannot be separated from actual discourse and requires dynamic analysis in large amounts of continuous speech.

In the field of speech science, synthetic speech is an important tool to help us verify the results of speech analysis. As Kent and Read (1992:262) state, only when we can reproduce a process do we truly understand it. Current speech synthesis experimental platforms based on sequence-to-sequence models can, on the basis of small amounts of high-quality continuous speech data, incorporate linguistic features at the textual, phonetic, and grammatical levels to build language models. This not only allows us to test whether our phonetic analysis is correct, but also helps us verify whether certain linguistic concepts and knowledge systems are reasonable.

Text segmentation is a fundamental task in natural language processing. For Chinese and Tibetan, text segmentation typically refers to word segmentation. However, the concept of "word" does not naturally exist in these languages. From the perspective of natural text in these languages, there are no word boundaries, only syllable boundaries corresponding to a Chinese character or Tibetan syllable, which also reflects native speakers' cognitive understanding of their language's characteristics. This feature makes text segmentation a challenging task for Chinese and Tibetan. There have been many insightful studies and discussions, generally aiming to balance multiple criteria such as the smallest independent usage unit, semantic completeness, consistent grammatical properties, high frequency, pauses, and syllable count (Sun et al. 2001; Feng 2001; Wang 2001; Jiang 2003; Guan 2009, 2010; Ministry of Education Language and Information Management Division 2015; Long and Liu 2016; GB/T 36452-2018 2018).

Currently, generative large language models such as ChatGPT have demonstrated remarkable capabilities in natural language processing, yet they cannot stably complete simple word reversal tasks and frequently make errors. For example, when asking GPT-3.5 to "reverse the word letter by letter: synthesis," the output is "sisyhtnes." This problem is also evident in Chinese. When we ask GPT-3.5 to "reverse this sentence character by character: 你是一个很聪明的机器人," the output is "机器人的明聪很一个是你." Both tests show errors where local fragments fail to reverse correctly: "-yhtne-," "-机器人-," and "-一个-." The issue lies in the fact that natural language processing must first decompose text into minimal semantic units, i.e., tokenization, which means identifying or segmenting into tokens. The basic unit of this analysis is the token, which may be an entire word or a fragment of a word, a Chinese character, or a word. Large models themselves lack deep understanding of these linguistic units, conflating different levels of linguistic units, whereas the human brain stores extremely rich linguistic units and related knowledge, enabling effortless mastery of these units to complete language comprehension and generation.

Early linguistic theories held that due to limited memory capacity, the lexicon should only include unpredictable, most basic morphemes, while using rules to represent predictable structural information, thereby achieving a separation between lexicon and rules. The purpose of this approach was to avoid "redundant" information, thus excluding compound words from the lexicon. For example, Lieber (1980) and Selkirk (1984) likened the lexicon to a calculator, containing a morpheme repository and a rule system for combining basic morphemes into complex words. This "pocket calculator" language model is characterized by maximizing computation and minimizing storage, but this also means it lacks a memory that stores computational processes and evaluates them, and therefore cannot learn from past experiences and steps (Baayen 2007:82).

However, the human brain clearly often acquires knowledge step by step from intermediate processes, gradually understanding and mastering complex things, and does not always learn based on minimal units and final outcomes. Therefore, Usage-Based Theory (J. L. Bybee and Beckner 2015) posits that specific learned instances in language and gradually emerging generalization patterns are stored together in memory. Speakers maintain rich memory representations, storing all details and rich experiences related to instances in actual representations.

The viewpoint of this paper is that the mental lexicon pre-stores linguistic units at multiple levels and ranks, enabling us to simultaneously track and analyze linguistic information at multiple levels quickly during language use. In a neuroscientific study on speech comprehension, Ding et al. (2016) found that when humans comprehend language, neural activity in the brain tracks different levels of linguistic structure simultaneously according to different time scales. Based on our work with Lhasa Tibetan speech synthesis experiments, we believe that the linguistic fragments involved in tone sandhi are key units in Lhasa Tibetan language comprehension and use, and are core elements that connect lexical, syntactic, and prosodic levels. When building natural language processing lexicons, tone sandhi-related linguistic information needs to be explicitly expressed in the lexicon.

2. Introduction to Lhasa Tibetan Speech Synthesis Experiments

The experimental sample (training data) for this study is a self-constructed speech database of 3.77 hours and 2,526 sentences of Lhasa Tibetan read by a professional broadcaster. The experiment employs a sequence-to-sequence speech synthesis method, performing direct encoding and decoding between input and output sequences. The speech synthesis model uses an autoregressive acoustic model and the Straight vocoder, utilizing 2,526 parallel data sentences from the Lhasa Tibetan speech database, totaling 3.77 hours with a sampling rate of 16kHz. During training, 128 samples are input per step, predicting 4 frames each time; during inference, a stepwise monotonic attention mechanism is used, with the traditional Straight vocoder reconstructing waveforms. The synthesis system uses the Wylie transliteration scheme (Wylie 1959), which has a one-to-one correspondence with Tibetan characters, as the input sequence. The experiment found that the choice of linguistic unit in the text segmentation stage significantly affects the final synthesis quality.

Early experiments used a conventional Tibetan dictionary to segment the 2,526 training sentences, with manual correction of automatic segmentation results. However, the MOS evaluation* of the speech synthesis system trained this way was not ideal, with numerous segmental and tonal errors affecting sentence comprehension and unnatural rhythm. The root cause was the inability to stably reproduce tone sandhi in natural speech. Tone sandhi is a very common phonetic phenomenon in Lhasa Tibetan as well as in Wu and Min Chinese dialects, involving most components in language use and directly affecting lexical semantic understanding and the prosodic naturalness of natural speech.

Every syllable in Lhasa Tibetan has a tone, i.e., a citation tone. When syllables enter words or sentences, tone sandhi occurs, manifesting in two forms: single-syllable linguistic components lose their original tone and are weakened; two-syllable components change their original tones, undergoing tone sandhi. Based on this phonetic feature, we changed the rules of text segmentation and manually annotated the 2,526 sentences in the speech database. The segmentation units are no longer conventional dictionary words, but three types of prosodic components: single-syllable citation tone components, labeled as se1; single-syllable toneless components, labeled as se0; and two-syllable tone sandhi components, labeled as se2. We collectively refer to se1, se0, and se2 as SE units (sense elements) in Lhasa Tibetan. During data annotation, each sentence is segmented into a linear combination sequence of se1, se0, and se2. Consequently, there are two comparable approaches to text segmentation for the 2,526-sentence speech database: one is conventional dictionary word segmentation, and the other is SE unit segmentation. MOS comparative experiments showed that the synthesis system using SE segmentation achieved significantly better results. The detailed process of this experiment is described in "Basic Linguistic Operating Units SE in Continuous Speech—Experimental Evidence from Tone Sandhi in Lhasa Tibetan" (Zu et al. 2022). The experimental results showed that synthesis based on conventional dictionary segmentation achieved a MOS of 3.45, while synthesis based on tone sandhi SE units achieved a MOS of 4.25. We conducted objective error statistics on 50 synthesized sentences from this experiment. As shown in Figure 1 [FIGURE:1], for the 50 synthesized sentences using SE units, not only did tone sandhi errors decrease significantly, but the synthesis accuracy of citation tones and segmental phonemes also improved. This demonstrates that the setting of linguistic units not only affects higher-level prosodic features such as pauses and prosody, but also influences the machine learning of lower-level phonetic features of segments and syllables.

Figure 1. Objective Error Statistics Comparison Between Dictionary Words and SE Lexicon

This experiment demonstrates that compared to conventional dictionary words, SE units extracted based on tone sandhi features may be the dominant units underlying Lhasa Tibetan lexicon operation, better reflecting the rules of phonological, lexical, and syntactic processes in actual language use. The further research question that needs in-depth investigation is: what is the scope of tone sandhi implementation in continuous speech, i.e., how to segment tone sandhi domains in sentences, and what are the lexical and syntactic constraint rules for segmenting tone sandhi domains? From an engineering application perspective, the above MOS experiments have already proven that segmenting text into SE units yields better results. The further question then becomes: how to achieve automatic segmentation of SE units, and what lexical and syntactic rules are needed to ensure the accuracy of automatic segmentation? To address these questions, this study conducted tone sandhi, lexical, and syntactic analysis and annotation of the 2,526-sentence speech database, summarized the lexical and syntactic constraints under which tone sandhi phenomena occur, and divided the analysis and prediction of tone sandhi components in speech synthesis work into two levels: lexicon and syntax. Based on this, a prosodic lexicon with 175,000 entries was constructed, and its effectiveness in improving speech synthesis quality was demonstrated through two experiments.

MOS (Mean Opinion Score) is a commonly used method for evaluating video, audio, and audiovisual quality. It is the arithmetic mean of a statistically significant sample space. It requires assembling a test team of statistically significant size to score the quality of test objects based on subjective experience, with scores ranging from 1 to 5: 5-Excellent, 4-Good, 3-Fair, 2-Poor, 1-Bad.

3. Constraints on Tone Sandhi in Lhasa Tibetan

3.1 Tone Sandhi Patterns and Domains

Research on tone sandhi typically involves two tasks: tone sandhi patterns and tone sandhi domains. We must first understand tone sandhi patterns, i.e., the possible tonal patterns when two syllables undergo tone sandhi. However, to deeply investigate the scope of tone sandhi implementation, i.e., tone sandhi domains, we need to consider more constraint factors—specifically, under what constraints components in a sentence undergo tone sandhi. Through in-depth investigation of these constraints, we can better understand the application rules of tone sandhi in language.

This is of great significance for building machine learning models that can automatically segment tone sandhi domains and predict tone sandhi. Therefore, our work aims to explore these constraints in order to accurately determine tone sandhi domains when analyzing continuous speech.

Previous research on Lhasa Tibetan tone sandhi has achieved considerable results in tone and tone sandhi pattern analysis (Qu 1981b; Hu et al. 1982; Zhou 1983; Yu 1983; Xu 2015). Modern Lhasa Tibetan has four citation tones, which we label as H, R, L, F in our data annotation, representing high, rise, low, and fall respectively. All syllable (Tibetan character) phonological information in Lhasa Tibetan needs to be stored in the lexicon. When two syllables undergo tone sandhi under lexical and syntactic processes, there are five tone sandhi patterns: HH, HF, LR, LF, LH. Which pattern two syllables exhibit depends on the initial syllable's onset type and the final syllable's rime type. Based on Tibetan orthographic information, a syllable's onset is divided into two categories, and the rime's final is divided into three categories:

  1. Rimes with consonant letters l, r, m, n, ng, etc., are smooth rimes;
  2. Rimes with consonant letters s, d, ms, ngs, b, bs, g, gs, etc., are checked rimes;
  3. Rimes without consonant letters are open rimes.

Tone sandhi derivation rules are shown in Table 1 [TABLE:1]:

Table 1. Two-syllable Tone Sandhi Rules in Lhasa Tibetan

Most previous Lhasa Tibetan studies have focused on analyzing the phonetic manifestations of tone sandhi given that two syllables have undergone sandhi, summarizing five sandhi patterns: HH, HF, LR, LF, LH. However, few studies have used continuous speech databases as analytical material to investigate the scope of tone sandhi rules in speech flow, i.e., tone sandhi domains. Our work is based on a large number of sentences, investigating how to enable machines to automatically segment tone sandhi domains in text and predict the tone sandhi performance of any given sentence. This work can provide empirical evidence for inferring the operational levels and organizational patterns of tone sandhi units in the human brain.

In Lhasa Tibetan text, a Tibetan character may exhibit various forms of coarticulation and tone sandhi in different linguistic environments. For example, the word "tshod lta" may be a noun meaning "experiment" or "pilot," in which case the two syllables coarticulate and exhibit the HH sandhi pattern. Alternatively, "tshod lta" may be a verb meaning "to try" or "to experiment," in which case the two syllables do not undergo sandhi and are read with citation tones F and H respectively. This is similar to the Chinese word "好" (hao), which has different tones in the idioms "好(hao3)事成双" (good deeds come in pairs) and "好(hao4)事之徒" (troublemaker). This is because the semantic and grammatical structures of "好事" differ—one is a modifier-head structure, the other a verb-object structure—requiring more comprehensive linguistic context to accurately parse its tone. Therefore, although tone sandhi manifests at the lexical level, the constraints on tone sandhi domains also involve phrasal and syntactic factors.

To ensure the synthesis system can accurately express these features of Lhasa Tibetan, we conducted annotation of tone sandhi and grammar for 2,526 audio sentences and summarized the tone sandhi performance of all linguistic fragments in these sentences. In continuous Lhasa Tibetan speech, there exist single-syllable components that do not undergo sandhi, including numerous basic nouns, verbs, adjectives, pronouns, and adverbs. However, there are even more tonal change units, which can be divided into two situations: single-syllable components losing independent tone, manifesting as neutral tone; and two-syllable components undergoing tone sandhi. We label non-sandhi single-syllable fragments as se1, weakened single-syllable fragments as se0, and two-syllable sandhi fragments as se2, collectively referred to as SE units. Through segmentation of 2,526 continuous utterances, we found that weakened se0 and sandhi se2 fragments dominate in actual language use.

Table 2 [TABLE:2] Statistics of SE Components in 2,526 Sentences

SE Structure Occurrences & Proportion in 2,526 Sentences Unique Items & Proportion Citation tone se1 10,929 (34.6%) 1,490 (21.0%) Weak se0 8,853 (28.1%) 129 (1.8%) Sandhi se2 11,773 (37.3%) 5,476 (77.2%) All SE 31,555 7,095

According to the statistical data in Table 2, we found that a total of 31,555 SE units appeared in the 2,526 sentences, comprising 7,095 distinct SE units after deduplication. Among them, sandhi se2 and weak se0 account for 79.0% of the lexicon entries and 65.4% of occurrences in the 2,526 sentences. This indicates that in actual usage, Lhasa Tibetan sentences are primarily composed of tonal change components, and this feature needs to be fully reflected in language model development.

3.2 Three Constraints on Tone Sandhi in Lhasa Tibetan

Based on the annotated data from the 2,526-sentence speech database and related literature (Wang 1956; Hu 1980; Qu 1981a, 1981b; Hu et al. 1982; Tan 1982; Zhou 1983; Huang 1994; Qu and Jing 2000), we found that the tonal change mechanism of syllables in connected speech is influenced by three types of constraints: the first type is constrained by word formation, the second type by syntax, and the third type involves constraints from high-frequency grammatical constructions. These three types of constraints correspond to different sandhi manifestations.

Table 3 [TABLE:3] Constraint Factors for Three Types of Tone Sandhi Components

Tone Sandhi Component Phonetic Manifestation Constraint Factor Statistics in 2,526-sentence Database Type 1 Two-syllable sandhi Word formation 5,243 entries, 11,360 occurrences Type 2 Postposed enclitic with neutral tone Syntax 8,853 occurrences Type 3 Special sandhi from grammatical marker + verb/adjective High-frequency grammatical constructions 160 entries, 271 occurrences

The first type of tone sandhi component results from two-syllable word formation and can be automatically segmented through exhaustive lexicon inclusion. The objects include:

There is a widespread two-syllable sandhi phenomenon in Lhasa Tibetan, with the constraint being two-syllable compound word formation, corresponding to two-syllable content words. In the 2,526 sentences, this sandhi component appeared 5,243 times, covering 11,360 instances, making it the primary target for collection in continuous speech lexicons. Wang Zhijing (1994:23, 25) argues that Tibetan monosyllabic morphemes are more fundamental than words, yet discussing morphemes sometimes cannot be separated from words. From a diachronic perspective of lexical development, Tibetan shows a trend of shifting from monosyllabic to disyllabic or polysyllabic words. For example: two morphemes form a disyllabic content word through compounding, nya "fish" + khrab "scale, armor" → nya khrab "fish scale, armor," with sandhi pattern LF; two morphemes form a new word through symbolic representation, ka (first Tibetan letter) + kha (second Tibetan letter) → ka kha "letter, Tibetan alphabet," with sandhi pattern HH; a content morpheme plus a derivational affix forms a new word, re "hope (verb root)" + ba (affix) → re ba "hope (noun)," with sandhi pattern LH, and so on.

The second type of tone sandhi component results from syntactic functions. The objects include:

Single-syllable syntactic function words lose their citation tone. Tibetan postposed function words or enclitics typically occur at chunk boundaries, which are often also prosodic boundaries. Affected by articulatory mechanisms, components at chunk endings lose their original tonal contours to varying degrees. Postpositions still retain lexical meaning; for example, tang "and" has an incomplete tonal shape but maintains its original tone category and can be restored under emphasis. However, postposed enclitics completely lose their citation tone, and such monosyllabic components are labeled as se0. These words are few in number but have high usage frequency. In the 2,526 sentences, there are 129 postposed enclitics, appearing a total of 8,853 times, such as case markers kyi, kyis, tu, nas; topic-marking and pause-indicating enclitics ni; adversative relation enclitics mos "although, however"; postposed mood particles such as declarative enclitics so, ngo, and interrogative enclitics lam, dam, etc.

The third type of tone sandhi component consists of two syllables, with the structure of a grammatical marker plus a verb or adjective. In this case, the grammatical marker does not become neutral tone. We believe this is because they frequently combine with verbs and adjectives to form a fixed construction, and this combination pattern has high-frequency salience in language cognition. During tone sandhi, they do not follow the sandhi rules in Table 1 but form a special grammatical sandhi, where the preceding syllable's tone depends on the following syllable, similar to the third-tone sandhi in Mandarin Chinese, where the preceding character changes tone according to the following character's tone. In the 2,526 sentences, there are four types, totaling 160 words, appearing 271 times.

  1. [Negative marker ma/mi] + [verb/adjective] structure for verb/adjective negation
  2. [Verb/adjective] + [postposed conjunction na] for conditional or hypothetical adverbials
  3. [Verb] + [nominalization components skabs, dus, phyir, etc.] for verb nominalization
  4. [Verb] + [imperfective aspectual affix gi/gyi/kyi]; [verb] + [continuous aspectual affix gin/gyin/kyin]; [verb] + [imperfective nominalization affix rgyu] and its case-marked forms

These four types of special grammatical sandhi structures all involve a functional marker attached to a verb or adjective, with the notable characteristic that the verb or adjective's part of speech is retained within the entire syntactic word, and such components are still labeled as se2.

Bybee (2002) notes that frequently repeated sequences become more fluent because they are automatized into a whole, i.e., prefabs (prefabricated chunks), which have independent representations in memory and can be accessed and executed as a unit. Traditional information processing lexicons have obvious deficiencies in including these prefabs. We believe that the third type of sandhi component belongs to prefabs formed by grammatical constructions and should still be included in the lexicon. For such components, when constructing the prosodic lexicon, we first create inventories of verbs, adjectives, and grammatical markers, and then write the constraint rules for the occurrence of such sandhi components into the lexicon.

4. Principles for Constructing a Lhasa Tibetan Prosodic Lexicon

4.1 Homographs in Lhasa Tibetan

There are two purposes for constructing a prosodic lexicon. On one hand, we aim to incorporate tone sandhi information into the lexicon to enrich the linguistic knowledge of low-resource language synthesis models. On the other hand, we need to resolve ambiguities of homographs with different pronunciations and meanings in the text-to-speech process. By evaluating the disambiguation level of the language model, we can understand the extent of our comprehension of human language systems and linguistic competence.

Discourse communication and text reading/writing are the most common language application scenarios. Text and speech are the most reliable sources of information for studying human language systems. Language is a symbolic representation system of thought, while writing is the secondary symbolic result of this system. Compared to machines, humans can seemingly effortlessly use one-dimensional linear text symbols to record complex language in reality. When faced with numerous polyphonic and polysemous words in text, humans can understand their communicative intent, quickly resolve ambiguities, and produce correct pronunciation. This benefits from the mature working network established in our brain's language system among text symbols, semantic knowledge, grammatical knowledge, and speech signals. This ability is a manifestation of human linguistic and cognitive competence and is the core issue of our research.

Tibetan is a phonographic writing system, but it does not directly express tonal information in text, let alone tone sandhi information. Due to the presence of tones and tone sandhi in Lhasa Tibetan, the text contains many homographs that are written identically but differ in semantics or grammatical function. Native speakers can quickly read Tibetan character-based sentences and convert text symbols into correct speech streams. We believe the cognitive basis for this behavior is that the brain pre-stores SE units reflecting tone sandhi information and their high-frequency combination patterns. The reason the human brain can quickly read numerous homographs in text is that the lexicon contains a large number of prefabs and includes the construction template information that forms these prefabs.

The disambiguation problems posed by Tibetan text are of two types: first, monosyllabic Tibetan characters can represent different morphemes and grammatical functions through citation tone, changed tone, and sandhi; second, two-syllable fragments can distinguish different word classes through whether they undergo sandhi. For example, the Tibetan character ma appears 430 times in the Lhasa Tibetan speech database with three pronunciation methods corresponding to three grammatical functions. This creates disambiguation challenges:

  • When ma functions alone as a noun "mother," it is read with citation tone R;
  • When ma functions as a postposed word-forming affix, such as in nyi ma "sun," nyi and ma undergo regular sandhi, LH;
  • When ma functions as a preposed negative marker, such as in ma thub "cannot," ma and thub undergo special grammatical sandhi, HF.

Polyphonic characters need to be basically stable in prefabricated structures and therefore need to be stored in the lexicon as a whole. Particularly, ma thub "cannot" is a special grammatical sandhi that would not have been included in the lexicon as a whole in the past.

In research on disambiguation methods for Lhasa Tibetan homographs, Laba Dunzhu et al. (2018) compiled 140 disyllabic polyphonic words. These polyphonic words mainly take two forms: those ending with affixes ba and pa, and those that are compound words not ending with ba and pa. Through statistical analysis of word forms, part-of-speech tags, and tone sandhi information from the 2,526 sentences, we obtained data on these two types of polyphonic words, detailed in Table 4 [TABLE:4].

Table 4 [TABLE:4] Statistics of Homograph Phenomena in Lhasa Tibetan Speech Database

SE Structure Occurrences Example Words Polyphonic Type dgos pa sandhi LH "use, meaning, need" se2 dgos pa no sandhi R+0 "[v+nmlz] needing..." se1+se0 tshod lta sandhi HH "experiment, pilot" se2 tshod lta no sandhi F+H "to test, to experiment" se1+se1

The key to the first category is that ba and pa have both word-forming and word-shaping functions. When functioning as word-forming components, they tightly combine with the preceding root morpheme, forming the se2 sandhi pattern. When ba and pa function as perfective nominalization components of verbs, they become neutral tone (tone 0), but the verb morpheme retains its verbal part of speech and original tone, thus forming the se1+se0 sandhi combination pattern.

The key to the second category is that two monosyllabic morphemes form a noun when sandhi occurs, but a verb when no sandhi occurs. Similar cases can be found in English, such as "record" and "perfect," where stress position distinguishes their part of speech and function in speech. In the 2,526 sentences, 617 sentences contain these two types of polyphonic words. Although the number of polyphonic words is not large, they are widely distributed. MOS evaluation shows that these words have a significant impact on the semantic understanding of entire sentences.

Disambiguation of polyphonic words is an important criterion for evaluating synthesis quality and can effectively measure the quality of our linguistic analysis. As previously mentioned, the difficulty of Lhasa Tibetan mainly lies in the variation of sandhi components. Therefore, we need to observe the environment in which sandhi occurs in continuous speech and supplement relevant linguistic knowledge to further enhance our understanding and analytical capabilities of language.

4.2 Scheme for Constructing the Prosodic Lexicon

Cognitive psychology research suggests that only by treating mental phenomena as organized, structured wholes can we better understand them. To reduce the number of items that need to be processed, we chunk them, which brings order and coherence to perception (Sternberg and Sternberg 2016). Directly storing a large number of high-frequency chunks can effectively reduce the computational burden on the brain's language system, thereby reserving cognitive resources for more important and novel analytical tasks. Beckner et al. (2009) point out that the cognitive organization of language is directly built upon linguistic experience, and frequently co-occurring components at phonological and syntactic levels gradually form retrievable chunks in the language system, influencing online language processing.

Conventional NLP lexicons typically focus on morphemes, words, phrases, and named entities. However, we believe that lexical sequences formed by high-frequency reusable construction templates should also be included in the lexicon if they exhibit phonological features such as sandhi or weakening, and their construction information regarding tone sandhi should be recorded. For Lhasa Tibetan, in addition to specific se1, se2, and se0 units, we also include relevant prefabs based on the sandhi performance of identical text fragments in different sentences, so that the lexicon contains more contrastive and disambiguation information.

Prefabs, also known as lexical bundles or formulaic sequences, refer to fixed collocations consisting of more than one word in language. Becker (1975) argues that actual speech production processes are primarily based on previously known phrases, completed through repetition, modification, and concatenation. For ease of communication and understanding, the main mode of language production is splicing previously heard text fragments. A phonetic perception experiment study on English by Kapatsinski and Radicke (2009) also supports the prefab hypothesis, suggesting that words and ultra-high-frequency phrases are stored in the lexicon for access. Additionally, language acquisition research has found that children acquire language through prefabs, and through repeated exposure to and use of chunks, children can induce construction rules and develop grammatical competence, with these chunks being stored holistically in the mental lexicon (Nattinger and DeCarrico 1992).

Usage-Based Theory posits that multi-word phrases can be stored in memory and enter the lexicon. Although the semantics and forms of some multi-word sequences are transparent, these prefabs provide typical combination patterns. From the perspective of language use, there is no need to choose between storage of unanalyzable units or compositional assembly, because speakers may have rich and diverse representations of sequences (Erman and Warren 2000; J. L. Bybee and Beckner 2015).

Prefab templates are necessarily high-frequency reusable patterns in language. The so-called high frequency can refer either to the high usage frequency of language fragments formed by these templates, or to the strong generative capacity of these templates themselves, which are frequently invoked in language use.

This study identifies three characteristics of prefab templates in Lhasa Tibetan:

First, prefab templates typically contain tone sandhi information. For example, two-syllable fragments with tone sandhi are prefabs. According to the statistical results in Table 2, when we segmented the 2,526 sentences of training data into citation tone syllables, neutral tone syllables, and two-syllable sandhi components, two-syllable sandhi fragments had the highest usage rate at 37.3%. The lexicon must include these fragments and their operational rules as completely as possible.

Second, when the same Tibetan character exhibits different sandhi performances in different sentence structures, both the sandhi and non-sandhi structures involving this character are prefab templates. From a language cognition perspective, the lexicon should include these contrasting structural information, which helps us quickly identify and distinguish polyphonic words. Based on the 2,526 annotated sentences, we can analyze and statistically examine the occurrence frequency, distribution patterns, and phonological-grammatical performance of identical Tibetan text fragments throughout the training data, equivalent to observing the dynamic changes of static linguistic fragments in actual language use. For example, in the case from Table 4 mentioned earlier, when tshod lta functions as a verb meaning "to test, to experiment," neither character undergoes sandhi, forming an se1+se1 structure. When it functions as a noun meaning "experiment, pilot," the two characters coarticulate, forming an se2. Therefore, both structures are prefab templates that can facilitate language comprehension efficiency. Furthermore, if tshod lta is followed by the verbal morpheme byed "to do, to perform," it forms a common trisyllabic verb structure, tshod lta byed meaning "to test, to put to the test," with sandhi structure se2+se1. This 2+1 trisyllabic verb is very common in modern colloquial Tibetan (Gesang Jumian 2004:394) and is also a prefab construction in language use. Storing such structures will help eliminate ambiguity in tone sandhi.

Third, grammatical constructions in linguistics can be large or small, but the prefabs defined in this study are relatively compact structures in language, typically without inserted pauses, approximating prosodic words in prosodic phonology.

In summary, the prefab templates we mined from the 2,526-sentence speech database are shown in Table 5 [TABLE:5].

Table 5 [TABLE:5] Prefab Construction Templates and Their Frequencies in the Database

SE Structure Entry Count Occurrences Predicate prefabs [Negative marker ma/mi] + [verb/adjective] [Verb/adjective] + [postposed conjunction na] [Verb] + [imperfective aspectual affix gi/gyi/kyi] [Verb] + [continuous aspectual affix gin/gyin/kyin] [Verb] + [imperfective nominalization affix rgyu] and its case forms [Verb] + [nominalization components skabs, dus, phyir, etc.] [Verb] + [perfective nominalization component pa/ba] and its case forms se1+se0 [Verb] + [progressive affix bzhin] se1+se0

These prefabs and their structural templates are the focus for improving disambiguation levels and are components that previous lexicons mostly failed to include. For example, [verb] + [progressive affix bzhin] is an se1+se0 structure. bzhin can function as a noun "face, countenance, appearance," read as citation tone R as se1. bzhin can also function as a progressive marker attached after verbs to express "in progress," in which case it is a neutral tone se0. Considering that bzhin as a progressive marker necessarily co-occurs with verbs, and [v] + [bzhin] is typically followed by clause-final particles such as 'dug, yod, we classify the structure [verb] + [progressive affix bzhin] + [clause-final particles 'dug/yod, etc.] as a prefab and record their SE structure se1+se0+se0 in the lexicon, thereby providing disambiguation information. Since the 2,526 sentences are only a small sample of Lhasa Tibetan, we gradually accumulated 183 groups of polyphonic words and 339 prefab structures through literature review. Through large-text matching and manual screening, we have currently added approximately 15,000 prefabs.

The traditional Tibetan dictionary used in the early stage of this study contained 160,000 entries. By supplementing SE information for these 160,000 entries and adding the 15,000 prefabs accumulated later, the final prefabs lexicon containing prosodic information has 175,000 entries.

5. Experimental Validation

To validate the effectiveness of the prosodic lexicon, we conducted two experiments: F1-score model evaluation and MOS evaluation for Amdo Tibetan.

First, we used a new speech database and compared the accuracy of tone sandhi domain boundary prediction between segmentation using the prefabs lexicon and conventional dictionary segmentation through F1-score evaluation. Results show that prosodic lexicon segmentation achieves higher accuracy in predicting tone sandhi boundaries compared to conventional dictionary segmentation, demonstrating that the improvement effect of the prosodic lexicon is also validated on data from other speakers.

Specifically, we used the previously mentioned 2,526 annotated sentences as the training set to build two tone sandhi domain prediction models using the prefabs lexicon and conventional dictionary respectively. We then used 56 minutes of 282 sentences of speech data from another Lhasa speaker as the test set, with manual tone annotation as the ground truth. We used the two tone sandhi domain models to predict the unannotated test set text and compared the results with manual annotation, using F1-score to evaluate the match between machine prediction and human annotation (ground truth).

F1-score is a metric for measuring the accuracy of binary classification models, which comprehensively considers precision and recall as their harmonic mean. Precision reflects the model's accuracy rate, i.e., how many of the predicted positive samples are true positive samples; recall reflects the model's coverage rate, i.e., how many of all true positive samples are correctly predicted as positive. In this experiment, tone sandhi domain boundaries are treated as positive samples and non-tone sandhi domain boundaries as negative samples. Therefore:

  • TP: Number of correctly predicted tone sandhi domain boundaries
  • FP: Number of non-boundaries predicted as boundaries
  • FN: Number of boundaries predicted as non-boundaries

Experimental results show that using the conventional dictionary yields an F1-score of 0.84. When using the prefabs lexicon, the accuracy on the 282-sentence test set is 0.894 and recall is 0.937, thereby improving the F1-score from 0.84 to 0.92. Generally, the F1-score for Chinese word segmentation is approximately 0.95. Considering the widespread phenomenon of homographs with multiple meanings and pronunciations in Tibetan, and that high-quality resources for Tibetan are relatively scarce compared to Mandarin Chinese, an F1-score of 0.92 can be considered a good result.

Additionally, we demonstrated through MOS evaluation experiments that the prosodic lexicon also has cross-dialect transferability. Yixi Weisa Acuo (2003, 2004) points out that although Amdo Tibetan does not have lexically distinctive tones, it has a habitual pitch with two phonetic patterns for disyllabic words: "high-low" and "low-high," reflecting the difference between nouns and verbs. Xu Shiliang (2015) argues that tonal Tibetan has a sandhi pattern of "front not high, back not low," while toneless Tibetan has a habitual pitch pattern of "front low, back high," both possibly deriving from pre-tonal pitch concomitant features of Tibetan vocabulary. In the process of analyzing and annotating Lhasa and Amdo Tibetan speech data, we found significant consistency between the rhythmic units of Amdo Tibetan and the sandhi units of Lhasa Tibetan. After directly applying the Lhasa Tibetan prefabs lexicon to the Amdo Tibetan synthesis system, the MOS score improved to 4.17, indicating that the prosodic lexicon has universal linguistic value for different Tibetan dialects.

6. Conclusion

This study, based on an AI speech synthesis experimental platform, analyzes the chunking and dynamic changes of tone sandhi units in Lhasa Tibetan continuous speech. Through analysis and annotation of experimental data and observation of speech synthesis effects, the research results support the "rich memory representation" viewpoint. Humans often solve problems through step-by-step thinking, hypothesis generation, and reasoning to reach final answers. In language acquisition, children obtain linguistic input by observing and perceiving their surrounding language environment, gradually inferring grammatical rules and sentence structures, and applying them to produce new sentences and expressions. This reasoning process involves gradual understanding and application of vocabulary and grammar. Our view is that in addition to containing the most basic atomic components (such as characters or morphemes), the lexicon system also pre-stores prefabricated chunks containing rich prosodic and grammatical information. The linguistic fragments related to Lhasa Tibetan tone sandhi are dominant units in the Tibetan lexicon and are core elements that connect lexical, syntactic, and prosodic levels, reflecting the operational rules of the language system.

References

[1] GB/T 36452-2018. 2018. Specification for Tibetan Word Segmentation for Information Processing. State Administration for Market Regulation; Standardization Administration of China.

[2] Feng, Zhiwei. 2001. "Certain Non-grammatical Factors in Determining Segmentation Units." Journal of Chinese Information Processing (5).

[3] Gesang Jumian, Gesang Yangjing. 2004. Practical Tibetan Grammar Tutorial (Revised Edition). Chengdu: Sichuan Nationalities Publishing House.

[4] Guan, Bai. 2009. "Analysis of Several Concepts in Tibetan Word Segmentation." Journal of Tibet University (Natural Science Edition) (1).

[5] Guan, Bai. 2010. "Research on Tibetan Segmentation Units for Information Processing." Journal of Chinese Information Processing (3).

[6] Hu, Tan, Qu, Aitang, and Lin, Lianhe. 1982. "Experiments on Tibetan (Lhasa Dialect) Tones." Language Research (1).

[7] Hu, Tan. 1980. "Research on Tibetan (Lhasa Dialect) Tones." Minzu Chinese (1).

[8] Huang, Bufan. 1994. "The Emergence and Differentiation Conditions of Tones in Tibetan Dialects." Minzu Chinese (3).

[9] Jiang, Di. 2003. "Methods and Processes of Modern Tibetan Chunk-based Segmentation." Minzu Chinese (4).

[10] Ministry of Education Language and Information Management Division (Ed.). 2015. Draft Tibetan Latin Alphabet Transliteration Scheme, Draft Modern Tibetan Word Segmentation Specification for Information Processing, Draft Modern Tibetan Part-of-Speech Tagging Specification for Information Processing. Beijing: The Commercial Press.

[11] Laba Dunzhu, Ou, Zhu, Zu, Yiqing, and Pei, Chunbao. 2018. "Research on Disambiguation Methods for Tibetan Homographs." Journal of Chinese Information Processing (7).

[12] Long, Congjun, and Liu, Huidan. 2016. Theoretical and Methodological Research on Automatic Tibetan Word Segmentation. Beijing: Intellectual Property Publishing House.

[13] Qu, Aitang, and Jing, Song. 2000. Theories and Methods of Sino-Tibetan Language Research. Beijing: China Tibetology Publishing House.

[14] Qu, Aitang. 1981a. "Tibetan Tones and Their Development." Language Research (1).

[15] Qu, Aitang. 1981b. "Tone Sandhi in Tibetan." Minzu Chinese (4).

[16] Sun, Maosong, Wang, Hongjun, Li, Xingjian, Fu, Li, Huang, Changning, Chen, Songcen, Xie, Zili, and Zhang, Weiguo. 2001. "Word List for Modern Chinese Segmentation for Information Processing." Applied Linguistics (4).

[17] Tan, Kerang. 1982. "Discussion on Tone Classification and Notation in Lhasa Tibetan." Minzu Chinese (3).

[18] Wang, Yao. 1956. "Tones in Tibetan." Chinese Language (6).

[19] Wang, Hongjun. 2001. "The Internal Structure of the Word List for Modern Chinese Segmentation for Information Processing and the Structural Characteristics of Chinese." Applied Linguistics (4).

[20] Wang, Zhijing. 1994. Lhasa Tibetan Spoken Grammar. Beijing: Minzu University of China Press.

[21] Xu, Shiliang. 2015. "Habitual Pitch in Toneless Tibetan and Tone Sandhi in Tonal Tibetan." Language Research (4).

[22] Yixi Weisa Acuo. 2003. Research on the Mixing of Tibetan and Chinese Languages in "Daohua" and Deep Language Contact Studies. PhD Dissertation, Nankai University.

[23] Yixi Weisa Acuo. 2004. Research on Daohua. Beijing: Nationalities Publishing House.

[24] Yu, Daoquan. 1983. Tibetan-Chinese Lhasa Colloquial Dictionary. Beijing: Nationalities Publishing House.

[25] Zhou, Jiwen. 1983. Tibetan Phonetic Teaching Materials (Lhasa Sound). Beijing: Nationalities Publishing House.

[26] Zu, Yiqing, Lu, Chen, Ou, Zhu, Zhu, Ronghua, Liu, Chenning, Shao, Pengfei, Lu, Buta, Zhang, Xiao, and Hu, Guoping. 2022. "Basic Linguistic Operating Units SE in Continuous Speech—Experimental Evidence from Tone Sandhi in Lhasa Tibetan." Contemporary Linguistics (4).

[27] Baayen, R. H. 2007. "Storage and Computation in the Mental Lexicon." In The Mental Lexicon (pp. 81–104). Brill.

[28] Becker, J. D. 1975. "The Phrasal Lexicon." In Theoretical Issues in Natural Language Processing.

[29] Beckner, Clay, et al. 2009. "Language is a Complex Adaptive System: Position Paper." Language Learning, 59(s1): 1–26.

[30] Bybee, J. 2002. "Sequentiality as the Basis of Constituent Structure." In T. Givón & B. F. Malle (Eds.), The Evolution of Language Out of Pre-Language (pp. 109–132). Amsterdam: Benjamins.

[31] Bybee, J. L., and Beckner, C. 2015. "Usage-Based Theory." In B. Heine and H. Narrog (Eds.), The Oxford Handbook of Linguistic Analysis. Oxford University Press.

[32] Ding, N., Melloni, L., Zhang, H., Tian, X., and Poeppel, D. 2016. "Cortical Tracking of Hierarchical Linguistic Structures in Connected Speech." Nature Neuroscience, 19(1), 158–164.

[33] Erman, B., & Warren, B. 2000. "The Idiom Principle and the Open Choice Principle." Text & Talk, 20(1), 29–62.

[34] Faber, P. B., and Mairal Usón, R. 1999. Constructing a Lexicon of English Verbs. Berlin; New York: Mouton de Gruyter.

[35] Jarema, G., and Libben, G. 2007. "Introduction: Matters of Definition and Core Perspectives." In The Mental Lexicon (pp. 1–6). Brill.

[36] Kapatsinski, V., & Radicke, J. 2009. "Frequency and the Emergence of Prefabs: Evidence from Monitoring." Formulaic Language, 2, 499–520.

[37] Kent, R. D., and Read, C. 1992. The Acoustic Analysis of Speech. Singular Publishing Group.

[38] Lieber, R. 1980. On the Organization of the Lexicon (PhD Thesis). Massachusetts Institute of Technology.

[39] Nattinger, J. R., & DeCarrico, J. S. 1992. Lexical Phrases and Language Teaching. Oxford University Press.

[40] Selkirk, E. 1984. Phonology and Syntax: The Relation between Sound and Structure (MA). Cambridge: MIT Press.

[41] Sternberg, R. J., Sternberg, K., and Mio, J. 2016. Cognitive Psychology. Cengage Learning Press.

[42] Wylie, T. 1959. "A Standard System of Tibetan Transcription." Harvard Journal of Asiatic Studies, 22, 261–267.

Correspondence:

LU Chen (First Author)
510632, School of Chinese Language and Literature, Jinan University, Guangzhou; No. 601, West Huangpu Avenue, Tianhe District, Guangzhou, Guangdong Province, China; 13001023722; yousiruan@qq.com; WeChat: yousiruan

ZU Yiqing (Corresponding Author)
230088, iFLYTEK Co., Ltd. & Interdisciplinary Research Center for Language Sciences, University of Science and Technology of China; Room 410, Block A, No. 789 Tianxi Road, Changning District, Shanghai (iFLYTEK Shanghai Technology Co., Ltd.); 13501684302; yqzu@iflytek.com; WeChat: wxid_u7djzi0994mv22

LIU Chenning
230088, iFLYTEK Co., Ltd.

ZHANG Xiao
230088, iFLYTEK Co., Ltd.

Submission history

A Prosodic Lexicon of Lhasa Tibetan: An Experimental Study Based on Speech Synthesis