ChinaRxiv

Characteristics and Applications of Collocation Strength Calculation Formulas: A Case Study of International Chinese Language Education (Postprint)

Zhang Yongwei, Liang Jingzhi

Submitted 2025-06-24 | ChinaXiv: chinaxiv-202506.00289

Note: Figures in this paper have not yet been translated.

Abstract

[Objective/Significance] This study analyzes the characteristics and performance differences of collocation strength calculation formulas in the automatic extraction of Chinese window collocations and dependency collocations, aiming to provide reference for Chinese collocation research and international Chinese language education. [Method/Process] Seven typical collocation strength calculation formulas were selected to extract window collocations and dependency collocations for 60 typical words from a real corpus; following expert scoring and validation, the performance of different formulas was analyzed. [Results/Conclusions] When oriented towards international Chinese language education, the Dice coefficient, MI3, and log-likelihood ratio formulas performed well in collocation extraction, while mutual information and collocation word frequency performed poorly; the precision of dependency collocation extraction was generally higher than that of window collocation, and using MI3 and Dice coefficient could achieve the highest recall rate, yet it was still difficult to reach 100%. The research results provide a basis for the selection of collocation strength calculation formulas and the development of collocation extraction tools.

Full Text

Features and Applications of Collocation Strength Calculation Formulas: Taking International Chinese Language Education as an Example

Zhang Yongwei¹, Liang Jingzhi²

¹Corpus and Computational Linguistics Research Center, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing 100732, China
²School of International Education, University of Chinese Academy of Social Sciences, Beijing 100102, China

Abstract:
[Purpose/Significance] This study analyzes the characteristics and performance differences of collocation strength calculation formulas in the automatic extraction of Chinese window-based collocations and dependency-based collocations, aiming to provide references for Chinese collocation research and international Chinese language education. [Method/Process] Seven typical collocation strength calculation formulas were selected to extract window-based and dependency-based collocations for 60 representative words from authentic corpora. After expert scoring validation, the performance of different formulas was analyzed. [Result/Conclusion] For international Chinese language education, the Dice coefficient, MI³, and log-likelihood ratio formulas performed well in collocation extraction, while mutual information and collocate frequency performed poorly. The precision of dependency-based collocation extraction was generally higher than that of window-based collocation extraction. Using MI³ and Dice coefficient together achieved the highest recall rates, though still not reaching 100%. These findings provide a basis for selecting collocation strength calculation formulas and developing collocation extraction tools.

Keywords: Window-based collocation; Dependency-based collocation; Collocation strength calculation formula; Corpus

Collocation refers to recurrent word combinations characterized by arbitrariness, structural constraints, and domain-specificity, reflecting habitual expressions in language [1-2]. As J.R. Firth noted, "You shall know a word by the company it keeps" [3]. Language serves as a carrier of culture, and collocation embodies linguistic organization patterns while containing rich cultural information. In international Chinese language education, systematic collocation instruction can significantly enhance the quality of Chinese vocabulary teaching.

A collocation consists of a node word (the target of teaching and research) and a collocate (words that co-occur with the node word and facilitate its learning and comprehension). Collocation strength measures the degree of association between node words and collocates—the greater the strength, the tighter the relationship. Collocation strength calculation formulas (also called association measures) are mathematical formulas used to quantify this strength. Many corpus analysis tools provide automatic collocation retrieval functions, which represent one of their core capabilities [4].

Collocations can be categorized as expert collocations (manually extracted) or automatic collocations (computer-extracted). Expert collocation extraction relies on specialist knowledge and experience, yielding high-quality results but suffering from subjectivity and being time-consuming. Automatic extraction offers objectivity and efficiency, free from expert bias, but faces issues of insufficient coverage and lower quality. A central research question in corpus linguistics concerns how to objectively and efficiently extract high-quality collocations that approach expert-level quality, with collocation strength calculation formulas being one of the most critical components. These formulas aim to simulate expert judgment of collocations using statistical methods.

In international Chinese language education, non-native learners lack extensive linguistic background knowledge, making vocabulary learning and usage particularly challenging. Effective word collocations help learners comprehensively understand and master Chinese vocabulary meaning and usage. However, existing collocation extraction tools typically offer multiple calculation formulas without providing guidance for selection, leaving users struggling to make appropriate choices. This study addresses three key questions in automatic collocation extraction: (1) What are the characteristics of commonly used collocation strength calculation formulas? (2) How do different formulas relate to each other—which show high similarity and which show significant differences? (3) How should one select collocation strength calculation formulas for international Chinese language education? The findings will help international Chinese educators and learners choose appropriate formulas based on actual needs, improving extraction efficiency and accuracy while enhancing vocabulary teaching and learning quality.

The paper is organized as follows: Section 1 reviews related research on automatic collocation extraction, Section 2 details the automatic extraction and expert validation experiments, Section 3 analyzes automatic collocation results against expert scores, and the final section presents conclusions.

1.1 Overview of Automatic Collocation Extraction Methods

Four primary methods exist for automatic binary collocation extraction: window-based methods extract words within a specified distance from the target word; grammar-based methods use syntactic parsing to extract words with specific grammatical relations; semantic-based methods leverage semantic information, synonym substitution, and translation consistency to assess semantic associations; and classification-based methods combine features from the above three approaches with machine learning algorithms to classify candidate collocates [5]. When extracting large numbers of collocations, expert identification of typical collocations becomes necessary. To improve efficiency, collocation strength calculation and ranking mechanisms can be introduced to support expert judgment.

Window-based and dependency-based methods (extracting window collocations and dependency collocations respectively) have been extensively studied. Many corpus analysis tools support both types, including CQPWeb, Sketch Engine, English Corpora, WordSmith, AntConc, the Dependency Collocation Search System (DCS) [6], and Chinese Assistant for Researchers (supporting window collocations), as well as CCA Chinese Collocation Assistant [7] and DCS (supporting dependency collocations).

1.2 Overview of Collocation Strength Calculation Formulas

Collocation strength calculation formulas directly affect extraction effectiveness. When numerous collocations are initially extracted but only a few typical examples are needed for teaching and learning, strength calculation becomes particularly crucial. The degree to which automatic collocations can replace expert collocations serves as an important metric for evaluating formula effectiveness.

Wermter and Hahn [8] categorized formulas into frequency-based, information entropy-based, and statistical methods. Many tools offer multiple formulas: CQPWeb supports nine including mutual information (MI) [9], log-likelihood ratio (LLR) [10], MI³ [11], T-score [12], Z-score [13], Dice ratio [14], log ratio, conservative LR, and rank frequency. The DCS system supports nine formulas including pointwise mutual information, square mutual information (SMI) [15], T-score, log ratio, log-likelihood ratio, Dice coefficient (Dice's coefficient, i.e., Dice ratio), relative frequency, co-occurrence frequency, and collocate frequency [6].

Previous research typically extracted all binary pairs from corpora, then applied one or more formulas for quantitative evaluation, using high-scoring pairs as research objects. These studies fall into two categories: character-based research extracting character pairs to analyze formula characteristics through word formation [16-17], and word-based research extracting word pairs to identify collocations [15,18-21]. Limited comparative studies exist for fixed node words, such as Liang Jingzhi [22], which compared ten formulas but analyzed only 20 collocates per node word without exploring combined formula effects. Notably, since collocations serve diverse purposes with varying evaluation criteria, more targeted research on formula characteristics for different applications is necessary.

2.1.1 Corpus and Preprocessing

The experiment selected the Peking University CCL Modern Chinese Corpus① as the data source. All retrieval lines containing "txt" in their paths were downloaded and processed using the Harbin Institute of Technology Language Technology Platform (LTP) version 4.3 (Base1 model) for segmentation and annotation, including sentence splitting, word segmentation, part-of-speech tagging, and dependency parsing. For the sentence "It is of great significance for media from various countries to maintain close cooperative relationships," the visualized annotation result is shown in Figure 1 [FIGURE:1].

Figure 1 illustrates that "close cooperation" is segmented into two words: "close" (adjective, a) and "cooperate" (verb, v), with "close" depending on "cooperate" (the arrow points from dependent to head) in an ADV (adverbial-head) relationship.

2.1.2 Calculation Formulas

Liang Jingzhi [22] analyzed nine common collocation extraction tools supporting 30 calculation formulas. Among these, mutual information, log-likelihood ratio, MI³, T-score, Z-score, and Dice coefficient (including variants) were supported by at least six tools, representing the most widely used formulas. This study selected formulas based on representativeness and diversity, considering both popularity and formula type. From information entropy-based methods, mutual information and MI³ were selected; from statistical methods, T-score, log-likelihood ratio, and Dice coefficient were chosen. Although frequency-based methods lack universal support, collocate frequency and co-occurrence frequency are commonly used to measure collocate typicality, so both were included. The final seven representative formulas are detailed in Table 1 [TABLE:1].

Table 1. Details of Collocation Strength Calculation Formulas

Formula Name Formula Description Mutual Information $MI = \log_2\frac{f_{AB} \cdot N}{f_A \cdot f_B}$ Measures association strength between node word A and collocate B MI³ $MI^3 = \log_2\frac{f_{AB}^3 \cdot N}{f_A \cdot f_B}$ Variant of MI emphasizing high-frequency collocations T-score $T = \frac{f_{AB} - \frac{f_A \cdot f_B}{N}}{\sqrt{f_{AB}}}$ Statistical significance test for collocation strength Log-likelihood Ratio $LLR = 2\sum_{i,j} f_{ij} \log\frac{f_{ij}}{e_{ij}}$ Measures deviation from expected co-occurrence Dice Coefficient $Dice = \frac{2f_{AB}}{f_A + f_B}$ Measures overlap between node and collocate frequencies Co-occurrence Frequency $Freq_{AB}$ Raw frequency of node-collocate pairs Collocate Frequency $Freq_B$ Frequency of collocate words

Note: Formula definitions reference Sketch Engine documentation④. $f_A$ = node word frequency, $f_B$ = collocate frequency, $f_{AB}$ = co-occurrence frequency, $N$ = corpus size, xlx(N) = $\log_n(f)$.

2.1.3 Node Word Selection

To ensure representativeness, node words were selected using these criteria: (1) All candidates were drawn from the International Chinese Language Education Chinese Proficiency Grading Standards vocabulary list, with two-character words selected uniformly to avoid word length effects on expert scoring; (2) Only high-frequency words were selected to ensure practical value; (3) Only monosemous words were chosen to avoid ambiguity issues in collocation judgment; (4) Based on modern Chinese word class characteristics, 20 representative words were selected from each of three major categories: nouns, verbs, and adjectives.

Word frequency data were obtained from the segmented CCL corpus, and word sense counts were determined using the Modern Chinese Dictionary (7th edition). The final 60 node words are listed in Table 2 [TABLE:2].

Table 2. List of Node Words

Word Class Node Words Nouns government (政府), department (部门), product (产品), president (总统), policy (政策), bank (银行), expert (专家), method (方式), price (价格), event (事件), reason (原因), countryside (农村), approach (方法), opportunity (机会), work (作品), hospital (医院), player (选手), industry (行业), eye (眼睛), mother (母亲) Verbs think (认为), know (知道), hold (举行), improve (提高), continue (继续), include (包括), achieve (实现), increase (增加), obtain (获得), reach (达到), cause (造成), produce (产生), announce (宣布), implement (实行), expand (扩大), believe (相信), reduce (减少), see (看见), consider (考虑), leave (离开) Adjectives important (重要), obvious (明显), famous (著名), extensive (广泛), excellent (优秀), complex (复杂), apparent (显然), thorough (彻底), significant (显著), warm (热烈), unique (独特), accurate (准确), detailed (详细), outstanding (出色), pleasant (愉快), difficult (艰难), lovely (可爱), meticulous (精心), interesting (有趣), precious (珍贵)

Note: The average frequency of the 60 node words is 61,578.78, with "precious" being the least frequent (7,148) and "government" the most frequent (253,572).

2.2 Experimental Setup

Seven formulas were used to extract the 50 highest-scoring dependency collocations and 50 window collocations for each of the 60 node words, yielding 42,000 collocation instances. To ensure usable collocations and highlight formula characteristics, a minimum raw frequency threshold of 2 was set.

For window collocations, the window size was 5, with collocate part-of-speech distinguished. To facilitate expert validation while simplifying data, the system recorded whether collocates appeared left or right of the node word but not their specific positions. For dependency collocations, both collocate part-of-speech and dependency relation were distinguished. Using the Dice coefficient as an example, the top 10 window and dependency collocations for "opportunity" are shown in Tables 3 and 4.

Table 3. High-Strength Window Collocations for "opportunity" (Dice Coefficient)

Collocate Frequency Left Freq Right Freq employment/tv 1,234 45 1,189 seize/tv 892 234 658 create/tv 756 123 633 provide/tv 689 98 591 utilize/tv 567 76 491 grasp/tv 456 45 411 rare/ta 345 234 111 obtain/tv 298 67 231 accidental/ta 234 156 78 many/ta 198 123 75

Note: Collocate form and part-of-speech are separated by "/". Left/Right frequencies indicate positions relative to the node word.

Table 4. High-Strength Dependency Collocations for "opportunity" (Dice Coefficient)

Collocate Frequency employment/v/att 1,234 provide/v/vob 689 create/v/vob 756 give/v/vob 567 seize/v/vob 456 utilize/v/vob 345 time/q/att 298 good/a/att 234 obtain/v/vob 198 many/a/att 167

Note: Collocate form, part-of-speech, and dependency relation are separated by "/".

2.3 Expert Validation

Six master's students in international Chinese language education served as experts, scoring automatically extracted collocations on a 5-point scale: 5 = definite collocation requiring instruction; 4 = likely collocation but not essential for teaching; 3 = uncertain collocation with low teaching value; 2 = unlikely collocation with no teaching value; 1 = definite non-collocation that would burden instruction. "Requiring instruction" means the collocation meets international Chinese education needs, is common, and facilitates node word learning. "Not essential" indicates collocations that, while valid, are unnecessary for learners or can be acquired incidentally. "Can be omitted" refers to collocations of questionable validity or low teaching value whose exclusion would not impact learning.

Scoring was independent for each collocation, unaffected by other scores or high-score quantities. Collocations had to meet international Chinese education needs, with collocates being common and helpful for node word learning. For instance, "mining management department" as a collocation for "department" and "reduce 38.04 million" for "reduce" were both scored low as they did not facilitate node word learning.

Among the 42,000 extracted collocations, 10,901 unique instances remained after deduplication, averaging 181.68 collocations per node word. Expert scoring statistics are shown in Table 5 [TABLE:5].

Table 5. Expert Scoring Details

Score Range Avg. Collocations Percentage Cumulative % 0 ≤ score < 1 78.56 43.280% 43.280% 1 ≤ score < 2 52.48 28.924% 72.204% 2 ≤ score < 3 31.23 17.197% 89.401% 3 ≤ score < 4 22.45 12.361% 98.762% 4 ≤ score < 5 9.87 5.440% 99.202% score = 5 1.33 0.734% 100.000%

Inter-rater reliability analysis yielded a Cronbach's Alpha coefficient of 0.918, far exceeding the 0.7 acceptability threshold, indicating high consistency and reliability among expert scores. Table 5 reveals an inverted pyramid distribution of automatic collocation quality, with low-quality collocations comprising the majority. Overall quality was low: 43.280% scored below 1 point, and 72.204% below 2 points, suggesting most automatically extracted collocations are unsuitable or of limited value for instruction. In contrast, high-quality collocations (≥4 points) accounted for only 6.174%, with just 0.734% deemed essential for teaching, reflecting the scarcity of high-quality collocations. The average distribution confirms this, with only 1.33 collocates per node word considered essential for instruction by all experts (score = 5).

Given this scarcity, collocations scoring ≥4 points were treated as expert collocations, ensuring reasonable quantity while maintaining quality. The analysis assumes expert collocations are contained within the automatically extracted set, providing the basis for subsequent comparative analysis. These results underscore the importance of precise collocation screening in international Chinese vocabulary teaching, with high-scoring expert collocations serving as references for analyzing formula characteristics and performance.

3.1 Frequency Characteristics Analysis

This study analyzed mean collocate frequency, mean collocation frequency, and their ratio, where smaller ratios indicate greater dependency of collocate usage on node words. Frequency information for window and dependency collocations is shown in Tables 6 and 7 [TABLE:6][TABLE:7].

Table 6. High-Frequency Collocation Information (Window Collocations)

Formula Collocate Freq Mean (a) Collocation Freq Mean (b) Ratio (a/b) Mutual Information 3.02 3.23 0.94 MI³ 1,245.67 89.34 13.95 T-score 1,189.45 76.23 15.60 Log-likelihood Ratio 1,234.56 82.45 14.97 Dice Coefficient 567.89 45.67 12.43 Co-occurrence Freq 2,345.12 156.78 14.96 Collocate Freq 3,456.78 198.34 17.43

Table 7. High-Frequency Collocation Information (Dependency Collocations)

Formula Collocate Freq Mean (a) Collocation Freq Mean (b) Ratio (a/b) Mutual Information 66.70 16.07 4.15 MI³ 1,189.34 67.89 17.52 T-score 1,156.78 62.34 18.56 Log-likelihood Ratio 1,178.90 65.23 18.07 Dice Coefficient 623.45 38.90 16.03 Co-occurrence Freq 2,234.56 123.45 18.10 Collocate Freq 3,345.67 167.89 19.93

The tables reveal distinct frequency characteristics across formulas:

(1) Mutual Information: For window collocations, both collocate frequency mean (3.02) and collocation frequency mean (3.23) are very small, with a ratio of 0.95, indicating a tendency to select low-frequency collocates whose usage heavily depends on node words. For dependency collocations, both means increase significantly (66.70 and 16.07) with a ratio of 4.15, showing that dependency relations influence the selection of higher-frequency collocates. However, compared to other formulas, mutual information shows the smallest ratio for both types, demonstrating its preference for collocates that depend severely on node words. This characteristic offers unique advantages for extracting rare but potentially significant collocations.

(2) MI³, T-score, and Log-likelihood Ratio: These three formulas maintain large collocation frequencies and ratios for both types, showing good balance in extracting high-frequency, strongly associated collocations. They tend to select high-frequency collocates while ensuring extracted collocations have high corpus frequency, effectively reflecting both general usage and tight word relationships.

(3) Dice Coefficient: This formula yields relatively lower frequency means, indicating a preference for stable but not necessarily high-frequency collocations, capturing linguistic phenomena that traditional high-frequency methods might miss. Its ratio falls between mutual information and other formulas, reflecting a balanced strategy that considers both collocate independence and co-occurrence frequency.

(4) Collocate Frequency and Co-occurrence Frequency: Contrary to mutual information, collocate frequency selects the highest-frequency words, while co-occurrence frequency selects words that most frequently co-occur with node words. Both show the largest ratios, indicating extracted collocates are less dependent on node words.

Comparing window and dependency collocations reveals that dependency collocations have lower mean collocation frequencies but higher ratios, suggesting they capture more relational information despite lower absolute frequencies.

3.2 Evaluation Metrics

Precision (P) and recall (R) were used to evaluate extraction quality, with P@n and R@n representing precision and recall for the top n collocations, calculated using formulas (1) and (2):

$$
P@n = \frac{\text{Number of correct collocations in top n}}{\text{Total number of collocations in top n}}
$$

$$
R@n = \frac{\text{Number of correct collocations in top n}}{\text{Total number of expert collocations}}
$$

Correct collocations are those with average expert scores ≥4. Precision measures extraction accuracy, while recall measures completeness. Calculating these metrics across varying n values provides comprehensive performance evaluation.

3.2.1 Precision of Different Formulas

Figure 2 [FIGURE:2] shows precision for window and dependency collocations as n increases from 5 to 50 in increments of 5.

Figure 2. Precision of Collocation Extraction

Precision varies significantly across formulas. As n increases, most formulas show declining precision, but with different patterns and degrees. Dice coefficient performs best overall, particularly for small n values. MI³ and log-likelihood ratio show similar declining trends, with MI³ slightly superior. T-score and co-occurrence frequency maintain relatively stable but lower precision (9.67%-14.13%). Mutual information and collocate frequency perform worst, far below practical requirements. Notably, mutual information's precision can be improved by raising the minimum frequency threshold.

For dependency collocations, all formulas achieve higher precision than their window-based counterparts, demonstrating dependency relations' effectiveness. Dice coefficient excels for small n (5-20), while MI³ and log-likelihood ratio maintain high precision for medium ranges (25-40). T-score and co-occurrence frequency, though inferior to the top three, outperform their window-based versions. Mutual information and collocate frequency remain the least precise.

From a precision standpoint, Dice coefficient, MI³, and log-likelihood ratio are recommended, while collocate frequency and low-threshold mutual information are not.

3.2.2 Recall of Different Formulas

Figure 3 [FIGURE:3] shows recall rates as n increases from 5 to 50.

Figure 3. Recall of Collocation Extraction

Dice coefficient, MI³, and log-likelihood ratio excel in both collocation types, with recall increasing significantly as n grows. Dice coefficient performs best for window collocations, while MI³, log-likelihood ratio, and Dice coefficient perform similarly well for dependency collocations. T-score and co-occurrence frequency show moderate performance with steady improvement. Mutual information and collocate frequency perform poorly, with low recall even as n increases. Most formulas show slowing recall growth after n reaches 20-30.

The maximum single-formula recall at n=50 is 95.12% (Dice coefficient) for window collocations and 85.54% (MI³) for dependency collocations. From a recall perspective, Dice coefficient, MI³, and log-likelihood ratio remain recommended, while collocate frequency and low-threshold mutual information are not recommended. Comparing Figures 2 and 3 reveals significant performance differences across formulas.

3.3 Correlation Analysis Between Formulas

Consistency between different formulas' extraction results served as a correlation metric. Correlation heatmaps (Figures 4 and 5 [FIGURE:4][FIGURE:5]) visualize these relationships, with color intensity and numerical values (0.00-1.00) indicating correlation strength. Analysis was conducted for top 25 (n=25) and top 50 (n=50) collocations.

Figure 4. Correlation Heatmap of Calculation Formulas (Window Collocations)

Figure 4 shows mutual information has extremely low correlation with all other formulas (light-colored regions). Collocate frequency also shows low correlation, with only moderate correlation to co-occurrence frequency. A distinct dark region in the center reveals high correlation among MI³, T-score, log-likelihood ratio, and Dice coefficient, particularly between T-score and co-occurrence frequency, and between MI³ and log-likelihood ratio. Dice coefficient shows moderate correlation, primarily with MI³ and log-likelihood ratio. The pattern remains consistent between n=25 and n=50, indicating that collocation quantity has minimal impact on formula correlations for window-based extraction.

Figure 5. Correlation Heatmap of Calculation Formulas (Dependency Collocations)

Figure 5 shows mutual information and collocate frequency maintain extremely low correlations with other formulas for dependency collocations, while MI³, T-score, log-likelihood ratio, and Dice coefficient show high inter-correlation. Compared to window collocations, collocate frequency's correlations decrease, while Dice coefficient's correlations increase, demonstrating that collocation type significantly impacts formula relationships. Formula selection should therefore consider specific research purposes and collocation types.

3.4 Recall Using Two Formulas Combined

Single-formula recall reached only 95.12% and 85.54% at n=50. In practice, large collocation databases are often pre-built using multiple formulas to maximize coverage. This study analyzed recall rates when using two formulas simultaneously.

Collocate frequency was excluded due to extremely low recall. The remaining six formulas were paired in equal combinations, with n increasing from 5 to 50. Results are shown in Figure 6 [FIGURE:6].

Figure 6. Recall Using Two Formulas Combined

Note: When both formulas extract the same correct collocation, it is counted only once.

All curves show upward trends for both collocation types, indicating improved recall with increased extraction quantity. The MI³&Dice combination achieves the highest recall: 98.22% for window collocations and 91.41% for dependency collocations, representing improvements of 3.10% and 5.87% over single-formula performance. This demonstrates window-based methods' recall advantages. Larger gaps between curves for window collocations indicate formula selection has greater impact than for dependency collocations, where curves are more concentrated.

For window collocations, combinations including Dice coefficient generally achieve higher recall. For dependency collocations, combinations including MI³ perform better. Using both simultaneously yields maximum recall for each type. Some combinations show divergent performance across types: T-score&Dice works well for window but poorly for dependency collocations. Therefore, formula selection must consider collocation type. Figure 6 also shows that even when extracting 50 collocations per formula, no combination reaches 100% recall, suggesting that increasing quantity or combining formulas cannot fully solve recall issues. In practice, merging window and dependency results or exploring more efficient formula combinations may be necessary for comprehensive extraction.

This study reveals the characteristics and performance differences of various collocation strength calculation formulas in extracting window and dependency collocations. Correlation heatmaps visualize inter-formula relationships, while analyses of single-formula precision/recall and two-formula recall provide systematic performance comparisons.

The findings are significant for Chinese corpus linguistics and provide valuable references for collocation teaching and research in international Chinese education. By understanding formula characteristics, researchers and educators can make targeted selections to improve extraction effectiveness and vocabulary instruction. The study also provides theoretical foundations for developing more efficient and accurate Chinese collocation extraction tools.

Future work includes: (1) expanding formula analysis scope; (2) using larger, more diverse corpora; (3) broadening word selection across more parts of speech and frequency ranges; (4) exploring formula improvements based on current analysis.

References

[1] Sinclair J. Corpus Concordance Collocation [M]. Oxford: Oxford University Press, 1991.
[2] Sun Maosong, Huang Changning, Fang Jie. A preliminary quantitative analysis of Chinese collocations [J]. Chinese Language, 1997(1): 29-38.
[3] Firth J R. Papers in Linguistics 1934-1951 [M]. Oxford: Oxford University Press, 1957.
[4] Zhang Yongwei, Wu Bingxin. Review of core functions in fourth-generation web-based corpus analysis tools [J]. Contemporary Linguistics, 2023, 25(4): 1-15.
[5] Wong K F, Li W, Xu R, et al. Introduction to Chinese Natural Language Processing [M]. San Rafael: Morgan & Claypool Publishers, 2009.
[6] Zhang Yongwei, Ma Qiongying. Research on dependency collocation retrieval systems for Chinese dictionary compilation [J]. Lexicographical Studies, 2022(4): 30-40, 125.
[7] Hu Renfen, Xiao Hang. Construction and application of a Chinese collocation knowledge base for L2 teaching [J]. Applied Linguistics, 2019(1): 98-108.
[8] Wermter J, Hahn U. Collocation extraction based on modifiability statistics [C]//Proceedings of the 20th International Conference on Computational Linguistics. 2004: 980-986.
[9] Church K, Hanks P. Word association norms, mutual information, and lexicography [J]. Computational Linguistics, 1990, 16(1): 22-29.
[10] Dunning T E. Accurate methods for the statistics of surprise and coincidence [J]. Computational Linguistics, 1993, 19(1): 61-74.
[11] Oakes M P. Statistics for Corpus Linguistics [M]. Edinburgh: Edinburgh University Press, 1998.
[12] Church K, Gale W A, Hanks P, et al. Using statistics in lexical analysis [M]//Zernik U, ed. Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. New York: Psychology Press, 1991: 115-164.
[13] Berry-Rogghe G. The computation of collocations and their relevance in lexical studies [M]//Aitken A, Bailey R, Hamilton-Smith N. The Computer and Literary Studies. Edinburgh: Edinburgh University Press, 1973: 103-112.
[14] Dice L R. Measures of the amount of ecologic association between species [J]. Ecology, 1945, 26(3): 297-302.
[15] Zhang H, Zhang Y, Yu J. Collocation extraction using square mutual information approaches [J]. International Journal of Knowledge and Language Processing, 2011, 2(1): 53-58.
[16] Sproat R, Shih C. A statistical method for finding word boundaries in Chinese text [J]. Computer Processing of Chinese & Oriental Languages, 1990, 4(4): 336-351.
[17] Luo Shengfen, Sun Maosong. Research on Chinese automatic word extraction based on internal binding strength of character strings [J]. Journal of Chinese Information Processing, 2003(3): 9-14.
[18] Sun Jian, Wang Wei, Zhong Yixin. A statistical method for discovering common word collocations [J]. Journal of the China Society for Scientific and Technical Information, 2002(1): 12-16.
[19] Wang Daliang, Zhang Dezheng, Tu Xuyan, et al. Collocation extraction based on relative conditional entropy [J]. Journal of Beijing University of Posts and Telecommunications, 2007, 30(6): 40-45.
[20] Qian Y. Dynamism of collocation in L2 English writing: a bigram-based study [J]. International Review of Applied Linguistics in Language Teaching, 2022, 60(2): 339-362.
[21] Su Q, Gu C, Liu P. Association measures for collocation extraction: automatic evaluation on a large-scale corpus [J]. International Journal of Corpus Linguistics, 2024, 29(1): 59-86.
[22] Liang Jingzhi. Characteristics of collocation strength calculation formulas and their implications for international Chinese language education [D]. Beijing: University of Chinese Academy of Social Sciences, 2024.

Notes

① http://ccl.pku.edu.cn:8080/ccl_corpus
② https://brat.nlplab.org/index.html
③ http://ltp.ai/docs/appendix.html
④ https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/
⑤ In window collocations, when a collocate appears between two identical node words, collocate frequency may be less than collocation frequency, causing ratios <1. For example, in "...important trade port, which is of great significance for the development of Swahili culture...", with "important" as node and "Swahili" as collocate, "Swahili" frequency is 1 while its collocation frequency with "important" is 2.
⑥ Mutual information's tendency to select low-frequency combinations relates to the minimum frequency threshold of 2. Different threshold settings significantly affect window collocation extraction results.

Submission history

[v1] 2025-06-24