Научная статья на тему 'Vocabulary richness of Early Chinese texts: macroanalysis of the Thirteen Classics and the Zhuangzi'

Vocabulary richness of Early Chinese texts: macroanalysis of the Thirteen Classics and the Zhuangzi Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
189
31
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
CHINESE CANONS / THE THIRTEEN CLASSICS / COMPUTATIONAL LINGUISTICS / QUANTITATIVE LINGUISTICS / VOCABULARY RICHNESS / LEXICAL DIVERSITY / TYPE-TOKEN RATIO / DIGITAL CORPORA / STYLOMETRY / "ТРИНАДЦАТИКНИЖИЕ" / ДРЕВНЕКИТАЙСКИЕ ТЕКСТЫ / КАНОНЫ / МАТЕМАТИЧЕСКАЯ ЛИНГВИСТИКА / СЛОВАРНЫЙ СОСТАВ / ИЕРОГЛИФИЧЕСКИЙ СОСТАВ / СЛОВАРНОЕ РАЗНООБРАЗИЕ / СООТНОШЕНИЕ РАЗМЕРА ТЕКСТА И УНИКАЛЬНЫХ СЛОВ / СТИЛОМЕТРИЯ

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Zinin Sergey

This study analyzes statistical data regarding the vocabulary richness of the Warring States Project CTexts collection of Chinese classics[97]. Vocabulary richness has been primarily used in quantitative linguistics for authorship identification and style analysis, and it has been increasingly applied for various aspects such as language acquisition in other linguistic fields. This study lays the foundation for a quantitative linguistic analysis of the vocabulary of early Chinese texts. It also conducts a macroanalysis of the data, including calculating several vocabulary richness indices and building charts of vocabulary growth. This study finds significant differences in the vocabulary growth of corpus texts. In addition, it reveals that the Shi Jing and Yi Li are two extreme ends of the vocabulary growth spectrum and identifies some historical texts in the middle of the spectrum as a distinct group. Furthermore, the study takes a closer look at specific forms of vocabulary growth such as hapax legomena, dis legomena, and the most frequent characters.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Vocabulary richness of Early Chinese texts: macroanalysis of the Thirteen Classics and the Zhuangzi»

С.В. Зинин*

Сравнительный анализ словарного состава

древнекитайских канонов: макроанализ текстов «Тринадцатикнижия»

АННОТАЦИЯ: В статье представлен сравнительный анализ количественных характеристик словарного состава большинства древнекитайских канонов, входящих в «Тринадцатикнижие». Автор стремится построить статистическое основание для количественного анализа словарей китайских текстов, путём определения различных показателей диапазона словарного состава (на практике, иероглифического состава) и характера роста словарей. Сравнение текстов по этому признаку позволяет выявить значительные различия между ними, в особенности, между Ши цзином и Ли цзи, представляющими собой в этом отношении полюсы системы «Тринадцатикнижия». Исторические тексты «Тринадцатикнижия» представляют собой в этом отношении группу со схожими характеристиками. В работе также приводятся результаты анализа изменения состава редких и распространённых иероглифов.

КЛЮЧЕВЫЕ СЛОВА: «Тринадцатикнижие», древнекитайские тексты, каноны, математическая лингвистика, словарный состав, иероглифический состав, словарное разнообразие, соотношение размера текста и уникальных слов, стилометрия.

CONTENT

1. Introduction

1.1. Importance of WSP CTexts vocabulary

1.2. Character as type and token (vocabulary unit)

* Зинин Сергей Васильевич, к.ф.н., Торонто, Канада; Исследовательский проект «Сражающиеся царства», Массачусетский университет (Амхерст); Е-шаН: [email protected]

© Зинин С.В., 2016

197

1.3. Functional and content types of characters

1.4. WSP corpus sample size and character stream abstraction

1.5. Previous work

1.6. Acknowledments

2. Measuring Vocabulary Richness

2.1. TTR as an Index of Vocabulary Richness

2.2. Other indices

3. Final-Value Index Approach

3.1. Clustering Type Token Ratio (TTR) Values

3.2. Guiraud's R

3.3. Herdan's C

3.4. Rubet's k

3.5. Maas' A2

3.6. Lukyanenko-Nesytoj's LN

3.7. Brunet's W

3.8. Honore's H

3.9. Sichel S

3.10. Michea's M

3.11. Yule's K

3.12. Herdan's Vm

3.13. Partial TTR Measurements

3.14. Discussion of Results for Final-Value and Partial Approaches

4. Vocabulary development profiles

4.1. Complete TTR developmental profile

4.2. Partial TTR Developmental Profiles

4.3. Results discussion for developmental profiles

5. Developmental profiles of rare and frequent characters

5.1. Hapax legomena (V1) and dis legomena V2

5.2. Frequent words (V50+)

5.3. Discussion of results for rare and most frequent characters

6. Conclusions Literature

1. Introduction

Several terms in quantitative linguistics "refer to the range of different words used in a text, with a greater range indicating a higher diversity" (McCarthy and Mild, "vocd-D and HD-D," 381). These terms are vocabulary richness, lexical richness1, lexical diversity, and vocabulary diversity.

1 The term "lexical diversity," defined as "a complex property that summarizes the range of vocabulary and the avoidance of repetition in the sample" (Malvern and Richards, Measures of Lexical Richness, 1), is used intermittently with lexical richness in a text. There are also other terms that are close in meaning, e.g., Pilar

198

However, since there may not be much difference between the meanings of these terms2, the term "vocabulary richness" could be used as a suitable representative.

There are various methods to calculate the measures of vocabulary richness (or "indices")3. Their values may allow a comparison of texts in various areas including "first and second language acquisition, linguistic input, interaction, demographic influences on language performance, language impairment, delay, aphasia, schizophrenia, stylistics, and forensic linguistics" (Oakes, "Corpus Linguistics and Stylometry," 1073-74). For classical Chinese, vocabulary richness can be useful for comparing texts from the perspective of their vocabulary diversity, lexical sophistication, and so on.

1.1. Importance of Warring States Project (WSP) CTexts vocabulary

The texts in Ctexts may be considered the most important source for studying character vocabulary in classical Chinese. The canon system that had formed from the Han to the Song dynasties had strongly affected all aspects of Chinese literary discourse, including the general character vocabulary of classical Chinese. The Thirteen Classics and its set of characters (i.e., its character vocabulary) have been memorized by generations of Chinese scholars and officials and it definitely has had an effect on most texts produced by these scholars4. While the corpus of the WSP Ctexts is not large enough comparing to the entire pre-Qin literature5, its character vocabulary could be very close to the general character vocabulary of the

Duran cites such terms as "flexibility," "verbal creativity," and "lexical range and balance" (Duran et al., "Developmental trends," 221-222). Other authors add terms such as "lexical originality," "lexical sophistication," "lexical density," and "lexical variation" (Laufer and Nation, "A Vocabulary-size Test," 309-320).

2 They have been used intermittently to describe the same value (e.g., in Malvern and Richards, "Measures of Lexical Richness," and other works).

3 Tweedie and Baayen in "How Variable May a Constant Be?" (323) use the term "constants," referring to the claim that these variables were thought to be constant by the researchers that created them.

4 It is reflected in the official status of the late Qin reference work on characters in the Thirteen Classics by Li Hongzao (Li, Hanyuan Shisanjing ji zi). According to some calculations, Li Hongzao's work implies that the Thirteen Classics vocabulary contained 6544 unique characters (Qiu, Written Chinese, 49-50). Ctexts contains 6055 characters; the Thirteen Classics in Ctext corpus (excluding the Er Ya) contain 5628 characters. The author is grateful to Rodo Pfister who pointed out that fact.

5 At approximately a half-million characters, it represents about 17% of all pre-Han literature (about three million) and about 6% of the Han and pre-Han literature (eight million), as stated by McLeod (McLeod, "Sinological Indexes," 50).

199

era. According to Qiu Xigui, "the number of characters in general use during that period would probably fall short of the total number used in the Thirteen Classics" (Qiu, Written Chinese, 50).

The present study of The Thirteen Classics and the Zhuangzi6 will try to lay the foundation for such analysis7. Moreover, the WSP Ctexts corpus contains texts of various length (from 2,000 characters to over 200,000 characters), which allows this study to test the validity of various methods.

1.2. Character as type and token (vocabulary unit)

In most languages, the basic unit of texts and vocabularies is the word (word stem). This study utilizes single characters as vocabulary and text units8. Linguists generally experience difficulties defining the word for word segmentation purposes. However, in the classical Chinese language, which contains single- and multi-character words, there are certain additional problems with word segmentation and using words as tokens and types, respectively. In the present study, in absence of well-segmented texts, tokens are defined as single characters in a text and types are unique characters in a text9. Therefore, the term "vocabulary of text" in this study means the list of unique characters (variants are treated as separate characters) or "character vocabulary"10, as opposed to "word vocabulary" (similar to the modern Chinese terms zidian and cidian)11.

6 The Zhuangzi has been added to offset partly the predominantly "Confucian" character of the WSP corpus. The Er Ya has been omitted since it is not a sample of narrative prose or poetry.

Jun Da describes the general situation with a list of frequencies of characters and concludes that there is not much structured information available (Da, "A Cor8pus-based Study," 1).

This is examined in more detail in the previous article of this series, i.e., Zinin, "Pre-Qin Digital Classics."

9 As in the previous study, the texts were cleaned of punctuation and other non-character symbols, and the titles of chapters were removed (see explanation in Zinin, ibid). Character variants, if they include different Unicode representations, are treated as different characters. To be more precise, the Unicode codes of characters serve as types and tokens in this study. It would be a much better situation if the digital versions of The Thirteen Classics with standardized resolved variants were available.

10 Naturally, it does not mean that the author accepts or prefers the idea of classical Chinese being a monosyllabic language. The character-as-token approach is one possible method and the most feasible approach to study classical Chinese texts.

11 This approach, using character vocabulary instead of word vocabulary, as showed by Peng et al. could be applied even to modern texts (Peng et al., "Language Independent Authorship", 272).

200

1.3. Functional and content types of characters

This study will not be distinguishing functional ("empty") and content characters, as it is often done in stylistic analysis in quantitative linguistics. However, in the fifth section, there will be an attempt to conduct separate analyses on hapax legomena, dis legomena, and the most frequent characters.

1.4. WSP corpus sample size and "character stream" abstraction

The vocabulary data for this study has been retrieved from the WSP corpus. The WSP corpus is an online open corpus, built on open source classic Chinese texts, which are considered by the present author to be a sufficient source for a quantitative study of vocabulary richness12. Conducting research on an open source corpus ensures its replicability and reproducibility, since any researcher can replicate vocabulary data (first, the numbers of types and tokens) and attempt to reproduce results by applying the same methods13. Along with the corpus itself, this study offers all related data (too large to be placed in the Appendix) as an accompanying MS Excel spreadsheet reference, available on Github14.

The WSP corpus is considered small in relation to some modern Chinese corpora. However, it can be viewed as being large enough for vocabulary richness analysis. Vocabulary richness analysis (especially in practical areas, e.g., in language acquisition and medical studies) is often conducted on short samples of texts (tens or hundreds of words). Popescu suggests the maximum length of vocabulary study sample as 10,000 words (Popescu, "Word Frequency Studies," 3). Many texts in the Ctexts corpus are much larger than this figure.

The reason Popescu suggests this maximum length is text "homogeneity" Not only are the WSP Ctexts long, but also they are not homogeneous narratives created by the same author or even in the same period. In fact, most of the texts in The Thirteen Classics took their current form considerably later and then their subtexts were written. In other words, they are heterogeneous. Moreover, each text in the corpus can be considered a mini corpus for a vocabulary study in itself, especially since it is often a compilation of subtexts of which each one is an independent text in its own right. Thus,

12 As any digital corpus of classical Chinese, the WSP corpus includes some philological problems, the nature of which was discussed by the present author in Zinin, ibid. The corpus can be found at the DOI: http://www.umass.edu/ctexts/ index.php (login and password are provided in the pop-up window).

There is a problem with the free availability of reliable classical Chinese corpora for research. See the previous article by the present author in Zinin, ibid.

14 See the file "Voc_ref.xlsx" at DOI: https://github.com/wsw-ctexts/vocabularv richness.

201

the vocabularies of these texts should be investigated separately15 and such "text patchwork" should be considered as normal for the Chinese tradition.

Actually, studies regarding the assemblages of early manuscripts (e.g., Meyer, "Philosophy on Bamboo") demonstrate that texts, considered be a single unit today, were often broken down into smaller meaningful units and mixed with other texts.

The authorial unity of style, to some degree, can be present in only a few of them16. Treating this mix of smaller texts as one large text allows interpreting this large text as a stream of characters17, which can be sampled at any length. Further analysis will concentrate on the specifics of individual texts and how they relate to the larger body of the text.

15 E.g. the Zhou Li is a compilation of pre-Han texts, probably, assembled by one person (William Boltz in Loewe (Ed.) "Early Chinese Texts", 27-29); in the Zhuangzi H.D. Roth indicates presence of five large groupings of heterogeneous collections of chapters, and supposes that it is a collective compilation in early Han (H.D. Roth in Loewe (Ed.) "Early Chinese Texts'', 56-57); the Chun Qiu, the Gongyang Zhuan, the Guliang Zhuan, the Zuo Zhuan traditionally all were ascribed to one person (Anne Cheng in Loewe (Ed.) "Early Chinese Texts", 6771), but the Gongyang and the Guliang are probably coming from school tradition; while the Zuo Zhuan could have one author-compiler; the Zhou Yi (Edward Shaughnessy in Loewe (Ed.) "Early Chinese Texts", 219) always ascribed to one person, the Yi Li (William Boltz in Loewe (Ed.) "Early Chinese Texts", 234237) "detailed and specific descriptions of the ritual ceremonies of a shi" (Loewe (Ed.) "Early Chinese Texts', 234), is probably a part of a larger corpus of ceremonial writings (Loewe (Ed.) "Early Chinese Texts", 237); the Li Ji (Jeffrey K. Riegel, Loewe (Ed.) "Early Chinese Texts", 293-295) — "a ritualist's anthology of ancient usages, prescription, definitions and anecdotes" (Loewe (Ed.) "Early Chinese Texts'", 293), with "no apparent overall structure" unlike the Zhou Li and the Yi Li, not of same time or origin, its 49 pian (11 groupings) are "extremely diverse and miscellaneous in their style and contents as well as in the origins of the materials of which they are constituted" (Loewe (Ed.) "Early Chinese Texts", 295); the Lun Yu (Anne Cheng in Loewe (Ed.) "Early Chinese Texts", 314), now considered to be "a composite work of various layers, contributed by different hands"; the Shu Jing is a compilation of texts of "heterogenous nature" (Edward Shaughnessy in Loewe (Ed.) "Early Chinese Texts", 376); the Shi Jing's heterogenous nature was not contested by the tradition itself, etc.

16 It is obviously the Xiao Jing, but also the Zuo Zhuan, the Guliang Zhuan, and the Gongyang Zhuan. In addition, the Chun Qiu and the Zhou Yi contain considerable amounts of formulaic expressions, which create some unity of style. The Lun Yu and the Mengzi, while coming from various sources, have probably been heavily edited in order to appear to have authorial unity. Some researchers, e.g., Dirk Meyer (Meyer "Philosophy on Bamboo") essentially deny the idea of sing1l7e "authorship" for it.

7 It should be stressed again that this is a stream of characters, not words.

202

Meanwhile, The Thirteen Classics were viewed as distinctive stylistic bodies by the Chinese tradition, which often ascribed them to one person as either an author or editor. Definitely, for a reader, the perceived style of the Shi Jing is different from that of the Lun Yu, which is different from that of the Zhou Yi or the Guliang Zhuan. These stylistic differences are often dependent on subject-specific characters or formulaic expressions. Formulaic expressions and repetitive characters strongly affect vocabulary content and growth behavior. The analysis of the vocabulary of these entities is necessary to delineate the area of discussion.

On the larger scale, these text bodies can be considered members of wider genre groups. For example, the Shi Jing represents early Chinese poetry, while the Chun Qiu and the Zuo Zhuan (and two accompanying zhuan) belong to historical prose. In addition, the Lun Yu and the Mengzi belong to philosophical prose, while the Li Ji, the Yi Li, and the Zhoy Yi belong to ritualistic prose. One of objectives of this study is to understand if some vocabulary richness measures can be useful for the genre attribution of texts.

In a way, these texts can be compared to the Bible, which consists of texts of different genres. The Bible corpus can also be considered as being more heterogeneous than any of the WSP Ctexts samples. However, the analysis of the Bible vocabulary as a whole still makes sense. The present study is a macroanalysis; i.e., a large-scale investigation of the vocabulary richness of large and heterogeneous texts intended to establish a quantitative framework for further text analysis (it may be more useful to use the term "text richness" instead of "vocabulary richness"18).

Therefore, it often treats texts as a stream of characters, which can be sampled at any moment, ignoring subtext borders. The macroanalysis should be followed by a microanalysis of the vocabularies of individual texts as well as sections or chapters of these texts (e.g., the Shi Jing's songs, texts of the Shu Jing) However, this work should be conducted in the future based on the results of this study.

1.5. Previous work

As to the present author's knowledge, there have been few vocabulary richness studies of classic Chinese literature. More specifically, the majority of the studies regarding the vocabulary of classic texts have consisted of character frequencies studies19 with no systematic analysis of the vocabu-

18 As Gejza Wimmer writes about the main index used as vocabulary richness measure in this study, TTR, "the TTR as a measure of vocabulary richness is a misnomer. As a measure of the richness of the text it can perhaps function if some problems could be solved" (Wimmer, "Type-token Relation," 362).

19 See the review of this literature in Zinin, ibid.

203

lary richness for Chinese classics20. Therefore, the main objective of the present study is to lay the statistical foundation for further analysis of the vocabulary richness of classical Chinese texts.

The remainder of this study is as follows. In the second and third sections, vocabulary indices (or "constants" according to Tweedie and Baayen in "How Variable May a Constant Be?") will be introduced. The final or partial values of these constants for the corpus will also be presented, along with diagrams of the hierarchical clustering of texts. In the fourth section, developmental profiles (mostly for the type-token ratio (TTR) index) for entire samples and normalized lengths will be displayed. In the fifth section, some introduction into the developmental analysis of hapax legomena (V1) and dis legomena (V2) as well as the most frequent words (V 50+) will be conducted. Finally, a discussion of the results and conclusions will be presented.

1.6. Acknowledgments

E. Bruce Brooks, who has supported the WSP Ctexts project from its beginning, has read the initial draft of the manuscript, made many important suggestions, and the present author has enjoyed an extremely fruitful discussion with him. Brooks as well as Rodo Pfister (to whom the present author is also grateful for reading the early draft) raised a very important issue regarding the validity of "character counting" in the study of early Chinese texts. The ensuing discussion with Brooks and Pfister made this author review several important views on the statistical approach to the texts in the WSP Ctexts corpus. Their comments also helped improve many academic aspects of this study's initial draft21.

2. Measuring Vocabulary Richness

The most basic approach to measure vocabulary richness is to use final value indices in which a formula is applied to the entire sample. The simplest form of such indices, the TTR, is known to be dependent on text length, which makes an impractical comparison of static value indices for texts of different length. However, there are other indices that claim independency of this parameter (Tweedie and Baayen call them "con-stants"22). Thus, this study will first test this final value approach and discuss the results23.

20 Nevertheless, it is worth mentioning the Le Guan Ha article that examines the Zipfs rank distribution on large modern Chinese corpora (and compares the curves with English) (Ha et al., "Extension of Zipfs Law").

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

21 All of the remaining factual, typographical and grammatical errors are the sole responsibility of the present author.

22 Tweedie and Baayen, "How Variable May a Constant Be?" 343.

23 There are many methods.

204

2.1. TTR as an Index of Vocabulary Richness

The most basic quantitative index of vocabulary richness is the TTR, which is the ratio of the number of types to the number of tokens in a given text sample. It is a well-established fact that the final values of the TTR (values calculated for the entire text sample) are not permanent vocabulary richness characteristics. Moreover, as Vulanovic and Koehler note, "statistical distribution of this index is unknown and, therefore, tests of significance of differences in the TTR between authors or texts cannot be conducted" (Vu-lanovic and Koehler, "Syntactic Units and Structures," 284). However, TTR values can be helpful to compare similarities in origin and sample size texts. Hence, this index is still used in authorship forensic and style studies.

The main problem with the TTR is its dependency on sampling size. Table 2.1 features the WSP texts, ordered by their TTR final values. It is clear that the texts could have also been ordered by their lengths; i.e., the shorter the text, the higher it is on the list24. This means that the TTR values of complete texts will not be helpful in a comparative style analysis.

Table 2.1. TTR final values for the WSP corpus25

TEXT N V TTR

XJ 1800 374 0.207778

SHI 29622 2833 0.095638

LY 15923 1361 0.085474

SHU 24537 1910 0.077842

ZY 13348 1030 0.077165

CQ 16791 941 0.056042

MZ 35354 1892 0.053516

ZHZ 65251 2968 0.045486

ZL 49410 2212 0.044768

GL 40835 1594 0.039035

GY 44224 1640 0.037084

24 With notable exceptions such as the Shi Jing, the Shu Jing, the Zhuangzi, and the Yi Li.

25 The first column features the abbreviated text name (here and thereafter, see Abbreviations section for full text names); the second column is N the number of tokens (characters) in the text, or, the sample size; the third column, V, features the number of types in the text, and the fourth column is the TTR, calculated as the ratio V/N, where V is the number of types in the complete text, and N is the number of tokens 25. The structure of all further index tables is same.

205

LJ 97994 3041 0.031033

YL 53882 1536 0.028507

ZZ 178563 3235 0.018117

CQZZ 195354 3251 0.016642

2.2. Other indices

The TTR issues have been known for long time and many researchers have attempted to create a length-independent measure of lexical richness ("length-invariant statistics")2 . Tweedie and Baayen conveniently summarized these attempts in their article, "How Variable May a Constant Be?," and conducted a study to demonstrate that these indices still depend on text length, although some of them are "less dependent" than others. The present study follows Tweedie and Baayen's approach by applying it to the texts in the WSP Ctexts corpus27.

Table 2.2 presents a list of several nonparametrical and parametrical indices28.

Table 2.2. List of static indices of lexical diversity29

Index Full Name Short Name Calculation method

Guiraud R30

26 Fiona Tweedie and Harald Baayen use the term "lexical constants" (Tweedie and Baayen, "How Variable May a Constant Be?").

27 Since the publication of their article (Tweedie and Baayen, "How Variable May a Constant Be?"), several more indices were invented with varying degree of success. The present study will only use those in the original Tweedie and Baayen article. Some newer articles will be mentioned, but they do not add much progress to the already known methods. David Mitchell (Mitchell, "Type-token Models: S Comparative Study") provides an even larger list.

28 Nonparametrical models usually depend on the sample size (number of tokens) N and the number of types V, while parametrical models introduce extra textual parameters (e.g., Brunet's formula for W includes the parameter "a," which is usually set to 0.172; see Table 2.2).

29 The first column contains the name(s) of the researcher(s), the second one includes an abbreviated index notation, and the third one presents its formula, following Tweedie and Baayen's "How Variable May a Constant Be?" 326-331. "N" is the sample size in tokens (characters) and "V" is the vocabulary size in tokens. V(N) is the number of types in the sample of size N. It is usually more convenient to simply use "V" when "N" is obvious. V(1,N) is the number of types that are hapax legomena in the sample of size N, while (V2,N) is the number of dis legomena in the sample.

206

Herdan C

Rubet k !oj!<Ioa! N}

Maas A2

Luk'janenkov & Nesitoj LN

Brunet W iv =

Honoré H

Sichel S

Michéa M

Yule K

Herdan Vm -J'ë'—-^-

Table 2.3 contains the values of these indices for the entire corpus31. The reason why this study calculated the values of these indices is that the indices were presumed to reflect the intrinsic inner characteristics of the texts expressed in their vocabulary, some of which are still popular in research. Based on the material of English prose, Tweedie and Baayen demonstrated that constants are not actually "constant." However, it is interesting to test them against classical Chinese texts, especially groups of texts with varying lengths such as those from the WSP Ctexts corpus.

30 This formula counts in all tokens in sample as N. In case if only nouns, etc. real words (no function words) are counted, there could be V/SQR(2n) formula. See Daller, "Guirad's Index" and Van Hout and Vermeer, "Comparing measures of lexical richness".

31 Yule's K and Herdan's Vm are not presented in Table 2.3.

207

Table 2.3. Lexical diversity indices32. The first column features text names, second and third column — such numerical indices as number of tokens in text sample (N) and the number of types (V), and other columns are featuring other indices, presented by their abbreviation.

CO O) M t r C^T— OOununr-^T— OOr-^T— C^COr-^unoOC^wtDC^unr-^OPOOOT—

oooooo °oo°oooo

tDtDUOCOtDtD^r-^T—

cqocqcoocooopoolft<x>co*— ooot9

lo PO

CO O O ^ r- cocoo<x>r-^

t— C^CO^OUOC^r-^T— C^C^OOunr-^T— oor-^oo tOr-^r-^T—

— —— — CO ^ CO T- № ^ uri r< r< uri to *—

CO 8 5 5 2 6 5 2 6 4 6 CO 4 5

o o o o o o

^ CO 2 o 2 CO r^ o CO 4

PO o 00 ^ CO en CO PO 26

CO 26 56 46

CO r^ 3 ^ cn 00 s 91 3 4

o cri cri co CO r<

a> CO 5 21 21 o CO 3 2 c^ c^

o o o o o

un 3 2 5 5 un

to o CO o 66 o 6 o 6 o 16 o

o o o o o o

^ s 5 5 6 c^

co co 42 CO 00 9 62 o co 9

^r Lft

CO 2 5 CO CO

CO CO CO en 6 cn 6 54 r^ 2

o o o o o o

Lft cn CO 4 cn

PO r-^ r^ r< 00 r< r^ cri o o o

r^ o 3 o CT9 3 o 31 o 5 8 4 5 o

o o o o o o

3251 1640 1594 3041 1361 1892

oooooooo

ooco*—

tOCOPOOC^tD^'ir

CO CO UO CO CO UO CO CO

co^counoc^coc^

r^ r^ to o

^ ffl T- PO M CO

C^ C^ o T— T— ^r O

c^ c^ oo c^ c^ c^ c^

o o o o o o o

unr-^^coT— m i> M ouncoococounco ^oo^oooo o o o o o o

t (D ffl M ^

LOWLn^^^LO^

c^^r^runcoooc^oo r-^r-^opoc^o*— co

o o o o o o o

(DWWI^i-LfiMiD t— T— UOT— COLO

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

^^cococncn^co (Dffl®ffiLni>LOffl

oooooooo

OOO^COC^OCOUO COrrNfOrCOtfiM

ajwcowMOWM

C^ T— T— C^ T— C^ CO

T— ^^un^roO^C^r-^OC^OCO*— PO OiLOMCOmMLOMMOfflr^Lntfl

rffi^^ffirMMM LA ^ r (fl h

«5 w

CT CT > c

32 See "voc_ref.xlsx/MAIN_LOOKUP/Lexical diversity indices".

208

3. Final Value Index Approach

Quantitative linguistics is not very popular in philological studies, partly because the results of this discipline cannot be immediately applicable or interpreted in philological analyses, e.g., stylistically. Particularly, it relates to "word counts," which were even determined to be useless33. It is true that indices' final values do not elucidate much about these texts, and, by themselves, are not very useful34. As Van Hout and Vermee formulate, "does a higher outcome really reflect a richer underlying lexicon? Can we be certain and happy about the values produced by lexical measures?"35

However, these indices can be used for a comparison of texts of similar length; i.e., why vocabulary richness indices are still used in authorship identification and style characterization. Further in this study, each index (or "constant") will be discussed separately, and its values will be presented both numerically (as a sorted list of values) and visually (as a dendrogram).

3.1. Clustering TTR values

The simplest (and most controversial) index, the TTR, is presented in Column 4 in Table 2.3. It is easy to see that the highest TTR (0.208) is produced by the Xiao Jing, while the lowest TTR (0.017) is produced by the combined Chun Qiu Zuo Zhuan. This does not mean that the vocabulary of the Xiao Jing is "richer" than that of the Chun Qiu. The Xiao Jing only includes 374 unique characters (types), but it is very short (1,800 characters in the WSP Ctexts version). The Chun Qiu Zuo Zhuan includes 3,251 unique characters, but it is the longest text in the WSP corpus at 195,354 characters. The number of types increases with the sample size, but as the sample size changes, so do their ratios. Therefore, the TTR value for the entire text (the final value) depends on the sample size.

While the TTR generally diminishes with sample size, some larger texts include higher TTR values than other smaller ones. It could be useful to group these texts by such values based on certain "similarity" metrics. One of the ways to group the items is through hierarchical clustering.

33 "A word frequency analysis of a text can reveal nothing about its characteristics (e.g., author, language, style, type of literature). The only exception appears to be Shakespeare" (Naranan and Balasubrahmanyan, "Models for Power Law Relations," 38). See the critique of this position by Sampson (Sampson, "Review of Harald Baayen.")

34 Duran et al., with their D (vocd), attempt to introduce a new index of lexical diversity. However, McCarthy et al. ("vocd-D and HD-D") argue that this index also depends on sample length (McCarthy et al., ibid, 382). The present study also includes a review of post-Baayen indices (McCarthy et al., ibid, 382).

5 Hout van and Vermee, "Comparing Measures," 94.

209

In addition, the results of hierarchical clustering can be graphically presented as a cluster dendrogram36.

The cluster dendrogram presents the same data as a regular table, but a clustering algorithm attempts to combine the texts (as "geometrical points") into largers groups (clusters) based on their closeness as "points" (starting from two). Moreover, it further combines smaller groups of points into largers clusters based on the Euclidian distance between the centers of the

clusters37.

The standard cluster dendrogram algorithm (Euclidian metrics with the "average method") produced the graph in Figure 3.1. If the dendrogram is cut at the 0.04 level on y-axis38, then the algorithm groups texts into three wide groups: Group 1, consisting of the outlier the Xiao Jing; Group 2, consisting of the Shi Jing, the Shu Jing, the Lun Yu, and the Zhou Yi; and Group 3, consisting of all other texts. Group 2 features texts with a higher TTR level, so it is not surprising that small- to medium-sized texts belong there. Group 3 features texts with a lower TTR level. What is surprising, e.g., is that the Chun Qiu and the Mengzi were also placed in Group 3, while the Shi Jing falls into Group 2 (i.e., it is treated as a shorter text).

Figure 3.1. TTR dendrogram

36 This study utilized clustering software provided by the standard R language package. It also used agglomerative clustering with Euclidian average distance for metrics.

37 In other words, "the average method." Other methods were attempted in this study, but they did not produce a significant difference.

38 Hereafter, the value for the horizontal cut is chosen in the way to identify the most meaningful largest groups.

TTR = V(N)/N

in* fr-tfitftn 1>- Typ*-Ti+ Ag Tm I mut •

Hh. I

'Mijgjji-* Ijfi

210

Table 3.1. TTR values (sorted by TTR in decreasing order)39

Text N V TTR

XJ 1800 374 0.2078

SHI 29622 2833 0.0956

LY 15923 1361 0.0855

SHU 24537 1910 0.0778

ZY 13348 1030 0.0772

CQ 16791 941 0.056

MZ 35354 1892 0.0535

ZHZ 65251 2968 0.0455

ZL 49410 2212 0.0448

GL 40835 1594 0.039

GY 44224 1640 0.0371

LJ 97994 3041 0.031

YL 53882 1536 0.0285

ZZ 178563 3235 0.0181

CQZZ 195354 3251 0.0166

It is difficult to see much "stylistical meaning" in grouping together, e.g., the Zhou Li, the Zhuangzi, the Gongyang Zhuan, and the Guliang Zhuan, except for their ordering according to the TTR final values. In addition, combining the Chun Qiu and the Mengzi definitely contradicts stylistic expectation. Otherwise, the results of the TTR approach are basically what could be expected and they mostly reflect text sample size. However, the indices to be discussed below claimed independency of text length. Therefore, they will be reviewed in the order of their position in Table 2.2, which is the order of their presentation in Tweedie and Baayen's article.

3.2. Guiraud's R

39 See "voc_ref.xlsx/MAIN_LOOKUP/TTR values for WSP corpus".

211

Figure 3.2. Guiraud's R dendrogram

Table 3.2. Guiraud's R values (sorted by R in decreasing order)4

Text N V R

SHI 29622 2833 16.46036

SHU 24537 1910 12.19334

ZHZ 65251 2968 11.61904

LY 15923 1361 10.78563

MZ 35354 1892 10.06241

ZL 49410 2212 9.951251

LJ 97994 3041 9.714416

ZY 13348 1030 8.915160

XJ 1800 374 8.815265

GL 40835 1594 7.888093

GY 44224 1640 7.798568

ZZ 178563 3235 7.655588

CQZZ 195354 3251 7.355392

CQ 16791 941 7.261918

YL 53882 1536 6.617125

40 See "voc_ref.xlsx/MAIN_LOOKUP Guiraud's R values (sorted by R in decreasing order)".

212

Guiraud's R demonstrates less dependency on text length since text order according to R is not the text size order. The Shi Jing goes to the top of the ordered list, which is closed by the Yi Li. The longest texts, such as the Zuo Zhuan and the Zuo Zhuan with the Chun Qiu, are still placed closer to the end, while other long texts, such as the Zhuangzi and the Li Ji, are placed in the first half of the list. The dendrogram cut at the 2.0 level provides four groups: 1) the singular Shi Jing; 2) the Shu Jing and the Zhuangzi; 3) the Yi Li; and 4) the Chun Qiu, the Zuo Zhuan with the Chun Qiu, the Zuo Zhuan, the Gongyang Zhuan, and the Guliang Zhuan (i.e., mostly "historical"41 prosaic texts). This arrangement indicates some relationship to stylistic characteristics42.

However, this clustering does not offer meaningful stylistic grouping, especially since the Xiao Jing is paired with the Zhou Yi and the Mengzi is paired with the Zhou Li. Yet, Hoet Van and Vermee consider (Hout van and Vermee, "Comparing Measures," 100) Guiraud's R to be the most productive measure of lexical richness (measuring proficiency of second language learning).

3.3. Herdan's C

41 Here, the adjective "historical" is not a genre definition. Some of these texts are not really "historic," e.g., the Chun Qiu "chronicle itself may be seen as a developed form of omen record" (Brooks and Brooks, "Emergence of China," 22), not a consciously written historical text that could be extended to the Gongyang Zhuan and the Guliang Zhuan. These texts are not even "narratives," according to the following popular definition: "[N]either does narrative exist without integration into the unity of a plot, but only chronology, an enunciation of a succession of uncoordinated facts" (Bremond, "Logic of Narrative Possibilities," 390). The Zuo Zhuan contains some narratives and historical prose. However, these texts are not only close stylistically but they also record and interpret events in history.

42 However, this author does not want to state that vocabulary richness values can be directly linked to genre stylistics. This issue will be discussed more in Section 4.3. However, it is worth noting any discovered correlation between quantitative indices and genre stylistics.

213

Figure 3.3. Herdan's C dendrogram

Table 3.3. Herdan's C values (sorted by C in decreasing order)4

Text N V C

XJ 1800 374 0.790371

SHI 29622 2833 0.772036

SHU 24537 1910 0.747418

LY 15923 1361 0.745797

ZY 13348 1030 0.730311

ZHZ 65251 2968 0.721238

MZ 35354 1892 0.72045

ZL 49410 2212 0.712594

CQ 16791 941 0.703795

LJ 97994 3041 0.697832

GL 40835 1594 0.694527

GY 44224 1640 0.69201

YL 53882 1536 0.67345

ZZ 178563 3235 0.668319

CQZZ 195354 3251 0.663794

43 See "voc_ref.xlsx/MAIN_LOOKUP/ Herdan's C values (sorted by C in decreasing order)".

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

214

Unlike Giraud's R, Herdan's C follows texts' size closer, although not as close as the TTR. If a cut on its dendrogram is made at the 0.03 level, then the clustering produces several groups, vaguely depending on size. That is, it groups the Chun Qiu, the Gongyang Zhuan, and the Guliang Zhuan with the Li Ji, but it groups the Zuo Zhuan and the Chun Qiu Zuo Zhuan together with the Yi Li. It also groups the Lun Yu and the Shu Jing, while placing the Mengzi into one group with the Zhuangzi.

3.4. Rubet's k

Figure 3.4. Rubet's k dendrogram

215

Table 3.4. Rubet's k values (sorted by k in decreasing order)44

Text N V K

SHI 29622 2833 5.307357

SHU 24537 1910 5.107089

ZHZ 65251 2968 5.087419

LY 15923 1361 5.02657

XJ 1800 374 5.019382

LJ 97994 3041 4.98853

MZ 35354 1892 4.981165

ZL 49410 2212 4.980872

ZY 13348 1030 4.895199

ZZ 178563 3235 4.872745

CQZZ 195354 3251 4.854049

GL 40835 1594 4.824491

GY 44224 1640 4.819515

CQ 16791 941 4.751399

YL 53882 1536 4.720624

Rubet's k is similar to Guiraud's R in four characteristics that were indicated above: 1) the order of the text, structured by decreasing k, does not follow the text lengths' ordering; 2) it groups "historical texts" together; 3) it places the Shi Jing at the top; and 4) it places the Yi Li at the bottom of the k-ordered list.

3.5. Maas' A2

44 See "voc_ref.xlsx/MAIN_LOOKUP/ Rubet's k values (sorted by k in decreasing order)".

216

C utter Dendrogram lor MAA£

1

a i

I

Figure 3.5. Maas's A2 dendrogram

Table 3.5. Maas's A2 values (sorted by A2 in decreasing order)45

Text N V A2

CQ 16791 941 0.070106

YL 53882 1536 0.069017

GY 44224 1640 0.066296

GL 40835 1594 0.066248

ZY 13348 1030 0.065373

XJ 1800 374 0.064397

CQZZ 195354 3251 0.063545

ZZ 178563 3235 0.063156

MZ 35354 1892 0.061461

ZL 49410 2212 0.061231

LJ 97994 3041 0.06054

LY 15923 1361 0.060495

ZHZ 65251 2968 0.057899

SHU 24537 1910 0.057538

SHI 29622 2833 0.05098

Maas' A2, like Rubet's k and Guiraud's R, does not display a correlation of text lengths and index values. In addition, it does not offer any specific genre grouping (e.g., historic texts). Meanwhile, it selects the Shi Jing as a text at the list's extreme and places the Yi Li close to this extreme.

45 See "voc_ref.xlsx/MAIN_LOOKUP/ Maas's A2 values (sorted by A2 in decreasing order)".

217

3.6. Lukyanenko-Nesytoj's LN

Figure 3.6. Lukyanenko and Nesitoj's LN dendrogram Table 3.6. LN values (sorted by LN in inreasing order)4

Text N V LN

XJ 1800 374 -0.30719

ZY 13348 1030 -0.2424

LY 15923 1361 -0.23798

CQ 16791 941 -0.23668

SHU 24537 1910 -0.2278

SHI 29622 2833 -0.22363

MZ 35354 1892 -0.21986

GL 40835 1594 -0.21687

GY 44224 1640 -0.21525

ZL 49410 2212 -0.21305

YL 53882 1536 -0.21135

order)"

1 See "voc_ref.xlsx/MAIN_LOOKUP/ LN values (sorted by LN in inreasing

218

ZHZ 65251 2968 -0.2077

LJ 97994 3041 -0.20035

ZZ 178563 3235 -0.19041

CQZZ 195354 3251 -0.18901

The Lukyanenko-Nesytoj's LN, similar to the TTR, basically displays a correlation of text lengths and index values. Moreover, it separates "historical" texts. Here the Shi Jing is placed in the middle of the ordered list, while groupings in the dendrogram (cut at the 0.02 level) do not offer much stylistic meaning.

3.7. Brunet' W

Figure 3.7. Brunet's W dendrogram

219

Table 3.7. Brunet's W values (sorted by W in increasing order)47

Text N V W

SHI 29622 2833 13.78493

XJ 1800 374 14.96381

SHU 24537 1910 15.74132

LY 15923 1361 16.39098

ZHZ 65251 2968 16.48212

MZ 35354 1892 17.47091

ZL 49410 2212 17.70207

ZY 13348 1030 17.82408

LJ 97994 3041 18.04657

GL 40835 1594 19.81939

GY 44224 1640 19.97337

CQ 16791 941 20.0124

ZZ 178563 3235 20.32376

CQ Zuozhuan 195354 3251 20.73038

YL 53882 1536 21.85115

If the cut is made at the 1.0 level, then Brunet's W provides five subgroups of which one of them groups historical texts together. In addition, it places the Yi Li and the Shi Jing at the extreme ends of the ordered list.

3.8. Honore's H

47 See "voc_ref.xlsx/MAIN_LOOKUP/ Brunet's W values (sorted by W in increasing order)".

220

'Tails

Figure 3.8. Honore's H dendrogram

Table 3.8 Honore's H values (sorted by H in increasing order)48

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Text N V(N) H

ZY 13348 1030 533.8164

XJ 1800 374 571.5831

GY 44224 1640 611.467

YL 53882 1536 613.2909

SHU 24537 1910 621.5388

SHI 29622 2833 622.8163

GL 40835 1594 622.8802

MZ 35354 1892 623.146

ZL 49410 2212 631.5522

CQ Zuozhuan 195354 3251 640.3747

ZZ 178563 3235 642.0841

LY 15923 1361 646.9407

CQ 16791 941 647.5239

LJ 97994 3041 656.4982

ZHZ 65251 2968 679.8142

Honore's H does not correlate text lengths and index values, but it is difficult to find stylistic meaning in its groupings.

48 See "voc_ref.xlsx/MAIN_LOOKUP/ Honore's H values (sorted by H in increasing order)".

221

3.9. Sichel's S

Figure 3.9. Sichel's S dendrogram

Table 3.9. Sichel's S values (sorted by S in decreasing order)4

Text N V S

ZY 13348 1030 0.193204

XJ 1800 374 0.187166

Shi 29622 2833 0.171903

LY 15923 1361 0.164585

CQ 16791 941 0.160468

MZ 35354 1892 0.154334

SHU 24537 1910 0.146597

ZHZ 65251 2968 0.143531

GY 44224 1640 0.142073

49 See "voc_ref.xlsx/MAIN_LOOKUP/ Sichel's S values (sorted by S in decreasing order)".

222

ZL YL LJ GL

CQZZ ZZ

49410

53882

97994

40835

195354

178563

2212 1536 3041 1594 3251 3235

0.135624 0.133464 0.125617 0.125471 0.100277 0.099845

If the dendrogram is cut at the 0.02 level, then Sichel's S clustering produces four groups, grouping together (among others) the two longest texts and then the Xiao Jing and the Zhou Yi. While Sichel's S order is not exactly the TTR order, it vaguely correlates to text size.

3.10. Michea's M

Figure 3.10. Michea's M dendrogram

223

Table 3.10. Miches's M values (sorted by M in increasing order)50

Text N V M

ZY 13348 1030 5.175879

XJ 1800 374 5.342857

SHI 29622 2833 5.817248

LY 15923 1361 6.075893

CQ 16791 941 6.231788

MZ 35354 1892 6.479452

SHU 24537 1910 6.821429

ZHZ 65251 2968 6.967136

GY 44224 1640 7.038627

ZL 49410 2212 7.373333

YL 53882 1536 7.492683

LJ 97994 3041 7.960733

GL 40835 1594 7.97

CQZZ 195354 3251 9.972393

ZZ 178563 3235 10.01548

Michea's M index, in some ways, is similar to Sichel's S and other indices that correlate index values and text sizes.

3.11. Yule's K

r V^WH'/W'] - N

A = !iî Jfl

50 See "voc_ref.xlsx/MAIN_LOOKUP/ Miches's M values (sorted by M in increasing order)".

224

Figure 3.11. Yule' s K dendrogram

Table 3.11. Yule's K values (sorted by K in decreasing order)5

Text N V K

LY 15923 1361 135.486

CQ 16791 941 119.086

XJ 1800 374 114.580

MZ 35354 1892 105.034

ZY 13348 1030 102.313

GL 40835 1594 95.4268

ZHZ 65251 2968 90.8390

ZL 49410 2212 84.9890

GY 44224 1640 82.4085

LJ 97994 3041 71.3906

ZZ 178563 3235 67.0797

YL 53882 1536 66.3595

CQZZ 195354 3251 63.5706

SHU 24537 1910 54.0901

SHI 29622 2833 49.1837

Yule's K index does not display direct dependency on the text length. Its groupings, provided by clustering (cut at 14), are very different than those of other indices.

51 See "voc_ref.xlsx/MAIN_LOOKUP/ Yule's K values (sorted by K in decreasing order)".

225

3.12. Herdan's Vm

Figure 3.12. Herdan's Vm dendrogram

Table 3.12. Herdan's Vm values (sorted by Vm in increasing order)5

Text N V(N) Vm

XJ 1800 374 0.035509

SHU 24537 1910 0.062502

SHI 29622 2833 0.063014

YL 53882 1536 0.077046

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

CQZZ 195354 3251 0.078906

ZZ 178563 3235 0.081092

LJ 97994 3041 0.083106

52 See "voc_ref.xlsx/MAIN_LOOKUP/ Herdan's Vm values (sorted by Vm in increasing order)".

226

GY

ZL

ZY

ZHZ

GL

MZ

CQ

LY

44224 49410 13348 65251 40835 35354 16791 15923

1640 2212 1030 2968 1594 1892 941 1361

0.087152

0.090117

0.093307

0.093677

0.095377

0.10003

0.102799

0.112172

Herdan's Vm is interesting since it singles out the Lun Yu (similar to Yule's K). Otherwise, it does not offer any interesting stylistic grouping.

3.13. Partial TTR measurements

The results of the analysis based on the final values of indices seem to be extremely diverse. Only a few constants allowed the grouping of texts (by clustering) in a way that could be remotely interpreted as stylistically meaningful.

It is possible to measure WSP texts at equal sample sizes. The TTR values could be calculated at some fixed sample intervals, e.g., at 15,000 and 30,000 characters. The results for the 30,000 token samples are presented in Dendrogram 3.13 and Table 3.1, while the results for the 15,000 token samples are presented in Dendrogram 3.14 and Table 3.14. These points have been selected due to the 30,000 tokens being the sample size, which include all large- and medium-sized texts, and the 15,000 tokens since this sample size includes all texts (except the Xiao Jing).

Figure 3.13. Partial TTR's dendrogram. Taken for sample lenghts of 30 000 characters (shorter texts are assigned 0 values)

ClEi&tor np-nrimgivirr tar TTR ut SOWR r.haranlP'rfi

o

Tecs ¿■«rage Mrth«J

227

Table 3.13. Partial TTR values (sorted in decreasing order by TTR), for sample lenghts of 30 000 characters (shorter texts are assigned n/a values)53

Text N V V(30000) TTR(30000)

SHI 29622 2833 2833 0.094433

ZHZ 65251 2968 2161 0.072033

LJ 97994 3041 2069 0.068967

ZZ 178563 3235 1902 0.0634

CQZZ 195354 3251 1825 0.060833

MZ 35354 1892 1768 0.058933

ZL 49410 2212 1627 0.054233

GL 40835 1594 1392 0.0464

GY 44224 1640 1381 0.046033

YL 53882 1536 1082 0.036067

CQ 16791 941 n/a n/a

LY 15923 1361 n/a n/a

SHU 24537 1910 n/a n/a

XJ 1800 374 n/a n/a

ZY 13348 1030 n/a n/a

Tilft

Figure 3.14. Partial TTR's dendrogram. Taken for sample lenghts of 15 000 characters (shorter texts are assigned 0 values).

53 See "voc_ref.xlsx/MAIN_LOOKUP/ Partial TTR values (sorted in decreasing order by TTR)".

228

Table 3.14. Partial TTR values for 15000 samplees (sorted in decresing order by TTR), for sample lenghts of 15 000 characters (shorter texts are assigned n/a values)54

Text N V V(15000) TTR(15000)

SHI 29622 2833 2067 0.1378

SHU 24537 1910 1658 0.110533

ZHZ 65251 2968 1611 0.1074

LJ 97994 3041 1510 0.100667

ZZ 178563 3235 1461 0.0974

MZ 35354 1892 1375 0.091667

CQZZ 195354 3251 1364 0.090933

LY 15923 1361 1328 0.088533

ZL 49410 2212 1125 0.075

GL 40835 1594 1037 0.069133

ZY 13348 1030 1030 0.068667

GY 44224 1640 978 0.0652

CQ 16791 941 888 0.0592

YL 53882 1536 872 0.058133

XJ 1800 374 n/a n/a

While some texts are shorter than 30,000 tokens, it is possible to compare the results with clustering at 15,000 tokens. Cutting both dendrograms at the 0.02 level provides similar results. Unlike the final values dendrogram, the Shi Jing (the Xiao Jing is missing) is singled out in clustering and placed at the top of the list, while the Yi Li is placed at the bottom. In both cases, historical texts are split. The values of the TTR in both cases of partial samples do not correlate with text lengths, unlike the situation with final values55. This supports the idea that the TTR index (under certain conditions) can be beneficial for evaluating vocabulary richness.

3.14. Discussion of the results for the final value and partial index approaches

As Hoover notes, "various measures of vocabulary richness produce further interesting differences in how they rank texts on the basis of vocabulary richness - differences that reflect their radically different bases and methods of calculation" (Hoover, "Another Perspective," 169). Hoover, similar to Tweedie and Baayen (336), had the benefit of controlling

54 See "voc_ref.xlsx/MAIN_LOOKUP/ Partial TTR values for 15000 samplees (sorted in decresing order by TTR)".

55 In Section 4, these results will be discussed in more detail.

229

these differences by authorship of texts in their sample and grouping the constants based on correct ranking56. In the case of the WSP corpus, other criteria could be used for grouping indices.

Table 3.14. Indices matching four criteria

Index Short name Length-ordered Grouping historical texts Shi Jing/ top Yi Li / bottom

TTR TTR yes no (second) no

TTR at 30000 TTR no no yes yes

sample

TTR at 15000 TTR no no yes yes

sample

Guiraud R no yes yes yes

Herdan C almost no (second) no

Rubet k no yes yes Yes

Maas A2 no no yes almost

Luk'janenkov-Nesitoj LN yes no no no

Brunet W no yes yes yes

Honoré H no no no No

Sichel S almost no no No

Michéa M almost no no No

Yule K no no yes No

Herdan VM no no no no

Four recurrent binary features have been noted earlier. First, the order of the final values may or may not reflect text lengths. Second, some indices group "historical" texts together (which might be seen as stylistic selection), while others do not. Third, some indices could definitely favor the Shi Jing, placing it at the top of the list. Fourth, similarly, some indices place the Yi Li at the opposite end (of the Shi Jing) of the list. These four binary features could form the criteria for grouping indices. Table 3.15 presents the breakdown. Finally, Guiraud's R, Rubet's k, and Bru-net's W satisfy all four criteria. The partial TTR values match three of the four criteria, excluding genre grouping. This result will be discussed in more detail in the fourth section.

56 Hoover further notes that "these variations in richness order merely emphasize the fact that different measures of vocabulary richness measure different aspects of vocabulary structure" (Hoover, "Another Perspective," 169).

230

4. Vocabulary Development Profiles

The TTR index could still be used for the description of vocabulary richness and growth by charting vocabulary growth dynamically as developmental profiles, which was demonstrated by, e.g., Tweedie and Baayen's "How Variable May a Constant Be?"57.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Unlike most other indices, the TTR has an immediate and clear meaning of measuring the relative rate of vocabulary growth. If a text demonstrates a comparatively faster growth of vocabulary, then its developmental curve remains higher on the chart compared to the curves of the texts with a lower rate of vocabulary growth. The TTR developmental profile allows analyzing the dynamics of the rate of adding new characters to the existing vocabulary. Analysis of developmental profiles helps visually (as well as numerically58) estimate how fast some texts add new characters to their vocabulary compared to other texts59. Figure 4.1 displays a view of the TTR's complete developmental profile for the WSP Ctexts texts60.

Developmental profiles are based on the abstraction of the character stream. This means that texts are perceived as one long string of characters, presumably stochastically generated by one source for each text (i.e., which vocabulary is being evaluated). The TTR values are calculated at even intervals in the sample and the subtext borders are not taken into consideration. Considering the average size of text samples, a 1,000-character interval was chosen as the interval for this study.

4.1. Complete TTR developmental profile

The curves in Figure 4.1 demonstrate that the longer WSP texts gradually converge into an asymptote, flattening out at approximately

57 These authors offer a more advanced study (Tweedie and Baayen, "How Variable May a Constant Be?"). The present study does not investigate randomization, intervals, coherent prose, and so on.

58 This difference could further be the foundation for evaluating text stylistic differences, while it is too early to make definite quantitative suggestions. This is why there are no quantitative indices for curve slopes in the present study, especially since it is still unclear how they could be used. Therefore, this study will only5 9perform some visual observations.

5 "Trimming the texts to equal size allows the number of types to be used as a direct measure of vocabulary richness and lays the groundwork for an examination of intratextual variability" (Hoover, "Another Perspective," 159).

60 The TTR values (y-axis) are presented for every 1,000 characters of texts (x-axis).

231

70,000 tokens61, meaning that there is no significant change in the rate of vocabulary growth beyond this sample size; i.e., after this point, character vocabularies are saturated. However, the curve slopes (i.e., rates of vocabulary growth) vary in the zone preceding this area.

Figure 4.1. TTR complete developmental profiles for WSP Texts62

As David Malvern and Bryan Richards state, "[t]he more slowly a curve falls and the higher up in the space it is, the nearer that sample is to the most diverse text possible. The more quickly a curve falls, the lower in the space it lies and the nearer it is to the least diverse text there can be" (Malvern and Richards, "Measures of Lexical Richness," 3). The Shi Jing clearly remains at the top. All of this explains why the ability to select the Shi Jing and the Yi Li as two extreme points of the spectrum was earlier referred to as a "criterion for classifying indices." The indices that placed them on the top and bottom positions need to reflect the tendency.

To better understand the slopes of the curves in Figure 4.1, they can be fitted to some function. It is not exactly known what is the law behind the TTR developmental profile curves, but it is generally presumed that power law (y = axAb) is a good approximation. The fitting of the TTR

61 There have been several hypotheses regarding the nature of TTR curves by several researchers, from Poisson to inverse Gaussian distribution (see a review in Mitchell, "Type-token Models," 2).

62 This chart is built based on data presented in "voc_ref.xlsx/TTR_ALL". The main table presents V(N) values for all WSP texts, take at sample sizes each thousand, and respective TTR value for this V(N). (The final values for some texts could be slightly different from exact values, as N is a multiple of 1000.) The scalable chart TTR complete developmental profiles for WSP Texts are situated right below the main table.

232

curves was implemented using a free, online curve-fitting package, My-CurveFit63, to increase the reproducibility of this study. The main results (the curve images and the numerical parameters) are presented in the accompanying Excel spreadsheet64.

The power law curves can be converted into straight lines, when the X- and Y-axes are converted to logarithmic scale. In this case, the parameter "a" becomes the Y-axis intercept and the parameter "b" becomes the line's slope. These numbers are presented on the Main Lookup sheet ("voc_ref.xlsx/MAIN_LOOKUP").

Table 4.1 presents these numbers for the corpus texts, sorted by the increasing "b" parameter. It shows that the Yi Li and the Xiao Jing demonstrate the steepest drop in the TTR curves, while the Shi Jing, the Lun Yu, and the Zhou Li demonstrate more vocabulary diversity.

Table 4.1. Power Method Curve Fitting Parameters y = axAb (complete samples)65

Text a b

SHI 0.364123 -0.36189

LY 0.303624 -0.43227

ZL 0.268362 -0.46971

SHU 0.415781 -0.47475

MZ 0.321539 -0.47974

GY 0.239313 -0.48198

ZHZ 0.383938 -0.48299

ZY 0.282812 -0.50061

CQ 0.234874 -0.50537

GL 0.268645 -0.50955

CQZZ 0.353877 -0.51907

ZZ 0.370022 -0.51933

63 DOI: www.mycurvefit.com. This free site will not accept sequences larger than 100 points. Thus, several larger texts were truncated to 90-100 points. In a few cases, e.g., the Zhou Li, a few points at the beginning were definitely out of range, which skewed the fitting. Therefore, a couple of points were taken out of the sequence (in the case of the Zhou Li) to better fit the other points. Otherwise, the fitting process was very basic. Besides the power law, other fitting methods were tested such as the polynomial and exponential functions. They also demonstrated good results (especially for some curves, see the spreadsheet), but the power law was the best for most of the curves.

64 See "voc_ref.xlsx/ FIT_ALL".

65 See "voc_ref.xlsx/MAIN_LOOKUP/ Power Method Curve Fitting Parameters".

233

LJ

0.413892

-0.52272

YL 0.276727 -0.55608 XJ_0.276_-0.56163_

4.2. Partial TTR developmental profiles

The sample size of 70,000 tokens is the approximate area beyond which the curves of texts in the corpus visibly flatten. Meanwhile, 30,000 tokens is another interesting sample size, allowing a better scale for comparison of most texts, except for the shortest (Figure 4.2). Numerical values of 30,000 sample end point values are presented in Table 4.2.

rrkia I ill D«**Ki:PKfll IH W* ew* ptH*r.jfcr,tr*Jtl

Figure 4.2. TTR profile for WSP text samples at 30000 characters66 Table 4.2. TTR values for WSP text samples at 30000 characters67

Text N V V(30000) TTR(30000)

SHI 29622 2833 2833 0.094433

ZHZ 65251 2968 2161 0.072033

LJ 97994 3041 2069 0.068967

ZZ 178563 3235 1902 0.063400

CQZZ 195354 3251 1825 0.060833

MZ 35354 1892 1768 0.058933

ZL 49410 2212 1627 0.054233

GL 40835 1594 1392 0.046400

66 This chart is also built based on data presented in "voc_ref.xlsx/ TTR_ALL". The scalable TTR profile for WSP Text samples at 30000 characters is situated below the main table.

67 See "voc_ref.xlsx/MAIN_LOOKUP/ TTR" values for WSP Text samples at 30000 characters.

234

GY 44224 1640 1381 0.046033

YL 53882 1536 1082 0.036067

CQ 16791 941 n/a n/a

LY 15923 1361 n/a n/a

SHU 24537 1910 n/a n/a

XJ 1800 374 n/a n/a

ZY 13348 1030 n/a n/a

It is possible to group texts by curve gradients in three subgroups68. On the top, there is the Shi Jing (with a TTR of 0.094, which is almost three times larger than the lowest one; i.e., the Yi Li with a TTR of 0.036). Then the Shu Jing should have come next, but it is slightly shorter than 30,000 tokens in the WSP version)69. After that, come the Zuo Zhuan, the Chun Qiu Zuo Zhuan, the Lun Yu, and the Mengzi. Finally, there are the others, which include the Zhou Li, the Guliang Zhuan, the Gongyang Zhuan, the Xiao Jing, the Yi Li, and the Chun Qiu. This is close to what the cluster algorithm displays, except that the Shi Jing represents the top group, not the Xiao Jing.

It is also possible to group the texts as follows:

1) The Shi Jing;

2) The Shu Jing, the Zhuangzi, the Li Ji, the Zuo Zhuan, the Chun the Qiu Zuo Zhuan, the Mengzi, the Zhou Li, the Guliang Zhuan, and the Gongyang Zhuan;

3) The Yi Li.

The sample size of 15,000 tokens allows a clearer picture, including practically all of the WSP corpus texts (Figure 4.3). This profile allows the identification of a more articulate grouping:

1) The Shi Jing;

2) The Shu Jing, the Zhuangzi, the Li Ji, the Zuo Zhuan, the Chun Qiu Zuo Zhuan, the Mengzi, and the Zhou Li;

3) The Zhou Yi, the Chun Qiu, the Guliang Zhuan, and the Gongyang Zhuan;

4) The Yi Li.

68 Even if the text samples are smaller than 30,000 token size. These subgroups are not "clusters" since the grouping is based on simple visual sorting.

However, for a continued curve, the values will be less than the Zhuangzi and the Li Ji.

235

Figure 4.3. TTR profile for WSP Text samples at 15000 characters70

Table 4.3. TTR values for WSP Text samples at 15000 characters71

Text N V V(15000) TTR(15000)

SHI 29622 2833 2067 0.1378

SHU 24537 1910 1658 0.110533

ZHZ 65251 2968 1611 0.1074

LJ 97994 3041 1510 0.100667

ZZ 178563 3235 1461 0.0974

MZ 35354 1892 1375 0.091667

CQZZ 195354 3251 1364 0.090933

LY 15923 1361 1328 0.088533

ZL 49410 2212 1125 0.075

GL 40835 1594 1037 0.069133

ZY 13348 1030 1030 0.068667

GY 44224 1640 978 0.0652

CQ 16791 941 888 0.0592

YL 53882 1536 872 0.058133

XJ 1800 374 n/a n/a

70 This chart is also built based on data presented in "voc_ref.xlsx/ TTR_ALL". The scalable chart TTR profile for WSP Text samples at 15000 characters is situated right below the main table.

71 See "voc_ref.xlsx/MAIN_LOOKUP/ TTR values for WSP Text samples at 15000 characters".

236

Table 4.4. Power Method Curve Fitting Parameters y = axAb (samples of 15000 characters)72

Text a b

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

shu 0.351897 -0.32366

shi 0.351897 -0.32366

ly 0.303005 -0.42924

Zhz 0.368025 -0.43461

ZZ 0.346208 -0.443

cqzz 0.333212 -0.44924

Zl 0.256942 -0.44937

lj 0.392175 -0.45649

mz 0.316927 -0.46192

Gy 0.236047 -0.46464

Yl 0.265421 -0.48969

Zy 0.282812 -0.50061

gl 0.267284 -0.50331

cq 0.234784 -0.50475

n/a n/a

The curve fitting was also conducted for the partial sample of 15,000 characters. The results are presented in Table 4.4. This table better demonstrates that the Shu Jing, the Shi Jing, and the Lun Yu could have the most diverse vocabularies, while the Chunqiu and the Xiaojing have the least (the Yi Li is not the last text on this list).

4.3. Discussion of the results of developmental profiles

The results of the developmental profiles demonstrate that the WSP texts display varying TTR tendencies. The Shi Jing's vocabulary grows consistently stronger than any other text, i.e., this text is the most lexically diverse. The Yi Li's vocabulary demonstrates distinctively weaker growth in the WSP corpus. The texts that could be grouped as "historical" (the Zuo Zhuan, the Chun Qiu, the Guliang Zhuan, and the Gongyang Zhuan) tend to be close to one another in the middle of the spectrum (or slightly lower than the average). This is why the final value indices that placed the Shi Jing and the Yi Li at opposite ends of the spectrum as well as grouped "historical" texts together represent more interest.

Is it possible to associate these observations regarding the TTR values with stylistic or genre characteristics such as texts being historical or po-

72 See "voc_ief.xlsx/MAIN_LOOKUP/ Power Method Curve Fitting Parameters".

237

etic? Is it possible to associate specific vocabulary richness values in the WSP corpus with texts being historical or poetic? Is it possible that vocabulary grows differently for historical texts ascribed to one author, while this index for a historical text by another author could be closer to a poetic text? In other words, could vocabulary richness be an index of a genre or stylistic feature unlike individual style? How do we evaluate vocabulary richness data for large heterogeneous collections like the texts in the WSP corpus?

At this point, the present author can only admit the complexity of the problem. However, the author cannot agree with the statement that "word count" is irrelevant, i.e., it cannot be used for any stylistic or authorship analysis. The developmental profiling provides information on texts' comparative vocabulary richness, singling out the Shi Jing as the richest text and the Yi Li as the simplest, while other texts took an intermediate position between them.

5. Developmental Profiles of Rare and Frequent Characters

There are two main approaches to utilize word frequencies for stylistic analysis. In one approach, infrequent words (hapax legomena, and le-gomena, i.e., the characters that have only two samples) are analyzed to evaluate the vocabulary richness of texts or compare texts. In the opposite approach, the most frequent words (functional, "empty" or "noncontent") are considered the key to stylistic analysis. This article will analyze the developmental profiles of hapax legomena (V1) and dis legomena (V2) as well as characters with a frequency of 50 or higher73.

5.1. Hapax legomena (V1) and dis legomena (V2)

One of important objects of quantitative linguistics analysis is hapax legomena (singletons, unique words, V(1,N) or V1). According to the large number of rare events (LNRE) model of word frequency distributions developed by Baayen74, they play an important role in defining vocabulary growth as well as dis legomena (V2). The TTR method can be applied to V1 and V2 to create V1 TTR75 and V2 TTR indices. V1 TTR

73 These are different for most texts, but they definitely include all function characters as well as some of the most frequent content words.

74 Baayen, following Khmaladze, describes word frequency distributions as "Large Number of Rare Events (LNRE) distributions, distributions characterized by the presence of large number of words with very low probabilities of occur-rence"(Baayen, "Word Frequency," 54-55), the outcome of which is that "sample relative word frequencies cannot be used to obtain the expected values of the voca75bulary size" (ibid, 57).

75 It is sometimes called "the index of diversity" (Tuldava, "Stylistics, Author Identification," 375).

238

curves demonstrate hapax legomena tendencies during text growth76. The question is "Should they follow the general TTR distribution?"

Table 5.1 contains V1 TTR numbers for 30,000 token samples. There is roughly the same order of texts as that for regular TTR values at this sample size, but there are definitely no pronounced groups. The V1 TTR chart (Figure 5.1) shows that most curves are extremely close to one another. Unlike the regular TTR chart, the V1 curves in Figure 5.1 converge very close to 30,000 tokens and they do not have considerable differences in their slopes. Yet, the V1(30,000) TTR value of the Shi Jing (0.027) is roughly three times higher than that of the Yi Li (0.009). Thus, the ratio between the extremes remains.

Table 5.1. TTR V1 values for WSP Text (samples of30000 characters)77

Text N V V1(30000) TTR(30000)

SHI 29622 2833 799 0.026633

ZHZ 65251 2968 713 0.023767

LJ 97994 3041 581 0.019367

ZZ 178563 3235 556 0.018533

CQZZ 195354 3251 515 0.017167

MZ 35354 1892 487 0.016233

ZL 49410 2212 442 0.014733

GL 40835 1594 370 0.012333

GY 44224 1640 341 0.011367

YL 53882 1536 274 0.009133

CQ 16791 941 n/a n/a

LY 15923 1361 n/a n/a

SHU 24537 1910 n/a n/a

XJ 1800 374 n/a n/a

ZY 13348 1030 n/a n/a

76 Some interesting statistics on V1 distribution in modern Chinese web corpora is presented in Hsieh, "Why Chinese Web-as-Corpus is Wacky?".

77 See "voc_ref.xlsx/MAIN_LOOKUP/ TTR V1 values for WSP Text (samples of 30000 characters)".

239

Figure 5.1. TTR chart for V1 WSP Text samples at 30000 characters78

Dis legomena (V2), the words that are encountered exactly twice, should be, unlike V1, "real content words." Table 5.2 presents the V2(30,000) TTR values, while Figure 5.2 displays the V2(30,000) TTR curves. However, the dynamic curves view for V2 TTR is similar to the general TTR distribution. This raises the following questions. How consistent is this growth? Is it possible that, since the Shi Jing consists of many different pieces of poetry with flowery language and rare characters, there is constant growth in hapax legomena that does not become dis legomena quickly enough (and with it, diminishing TTR)?

Table 5.2. TTR V2 values for WSP Text (samples of30000 characters)79

Text N V V2(30000) TTR(3i

shi 29622 2833 487 0.0162

ZHZ 65251 2968 361 0.0120

LJ 97994 3041 293 0.0098

zz 178563 3235 246 0.0082

cqzz 195354 3251 253 0.0084

mz 35354 1892 287 0.0096

ZL 49410 2212 225 0.0075

gl 40835 1594 196 0.0065

gy 44224 1640 222 0.0074

78 See "voc_ref.xlsx/V1 spreadsheet".

79 See "voc_ref.xlsx/MAIN_LOOKUP/ TTR V2 values for WSP Text (samples of 30000 characters)".

240

YL

cq

53882 1536

16791 941

15923 1361

24537 1910

1800 374

13348 1030

146

n/a n/a n/a n/a n/a

n/a n/a n/a n/a n/a

0.0049

ly

shu

xj ZY

TIP\t ililL IJi lltiP QCSTU 3C i te y.'M. LflhiiU

Figure 5.2. TTR chart for V2 WSP Text samples at 30000 characters80

The V2 TTR distribution is visually closer to the general TTR distribution. The rate of hapax legomena growth is similar for all of the texts. The dis legomena, which could be the content descriptors of the texts, behave similarly to general TTR curves.

According to Baayen, the distribution of hapax legomena is important for understanding when text vocabulary is nearing saturation or is moving from the central LNRE zone to the late LNRE zone. The central LNRE zone is the range of sample sizes "where the expected number of hapax legomena is increasing" (Baayen, "Word Frequency," 56). In the late LNRE zone, the growth of hapax legomena stops, and its curve first flattens and then decreases. The stalling of the growth of hapax legomena is only observed in three texts of the WSP corpus (Figure 5.3). First, for the Zuo Zhuan, when it reaches sample sizes of more than 60,000 (where its TTR curve becomes closer to horizontal asymptote). Second, for the Yi Li, where it occurs comparatively early, at 40,000 tokens. This fact also places the Yi Li into a category of texts with rather a poor vocabulary growth81. The flattening of the V1 curve for the Zuo Zhuan after 60,000,

80 See "voc_ref.xlsx/V2 spreadsheet".

81 Naturally, this only refers to the WSP corpus.

241

together with the flattening of its TTR curve, should allow one to make some projections regarding the general vocabulary size of the writers in the Warring States period.

V( 1, N) chart for WSP Ctexts at com plete sam p le

MT

7W MKT №

ma m

u

. 17 ii ¿5 11 W it tj El ■-■.■ 1 Jt m LJ1 Ui'11, '.-1Ibl If I,-.- 1E1 1JU -ti\ iy Lrf -a -lb

Figure 5.3. Hapax legomena for WSP corpus

5.2. Frequent words (V50+)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

The final index reviewed in this study is based on the distribution of the most frequent words. What part of the corpus is covered by the most frequent words?82 Researchers often use parameters such as "corpus coverage," "lexical coverage," or "cumulative frequency" (Da, "A Corpus-based Study of Character and Bigram Frequencies"). The studies that implement them mostly operate with frequency lists, e.g., the top 1000 characters (ordered by frequency). This approach seems to be not well adjusted in the case of the WSP corpus, which is so heterogeneous that it needs an individual list of frequent characters for each text. Moreover, such lists can be difficult to merge in order to create a single-frequency word list for the entire corpus83.

82 Van Hout and Vermeer describe LFP - Lexic frequency profiles, dividing words into nine groups (Hout van and Vermee, "Comparing Measures of Lexical Richness," 107).

83 Smith and Witten recommend starting from the top 1% of the frequency list, merging the lists for the corpus texts. However, this approach was not chosen for the present study (Smith and Witten, "Language Inference from Function-Word"). As Bin Li et al. notice, of the 100 most frequent characters for their corpus, "25 characters surprisingly do not occur in all the literatures" (Li et.al., "Corpus-based Statistics," 148). Moreover, "The general characters are themselves of high frequency, but they are not necessarily distributed uniformly" and "this non-uniform distribution reflects the diversity of these literatures in domains, ages, and the writing styles" (Ibid.).

242

Therefore, for this study, another method was selected: the analysis of the distribution of characters that are found in each individual text fifty times or more (V50+)84. This study will not reject functional words since it is searching for general character frequencies not just the frequencies of content words85.

Figure 5.486 presents the chart of the complete V50+ TTR distributions, while Figure 5.587 displays the charts for V50+ TTR at 30,000 token samples88. Table 5.3 presents V50+ data for the complete WSP Ctexts, including the ratios of the number of V50+ tokens to all vocabulary as well as the V50+TTR89. Table 5.4 presents the V50+ TTR at the sample size of 30,000 tokens90. Finally, Table 5.5 presents the data regarding complete text coverage by V50+ characters91.

According to Table 5.5, V50+ characters tend to cover, on average, more than 70% of texts longer than 30,000 tokens. In addition, this is comparable to using the most frequent words in other methods92. This also justifies the V50+ approach.

84 See "Voc_ref.xlsx/V50+CHARS spreadsheet," which lists these V50+ characters for each text in the WSP corpus, along with the number of entries and the total sum of the entries of V50+ characters for each text. These sums allow calculating text coverage.

85 Another approach is to reject content words: "A near must of stylometric investigations is to exclude content words from the start. The reason for this is obvious: the use of content words depends on content, and the content of a text (_topic_) is, by definition, not covered by stylometry" (Golcher, "A New Text Statistical Measure," 3), although Felix Golcher himself offers a content-word ignorant method. On keeping content words and the two urns method, see Kor-nai, "How Many Words," 50). See also Evert on hapax legomena and dis legomena (Evert, "The Statistics of Word Co-occurrences").

86 See "Voc_ref.xlsx/V50+ spreadsheet."

87 See "Voc_ref.xlsx/V50+ spreadsheet."

88 V50+ characters are not all functional words, naturally. See the list of characters for all of the texts at "Voc_ref.xlsx/V50+CHARS spreadsheet."

89 See "Voc_ref.xlsx/MAIN_LOOKUP/TTR" for V50+ characters in the WSP CTexts (complete samples, sorted by V50+/V(N) decreasing).

90 See "Voc_ref.xlsx/MAIN_LOOKUP/" regarding the TTR V50+ values for the WSP CTexts samples at 30,000 characters (sorted by decreasing TTR).

91 See "Voc_ref.xlsx/MAIN_LOOKUP/" regarding the TTR V50+ for the WSP CTexts complete size (sorted by decreasing V50+ coverage).

92 Jun Da (Da, Jun, "A Corpus-based Study," 6) reports that cumulative coverage of the top 705 characters in their study is approximately 75%.

243

rn E'i'l IH V5-1 WSP LE-ITFIcic iuJ Lirrpki

Figure 5.4. TTR chart for V50+ WSP Complete Text Samples

TlKLHiVI Li bVH- rt'K Ir J -Jid Jh>..'i H»'■'• i't^.VIn-:

Figure 5.5. TTR chart for V50+ WSP Text samples at 30000 characters

Table 5.3. TTR of V50+ characters for WSP CTexts (complete samples, sorted by V50+/V(N) decreasing)

Text N V V50+ V50+/V(N) V50+TTR

CQZZ 195354 3251 593 0.18240541 0.0030355

ZZ 178563 3235 560 0.17310665 0.0031361

YL 53882 1536 209 0.13606771 0.0038788

LJ 97994 3041 364 0.11969747 0.0037145

GY 44224 1640 170 0.10365854 0.0038441

GL 40835 1594 155 0.09723965 0.0037958

ZL 49410 2212 187 0.08453888 0.0037847

ZHZ 65251 2968 222 0.07479784 0.0034022

CQ 16791 941 69 0.07332625 0.0041093

MZ 35354 1892 125 0.06606765 0.0035357

244

ZY 13348 1030 51 0.04951456 0.0038208

SHU 24537 1910 94 0.04921466 0.0038309

LY 15923 1361 53 0.03894195 0.0033285

SHI 29622 2833 94 0.03318037 0.0031733

XJ 1800 374 3 0.00802139 0.0016667

Table 5.4. TTR V50+ values for WSP CTexts complete samples at 30000 characters (sorted by TTR decreasing)

Text N V V50+ TTR

YL 53882 136 0.004533

CQZZ 195354 120 0.004000

ZZ 178563 119 0.003967

GL 40835 118 0.003933

GY 44224 117 0.003900

LJ 97994 115 0.003833

ZL 49410 113 0.003767

MZ 35354 103 0.003433

SHI 29622 94 0.003133

ZHZ 65251 94 0.003133

SHU 24537 n/a 0

LY 15923 n/a 0

ZY 13348 n/a 0

CQ 16791 n/a 0

XJ 1800 n/a 0

Table 5.5. TTR V50+ for WSP CTexts complete size (sorted by V50+

coverage decreasing)

Text N V50+ V50+ V50+ V50+sum coverage

cqzz 195354 593 170269 0.8716

zz 178563 560 153336 0.8587

YL 53882 209 43366 0.8048

LJ 97994 364 76523 0.7809

gy 44224 170 33745 0.7630

gl 40835 155 30718 0.7522

ZHZ 65251 222 47079 0.7215

cq 16791 69 11826 0.7043

ZL 49410 187 34388 0.6960

mz 35354 125 23454 0.6634

ly 15923 53 8902 0.5591

ZY 13348 51 7314 0.5479

245

shu shi

24537 29622 1800

94 94 3

12818 13329 206

0.5224 0.4500 0.1144

Figure 5.4, and especially Figure 5.5, demonstrate a curve order in reverse to what was observed at the regular TTR at 30,000 token samples (Figure 4.2). The Yi Li's curve comes at the top of the most frequent word curves, while the Shi Jing tends to be at the bottom. It appears that the more frequent words (V50+) there are in a text, the lower its TTR

i • • 93

score and curve position .

5.3. Discussion of results for rare and the most frequent words

The data analysis for V1, V2, and V50+ demonstrates that content words (words found in texts two times or more but not too often) provide considerable input into the separation of texts into groups based on the TTR developmental profiles. Hapax legomena tend to be smoother models than dis legomena (and probably include a higher degree of words). For the most frequent words, which are responsible for most coverage of the samples, the situation is in contrast to what was observed for the TTR. The highest ratio of such words at the sample size of 30,000 is found in the Yi Li94, while the Shi Jing has the smallest V50+ TTR index value. This could explain the positioning of their regular TTR curves.

This study examined the vocabulary richness of the WSP Ctexts corpus with the main objective of establishing the quantitative foundation for a general analysis of text vocabularies, mostly based on developmental profiles of TTR. The WSP Ctexts corpus is an open corpus of classical Chinese texts, which allows downloading and independent processing of data. All of the numerical data used in this study is available on Github. This study is an attempt to create reproducible research, and all of its components are available for independent processing.

In the first section of this study, traditional final value approaches were utilized to identify whether vocabulary richness indices (constants) could

93 This could be a similar result to the "law of decreasing new vocabulary growth" described, e.g., by Feng Zhiwei (Feng, "Introduction of Modern Terminology, " 1996) and formulated by Li and Zhang as "The repeated occurrences of high frequency words indicate a tendency of decreasing new vocabulary growth" (Li and Zhang, "Inter-textual Vocabulary," 14).

94 According to Table 5.3, the Yi Li is close to the absolute top regarding the V50+TTR, directly behind the Chun Qiu Zuo Zhuan and the Zuo Zhuan.

6. Conclusions

246

supply important information regarding the stylistic groupings of texts. As previously shown by many researchers (particularly by Tweedie and Baayen), all such indices are not really "constants," but they depend on sample size. However, this approach was almost never applied to classical Chinese texts, and that is one of the reasons why this article presents these indices.

The final value indices' analysis did not analyze the dependency of indices of sample size, but it demonstrated that a majority are not very useful for stylistic grouping of texts. However, the comparison of indices was still valuable since it revealed that not all of them were directly related to sample size. Some of the indices allowed the grouping of prosaic historical texts and demonstrated the proclivity of placing the Shi Jing and the Yi Li on opposite ends of the vocabulary richness spectrum.

The first result of this study is the identification that Guiraud's R, Rubet's K, and Brunet's W match all of these criteria. Consequently, they can be considered as suitable candidates for further stylistic analysis.

In the second section, instead of the final value approach, development profiles were analyzed, mostly for the TTR95. The TTR developmental profiles allowed observing at what rate new words were added to the existing vocabulary. It also allowed observation of the direct change of vocabulary over comparable text sizes and comparing texts from this relative viewpoint.

The WSP Ctexts is a collection of large-sized heterogeneous texts, usually consisting of many chapters that must be considered independent texts themselves. In addition, there is no single narrative structure. Therefore, the texts were abstracted as streams of characters. The analysis of the complete length curves showed that some type of vocabulary saturation occurs around the sample size of 60,000 characters for the longest texts. Furthermore, their TTR curves approach a horizontal asymptote, and their hapax legomena numbers stop increasing. Therefore, the sample sizes of 15,000 and 30,000 tokens were chosen as cross-cut points.

The comparison of the TTR developmental curves showed that the Shi Jing and the Yi Li create upper and lower borders for other curves, serving as extremes of the spectrum of developmental curves. The texts between them can also be separated (at the 15,000 sample size) into a group of "historical texts" (the Chun Qiu, the Zuo Zhuan, the Guliang Zhuan, the Gongyang Zhuan, and the remainder).

95 Similar analysis is possible for other indices, especially, Guiraud's R, Rubet's k, and Brunet's W. However, the TTR is "transparent" and it was good enough for this study.

247

Among the WSP corpus texts, the Shi Jing demonstrated the most diverse vocabulary with the highest rate of growth. The Yi Li included the lowest rate of inflow of new characters among the entire corpus as well as the least diverse vocabulary. At the 30,000 sample size, the Shi Jing's TTR value was practically three times larger than that of the Yi Li, i.e., its vocabulary was three times larger than that of the Yi Li. The same ratio was observed for the first two spectral elements (V1 and V2).

However, this ratio was reversed for most frequent characters, (V50+, found 50 times or more in a text). These numbers, presented in the third section, focused on hapax legomena and dis legomena influx as well as words that were most frequent. The Yi Li had the highest rate of accumulating frequent words per 1,000 tokens, while the Shi Jing had the lowest. In other words, the Yi Li utilized functional characters and high-frequency characters at a much higher degree than the Shi Jing, while the Shi Jing included a vocabulary that was more diverse. Hapax legomena in the Yi Li stopped increasing and even began decreasing from about 70% of the sample size. This signaled stagnation of vocabulary growth, while in the Shi Jing, they did not slow the rate of increase.

The second result of this study is the discovery that, in regard to vocabulary richness, the Shi Jing and the Yi Li formed two extremes of The Thirteen Classics96. Unfortunately, there is no clear separation (based on the TTR developmental profiles) of the texts into genre categories. The Shi Jing is outstanding as the only poetic texts. In addition, the "historical" texts such as the Chun Qiu, the Zuo Zhuan, the Guliang Zhuan, and the Gongyang Zhuan display very similar types of vocabulary growth, which makes them a special group in the corpus. However, the other texts such as the Zhuangzi, the Shu Jing, and the Mengzi also demonstrate higher vocabulary growth, but do not form a special "philosophical" group. The same is true for "ritualistic texts."

It was mentioned, a few times in this study, that a vocabulary richness analysis of large texts can be used in stylistic analysis. Is it possible to stylistically interpret the results of the analysis presented in this study? Is it possible to interpret the data regarding the growth of the Shi Jing vocabulary? Can this interpretation be performed in comparison with the Yi Li? This could possibly be evidence that the Shi Jing tends to incorporate the vocabulary of many diverse poems, while the Yi Li tends to be formulaic and utilizes many standard expressions and function characters. The present author agrees with Hoover's opinion:

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

96 Without the Er Ya.

248

Such measures cannot provide a consistent, reliable, or satisfactory means of identifying an author or describing a style. There is so much intratextual and intertextual variation among texts and authors that measures of vocabulary richness should be used with great caution, if at all, and should be treated only as preliminary indications of authorship, as rough suggestions about the style of a text or author, as characterizations of texts at the extremes of the range from richness to concentration. Perhaps their only significant usefulness is as an index of what texts or sections of texts may repay further analysis by more robust methods. (Hoover, "Another Perspective,"173)

Further analysis of the individual vocabularies of these subtexts is necessary. Combined with the general framework presented in this study, this analysis should clarify the lexical landscape of The Thirteen Classics. The continuation of this study will conduct a more detailed examination regarding the statistical setup of the vocabularies of the corpus.

Abbreviations

Chun Qiu CQ

Chun Qiu Zuo Zhuan CQZZ

Gongyang Zhuan GY

Guliang Zhuan GL

Li Ji LJ

Lun Yu LY

Mengzi MZ

Shi Jing SHI

Shu Jing SHU

Xiao Jing XJ

Yi Li YI

Zhou Yi ZY

Zhuangzi ZHZ

Zhou Li ZL

Zuo Zhuan ZZ

Literature

Baayen, Harald R. Word frequency distributions. Dordrecht: Text, speech, and language technology No. 18, Kluwer Academic, 2001.

Bremond, Claude. "The Logic of Narrative Possibilities." 1966. Trans. Elaine D. Cancalon. New Literary History: A Journal of Theory and Interpretation 11:3 (1980): 387-411.

Brooks, Bruce E., Brooks, Taeko A. The emergence of China: from Confucius to the empire. Amherst: Ancient China in Context, University of Massachusetts at Amherst, 2015.

249

Da, Jun. "A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction." In Zhang, Pu, Tianwei Xie and Juan Xu (eds.) The studies on the theory and methodology of the digitalized Chinese teaching to foreigners: Proceedings of the Fourth International Conference on New Technologies in Teaching and Learning Chinese. Beijing: Tsinghua University Press (2004): 501-511.

Daller, Michael H. "Guirauds index of lexical richness". In: British Association of Applied Linguistics, September 2010 (DOI: http://eprints.uwe.ac.uk/11902/).

Duran, Pilar, D. Malvern, B. Richards and N. Chipere. "Developmental trends in lexical diversity." Applied Linguistics 25:2 (2004): 220-242.

Evert, Stefan. The Statistics of Word Co-occurrences: Word Pairs and Collocations. PhD Thesis, Institut für maschinelle Sprachverarbeitung, Universität Stuttgart, Stuttgart, 2005.

Feng, Zhiwei. Introduction of Modern Terminology. Beijing: The Language Publishing House, 1996.

Feng, Zhiwei. "Evolution and present situation of corpus research in China."

International Journal of Corpus Linguistics 11:2 (2006), 173-207.

Golcher, Felix. "A New Text Statistical Measure and its Application to Stylome-try." In Proc. of the Corpus Linguistics conference (CL'07), Article 71, 2007.

Herdan, Gustav. Type-token mathematics; a textbook of mathematical linguistics. 'S-Gravenhage Mouton: Janua linguarum, studia memoriae Nicolai van Wijk dedicate No. 4, 1960.

Herdan, Gustav. The advanced theory of language as choice and chance. Kommunikation und Kybernetik in Einzeldarstellungen; Bd. 4 New York: Springer-Verlag, 1966.

Hoover, David L. "Another Perspective on Vocabulary Richness." Computers and the Humanities 37 (2003) 151-178.

Hout, Roeland van and Anne Vermee. "Comparing measures of lexical richness" In: Eds. Helmut Daller, James Milton and Jeanine Treffers-Daller , Modelling and Assessing Vocabulary Knowledge. Cambridge, UK: Cambridge University Press, Ch.5, 93-115.

Hsieh, Shu-Kai. "Why Chinese Web-as-Corpus is Wacky? Or: How Big Data is Killing Chinese Corpus Linguistics." In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), eds. Nicoletta Calzolari (Conference Chair) ,Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis, 2014, May, 26-31, Reykjavik, European Language Resources Association (ELRA).

Kornai , Andras. "How many words are there?" Glottometrics 4 (2002), 61-86.

Köhler, Reinhard , Gabriel Altmann and Rajmund G. Piotrowski (Eds.) Quantitative Linguistik / Quantitative Linguistics: Ein internationales Handbuch / An International Handbook. Walter de Gruyter, 2005.

Kytö, Merja and Anke Lüdeling (Eds.). Corpus linguistics: an international handbook. Berlin, New York: Walter de Gruyter: Handbooks of linguistics and communication science, 29.1-29.2 Handbücher zur Sprach- und Kommunikationswissenschaft Bd. 29.1-29.2, 2008-2009.

250

Laufer, Batia and Paul Nation. "A vocabulary-size test of controlled productive ability." Language Testing 16:1 (1999), 33-51.

Le Quan Ha, E. I. Sicilia-Garcia, Ji Ming and F. J. Smith. "Extension of Zipfs Law to Word and Character N-grams for English and Chinese". Computational Linguistics and Chinese Language Processing 8:1 (2003), 77-102.

Li, Bin, Ning Xi, Minxuan Feng, and Xiaohe Chen. "Corpus-Based Statistics of Pre-Qin Chinese." In Chinese Lexical Semantics — 13th Workshop, CLSW 2012, Wuhan, China, July 6-8, 2012, ed. by Donghong Ji and Guozheng Xiao 145-153, Berlin-Heidelberg: Springer-Verlag, 2013.

Li, J. and F. Zhang. Inter-textual vocabulary growth patterns for marine engineering English. Bejing: Editorial office for contemporary foreign languages, 2011.

Li Hongzao's Hanyuan Shisanjing ji zi Ä^+H^ft^ (A col-

lection of characters in the Thirteen Classics compiled by the National Academy), 1889.

Loewe, Michael (Ed.) Early Chinese Texts: a Bibliographical Guide. Berkeley: The Society for the Study of Early China and the Institute of East Asian Studies, University of California, 1993.

Malvern, David D., Ngoni Chipere, Brian J. Richards and Pilar Duran Lexical Diversity and Language Development. Houndmills, Basingstoke, Hampshire, New York: Palgrave Macmillan, 2004.

Malvern, David D. and Bryan Richards. "Measures of Lexical Richness." In: The Encyclopedia of Applied Linguistics. Ed. by Carol A. Chapelle. Blackwell Publishing Ltd., 2013.

McCarthy, Philip M. and Scott Jarvis Mild. "vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment." Behavior Research Methods, 42:2 (2010), 381-392.

McLeod, Russell. "Sinological Indexes in the Computer Age: The ICS Ancient Chinese Text Concordance Series." China Review International 1 no. 1 (1994): 48-53.

Meyer, Dirk. Philosophy on Bamboo: Text and the Production of Meaning in Early China. Leiden: HCT 2, Brill, 2012.

Michell, Colin Simon. Investigating the Use of Forensic Stylistic and Stylometric Techniques in the Analysis of Authorship on a Publicly Accessible Social Networking Site (Facebook). MA Thesis 2013.

Mitchell,David. "Type-token models: a comparative study." Journal of Quantitative Linguistics, 22:1 (2015), 1-21. —

Naranan, S. and V.K. Balasubrahmanyan. "Models for Power Law Relations in Linguistics and Information Science." Journal of Quantitative Linguistics 5:12 (1998), 35-61.

Nelson, Robert. "Issues with the capture-recapture measure of vocabulary size." The Mental Lexicon 10:1 (2015), 168-179.

Oakes, Michael Philip. "Corpus Linguistics and Stylometry". In Eds. A LÄ% deling and M. KytÄ.. Corpus Linguistics: An International Handbook. Mouton de Gruyter, 2009: 1070-1090.

Peng, Fuchun, Dale Schuurmans, Shaojun Wang, and Vlado Keselj. "Language independent authorship attribution using character level language models".

251

In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, 1 (2003), 267-274.

Piantadosi, Steven T. "Zipfs word frequency law in natural language: A critical review and future directions." Psychonomic Bulletin & Review, 21 (2014), 1112-1130.

Popescu, Ioan-Iovitz. Word Frequency Studies, Quantitative linguistics 64, Walter de Gruyter, 2009.

Popescu, Ioan-Iovitz, J. Macutek and Gabriel Altmann, Aspects of Word Frequencies. Lüdenscheid, 2009.

Qiu Xigui ^SS^. Chinese Writing. Translated by Gilbert L. Mattos and Jerry Norman. Early China Special Monograph Series No. 4. Berkeley: The Society for the Study of Early China and The Institute of East Asian Studies, University of California, 2000.

Read, John. Assessing vocabulary. Cambridge, England: Cambridge University Press, 2000.

Sampson, Geoffrey. "Review of Harald Baayen: Word Frequency Distributions." Computational Linguistics 28 (2002): 565-569.

Smith, Tony C. and Ian H.Witten. "Language Inference from Function-Word". Working Paper 93/3, University of Waikato, New Zealand, 1993.

Tweedy, Fiona J. and R. Harald Baayen. "How Variable May a Constant be? Measures of Lexical Richness." Perspective Computers and the Humanities 32:5 (1998), 323-352.

Tuldava, Juhan. "Stylistics, author identification." Gabriel Altmann and Ra-jmund G. Piotrowski (Eds.) Quantitative Linguistik / Quantitative Linguistics: Ein internationales Handbuch /An International Handbook. Walter de Gruyter, Berlin, New York, 2005, 368-387.

Vulanovic , Relja and Köhler, Reinhard. "Syntactic units and structures". In Köhler, Reinhard , Gabriel Altmann and Rajmund G. Piotrowski (Eds.) Quantitative Linguistik / Quantitative Linguistics: Ein internationales Handbuch / An International Handbook. Walter de Gruyter, Berlin, New York, 2005, 274-291.

Wang Dahui, Menghui Li and Zengru Di. "True reason for Zipfs law in language." PhysicaA 358 (2005): 545-550.

Wimmer, Gejza. "The Type-Token relation." In Köhler, Reinhard , Gabriel Altmann and Rajmund G. Piotrowski (Eds.) Quantitative Linguistik / Quantitative Linguistics: Ein internationales Handbuch / An International Handbook. Walter de Gruyter, Berlin, New York, 2005, 361-368.

Wimmer, Gejza and Gabriel Altmann. "On Vocabulary Richness." Journal of Quantitative Linguistics 6:1 (1999), 1-9.

Xiao, Hang. "On the Applicability of Zipfs Law in Chinese Word Frequency Distribution." Journal of Chinese Language and Computing 18:1 (2008). 33-46.

Yang, Yuting, Yunhua Qu, Chenyao Bao and Xiaowen Zhang. "A Modelbased Feature Optimization Approach to Chinese Language." Processing Journal of Quantitative Linguistics, 22, No. 1 (2015): 55-81.

Zhang, Dongbo and Shouhui Zhao. "The Totality of Chinese Characters — A Digital Perspective." Journal of Chinese Language and Computing 17:2 (2007), 107-125.

252

Zinin, Sergey. "Pre-Qin Digital Classics: Study of Text Length Variations". — Учёные записки отдела Китая, выпуск 15, 44 научная конференция Общество и государство в Китае, том XLIV, ч. 2, М., Институт Востоковедения РАН (Scholarly Reports of the Department of China of the Institute of Oriental Studies, Russian Academy of Sciences, issue 15, The 44th Conference "Society and State in China", vol. XLIV, pt. 2, Moscow) (2014): 270-311.

S.V. Zinin*

Vocabulary richness of early Chinese texts: macroanalysis of the Thirteen classics and the Zhuangzi

ABSTRACT: This study analyzes statistical data regarding the vocabulary richness of the Warring States Project CTexts collection of Chinese classics97. Vocabulary richness has been primarily used in quantitative linguistics for authorship identification and style analysis, and it has been increasingly applied for various aspects such as language acquisition in other linguistic fields. This study lays the foundation for a quantitative linguistic analysis of the vocabulary of early Chinese texts. It also conducts a macroanalysis of the data, including calculating several vocabulary richness indices and building charts of vocabulary growth. This study finds significant differences in the vocabulary growth of corpus texts. In addition, it reveals that the Shi Jing and Yi Li are two extreme ends of the vocabulary growth spectrum and identifies some historical texts in the middle of the spectrum as a distinct group. Furthermore, the study takes a closer look at specific forms of vocabulary growth such as hapax le-gomena, dis legomena, and the most frequent characters.

KEYWORDS: Chinese canons, The Thirteen Classics, computational linguistics, quantitative linguistics, vocabulary richness, lexical diversity, type-token ratio, digital corpora, stylometry.

* Zinin Sergey, Warring States Project, University of Massachusetts, Amherst; E-mail: [email protected]

97 It contains The Thirteen Classics (excluding the Er Ya) and adds the Zhuangzi to balance the "Confucian" texts.

253

i Надоели баннеры? Вы всегда можете отключить рекламу.