Научная статья на тему 'Structure of ‘Coca’ (corpus of Contemporary American English) and simple queries on it'

Structure of ‘Coca’ (corpus of Contemporary American English) and simple queries on it Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
628
81
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
CORPUS LINGUISTIC / CORPUS / COCA / NATIONAL CORPUS / CORPUS CLASSIFICATION / CORPUS STRUCTURE

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Ataboev Nozimjon Bobojon O'G'Li

This article discusses the structure of the COCA and its components. The content of the corpus is analyzed from the following viewpoints as number of words, type and genre of the materials in it and others. The use of collocations is analyzed by means of the searches on COCA. The adjective + noun constructions made with the adjectives of beautiful and handsome are empirically studied through the quantitative results on their use with the common nouns representing the opposite gender members.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Structure of ‘Coca’ (corpus of Contemporary American English) and simple queries on it»

ФИЛОЛОГИЧЕСКИЕ НАУКИ

STRUCTURE OF 'COCA' (CORPUS OF CONTEMPORARY AMERICAN ENGLISH) AND SIMPLE QUERIES ON IT

Ataboev N.B.

Ataboev Nozimjon Bobojon o 'g 'li - doctoral Student, UZBEK STATE WORLD LANGUAGES UNIVERSITY, TASHKENT, REPUBLIC OF UZBEKISTAN

Abstract: this article discusses the structure of the COCA and its components. The content of the corpus is analyzed from the following viewpoints as number of words, type and genre of the materials in it and others. The use of collocations is analyzed by means of the searches on COCA. The adjective + noun constructions made with the adjectives of beautiful and handsome are empirically studied through the quantitative results on their use with the common nouns representing the opposite gender members.

Keywords: corpus linguistic, corpus, COCA, national corpus, corpus classification, corpus structure.

UDC 800

Our world is undertaking great changes day by day that influence the pace of alterations in the science of linguistics as well. One of the best examples to prove the point is the emergence of corpus linguistics. Hereby, the loads of opportunities appeared to ease the burdens of the translators or interpreters as well. The following article deals with the application of corpus analyses in the practice of translation. Corpus Linguistics is a new field of linguistics dedicated to the design, creation and use of text corpus. The term has to do with the development of corpus practices created in the 1960s, and has been based on computer technology since the 1980s.

To be more precise about the concepts of corpus linguistics and corpus, both terms have emerged in linguistics in the second half of the twentieth century. According to N. Dash, corpus linguistics is an important area of applied linguistics and plays an important role in linguistic research. It provides a quantitative (quantitative) empirical database on language use for researchers. This base is a corpus, which is compiled using well-defined source collection statistical methods and technologies [5; P. 3-4]. Linguistic corpus is the collection of speech units from the linguist based on systematic principles. It considers the importance of generating frequency results from the concordance search engine to facilitate empirical analysis.

Any corpus user expects the body to: 1) what words are associated with the search term or phrase and in what cases it can be used; 2) how the meaning or use of the word differs by stylistic genres; 3) the difference between the use of words or words that are morally close; 4) frequency and inter-genre distortion of words in the same semantic field [4, P 13]. According to the classification by N.Dash [5; P.10-11], COCA is defined as: a) both written and spoken corpus in accordance with its genre; b) a general corpus in accordance with its nature; c) a monolingual corpus in accordance with the type of text in it; d) a non-annotated corpus in accordance with the purpose of designing.

In his 2009 work, Mark Davis considers on the size of the COCA corpus and compares it with the BNC and The Cobuilt / Bank of English, explaining their differences as one-fourth of COCA and the other slightly larger. As of August 2009, COCA contains the 160,000 texts and its total volume is more than 400 million words [4, P 15].

Today, as of December 2017, the COCA database contains 560 million words of volume of 220,225 texts, and the division was formed according to the years from 1990 to 2017, with an annual input of 20 million words. Every year, and in general, the corpus materials

are combined into five genres: oral, fiction, popular magazines, newspapers, and scientific journals. Texts are drawn from well-defined sources [2]:

Spoken texts consist of a total of 118 million words [118,167,133] words formed from more than 150 television transmissions and transcriptions of recorded broadcasts. Examples include All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer and so on.

Fiction texts contain 113 million words [113,404,735] and collect material from literary magazines, children's magazines, popular magazines, short stories and the first chapter of the plays from 1990s to the present, as well as movie scripts.

Popular magazine texts consist of 118 million [118,450,563] words and selectively match texts from nearly 100 different magazines (news, health, home and gardening, women, finance, religion, sports, etc.). For example, Time, Men's Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.

Newspaper texts show 14 million [114,341,164] words in the corpus, and a selection of ten newspapers from across the United States. For example, USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, this also focuses on the diversity of themes.

The scientific journal texts are composed of 112 million [111,537,393] words and are collected from around 100 edited scientific journals. The foregoing will be selected according to the diversity of the Library of Congress classification system. For example, most are collected from databases classified as B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.

Table 1. The number of words included in COCA over the last decade

YEAR SPOK FIC MAG NEWS ACAD TOTAL

2011 4760687 4166029 4199378 3986321 4551005 21663420

2012 4336058 4335155 4294190 4173813 4337823 21477039

2013 4019619 4225162 4173336 4133917 3531695 20083729

2014 4004868 4134220 4266683 4142500 3456761 20005032

2015 4005894 4255674 4195487 4130818 3609226 20197099

2016 4371199 4197883 4087037 4134560 4005824 20796503

2017 4404291 4228709 4252889 4242760 4109588 21238237

TOTAL 29902616 29542832 29469000 28944689 27601922 145461059

Table is taken from the site www.english-corpora.org.

Hallidey M. referring to the term collocation, he argues that the use of lexical units not only in the syntactic layer but also in the textual context is justified. As an example, he concludes that using strong tea and powerful car in vocabulary is a combination that can be used in English, and the use of powerful tea and strong car variants do not exist in the language [3]. However, no exact examples are given to prove the ideas. Corpus application can give quantitative results that are useful to prove the views. The frequency of certain designated units on collocations does not provide accurate information. It only provides numerical information about search terms in a corpus [1]. For example, in order to differentiate between the use of handsome and beautiful adjectives in relation to a person, it is wrong to conclude about all the members of the gender based on the frequency of the use of handsome man handsome or woman, and the combination of beautiful woman or beautiful man. Because the corpus contains some more nouns representing the gender such as wife and husband, girl and rich, female and male etc., it is well known that the attributes above are also associated with these nouns.

Beautiful + noun Handsome + noun

Female g. nouns Male g. nouns Female g. nouns Male g. nouns

Woman 2210 Man 363 Woman 161 Man 1110

Women 1039 Men 155 Women 41 Men 165

Girl 915 Boy 232 Girl 28 Boy 159

Wife 469 Husband 69 Lady 9 Guy 151

Mother 383 Father 108 Wife 12 Husband 79

Daughter 281 Son 73 Daughter 6 Son 72

Lady 227 Gentleman 3 Mother 31 Father 63

Ladies 69 Gentlemen 6 Sister 0 Brother 31

Total: 5593 Total: 1009 Total: 288 Total: 1830

Table was worked out by the author by means of the materials from the site https://www.english-corpora.org/coca/.

From the table it is visible that in COCA use of beautiful and handsome with the members of the opposite genders cannot be concluded strictly as saying which is possible or impossible; one exists or the other doesn't. The corpus results help to find out the most common or more widely applicable words or collocations. For example, the adjective beautiful is used more commonly to describe a quality of a female, while the adjective handsome is applied more commonly to depict a quality of a male that is apparent from the total numbers of the frequencies in the table 2.

References

1. Ataboev N. B. Problematic issues of corpus analysis and its shortcomings // ISJ Theoretical & Applied Science. 10 (78), 2019. C. 170-173.

2. Corpus of Contemporary American English // English-corpora.org [Electronic Resource]. URL: https://www.english-corpora.org/coca/ (date of access: 30.12.2019).

3. Halliday M. An introduction to Functional Grammar. 3rd ed. / Revised by C. Matthiessen. London: Hodder Arnold, 2004. 700 p.

4. Davies Mark. Semantically-based, learner-oriented queries with the 400+ million word Corpus of Contemporary American English // Explorations across Languages and corpora edited by Stanislaw Gozdz-Roszkowski. Berlin, 2009. 622 p. 13-28 pp.

5. Niladri Sekhar Dash. Corpus Linguistics: An Introduction. Encyclopedia of Life Support Systems (EOLSS), India, 2015. 11 p.

i Надоели баннеры? Вы всегда можете отключить рекламу.