Научная статья на тему 'Digitizing Cyrillic Manuscripts for the Historical Dictionary of the Serbian Language Using Handwritten Text Recognition Technology'

Digitizing Cyrillic Manuscripts for the Historical Dictionary of the Serbian Language Using Handwritten Text Recognition Technology Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
4
1
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
Transkribus / automatic text recognition / artificial intelligence / machine learning / historical lexicography / serbian language / Gavril Stefanović Venclović / Transkribus / автоматическое распознавание текста / искусственный интеллект / машинное обучение / историческая лексикография / сербский язык / Гаврил Стефанович Венцлович

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Vladimir Polomac, Marina Kurešević, Isidora Bjelaković, Aleksandra Colić Jovanović, Sanja Petrović

The paper explores the possibilities of using information technologies based on the principles of machine learning and artificial intelligence in the process of digitizing Cyrillic manuscripts for the purposes of creating a historical dictionary of the Serbian language. Empirical research is based on the use of the Transkribus software platform in the creation of a model for automatic text recognition of the manuscripts by Gavril Stefanović Venclović, the most significant and prolific Serbian cultural enthusiast of the 18th century, whose extensive manuscript legacy in Serbian vernacular represents the most significant primary source for the historical dictionary of the Serbian language of this period. Following the results of conducted research, it can be concluded that the process of digitizing Cyrillic manuscripts for the purposes of creating a historical dictionary of the Serbian language can be significantly accelerated using Transkribus by creating specific and generic models for automatic text recognition. The advantage of automatic text recognition compared to the traditional methods is particularly reflected in the possibility of continuous improvement of the performance of specific and generic models in accordance with the progress of the transcription process and the increase in the amount of digitized text that can be used to train a new version of the model.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Оцифровка кириллических рукописей для исторического словаря сербского языка с использованием технологии распознавания рукописного текста

В статье исследуются возможности использования информационных технологий, основанных на принципах машинного обучения и искусственного интеллекта, в процессе оцифровки кириллических рукописей в целях создания исторического словаря сербского языка. Эмпирическое исследование основано на использовании программной платформы Transkribus при создании модели автоматического распознавания текста рукописей Гаврила Стефановича Венцловича, самого значительного и плодовитого сербского культурного энтузиаста XVIII в., чье обширное рукописное наследие в сербском народном языке представляет собой наиболее значительный первоисточник исторического словаря сербского языка, относящегося к этому периоду. По результатам проведенного исследования можно сделать вывод, что процесс оцифровки кириллических рукописей в целях создания исторического словаря сербского языка можно значительно ускорить с помощью Transkribus через создание определенных и генерических моделей для автоматического распознавания текста. Преимущество автоматического распознавания текста по сравнению с традиционным, в частности, выражается в возможности постоянного улучшения производительности определенных и генерических моделей в соответствии с ходом процесса транскрипции и увеличением объема оцифрованного текста, который можно использовать для обучения новой версии модели.

Текст научной работы на тему «Digitizing Cyrillic Manuscripts for the Historical Dictionary of the Serbian Language Using Handwritten Text Recognition Technology»

Digitizing Cyrillic Manuscripts for the Historical Dictionary of the Serbian Language Using Handwritten Text Recognition Technology*

Vladimir Polomac

University of Kragujevac, Kragujevac, Serbia

Marina Kuresevic Isidora Bjelakovic Aleksandra Colic Jovanovic Sanja Petrovic

University of Novi Sad, Novi Sad, Serbia

Оцифровка кириллических рукописей для исторического словаря сербского языка с использованием технологии распознавания рукописного текста

Владимир Поломац

Университет в Крагуеваце, Сербия

Марина Курешевич Исидора Бьелакович Александра Цолич Йованович Саня Петрович

Университет в Нови-Саде, Нови-Сад, Сербия

Цитирование: Поломац В., Курешевич М, Бьелакович И., Цолич Йованович А, Петрович С. Оцифровка кириллических рукописей для исторического словаря сербского языка с использованием технологии распознавания рукописного текста // Slovene. 2023. Vol. 12, № 1. C. 295-316.

Citation: Polomac V., Kuresevic M., Bjelakovic I., Colic Jovanovic A., Petrovic S. (2023) Digitizing Cyrillic Manuscripts for the Historical Dictionary of the Serbian Language Using Handwritten Text Recognition Technology. Slovene, Vol. 12, № 1, p. 295-316.

DOI: 10.31168/2305-6754.2023.1.08

I ГссНЯИЯ^М This is an open access article distributed under the Creative i v

1-J Commons Attribution-NoDerivatives 4.0International 2023 №1 biOVGllG

Abstract

The paper explores the possibilities of using information technologies based on the principles of machine learning and artificial intelligence in the process of digitizing Cyrillic manuscripts for the purposes of creating a historical dictionary of the Serbian language. Empirical research is based on the use of the Transkribus software platform in the creation of a model for automatic text recognition of the manuscripts by Gavril Stefanovic Venclovic, the most significant and prolific Serbian cultural enthusiast of the 18th century, whose extensive manuscript legacy in Serbian vernacular represents the most significant primary source for the historical dictionary of the Serbian language of this period. Following the results of conducted research, it can be concluded that the process of digitizing Cyrillic manuscripts for the purposes of creating a historical dictionary of the Serbian language can be significantly accelerated using Transkribus by creating specific and generic models for automatic text recognition. The advantage of automatic text recognition compared to the traditional methods is particularly reflected in the possibility of continuous improvement of the performance of specific and generic models in accordance with the progress of the transcription process and the increase in the amount of digitized text that can be used to train a new version of the model.

Keywords

Transkribus, automatic text recognition, artificial intelligence, machine learning, historical lexicography, serbian language, Gavril Stefanovic Venclovic

Резюме

В статье исследуются возможности использования информационных технологий, основанных на принципах машинного обучения и искусственного интеллекта, в процессе оцифровки кириллических рукописей в целях создания исторического словаря сербского языка. Эмпирическое исследование основано на использовании программной платформы Transkribus при создании модели автоматического распознавания текста рукописей Гаврила Стефановича Венцловича, самого значительного и плодовитого сербского культурного энтузиаста XVIII в., чье обширное рукописное наследие в сербском народном языке представляет собой наиболее значительный первоисточник исторического словаря сербского языка, относящегося к этому периоду. По результатам проведенного исследования можно сделать вывод, что процесс оцифровки кириллических рукописей в целях создания

* The paper was financed by the Ministry of Education, Science and Technological Development of the Republic of Serbia and German Academic Exchange Service (DAAD) (project: Automatic Text Recognition of Serbian Medieval Manuscripts and Early Printed Books: Problems and Perspectives). The previous version entitled Serbian Written Heritage of the 18th century: Towards Automatic Text Recognition of Gavril Stefanovic Venclovic's Manuscripts was presented at the 17th annual conference of the Slavic Linguistics Society (19-21st September 2022, Hokkaido University, Sapporo, Japan). The team of authors would like to express its gratitude to Academician Vasilije Krestic (manager) and Dr. Miroslav Jovanovic (vice manager) of the SASA (Serbian Academy of Sciences and Arts) Archives for providing the digital copies of Venclovic's manuscripts used in this paper.

исторического словаря сербского языка можно значительно ускорить с помощью Тга^кпЬи через создание определенных и генерических моделей для автоматического распознавания текста. Преимущество автоматического распознавания текста по сравнению с традиционным, в частности, выражается в возможности постоянного улучшения производительности определенных и генерических моделей в соответствии с ходом процесса транскрипции и увеличением объема оцифрованного текста, который можно использовать для обучения новой версии модели.

Ключевые слова

Тга^кпЬ^, автоматическое распознавание текста, искусственный интеллект, машинное обучение, историческая лексикография, сербский язык, Гаврил Стефанович Венцлович

1. Introduction

Recent research on the use of the Transkribus software platform1 for the automatic recognition of Russian and Serbian Church Slavonic Cyrillic manuscripts and printed books [Rabus 2019; Polomac, Lutovac Kaznovac 2021; Polomac 2022a; 2022b] triggered the investigation to be performed in the present article. In his pioneering study, German slavist A. Rabus [2019a] demonstrated that the first version of the automatically recognized text can be digitized with only 3-4% of misrecognized characters, using this particular software platform. Furthermore, this could be done in a significantly shorter amount of time, simultaneously reducing human and financial resources. The obtained output might later be used for further philol ogical and linguistic research, especially after a manual correction (editing) by a competent philologist. The present paper further contributes by underlining that the models for automatic recognition have been made available to all Transkribus users; hence, its performance can be checked on other Slavic medieval manuscripts. A paper by V. Polomac and T. Lutovac Kaznovac [2021] examined the performance of Rabus's models for automatic recognition of Serbian medieval manuscripts written in different types of Cyrillic script. The authors concluded that the application of Rabus's

1 The Transkribus software platform (https://readcoop.eu/transkribus/) represents a tool for manual and automatic reading and searching of old manuscripts and printed books, regardless of the time of creation, language or script. The key advantage of Transkribus compared to other similar applications is reflected in the ability of the user to create his/ her own model for automatic text reading. Training a model for automatic text reading is an example of machine learning based on advanced neural networks in which the model compares photographs of manuscripts and the corresponding letters, words and lines of text in the diplomatic edition. For more information on the technological background and the way this platform works, see [Muhlberger et al. 2019; Rabus 2019a].

models yielded relatively favorable results on Serbian medieval manuscripts written in poluustav ('semi-majuscule Cyrillic script'), while the creation of specific models was suggested for manuscripts written in brzopis ('diplomatic minuscule Cyrillic script'). The current paper underscores the necessity of creating specific models for the recognition of Serbian medieval manuscripts and printed books in particular to speed up the work on current projects in historical corpus linguistics and lexicography of the Serbian language. It was precisely the creation of a model for the automatic recognition of Serbian Church Slavonic printed books that was the focus of the studies by V. Polomac [2022a; 2022b]. The most important result of the aforementioned studies relates to the creation of a publicly available generic model for the automatic recognition of Serbian Church Slavonic printed books of the 15th and 16th centuries, entitled Dionisio 2.0. In continuation of the research, the authors of the current study were interested in whether the Transkribus software platform can be used for the automatic recognition of the Serbian manuscript heritage of the 18th century, as well as to speed up the preparation of the electronic corpus for the historical dictionary of Serbian. Empirical research was conducted on the manuscripts by Gavrilo Stefanovic Venclovic, one of the most significant and prolific Serbian cultural forerunners of the 18th century, whose legacy, written in Serbian vernacular and the Serbian Church Slavonic language, includes more than 20 manuscripts, with around 10,000 pages in total.2 The second chapter of the paper provides a more detailed presentation of conceptualization of the Serbian historical dictionary, especially referring to the principles of text digitization. After a brief review of Venclovic's manuscripts written in Serbian vernacular, the third, and central chapter of the paper presents and discusses the results of the experiments on creating and evaluating models for the automatic recognition of the texts in question using the Transkribus software platform. The fourth, and final chapter, summarizes the results and perspectives for further research.

2 All his writings represent autographs—in some of them he left his name in the preface, afterword or inscription, while others were attributed to him based on paleographic and orthographic analysis and, possibly, language or illumination. Venclovic's Serbian Church Slavonic manuscript fund is somewhat more extensive (about 6,200 pages) than the one in Serbian vernacular (more details here in point 3.1), and it consists mainly of manuscripts for liturgical purposes. In them, Venclovic appears primarily as a copyist and illuminator, and less often as an editor or translator [ПавиЬ 1972: 98]. The largest part of Serbian Church Slavonic manuscripts is preserved in the SASA Archives (see [Сто]ановиЬ 1901: 19-21, 34-36, 38-39, 102-120]) and in the Szentendre Archives (see [Синдик et al. 1991: 107-117, 120-121]).

2. Оn the Historical Dictionary of the Serbian Language and Principles of Digitization

A project entitled the Dictionary of the Serbian Language from the 12th to the 18th Century was established in 20133 as part of the activities of the Department of Language and Literature of Matica srpska. Since the Serbian historical lexicography still, to a large extent, falls behind the lexicography of the Slavic world,4 creation of such a dictionary represents one of the primary goals of Serbian diachronic investigations. Materials from the oldest preserved manuscripts in Serbian (end of the 12th century) until the beginning of the pre-standard phase in the development of the Serbian literary language (end of the 18th century) represent the corpus for the dictionary.5 Simultaneously, the upper time limit refers to the period from which the oldest corpus for the Dictionary of Serbo-Croatian Literary and Vernacular Language [PCAHy] dates, thus ensuring the continuity in the lexicographic processing of the corpora in Serbian. The corpus was divided into primary, secondary and tertiary for several reasons. Namely, Serbian vernacular functioned as a complementary and functionally marked member of a dichotomy during the period of diglossia (12th-18th century),6 i.e. polyglossia of the 18th century.7 Moreover, it is a well-known fact that the boundaries between the literary languages and vernacular are often not sharply delineated, yet blurred and fuzzy [PagoBaHoBHh 2015; KypemeBuh 2016], which should likewise be taken into consideration. Thus, the primary corpus consists of texts written in Serbian vernacular, which will be lexicographically processed using total excerption. The secondary corpus is comprised of texts in which the presence of both Serbian vernacular and literary language/s was attested. These sources will be processed selectively: only the Serbian lexis not recorded in the primary sources8 will be excerpted. Tertiary sources

3 The project leader is academician Jasmina Grkovic-Major (a full member of SASA), and the project team gathers language historians (experts on the history of Serbian vernacular and literary language idioms) from the Republic of Serbia, the Republic of Srpska and the Republic of Montenegro.

4 The only historical dictionary based on Serbian linguistic material is [flaHHMHh 1863-1864], which contains not only Serbian but Serbian Church Slavonic corpora, as

well.

5 We should use this occasion to emphasize that there is an open possibility for Serbian words recorded in earlier sources to be processed lexicographically [rpKOBHh-Mejijop 2021: 14].

6 For more information about diglossia in Serbian medieval literacy see in [rpKOBHh-Mejijop 2007: 443-459].

7 For more information about polyossia in Serbian literacy in the 18th century see: [Oy6orah 2004].

8 More on the secondary corpus of the Serbian historical dictionary cf.: ^BeTKOBHh Teo^H^OBHh 2021; JoBHh 2021].

include editions to be taken into account only after we make sure the original or its transcription has not been preserved and can be regarded as a valuable historical and linguistic resource [rpKOBHh-Mejyop 2021: 19]. In order to ensure reliability in the lexicographic processing of the material, the intention of the team of authors is to perform the excerption exclusively from the original, that is, from its digital copy. In addition to defining the theoretical concept of the dictionary [rpKOBHh-Mejyop 2021], theoretical and methodological solutions for various issues of its microstructure have been proposed in the previous work done so far on the project (cf. [CaBHh, MH^aHOBHh 2021; KypemeBHh 2021; naB^oBHh 2021; rpKOBHh-Mejyop, EjenaKOBHh 2021; EjenaKOBHh 2021]), as well as theoretical principles of digitization [KypemeBHh et al. 2021]. As part of the practical work on the project, comprehensive registers of corpora for the dictionary have been created thus far. Furthermore, digital copies of most sources have been acquired and manual digitization has already begun.

Since the materials for the historical dictionary of Serbian were written in different types of Cyrillic and Latin script, the basic principle of digitization was to standardize the paleographic variants of the letters according to the corresponding solutions in the Civil script. For these purposes, the Beo-gradPro font was used, since it contains all modern Cyrillic and Latin characters, as well as additional characters for specific old Cyrillic and Latin graphemes. The following principles were used for Cyrillic corpora digitization: 1) punctuation is transferred according to the original, 2) abbreviations are transferred without resolving, with a titlo mark and/or with the superscript letter written in the exponent where it belongs in the word structure, 3) types of letters are generalized, and those that have an orthographic function are retained (e.g. the letters e and e are retained, then o, o and w, and the letter derv is transferred with the letter h), 4) when it comes to the superscript characters, a titlo mark is transferred (as a mark for an abbreviated word or as a mark for the numerical value of a letter) and a pajerak mark in its original place, 5) ligature grapheme connections are resolved, and only traditional ligatures are retained: «, ra, k, ro, w and w, 6) phonetic clusters with proclitics and enclitics are separated, 7) the spelling of compounds, as well as certain adverbs and conjunctions created by grammaticalization is standardized in favor of using a hyphen in texts that are not consistent in their use, 8) the beginning of a line is marked with a vertical line next to which there is also a number of the row in the exponent, and the end of the sheet is marked with a double vertical line next to which the number of the sheet/page is written in the exponent (more details in [KypemeBHh et al. 2021]).

Bearing the scope of the chronological arc of the historical dictionary of Serbian (12th-18th century) in mind, as well as the volume of its corpus, the

process of digitization is currently the primary task in the realization of the project. How demanding this process is, especially in the context of limited human and financial resources, can be proved by the fact that from 2017 until today, the material of about 500,000 words (mostly from business and legal literacy and literary pieces) has been digitized, representing only an insignificant part of the entire corpus.9 If this process continues in the traditional way and with this dynamic, it is more likely that it will take decades, rather than years, to complete. The inclusion of technology for automatic text recognition in the process of digitization could significantly improve and speed up the work on the creation of the historical dictionary. Choosing the Transkribus software platform for this endeavor is suitable for several reasons.10 Not only is the software characterized by a fairly simple user interface, but also demanding computer tasks are performed on the server so that the user does not need special computer equipment. Additionally, starting from version Transkribus 1.18.0. this software platform has allowed training and recognition of textual tags (including text styles such as bold, italic, superscript, etc.) using the Include Properties option. The final reason is particularly important since, in accordance with the aforementioned principles of transferring material for the dictionary into electronic form, superscript letters and titlo marks are transferred by raising them to an exponent (Tag as a superscript).

3. Creating and Еvualuating Мodels for Аutomatic Тext Recognition

of Venclovic's Manuscripts in Serbian Vernacular 3.1. Reflecting upon Venclovic's Legacy Written in Serbian Vernacular Venclovic's manuscripts in Serbian vernacular were selected for investigating the possibility of including the Transkribus software platform in the process of the historical dictionary corpus digitization, as they represent the most extensive (about 4,400 pages) and important primary source for a dictionary of 18th century language. The advantage of automatic digitization by means of artificial intelligence and machine learning compared to traditional manual digitizing is especially evident when working with voluminous manuscripts, such as these ones. Manual digitizing requires enormous human, temporal and financial resources. It is not surprising, therefore, that even though

9 At this moment, it is not possible to provide even an approximate estimate of the size of the corpus by the number of words.

10 Transkribus is not the only software platform for automatic text recognition. At the University of Paris (Université Paris Sciences et Lettres) an open-access software platform eScriptorium was developed within the project Scripta-PSL which is currently most widely used for automatic recognition of Hebrew, Syriac and Arabic manuscripts. More on the project and platform itself see the following link https://escripta. hypotheses.org/, as well as in [Kiesling et al. 2019].

Venclovic's manuscripts were discovered in the second half of the 19th cen-tury,11 they have not yet received a complete critical edition. In Serbian vernacular, Venclovic composed texts directly addressed to Orthodox believers— sermons, letters, and lessons. The choice of language is explained in several places by the need for his presentation to be understandable (according to [Павип 1972: 120-121]). This part of Venclovic's written legacy includes: 1) Поученща и слова разлика (САНУ 94 (271), 1732); 2) Мач духовны I (САНУ 92 (267), 1733/34); 3) Мач духовны II (САНУ 93 (268), 1733/34); 4) Великопосник (САНУ 97 (136), 1740/41); 5) Слова изабрана (САНУ 101 (137), 1743); 6) Пентикости (САНУ 98 (272), 1743); 7) Жити)а, слова и поуке (САНУ 84 (270), 1744/45); 8) Поучение изабрано]е I (САНУ 99 (139), 1745); 9) Поучение изабрано]е II (САНУ 100 (269), 1746).12

3.2. Creation and Quantitative Evaluation of the Model

The initial methodological problem was ascribed to the fact that we lacked high-quality digital copies of any of Venclovic's manuscripts written in Serbian vernacular, or transcripts that could be used to train models for automatic text recognition. By the courtesy of the SASA Archives, digital copies of the first 100 pages of the manuscript of Слова изабрана (САНУ (101) 137) (hereinafter abbreviated as САНУ 137) were made available to us. The choice of this manuscript was motivated by the fact that it is one of Venclovic's most voluminous manuscripts in Serbian vernacular (745 pages in total), with a very neat and uniform ductus throughout. The process of creating a model for automatic text recognition started with manual digitization of the first 35 pages of the manuscript in Transkribus. Consequently, we obtained the minimum amount of Ground Truth data13 (about 15,000 words) necessary for training the model.14 During the process, we adhered to the principles of digitizing the

11 Venclovic's manuscripts reached the SASA Archives in 1870 thanks to Gavrilo Vitkovic, who was engaged in collecting antiquities in southern and central Hungary [Синдик et al. 1991: 3].

12 All the mentioned books were created in the parishes of Komárno and Gyor. They were described for the first time in [Сто^ановип 1901: 42-51, 84-171]. Based on the analysis of watermarks, M. Grozdanovic-Pajic [1992] offered more precise or slightly different dates of origin for many of them, which we present in this paper. Although the degree of Venclovic's originality is also questionable here, since we are talking about adaptations/translations to a considerable extent [Павип 1972: 243-246; Трифуновип 2009: 68], these manuscripts represent a very important resource in the study of the history of the Serbian language [Ивип 2014: 112-113].

13 The term Ground Truth Data in machine learning refers to completely accurate data used to train the model. In our case, these would be exact transcripts of digital photographs of the manuscript. For more details on this term, see Transkribus Glossary at https://readcoop.eu/glossary/ground-truth/.

14 The minimum amount of data necessary to train a model for manuscript recognition is about 15,000 words, while training a model for recognizing printed books requires much less data (about 5,000 words) [Mühlberger et al. 2019: 959].

Cyrillic manuscripts for the historical dictionary specified here in chapter 2, except in case of compounds, certain conjunctions and adverbs. The latter were always transferred as one word (without a hyphen), since Venclovic's texts belong to the epoch when the process of grammaticalization of the words in question had already ended.

The parameters and performance of the first version of the model named Venclovic 0.1. are shown in the following table.

Table 1. Parameters and performance of the Venclovic 0.1. model

Engine15 Word count Word count Number of CER on CER on

on Train Set on Validation Set epochs16 Test Set Validation Set17

CITlab HTR+ 15 806 717 50 0.57% 6.87%

In the continuation of the transcription process, we used the Venclovic 0.1. model for automatic digitization of the next 35 pages of the САНУ 137 manuscript. After manual correction of the automatically obtained transcripts, we had twice as much Ground Truth data necessary for training the second version of the model at our disposal. The parameters and performance of the second version of the model entitled Venclovic 0.2. are displayed in the following table.

Table2. Parameters and performance of the Venclovic0.2. model

Engine Word count Word count Number CER on CER on

on Train Set on Validation Set of epochs Test Set Validation Set

CITlab HTR+ 32039 1675 50 1.39% 4.87%

We digitized the remaining 30 pages using the Venclovic 0.2. model. After manually correcting the transcripts, we trained the Venclovic 0.3. model, the parameters and performance of which are presented in the following table.

15 Users of the Transkribus software platform have two engines for model training and automatic text recognition at their disposal: CITlab HTR+ and PyLaia. Training the model on the same material using different engines yields almost identical results, which was also shown in our research. The advantage of the PyLaia engine is reflected only in the fact that it allows certain changes in its structure, and is thus suitable

for adaptation to the specific needs of users who are familiar with the IT aspects of machine learning. For more detailed information see Transkribus Glossary at https:// readcoop.eu/glossary/htr-plus/ and https://readcoop.eu/glossary/pylaia/.

16 The term epoch in machine learning stands for "one complete presentation of the data set to be learned to a learning machine" [Burlacu, Rabus 2021: 1].

17 In all the models trained to recognize Venclovic's manuscripts described in this paper, the amount of data in the validation set was 5% of the total training set.

Table 3. Parameters and performance of the Venclovic 0.3. model

Engine Word count Word count Number CER on CER on

on Train Set on Validation Set of epochs Test Set Validation Set

CITlab HTR+ 46118 2421 50 2.04% 4.49%

The quantitative indicators of the models for the automatic recognition of Venclovic's manuscripts can be rated as exceptional, since it was already in the second version of the model Venclovic 0.2. that the percentage of incorrectly recognized characters fell below 5%.18 In other words, this means that the model can be trained to automatically recognize the rest of the manuscript with 95% accuracy only on the basis of one tenth of the manuscript. The progress in the quantitative performance of the model is more pronounced between its first and second versions—cf. CER on Validation Set for model Venclovic 0.1. and Venclovic 0.2. The Venclovic 0.3. model shows that each subsequent version of the model exhibits minimal improvement in quantitative performance with the new training material. Unfortunately, as we did not possess digital recordings of the rest of the manuscripts, we were not able to continue the process of automatic recognition and model enhancement. However, even based on this experiment, as well as on the experience with training models for automatic recognition of Serbian Church Slavonic printed books [Polomac 2022a; 2022b], we can assertively assume that further refinement of the model could lead to the percentage of misrecognized characters dropping even lower. Nevertheless, insisting on reducing the percentage of incorrectly recognized characters to an even lower percentage does not contribute much in the practical sense, since the text obtained by automatic recognition must be edited by a competent philologist anyhow19.

All three versions of Venclovic's manuscripts recognition model were trained in a fifty-epoch process. The dependency of the training results expressed by the percentage of incorrectly recognized characters and the number of epochs for training the model can be shown for each model using the learning curve. A typical learning curve can be seen in the example of the Venclovic 0.3. model in Figure 1.

The learning curve demonstrates that, in the process of machine learning, the model achieves the most significant progress during the first few epochs

18 According to [Muhlberger et al. 2019: 962] it can be considered exceptional if the percentage of incorrectly recognized characters during automatic manuscript recognition is less than 5%. In the case of printed books, this percentage can be lower and amount to about 1-2%. Cf. our results on the material of Serbian Church Slavonic printed books in [Polomac 2022a; 2022b].

19 In the paper by J. Besters-Dilger and A. Rabus [2021] a very interesting thesis was presented stating that a large amount of material obtained by automatic text recognition and tagging can be used for quantitative linguistic research even without the manual correction of the text.

Figure 1. The learning curve of the Venclovic0.3. model

of training. Subsequently, the percentage of incorrectly recognized characters stabilizes very quickly at a certain level (only after ten epochs). By the end of the training process, it only slightly decreases, which means that increasing the number of epochs would not necessarily lead to a lower percentage of misrecognized characters.

3.3. The Qualitative Analysis of the Venclovic 0.3. Model

Previous research (cf. only [Rabus 2019b: 13]) showed that the percentage of incorrectly recognized characters was not always a realistic indicator of model quality. Considering that during the automatic statistical calculation of the percentage of incorrectly recognized characters all interventions in the text are taken into account (e.g. insertion, deletion or replacement of characters, including spaces and punctuation marks),20 qualitative indicators of the model's success are often better than quantitative ones. For the qualitative analysis of the Venclovic 0.3. model a comparative display of sheet 90b САНУ 137 was used along with the automatically digitized text, which is presented in the following figure.

20 For more precise data on the method of calculating the percentage of incorrectly recognized characters see Transkribus Glossary at https://readcoop.eu/glossary/ character-error-rate-cer/.

СДДНСЮ* i^CopHli тл\& стг* ijyfif ОДнсткпш Каммс£ ш^шт тлдо^лска тфшм жн сын нужк ^wwh fopukhh. 1бшш4д дпл7» пабд

щр&Щ* цнкъЩ^ Irpi^CHH \а,лнеьк im цыи ¿та isj^ttl fe'kpi л^ ндгьо»* f^vu

нд ,тшг* jdhS Чг^нш

\<т!внктд ^ивата.и^н^чо сдггаи s^^m-

f ' / ^ * / >.;

Os^sua шк-дсиа иишнл» п^нсид^н-жс^тс^ \шы

CB't^ VUtrA ibdriV^Ad,

1-1 слово .'а.

1-2 салимск8 съборнн8 апслк8 CT8 цркв8, неточны законъ . 1-3 когано се прозивнлк: по прсэрочаском сказшваню мти свшм 1-4 црква, и бжТе оу нкои боравилк:н1е . како нам апелъ пава-1-5 оуказ8е . еднна вЪра, еднно крнщенк: кщанъ бъ, а не ви-

1-6 ше що . ето те кщине вЪре нърсод, оузе и хс . на свош раме-1-7 на, пакъ з дрнвеным поделоном с чегним крето 8здиже тамо 1-8 до rophHiera живота . и 81еднно с4 аггли хр'сЬане здр8жТи .

1-9 коще кано 8 помрнчины еъ свЪЬом ев8даа попровалТа по-

1-10 С8рнд8цы паклены и шнпилш поиска . диже 8вшсъ каа ве-

1-11 к8 ламнбад8 rop8ti8 зн бжетвом CBi»h8 на крхъ свое тЪло .

Figure 2. САНУ 137, part of sheet 90b and the automatically digitized text

A comparison of the photo and the automatically digitized text demonstrates that the Venclovic 0.3. model failed to recognize the superscript textual tag, which, in accordance with the principles of digitizing corpora for the historical dictionary of Serbian, introduces a superscript letter. The correction of the automatically cleared text was performed in two steps: in the first version, all the errors of character recognition were corrected, followed by the correction of the superscript textual tag that marks superscript letters in the second step. The percentage of misrecognized characters was identical in both versions (2.69% in total). The latter indicates that, not only did Transkribus fail to recognize the superscript textual tag, but also did not take its corrections into account when calculating misrecognized characters. This fact sheds a different

light on the quantitative indicators of the model presented in Tables 1-3. The percentage of incorrectly recognized characters on the validation set does not include the correction of the superscript textual tag. This actually means that the quantitative indicators of the model are slightly worse because of the specific way of marking superscript letters when digitizing the dictionary material.

If we return to our example of a part of sheet 90b and analyze individual errors in automatic text recognition, we can conclude that, most frequently, the Venclovic 0.3. model makes mistakes in recognizing superscript letters and a titlo mark: so instead of салимск8 2, црквам 4, пава"" 4, крсто" 7, крстъ 11 the model incorrectly reads салимск8 2, црква 4, пава- 4, крсто 7, крхъ 11. In two examples, errors were recorded in the recognition of spaces between words: instead of по провала 8, по с8р"д8цы 8/9 the model incorrectly reads попровал1а 8, пос8р"д8цы 8/9. Other errors refer to superscript letters and titlo marks that are recognized but not raised to a superscript: instead of ап"лк8 2, источны 2, проро-часком 3, свшм 3, нам 4, аплъ 4, нърмд 6, др"веным 7, подслоном 7, ч^тни" 7, там°" 7, хрсТ1ане 8, свЪЬо" 9, бж^твом 10 the model reads апслк8 2, источны 2, проро-часком 3, свшм 3, нам 4, апслъ 4, нърмд 6, др"веным 7, подслоном 7, чстним 7, тамо" 7, хрсйане 8, свЪЬом 9, бжством 10. Taking the aforementioned errors into consideration, it seems that the Venclovic 0.3. model can be rated as excellent in the qualitative sense, as well. We hope that the problem of not recognizing textual tags will be solved in the future by technical improvement of Transkribus. However, even if it stays the same, the process of digitizing texts for the historical dictionary of the Serbian language will be accelerated significantly.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

3.4. Application of the Venclovic0.3. model on other manuscripts

in Serbian vernacular In continuation of the research, we hypothesized that the Venclovic 0.3. model, trained on САНУ 137 material, will be able to successfully automatically recognize Venclovic's other manuscripts in Serbian vernacular. In order to test the hypothesis, we created an experiment in which we used the Venclovic 0.3. model applied to the first ten pages of the manuscript Великопосник (САНУ 97 (136) from the year 1740/41) (hereinafter САНУ 136) and Поучени)а изабрана I (САНУ 99 (139) from 1745) (hereinafter САНУ 139). These two manuscripts were chosen for the experiment because they were written in the same style as САНУ 137, as well as because their high-quality digital recordings already existed in the SASA Archives.

As can be seen in Graph 1, the quantitative performance of the Venclovic 0.3. model on the manuscripts САНУ 136 and 139 is negligibly lower than on САНУ 137. Excluding the superscript letters and titlo marks in the superscript, the percentage of misrecognized characters on САНУ 136 is 5.95%, and on САНУ 139 is even lower-4.90%.

Manuscript

Graph 1. Application of the Venclovic0.3. model on manuscripts САНУ 136 and САНУ 139

For the qualitative evaluation of the result, a comparative view of a part of sheet 9а of the САНУ 136 manuscript is presented in the following figure.

Figure 3. САНУ 136, part of sheet 9а and the automatically recognized text

Along with the errors in recognizing superscripts: instead of ком 2, др«гом 3, х^во 3, х^ва 5, вас 5, аПли 5 there is the incorrect ком 2, драгом 3, хсво - 3, хсва 5, вас 5, апсли 5, the model makes errors in recognizing superscript letters and titlo marks, as well as spaces between words, more frequently than in the case of the manuscript САНУ 137: thus instead of провъзгла^на 1, несносным 2, браЬо 4, а и нгеговы 5, црско 6, ва" 7, the model incorrectly reads провъ зглаш 1, не сносны 2, браЬом 4, м"нгеговым 5, цр"ском 6, вам 7; thus instead of провъзгла^ш 1, несносны" 2, х'во 3, нисте 4 the model incorrectly reads провъ зглаш 1, не сносны 2, хсво- 3, ни сте 4. Unlike САНУ 137, errors of recognizing a pajerak mark are recorded here: thus instead of с нгега пошз"ли 4, а и нгеговы 5, мб"новлгенш 6, мб"гавити 7 the model incorrectly reads с" нгега пошзли 4, м"нгеговым 5, мб"нов"лгенш 6, мбгавити 7. Errors in recognizing letters in examples др« - 2, а и нгеговы 5 (incorrect зр« - 2, м"нгего-вым 5) can be explained by an illegible recording.

For the qualitative evaluation of the results of the efficiency of the model on the САНУ 139 manuscript, a comparative view of a part of sheet 1b and the automatically recognized text is displayed in the following figure.

тгердпгнд гддр Сулим шт^р^тмшллмА Aa^ovtw нелегко \,шл1

НБрЯт^ ШИК'О'НДШАГЬ ^тшнн сш>сАвтшпм ушт тгднн^паАдзййна^ш^т^тшло МтЖ. нщн бы ншк МААт^тг овдЕ&скт iwjam Н£ (ллмщгн}ттмт Шмте Шин boM.l^tii

1-2 безн частна гаарезнлива и многр-Ьш^ива, ама 1-3 ло добина живота . тога радъ и смр'тном бысмо 1-4 сосйгкны . и врагй велико нашемъ д8шнмани 1-5 Н8, под р«ке м8 оу владя с послови м« своим занове 1-6 тани, подлож"ни се дирин"ценю 8чини смо лаком"о . 1-7 и задил ове намъ мало врЧше ово свЪтске полаа-1-8 не слаасти, и сваке желиве наше волк . пакъ е-

Figure 4. САНУ 139, part of sheet 1b and the automatically recognized text

The previous presentation shows that the Venclovic 0.3. model makes most frequent errors in recognizing spaces between words: thus instead of по«чен1е 1, безчастна 2, а малодоб"на 2/3, д«ш"манин« 4/5, зановетани 5/6, «чинисмо 6, маловр"сне 7, овосвЪ'ске 7 the model incorrectly reads по «чеше 1, без" частна 2, ама ло доб"на 2/3, д«ш"манин« 4/5, занове тани 5/6, «чини смо 6, мало вр"сне 7, ово свЪтске 7. Digitizing superscripts represents an issue in the following examples: смр"тно 3, велико" 4, свои 5, полаат- 7, зарР1ди 7 the model incorrectly outputs смр"тном 3, велико 4, своим 5, полаа- 7, задил 7. In relation to this category and in relation to САНУ 137, there are a few examples with unrecognized superscript textual tag: instead of бысмо 3, под 5, подлож" -ни 6, лаком°" 6, овосвЪ'ске 7 the model outputs бысм'о 3, под 5, подлож"ни 6, лакомо 6, ово свЪтске 7. The errors in the recognition of the pajerak mark appeared in two examples only: instead of безчастна 2, диринценю 6 the model incorrectly outputs без" частна 2, дирин"ценю 6.

4. Concluding Remarks and Future Research Perspectives The results of the previously presented research point to the conclusion that following the principles of artificial intelligence and machine learning, and using the Transkribus software platform, the digitization process of Cyrillic manuscripts can be significantly accelerated in order to create an electronic corpus for the historical dictionary of Serbian. Using the example of the Gavril Stefanovic Venclovic's manuscripts written in Serbian vernacular of the 18th century, the study shows that the process of transcription of voluminous manuscripts can be digitized by creating specific models for automatic text recognition. The Venclovic 0.3. model was created on the material of 100 pages of the voluminous САНУ 137 manuscript (774 pages in total), with an acceptable percentage of misrecognized characters of about 4-5%. Using the same model, the rest of the voluminous manuscript САНУ 137 can likewise be digitized in a significantly shorter amount of time, significantly reducing human and financial resources, if, of course, complemented by a final proofreading and edition by a competent philologist. The Venclovic 0.3. model can also be used fairly successfully for the automatic recognition of other Venclovic's manuscripts in Serbian vernacular written in similar style. The percentage of incorrectly recognized characters on the САНУ 136 and 139 manuscripts was around 5-6% and is only slightly lower than the САНУ 137 manuscript on which the model was trained. The qualitative analysis of the most common errors in automatic recognition can lead to the conclusion that the most frequent problems the model has pertain to recognizing superscript letters, titlo marks and spaces between words. Errors in the recognition of a pajerak mark are much less frequent, and errors in the recognition of regular letters are found merely exceptionally.

The recognition of the superscript textual tag used to mark the superscript letter following the principles of digitization of dictionary materials is fairly problematic. Although Transkribus offers the possibility of training and recognizing textual tags since version 1.18.0., our research has shown that this option is still not fully applicable.21 Transkribus does not read textual tags during initial recognition, yet only if there is a version of the digitized text in Transkribus. In neither case does it take textual tags into account when calculating misrecognized characters. Therefore, the qualitative performance of the Venclovic 0.3. model is slightly less efficient than the percentage of incorrectly recognized characters shows, but still excellent, especially compared to traditional manual digitizing. This problem could be overcome in the near future either by further improving the technical performance of Transkribus or by minimally modifying the principles of digitization and, also by improving the font, so that superscript letters could be marked with special characters compliant with the Unicode standard. After solving this problem, the manually digitized material obtained so far throughout the project of digitizing the historical dictionary of Serbian could, after prior preparation, be imported into Transkribus and used for training specific and generic models to automatically recognize other Cyrillic manuscripts.

The advantage of automatic text recognition as compared to the traditional process is especially evident in the possibility of constant improvement of the performance of specific and generic models in accordance with the progress of the transcription process and the increase in the amount of digitized text that can be used to train a new version of the model. In order to further improve the model for automatic text recognition of Venclovic's manuscripts written in Serbian vernacular, it seems necessary to completely digitize the manuscripts within the SASA Archives, and to establish cooperation with scientific and cultural institutions (SASA and Matica srpska) to become potential leaders of a particular project related to preparing and publishing the critical edition of Venclovic's manuscripts in Serbian vernacular. With the development of technology for automatic text recognition, we are not only approaching the critical edition of Venclovic's manuscripts, but also the possibility of creating a digital edition and a special electronic corpus.

21 Transkribus software, version 1.20.0. was used for all the experiments described in the paper.

Bibliography Primary Sources

САНУ 136

ВенцловиЬ СтефановиЬ Г. [Великопосник], Архив Српске академщ'е наука и уметности, сигмтура САНУ 97 (136).

САНУ 137

ВенцловиЬ СтефановиЬ Г. [Слова изабрана], Архив Српске академщ'е наука и уметности, Београд, сигнатура САНУ 101 (137).

САНУ 139

ВенцловиЬ СтефановиЬ Г. [Поучение изабрано]е, први део], Архив Српске академщ'е наука и уметности, Београд, сигнатура САНУ 99 (139).

Literature

Besters-Dilger, Rabus 2021

Besters-Dilger J., Rabus A., Neural Morphological Tagging for Slavic: Strengths and Weaknesses, Scripta&e-Scripta, 2021, 21, 79-92.

Burlacu Rabus 2021

Burlacu C., Rabus A., Digitising (Romanian) Cyrillic using Transkribus: new perspectives, Diacronia, 2021, 14, 1-9. Kiesling et al. 2019

Kiesling B., Tissot R., Stokes P., Stokl Ben Ezra D., eScriptorium: An Open Source Platform for Historical Document Analysis, 2019 International Conference on Document Analysis and Recognition Workshop (ICDARW). Sydney, 2019, 19-24. Mühlberger et al. 2019

Mühlberger G., Seaward L., Terras M., Oliveira Ares S., Bosch V., Bryan M., Colluto S., Déjean H., Diem M., Fiel S., Gatos B., Greinoecker A. Grüning T., Hackl G., Haukkovaara V., Heyer G., Hirvonen L., Hodel T., Jokinen M., Kahle P., Kallio M., Kaplan F., Kleber F., Labahn R., Lang M., Laube S., Leifert G., Louloudis G., McNicholl R., Meunier J., Michael J., Mühlbauer E., Philipp N., Pratikakis J., Puigcerver Pérez J., Putz H., Retsinas G., Romero V., Sablatnig R., Sánchez J., Schofield P., Sfikas G., Sieber C., Stamatopoulos N., Strauss T., Terbul T., Toselli A., Ulreich B., Villegas M., Vidal E., Walcher J., Wiedermann M., Wurster H., Zagoris K., Transforming scholarship in the archives through handwritten text recognition, Journal of Documentation, 2019, 5/75, 954-976.

Polomac, Lutovac Kaznovac 2021

Polomac V., Lutovac Kaznovac T., Automatic Recognition of Serbian Medieval Manuscripts by Applying the Transkribus Software Platform: Current State and Future Perspectives, Зборник Матице српске за филологи^у и лингвистику, 2021, LXIV/2, 7-26. Polomac 2022a

Polomac V., Serbian Early Printed Books from Venice. Creating Models for Automatic Text Recognition using Transkribus, Scripta&e-Scripta, 2022, 22, 11-29.

--2022b

Polomac V., Serbian Early Printed Books. Towards Generic Model for Automatic Text Recognition using Transkribus, D. Fiser, T. Erjavec, eds., Proceedings of the Conference on Language Technologies and Digital Humanities, Ljubljana, 2022b, 154-161. Rabus 2019a

Rabus A., Recognizing Handwritten Text in Slavic Manuscripts: a Neural-Network Approach using Transkribus, Scripta&e-Scripta, 2019, 19, 9-32.

-2019b

Rabus A., Training generic models for Handwritten Text Recognition using Transkribus: opportunities and pitfalls, Proceeding of the Dark Archives Conference, Oxford, 2019b, in print. Василев 1996

Василев Л., Буквар из 1717. године — дело Гаврила СтефановиЬа ВенцловиЬа, Зборник Матице српске за филологи]у и лингвистику, 1996, 39/2, 169-184.

Б]елаковиЬ 2021

Б]елаковиЬ И., Предлог микроструктуре исторщског речника српског ]езика, J. ГрковиЬ-Ме^ор, И. Б]елаковиЬ, М. КурешевиЬ, ур., Истори]скалексикографи]а српског]езика, Нови Сад, 2021, 387-400. ГрковиЬ-Ме^ор 2007

ГрковиЬ-Ме^ор J., Списи из истори]скелингвистике, Сремски Карловци, Нови Сад, 2007. -2021

ГрковиЬ-Ме^ор J., Ка исторщском речнику српског ]езика, J. ГрковиЬ-Медоор, И. Б^елаковиЬ, М. КурешевиЬ, ур., Исторгдскалексжографща српског]езжа, Нови Сад, 2021, 11-24. ГрковиЬ-Ме^ор, Б]елаковиЬ 2021

ГрковиЬ-Ме^ор J., Б]елаковиЬ И., Дефинисаае лексичког значеаа у исторщском речнику српског ]езика, J. ГрковиЬ-Ме^ор, И. Б]елаковиЬ, М. КурешевиЬ, ур., Истори]скалексикографи]а српског ]езика, Нови Сад, 2021, 367-386.

ГроздановиЬ-Па]иЬ 1992

ГроздановиЬ-Па]иЬ М., Хартщ'а и водени знаци у ВенцловиЬевим рукописима писаним у Коморану и Ъуру, Сентандре]ски зборник, 1992, 2, 177-197. ДаничиЬ 1863-1864

ДаничиЬ Ъ., Речник из ктижевних старина српских, I-III, У Биограду, 1863-1864. ИвиЬ 2014

ИвиЬ П., Преглед исторще српског]езика, Сремски Карловци, Нови Сад, 2014. |овиЬ 2021

|овиЬ Н., Медицински списи као извор за исторщски речник српског ]езика, J. ГрковиЬ-Ме^ор, И. Б]елаковиЬ, М. КурешевиЬ, ур., Истори]скалексикографи]а српског]езика, Нови Сад, 2021, 185-198. КурешевиЬ 2016

КурешевиЬ М., |език Слова Акира Премудрог из рукописног зборника Народне библиотеке Србщ'е бр. 53,]ужнословенски филолог, 2016, 72/1-2, 105-126. -2021

КурешевиЬ М., Граматичке информацще у исторщ'ском речнику српског ]езика: полазни принципи, J. ГрковиЬ-Ме^ор, И. Б]елаковиЬ, М. КурешевиЬ, ур., Исторщска лексикографи]а српског]езика, Нови Сад, 2021, 319-345.

КурешевиЬ et al. 2021

КурешевиЬ М., Лутовац Казновац Т., ЦолиЬ |овановиЬ А., Ба]иЬ В., Рашчитаваае и пренос у електронску форму Ьирилске гра^е за исторщски речник српског ]езика: недоумице и могуЬа решеаа, J. ГрковиЬ-Ме^ор, И. Б]елаковиЬ, М. КурешевиЬ, ур., Истори]ска лексикографи]а српског ]езика, Нови Сад, 2021, 81-113.

ПавиЬ 1972

ПавиЬ M., Гаврил СтефановиИ ВенцловиИ, Београд, 1972. ПавловиЬ 2021

ПавловиЬ С., Лексикографска обрада граматичких речи у исторщским речницима, J. ГрковиЬ-Ме^ор, И. Б]елаковиЬ, М. КурешевиЬ, ур., Истори]скалексикографи]а српског ]езика, Нови Сад, 2021, 345-366.

РадовановиЬ 2015

РадовановиЬ М., Фазилингвистика, Сремски Карловци, Нови Сад, 2015. РСАНУ, 1-

Речник српскохрватског ктижевног и народног]езика, Београд: 1959-. СавиЬ, МилановиЬ 2021

СавиЬ В., МилановиЬ А., Идентификацщ'а и формираае одредница у српском исторщ'ском речнику, J. ГрковиЬ-Ме^ор, И. Б]елаковиЬ, М. КурешевиЬ, ур., Истори]ска лексикографи]а српског]езика, Нови Сад, 2021, 277-318.

Синдик et al. 1991

Синдик Н., ГроздановиЬ-Па]иЬ М., Мано-Зиси К., Описрукописа и старих штампаних ктига Библиотеке Српске православне епархи]е будимскеу Сентандрди, Београд, Нови Сад, 1991. СтефановиЬ, 1овановиЬ 2013

СтефановиЬ Д., 1овановиЬ Т., ВенцловиИев сентандре]ски буквар: 1717, Будимпешта, Београд, 2013. Сто]ановиЬ 1901

Становий Л., Каталог рукописа и старих штампаних ктига Српске кралевске академи/е. Београд, 1901. СуботиЬ 2004

СуботиЬ Л., Из исторщ'е каижевног ]езика: питаае ]езика, В. ВасиЬ, ур., Предавала из истори]е]езика, Нови Сад, 2004, 142-191.

ТрифуновиЬ 2009

ТрифуновиЬ Ъ., Стара српска ктижевност: основи, Београд, 2009. ЦветковиЬ ТеофиловиЬ 2021

ЦветковиЬ ТеофиловиЬ И., Путописи као извори за израду речника српског ]езика XII-XVIII века, J. ГрковиЬ-Ме^ор, И. Б]елаковиЬ, М. КурешевиЬ, ур., Истори]ска лексикографи]а српског]езика, Нови Сад, 2021, 165-184.

References

Besters-Dilger J., Rabus A., Neural Morphological Tagging for Slavic: Strengths and Weaknesses, Scripta&e-Scripta, 2021, 21, 79-92.

Bjelakovic I., Predlog mikrostrukture istorijskog recnika srpskog jezika, Istorijska leksikografija srpskogjezika, Novi Sad, 2021, 387-400.

Burlacu C., Rabus A., Digitising (Romanian) Cyrillic using Transkribus: new perspectives, Diacro-nia, 2021, 14, 1-9.

Cvetkovic Teofilovic I., Putopisi kao izvori za iz-radu recnika srpskog jezika XII-XVIII veka, Istorijska leksikografija srpskogjezika, Novi Sad, 2021, 165-184.

Grkovic-Mejdzor J., Spisi iz istorijske lingvistike, Sremski Karlovci, Novi Sad, 2007.

Grkovic-Mejdzor J., Ka istorijskom recniku srpskog jezika, Istorijska leksikografija srpskogjezika, Novi Sad, 2021, 11-24.

Grkovic-Mejdzor J., Bjelakovic I., Definisanje leksickog znacenja u istorijskom recniku srpskog jezika, Istorijska leksikografija srpskogjezika, Novi Sad, 2021, 367-386.

Grozdanovic-Pajic M., Hartija i vodeni znaci u Venclovicevim rukopisima pisanim u Komoranu i Duru, Sentandrejski zbornik, 1992, 2, 177-197.

Ivic P., Pregled istorije srpskog jezika, Sremski Karlovci, Novi Sad, 2014.

Jovic N., Medicinski spisi kao izvor za istorijski recnik srpskog jezika, Istorijska leksikografija srpskog jezika, Novi Sad, 2021, 185-198.

Kiesling B., Tissot R., Stokes P., Stokl Ben Ezra D., eScriptorium: An Open Source Platform for Historical Document Analysis, 2019 International Conference on Document Analysis and Recognition Workshop (ICDARW). Sydney, 2019, 19-24.

Kuresevic M., The Language of the Story of the Sage Ahiquar from Serbian Manuscript No. 53 of the National Library of Serbia, fuznoslovenski filolog, 2016, 72/1-2, 105-126.

Kuresevic M., Gramaticke informacije u istorij-skom recniku srpskog jezika: polazni principi, Isto-rijska leksikografija srpskog jezika, Novi Sad, 2021, 319-345.

Kuresevic M., Lutovac Kaznovac T., Colic Jovanovic A., Bajic V., Rascitavanje i prenos u elektron-sku formu cirilske grade za istorijski recnik srpskog jezika: nedoumice i moguca resenja, Istorijska leksi-kografija srpskogjezika, Novi Sad, 2021, 81-113.

Mühlberger G., Seaward L., Terras M., Oliveira Ares S., Bosch V., Bryan M., Colluto S., Déjean H., Diem M., Fiel S., Gatos B., Greinoecker A. Grü-ning T., Hackl G., Haukkovaara V., Heyer G., Hir-vonen L., Hodel T., Jokinen M., Kahle P., Kallio M., Kaplan F., Kleber F., Labahn R., Lang M., Laube S., Leifert G., Louloudis G., McNicholl R., Meunier J., Michael J., Mühlbauer E., Philipp N., Pratikakis J., Puigcerver Pérez J., Putz H., Retsinas G., Romero V., Sablatnig R., Sánchez J., Schofield P., Sfikas G., Sieber C., Stamatopoulos N., Strauss T., Terbul T., Toselli A., Ulreich B., Villegas M., Vidal E., Walcher J., Wiedermann M., Wurster H., Zagoris K., Transforming scholarship in the archives through handwritten text recognition, Journal of Documentation, 2019, 5/75, 954-976.

Pavic M., Gavril Stefanovic Venclovic, Beograd, 1972.

Pavlovic S., Leksikografska obrada gramatickih reci u istorijskim recnicima, Istorijska leksikografija srpskogjezika, Novi Sad, 2021, 345-366.

Polomac V., Lutovac Kaznovac T., Automatic Recognition of Serbian Medieval Manuscripts by Applying the Transkribus Software Platform: Current State and Future Perspectives, Matica Srpska Journal of Philology and Linguistics, 2021, LXIv/2, 7-26.

Polomac V., Serbian Early Printed Books from Venice. Creating Models for Automatic Text Recognition using Transkribus, Scripta&e-Scripta, 2022, 22, 11-29.

Polomac V., Serbian Early Printed Books. Towards Generic Model for Automatic Text Recognition using Transkribus, D. Fiser, T. Erjavec, eds., Proceedings of the Conference on Language Technologies and Digital Humanities, Ljubljana, 2022b, 154-161.

Rabus A., Recognizing Handwritten Text in Slavic Manuscripts: a Neural-Network Approach using Transkribus, Scripta&e-Scripta, 2019, 19, 9-32.

Radovanovic M., Fazi lingvistika, Sremski Kar-lovci, Novi Sad, 2015.

Savic V., Milanovic A., Identifikacija i formiranje odrednica u srpskom istorijskom recniku, Istorijska leksikografija srpskogjezika, Novi Sad, 2021, 277-318.

Sindik N., Grozdanovic-Pajic M., Mano-Zisi K., Opis rukopisa i starih stampanih knjiga Biblioteke Srp-ske pravoslavne eparhije budimske u Sentandreji, Be-ograd, Novi Sad, 1991.

Stefanovic D., Jovanovic T., Venclovicev sentand-rejski bukvar: 1717, Budimpesta, Beograd, 2013.

Subotic Lj., Iz istorije knjizevnog jezika: pitanje jezika, Predavanja iz istorije jezika, Novi Sad, 2004, 142-191.

Trifunovic D., Stara srpska knjizevnost: osnovi, Beograd, 2009.

Vasiljev Lj., Bukvar iz 1717. godine — delo Gav-rila Stefanovica Venclovica, Matica Srpska Journal of Philology and Linguistics, 1996, 39/2, 169-184.

Vladimir Polomac, PhD, redovni profesor Univerzitet u Kragujevcu Filolosko-umetnicki fakultet Jovana Cvijica bb, 34000 Kragujevac Srbija / Serbia v.polomac@filum.kg.ac.rs

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Marina Kuresevic, PhD, redovni profesor Univerzitet u Novom Sadu Filozofski fakultet Zorana Dindica 2, 21 000 Novi Sad Srbija / Serbia

marina.kuresevic@gmail.com

Isidora Bjelakovic, PhD, redovni profesor Univerzitet u Novom Sadu Filozofski fakultet Zorana Dindica 2, 21 000 Novi Sad Srbija / Serbia

isidora.bjelakovic@gmail.com

Aleksandra Colic Jovanovic, PhD, asistent sa doktoratom

Univerzitet u Novom Sadu

Filozofski fakultet

Zorana Dindica 2, 21 000 Novi Sad

Srbija / Serbia

aleksandra.colic@ff.uns.ac.rs

Sanja Petrovic, doktorand Univerzitet u Novom Sadu Filozofski fakultet Zorana Dindica 2, 21 000 Novi Sad Srbija / Serbia

sanja.lj.petrovic@gmail.com

Received September 27, 2022

i Надоели баннеры? Вы всегда можете отключить рекламу.