Gos 2: a new reference corpus of spoken Slovenian

Darinka Verdonik, Kaja Dobrovoljc, Tomaž Erjavec, Nikola Ljubešić, 2024

This paper introduces a new version of the Gos reference corpus of spoken Slovenian, which was recently extended to more than double the original size (300 hours, 2.4 million words) by adding speech recordings and transcriptions from two related initiatives, the Gos VideoLectures corpus of public academic speech, and the Artur speech recognition database. We describe this process by first presenting the criteria guiding the balanced selection of the newly added data and the challenges encountered when merging language resources with divergent designs, followed by the presentation of other major enhancements of the new Gos corpus, such as improvements in lemmatization and morphosyntactic annotation, word-level speech alignment, a new XML schema and the development of a specialized online concordancer.

Authors:: Darinka Verdonik, Kaja Dobrovoljc, Tomaž Erjavec, Nikola Ljubešić
Year:: 2024
Publishers:: ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics
Source:: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) : main conference proceedings : 20-25 May, 2024, Torino, Italia

Gos 2: a new reference corpus of spoken Slovenian

Digital humanities

Gos 2: a new reference corpus of spoken Slovenian

Research Group

Dr. Nikola Ljubešić

Nikola Ljubešić, PhD