Research Publications
Gos 2: a new reference corpus of spoken Slovenian
Darinka Verdonik, Kaja Dobrovoljc, Tomaž Erjavec, Nikola Ljubešić, 2024
This paper introduces a new version of the Gos reference corpus of spoken Slovenian, which was recently extended to more than double the original size (300 hours, 2.4 million words) by adding speech recordings and transcriptions from two related initiatives, the Gos VideoLectures corpus of public academic speech, and the Artur speech recognition database. We describe this process by first presenting the criteria guiding the balanced selection of the newly added data and the challenges encountered when merging language resources with divergent designs, followed by the presentation of other major enhancements of the new Gos corpus, such as improvements in lemmatization and morphosyntactic annotation, word-level speech alignment, a new XML schema and the development of a specialized online concordancer.
- Authors:
- Darinka Verdonik, Kaja Dobrovoljc, Tomaž Erjavec, Nikola Ljubešić
- Year:
- 2024
- Publishers:
- ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics
- Source:
- The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) : main conference proceedings : 20-25 May, 2024, Torino, Italia