Parlamint I: Towards Comparable Parliamentary Corpora
The ParlaMint project focused on the creation of comparable and uniformly annotated corpora of parliamentary debates in Europe. The first stage of the project resulted in the compilation of 17 corpora, while the second stage increased the time-span of the corpora, adding corpora for new countries and autonomous regions, providing a machine translated version of the corpora into English, further enhancing the corpora with additional metadata and improving the usability of the corpora.
ParlaMint I (July 2020 – May 2021)
Tasks
- Creating a multilingual set of uniformly annotated corpora of parliamentary proceedings dating from November 2019 to July 2020 (thus covering current COVID-19 pandemic situation).
- Creating a set of comparable multilingual reference corpora of parliamentary data from 2015 to October 2019.
- Processing the corpora linguistically to add syntactic structures of Universal Dependencies as well as Named Entities annotation.
- Making the corpora available through concordancers and Parlameter.
- Building use cases in Political Sciences and Digital Humanities based on the corpus data.
ParlaMint I Work Plan (2020-2021)
WP 1: Testing the approach for four languages (Lead: Maciej Ogrodniczuk (IPI-PAN), Petya Osenova (IICT-BAS))
- T1.1: Preparation of the reference parliamentary corpora
- T1.2: Creation of COVID-19 parliamentary corpora
- T1.3: Mounting of the corpora on the NoSketch Engine and KonText concordancers
- T1.4: Preparation of guidelines and mini-grant procedure
WP 2: Extending the corpora and showcasing (Lead: Tomaž Erjavec (IJS))
- T2.1: Adding additional corpora to the infrastructure
- T2.2: Preparation of showcases
- T2.3: Preparation of the documentation for usage by interested parties
More detailed information can be found below in section ParlaMint I (July 2020 – May 2021)
The ParlaMint I project (2020-2021)
- Created comparable corpora of parliamentary debates:
- Of 29 European countries and autonomous regions
- From 2015 until 2022
- Containing over 1 billion words
- Created uniformly encoded corpora
- Inclusion of rich metadata about 24000 speakers
- Linguistically annotated
- Project type:
- mednarodni raziskovalni projekt
- Period:
- Julij 2020 - Maj 2021
- Funders:
- CLARIN ERIC
INZ Research Group
Andrej Pančur, PhD
Director and Research Associate
Programme Funder