Korpusnik

Slovene corpus summarizer

Help

Korpusnik is a tool that summarizes statistical and textual data from five corpora of the Slovenian language. Each corpus has a separate tab in the tool: -- Standard Slovene: Gigafida 2.0 reference corpus of written standard Slovenian (1.3 billion tokens, texts from 1990-2018). -- Trending Slovene: the Trendi monitor corpus, which is updated monthly and draws texts from online media portals. Contains texts from 2019 to present. -- Academic Slovene: the corpus of academic Slovene OSS 1.0 (3.2 billion words) contains more than 150,000 scientific texts from the National Open Science Portal. -- Internet Slovene: the JANES 1.0 Internet Slovene corpus (more than 252 million tokens) contains texts from Slovenian social networks (blogs, comments on news, tweets). -- Spoken Slovene: the Gos 2.0 reference corpus of spoken Slovene (2.5 million tokens) contains approximately 300 hours of speech. In addition to the aforementioned five tabs, the Korpusnik also has a special Highlights tab, which the user sees first and where the main characteristics of the searched word from all five corpora are presented. An important functionality of the tool is the semi-automatically generated Main Points, which summarize the most relevant aspects of the results presented on each tab. Templates are “triggered” based on a specific result. Similarly, a description of each chart is semi-automatically generated, containing a summary of the data on the screen and available by clicking on the Description button.