Creator | Dobrovoljc, Kaja | |
Creator | Krek, Simon | |
Creator | Čibej, Jaka | |
Creator | Arhar Holdt, Špela | |
Date | 2019-11-13T08:53:47Z | |
dc.date.accessioned | 2021-07-24T21:27:34Z | |
dc.date.available | 2021-07-24T21:27:34Z | |
Identifier | http://hdl.handle.net/11356/1271 | |
dc.identifier.uri | https://linghub.org/handle/123456789/924940 | |
Description | Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL.
The n-grams were extracted from lower-case word forms, normalized word forms, and morphosyntactic tags.
For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software. | |
Publisher | Jožef Stefan Institute | |
Publisher | Centre for Language Resources and Technologies, University of Ljubljana | |
Rights | https://creativecommons.org/licenses/by-sa/4.0/ | |
Rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) | |
Subject | word sets | |
Subject | spoken corpus | |
Subject | word forms | |
Subject | normalized forms | |
Subject | words | |
Subject | n-grams | |
Subject | morphosyntactic tags | |
Title | Frequency lists of word-level n-grams from the GOS 1.0 corpus | |
Type | lexicalConceptualResource | |
Type | Text | |
dcterms.available | 2019-11-13T08:53:47Z | |
dcterms.bibliographicCitation | http://hdl.handle.net/11356/1271 | |
dcterms.creator | Dobrovoljc, Kaja | |
dcterms.creator | Krek, Simon | |
dcterms.creator | Čibej, Jaka | |
dcterms.creator | Arhar Holdt, Špela | |
dcterms.date | 2019-11-13T08:53:47Z | |
dcterms.description | Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL.
The n-grams were extracted from lower-case word forms, normalized word forms, and morphosyntactic tags.
For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software. | |
dcterms.identifier | http://hdl.handle.net/11356/1271 | |
dcterms.isReplacedBy | http://hdl.handle.net/11356/1365 | |
dcterms.publisher | Jožef Stefan Institute | |
dcterms.publisher | Centre for Language Resources and Technologies, University of Ljubljana | |
dcterms.rights | https://creativecommons.org/licenses/by-sa/4.0/ | |
dcterms.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) | |
dcterms.subject | word sets | |
dcterms.subject | spoken corpus | |
dcterms.subject | word forms | |
dcterms.subject | normalized forms | |
dcterms.subject | words | |
dcterms.subject | n-grams | |
dcterms.subject | morphosyntactic tags | |
dcterms.title | Frequency lists of word-level n-grams from the GOS 1.0 corpus | |
dcterms.type | lexicalConceptualResource | |
dcterms.type | Text | |
odrl.Policy | http://purl.org/net/rdflicense/cc-by-sa4.0 | |