Show simple item record

Frequency lists of word-level n-grams from the GOS 1.0 corpus

CreatorDobrovoljc, Kaja
CreatorKrek, Simon
CreatorČibej, Jaka
CreatorArhar Holdt, Špela
Date2019-11-13T08:53:47Z
dc.date.accessioned2021-07-24T21:27:34Z
dc.date.available2021-07-24T21:27:34Z
Identifierhttp://hdl.handle.net/11356/1271
dc.identifier.urihttps://linghub.org/handle/123456789/924940
DescriptionFrequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms, normalized word forms, and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software.
PublisherJožef Stefan Institute
PublisherCentre for Language Resources and Technologies, University of Ljubljana
Rightshttps://creativecommons.org/licenses/by-sa/4.0/
RightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Subjectword sets
Subjectspoken corpus
Subjectword forms
Subjectnormalized forms
Subjectwords
Subjectn-grams
Subjectmorphosyntactic tags
TitleFrequency lists of word-level n-grams from the GOS 1.0 corpus
TypelexicalConceptualResource
TypeText
dcterms.available2019-11-13T08:53:47Z
dcterms.bibliographicCitationhttp://hdl.handle.net/11356/1271
dcterms.creatorDobrovoljc, Kaja
dcterms.creatorKrek, Simon
dcterms.creatorČibej, Jaka
dcterms.creatorArhar Holdt, Špela
dcterms.date2019-11-13T08:53:47Z
dcterms.descriptionFrequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms, normalized word forms, and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software.
dcterms.identifierhttp://hdl.handle.net/11356/1271
dcterms.isReplacedByhttp://hdl.handle.net/11356/1365
dcterms.publisherJožef Stefan Institute
dcterms.publisherCentre for Language Resources and Technologies, University of Ljubljana
dcterms.rightshttps://creativecommons.org/licenses/by-sa/4.0/
dcterms.rightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dcterms.subjectword sets
dcterms.subjectspoken corpus
dcterms.subjectword forms
dcterms.subjectnormalized forms
dcterms.subjectwords
dcterms.subjectn-grams
dcterms.subjectmorphosyntactic tags
dcterms.titleFrequency lists of word-level n-grams from the GOS 1.0 corpus
dcterms.typelexicalConceptualResource
dcterms.typeText
odrl.Policyhttp://purl.org/net/rdflicense/cc-by-sa4.0


Check resource access

Authorized
Reason

Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

  • OLAC
    Main data from the OLAC dataset

Show simple item record


Copyright  © 2020 All Rights Reserved by Prêt-à-LLOD Project.

Horizon 2020

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825182.