Show simple item record

Frequency lists of character-level n-grams from the GOS 1.0 corpus 1.1

CreatorDobrovoljc, Kaja
CreatorKrek, Simon
CreatorČibej, Jaka
CreatorArhar Holdt, Špela
Date2020-11-02T12:35:03Z
dc.date.accessioned2021-07-24T21:29:57Z
dc.date.available2021-07-24T21:29:57Z
Identifierhttp://hdl.handle.net/11356/1363
dc.identifier.urihttps://linghub.org/handle/123456789/925004
DescriptionFrequency lists of character-level n-grams were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain 1-5-gram combinations of characters occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy. Character-level n-grams were extracted from lemmas (5 files), lower-case word forms (5 files), and standardized word forms (5 files). Compared to the previous version (http://hdl.handle.net/11356/1268), this one includes fixes of several typos and substitutes all instances of "normalized forms" with the more adequate term "standardized forms" (as used in the SSJ project).
PublisherJožef Stefan Institute
PublisherCentre for Language Resources and Technologies, University of Ljubljana
Rightshttps://creativecommons.org/licenses/by-sa/4.0/
RightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Subjectcharacters
Subjectn-grams
Subjectspoken corpus
Subjectfrequency list
TitleFrequency lists of character-level n-grams from the GOS 1.0 corpus 1.1
TypelexicalConceptualResource
TypeText
dcterms.available2020-11-02T12:35:03Z
dcterms.bibliographicCitationhttp://hdl.handle.net/11356/1363
dcterms.creatorDobrovoljc, Kaja
dcterms.creatorKrek, Simon
dcterms.creatorČibej, Jaka
dcterms.creatorArhar Holdt, Špela
dcterms.date2020-11-02T12:35:03Z
dcterms.descriptionFrequency lists of character-level n-grams were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain 1-5-gram combinations of characters occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy. Character-level n-grams were extracted from lemmas (5 files), lower-case word forms (5 files), and standardized word forms (5 files). Compared to the previous version (http://hdl.handle.net/11356/1268), this one includes fixes of several typos and substitutes all instances of "normalized forms" with the more adequate term "standardized forms" (as used in the SSJ project).
dcterms.identifierhttp://hdl.handle.net/11356/1363
dcterms.publisherJožef Stefan Institute
dcterms.publisherCentre for Language Resources and Technologies, University of Ljubljana
dcterms.replaceshttp://hdl.handle.net/11356/1268
dcterms.rightshttps://creativecommons.org/licenses/by-sa/4.0/
dcterms.rightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dcterms.subjectcharacters
dcterms.subjectn-grams
dcterms.subjectspoken corpus
dcterms.subjectfrequency list
dcterms.titleFrequency lists of character-level n-grams from the GOS 1.0 corpus 1.1
dcterms.typelexicalConceptualResource
dcterms.typeText
odrl.Policyhttp://purl.org/net/rdflicense/cc-by-sa4.0


Check resource access

Authorized
Reason

Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

  • OLAC
    Main data from the OLAC dataset

Show simple item record


Copyright  © 2020 All Rights Reserved by Prêt-à-LLOD Project.

Horizon 2020

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825182.