Show simple item record

CMC training corpus Janes-Tag 1.1

CreatorFišer, Darja
CreatorErjavec, Tomaž
CreatorČibej, Jaka
CreatorArhar Holdt, Špela
Date2016-12-28T11:40:50Z
dc.date.accessioned2021-07-24T21:26:27Z
dc.date.available2021-07-24T21:26:27Z
Identifierhttp://hdl.handle.net/11356/1081
dc.identifier.urihttps://linghub.org/handle/123456789/924808
DescriptionJanes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1083.
PublisherJožef Stefan Institute
Rightshttps://creativecommons.org/licenses/by-sa/4.0/
RightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Subjectlemmatisation
Subjectcomputer-mediated communication
Subjectmanual annotation
Subjectword normalisation
Subjecttagging
Subjecttokenisation
SubjectTEI
TitleCMC training corpus Janes-Tag 1.1
Typecorpus
TypeText
dcterms.available2016-12-28T11:40:50Z
dcterms.bibliographicCitationhttp://hdl.handle.net/11356/1081
dcterms.creatorFišer, Darja
dcterms.creatorErjavec, Tomaž
dcterms.creatorČibej, Jaka
dcterms.creatorArhar Holdt, Špela
dcterms.date2016-12-28T11:40:50Z
dcterms.descriptionJanes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1083.
dcterms.identifierhttp://hdl.handle.net/11356/1081
dcterms.isReplacedByhttp://hdl.handle.net/11356/1085
dcterms.publisherJožef Stefan Institute
dcterms.replaceshttp://hdl.handle.net/11356/1079
dcterms.rightshttps://creativecommons.org/licenses/by-sa/4.0/
dcterms.rightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dcterms.subjectlemmatisation
dcterms.subjectcomputer-mediated communication
dcterms.subjectmanual annotation
dcterms.subjectword normalisation
dcterms.subjecttagging
dcterms.subjecttokenisation
dcterms.subjectTEI
dcterms.titleCMC training corpus Janes-Tag 1.1
dcterms.typecorpus
dcterms.typeText
odrl.Policyhttp://purl.org/net/rdflicense/cc-by-sa4.0


Check resource access

Authorized
Reason

Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

  • OLAC
    Main data from the OLAC dataset

Show simple item record


Copyright  © 2020 All Rights Reserved by Prêt-à-LLOD Project.

Horizon 2020

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825182.