Show simple item record

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1

CreatorMiličević, Maja
CreatorBatanović, Vuk
CreatorErjavec, Tomaž
CreatorLjubešić, Nikola
CreatorSamardžić, Tanja
Date2019-09-11T15:39:00Z
dc.date.accessioned2021-07-24T21:27:19Z
dc.date.available2021-07-24T21:27:19Z
Identifierhttp://hdl.handle.net/11356/1241
dc.identifier.urihttps://linghub.org/handle/123456789/924913
DescriptionReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). As an update to version 2.0, version 2.1 corrects some annotation errors and adds morphosyntactic annotations in the Universal Dependencies formalism in addition to the MULTEXT-East morphosyntactic descriptions. The corpus is now also available in CoNLL-U format.
PublisherJožef Stefan Institute
Rightshttps://creativecommons.org/licenses/by-sa/4.0/
RightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Subjectlemmatisation
Subjectcomputer-mediated communication
Subjectnamed entities
Subjectmanual annotation
Subjectword normalisation
Subjecttokenisation
Subjectpart-of-speech tagging
SubjectTEI
TitleCroatian Twitter training corpus ReLDI-NormTagNER-hr 2.1
Typecorpus
TypeText
dcterms.available2019-09-11T15:39:00Z
dcterms.bibliographicCitationhttp://hdl.handle.net/11356/1241
dcterms.creatorMiličević, Maja
dcterms.creatorBatanović, Vuk
dcterms.creatorErjavec, Tomaž
dcterms.creatorLjubešić, Nikola
dcterms.creatorSamardžić, Tanja
dcterms.date2019-09-11T15:39:00Z
dcterms.descriptionReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). As an update to version 2.0, version 2.1 corrects some annotation errors and adds morphosyntactic annotations in the Universal Dependencies formalism in addition to the MULTEXT-East morphosyntactic descriptions. The corpus is now also available in CoNLL-U format.
dcterms.identifierhttp://hdl.handle.net/11356/1241
dcterms.publisherJožef Stefan Institute
dcterms.replaceshttp://hdl.handle.net/11356/1170
dcterms.rightshttps://creativecommons.org/licenses/by-sa/4.0/
dcterms.rightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dcterms.subjectlemmatisation
dcterms.subjectcomputer-mediated communication
dcterms.subjectnamed entities
dcterms.subjectmanual annotation
dcterms.subjectword normalisation
dcterms.subjecttokenisation
dcterms.subjectpart-of-speech tagging
dcterms.subjectTEI
dcterms.titleCroatian Twitter training corpus ReLDI-NormTagNER-hr 2.1
dcterms.typecorpus
dcterms.typeText
odrl.Policyhttp://purl.org/net/rdflicense/cc-by-sa4.0


Check resource access

Authorized
Reason

Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

  • OLAC
    Main data from the OLAC dataset

Show simple item record


Copyright  © 2020 All Rights Reserved by Prêt-à-LLOD Project.

Horizon 2020

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825182.