Show simple item record

Dataset of normalised Slovene text KonvNormSl 1.0

CreatorFišer, Darja
CreatorLjubešić, Nikola
CreatorErjavec, Tomaž
CreatorZupan, Katja
Date2016-07-27T12:15:05Z
dc.date.accessioned2021-07-24T21:26:22Z
dc.date.available2021-07-24T21:26:22Z
Identifierhttp://hdl.handle.net/11356/1068
dc.identifier.urihttps://linghub.org/handle/123456789/924797
DescriptionData used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/) Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (*.orig.txt) and the data with hand-normalised words (*.norm.txt). The files are aligned by lines. There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language) The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/). The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
PublisherJožef Stefan Institute
Rightshttps://creativecommons.org/licenses/by-sa/4.0/
RightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Subjectcomputer-mediated communication
Subjectmanual annotation
Subjectword normalisation
Subjecthistorical language
Subjectexperimental data
TitleDataset of normalised Slovene text KonvNormSl 1.0
Typecorpus
TypeText
dcterms.available2016-09-01T23:00:07Z
dcterms.bibliographicCitationhttp://hdl.handle.net/11356/1068
dcterms.creatorFišer, Darja
dcterms.creatorLjubešić, Nikola
dcterms.creatorErjavec, Tomaž
dcterms.creatorZupan, Katja
dcterms.date2016-07-27T12:15:05Z
dcterms.descriptionData used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/) Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (*.orig.txt) and the data with hand-normalised words (*.norm.txt). The files are aligned by lines. There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language) The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/). The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
dcterms.identifierhttp://hdl.handle.net/11356/1068
dcterms.publisherJožef Stefan Institute
dcterms.rightshttps://creativecommons.org/licenses/by-sa/4.0/
dcterms.rightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dcterms.subjectcomputer-mediated communication
dcterms.subjectmanual annotation
dcterms.subjectword normalisation
dcterms.subjecthistorical language
dcterms.subjectexperimental data
dcterms.titleDataset of normalised Slovene text KonvNormSl 1.0
dcterms.typecorpus
dcterms.typeText
odrl.Policyhttp://purl.org/net/rdflicense/cc-by-sa4.0


Check resource access

Authorized
Reason

Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

  • OLAC
    Main data from the OLAC dataset

Show simple item record


Copyright  © 2020 All Rights Reserved by Prêt-à-LLOD Project.

Horizon 2020

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825182.