Collocation and Term Extractor

Instance of: Dataset
Contributor Nikola Ljubešić
Description CollTerm is a language independent tool for collocation and term extraction. It is an application that collects collocation and term candidates based on five different co occurrence measures for multiword units (i.e. collocations) or distributional differences from large representative corpus by application of the TF-IDF measurement on singleword units. The language dependent part consists of stop-word list and list of MWU MSD-patterns that can be coded with regular expressions as well. The application is describe in the paper presented at TKE2012 by Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I, Tadić, Gornostay, T. Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages. The first version of this application is available as an integral part of ACCURAT Toolkit that is available under Apache 2.0 license (http://www.accurat-project.eu/index.php?p=accurat-toolkit). In this version of the tool a calibration of MWU MSD-patterns has been provided for Croatian thus enhancing the usability of the tool. The plan is to provide calibration for other CESAR languages as well.
Rights ApacheLicence_2.0
See Also http://metashare.elda.org/repository/browse/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75/
Source META-SHARE
Title Collocation and Term Extractor
Type Resource Info
Type Tool Service

Contact Point

Affiliation
Communication Info Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#communication Info2
Organization Name University of Zagreb, Faculty of Humanities and Social Sciences, Department of Information Sciences
Type Organization Info Type
Communication Info Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#communication Info
Given Name Nikola
Position Assistant Professor
Surname Ljubešić
Type Contact Person
Person
Person Info Type

Distribution Info

Availability Available-unrestricted Use
Ipr Holder
Organization Info
Communication Info Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#communication Info2
Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#communication Info
Department Name Department/Institute of Linguistics, Department of Information Sciences
Organization Name University of Zagreb, Faculty of Humanities and Social Sciences
Organization Short Name FFZG
Type Organization Info Type
Person Info Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#person Info
Type Actor
License
Delivery Channel Downloadable
Distribution Rights Holder
Organization Info Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#organization Info
Type Actor
Permission
Action http://creativecommons.org/ns/Notify
Duty Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#permission
Type Duty
Permission
Restrictions Of Use
Same As http://www.apache.org/licenses/LICENSE-2.0.html
Type Licence Info
Type Distribution
Distribution Info

Identification Info

Description CollTerm is a language independent tool for collocation and term extraction. It is an application that collects collocation and term candidates based on five different co occurrence measures for multiword units (i.e. collocations) or distributional differences from large representative corpus by application of the TF-IDF measurement on singleword units. The language dependent part consists of stop-word list and list of MWU MSD-patterns that can be coded with regular expressions as well. The application is describe in the paper presented at TKE2012 by Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I, Tadić, Gornostay, T. Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages. The first version of this application is available as an integral part of ACCURAT Toolkit that is available under Apache 2.0 license (http://www.accurat-project.eu/index.php?p=accurat-toolkit). In this version of the tool a calibration of MWU MSD-patterns has been provided for Croatian thus enhancing the usability of the tool. The plan is to provide calibration for other CESAR languages as well.
Distribution
Access URL http://hnk.ffzg.hr/
http://www.nljubesic.net/resources/tools/collterm/
Type Distribution
URL
Identifier 312
Meta Share Id NOT_DEFINED_FOR_V2
Resource Short Name CollTerm
Title Collocation and Term Extractor
Type Identification Info

Resource Creation Info

Creation Start Date 2011-04-01 Date
Creator
Organization Info
Communication Info
Address Ivana Lučića 3
City Zagreb
Country Croatia
Distribution Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#Dist URL3
Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#Dist URL
Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#Dist URL2
Email nljubesi@ffzg.hr
zzl@ffzg.hr
Fax Number +385 1 6156 879
Telephone Number +385 1 6002 323
+385 1 6120 066
Type Communication Info
Zip Code 10000
Organization Name Univ. of Zagreb, Faculty of Humanities and Social Sciences, Depts. of Linguistics & Information Sci.
Type Organization Info Type
Type Actor
Funding Project
Distribution Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#Dist URL3
Access URL http://www.cesar-project.net
Type Distribution
URL
Funder University of Zagreb, Faculty of Humanities and Social Sciences (25%)
European Commission (50%)
University of Zagreb, Faculty of Humanities and Social Sciences (50%)
European Commission (75%)
Funding Type National Funds
Eu Funds
Project End Date 2013-01-31 Date
2012-06-30 Date
Project Name Analysis and evaluation of Comparable Corpora for Under Resourced Areas of machine Translation
Central and South-East European Resources
Project Short Name ACCURAT
CESAR
Project Start Date 2010-01-01 Date
2011-02-01 Date
Type Project Info Type
Type Resource Creation Info

Tool Service Info

Input Info
Media Type Text
Modality Type Written Language
Resource Type Language Description
Type Input Info
Language Dependent false Boolean
Output Info
Media Type Text
Modality Type Written Language
Resource Type Lexical Conceptual Resource
Type Output Info
Resource Type Tool Service
Tool Service Creation Info
Implementation Language Python
Type Tool Service Creation Info
Tool Service Evaluation Info
Evaluated true Boolean
Evaluation Criteria Intrinsic
Evaluation Level Diagnostic
Evaluation Measure Human
Evaluation Type Black Box
Evaluator
Person Info
Affiliation
Communication Info
Address Ivana Lučića 3
City Zagreb
Country Croatia
Distribution
Access URL http://www.nljubesic.net/
Type Distribution
URL
Email nljubesi@ffzg.hr
Fax Number +385 1 6156 879
Telephone Number +385 1 6002 323
Type Communication Info
Zip Code 10000
Organization Name University of Zagreb, Faculty of Humanities and Social Sciences, Department of Information Sciences
Type Organization Info Type
Communication Info
Address Ivana Lučića 3
City Zagreb
Country Croatia
Distribution Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#Dist URL2
Email nljubesi@ffzg.hr
Fax Number +385 1 6156 879
Telephone Number +385 1 6002 323
Type Communication Info
Zip Code 10000
Address Ivana Lučića 3
City Zagreb
Country Croatia
Distribution Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#Dist URL
Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#Dist URL2
Access URL http://www.accurat-project.eu
http://hnk.ffzg.hr
Type Distribution
URL
Email marko.tadic@ffzg.hr
zzl@ffzg.hr
nljubesi@ffzg.hr
Fax Number +385 1 6156 879
Telephone Number +385 1 6002 323
+385 1 6120 066
Type Communication Info
Zip Code 10000
Given Name Nikola
Marko
Surname Ljubešić
Tadić
Type Person
Person Info Type
Type Actor
Type Tool Service Evaluation Info
Tool Service Operation Info
Operating System Linux
Running Environment Info
Required Software Python (version 2.6 or higher)
Type Running Environment Info
Type Tool Service Operation Info
Tool Service Type Tool
Type Tool Service Info

Version Info

Has Version 1.0
Modified 2012-07-30 Date
Type Version Info

Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#metadata Creator

Instance of: Actor
Is Creator of Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#metadata Info

Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#Header

Instance of: Catalog Record
Issued 2014-09-23T00:19:21Z Date
Primary Topic Collocation and Term Extractor
Set Spec toolService:tool
toolService