MSTperl is a Perl reimplementation of the MST parser of Ryan McDonald (http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html).
MST parser (Maximum Spanning Tree parser) is a state-of-the-art natural language dependency parser -- a tool that takes a sentence and returns its dependency tree.
In MSTperl, only some functionality was implemented; the limitations include the following:
the parser is a non-projective one, curently with no possibility of enforcing the requirement of projectivity of the parse trees;
only first-order features are supported, i.e. no second-order or third-order features are possible;
the implementation of MIRA is that of a single-best MIRA, with a closed-form update instead of using quadratic programming.
On the other hand, the parser supports several advanced features:
parallel features, i.e. enriching the parser input with word-aligned sentence in other language;
adding large-scale information, i.e. the feature set enriched with features corresponding to pointwise mutual information of word pairs in a large corpus (CzEng);
weighted/unweighted parser model interpolation;
combination of several instances of the MSTperl parser (through MST algorithm);
combination of several existing parses from any parsers (through MST algorithm).
The MSTperl parser is tuned for parsing Czech. Trained models are available for Czech, English and German. We can train the parser for other languages on demand, or you can train it yourself -- the guidelines are part of the documentation.
The parser, together with detailed documentation, is avalable on CPAN (http://search.cpan.org/~rur/Treex-Parser-MSTperl/). and The research has been supported by the EU Seventh Framework Programme under grant agreement 247762 (Faust), and by the grants GAUK116310 and GA201/09/H057.
NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
NER models for NameTag 2, named entity recognition tool, for English, German, Dutch, Spanish and Czech. Model documentation including performance can be found here: https://ufal.mff.cuni.cz/nametag/2/models . These models are for NameTag 2, named entity recognition tool, which can be found here: https://ufal.mff.cuni.cz/nametag/2 .
OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file.
This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence.
This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/).
If using it, please cite the following paper:
Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA
OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file.
The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license.
This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/).
If using it, please cite the following paper:
Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, Nov. 2019
To reproduce the experiments in the above paper, you can use the first 100000 lines of part_0_0.txt file.
OAGL is a paper metadata dataset consisting of 17528680 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last field of each record is the page length of the corresponding publication. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGL Paper Metadata Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/).
If using it, please cite the following paper:
Çano Erion, Bojar Ondřej: How Many Pages? Paper Length Prediction from the Metadata.
NLPIR 2020, Proceedings of the the 4th International Conference on Natural Language
Processing and Information Retrieval, Seoul, Korea, December 2020.
OAGS is a title generation dataset consisting of 34993700 abstracts and titles from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGS Title Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study", INLG 2019, The 12th International Conference on Natural Language Generation, November 2019, Tokyo, Japan. To reproduce the experiments in the above paper, you can use oags_train1.txt, oags_train2.txt, oags_train3.txt, oags_test.txt and oags_val.txt files. If you need more data samples you can get them from oags_train_backup.txt and oags_val-test_backup.txt.
OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file.
The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license.
This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/).
If using it, please consider citing also the following paper:
Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles.
LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation,
Marseille, France, May 2020.