Number of results to display per page
Search Results
132. jusText
- Creator:
- Pomikálek, Jan
- Publisher:
- Masaryk University, NLP Centre
- Type:
- toolService and tool
- Subject:
- boilerplate, web documents, text cleaning, boilerplate removal, and text corpora
- Language:
- English
- Description:
- jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether. and PRESEMT, Lexical Computing Ltd
- Rights:
- Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), http://creativecommons.org/licenses/by-sa/3.0/, and PUB
133. KAMOKO-Digitalizer
- Creator:
- Rüdiger, Jan Oliver
- Publisher:
- Rüdiger, Jan Oliver
- Type:
- tool and toolService
- Subject:
- learner corpus, corpus, and annotation
- Language:
- German
- Description:
- This editor was developed especially for the needs of the KAMOKO project (https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3261). The editor allows the quick entry of example sentences and sentence variants as well as the corresponding speaker ratings.
- Rights:
- Affero General Public License 3 (AGPL-3.0), http://opensource.org/licenses/AGPL-3.0, and PUB
134. KER - Keyword Extractor
- Creator:
- Libovický, Jindřich
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- service and toolService
- Subject:
- keyword extraction
- Language:
- Czech and English
- Description:
- KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from Wikipedia. To deal with the data sparsity, texts are preprocessed by Morphodita: morphological dictionary and tagger.
- Rights:
- Apache License 2.0, http://opensource.org/licenses/Apache-2.0, and PUB
135. KinOath Kinship Archiver
- Publisher:
- Max Planck Institute for Psycholinguistics
- Type:
- toolService
- Description:
- KinOath Kinship Archiver is a kinship application with the primary goal of connecting kinship data with archived data, such as audio, video or written resources while also being closely integrated with the archive software such as Arbil. Beyond this primary goal it is designed to be flexible and culturally nonspecific, such that culturally different social structures can equally be represented. Kin type strings are used throughout the application for constructing and searching data sets. The representation of kin terms is also integrated into the application allowing comparative diagrams of kin terms. Graphical representation of the data is an important part of the application and the diagrams produced are intended to very flexible and of publishable quality.
- Rights:
- Not specified
136. KonText Web Demo
- Creator:
- Josífko, Michal
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- toolService and tool
- Subject:
- web service, corpus, parallel corpus, and demo
- Language:
- Czech and English
- Description:
- An interactive web demo for querying selected ÚFAL and LINDAT corpora. LINDAT/CLARIN KonText is a fork of ÚČNK KonText (https://github.com/czcorpus/kontext, maintained by Tomáš Machálek) that contains some modifications and additional features. Kontext, in turn, is a fork of the Bonito 2.68 python web interface to the corpus management tool Manatee (http://nlp.fi.muni.cz/trac/noske, created by Pavel Rychlý).
- Rights:
- GNU General Public License, version 2, http://www.gnu.org/licenses/gpl-2.0.html, and PUB
137. Korektor
- Creator:
- Richter, Michal
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- toolService and tool
- Subject:
- grammar checker and spellchecker
- Language:
- Czech
- Description:
- Statistical spell- and (occasional) grammar-checker. There are three versions: a unix command line utility and an OS X SpellServer with a System Service, that integrates with native OS X GUI applications, and a web service run by Lindat-Clarin, that can be used either through a web form in a browser, or by web applications using API. and The LINDAT-CLARIN project (LM2010013), fully supported by TheMinistry of Education, Sports and Youth of The Czech Republic under the programme LM of "Large Infrastructures"
- Rights:
- BSD 2-Clause "Simplified" or "FreeBSD" license, http://opensource.org/licenses/BSD-2-Clause, and PUB
138. Korektor 2
- Creator:
- Straka, Milan and Richter, Michal
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- tool and toolService
- Subject:
- Korektor, spellchecker, spellchecking, grammar checker, and diacritical marks generation
- Language:
- English
- Description:
- Korektor is a statistical spell-checker and (occasionally) grammar-checker. It is released under 2-Clause BSD license http://opensource.org/licenses/BSD-2-Clause. Korektor started with Michal Richter's diploma thesis Advanced Czech Spellchecker https://redmine.ms.mff.cuni.cz/documents/1, but it is being developed further. There are two versions: a command line utility (tested on Linux, Windows and OS X) and a REST service with publicly available API http://lindat.mff.cuni.cz/services/korektor/api-reference.php and HTML front end https://lindat.mff.cuni.cz/services/korektor/.
- Rights:
- BSD 2-Clause "Simplified" or "FreeBSD" license, http://opensource.org/licenses/BSD-2-Clause, and PUB
139. kwic
- Publisher:
- Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
- Type:
- toolService
- Description:
- Word concordancer.
- Rights:
- Not specified
140. LAMUS
- Publisher:
- Max Planck Institute for Psycholinguistics
- Type:
- toolService
- Description:
- Language Archive Management and Upload System (LAMUS) is a web-based application that allows users to organize and update the content in the extensive archive of and IMDI-based corpus
- Rights:
- Not specified