KER - Keyword Extractor

KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from Wikipedia. To deal with the data sparsity, texts are preprocessed by Morphodita: morphological dictionary and tagger. The web interface is limited to file of at most 2MB, please use the API for bigger files.

langugage:
tf-idf threshold:
maximum number of keywords:

file with content: (UTF-8 plaintext or ALTO OCR XML)

Keywords:

Documentation

Introduction

This section serves as a reference how to use KER via the REST API. The service supports standard HTTP request and returns its output in the JSON format in the UTF-8 encoding.

Request

#	Parameter	Mandatory	Data type	Description
1	file	yes	file	Plain-text file, ALTO XML file or a zip archive containing multiple files.
2	language	no	string	Language the text is in. Supported langugaes are cs and en. The default value is cs.
3	threshold	no	float	The minimum value of tf-idf score of a term to be considered a keyword. The default value is 0.2.
4	maximum-words	no	int	The maximum number of words that can be returned. The default value is 15.

All text is expected in UTF-8. Regardless the threshold and maximum-words fields, at least two keywords are always returned.

CURL Example

curl --form 'file=@test.zip' http://lindat.mff.cuni.cz/services/ker?language=cs

JSON Response

{
    "keywords": ["odhad", "výběr", "úhrn", "rozvržení", "průměr"],
    "keyword_scores": [0.6033144150404524, 0.4630133659942532, 0.35208990668596857, 0.2560803125496312, 0.22390298829472924],
    "morphodita_calls": 27
}