KER - Keyword Extractor

KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from Wikipedia. To deal with the data sparsity, texts are preprocessed by Morphodita: morphological dictionary and tagger. The web interface is limited to file of at most 2MB, please use the API for bigger files.

langugage:
tf-idf threshold:
maximum number of keywords:   
file with content: (UTF-8 plaintext or ALTO OCR XML)
Keywords:

    Documentation

    Introduction

    This section serves as a reference how to use KER via the REST API. The service supports standard HTTP request and returns its output in the JSON format in the UTF-8 encoding.

    Request

    # Parameter Mandatory Data type Description
    1 file yes file Plain-text file, ALTO XML file or a zip archive containing multiple files.
    2 language no string Language the text is in. Supported langugaes are cs and en. The default value is cs.
    3 threshold no float The minimum value of tf-idf score of a term to be considered a keyword. The default value is 0.2.
    4 maximum-words no int The maximum number of words that can be returned. The default value is 15.

    All text is expected in UTF-8. Regardless the threshold and maximum-words fields, at least two keywords are always returned.


    CURL Example

    curl --form 'file=@test.zip' http://lindat.mff.cuni.cz/services/ker?language=cs

    JSON Response

    {
        "keywords": ["odhad", "výběr", "úhrn", "rozvržení", "průměr"],
        "keyword_scores": [0.6033144150404524, 0.4630133659942532, 0.35208990668596857, 0.2560803125496312, 0.22390298829472924],
        "morphodita_calls": 27
    }