KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from Wikipedia. To deal with the data sparsity, texts are preprocessed by Morphodita: morphological dictionary and tagger. The web interface is limited to file of at most 2MB, please use the API for bigger files.
# | Parameter | Mandatory | Data type | Description |
---|---|---|---|---|
1 | file | yes | file | Plain-text file, ALTO XML file or a zip archive containing multiple files. |
2 | language | no | string | Language the text is in. Supported langugaes are cs and en. The default value is cs. |
3 | threshold | no | float | The minimum value of tf-idf score of a term to be considered a keyword. The default value is 0.2. |
4 | maximum-words | no | int | The maximum number of words that can be returned. The default value is 15. |
All text is expected in UTF-8. Regardless the threshold and maximum-words fields, at least two keywords are always returned.
curl --form 'file=@test.zip' http://lindat.mff.cuni.cz/services/ker?language=cs
{ "keywords": ["odhad", "výběr", "úhrn", "rozvržení", "průměr"], "keyword_scores": [0.6033144150404524, 0.4630133659942532, 0.35208990668596857, 0.2560803125496312, 0.22390298829472924], "morphodita_calls": 27 }