KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from Wikipedia. To deal with the data sparsity, texts are preprocessed by Morphodita: morphological dictionary and tagger. The web interface is limited to file of at most 2MB, please use the API for bigger files.
| # | Parameter | Mandatory | Data type | Description |
|---|---|---|---|---|
| 1 | file | yes | file | Plain-text file, ALTO XML file or a zip archive containing multiple files. |
| 2 | language | no | string | Language the text is in. Supported langugaes are cs and en. The default value is cs. |
| 3 | threshold | no | float | The minimum value of tf-idf score of a term to be considered a keyword. The default value is 0.2. |
| 4 | maximum-words | no | int | The maximum number of words that can be returned. The default value is 15. |
All text is expected in UTF-8. Regardless the threshold and maximum-words fields, at least two keywords are always returned.
curl --form 'file=@test.zip' http://lindat.mff.cuni.cz/services/ker?language=cs
{
"keywords": ["odhad", "výběr", "úhrn", "rozvržení", "průměr"],
"keyword_scores": [0.6033144150404524, 0.4630133659942532, 0.35208990668596857, 0.2560803125496312, 0.22390298829472924],
"morphodita_calls": 27
}