This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

Czech HS Contracts Dataset (CHSC) 1.0

Please use the following text to cite this item or export to a predefined format:
Szabó, Adam and Straka, Milan, 2021, Czech HS Contracts Dataset (CHSC) 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-3731.
Date issued
2021-07-22
Size
97000 texts
Language(s)
Description
Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal. Labels in the development and training set are automatically classified on the basis of the keyword method according to the thesis Automatická klasifikace smluv pro portál HlidacSmluv.cz, J. Maroušek (2020), MFF UK. For this reason, the goal in the classification is not to achieve 100% on the development set, as the classification contains a certain amount of noise. The test set is manually annotated. The dataset contains a total of 97493 contracts.
 Files in this item
Name
CHSC-1.0.tar.xz
Size
452.31 MB
Format
application/x-xz
Description
Czech HS Contracts Dataset (CHSC) 1.0
MD5
ace2df821cafeef61984ebfa47b05d99
Preview
  File Preview
  • CHSC-1.0
    • categories.json4 kB
    • dev10.jsonl26 MB
    • train.jsonl2 GB
    • test.jsonl43 MB
    • dev.jsonl268 MB
    • README.md7 kB
    • LICENSE.txt20 kB
Name
CHSC-1.0.DESCRIPTION.pdf
Size
144.33 KB
Format
application/pdf
Description
Description of Czech HS Contracts Dataset (CHSC) 1.0
MD5
fdf5e64f529af54d9e4f55e36305efe4
Preview
  File Preview