Czech HS Contracts Dataset (CHSC) 1.0
Please use the following text to cite this item or export to a predefined format:
Szabó, Adam and Straka, Milan, 2021,
Czech HS Contracts Dataset (CHSC) 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-3731.
Authors
Item identifier
Date issued
2021-07-22
Size
97000 texts
Language(s)
Description
Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK.
Contracts are obtained from the Hlídač Státu web portal. Labels in the development and training set are automatically classified on the basis of the keyword method according to the thesis Automatická klasifikace smluv pro portál HlidacSmluv.cz, J. Maroušek (2020), MFF UK. For this reason, the goal in the classification is not to achieve 100% on the development set, as the classification contains a certain amount of noise. The test set is manually annotated. The dataset contains a total of 97493 contracts.
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- CHSC-1.0.tar.xz
- Size
- 452.31 MB
- Format
- application/x-xz
- Description
- Czech HS Contracts Dataset (CHSC) 1.0
- MD5
- ace2df821cafeef61984ebfa47b05d99

- CHSC-1.0
- categories.json4 kB
- dev10.jsonl26 MB
- train.jsonl2 GB
- test.jsonl43 MB
- dev.jsonl268 MB
-
- README.md7 kB
- LICENSE.txt20 kB
- Name
- CHSC-1.0.DESCRIPTION.pdf
- Size
- 144.33 KB
- Format
- application/pdf
- Description
- Description of Czech HS Contracts Dataset (CHSC) 1.0
- MD5
- fdf5e64f529af54d9e4f55e36305efe4

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

