This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

Paraphrase Identification (PI) datasets in Czech

Please use the following text to cite this item or export to a predefined format:
Javorský, Dávid and Popel, Martin, 2023, Paraphrase Identification (PI) datasets in Czech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-5228.
Date issued
2023-10-01
Size
484072 sentences
Language(s)
Description
The goal of the Paraphrase Identification (PI) task is to determine whether two sentences have the same meaning. The repository contains two PI datasets, namely paws (https://huggingface.co/datasets/paws) and quora (https://huggingface.co/datasets/quora). These datasets are in two versions, the original English version and our-added Czech translation using CUBBITT, the Charles University Block-Backtranslation-Improved Transformer Translation model (https://lindat.mff.cuni.cz/services/translation/). The record includes target labels for Czech datasets as well, however, note that they could no longer be correct for the Czech translation (because of errors made by the translation model). The licence of this record (CC BY-SA) holds for the translated part of the dataset. For the original English datasets, follow their respective licence descriptions.
This item isPublicly Available
and licensed under:
 Files in this item
Name
quora.zip
Size
38.09 MB
Format
application/zip
Description
quora
MD5
4057b86e5365c334bc6f797e79cd2953
Preview
  File Preview
  • source
    • train.label789 kB
    • train.sentence_223 MB
    • train.sentence_123 MB
  • translation
    • train.label789 kB
    • train.sentence_225 MB
    • train.sentence_124 MB
    • README.txt609 B
Name
paws.zip
Size
21.22 MB
Format
application/zip
Description
paws
MD5
1fb26434062cc14b6c2b0b1d1e34b4be
Preview
  File Preview
  • source
    • train.labeled_swap.sentence_23 MB
    • train.labeled_swap.sentence_13 MB
    • valid.labeled_final.sentence_2893 kB
    • valid.labeled_final.sentence_1895 kB
    • valid.labeled_swap.sentence_2893 kB
    • valid.labeled_swap.sentence_1895 kB
    • train.labeled_final.label96 kB
    • test.labeled_swap.sentence_2900 kB
    • test.labeled_swap.sentence_1899 kB
    • test.labeled_final.sentence_2900 kB
    • test.labeled_final.sentence_1899 kB
    • train.labeled_swap.label59 kB
    • valid.labeled_swap.label15 kB
    • train.labeled_final.sentence_25 MB
    • train.labeled_final.sentence_15 MB
    • test.labeled_final.label15 kB
    • valid.labeled_final.label15 kB
    • test.labeled_swap.label15 kB
  • translation
    • train.labeled_swap.sentence_23 MB
    • train.labeled_swap.sentence_13 MB
    • valid.labeled_final.sentence_2883 kB
    • valid.labeled_final.sentence_1884 kB
    • valid.labeled_swap.sentence_2883 kB
    • valid.labeled_swap.sentence_1884 kB
    • train.labeled_final.label96 kB
    • test.labeled_swap.sentence_2891 kB
    • test.labeled_swap.sentence_1890 kB
    • test.labeled_final.sentence_2891 kB
    • test.labeled_final.sentence_1890 kB
    • train.labeled_swap.label59 kB
    • valid.labeled_swap.label15 kB
    • train.labeled_final.sentence_25 MB
    • train.labeled_final.sentence_15 MB
    • test.labeled_final.label15 kB
    • valid.labeled_final.label15 kB
    • test.labeled_swap.label15 kB
    • README.txt608 B