COSTRA 1.0: A Dataset of Complex Sentence Transformations

COSTRA 1.0: A Dataset of Complex Sentence Transformations

LINDAT / CLARIAH-CZ

Authors: Barančíková, Petra and Bojar, Ondřej

Item identifier: http://hdl.handle.net/11234/1-3123

Date issued: 2019-12-03

Type: corpus, text

Size: 5544 sentences

Language(s): Czech

Description: COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation. The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Subject(s): sentences sentence embeddings paraphrases semantic relations

Collection(s): LINDAT / CLARIAH-CZ Data & Tools

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Name: costra_1.0.zip
Size: 116.75 KB
Format: application/zip
Description: data
MD5: bc43603c67b940f62299f441be420eba

Download file Preview

File Preview

costra_1.0.
- round_2
  - annotations.tsv407 kB
  - source_sentences.tsv12 kB
- round_1
  - annotations.tsv116 kB
  - source_sentences.tsv1 kB
- README3 kB
- .README.swp12 kB