CantusCorpus v1.0
Please use the following text to cite this item or export to a predefined format:
Anna Dvořáková; Debra Lacoste and Hajič jr., Jan, 2025,
CantusCorpus v1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-6041.
Authors
Item identifier
Project URL
Date issued
2025-11-19
Size
888010 entries
Language(s)
Description
CantusCorpus 1.0 is a large dataset of Gregorian chant intended for computational research.
The dataset consists of all chants that are accessible through the Cantus Index federated search interface, combining data from 10 individual chant databases. Primarily these are catalogue records: which chants appear in which manuscripts. What allows us to identify multiple instances of a chant across different manuscripts is the Cantus ID mechanism, established from the long history of the Cantus Database.
Thus, CantusCorpus 1.0 has two components: chant records (chants.csv), and source - overwhelmingly manuscript - records (sources.csv). CantusCorpus lies inherently downstream of the Cantus Database and the whole Cantus Index network of compatbile chant databases: we do not revisit anyone's editorial decisions. However, the value of this dataset is that the sum of all the editorial decisions made over the databases' decades of existence are being made available as a dataset for computational research.
The PyCantus library (https://github.com/dact-chant/PyCantus) then makes handling this dataset (almost) easy.
The accompanying source code (CantusCorpus-1.0.zip) contains a subdirectory with code and documentation for this particular version of CantusCorpus (v1.0). We expect re-collecting the dataset annually, as the Cantus network grows by tens of thousands of chant records each year.
Acknowledgement
Social Sciences and Humanities Research Council of Canada
Project code:895-2023-1002
Project name:Digital Analysis of Chant Transmission
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- CantusCorpus-1.0.zip
- Size
- 1.36 MB
- Format
- application/zip
- Description
- MD5
- f6628216d3599ede54ea20755ed4942e

- CantusCorpus-1.0
- .gitignore4 kB
- README.md713 B
- cantuscorpus_1.0
- get_dataset_from_scrapes.ipynb46 kB
- volpiano_utils.py15 kB
- README.md7 kB
- dataset_stats.ipynb414 kB
- scraping
- get_chants_collected_by_genre.sh390 B
- README.md5 kB
- prepare_slurm_scripts.sh3 kB
- cantus_json_to_csv.py16 kB
- static
- feast.csv47 kB
- genre.csv5 kB
- scrape_cid_values.py7 kB
- get_scripts_prep_by_genre.sh629 B
- db_scrapers.py42 kB
- get_cids_lists.sh685 B
- scrape_ci_feasts_list.py4 kB
- scrape_cdb_feasts_list.py1 kB
- collect_slurm_results.sh2 kB
- scrape_sources_csv.py5 kB
- scrape_ci_jsons.sh1 kB
- run_slurm_scripts.sh1 kB
- scrape_ci_genre_list.py5 kB
- img
- mode_distr.png27 kB
- cids_by_db.png32 kB
- chants_by_db.png38 kB
- office_distr.png25 kB
- most_com_modes.png27 kB
- century_distr_two.png27 kB
- sources_by_db.png28 kB
- source_distr_two.png29 kB
- source_metadata_support.png120 kB
- unique_cids_by_db.png38 kB
- genre_distr.png43 kB
- manuscript_size_distr.png27 kB
- big_sources_by_db.png39 kB
- century_distr.png27 kB
- dataset_stats.pdf555 kB
- Report_on_harmonization_issues.pdf142 kB
- get_dataset_from_scrapes.pdf401 kB
- LICENSE20 kB
- Name
- chants.csv
- Size
- 235.13 MB
- Format
- text/csv
- Description
- MD5
- f76e07ec358779ed33866c09b6905081

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- sources.csv
- Size
- 299.95 KB
- Format
- text/csv
- Description
- MD5
- 2d3e0d2cd1d1f4443a345d72abcc45a5

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

