This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

CooccurrenceFieldSampler (CFS)

Please use the following text to cite this item or export to a predefined format:
Jan Oliver Rüdiger, 2026, CooccurrenceFieldSampler (CFS), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11372/LRT-6050.
Date issued
2026-01-01
Description
The CooccurrenceFieldSampler (CFS) was developed for sampling from corpora to facilitate lexicographical data analysis. It works with corpora from different sources, text types or years. In random sentence sampling (random/opportunistic sampling), it can be observed that corpora containing different text types and lengths (per source) cannot always be mixed optimally, as they usually do not have the same size and have different topic weightings, for example. The CFS was designed to solve this problem. The CFS first calculates all co-occurrences for all tokens within sentences – separately for each source. These corpora are then combined in a 1:1 mixture and the co-occurrences for the entire data set are recalculated. The tool evaluates which co-occurrences disappear and which new ones are created, resulting in quotas that control the random mixing of the corpora sentence by sentence. The end result is a sentence-based corpus that (A) strives to retain the maximum number of co-occurrences from all sources (as accurately as possible) and (B) minimises the rejection of corpus data. --- To use the CFS tool, follow these steps: 1. Unzip the ZIP file containing the necessary files. 2. For Windows, Linux, and macOS, you will find precompiled binaries that run exclusively on x64 processors. 3. If you are using a different processor type, such as ARM or ARM64, please use the Universal folder. 4. Windows users should run "cfs.exe" directly. 5. For Linux and macOS users, you must mark the cfs file as executable. 6. If using the Universal version, ensure .NET 10.0 is installed for compiling. You can then run the program with "dotnet cfs.dll". 7. To display help information, use the --help parameter. Help/Parameter: --from (Default: cec / recommended: cec) import file format (valid: cec, bnc, catma, clan, conll, cora, cwd, dewac, dta, folia, fln, korap, leipzig, xces, relannis, salt, json, sketch, speedy, tiger, tlv, treetagger, tsv, txm, weblicht) --input (Default: input/) folder with input-files (mix per file) --to (Default: cec / recommended: cec) export file format (valid: cec, catma, conll, cwd, csv, dta, folia, i5, korap, xces, plain, salt, json, sketch, speedy, tlv, tsv, treetagger, txm, weblicht) --layer (Default: Wort) use this layer to calculate the co-occurrences --output (Default: output.cec6) output file (every round and logfile) --minFrequency (Default: 1 / recommended: 5) min. absolute frequency --minSignificance (Default: 1.0 / recommended: 1.0) min. significance (poisson distribution) --minChangeRate (Default: 0.1 / recommended: 0.1) min. significance (poisson distribution) --maxRounds (Default: 10 / recommended: 5) min. absolute frequency --help Display this help screen. --version Display version information. Supported corpus formats (input/output): cec - CorpusExplorer Corpus (v6) - http://corpusexplorer.de bnc - British National Corpus - http://www.natcorp.ox.ac.uk/ catma - CATMA (Computer assisted text markup and analysis) - https://catma.de/ clan - CLAN/CHILDES - https://talkbank.org/childes/ conll - CoNLL-U https://universaldependencies.org/format.html cora - CORA XML - https://cora.readthedocs.io/en/latest/coraxml/ cwd - IMS Open Corpus Workbench (CWB) - https://cwb.sourceforge.io/ dewac - https://wacky.sslmit.unibo.it/doku.php?id=corpora dta - DTA TCF-XML - https://www.deutschestextarchiv.de/download folia - FoLiA XML - https://proycon.github.io/folia/ fln - Folker/OrthoNormal - https://exmaralda.org/de/folker-de/ korap - KorAP - http://korap.ids-mannheim.de/ leipzig - Wortschatz Leipzig - https://wortschatz.uni-leipzig.de/en/download/ xces - XCes XML - http://www.xces.org/ / https://www.cs.vassar.edu/CES/ relannis - https://corpus-tools.org/annis/ salt - https://corpus-tools.org/archive-2015-2025/salt/ json - https://de.wikipedia.org/wiki/JSON sketch - SketchEngine VERT - https://www.sketchengine.eu/glossary/vertical-file/ speedy - SPEEDy Annotation Editor - http://kups.ub.uni-koeln.de/id/eprint/55224 tiger - TiGER-XML - https://www.ims.uni-stuttgart.de/documents/ressourcen/werkzeuge/tigersearch/doc/html/TigerXML.html tlv - TLV-XML treetagger - TreeTagger - https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ tsv - Tab-separated values - https://en.wikipedia.org/wiki/Tab-separated_values txm - TXM - https://txm.gitpages.huma-num.fr/textometrie/?lang=en weblicht - Weblicht - https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/Main_Page.html csv - Comma-separated values - https://en.wikipedia.org/wiki/Comma-separated_values i5 - i5-XML - https://www.ids-mannheim.de/en/digspra/pb-s1/projects/corpus-development/ids-text-model/ plain - Plaintext - https://en.wikipedia.org/wiki/Plain_text
This item isPublicly Available
and licensed under:
 Files in this item
Name
CFS.zip
Size
95.52 MB
Format
application/zip
Description
Zip
MD5
9fbb7469dfdd8da63833dbe618ad446c
Preview
  File Preview
    The file preview has not been generated yet. Please try again later or contact the system administrator