Explorer Enrique Stanko Vráz with his colleague Josef Kořenský and botanist Karel Domin in the Botanical Garden in Prague-Na Slupi in the documentary Několik čelných cestovatelů českých (Leading Czech Explorers, Masaryk´s People´s Institute, 1928).
In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
Segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1943, issue no. 11B, reports on a workers´ holiday organized by the Reinhard Heydrich Foundation for Workers´ Recuperation at the Gymnasion Health Resort in Jarov u Dolních Břežan. Workers are having a warm-up exercise and practise shot putting. Everyone gets an apple as a snack. Minister of Agriculture and Forestry Adolf Hrubý comes for a visit.
Dramaturgist Ferdinand Pujman at the funeral of writer Marie Pujmanová in Vyšehrad Cemetery in Prague in May 1958 in a fragmented segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1958, issue no. 22. Pujman with his son, translator Petr Pujman. Pujmanová celebrating her 60th birthday in a fragmented segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1953, issue no. 25. Poet Vítězslav Nezval is seen on her left behind the platform.
František Fiala, better known by his stage name Ferenc Futurista, in Učitel orientálních jazyků (The Oriental Language Teacher, dir. Olga Rautenkrauzová and Jan S. Kolár, 1918). Studies of facial expressions by actor Ferenc Futurista captured by Bohumil Veselý. Futurista with actress Betty Kyslíková during the shooting of Proč se nesměješ (Why Aren´t You Laughing?, dir. Eman Fiala, 1922). Ferenc Futurista with his wife (actress Anna Filípková Ferencová), his brother Eman Fiala, his wife and their daughter Milena, and R. A. Dvorský on Bohumil Veselý's balcony.
Segment from Český zvukový týdeník Aktualita (Czech Aktualita Sound Newsreel) issue no. 16B from 1945 captures an event organised by the Board of Trustees for the Education of Youth aimed against the mite infestation of bees. The Veterinary Laboratory of the City of Prague, where beekeepers had sent thirty bees from each beehive, sought different ways to stop the mite epidemic. Trained female instructors from the Board of Trustees for the Education of Youth helped with the research.
Filip Hauptmann, physical education promoter and an official of the Czechoslovak Sokol society, on a bench in the garden of his house. Hauptmann portrayed with four unidentified people Ï one woman and three men. One of the men gives Hauptmann Jiří Ota Parma´s book Mořeplavci (Seafarers).