Module `tf.about.corpora`

Corpora

TF corpora are usually stored on GitHub / GitLab and TF knows how to download a corpus from GitHub / GitLab if you specify the org/repo.

Most corpora are configured by metadata in a directory app in the repo.

You can load a corpus into a Python data structure by

from tf.app import use
A = use("org/repo")

And you can get it in the TF browser by saying this on a command prompt:

tf org/repo

Here is a list of corpora that can be loaded this way. Since everybody can put a TF corpus on GitHub / GitLab, the list may not be complete!

annotation/banks: modern English Iain M. Banks, 1954 - 2013, 99 words from the SF novel Consider Phlebas, Dirk Roorda
annotation/mobydick: English Herman Melville, 1819 - 1891, Novel, 1851; converted from TEI in the Oxford Text Archive, with NLP output from Spacy woven in; Dirk Roorda
annotation/mondriaan: English Piet Mondriaan, 1872 - 1944, Test corpus of 14 letters (proeftuin), with NLP output from Spacy woven in; converted from TEI from the Huygens Institute, together with RKD and HuC; many people involved

Cambridge Semitics Lab

CambridgeSemiticsLab/nena_tf: Aramaic North Eastern Neo-Aramaic Corpus, 2000, Nena Cambridge, with a client-side, offline search interface in JavaScript; Cody Kingham

CenterBLC Andrews University

CenterBLC/LXX: Greek Septuagint, 300 - 100 BCE, LXX Rahlf's edition 1935 plus additional features by CenterBLC; earliest extant Greek translation of Hebrew Bible books; Oliver Glanz, Adrian Negrea
CenterBLC/N1904: Greek New Testament, 100 - 400, GNT Nestle-Aland edition 1904 with new features by CenterBLC, using data from Clear Bible - Macula Greek; Tony Jurg, Saulo Oliveira de Cantanhêde, Oliver Glanz, Willem van Peursen
CenterBLC/SBLGNT: Greek New Testament, 100 - 400, converted from James Tauber's morphgnt / sblgnt with additional features by CenterBLC; Adrian Negrea, Clacir Virmes, Oliver Glanz, Krysten Thomas

CLARIAH

CLARIAH/descartes-tf: French, Latin, Dutch Letters from and to Descartes, 1619 - 1650, René Descartes - Correspondance, with math display and illustrations; Ch. Adam et G. Milhaud (eds. and illustrations, 1896-1911); Katsuzo Murakami, Meguru Sasaki, Takehumi Tokoro (ASCII digitization, 1998); Erik-Jan Bos (ed, 2011); Dirk Roorda (converter TEI, 2011 and TF 2023)
CLARIAH/wp6-ferdinandhuyck: Dutch a novel by Jacob van Lennep, 1840, Jacob van Lennep - Ferdinand Huyck, with NLP output from Spacy woven in; From DBNL, TEI-Lite; Dirk Roorda (converter TEI to TF), see also tff.convert.tei
CLARIAH/wp6-missieven: Dutch General Missives, 1600 - 1800, General Missives, Dutch East-Indian Company, Jesse van der Does, Sophie Arnoult, Dirk Roorda
CLARIAH/wp6-daghregisters: Dutch Dagh Registers Batavia, 1640 - 1641, Daily events at Batavia, Indonesia, historical source for the operation of the Dutch East-Indian Company, work in progress, currently only volume 4, with many OCR errors and an attempt to detect them, Lodewijk Petram, Dirk Roorda.

Cody Kingham

codykingham/tischendorf_tf: Greek New Testament, 50 - 450, Tischendorf 8th Edition, Cody Kingham, Dirk Roorda

Digital Theologians of the University of Copenhagen

DT-UCPH/sp: Hebrew Samaritan Pentateuch , 516 BCE - 70 AD, MS Dublin Chester Beatty Library 751 + MS Garizim 1, Martijn Naaijer, Christian Canu Højgaard
DT-UCPH/cuc: Hebrew Copenhagen Ugaritic Corpus, 1223 BCE - 1172 AD, selected clay tablets, Martijn Naaijer, Christian Canu Højgaard

Eep Talstra Center for Bible and Computer

ETCBC/bhsa

Hebrew Bible (Old Testament), 1000 BCE - 900 AD, Biblia Hebraica Stuttgartensia (Amstelodamensis), the canonical TF dataset, where it all started; ETCBC + Dirk Roorda

the canonical TF dataset, where it all started

I am horrified by the genocidal violence that Israel is committing against the Palestine people

(2025-05-19) israel-palestine

ETCBC/dhammapada

Pāli and Latin Ancient Buddhist verses, 300 BCE and 1855 AD, Transcription with Latin translations based on Viggo Fausbøll's book, Bee Scherer, Yvonne Mataar, Dirk Roorda

ETCBC/dss

Hebrew Dead Sea Scrolls, 300 BCE - 100 AD, Transcriptions with morphology based on Martin Abegg's data files, Martijn Naaijer, Jarod Jacobs, Dirk Roorda

ETCBC/nestle1904

Greek New Testament, 100 - 400, GNT Nestle-Aland edition 1904 from LOWFAT-XML syntax trees, converted from biblicalhumanities/greek-new-testament contributed by Jonathan Robie and Micheal Palmer; Oliver Glanz, Tony Jurg, Saulo de Oliveira Cantanhêde, Dirk Roorda

ETCBC/peshitta

Syriac Peshitta (Old Testament), 1000 BCE - 900 AD, Vetus Testamentum Syriace, Hannes Vlaardingerbroek, Dirk Roorda

ETCBC/syrnt

Syriac New Testament, 0 - 1000, Novum Testamentum Syriace, Hannes Vlaardingerbroek, Dirk Roorda

KNAW/HuygensING and gitlab.huc.knaw.nl

hermans/works: Dutch Complete Works of W.F. Hermans. The conversion to TF is work in progress. So far these works have been done: Paranoia, Sadistisch Universum, Nooit meer slapen, Not publicly accessible, the book is under copyright., with a critical apparatus; Bram Oostveen, Peter Kegel, Dirk Roorda
mondriaan/letters: Dutch Letters of Piet Mondriaan , 1892-1923, Work in progress, test set only ("proeftuin"), with NLP output from Spacy woven in. Straight conversion from TEI to TF, Peter Boot et al., Dirk Roorda
HuygensING/suriano: Italian Correspondence of Christofforo Suriano , 1616-1623, with additional meta data, named entities, and page scans.. Complex conversion from DOCX through simple TEI to TF, From the TF a stream of annotations is generated (WATM) that drives the publishing machinery of HuC Team Text leading to the website edition.suriano.huygens.knaw.nl. Nina Lamal, Helmer Helmers, Sebastiaan van Dalen, Bram Buitendijk, Hayco de Jong, Hennie Brugman, Dirk Roorda
HuygensING/translatin-manif: Latin The transnational impact of Latin drama from the early modern Netherlands, a qualitative and computational analysis, Work in progress. Conversion from PageXML to TF, Jirsi Reinders, Hayco de Jong, et al., Dirk Roorda

NINO Cuneiform

Nino-cunei/ninmed: Akkadian / cuneiform Medical Encyclopedia from Nineveh, ca. 800 BCE, Medical documents with lemma annotations, Cale Johnson, Dirk Roorda
Nino-cunei/oldassyrian: Akkadian / cuneiform Old Assyrian documents, 2000 - 1600 BCE, Documents from Ashur Cale Johnson, Alba de Ridder, Martijn Kokken, Dirk Roorda
Nino-cunei/oldbabylonian: Akkadian / cuneiform Old Babylonian letters, 1900 - 1600 BCE, Altbabylonische Briefe in Umschrift und Übersetzung, Cale Johnson, Dirk Roorda
Nino-cunei/uruk: proto-cuneiform Uruk, 4000 - 3100 BCE, Archaic tablets from Uruk with lots of illustrations; Cale Johnson, Dirk Roorda

Protestant Theological University

Greek Literature: Greek Literature, -400 - +400, Perseus Digital Library and Open Greek and Latin Project The result of a massive conversion effort by Ernst Boogert.
pthu/athenaeus: Greek Works of Athenaeus, 80 - 170, Deipnosophistae, Ernst Boogert

Quran

q-ran/quran: Arabic Quran, 600 - 900, Quranic Arabic Corpus, Cornelis van Lit, Dirk Roorda

University of Utrecht: Cornelis van Lit

among/fusus: Arabic Fusus Al Hikam, 1165- 2000, editions (Lakhnawi and Afifi) of Ibn Arabi's Fusus plus commentaries in the centuries thereafter, Cornelis van Lit, Dirk Roorda

Get corpora

Automatically

TF downloads corpus data and apps from GitHub / GitLab on demand.

See tf.about.use.

Data ends up in a logical place under your ~/text-fabric-data/.

The TF data is fairly compact.

Size of data

There might be sizable additional data for some corpora, images for example. In that case, take care to have a good internet connection when you use a TF app for the first time.

Manually

TF data of corpora reside in a back-end repo. You can manually clone such a data repository and point TF to that data.

First, take care that your clone ends up in github/orgName or gitlab/orgName (relative your home directory) where orgName is the organization or person or group on GitHub / GitLab under which you have found the repo.

Then, when you invoke the app, pass the specifier :clone. This instructs TF to look in your local GitHub / GitLab clone, rather than online or in your local ~/text-fabric-data, where downloaded data is stored.

use('org/repo:clone', checkout="clone")

tf org/repo:clone --checkout=clone

In this way, you can work with data that is under your control.

Size of data

Cloning a data repository is more costly then letting TF download the data. A data repository may contain several versions and representations of the data, including their change histories. There might also be other material in the repo, such as source data, tutorials, programs.

For example, the ETCBC/bhsa repo is several gigabytes, but the TF data for a specific version is only 25MB.

Extra data

Researchers are continually adding new insights in the form of new feature data. TF apps make it easy to use that data alongside the main data source. Read more about the data life cycle in tf.about.datasharing.