Module tf.about.corpora

Corpora

Corpora are usually stored in an online repository, such as GitHub or a research data archive such as DANS.

Some corpora are supported by Text-Fabric apps.

These apps provide a browser interface for the corpus, and they enhance the API for working with them programmatically.

A TF app can also download and update the corpus data.

All existing apps can be found in annotation on GitHub. Each repo named app-appName hosts the app named appName.

Corpora prefixed with a star do not have dedicated TF apps.

* Greek Literature
Greek Greek Literature, -400 - +400, Perseus Digital Library and Open Greek and Latin Project The result of a massive conversion effort by Ernst Boogert.
athenaeus
Greek Works of Athenaeus, 80 - 170, Deipnosophistae, Ernst Boogert
banks
modern english Iain M. Banks, 1984 - 1987, 99 words from the SF novel Consider Phlebas, Dirk Roorda
bhsa
Hebrew Hebrew Bible, 1000 BC - 900 AD, Biblia Hebraica Stuttgartensia (Amstelodamensis), ETCBC + Dirk Roorda
dss
Hebrew Dead Sea Scrolls, 300 BC - 100 AD, Transcriptions with morphology based on Martin Abegg's data files, Martijn Naaijer, Jarod Jacobs, Dirk Roorda
fusus
Arabic Fusus Al Hikam, 1165- 2000, editions (Lakhnawi and Afifi) of Ibn Arabi's Fusus plus commentaries in the centuries thereafter, Cornelis van Lit, Dirk Roorda
missieven
Dutch General Missives, 1600 - 1800, General Missives, Dutch East-Indian Company, Jesse van der Does, Sophie Arnoult, Dirk Roorda
nena
Aramaic North Eastern Neo-Aramaic Corpus, 2000, Nena Cambridge, Cody Kingham
oldassyrian
Akkadian / cuneiform Old Assyrian documents, 2000 - 1600 BC, Documents from Ashur Cale Johnson, Alba de Ridder, Martijn Kokken, Dirk Roorda
oldbabylonian
Akkadian / cuneiform Old Babylonian letters, 1900 - 1600 BC, Altbabylonische Briefe in Umschrift und Übersetzung, Cale Johnson, Dirk Roorda
peshitta
Syriac Syriac Old Testament, 1000 BC - 900 AD, Vetus Testamentum Syriace, Hannes Vlaardingerbroek, Dirk Roorda
quran
Arabic Quran, 600 - 900, Quranic Arabic Corpus, Cornelis van Lit, Dirk Roorda
syrnt
Syriac Syriac New Testament, 0 - 1000, Novum Testamentum Syriace, Hannes Vlaardingerbroek, Dirk Roorda
tisch
Greek New Testament, 50 - 450, Greek New Testament in Tischendorf 8th Edition, Cody Kingham, Dirk Roorda
uruk
proto-cuneiform Uruk, 4000 - 3100 BC, Archaic tablets from Uruk, Cale Johnson, Dirk Roorda

Intentions

oldroyal
Akkadian-Sumerian cuneiform Bilingual royal inscriptions, 2000 - 1600, more info to come, Martijn Kokken, Dirk Roorda

Get corpora

Automatically

Text-Fabric downloads apps from annotation automatically when you use them.

See tf.about.use.

Data ends up in a logical place under your ~/text-fabric-data/.

The TF data is fairly compact.

Size of data

There might be sizable additional data for some corpora, images for example. In that case, take care to have a good internet connection when you use a TF app for the first time.

Manually

Corpus data of app-supported corpora reside in a GitHub repo. You can manually clone such a data repository and point Text-Fabric to that data.

First, take care that your clone ends up in github/orgName (relative your home directory) where orgName is the organization or person on GitHub under which you have found the repo.

Then, when you invoke the app, pass the specifier :clone. This instructs Text-Fabric to look in your local GitHub clone, rather than online or in text-fabric-data, where downloaded data is stored.

use('xxxx:clone', checkout="clone")

text-fabric xxxx:clone --checkout=clone

In this way, you can work with data that is under your control.

Size of data

Cloning a data repository is more costly then letting Text-Fabric download the data. A data repository may contain several versions and representations of the data, including the their change histories. There might also be other material in the repo, such as source data, tutorials, programs.

For example, the etcbc/bhsa repo is several GB, but the TF data for a specific version is only 25MB.

Extra data

Researchers are continually adding new insights in the form of new feature data. TF apps make it easy to use that data alongside the main data source. Read more about the data life cycle in tf.about.datasharing.

Expand source code Browse git
"""
.. include:: ../docs/about/corpora.md
"""