Corpora are usually stored in an online repository, such as GitHub or a research data archive such as DANS.
Some corpora are supported by Text-Fabric apps.
These apps provide a browser interface for the corpus, and they enhance the API for working with them programmatically.
A TF app can also download and update the corpus data.
All existing apps can be found in
annotation on GitHub.
Each repo named
app-appName hosts the app named appName.
Corpora prefixed with a star do not have dedicated TF apps.
- Greek Greek Literature, -400 - +400, Perseus Digital Library and Open Greek and Latin Project The result of a massive conversion effort by Ernst Boogert.
- Greek Works of Athenaeus, 80 - 170, Deipnosophistae, Ernst Boogert
- modern english Iain M. Banks, 1984 - 1987, 99 words from the SF novel Consider Phlebas, Dirk Roorda
- Hebrew Hebrew Bible, 1000 BC - 900 AD, Biblia Hebraica Stuttgartensia (Amstelodamensis), ETCBC + Dirk Roorda
- Hebrew Dead Sea Scrolls, 300 BC - 100 AD, Transcriptions with morphology based on Martin Abegg's data files, Martijn Naaijer, Jarod Jacobs, Dirk Roorda
- Arabic Fusus Al Hikam, 1165- 2000, editions (Lakhnawi and Afifi) of Ibn Arabi's Fusus plus commentaries in the centuries thereafter, Cornelis van Lit, Dirk Roorda
- Dutch General Missives, 1600 - 1800, General Missives, Dutch East-Indian Company, Jesse van der Does, Sophie Arnoult, Dirk Roorda
- Aramaic North Eastern Neo-Aramaic Corpus, 2000, Nena Cambridge, Cody Kingham
- Akkadian / cuneiform Old Assyrian documents, 2000 - 1600 BC, Documents from Ashur Cale Johnson, Alba de Ridder, Martijn Kokken, Dirk Roorda
- Akkadian / cuneiform Old Babylonian letters, 1900 - 1600 BC, Altbabylonische Briefe in Umschrift und Übersetzung, Cale Johnson, Dirk Roorda
- Syriac Syriac Old Testament, 1000 BC - 900 AD, Vetus Testamentum Syriace, Hannes Vlaardingerbroek, Dirk Roorda
- Arabic Quran, 600 - 900, Quranic Arabic Corpus, Cornelis van Lit, Dirk Roorda
- Syriac Syriac New Testament, 0 - 1000, Novum Testamentum Syriace, Hannes Vlaardingerbroek, Dirk Roorda
- Greek New Testament, 50 - 450, Greek New Testament in Tischendorf 8th Edition, Cody Kingham, Dirk Roorda
- proto-cuneiform Uruk, 4000 - 3100 BC, Archaic tablets from Uruk, Cale Johnson, Dirk Roorda
- Akkadian-Sumerian cuneiform Bilingual royal inscriptions, 2000 - 1600, more info to come, Martijn Kokken, Dirk Roorda
Text-Fabric downloads apps from annotation automatically when you use them.
Data ends up in a logical place under your
The TF data is fairly compact.
Size of data
There might be sizable additional data for some corpora, images for example. In that case, take care to have a good internet connection when you use a TF app for the first time.
Corpus data of app-supported corpora reside in a GitHub repo. You can manually clone such a data repository and point Text-Fabric to that data.
First, take care that your clone ends up in
(relative your home directory)
where orgName is the organization or person on GitHub under which you have
found the repo.
Then, when you invoke the app, pass the specifier
This instructs Text-Fabric to look in your local GitHub clone, rather
than online or in
text-fabric-data, where downloaded data is stored.
use('xxxx:clone', checkout="clone") text-fabric xxxx:clone --checkout=clone
In this way, you can work with data that is under your control.
Size of data
Cloning a data repository is more costly then letting Text-Fabric download the data. A data repository may contain several versions and representations of the data, including the their change histories. There might also be other material in the repo, such as source data, tutorials, programs.
For example, the
etcbc/bhsa repo is several GB, but the TF data for a specific
version is only 25MB.
Researchers are continually adding new insights in the form of new feature
data. TF apps make it easy to use that data alongside the main data source.
Read more about the data life cycle in
Expand source code Browse git
""" .. include:: ../docs/about/corpora.md """