Module tf.about.background
Background
Research data management
TF supports the research data cycle of retrieving/analysing/generating/archiving research results.
Share data
When using the TF browser, results can be
exported, see export()
.
When programming in a notebook, TF generates many useful links after having been invoked. In this way the provenance of your data will be shared wherever you share the notebook (GitHub / GitLab, NBViewer, Software Heritage Archive).
Contribute data
Researchers can produce new data (FabricCore.save()
)
out of their findings and package their new data into modules and
distribute it to GitHub / GitLab, see tf.about.datasharing
.
Other people can use that data just by mentioning the GitHub / GitLab location.
TF will auto-load it for them.
Factory
TF can be used to construct websites,
for example SHEBANQ.
In the case of SHEBANQ, data has been converted to MYSQL databases.
However, with the built-in TF kernel] (tf.browser.kernel
),
it is also possible to use TF itself as a database to
serve multiple connections and requests.
API organization
All corpora are different, and it shows when we have to display the materials.
TF offers a plain display of corpus text and a pretty display of feature-enriched
structures.
These functions are supported by advanced configuration settings, derived from
the corpus itself. Where these default settings are not enough, the corpus designer
can add and tweak corpus settings.
Moreover, custom code can be written and hooked into the display functions.
The combination of a custom configuration file (config.yaml) and a bit of
application code (app.py), together with additional styling (display.css) is an
app.
A well-configured app can auto-download the corpus data, holds provenance information
of all data sources that are being used for a corpus, and takes care of an optimal display
of the patterns in the corpus.
- Corpora:
tf.about.corpora
- advanced: API:
tf.advanced
- core API:
tf.core
Design principles
There are a number of things that set TF apart from most other ways to encode corpora.
Minimalistic model
TF is based on a minimalistic data model for text plus annotations.
A defining characteristic is that TF stores text as a bunch of features in plain text files.
These features are interpreted against a graph of nodes and edges, which make up the abstract fabric of the text.
A graph is a more general concept than a tree. Whilst trees are ubiquitous in linguistic analysis, there is much structure in a corpus that is not strictly tree-like.
Therefore, we do not adopt technologies that have the tree as their first class data model. Hence, almost by definition, TF does not make use of XML technology.
Performance matters
Based on this model, TF offers a core API (tf.fabric
)
to search, navigate and process text and its annotations.
A lot of care has been taken to make this API work as fast as possible.
Efficiency in data processing has been a design criterion from the start.
Comparisons
See e.g. the comparisons between the TF way of serializing (pickle + gzip) and avro, joblib, and marshal.
Code organization and statistics
To get an impression of the software that is TF,
in terms of organization and size, see tf.about.code
.
History
The foundational ideas derive from work done in and around the ETCBC avant-la-lettre from 1970 onward by Eep Talstra, Crist-Jan Doedens, (Ph.D. thesis), Henk Harmsen, Ulrik Sandborg-Petersen (Emdros), and many others.
I entered in that world in 2007 as a DANS employee, doing a joint small data project, and a bigger project SHEBANQ in 2013/2014. In 2013 I developed LAF-Fabric as a tool for constructing the website SHEBANQ.
House cleaning
LAF-Fabric is based on the ISO standard Linguistic Annotation Framework (LAF). LAF is an attempt to marry graph models to the Text Encoding Initiative (TEI) which lives in XML. It is a good try, but it turns out that using XML technology for graphs is a pain. All the usual advantages of using the XML tool chain evaporate.
So I decided to leave XML and its associated syntactical complexity. While I was at it, I took out everything that makes LAF-Fabric complicated and all things that are not essential for the sake of raw data processing. That became TF version 1 at the end of 2016.
It turned out that this move has freed the way to work towards higher-level goals:
- a new search engine (inspired by MQL and
- support for research data workflows.
Time moves on, and nowhere is that felt as keenly as in computing science. Programming has become easier, humanists become better programmers, and personal computers have become powerful enough to do a sizable amount of data science on them.
That leads to exciting tipping points:
In sociology, a tipping point is a point in time when a group - or a large number of group members — rapidly and dramatically changes its behaviour by widely adopting a previously rare practice.
TF is an attempt to tip the scales by providing digital humanists with the functions they need now, based on technology that appeals now.
Hence, my implementation of TF search has been done from the ground up, and uses a strategy that is very different from Ulrik's MQL search engine.
I continued working on TF while I was at DANS, till 2022.
Now, at KNAW/Humanities Cluster, I apply it to new contexts: it can work with GitLab as well.
Expand source code Browse git
"""
.. include:: ../docs/about/background.md
"""