STAM: Stand-off Text Annotation Model

Schematic introduction to the STAM data model

Data Model

STAM is a standalone data model for stand-off annotation on text. It allows you to describe annotations on text in your own terms. STAM does not prescribe any vocabulary.

Texts are kept as-is (utf-8 plain text), devoid of any special markup, and annotations target text segments via character offsets. This approach can be called radical stand-off. Any further annotation paradigm decisions are up to you. STAM aims to be generic and flexible.

Annotation is the central notion in STAM; almost everything is an annotation, each has annotation data and selects a target. The data vocabulary is all up to you; not even the notion of a word, token or sentence is predefined. Annotations can represent whatever you want; be it linguistic, structural, presentational, editorial or otherwise.

STAM implementations offer you the means to efficiently encode your annotation model without data duplication. We then offer means to query and manipulate this data. We implement the boring core logic for things like coordinate mappings, import, export, validation, and computation of textual relations like overlap, so you no longer have to!

Practical Tooling

STAM comes with practical programming libraries and tools to work with annotations on text. These aim to be fairly low-level standalone software with minimal dependencies, high reusability, and not reliant on any wider infrastructure.

All are available as free open-source software under the GNU General Public Licence v3. The tools are aimed at developers and technical researchers and data scientists. You can build upon STAM to construct your own applications that deal with annotation on text, including but not limited to for purposes such as natural language processing (NLP) and information retrieval.

from stam import *
store = AnnotationStore(id="example")
resource = store.add_resource(id="document.txt", text="Hallå världen")
annotation = store.annotate(
   target=Selector.textselector(resource, Offset.simple(6,13)),
   data={"key": "pos", "value": "noun", "set": "testset" } )

store.set_filename("example.stam.store.json")
store.save()

for annotation in store.annotations():
    for data in annotation.data():
        print(f"{annotation}, {data.key()}={data.value()}")
världen, pos=noun

Learn more? See the STAM Python Tutorial!

STAM data model

Formal specification & extensibility

There is a core STAM model that is kept minimal yet expressive, and several optional extensions that define additional modelling capabilities you may or may not use.

All are documented in a formal specification, independent of any implementations.

STAM aims to provide a solid generic foundation for stand-off-text annotation upon which other initiatives can build. The model is specialised for text annotation rather than general knowledge graphs.

Simplicity, reusability & interoperability

STAM provides a canonical STAM JSON format (shown right), and as an extension also a STAM CSV format and an optimised binary format. Data can also be easily imported from and exported to simple ad-hoc formats like TSV. The model itself is independent of any serialisation formats.

STAM may act as a pivot model in conversions between different annotation formats, paradigms and vocabularies. It does not seek to replace existing projects but offers a common foundation in which other vocabularies may be expressed, allowing you to benefit from a common machinery.

We seek a fair degree of interoperability with W3C Web Annotations , linked open data in general, FoLiA, Text Encoding Initiative (TEI), CoNLL-U and Text Fabric. Various converters and mappings to this end have been implemented, such as a W3C Web Annotation exporter (specification), and a generic XML importer that effectively untangles an XML file with inline annotations/markup (like TEI) and splits it into plain text and pure stand-off annotations referencing that text.

{
  "@type": "AnnotationStore", "@id": "example",
  "resources": [ { "@type": "TextResource", "@id": "document.txt", "text": "Hallå världen" }],
  "annotationsets": [ { "@type": "AnnotationDataSet", "@id": "testset", 
  "keys": [ { "@type": "DataKey", "@id": "pos" }],
  "data": [ { "@type": "AnnotationData", "@id": "!D0", "key": "pos", "value": { "@type": "String", "value": "noun" } } ] } ],
  "annotations": [
    {
      "@type": "Annotation", "@id": "!A0",
      "target": {
        "@type": "TextSelector",
        "resource": "document.txt",
        "offset": { "@type": "Offset", "begin": { "@type": "BeginAlignedCursor", "value": 6 },
            "end": { "@type": "BeginAlignedCursor", "value": 13 } }
      },
      "data": [ { "@type": "AnnotationData", "@id": "!D0", "set": "testset" } ]
    }
  ]
}
$ stam query --query \
  'SELECT ANNOTATION ?a WHERE 
          DATA "testset" "pos" = "noun";' \
   example.stam.store.json
Type       Id  Text    TextSelection     testset/pos
Annotation !A0 världen document.txt#6-13 noun

Querying & Visualisation

Tools are also provided for querying and visualisation. A STAM Query Language (STAMQL) is available to that end. Visualisation works either in the browser (HTML) or in the terminal (ANSI text), and is also exposed via stam-python for use in for example Jupyter Notebooks.

$ stam view --query 'SELECT RESOURCE ?res' \
  --query '@KEYVALUETAG SELECT ANNOTATION ?a WHERE 
          RESOURCE ?res; 
          DATA "testset" "pos" = "noun";' \
   example.stam.store.json
visualisation of an annotation in the browser visualisation of an annotation on the terminal

Computing with annotations

STAM, through its Rust and Python library and the command line tools, offers a lot of functionality for various computations on text and their annotations.

stam-tools demo

Demos, presentations & publications