Module tf.convert.tei

TEI import

You can convert any TEI source into TF by specifying a few details about the source.

TF then invokes the tf.convert.walker machinery to produce a TF dataset out of the source.

TF knows the TEI elements, because it will read and parse the complete TEI schema. From this the set of complex, mixed elements is distilled.

If the TEI source conforms to a customised TEI schema, it will be detected and the importer will read it and override the generic information of the TEI elements.

It is also possible to pass a choice of template and adaptation in a processing instruction. This does not influence validation, but it may influence further processing.

If the TEI consists of multiple source files, it is possible to specify different templates and adaptations for different files.

The possible values for models, templates, and adaptations should be declared in the configuration file. For each model there should be a corresponding schema in the schema directory, either an RNG or an XSD file.

The converter goes the extra mile: it generates a TF app and documentation (an about.md file and a transcription.md file), in such a way that the TF browser is instantly usable.

The TEI conversion is rather straightforward because of some conventions that cannot be changed.

Configuration and customization

We assume that you have a programs directory at the top-level of your repo. In this directory we'll look for two optional files:

  • a file tei.yaml in which you specify a bunch of values to get the conversion off the ground.

  • a file tei.py in which you define custom functions that are executed at certain specific hooks:

    • transform(text) which takes a text string argument and delivers a text string as result. The converter will call this on every TEI input file it reads before feeding it to the XML parser. This can be used to solve some quirks in the input, e.g. replacing two consecutive commas (,,) by a single unicode character ( = 201E);
    • beforeTag: just before the walker starts processing the start tag of a TEI element;
    • beforeChildren: just after processing the start tag, but before processing the element content (text and child elements);
    • afterChildren: just after processing the complete element content (text and child elements), but before processing the end tag of the TEI element;
    • afterTag: just after processing the end tag of a TEI element.

      The before and after functions should take the following arguments

      • cv: the walker converter object;
      • cur: the dictionary with information that has been gathered during the conversion so far and that can be used to dump new information into; it is nonlocal, i.e. all invocations of the hooks get the same dictionary object passed to them;
      • xnode: the LXML node corresponding to the TEI element;
      • tag: the tag name of the element, without namespaces; this is a bit redundant, because it can also be extracted from the xnode, but it is convenient.
      • atts: the attributes (names and values) of the element, without namespaces; this is a bit redundant, because it can also be extracted from the xnode, but it is convenient.

      These functions should not return anything, but they can write things to the cur dictionary. And they can create slots, nodes, and terminate them, in short, they can do every cv-based action that is needed.

      You can define these functions out of this context, but it is good to know what information in cur is guaranteed to be available:

      • xnest: the stack of XML tag names seen at this point;
      • tnest: the stack of TF nodes built at this point;
      • tsiblings (only if sibling nodes are being recorded): the list of preceding TF nodes corresponding to the TEI sibling elements of the current TEI element.

Keys and values of the tei.yaml file

generic

dict, optional {}

Metadata for all generated TF features. The actual source version of the TEI files does not have to be stated here, it will be inserted based on the version that the converter will actually use. That version depends on the tei argument passed to the program. The key under which the source version will be inserted is teiVersion.

extra

dict, optional {}

Instructions and metadata for specific generated TF features, namely those that have not been generated by the vanilla TEI conversion, but by extra code in one of the customised hooks. The dict is keyed by feature name, the values are again dictionaries. These value dictionaries have a key meta under which any number of metadata key value pairs, such as description="xxx".

If you put the string «base» in such a field, it will be expanded on the basis of the contents of the path key, see below.

You must provide the key valueType and pass int or str there, depending on the values of the feature. You may provide extra keys, such as conversionMethod="derived", so that other programs can determine what to do with these features. The information in this dict will also end up in the generated feature docs.

Besides the meta key, there may also be the keys path, and nodeType. Together they contain an instruction to produce a feature value from element content that can be found on the current stack of XML nodes and attributes. The value found will be put in the feature in question for the node of type specified in nodeType that is recently constructed.

Example:

extra:
  letterid:
    meta:
      description: The identifier of a letter; «base»
      valueType: str
      conversionMethod: derived
      conversionCode: tt
    path:
      - idno:
        type: letterId
      - altIdentifier
      - msIdentifier
      - msDesc
      - sourceDesc
    nodeType: letter
    feature: letterid

The meaning is:

  • if, while parsing the XML, I encounter an element idno,
  • and if that element has an attribute type with value letterId,
  • and if it has parent altIdentifier,
  • and grandparent msIdentifier,
  • and great-grandparent msDesc,
  • and great-great-grandparent sourceDesc,
  • then look up the last created node of type letter
  • and get the text content of the current XML node (the idno one),
  • and put it in the feature letterid for that node.
  • Moreover, the feature letterid gets metadata as specified under the key meta, where the description will be filled with the text
    The identifier of a letter; the content is taken from sourceDesc/msDesc/msIdentifier/altIdentifier/idno[type=letterId]
    

models

list, optional []

Which TEI-based schemas are to be used. For each model there should be an XSD or RNG file with that name in the schema directory. The tei_all schema is known to TF, no need to specify that one.

We'll try a RelaxNG schema (.rng) first. If that exists, we use it for validation with JING, and we also convert it with TRANG to an XSD schema, which we use for analysing the schema: we want to know which elements are mixed and pure.

If there is no RelaxNG schema, we try an XSD schema (.xsd). If that exists, we can do the analysis, and we will use it also for validation.

Problems with RelaxNG validation

RelaxNG validation is not always reliable when performed with LXML, or any tool based on libxml, for that matter. That's why we try to avoid it. Even if we translate the RelaxNG schema to an XSD schema by means of TRANG, the resulting validation is not always reliable. So we use JING to validate the RelaxNG schema.

See also JING-TRANG.

templates

list, optional []

Which template(s) are to be used. A template is just a keyword, associated with an XML file, that can be used to switch to a specific kind of processing, such as letter, bibliolist, artworklist.

You may specify an element or processing instruction with an attribute that triggers the template for the file in which it is found.

This will be retrieved from the file before XML parsing starts. For example,

    templateTrigger="?editem@template"

will read the file and extract the value of the template attribute of the editem processing instruction and use that as the template for this file. If no template is found in this way, the empty template is assumed.

adaptations

list, optional []

Which adaptations(s) are to be used. An adaptation is just a keyword, associated with an XML file, that can be used to switch to a specific kind of processing. It is meant to trigger tweaks on top of the behaviour of a template.

You may specify an element or processing instruction with an attribute that triggers the adaptation for the file in which it is found.

This will be retrieved from the file before XML parsing starts. For example,

    adaptationTrigger="?editem@adaptation"

will read the file and extract the value of the adaptation attribute of the editem processing instruction and use that as the adaptation for this file. If no adaptation is found in this way, the empty adaptation is assumed.

prelim

boolean, optional True

Whether to work with the pre TF versions. Use this if you convert TEI to a preliminary TF dataset, which will receive NLP additions later on. That version will then lose the pre.

granularity

string, optional token

What to take the basic entities (slots). Possible values:

  • word: words are slots, even if they cross element boundaries. This leads to some imprecisions: words containing an element boundary will belong to just one of both elements around the boundary.
  • char: all individual characters are separate slots. Very precise, but the dataset gets expensive with so many slots.
  • token: every sequence of alphanumeric characters becomes a token, in sofar there is no intervening markup. Non alphanumeric characters become separate tokens. There are some additional rules: . or , tightly surrounded by digits also count as tokens.

The datasets with granularity word and token have features str for the string content of the slots, and after for the material after the slots. In the case of word, the feature after can contain whitespace and punctuation characters, in the case of token, it only contains whitespace.

If not, the characters are taken as basic entities.

If you use an NLP pipeline to detect tokens, use the value False. The preliminary dataset is then based on characters, but the final dataset that we build from there is based on tokens, which are mostly words and non-word characters.

parentEdges

boolean, optional True

Whether to create edges between nodes that correspond to XML elements and their parents.

siblingEdges

boolean, optional False

Whether to create edges between nodes that correspond to XML elements and siblings. Edges will be created between each sibling and its preceding siblings. If you use these edges in the binary way, you can also find the following siblings. The edges are labeled with the distance between the siblings, adjacent siblings get distance 1.

Overwhelming space requirement

If the corpus is divided into relatively few elements that each have very many direct children, the number of sibling edges is comparable to the size of the corpus squared. That means that the TF dataset will consist for 50-99% of sibling edges! An example is ETCBC/nestle1904 (Greek New Testament) where each book element has all of its sentences as direct children. In that dataset, the siblings would occupy 40% of the size, and we have taken care not to produce sibling edges for sentences.

procins

boolean, optional False

If True, processing instructions will be treated. Processing instruction <?foo bar="xxx"?> will be converted as if it were an empty element named foo with attribute bar with value xxx.

lineModel

dict, optional False

If not passed, or an empty dict, line model I is assumed. A line model must be specified with the parameters relevant for the model:

dict(
    model="I",
)

(model I does not require any parameters)

or

dict(
    model="II",
    element="p",
    nodeType="ln",
)

For model II, the default parameters are:

element="p",
nodeType="ln",

Model I is the default, and nothing special happens to the <lb> elements.

In model II the <lb> elements translate to nodes of type ln, which span content, whereas the original lb elements just mark positions. Instead of ln, you can also specify another node type by the parameter element.

We assume that the material that the <lb> elements divide up is the material that corresponds to their <p> parent element. Instead of <p>, you can also specify another element in the parameter element.

We assume that lines start and end at the start and end of the <p> elements and the <lb> elements. For the material etween these boundaries, we build ln nodes. If an <lb> element follows a <p> start tag without intervening slots, a ln node will be created but not linked to slots, and it will be deleted later in the conversion. Likewise, if an <lb> element is followed by a <p> end tag without intervening slots, a ln node is created that is not linked to slots.

The attributes of the <lb> elements become features of the ln node that starts with that <lb> element. If there is no explicit <lb> element at the start of a paragraph, the first ln node of that paragraph gets no features.

pageModel

dict, optional False

If not passed, or an empty dict, page model I is assumed. A page model must be specified with the parameters relevant for the model:

dict(
    model="I",
)

(model I does not require any parameters)

or

dict(
    model="II",
    element="div",
    attributes=dict(type=["original", "translation"]),
    pbAtTop=True,
    nodeType="page",
)

For model II, the default parameters are:

element="div",
pbAtTop=True,
nodeType="page",
attributes={},

Model I is the default, and nothing special happens to the <pb> elements.

In model II the <pb> elements translate to nodes of type page, which span content, whereas the original pb elements just mark positions. Instead of page, you can also specify another node type by the parameter element.

We assume that the material that the <pb> elements divide up is the material that corresponds to their <div> parent element. Instead of <div>, you can also specify another element in the parameter element. If you want to restrict the parent elements of pages, you can do so by specifying attributes, like type="original". Then only parents that carry those attributes will be chopped up into pages. You can specify multiple values for each attribute. Elements that carry one of these values are candidates for having their content divided into pages.

We assume that the material to be divided starts with a <pb> (as the TEI-guidelines prescribe) and we translate it to a page element that we close either at the next <pb> or at the end of the div.

But if you specify pbAtTop=False, we assume that the <pb> marks the end of the corresponding page element. We start the first page at the start of the enclosing element. If there is material at between the last <pb> till the end of the enclosing element, we generate an extra page node without features.

sectionModel

dict, optional {}

If not passed, or an empty dict, section model I is assumed. A section model must be specified with the parameters relevant for the model:

dict(
    model="II",
    levels=["chapter", "chunk"],
    element="head",
    attributes=dict(rend="h3"),
)

(model I does not require the element and attribute parameters)

or

dict(
    model="I",
    levels=["folder", "file", "chunk"],
)

This section model (I) accepts a few other parameters:

    backMatter="backmatter"

This is the name of the folder that should not be treated as an ordinary folder, but as the folder with the sources for the back-matter, such as references, lists, indices, bibliography, biographies, etc.

    drillDownDivs=True

Whether the chunks are the immediate children of body elements, or whether we should drill through all intervening div levels.

For model II, the default parameters are:

element="head"
levels=["chapter", "chunk"],
attributes={}

In model I, there are three section levels in total. The corpus is divided in folders (section level 1), files (section level 2), and chunks within files. The parameter levels allows you to choose names for the node types of these section levels.

In model II, there are 2 section levels in total. The corpus consists of a single file, and section nodes will be added for nodes at various levels, mainly outermost <div> and <p> elements and their siblings of other element types. The section heading for the second level is taken from elements in the neighbourhood, whose name is given in the parameter element, but only if they carry some attributes, which can be specified in the attributes parameter.

Usage

Command-line

tf-fromtei tasks flags

From Python

from tf.convert.tei import TEI

T = TEI()
T.task(**tasks, **flags)

For a short overview the tasks and flags, see HELP.

Tasks

We have the following conversion tasks:

  1. check: makes and inventory of all XML elements and attributes used.
  2. convert: produces actual TF files by converting XML files.
  3. load: loads the generated TF for the first time, by which the pre-computation step is triggered. During pre-computation some checks are performed. Once this has succeeded, we have a workable TF dataset.
  4. app: creates or updates a corpus specific TF app with minimal sensible settings, plus basic documentation.
  5. apptoken: updates a corpus specific TF app from a character-based dataset to a token-based dataset.
  6. browse: starts the TF browser on the newly created dataset.

Tasks can be run by passing any choice of task keywords to the TEI.task() method.

Note on versions

The TEI source files come in versions, indicated with a data. The converter picks the most recent one, unless you specify an other one:

tf-from-tei tei=-2  # previous version
tf-from-tei tei=0  # first version
tf-from-tei tei=3  # third version
tf-from-tei tei=2019-12-23  # explicit version

The resulting TF data is independently versioned, like 1.2.3 or 1.2.3pre. When the converter runs, by default it overwrites the most recent version, unless you specify another one.

It looks at the latest version and then bumps a part of the version number.

tf-fromtei tf=3  # minor version, 1.2.3 becomes 1.2.4; 1.2.3pre becomes 1.2.4pre
tf-fromtei tf=2  # intermediate version, 1.2.3 becomes 1.3.0
tf-fromtei tf=1  # major version, 1.2.3 becomes 2.0.0
tf-fromtei tf=1.8.3  # explicit version

Examples

Exactly how you can call the methods of this module is demonstrated in the small corpus of 14 letter by the Dutch artist Piet Mondriaan.

Expand source code Browse git
"""
# TEI import

You can convert any TEI source into TF by specifying a few details about the source.

TF then invokes the `tf.convert.walker` machinery to produce a TF
dataset out of the source.

TF knows the TEI elements, because it will read and parse the complete
TEI schema. From this the set of complex, mixed elements is distilled.

If the TEI source conforms to a customised TEI schema, it will be detected and
the importer will read it and override the generic information of the TEI elements.

It is also possible to pass a choice of template and adaptation in a processing
instruction. This does not influence validation, but it may influence further
processing.

If the TEI consists of multiple source files, it is possible to specify different
templates and adaptations for different files.

The possible values for models, templates, and adaptations should be declared in
the configuration file.
For each model there should be a corresponding schema in the schema directory,
either an RNG or an XSD file.

The converter goes the extra mile: it generates a TF app and documentation
(an *about.md* file and a *transcription.md* file), in such a way that the TF
browser is instantly usable.

The TEI conversion is rather straightforward because of some conventions
that cannot be changed.

# Configuration and customization

We assume that you have a `programs` directory at the top-level of your repo.
In this directory we'll look for two optional files:

*   a file `tei.yaml` in which you specify a bunch of values to
    get the conversion off the ground.

*   a file `tei.py` in which you define custom functions that are executed at certain
    specific hooks:

    *   `transform(text)` which takes a text string argument and delivers a
        text string as result. The converter will call this on every TEI input
        file it reads *before* feeding it to the XML parser.
        This can be used to solve some quirks in the input, e.g. replacing two
        consecutive commas (`,,`) by a single unicode character (`„` = 201E);
    *   `beforeTag`: just before the walker starts processing the start tag of
        a TEI element;
    *   `beforeChildren`: just after processing the start tag, but before processing
        the element content (text and child elements);
    *   `afterChildren`: just after processing the complete element content
        (text and child elements), but before processing the end tag of the
        TEI element;
    *   `afterTag`: just after processing the end tag of a TEI element.

        The  `before` and `after` functions should take the following arguments

        *   `cv`: the walker converter object;
        *   `cur`: the dictionary with information that has been gathered during the
            conversion so far and that can be used to dump new information
            into; it is nonlocal, i.e. all invocations of the hooks get the same
            dictionary object passed to them;
        *   `xnode`: the LXML node corresponding to the TEI element;
        *   `tag`: the tag name of the element, without namespaces;
            this is a bit redundant, because it can also be extracted from
            the `xnode`, but it is convenient.
        *   `atts`: the attributes (names and values) of the element,
            without namespaces;
            this is a bit redundant, because it can also be extracted from
            the `xnode`, but it is convenient.

        These functions should not return anything, but they can write things to
        the `cur` dictionary.
        And they can create slots, nodes, and terminate them, in short, they
        can do every `cv`-based action that is needed.

        You can define these functions out of this context, but it is good to know
        what information in `cur` is guaranteed to be available:

        *   `xnest`: the stack of XML tag names seen at this point;
        *   `tnest`: the stack of TF nodes built at this point;
        *   `tsiblings` (only if sibling nodes are being recorded): the list of
            preceding TF nodes corresponding to the TEI sibling elements of the
            current TEI element.

## Keys and values of the `tei.yaml` file

### `generic`

dict, optional `{}`

Metadata for all generated TF features.
The actual source version of the TEI files does not have to be stated here,
it will be inserted based on the version that the converter will actually use.
That version depends on the `tei` argument passed to the program.
The key under which the source version will be inserted is `teiVersion`.

### `extra`

dict, optional `{}`

Instructions and metadata for specific generated TF features, namely those that
have not been generated by the vanilla TEI conversion, but by extra code in one
of the customised hooks.
The dict is keyed by feature name, the values are again dictionaries.
These value dictionaries have a key meta under which any number of metadata key value
pairs, such as `description="xxx"`.

If you put the string «base» in such a field, it will be expanded on the
basis of the contents of the `path` key, see below.

You must provide the key `valueType` and pass `int` or `str` there, depending on the
values of the feature.
You may provide extra keys, such as `conversionMethod="derived"`, so that other programs
can determine what to do with these features.
The information in this dict will also end up in the generated feature docs.

Besides the `meta` key, there may also be the keys `path`, and `nodeType`.
Together they contain an instruction to produce a feature value from element content
that can be found on the current stack of XML nodes and attributes.
The value found will be put in the feature in question
for the node of type specified in `nodeType` that is recently constructed.

Example:

``` yaml
extra:
  letterid:
    meta:
      description: The identifier of a letter; «base»
      valueType: str
      conversionMethod: derived
      conversionCode: tt
    path:
      - idno:
        type: letterId
      - altIdentifier
      - msIdentifier
      - msDesc
      - sourceDesc
    nodeType: letter
    feature: letterid
```

The meaning is:

*   if, while parsing the XML, I encounter an element `idno`,
*   and if that element has an attribute `type` with value `letterId`,
*   and if it has parent `altIdentifier`,
*   and grandparent `msIdentifier`,
*   and great-grandparent `msDesc`,
*   and great-great-grandparent `sourceDesc`,
*   then look up the last created node of type `letter`
*   and get the text content of the current XML node (the `idno` one),
*   and put it in the feature `letterid` for that node.
*   Moreover, the feature `letterid` gets metadata as specified under the key `meta`,
    where the `description` will be filled with the text

        The identifier of a letter; the content is taken from sourceDesc/msDesc/msIdentifier/altIdentifier/idno[type=letterId]

### `models`

list, optional `[]`

Which TEI-based schemas are to be used.
For each model there should be an XSD or RNG file with that name in the `schema`
directory. The `tei_all` schema is known to TF, no need to specify that one.

We'll try a RelaxNG schema (`.rng`) first. If that exists, we use it for validation
with JING, and we also convert it with TRANG to an XSD schema, which we use for
analysing the schema: we want to know which elements are mixed and pure.

If there is no RelaxNG schema, we try an XSD schema (`.xsd`). If that exists,
we can do the analysis, and we will use it also for validation.

!!! note "Problems with RelaxNG validation"
    RelaxNG validation is not always reliable when performed with LXML, or any tool
    based on `libxml`, for that matter. That's why we try to avoid it. Even if we
    translate the RelaxNG schema to an XSD schema by means of TRANG, the resulting
    validation is not always reliable. So we use JING to validate the RelaxNG schema.

See also [JING-TRANG](https://code.google.com/archive/p/jing-trang/downloads).

### `templates`

list, optional `[]`

Which template(s) are to be used.
A template is just a keyword, associated with an XML file, that can be used to switch
to a specific kind of processing, such as `letter`, `bibliolist`, `artworklist`.

You may specify an element or processing instruction with an attribute
that triggers the template for the file in which it is found.

This will be retrieved from the file before XML parsing starts.
For example,

``` python
    templateTrigger="?editem@template"
```

will read the file and extract the value of the `template` attribute of the `editem`
processing instruction and use that as the template for this file.
If no template is found in this way, the empty template is assumed.

### `adaptations`

list, optional `[]`

Which adaptations(s) are to be used.
An adaptation is just a keyword, associated with an XML file, that can be used to switch
to a specific kind of processing.
It is meant to trigger tweaks on top of the behaviour of a template.

You may specify an element or processing instruction with an attribute
that triggers the adaptation for the file in which it is found.

This will be retrieved from the file before XML parsing starts.
For example,

``` python
    adaptationTrigger="?editem@adaptation"
```

will read the file and extract the value of the `adaptation` attribute of the `editem`
processing instruction and use that as the adaptation for this file.
If no adaptation is found in this way, the empty adaptation is assumed.

### `prelim`

boolean, optional `True`

Whether to work with the `pre` TF versions.
Use this if you convert TEI to a preliminary TF dataset, which will
receive NLP additions later on. That version will then lose the `pre`.

### `granularity`

string, optional `token`

What to take the basic entities (slots). Possible values:

*   `word`: words are slots, even if they cross element boundaries. This leads to some
    imprecisions: words containing an element boundary will belong to just one
    of both elements around the boundary.
*   `char`: all individual characters are separate slots. Very precise, but the dataset
    gets expensive with so many slots.
*   `token`: every sequence of alphanumeric characters becomes a token, in sofar there
    is no intervening markup. Non alphanumeric characters become separate tokens.
    There are some additional rules: `.` or `,` tightly surrounded by digits also
    count as tokens.

The datasets with granularity `word` and `token` have features `str` for the string
content of the slots, and `after` for the material after the slots.
In the case of `word`, the feature `after` can contain whitespace and punctuation
characters, in the case of `token`, it only contains whitespace.

If not, the characters are taken as basic entities.

If you use an NLP pipeline to detect tokens, use the value `False`.
The preliminary dataset is then based on characters, but the final dataset that we build
from there is based on tokens, which are mostly words and non-word characters.

### `parentEdges`

boolean, optional `True`

Whether to create edges between nodes that correspond to XML elements and their parents.

### `siblingEdges`

boolean, optional `False`

Whether to create edges between nodes that correspond to XML elements and siblings.
Edges will be created between each sibling and its *preceding* siblings.
If you use these edges in the binary way, you can also find the following siblings.
The edges are labeled with the distance between the siblings, adjacent siblings
get distance 1.

!!! caution "Overwhelming space requirement"
    If the corpus is divided into relatively few elements that each have very many
    direct children, the number of sibling edges is comparable to the size of the
    corpus squared. That means that the TF dataset will consist for 50-99% of
    sibling edges!
    An example is [`ETCBC/nestle1904`](https://github.com/ETCBC/nestle1904) (Greek New
    Testament) where each book element has all of its sentences as direct children.
    In that dataset, the siblings would occupy 40% of the size, and we have taken care
    not to produce sibling edges for sentences.

### `procins`

boolean, optional `False`

If True, processing instructions will be treated.
Processing instruction `<?foo bar="xxx"?>` will be converted as if it were an empty
element named `foo` with attribute `bar` with value `xxx`.


### `lineModel`

dict, optional `False`

If not passed, or an empty dict, line model I is assumed.
A line model must be specified with the parameters relevant for the
model:

``` python
dict(
    model="I",
)
```

(model I does not require any parameters)

or

``` python
dict(
    model="II",
    element="p",
    nodeType="ln",
)
```

For model II, the default parameters are:

``` python
element="p",
nodeType="ln",
```

Model I is the default, and nothing special happens to the `<lb>` elements.

In model II the `<lb>` elements translate to nodes of type `ln`, which span
content, whereas the original `lb` elements just mark positions.
Instead of `ln`, you can also specify another node type by the parameter `element`.

We assume that the material that the `<lb>` elements divide up is the material
that corresponds to their `<p>` parent element. Instead of `<p>`,
you can also specify another element in the parameter `element`.

We assume that lines start and end at the start and end of the `<p>` elements and
the `<lb>` elements. For the material etween these boundaries, we build `ln` nodes.
If an `<lb>` element follows a `<p>` start tag without intervening slots, a `ln`
node will be created but not linked to slots, and it will be deleted later in
the conversion.
Likewise, if an `<lb>` element is followed by a `<p>` end tag without
intervening slots, a `ln` node is created that is not linked to slots.

The attributes of the `<lb>` elements become features of the `ln` node that starts
with that `<lb>` element. If there is no explicit `<lb>` element at the start of
a paragraph, the first `ln` node of that paragraph gets no features.


### `pageModel`

dict, optional `False`

If not passed, or an empty dict, page model I is assumed.
A page model must be specified with the parameters relevant for the
model:

``` python
dict(
    model="I",
)
```

(model I does not require any parameters)

or

``` python
dict(
    model="II",
    element="div",
    attributes=dict(type=["original", "translation"]),
    pbAtTop=True,
    nodeType="page",
)
```

For model II, the default parameters are:

``` python
element="div",
pbAtTop=True,
nodeType="page",
attributes={},
```

Model I is the default, and nothing special happens to the `<pb>` elements.

In model II the `<pb>` elements translate to nodes of type `page`, which span
content, whereas the original `pb` elements just mark positions.
Instead of `page`, you can also specify another node type by the parameter `element`.

We assume that the material that the `<pb>` elements divide up is the material
that corresponds to their `<div>` parent element. Instead of `<div>`,
you can also specify another element in the parameter `element`.
If you want to restrict the parent elements of pages, you can do so by specifying
attributes, like `type="original"`. Then only parents that carry those attributes
will be chopped up into pages.
You can specify multiple values for each attribute. Elements that carry one of these
values are candidates for having their content divided into pages.

We assume that the material to be divided starts with a `<pb>` (as the TEI-guidelines
prescribe) and we translate it to a page element that we close either at the
next `<pb>` or at the end of the `div`.

But if you specify `pbAtTop=False`, we assume that the `<pb>` marks the end of
the corresponding page element. We start the first page at the start of the enclosing
element. If there is material at between the last `<pb>` till the end of the enclosing
element, we generate an extra page node without features.


### `sectionModel`

dict, optional `{}`

If not passed, or an empty dict, section model I is assumed.
A section model must be specified with the parameters relevant for the
model:

``` python
dict(
    model="II",
    levels=["chapter", "chunk"],
    element="head",
    attributes=dict(rend="h3"),
)
```

(model I does not require the *element* and *attribute* parameters)

or

``` python
dict(
    model="I",
    levels=["folder", "file", "chunk"],
)
```

This section model (I) accepts a few other parameters:

``` python
    backMatter="backmatter"
```

This is the name of the folder that should not be treated as an ordinary folder, but
as the folder with the sources for the back-matter, such as references, lists, indices,
bibliography, biographies, etc.

``` python
    drillDownDivs=True
```

Whether the chunks are the immediate children of `body` elements, or whether
we should drill through all intervening `div` levels.

For model II, the default parameters are:

``` python
element="head"
levels=["chapter", "chunk"],
attributes={}
```

In model I, there are three section levels in total.
The corpus is divided in folders (section level 1), files (section level 2),
and chunks within files. The parameter `levels` allows you to choose names for the
node types of these section levels.

In model II, there are 2 section levels in total.
The corpus consists of a single file, and section nodes will be added
for nodes at various levels, mainly outermost `<div>` and `<p>` elements and their
siblings of other element types.
The section heading for the second level is taken from elements in the neighbourhood,
whose name is given in the parameter `element`, but only if they carry some attributes,
which can be specified in the `attributes` parameter.


# Usage

## Command-line

``` sh
tf-fromtei tasks flags
```

## From Python

``` python
from tf.convert.tei import TEI

T = TEI()
T.task(**tasks, **flags)
```

For a short overview the tasks and flags, see `HELP`.

## Tasks

We have the following conversion tasks:

1.  `check`: makes and inventory of all XML elements and attributes used.
1.  `convert`: produces actual TF files by converting XML files.
1.  `load`: loads the generated TF for the first time, by which the pre-computation
    step is triggered. During pre-computation some checks are performed. Once this
    has succeeded, we have a workable TF dataset.
1.  `app`: creates or updates a corpus specific TF app with minimal sensible settings,
    plus basic documentation.
1.  `apptoken`: updates a corpus specific TF app from a character-based dataset
    to a token-based dataset.
1.  `browse`: starts the TF browser on the newly created dataset.

Tasks can be run by passing any choice of task keywords to the
`TEI.task()` method.

## Note on versions

The TEI source files come in versions, indicated with a data.
The converter picks the most recent one, unless you specify an other one:

``` python
tf-from-tei tei=-2  # previous version
tf-from-tei tei=0  # first version
tf-from-tei tei=3  # third version
tf-from-tei tei=2019-12-23  # explicit version
```

The resulting TF data is independently versioned, like `1.2.3` or `1.2.3pre`.
When the converter runs, by default it overwrites the most recent version,
unless you specify another one.

It looks at the latest version and then bumps a part of the version number.

``` python
tf-fromtei tf=3  # minor version, 1.2.3 becomes 1.2.4; 1.2.3pre becomes 1.2.4pre
tf-fromtei tf=2  # intermediate version, 1.2.3 becomes 1.3.0
tf-fromtei tf=1  # major version, 1.2.3 becomes 2.0.0
tf-fromtei tf=1.8.3  # explicit version
```

## Examples

Exactly how you can call the methods of this module is demonstrated in the small
corpus of 14 letter by the Dutch artist Piet Mondriaan.

*   [Mondriaan](https://nbviewer.org/github/annotation/mondriaan/blob/master/programs/convertExpress.ipynb).
"""

import sys
import collections
import re
from textwrap import dedent, wrap
from io import BytesIO
from subprocess import run
from importlib import util

from ..capable import CheckImport
from .helpers import (
    setUp,
    tweakTrans,
    checkModel,
    matchModel,
    lookupSource,
    tokenize,
    getWhites,
    NODE,
    FILE,
    PAGE,
    LINE,
    PRE,
    ZWSP,
    XNEST,
    TNEST,
    TSIB,
    WORD,
    TOKEN,
    T,
    CHAR,
    CONVERSION_METHODS,
    CM_LIT,
    CM_LITP,
    CM_LITC,
    CM_PROV,
)
from ..parameters import BRANCH_DEFAULT_NEW
from ..fabric import Fabric
from ..core.helpers import console, versionSort, mergeDict
from ..convert.walker import CV
from ..core.timestamp import AUTO, DEEP, TERSE
from ..core.command import readArgs
from ..core.files import (
    fileOpen,
    abspath,
    expanduser as ex,
    unexpanduser as ux,
    getLocation,
    initTree,
    fileNm,
    dirNm,
    dirExists,
    dirContents,
    fileExists,
    fileCopy,
    scanDir,
    readYaml,
    writeYaml,
)

from ..tools.xmlschema import Analysis


(HELP, TASKS, TASKS_EXCLUDED, PARAMS, FLAGS) = setUp("TEI")

CSS_REND = dict(
    h1=(
        "heading of level 1",
        dedent(
            """
        font-size: xx-large;
        font-weight: bold;
        margin-top: 3rem;
        margin-bottom: 1rem;
        """
        ),
    ),
    h2=(
        "heading of level 2",
        dedent(
            """
        font-size: x-large;
        font-weight: bold;
        margin-top: 2rem;
        margin-bottom: 1rem;
        """
        ),
    ),
    h3=(
        "heading of level 3",
        dedent(
            """
        font-size: large;
        font-weight: bold;
        margin-top: 1rem;
        margin-bottom: 0.5rem;
        """
        ),
    ),
    h4=(
        "heading of level 4",
        dedent(
            """
        font-size: large;
        font-style: italic;
        margin-top: 1rem;
        margin-bottom: 0.5rem;
        """
        ),
    ),
    h5=(
        "heading of level 5",
        dedent(
            """
        font-size: medium;
        font-weight: bold;
        font-variant: small-caps;
        margin-top: 0.5rem;
        margin-bottom: 0.25rem;
        """
        ),
    ),
    h6=(
        "heading of level 6",
        dedent(
            """
        font-size: medium;
        font-weight: normal;
        font-variant: small-caps;
        margin-top: 0.25rem;
        margin-bottom: 0.125rem;
        """
        ),
    ),
    italic=(
        "cursive font style",
        dedent(
            """
        font-style: italic;
        """
        ),
    ),
    bold=(
        "bold font weight",
        dedent(
            """
        font-weight: bold;
        """
        ),
    ),
    underline=(
        "underlined text",
        dedent(
            """
        text-decoration: underline;
        """
        ),
    ),
    center=(
        "horizontally centered text",
        dedent(
            """
        text-align: center;
        """
        ),
    ),
    large=(
        "large font size",
        dedent(
            """
        font-size: large;
        """
        ),
    ),
    spaced=(
        "widely spaced between characters",
        dedent(
            """
        letter-spacing: .2rem;
        """
        ),
    ),
    margin=(
        "in the margin",
        dedent(
            """
        position: relative;
        top: -0.3em;
        font-weight: bold;
        color: #0000ee;
        """
        ),
    ),
    above=(
        "above the line",
        dedent(
            """
        position: relative;
        top: -0.3em;
        """
        ),
    ),
    below=(
        "below the line",
        dedent(
            """
        position: relative;
        top: 0.3em;
        """
        ),
    ),
    small_caps=(
        "small-caps font variation",
        dedent(
            """
        font-variant: small-caps;
        """
        ),
    ),
    sub=(
        "as subscript",
        dedent(
            """
        vertical-align: sub;
        font-size: small;
        """
        ),
    ),
    super=(
        "as superscript",
        dedent(
            """
        vertical-align: super;
        font-size: small;
        """
        ),
    ),
)
CSS_REND_ALIAS = dict(
    italic="italics i",
    bold="b",
    underline="ul",
    spaced="spat",
    small_caps="smallcaps sc",
    super="sup",
)


PROGRESS_LIMIT = 5
KNOWN_RENDS = set()
REND_DESC = {}


REFERENCING = dict(
    ptr="target",
    ref="target",
    rs="ref",
)


def makeCssInfo():
    """Make the CSS info for the style sheet."""
    rends = ""

    for rend, (description, css) in sorted(CSS_REND.items()):
        aliases = CSS_REND_ALIAS.get(rend, "")
        aliases = sorted(set(aliases.split()) | {rend})
        for alias in aliases:
            KNOWN_RENDS.add(alias)
            REND_DESC[alias] = description
        selector = ",".join(f".r_{alias}" for alias in aliases)
        contribution = f"\n{selector} {{{css}}}\n"
        rends += contribution

    return rends


def getRefs(tag, atts, xmlFile):
    refAtt = REFERENCING.get(tag, None)
    result = []

    if refAtt is not None:
        refVal = atts.get(refAtt, None)
        if refVal is not None and not refVal.startswith("http"):
            for refv in refVal.split():
                parts = refv.split("#", 1)
                if len(parts) == 1:
                    targetFile = refv
                    targetId = ""
                else:
                    (targetFile, targetId) = parts
                if targetFile == "":
                    targetFile = xmlFile
                result.append((refAtt, targetFile, targetId))
    return result


class TEI(CheckImport):
    def __init__(
        self,
        tei=PARAMS["tei"][1],
        tf=PARAMS["tf"][1],
        validate=PARAMS["validate"][1],
        verbose=FLAGS["verbose"][1],
    ):
        """Converts TEI to TF.

        For documentation of the resulting encoding, read the
        [transcription template](https://github.com/annotation/text-fabric/blob/master/tf/convert/app/transcription.md).

        Below we describe how to control the conversion machinery.

        We adopt a fair bit of "convention over configuration" here, in order to lessen
        the burden for the user of specifying so many details.

        Based on current directory from where the script is called,
        it defines all the ingredients to carry out
        a `tf.convert.walker` conversion of the TEI input.

        This function is assumed to work in the context of a repository,
        i.e. a directory on your computer relative to which the input directory exists,
        and various output directories: `tf`, `app`, `docs`.

        Your current directory must be at

        ```
        ~/backend/org/repo/relative
        ```

        where

        *   `~` is your home directory;
        *   `backend` is an online back-end name,
            like `github`, `gitlab`, `git.huc.knaw.nl`;
        *   `org` is an organization, person, or group in the back-end;
        *   `repo` is a repository in the `org`.
        *   `relative` is a directory path within the repo (0 or more components)

        This is only about the directory structure on your local computer;
        it is not required that you have online incarnations of your repository
        in that back-end.
        Even your local repository does not have to be a git repository.

        The only thing that matters is that the full path to your repo can be parsed
        as a sequence of `home/backend/org/repo/relative`.

        Relative to this directory the program expects and creates
        input / output directories.

        ## Input directories

        ### `tei`

        *Location of the TEI-XML sources.*

        **If it does not exist, the program aborts with an error.**

        Several levels of subdirectories are assumed:

        1.  the version of the source (this could be a date string).
        1.  volumes / collections of documents. The subdirectory `__ignore__` is ignored.
        1.  the TEI documents themselves, conforming to the TEI schema or some
            customization of it.

        ### `schema`

        *TEI or other XML schemas against which the sources can be validated.*

        They should be XSD or RNG files.

        !!! note "Multiple XSD files"
            When you started with a RNG file and used `tf.tools.xmlschema` to
            convert it to XSD, you may have got multiple XSD files.
            One of them has the same base name as the original RNG file,
            and you should pass that name. It will import the remaining XSD files,
            so do not throw them away.

        We use these files as custom TEI schemas,
        but to be sure, we still analyse the full TEI schema and
        use the schemas here as a set of overriding element definitions.

        ## Output directories

        ### `report`

        Directory to write the results of the `check` task to: an inventory
        of elements / attributes encountered, and possible validation errors.
        If the directory does not exist, it will be created.
        The default value is `.` (i.e. the current directory in which
        the script is invoked).

        ### `tf`

        The directory under which the TF output file (with extension `.tf`)
        are placed.
        If it does not exist, it will be created.
        The TF files will be generated in a folder named by a version number,
        passed as `tfVersion`.

        ### `app` and `docs`

        Location of additional TF app configuration and documentation files.
        If they do not exist, they will be created with some sensible default
        settings and generated documentation.
        These settings can be overridden in the `app/config_custom.yaml` file.
        Also a default `display.css` file and a logo are added.

        Custom content for these files can be provided in files
        with `_custom` appended to their base name.

        ### `docs`

        Location of additional documentation.
        This can be generated or hand-written material, or a mixture of the two.

        Parameters
        ----------
        tei: string, optional ""
            If empty, use the latest version under the `tei` directory with sources.
            Otherwise it should be a valid integer, and it is the index in the
            sorted list of versions there.

            *   `0` or `latest`: latest version;
            *   `-1`, `-2`, ... : previous version, version before previous, ...;
            *   `1`, `2`, ...: first version, second version, ....
            *   everything else that is not a number is an explicit version

            If the value cannot be parsed as an integer, it is used as the exact
            version name.

        tf: string, optional ""
            If empty, the TF version used will be the latest one under the `tf`
            directory. If the parameter `prelim` was used in the initialization of
            the TEI object, only versions ending in `pre` will be taken into account.

            If it can be parsed as the integers 1, 2, or 3 it will bump the latest
            relevant TF version:

            *   `0` or `latest`: overwrite the latest version
            *   `1` will bump the major version
            *   `2` will bump the intermediate version
            *   `3` will bump the minor version
            *   everything else is an explicit version

            Otherwise, the value is taken as the exact version name.

        verbose: integer, optional -1
            Produce no (-1), some (0) or many (1) progress and reporting messages

        """
        super().__init__("lxml")
        if self.importOK(hint=True):
            self.etree = self.importGet()
        else:
            return

        self.good = True

        (backend, org, repo, relative) = getLocation()

        if any(s is None for s in (backend, org, repo, relative)):
            console(
                (
                    "Not working in a repo: "
                    f"backend={backend} org={org} repo={repo} relative={relative}"
                ),
                error=True,
            )
            self.good = False
            return

        if verbose == 1:
            console(
                f"Working in repository {org}/{repo}{relative} in back-end {backend}"
            )

        base = ex(f"~/{backend}")
        repoDir = f"{base}/{org}/{repo}"
        refDir = f"{repoDir}{relative}"
        programDir = f"{refDir}/programs"
        schemaDir = f"{refDir}/schema"
        convertSpec = f"{programDir}/tei.yaml"
        convertCustom = f"{programDir}/tei.py"

        self.schemaDir = schemaDir

        settings = readYaml(asFile=convertSpec, plain=True)

        customKeys = set(
            """
            transform
            beforeTag
            beforeChildren
            afterChildren
            afterTag
        """.strip().split()
        )

        functionType = type(lambda x: x)

        if fileExists(convertCustom):
            hooked = []

            try:
                spec = util.spec_from_file_location("teicustom", convertCustom)
                code = util.module_from_spec(spec)
                sys.path.insert(0, dirNm(convertCustom))
                spec.loader.exec_module(code)
                sys.path.pop(0)
                for method in customKeys:
                    if not hasattr(code, method):
                        continue

                    func = getattr(code, method)
                    typeFunc = type(func)
                    if typeFunc is not functionType:
                        console(
                            (
                                f"custom member {method} should be a function, "
                                f"but it is a {typeFunc.__name__}"
                            ),
                            error=True,
                        )
                        continue

                    methodC = f"{method}Custom"
                    setattr(self, methodC, func)
                    hooked.append(method)

            except Exception as e:
                console(str(e), error=True)
                for method in customKeys:
                    if not hasattr(self, method):
                        methodC = f"{method}Custom"
                        setattr(self, methodC, None)

            if verbose >= 0:
                console("With custom behaviour hooked in at:")
                for method in hooked:
                    methodC = f"{method}Custom"
                    console(f"\t{methodC} = {ux(convertCustom)}.{method}")

        generic = settings.get("generic", {})
        extra = settings.get("extra", {})
        models = settings.get("models", [])
        templates = settings.get("templates", [])
        templateTrigger = settings.get("templateTrigger", None)
        adaptations = settings.get("adaptations", [])
        adaptationTrigger = settings.get("adaptationTrigger", None)
        prelim = settings.get("prelim", True)
        granularity = settings.get("granularity", TOKEN)
        wordAsSlot = granularity == WORD
        tokenAsSlot = granularity == TOKEN
        charAsSlot = granularity == CHAR
        parentEdges = settings.get("parentEdges", True)
        siblingEdges = settings.get("siblingEdges", True)
        procins = settings.get("procins", False)

        lineModel = settings.get("lineModel", {})
        lineModel = checkModel(LINE, lineModel, verbose)

        if not lineModel:
            self.good = False
            return

        makeLineElems = lineModel["model"] == "II"
        lineProperties = lineModel.get("properties", None)
        lineModel = lineModel["model"]

        self.makeLineElems = makeLineElems
        self.lineModel = lineModel
        self.lineProperties = lineProperties

        pageModel = settings.get("pageModel", {})
        pageModel = checkModel(PAGE, pageModel, verbose)

        if not pageModel:
            self.good = False
            return

        makePageElems = pageModel["model"] == "II"
        pageProperties = pageModel.get("properties", None)
        pageModel = pageModel["model"]

        self.makePageElems = makePageElems
        self.pageModel = pageModel
        self.pageProperties = pageProperties

        sectionModel = settings.get("sectionModel", {})
        sectionModel = checkModel("section", sectionModel, verbose)
        if not sectionModel:
            self.good = False
            return

        sectionProperties = sectionModel.get("properties", None)
        sectionModel = sectionModel["model"]
        self.sectionModel = sectionModel
        self.sectionProperties = sectionProperties

        self.generic = generic
        self.extra = extra
        self.models = models
        self.templates = templates
        self.adaptations = adaptations

        if templateTrigger is None:
            self.templateAtt = None
            self.templateTag = None
        else:
            (tag, att) = templateTrigger.split("@")
            self.templateAtt = att
            self.templateTag = tag

        if adaptationTrigger is None:
            self.adaptationAtt = None
            self.adaptationTag = None
        else:
            (tag, att) = adaptationTrigger.split("@")
            self.adaptationAtt = att
            self.adaptationTag = tag

        templateTag = self.templateTag
        templateAtt = self.templateAtt
        adaptationTag = self.adaptationTag
        adaptationAtt = self.adaptationAtt

        triggers = {}
        self.triggers = triggers

        for kind, theAtt, theTag in (
            ("template", templateAtt, templateTag),
            ("adaptation", adaptationAtt, adaptationTag),
        ):
            triggerRe = None

            if theAtt is not None and theTag is not None:
                tagPat = re.escape(theTag)
                triggerRe = re.compile(
                    rf"""<{tagPat}\b[^>]*?{theAtt}=['"]([^'"]+)['"]"""
                )
            triggers[kind] = triggerRe

        self.A = Analysis(verbose=verbose)
        self.readSchemas()

        self.prelim = prelim
        self.wordAsSlot = wordAsSlot
        self.tokenAsSlot = tokenAsSlot
        self.charAsSlot = charAsSlot
        self.parentEdges = parentEdges
        self.siblingEdges = siblingEdges
        self.procins = procins

        reportDir = f"{refDir}/report"
        appDir = f"{refDir}/app"
        docsDir = f"{refDir}/docs"
        teiDir = f"{refDir}/tei"
        tfDir = f"{refDir}/tf"

        teiVersions = sorted(dirContents(teiDir)[1], key=versionSort)
        nTeiVersions = len(teiVersions)

        if tei in {"latest", "", "0", 0} or str(tei).lstrip("-").isdecimal():
            teiIndex = (0 if tei == "latest" else int(tei)) - 1

            try:
                teiVersion = teiVersions[teiIndex]
            except Exception:
                absIndex = teiIndex + (nTeiVersions if teiIndex < 0 else 0) + 1
                console(
                    (
                        f"no item in {absIndex} in {nTeiVersions} source versions "
                        f"in {ux(teiDir)}"
                    )
                    if len(teiVersions)
                    else f"no source versions in {ux(teiDir)}",
                    error=True,
                )
                self.good = False
                return
        else:
            teiVersion = tei

        teiPath = f"{teiDir}/{teiVersion}"
        reportPath = f"{reportDir}/{teiVersion}"

        if not dirExists(teiPath):
            console(
                f"source version {teiVersion} does not exists in {ux(teiDir)}",
                error=True,
            )
            self.good = False
            return

        teiStatuses = {tv: i for (i, tv) in enumerate(reversed(teiVersions))}
        teiStatus = teiStatuses[teiVersion]
        teiStatusRep = (
            "most recent"
            if teiStatus == 0
            else "previous"
            if teiStatus == 1
            else f"{teiStatus - 1} before previous"
        )
        if teiStatus == len(teiVersions) - 1 and len(teiVersions) > 1:
            teiStatusRep = "oldest"

        if verbose >= 0:
            console(f"TEI data version is {teiVersion} ({teiStatusRep})")

        tfVersions = sorted(dirContents(tfDir)[1], key=versionSort)
        if prelim:
            tfVersions = [tv for tv in tfVersions if tv.endswith(PRE)]

        latestTfVersion = (
            tfVersions[-1] if len(tfVersions) else ("0.0.0" + (PRE if prelim else ""))
        )
        if tf in {"latest", "", "0", 0}:
            tfVersion = latestTfVersion
            vRep = "latest"
        elif tf in {"1", "2", "3", 1, 2, 3}:
            bump = int(tf)
            parts = latestTfVersion.split(".")

            def getVer(b):
                return (
                    int(parts[b].removesuffix(PRE))
                    if prelim and b == len(parts) - 1
                    else int(parts[b])
                )

            def setVer(b, val):
                parts[b] = f"{val}{PRE}" if prelim and b == len(parts) - 1 else f"{val}"

            if bump > len(parts):
                console(
                    f"Cannot bump part {bump} of latest TF version {latestTfVersion}",
                    error=True,
                )
                self.good = False
                return
            else:
                b1 = bump - 1
                old = getVer(b1)
                setVer(b1, old + 1)
                for b in range(b1 + 1, len(parts)):
                    setVer(b, 0)
                tfVersion = ".".join(str(p) for p in parts)
                vRep = (
                    "major" if bump == 1 else "intermediate" if bump == 2 else "minor"
                )
                vRep = f"next {vRep}"
        else:
            tfVersion = tf
            status = "existing" if dirExists(f"{tfDir}/{tfVersion}") else "new"
            vRep = f"explicit {status}"

        tfPath = f"{tfDir}/{tfVersion}"

        if verbose >= 0:
            console(f"TF data version is {tfVersion} ({vRep})")
            console(
                f"Processing instructions are {'treated' if procins else 'ignored'}"
            )

        self.refDir = refDir
        self.teiVersion = teiVersion
        self.teiPath = teiPath
        self.tfVersion = tfVersion
        self.tfPath = tfPath
        self.reportPath = reportPath
        self.tfDir = tfDir
        self.appDir = appDir
        self.docsDir = docsDir
        self.backend = backend
        self.org = org
        self.repo = repo
        self.relative = relative

        levelNames = sectionProperties["levels"]
        self.levelNames = levelNames
        self.chunkLevel = levelNames[-1]

        if sectionModel == "II":
            self.chapterSection = levelNames[0]
            self.chunkSection = levelNames[1]
        else:
            self.folderSection = levelNames[0]
            self.fileSection = levelNames[1]
            self.chunkSection = levelNames[2]
            self.backMatter = sectionProperties.get("backMatter", None)

        chunkSection = self.chunkSection
        intFeatures = {"empty", chunkSection}
        self.intFeatures = intFeatures

        if siblingEdges:
            intFeatures.add("sibling")

        slotType = WORD if wordAsSlot else T if tokenAsSlot else CHAR
        self.slotType = slotType

        sectionFeatures = ",".join(levelNames)
        sectionTypes = ",".join(levelNames)

        textFeatures = "{ch}" if charAsSlot else "{str}{after}"
        otext = {
            "fmt:text-orig-full": textFeatures,
            "sectionFeatures": sectionFeatures,
            "sectionTypes": sectionTypes,
        }
        self.otext = otext

        featureMeta = dict(
            str=dict(
                description="the text of a word or token",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            ),
            after=dict(
                description="the text after a word till the next word",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            ),
            empty=dict(
                description="whether a slot has been inserted in an empty element",
                conversionMethod=CM_PROV,
                conversionCode=CONVERSION_METHODS[CM_PROV],
            ),
            is_meta=dict(
                description="whether a slot or word is in the teiHeader element",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            ),
            is_note=dict(
                description="whether a slot or word is in the note element",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            ),
        )
        if charAsSlot:
            featureMeta["extraspace"] = dict(
                description=(
                    "whether a space has been added after a character, "
                    "when it is in the direct child of a pure XML element"
                ),
                conversionMethod=CM_LITP,
                conversionCode=CONVERSION_METHODS[CM_LITP],
            )
            featureMeta["ch"] = dict(
                description="the UNICODE character of a slot",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            )
        if parentEdges:
            featureMeta["parent"] = dict(
                description="edge between a node and its parent node",
                conversionMethod=CM_LITP,
                conversionCode=CONVERSION_METHODS[CM_LITP],
            )
        if siblingEdges:
            featureMeta["sibling"] = dict(
                description=(
                    "edge between a node and its preceding sibling nodes; "
                    "labeled with the distance between them"
                ),
                conversionMethod=CM_LITP,
                conversionCode=CONVERSION_METHODS[CM_LITP],
            )
        featureMeta[chunkSection] = dict(
            description=f"number of a {chunkSection} within a document",
            conversionMethod=CM_PROV,
            conversionCode=CONVERSION_METHODS[CM_PROV],
        )

        if sectionModel == "II":
            chapterSection = self.chapterSection
            featureMeta[chapterSection] = dict(
                description=f"name of {chapterSection}",
                conversionMethod=CM_PROV,
                conversionCode=CONVERSION_METHODS[CM_PROV],
            )
        else:
            folderSection = self.folderSection
            fileSection = self.fileSection
            featureMeta[folderSection] = dict(
                description=f"name of source {folderSection}",
                conversionMethod=CM_PROV,
                conversionCode=CONVERSION_METHODS[CM_PROV],
            )
            featureMeta[fileSection] = dict(
                description=f"name of source {fileSection}",
                conversionMethod=CM_PROV,
                conversionCode=CONVERSION_METHODS[CM_PROV],
            )

        self.featureMeta = featureMeta

        generic["sourceFormat"] = "TEI"
        generic["version"] = tfVersion
        generic["teiVersion"] = teiVersion
        generic["schema"] = "TEI" + (" + " + (" + ".join(models))) if models else ""

        extraInstructions = []

        for feat, featSpecs in extra.items():
            featMeta = featSpecs.get("meta", {})
            if "valueType" in featMeta:
                if featMeta["valueType"] == "int":
                    intFeatures.add(feat)
                del featMeta["valueType"]

            featPath = featSpecs.get("path", None)
            featPathRep = "" if featPath is None else "the content is taken from "
            featPathLogical = []

            sep = ""
            for comp in reversed(featPath or []):
                if type(comp) is str:
                    featPathRep += f"{sep}{comp}"
                    featPathLogical.append((comp, None))
                else:
                    for tag, atts in comp.items():
                        # there is only one item in this dict
                        featPathRep += f"{sep}{tag}["
                        featPathRep += ",".join(
                            f"{att}={v}" for (att, v) in sorted(atts.items())
                        )
                        featPathRep += "]"
                        featPathLogical.append((tag, atts))
                sep = "/"

            featureMeta[feat] = {
                k: v.replace("«base»", featPathRep) for (k, v) in featMeta.items()
            }
            nodeType = featSpecs.get("nodeType", None)
            if nodeType is not None and featPath:
                extraInstructions.append(
                    (list(reversed(featPathLogical)), nodeType, feat)
                )

        self.extraInstructions = tuple(extraInstructions)

        self.verbose = verbose
        self.validate = validate
        myDir = dirNm(abspath(__file__))
        self.myDir = myDir

    def readSchemas(self):
        schemaDir = self.schemaDir
        models = self.models
        A = self.A

        schemaFiles = dict(rng={}, xsd={})
        self.schemaFiles = schemaFiles
        modelInfo = {}
        self.modelInfo = modelInfo
        modelXsd = {}
        self.modelXsd = modelXsd
        modelInv = {}
        self.modelInv = modelInv

        for model in [None] + models:
            for kind in ("rng", "xsd"):
                schemaFile = (
                    A.getBaseSchema()[kind]
                    if model is None
                    else f"{schemaDir}/{model}.{kind}"
                )
                if fileExists(schemaFile):
                    schemaFiles[kind][model] = schemaFile
                    if (
                        kind == "rng"
                        or kind == "xsd"
                        and model not in schemaFiles["rng"]
                    ):
                        modelInfo[model] = schemaFile
            if model in schemaFiles["rng"] and model not in schemaFiles["xsd"]:
                schemaFileXsd = f"{schemaDir}/{model}.xsd"
                A.fromrelax(schemaFiles["rng"][model], schemaFileXsd)
                schemaFiles["xsd"][model] = schemaFileXsd

        baseSchema = schemaFiles["xsd"][None]
        modelXsd[None] = baseSchema
        modelInv[(baseSchema, None)] = None

        for model in models:
            override = schemaFiles["xsd"][model]
            modelXsd[model] = override
            modelInv[(baseSchema, override)] = model

    def getSwitches(self, xmlPath):
        verbose = self.verbose
        models = self.models
        adaptations = self.adaptations
        templates = self.templates
        triggers = self.triggers
        A = self.A

        text = None

        found = {}

        for kind, allOfKind in (
            ("model", models),
            ("adaptation", adaptations),
            ("template", templates),
        ):
            if text is None:
                with fileOpen(xmlPath) as fh:
                    text = fh.read()

            found[kind] = None

            if kind == "model":
                result = A.getModel(text)
                if result is None or result == "tei_all":
                    result = None
            else:
                result = None
                triggerRe = triggers[kind]
                if triggerRe is not None:
                    match = triggerRe.search(text)
                    result = match.group(1) if match else None

            if result is not None and result not in allOfKind:
                if verbose >= 0:
                    console(f"unavailable {kind} {result} in {ux(xmlPath)}")
                result = None
            found[kind] = result

        return (found["model"], found["adaptation"], found["template"])

    def getParser(self):
        """Configure the LXML parser.

        See [parser options](https://lxml.de/parsing.html#parser-options).

        Returns
        -------
        object
            A configured LXML parse object.
        """
        if not self.importOK():
            return None

        etree = self.etree
        procins = self.procins

        return etree.XMLParser(
            remove_blank_text=False,
            collect_ids=False,
            remove_comments=True,
            remove_pis=not procins,
            huge_tree=True,
        )

    def getXML(self):
        """Make an inventory of the TEI source files.

        Returns
        -------
        tuple of tuple | string
            If section model I is in force:

            The outer tuple has sorted entries corresponding to folders under the
            TEI input directory.
            Each such entry consists of the folder name and an inner tuple
            that contains the file names in that folder, sorted.

            If section model II is in force:

            It is the name of the single XML file.
        """
        verbose = self.verbose
        teiPath = self.teiPath
        sectionModel = self.sectionModel
        if verbose == 1:
            console(f"Section model {sectionModel}")

        if sectionModel == "I":
            backMatter = self.backMatter

            IGNORE = "__ignore__"

            xmlFilesRaw = collections.defaultdict(list)

            with scanDir(teiPath) as dh:
                for folder in dh:
                    folderName = folder.name
                    if folderName == IGNORE:
                        continue
                    if not folder.is_dir():
                        continue
                    with scanDir(f"{teiPath}/{folderName}") as fh:
                        for file in fh:
                            fileName = file.name
                            if not (
                                fileName.lower().endswith(".xml") and file.is_file()
                            ):
                                continue
                            xmlFilesRaw[folderName].append(fileName)

            xmlFiles = []
            hasBackMatter = False

            for folderName in sorted(xmlFilesRaw, key=versionSort):
                if folderName == backMatter:
                    hasBackMatter = True
                else:
                    fileNames = xmlFilesRaw[folderName]
                    xmlFiles.append((folderName, tuple(sorted(fileNames))))

            if hasBackMatter:
                fileNames = xmlFilesRaw[backMatter]
                xmlFiles.append((backMatter, tuple(sorted(fileNames))))

            xmlFiles = tuple(xmlFiles)

            return xmlFiles

        if sectionModel == "II":
            xmlFile = None
            with scanDir(teiPath) as fh:
                for file in fh:
                    fileName = file.name
                    if not (fileName.lower().endswith(".xml") and file.is_file()):
                        continue
                    xmlFile = fileName
                    break
            return xmlFile

    def checkTask(self):
        """Implementation of the "check" task.

        It validates the TEI, but only if a schema file has been passed explicitly
        when constructing the `TEI()` object.

        Then it makes an inventory of all elements and attributes in the TEI files.

        If tags are used in multiple namespaces, it will be reported.

        !!! caution "Conflation of namespaces"
            The TEI to TF conversion does construct node types and attributes
            without taking namespaces into account.
            However, the parsing process is namespace aware.

        The inventory lists all elements and attributes, and many attribute values.
        But is represents any digit with `n`, and some attributes that contain
        ids or keywords, are reduced to the value `x`.

        This information reduction helps to get a clear overview.

        It writes reports to the `reportPath`:

        *   `errors.txt`: validation errors
        *   `elements.txt`: element / attribute inventory.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        verbose = self.verbose
        procins = self.procins
        validate = self.validate
        modelInfo = self.modelInfo
        modelInv = self.modelInv
        modelXsd = self.modelXsd
        A = self.A
        etree = self.etree

        teiPath = self.teiPath
        reportPath = self.reportPath
        docsDir = self.docsDir
        sectionModel = self.sectionModel

        if verbose == 1:
            console(f"TEI to TF checking: {ux(teiPath)} => {ux(reportPath)}")
        if verbose >= 0:
            console(
                f"Processing instructions are {'treated' if procins else 'ignored'}"
            )
            console(f"XML validation will be {'performed' if validate else 'skipped'}")

        kindLabels = dict(
            format="Formatting Attributes",
            keyword="Keyword Attributes",
            rest="Remaining Attributes and Elements",
        )
        getStore = lambda: collections.defaultdict(  # noqa: E731
            lambda: collections.defaultdict(collections.Counter)
        )
        analysis = {x: getStore() for x in kindLabels}
        errors = []
        tagByNs = collections.defaultdict(collections.Counter)
        refs = collections.defaultdict(lambda: collections.Counter())
        ids = collections.defaultdict(lambda: collections.Counter())

        parser = self.getParser()
        baseSchema = modelXsd[None]
        overrides = [
            override for (model, override) in modelXsd.items() if model is not None
        ]
        A.getElementInfo(baseSchema, overrides, verbose=verbose)
        elementDefs = A.elementDefs

        initTree(reportPath)
        initTree(docsDir)

        nProcins = 0

        lbParents = collections.Counter()

        def analyse(root, analysis, xmlFile):
            FORMAT_ATTS = set(
                """
                dim
                level
                place
                rend
            """.strip().split()
            )

            KEYWORD_ATTS = set(
                """
                facs
                form
                function
                lang
                reason
                type
                unit
                who
            """.strip().split()
            )

            TRIM_ATTS = set(
                """
                id
                key
                target
                value
            """.strip().split()
            )

            NUM_RE = re.compile(r"""[0-9]""", re.S)

            def nodeInfo(xnode):
                nonlocal nProcins

                if procins and isinstance(xnode, etree._ProcessingInstruction):
                    target = xnode.target
                    tag = f"?{target}"
                    ns = ""
                    nProcins += 1
                else:
                    qName = etree.QName(xnode.tag)
                    tag = qName.localname
                    ns = qName.namespace

                atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()}

                tagByNs[tag][ns] += 1

                if tag == "lb":
                    parentTag = etree.QName(xnode.getparent().tag).localname
                    lbParents[parentTag] += 1

                if len(atts) == 0:
                    kind = "rest"
                    analysis[kind][tag][""][""] += 1
                else:
                    idv = atts.get("id", None)

                    if idv is not None:
                        ids[xmlFile][idv] += 1

                    for refAtt, targetFile, targetId in getRefs(tag, atts, xmlFile):
                        refs[xmlFile][(targetFile, targetId)] += 1

                    for k, v in atts.items():
                        kind = (
                            "format"
                            if k in FORMAT_ATTS
                            else "keyword"
                            if k in KEYWORD_ATTS
                            else "rest"
                        )
                        dest = analysis[kind]

                        if kind == "rest":
                            vTrim = "X" if k in TRIM_ATTS else NUM_RE.sub("N", v)
                            dest[tag][k][vTrim] += 1
                        else:
                            words = v.strip().split()
                            for w in words:
                                dest[tag][k][w.strip()] += 1

                for child in xnode.iterchildren(
                    tag=(etree.Element, etree.ProcessingInstruction)
                    if procins
                    else etree.Element
                ):
                    nodeInfo(child)

            nodeInfo(root)

        def writeErrors():
            """Write the errors to a file."""

            errorFile = f"{reportPath}/errors.txt"

            nErrors = 0
            nFiles = 0

            with fileOpen(errorFile, mode="w") as fh:
                prevFolder = None
                prevFile = None

                for folder, file, line, col, kind, text in errors:
                    newFolder = prevFolder != folder
                    newFile = newFolder or prevFile != file

                    if newFile:
                        nFiles += 1

                    if kind == "error":
                        nErrors += 1

                    indent1 = f"{folder}\n\t" if newFolder else "\t"
                    indent2 = f"{file}\n\t\t" if newFile else "\t"
                    loc = f"{line or ''}:{col or ''}"
                    text = "\n".join(wrap(text, width=80, subsequent_indent="\t\t\t"))
                    fh.write(f"{indent1}{indent2}{loc} {kind or ''} {text}\n")
                    prevFolder = folder
                    prevFile = file

            if nErrors:
                console(
                    (
                        f"{nErrors} validation error(s) in {nFiles} file(s) "
                        f"written to {errorFile}"
                    ),
                    error=True,
                )
            else:
                if verbose >= 0:
                    if validate:
                        console("Validation OK")
                    else:
                        console("No validation performed")

        def writeNamespaces():
            errorFile = f"{reportPath}/namespaces.txt"

            nErrors = 0

            nTags = len(tagByNs)

            with fileOpen(errorFile, mode="w") as fh:
                for tag, nsInfo in sorted(
                    tagByNs.items(), key=lambda x: (-len(x[1]), x[0])
                ):
                    label = "OK"
                    nNs = len(nsInfo)
                    if nNs > 1:
                        nErrors += 1
                        label = "XX"

                    for ns, amount in sorted(
                        nsInfo.items(), key=lambda x: (-x[1], x[0])
                    ):
                        fh.write(
                            f"{label} {nNs:>2} namespace for "
                            f"{tag:<16} : {amount:>5}x {ns}\n"
                        )

            if verbose >= 0:
                if procins:
                    plural = "" if nProcins == 1 else "s"
                    console(f"{nProcins} processing instruction{plural} encountered.")

                console(
                    f"{nTags} tags of which {nErrors} with multiple namespaces "
                    f"written to {errorFile}"
                    if verbose >= 0 or nErrors
                    else "Namespaces OK"
                )

        def writeReport():
            reportFile = f"{reportPath}/elements.txt"
            with fileOpen(reportFile, mode="w") as fh:
                fh.write(
                    "Inventory of tags and attributes in the source XML file(s).\n"
                    "Contains the following sections:\n"
                )
                for label in kindLabels.values():
                    fh.write(f"\t{label}\n")
                fh.write("\n\n")

                infoLines = 0

                def writeAttInfo(tag, att, attInfo):
                    nonlocal infoLines
                    nl = "" if tag == "" else "\n"
                    tagRep = "" if tag == "" else f"<{tag}>"
                    attRep = "" if att == "" else f"{att}="
                    atts = sorted(attInfo.items())
                    (val, amount) = atts[0]
                    fh.write(
                        f"{nl}\t{tagRep:<18} " f"{attRep:<11} {amount:>5}x {val}\n"
                    )
                    infoLines += 1
                    for val, amount in atts[1:]:
                        fh.write(
                            f"""\t{'':<7}{'':<18} {'"':<18} {amount:>5}x {val}\n"""
                        )
                        infoLines += 1

                def writeTagInfo(tag, tagInfo):
                    nonlocal infoLines
                    tags = sorted(tagInfo.items())
                    (att, attInfo) = tags[0]
                    writeAttInfo(tag, att, attInfo)
                    infoLines += 1
                    for att, attInfo in tags[1:]:
                        writeAttInfo("", att, attInfo)

                for kind, label in kindLabels.items():
                    fh.write(f"\n{label}\n")
                    for tag, tagInfo in sorted(analysis[kind].items()):
                        writeTagInfo(tag, tagInfo)

            if verbose >= 0:
                console(f"{infoLines} info line(s) written to {reportFile}")

        def writeElemTypes():
            elemsCombined = {}

            modelSet = set()

            for schemaOverride, eDefs in elementDefs.items():
                model = modelInv[schemaOverride]
                modelSet.add(model)
                for tag, (typ, mixed) in eDefs.items():
                    elemsCombined.setdefault(tag, {}).setdefault(model, {})
                    elemsCombined[tag][model]["typ"] = typ
                    elemsCombined[tag][model]["mixed"] = mixed

            tagReport = {}

            for tag, tagInfo in elemsCombined.items():
                tagLines = []
                tagReport[tag] = tagLines

                if None in tagInfo:
                    teiInfo = tagInfo[None]
                    teiTyp = teiInfo["typ"]
                    teiMixed = teiInfo["mixed"]
                    teiTypRep = "??" if teiTyp is None else typ
                    teiMixedRep = (
                        "??" if teiMixed is None else "mixed" if teiMixed else "pure"
                    )
                    mds = ["TEI"]

                    for model in sorted(x for x in tagInfo if x is not None):
                        info = tagInfo[model]
                        typ = info["typ"]
                        mixed = info["mixed"]
                        if typ == teiTyp and mixed == teiMixed:
                            mds.append(model)
                        else:
                            typRep = (
                                "" if typ == teiTyp else "??" if typ is None else typ
                            )
                            mixedRep = (
                                ""
                                if mixed == teiMixed
                                else "??"
                                if mixed is None
                                else "mixed"
                                if mixed
                                else "pure"
                            )
                            tagLines.append((tag, [model], typRep, mixedRep))
                    tagLines.insert(0, (tag, mds, teiTypRep, teiMixedRep))
                else:
                    for model in sorted(tagInfo):
                        info = tagInfo[model]
                        typ = info["typ"]
                        mixed = info["mixed"]
                        typRep = "??" if typ is None else typ
                        mixedRep = (
                            "??" if mixed is None else "mixed" if mixed else "pure"
                        )
                        tagLines.append((tag, [model], typRep, mixedRep))

            reportFile = f"{reportPath}/types.txt"
            with fileOpen(reportFile, mode="w") as fh:
                for tag in sorted(tagReport):
                    tagLines = tagReport[tag]
                    for tag, mds, typ, mixed in tagLines:
                        model = ",".join(mds)
                        fh.write(f"{tag:<18} {model:<18} {typ:<7} {mixed:<5}\n")

            if verbose >= 0:
                console(
                    f"{len(elemsCombined)} tag(s) type info written to {reportFile}"
                )

        def writeLbParents():
            reportFile = f"{reportPath}/lb-parents.txt"

            with open(reportFile, "w") as fh:
                for parent, n in sorted(lbParents.items()):
                    fh.write(f"{n:>5} x {parent}\n")

            if verbose >= 0:
                console(f"lb-parent info written to {reportFile}")

        def writeIdRefs():
            reportIdFile = f"{reportPath}/ids.txt"
            reportRefFile = f"{reportPath}/refs.txt"

            ih = fileOpen(reportIdFile, mode="w")
            rh = fileOpen(reportRefFile, mode="w")

            refdIds = collections.Counter()
            missingIds = set()

            totalRefs = 0
            totalRefsU = 0

            totalResolvable = 0
            totalResolvableU = 0
            totalDangling = 0
            totalDanglingU = 0

            seenItems = set()

            for file, items in refs.items():
                rh.write(f"{file}\n")

                resolvable = 0
                resolvableU = 0
                dangling = 0
                danglingU = 0

                for item, n in sorted(items.items()):
                    totalRefs += n

                    if item in seenItems:
                        newItem = False
                    else:
                        seenItems.add(item)
                        newItem = True
                        totalRefsU += 1

                    (target, idv) = item

                    if target not in ids or idv not in ids[target]:
                        status = "dangling"
                        dangling += n

                        if newItem:
                            missingIds.add((target, idv))
                            danglingU += 1
                    else:
                        status = "ok"
                        resolvable += n
                        refdIds[(target, idv)] += n

                        if newItem:
                            resolvableU += 1
                    rh.write(f"\t{status:<10} {n:>5} x {target} # {idv}\n")

                msgs = (
                    f"\tDangling:   {dangling:>4} x {danglingU:>4}",
                    f"\tResolvable: {resolvable:>4} x {resolvableU:>4}",
                )
                for msg in msgs:
                    rh.write(f"{msg}\n")

                totalResolvable += resolvable
                totalResolvableU += resolvableU
                totalDangling += dangling
                totalDanglingU += danglingU

            if verbose >= 0:
                console(f"Refs written to {reportRefFile}")
                msgs = (
                    f"\tresolvable: {totalResolvableU:>4} in {totalResolvable:>4}",
                    f"\tdangling:   {totalDanglingU:>4} in {totalDangling:>4}",
                    f"\tALL:        {totalRefsU:>4} in {totalRefs:>4} ",
                )
                for msg in msgs:
                    console(msg)

            totalIds = 0
            totalIdsU = 0
            totalIdsM = 0
            totalIdsRefd = 0
            totalIdsRefdU = 0
            totalIdsUnused = 0

            for file, items in ids.items():
                totalIds += len(items)

                ih.write(f"{file}\n")

                unique = 0
                multiple = 0
                refd = 0
                refdU = 0
                unused = 0

                for item, n in sorted(items.items()):
                    nRefs = refdIds.get((file, item), 0)

                    if n == 1:
                        unique += 1
                    else:
                        multiple += 1

                    if nRefs == 0:
                        unused += 1
                    else:
                        refd += nRefs
                        refdU += 1

                    status1 = f"{n}x"
                    plural = "" if nRefs == 1 else "s"
                    status2 = f"{nRefs}ref{plural}"

                    ih.write(f"\t{status1:<8} {status2:<8} {item}\n")

                msgs = (
                    f"\tUnique:     {unique:>4}",
                    f"\tNon-unique: {multiple:>4}",
                    f"\tUnused:     {unused:>4}",
                    f"\tReferenced: {refd:>4} x {refdU:>4}",
                )
                for msg in msgs:
                    ih.write(f"{msg}\n")

                totalIdsU += unique
                totalIdsM += multiple
                totalIdsRefdU += refdU
                totalIdsRefd += refd
                totalIdsUnused += unused

            if verbose >= 0:
                console(f"Ids written to {reportIdFile}")
                msgs = (
                    f"\treferenced: {totalIdsRefdU:>4} by {totalIdsRefd:>4}",
                    f"\tnon-unique: {totalIdsM:>4}",
                    f"\tunused:     {totalIdsUnused:>4}",
                    f"\tALL:        {totalIdsU:>4} in {totalIds:>4}",
                )
                for msg in msgs:
                    console(msg)

        def writeDoc():
            teiUrl = "https://tei-c.org/release/doc/tei-p5-doc/en/html"
            elUrlPrefix = f"{teiUrl}/ref-"
            attUrlPrefix = f"{teiUrl}/REF-ATTS.html#"
            docFile = f"{docsDir}/elements.md"
            with fileOpen(docFile, mode="w") as fh:
                fh.write(
                    dedent(
                        """
                        # Element and attribute inventory

                        Table of contents

                        """
                    )
                )
                for label in kindLabels.values():
                    labelAnchor = label.replace(" ", "-")
                    fh.write(f"*\t[{label}](#{labelAnchor})\n")

                fh.write("\n")

                tableHeader = dedent(
                    """
                    | element | attribute | value | amount
                    | --- | --- | --- | ---
                    """
                )

                def writeAttInfo(tag, att, attInfo):
                    tagRep = " " if tag == "" else f"[{tag}]({elUrlPrefix}{tag}.html)"
                    attRep = " " if att == "" else f"[{att}]({attUrlPrefix}{att})"
                    atts = sorted(attInfo.items())
                    (val, amount) = atts[0]
                    valRep = f"`{val}`" if val else ""
                    fh.write(
                        "| "
                        + (
                            " | ".join(
                                str(x)
                                for x in (
                                    tagRep,
                                    attRep,
                                    valRep,
                                    amount,
                                )
                            )
                        )
                        + "\n"
                    )
                    for val, amount in atts[1:]:
                        valRep = f"`{val}`" if val else ""
                        fh.write(f"""| | | {valRep} | {amount}\n""")

                def writeTagInfo(tag, tagInfo):
                    tags = sorted(tagInfo.items())
                    (att, attInfo) = tags[0]
                    writeAttInfo(tag, att, attInfo)
                    for att, attInfo in tags[1:]:
                        writeAttInfo("", att, attInfo)

                for kind, label in kindLabels.items():
                    fh.write(f"## {label}\n{tableHeader}")
                    for tag, tagInfo in sorted(analysis[kind].items()):
                        writeTagInfo(tag, tagInfo)
                    fh.write("\n")

        def filterError(msg):
            return msg == (
                "Element 'graphic', attribute 'url': [facet 'pattern'] "
                "The value '' is not accepted by the pattern '\\S+'."
            )

        def doXMLFile(xmlPath):
            tree = etree.parse(xmlPath, parser)
            root = tree.getroot()
            xmlFile = fileNm(xmlPath)
            ids[xmlFile][""] = 1
            analyse(root, analysis, xmlFile)

        xmlFilesByModel = collections.defaultdict(list)

        if sectionModel == "I":
            i = 0
            for xmlFolder, xmlFiles in self.getXML():
                msg = "Start " if verbose >= 0 else "\t"
                console(f"{msg}folder {xmlFolder}:")
                j = 0
                cr = ""
                nl = True

                for xmlFile in xmlFiles:
                    i += 1
                    j += 1
                    if j > PROGRESS_LIMIT:
                        cr = "\r"
                        nl = False
                    xmlPath = f"{teiPath}/{xmlFolder}/{xmlFile}"
                    (model, adapt, tpl) = self.getSwitches(xmlPath)
                    mdRep = model or "TEI"
                    tplRep = tpl or ""
                    adRep = adapt or ""

                    label = f"{mdRep:<12} {tplRep:<12} {adRep:<12}"

                    if verbose >= 0:
                        console(f"{cr}{i:>4} {label} {xmlFile:<50}", newline=nl)
                    xmlFilesByModel[model].append(xmlPath)
                if verbose >= 0:
                    console("")
                    console(f"End   folder {xmlFolder}")

        elif sectionModel == "II":
            xmlFile = self.getXML()
            if xmlFile is None:
                console("No XML files found!", error=True)
                return False

            xmlPath = f"{teiPath}/{xmlFile}"
            (model, adapt, tpl) = self.getSwitches(xmlPath)
            xmlFilesByModel[model].append(xmlPath)

        good = True

        for model, xmlPaths in xmlFilesByModel.items():
            if verbose >= 0:
                console(f"{len(xmlPaths)} {model or 'TEI'} file(s) ...")

            thisGood = True

            if validate:
                if verbose >= 0:
                    console("\tValidating ...")

                schemaFile = modelInfo.get(model, None)

                if schemaFile is None:
                    if verbose >= 0:
                        console(f"\t\tNo schema file for {model}")
                    if good is not None and good is not False:
                        good = None
                    continue

                (thisGood, info, theseErrors) = A.validate(schemaFile, xmlPaths)

                for line in info:
                    if verbose >= 0:
                        console(f"\t\t{line}")

            if not thisGood:
                good = False
                errors.extend(theseErrors)

            if verbose >= 0:
                console("\tMaking inventory ...")
            for xmlPath in xmlPaths:
                doXMLFile(xmlPath)

        if not good:
            self.good = False

        if verbose >= 0:
            console("")
        writeErrors()
        writeReport()
        writeElemTypes()
        writeDoc()
        writeNamespaces()
        writeIdRefs()
        writeLbParents()

    # SET UP CONVERSION

    def getConverter(self):
        """Initializes a converter.

        Returns
        -------
        object
            The `tf.convert.walker.CV` converter object, initialized.
        """
        verbose = self.verbose
        tfPath = self.tfPath

        silent = AUTO if verbose == 1 else TERSE if verbose == 0 else DEEP
        TF = Fabric(locations=tfPath, silent=silent)
        return CV(TF, silent=silent)

    # DIRECTOR

    def getDirector(self):
        """Factory for the director function.

        The `tf.convert.walker` relies on a corpus dependent `director` function
        that walks through the source data and spits out actions that
        produces the TF dataset.

        The director function that walks through the TEI input must be conditioned
        by the properties defined in the TEI schema and the customised schema, if any,
        that describes the source.

        Also some special additions need to be programmed, such as an extra section
        level, word boundaries, etc.

        We collect all needed data, store it, and define a local director function
        that has access to this data.

        Returns
        -------
        function
            The local director function that has been constructed.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        TEI_HEADER = "teiHeader"

        TEXT_ANCESTOR = "text"
        TEXT_ANCESTORS = set(
            """
            front
            body
            back
            group
            """.strip().split()
        )
        CHUNK_PARENTS = TEXT_ANCESTORS | {TEI_HEADER}

        CHUNK_ELEMS = set(
            """
            facsimile
            fsdDecl
            sourceDoc
            standOff
            """.strip().split()
        )

        PASS_THROUGH = set(
            """
            TEI
            """.strip().split()
        )

        # CHECKING

        HY = "\u2010"  # hyphen

        IN_WORD_HYPHENS = {HY, "-"}

        procins = self.procins
        verbose = self.verbose
        teiPath = self.teiPath
        wordAsSlot = self.wordAsSlot
        tokenAsSlot = self.tokenAsSlot
        parentEdges = self.parentEdges
        siblingEdges = self.siblingEdges
        featureMeta = self.featureMeta
        intFeatures = self.intFeatures
        transform = getattr(self, "transformCustom", None)
        chunkLevel = self.chunkLevel
        modelInv = self.modelInv
        modelInfo = self.modelInfo
        modelXsd = self.modelXsd
        A = self.A
        etree = self.etree

        transformFunc = (
            (lambda x: BytesIO(x.encode("utf-8")))
            if transform is None
            else lambda x: BytesIO(transform(x).encode("utf-8"))
        )

        parser = self.getParser()

        baseSchema = modelInfo[None]
        overrides = [
            override for (model, override) in modelInfo.items() if model is not None
        ]
        baseSchema = modelXsd[None]
        overrides = [
            override for (model, override) in modelXsd.items() if model is not None
        ]
        A.getElementInfo(baseSchema, overrides, verbose=-1)

        refs = collections.defaultdict(lambda: collections.defaultdict(set))
        ids = collections.defaultdict(dict)

        # WALKERS

        WHITE_TRIM_RE = re.compile(r"\s+", re.S)
        NON_NAME_RE = re.compile(r"[^a-zA-Z0-9_ ]+", re.S)

        NOTE_LIKE = set(
            """
            note
            """.strip().split()
        )
        EMPTY_ELEMENTS = set(
            """
            addSpan
            alt
            anchor
            anyElement
            attRef
            binary
            caesura
            catRef
            cb
            citeData
            classRef
            conversion
            damageSpan
            dataFacet
            default
            delSpan
            elementRef
            empty
            equiv
            fsdLink
            gb
            handShift
            iff
            lacunaEnd
            lacunaStart
            lb
            link
            localProp
            macroRef
            milestone
            move
            numeric
            param
            path
            pause
            pb
            ptr
            redo
            refState
            specDesc
            specGrpRef
            symbol
            textNode
            then
            undo
            unicodeProp
            unihanProp
            variantEncoding
            when
            witEnd
            witStart
            """.strip().split()
        )
        NEWLINE_ELEMENTS = set(
            """
            ab
            addrLine
            cb
            l
            lb
            lg
            list
            p
            pb
            seg
            table
            u
            """.strip().split()
        )
        CONTINUOUS_ELEMENTS = set(
            """
            choice
            """.strip().split()
        )

        def makeNameLike(x):
            return NON_NAME_RE.sub("_", x).strip("_")

        def walkNode(cv, cur, xnode):
            """Internal function to deal with a single element.

            Will be called recursively.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

                The subdictionary `cur["node"]` is used to store the currently generated
                nodes by node type.
            xnode: object
                An LXML element node.
            """
            if procins and isinstance(xnode, etree._ProcessingInstruction):
                target = xnode.target
                tag = f"?{target}"
            else:
                tag = etree.QName(xnode.tag).localname

            atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()}

            beforeTag(cv, cur, xnode, tag, atts)

            cur[XNEST].append((tag, atts))

            curNode = beforeChildren(cv, cur, xnode, tag, atts)

            if curNode is not None:
                if parentEdges:
                    if len(cur[TNEST]):
                        parentNode = cur[TNEST][-1]
                        cv.edge(curNode, parentNode, parent=None)

                cur[TNEST].append(curNode)

                if siblingEdges:
                    if len(cur[TSIB]):
                        siblings = cur[TSIB][-1]

                        nSiblings = len(siblings)
                        for i, sib in enumerate(siblings):
                            cv.edge(sib, curNode, sibling=nSiblings - i)
                        siblings.append(curNode)

                    cur[TSIB].append([])

            for child in xnode.iterchildren(
                tag=(etree.Element, etree.ProcessingInstruction)
                if procins
                else etree.Element
            ):
                walkNode(cv, cur, child)

            afterChildren(cv, cur, xnode, tag, atts)

            if curNode is not None:
                xmlFile = cur["xmlFile"]

                for refAtt, targetFile, targetId in getRefs(tag, atts, xmlFile):
                    refs[refAtt][(targetFile, targetId)].add(curNode)

                idVal = atts.get("id", None)
                if idVal is not None:
                    ids[xmlFile][idVal] = curNode

                if len(cur[TNEST]):
                    cur[TNEST].pop()
                if siblingEdges:
                    if len(cur[TSIB]):
                        cur[TSIB].pop()

            cur[XNEST].pop()
            afterTag(cv, cur, xnode, tag, atts)

        def isChapter(cur):
            """Whether the current element counts as a chapter node.

            ## Model I

            Not relevant: there are no chapter nodes inside an XML file.

            ## Model II

            Chapters are the highest section level (the only lower level is chunks).

            Chapters come in two kinds:

            *   the TEI header;
            *   the immediate children of `<text>`
                except `<front>`, `<body>`, `<back>`, `<group>`;
            *   the immediate children of
                `<front>`, `<body>`, `<back>`, `<group>`.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            sectionModel = self.sectionModel

            if sectionModel == "II":
                nest = cur[XNEST]
                nNest = len(nest)

                if nNest > 0 and nest[-1][0] in EMPTY_ELEMENTS:
                    return False

                outcome = nNest > 0 and (
                    nest[-1][0] == TEI_HEADER
                    or (
                        nNest > 1
                        and (
                            nest[-2][0] in TEXT_ANCESTORS
                            or nest[-2][0] == TEXT_ANCESTOR
                            and nest[-1][0] not in TEXT_ANCESTORS
                        )
                    )
                )
                if outcome:
                    cur["chapterElems"].add(nest[-1][0])

                return outcome

            return False

        def isChunk(cur):
            """Whether the current element counts as a chunk node.

            It depends on the section model, but also on the template.

            Note that we only can have distinct templates if we deal with
            multiple files, so only when we are in section model I.

            ## Model I

            Chunks are the lowest section level (the higher levels are folders
            and then files)

            The default is that chunks are the immediate children of the
            `<teiHeader>` and the `<body>`
            elements; a few other elements also count as chunks.

            However, if `drillDownDivs` is True and if the chunk appears to be
            a `<div>` element, we drill further down, until we arrive at a
            non-`<div>` element.

            But in specific templates we have different rules:

            ### `bibliolist`:

            *   The TEI Header is a chunk, and nothing inside the TEI header is a chunk;
            *   Everything at level 5, except `<listBibl>` is a chunk;
            *   The children of `<listBibl>` are chunks (the `<bibl>` elements
                and a few others), provided they are at level 6.

            ### `artworklist`

            *   The TEI Header is a chunk, and nothing inside the TEI header is a chunk;
            *   Everything at level 5 is a chunk.

            ## Model II

            Chunks are the lowest section level (the only higher level is chapters).

            Chunks are the immediate children of the chapters, and they come in two
            kinds: the ones that are `<p>` elements, and the rest.

            Deviation from this rule:

            *   If a chapter is a mixed content node, then it is also a chunk.
                and its subelements are not chunks

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            sectionModel = self.sectionModel

            nest = cur[XNEST]
            nNest = len(nest)
            model = cur["model"]

            if nNest == 0:
                return False

            thisTag = nest[-1][0]

            if sectionModel == "II":
                if nNest == 1:
                    outcome = False
                else:
                    parentTag = nest[-2][0]
                    meChptChnk = (
                        isChapter(cur) and thisTag not in cur["pureElems"][model]
                    )

                    if meChptChnk:
                        outcome = True
                    elif parentTag == TEI_HEADER:
                        outcome = True
                    elif nNest <= 2:
                        outcome = False
                    elif parentTag not in cur["pureElems"][model]:
                        outcome = False
                    else:
                        grandParentTag = nest[-3][0]
                        outcome = (
                            grandParentTag in TEXT_ANCESTORS
                            and thisTag not in EMPTY_ELEMENTS
                        ) or (
                            grandParentTag == TEXT_ANCESTOR
                            and parentTag not in TEXT_ANCESTORS
                        )

            elif sectionModel == "I":
                template = cur["template"]

                if template == "biolist":
                    if thisTag == TEI_HEADER:
                        outcome = True
                    elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                        outcome = False
                    elif nNest not in {5, 6}:
                        outcome = False
                    else:
                        parentTag = nest[-2][0]
                        if nNest == 5:
                            outcome = thisTag != "listPerson"
                        else:
                            outcome = parentTag == "listPerson"

                elif template == "bibliolist":
                    if thisTag == TEI_HEADER:
                        outcome = True
                    elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                        outcome = False
                    elif nNest not in {5, 6}:
                        outcome = False
                    else:
                        parentTag = nest[-2][0]
                        if nNest == 5:
                            outcome = thisTag != "listBibl"
                        else:
                            outcome = parentTag == "listBibl"

                elif template == "artworklist":
                    if thisTag == TEI_HEADER:
                        outcome = True
                    elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                        outcome = False
                    else:
                        outcome = nNest == 5

                else:
                    if thisTag in CHUNK_ELEMS:
                        outcome = True
                    elif nNest == 1:
                        outcome = False
                    else:
                        sectionProperties = self.sectionProperties
                        drillDownDivs = sectionProperties["drillDownDivs"]

                        parentTag = nest[-2][0]
                        if drillDownDivs:
                            if thisTag == "div":
                                outcome = False
                            else:
                                dParentTag = None
                                for ancestor in reversed(nest[0:-1]):
                                    if ancestor[0] != "div":
                                        dParentTag = ancestor[0]
                                        break
                                outcome = (
                                    dParentTag in CHUNK_PARENTS
                                    and thisTag not in EMPTY_ELEMENTS
                                ) or (
                                    dParentTag == TEXT_ANCESTOR
                                    and thisTag not in TEXT_ANCESTORS
                                )
                        else:
                            outcome = (
                                parentTag in CHUNK_PARENTS
                                and thisTag not in EMPTY_ELEMENTS
                            ) or (
                                parentTag == TEXT_ANCESTOR
                                and thisTag not in TEXT_ANCESTORS
                            )

            if outcome:
                cur["chunkElems"].add(nest[-1][0])

            return outcome

        def isPure(cur):
            """Whether the current tag has pure content.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            nest = cur[XNEST]
            model = cur["model"]
            return (
                len(nest) == 0
                or len(nest) > 0
                and nest[-1][0] in cur["pureElems"][model]
            )

        def isEndInPure(cur):
            """Whether the current end tag occurs in an element with pure content.

            If that is the case, then it is very likely that the end tag also
            marks the end of the current word.

            And we should not strip spaces after it.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            nest = cur[XNEST]
            model = cur["model"]
            return len(nest) > 1 and nest[-2][0] in cur["pureElems"][model]

        def hasMixedAncestor(cur):
            """Whether the current tag has an ancestor with mixed content.

            We use this in case a tag ends in an element with pure content.
            We should then add white-space to separate it from the next
            element of its parent.

            If the whole stack of element has pure content, we add
            a newline, because then we are probably in the TEI header,
            and things are most clear if they are on separate lines.

            But if one of the ancestors has mixed content, we are typically
            in some structured piece of information within running text,
            such as change markup. In this case we want to add merely a space.

            And we should not strip spaces after it.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            nest = cur[XNEST]
            model = cur["model"]
            return any(n[0] in cur["mixedElems"][model] for n in nest[0:-1])

        def hasContinuousAncestor(cur):
            """Whether an ancestor tag is a continuous pure element.

            A continuous pure element is an element whose child elements do not
            imply word separation, e.g. `<choice>`.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            nest = cur[XNEST]
            return any(n[0] in CONTINUOUS_ELEMENTS for n in nest[0:-1])

        def startWord(cv, cur, ch):
            """Start a word node if necessary.

            Whenever we encounter a character, we determine
            whether it starts or ends a word, and if it starts
            one, this function takes care of the necessary actions.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            ch: string
                A single character, the next character in the result data.
            """
            curWord = cur[NODE][WORD]

            if not curWord:
                prevWord = cur["prevWord"]
                if prevWord is not None:
                    cv.feature(prevWord, after=cur["afterStr"])
                if ch is not None:
                    if wordAsSlot:
                        curWord = cv.slot()
                    else:
                        curWord = cv.node(WORD)
                    cur[NODE][WORD] = curWord
                    addSlotFeatures(cv, cur, curWord)

            if ch is not None:
                cur["wordStr"] += ch

        def finishWord(cv, cur, ch, spaceChar):
            """Terminate a word node if necessary.

            Whenever we encounter a character, we determine
            whether it starts or ends a word, and if it ends
            one, this function takes care of the necessary actions.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            ch: string
                A single character, the next slot in the result data.
            spaceChar: string | void
                If None, no extra space or newline will be added.
                Otherwise, the `spaceChar` (a single space or newline will be added).
            """
            curWord = cur[NODE][WORD]
            if curWord:
                cv.feature(curWord, str=cur["wordStr"])
                if not wordAsSlot:
                    cv.terminate(curWord)
                cur[NODE][WORD] = None
                cur["wordStr"] = ""
                cur["prevWord"] = curWord
                cur["afterStr"] = ""

            if ch is not None:
                cur["afterStr"] += ch
            if spaceChar is not None:
                cur["afterStr"] = cur["afterStr"].rstrip() + spaceChar
                if not wordAsSlot:
                    addSpace(cv, cur, spaceChar)
                cur["afterSpace"] = True
            else:
                cur["afterSpace"] = False

        def addSlotFeatures(cv, cur, s):
            """Add generic features to a slot.

            Whenever we encounter a character, we add it as a new slot, unless
            `wordAsSlot` is in force. In that case we suppress the triggering of a
            slot node.
            If needed, we start / terminate word nodes as well.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            s: slot
                A previously added (slot) node
            """
            if cur["inHeader"]:
                cv.feature(s, is_meta=1)
            if cur["inNote"]:
                cv.feature(s, is_note=1)
            for r, stack in cur.get("rend", {}).items():
                if len(stack) > 0:
                    cv.feature(s, **{f"rend_{r}": 1})

        def addTokens(cv, cur, text):
            """Adds text as a series of tokens.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            text: string
                The text to be added.

            Only meant for the case where slots are tokens.
            """
            (beforew, material, afterw) = getWhites(text)

            if beforew:
                makeSpace(cv, cur)

            s = None

            for tx, after in tokenize(material):
                s = cv.slot()
                cv.feature(s, str=tx, after=after)
                addSlotFeatures(cv, cur, s)

            if afterw:
                if s is None:
                    makeSpace(cv, cur)
                else:
                    cv.feature(s, after=" ")

        def addSlot(cv, cur, ch):
            """Add a slot.

            Whenever we encounter a character, we add it as a new slot, unless
            `wordAsSlot` is in force. In that case we suppress the triggering of a
            slot node.
            If needed, we start / terminate word nodes as well.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            ch: string
                A single character, the next slot in the result data.
            """
            if ch in {"_", None} or ch.isalnum() or ch in IN_WORD_HYPHENS:
                startWord(cv, cur, ch)
            else:
                finishWord(cv, cur, ch, None)

            if wordAsSlot:
                s = cur[NODE][WORD]
            elif ch is None:
                s = None
            else:
                s = cv.slot()
                cv.feature(s, ch=ch)
            if s is not None:
                addSlotFeatures(cv, cur, s)

        def addEmpty(cv, cur):
            """Add an empty slot.

            We also terminate the current word.
            If words are slots, the empty slot is a word on its own.

            Returns
            -------
            node
                The empty slot
            """
            if tokenAsSlot:
                emptyNode = cv.slot()
                cv.feature(emptyNode, str=ZWSP, after="", empty=1)
            else:
                finishWord(cv, cur, None, None)
                startWord(cv, cur, ZWSP)
                emptyNode = cur[NODE][WORD]
                cv.feature(emptyNode, empty=1)

                if not wordAsSlot:
                    emptyNode = cv.slot()
                    cv.feature(emptyNode, ch=ZWSP, empty=1)

                finishWord(cv, cur, None, None)

            return emptyNode

        def addSpace(cv, cur, spaceChar):
            """Adds a space or a new line.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            spaceChar: string
                The character to add (supposed to be either a space or a newline).

            Only meant for the case where slots are characters or tokens.

            Suppressed when not in a lowest-level section.
            """
            if chunkLevel in cv.activeTypes():
                s = cv.slot()
                if tokenAsSlot:
                    cv.feature(s, str="", after=spaceChar, extraspace=1)
                else:
                    cv.feature(s, ch=spaceChar, extraspace=1)
                addSlotFeatures(cv, cur, s)

        def makeSpace(cv, cur):
            """Adds a space.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Only meant for the case where slots are tokens.
            """
            s = cv.slot()
            cv.feature(s, str="", after=" ", extraspace=1)
            addSlotFeatures(cv, cur, s)

        def endLine(cv, cur):
            """Ends a line node.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            """
            lineProperties = self.lineProperties
            lineType = lineProperties["nodeType"]

            slots = cv.linked(cur[NODE][lineType])
            empty = len(slots) == 0

            if empty:
                lastSlot = addEmpty(cv, cur)
                if cur["inNote"]:
                    cv.feature(lastSlot, is_note=1)
            else:
                lastSlot = (T, slots[-1])

            if not wordAsSlot:
                after = cv.get("after", lastSlot)
                if after is not None and "\n" not in after:
                    cv.feature(lastSlot, after=f"{after.rstrip()}\n")
            cv.terminate(cur[NODE][lineType])

        def endPage(cv, cur):
            """Ends a page node.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            """
            pageProperties = self.pageProperties
            pageType = pageProperties["nodeType"]

            slots = cv.linked(cur[NODE][pageType])
            empty = len(slots) == 0

            if empty:
                lastSlot = addEmpty(cv, cur)
                if cur["inNote"]:
                    cv.feature(lastSlot, is_note=1)
            cv.terminate(cur[NODE][pageType])

        def beforeTag(cv, cur, xnode, tag, atts):
            """Actions before dealing with the element's tag.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            xnode: object
                An LXML element node.
            tag: string
                The tag of the LXML node.
            """
            beforeTagCustom = getattr(self, "beforeTagCustom", None)
            if beforeTagCustom is not None:
                beforeTagCustom(cv, cur, xnode, tag, atts)

        def beforeChildren(cv, cur, xnode, tag, atts):
            """Actions before dealing with the element's children.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            xnode: object
                An LXML element node.
            tag: string
                The tag of the LXML node.
            atts: string
                The attributes of the LXML node, with namespaces stripped.
            """
            makeLineElems = self.makeLineElems

            if makeLineElems:
                lineProperties = self.lineProperties
                lineElem = lineProperties["element"]
                lineType = lineProperties["nodeType"]
                isLineContainer = tag == lineElem
                inLine = cur["inLine"]

                if isLineContainer:
                    cur["inLine"] = True

                    # the line starts with the container
                    cur[NODE][lineType] = cv.node(lineType)

            makePageElems = self.makePageElems

            if makePageElems:
                pageProperties = self.pageProperties
                pageType = pageProperties["nodeType"]
                isPageContainer = matchModel(pageProperties, tag, atts)
                inPage = cur["inPage"]

                pbAtTop = pageProperties["pbAtTop"]

                if isPageContainer:
                    cur["inPage"] = True

                    if pbAtTop:
                        # material before the first pb in the container is not in a page
                        pass
                    else:
                        # the page starts with the container
                        cur[NODE][pageType] = cv.node(pageType)

            sectionModel = self.sectionModel
            sectionProperties = self.sectionProperties

            if sectionModel == "II":
                chapterSection = self.chapterSection
                chunkSection = self.chunkSection

                if isChapter(cur):
                    cur["chapterNum"] += 1
                    cur["prevChapter"] = cur[NODE].get(chapterSection, None)
                    cur[NODE][chapterSection] = cv.node(chapterSection)
                    cv.link(cur[NODE][chapterSection], cur["danglingSlots"])

                    value = {chapterSection: f"{cur['chapterNum']} {tag}"}
                    cv.feature(cur[NODE][chapterSection], **value)
                    cur["chunkPNum"] = 0
                    cur["chunkONum"] = 0
                    cur["prevChunk"] = cur[NODE].get(chunkSection, None)
                    cur[NODE][chunkSection] = cv.node(chunkSection)
                    cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                    cur["danglingSlots"] = set()
                    cur["infirstChunk"] = True

                # N.B. A node can count both as chapter and as chunk,
                # e.g. a <trailer> sibling of the chapter <div>s
                # A trailer has mixed content, so its subelements aren't typical chunks.
                if isChunk(cur):
                    if cur["infirstChunk"]:
                        cur["infirstChunk"] = False
                    else:
                        cur[NODE][chunkSection] = cv.node(chunkSection)
                        cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                        cur["danglingSlots"] = set()
                    if tag == "p":
                        cur["chunkPNum"] += 1
                        cn = cur["chunkPNum"]
                    else:
                        cur["chunkONum"] -= 1
                        cn = cur["chunkONum"]
                    value = {chunkSection: cn}
                    cv.feature(cur[NODE][chunkSection], **value)

                if matchModel(sectionProperties, tag, atts):
                    heading = etree.tostring(
                        xnode, encoding="unicode", method="text", with_tail=False
                    ).replace("\n", " ")
                    value = {chapterSection: heading}
                    cv.feature(cur[NODE][chapterSection], **value)
                    chapterNum = cur["chapterNum"]
                    if verbose >= 0:
                        console(
                            f"\rchapter {chapterNum:>4} {heading:<50}", newline=False
                        )
            else:
                chunkSection = self.chunkSection

                if isChunk(cur):
                    cur["chunkNum"] += 1
                    cur["prevChunk"] = cur[NODE].get(chunkSection, None)
                    cur[NODE][chunkSection] = cv.node(chunkSection)
                    cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                    cur["danglingSlots"] = set()
                    value = {chunkSection: cur["chunkNum"]}
                    cv.feature(cur[NODE][chunkSection], **value)

            if tag == TEI_HEADER:
                cur["inHeader"] = True
                if sectionModel == "II":
                    value = {chapterSection: "TEI header"}
                    cv.feature(cur[NODE][chapterSection], **value)
            if tag in NOTE_LIKE:
                cur["inNote"] = True
                if not tokenAsSlot:
                    finishWord(cv, cur, None, None)

            curNode = None

            if makeLineElems:
                if inLine and tag == "lb":
                    if cur[NODE][lineType] is not None:
                        if cur["lineAtts"] is not None and len(cur["lineAtts"]):
                            cv.feature(cur[NODE][lineType], **cur["lineAtts"])
                        endLine(cv, cur)
                    cur[NODE][lineType] = cv.node(lineType)
                    cur["lineAtts"] = atts

            if makePageElems:
                if inPage and tag == "pb":
                    if pbAtTop:
                        if cur[NODE][pageType] is not None:
                            endPage(cv, cur)
                        cur[NODE][pageType] = cv.node(pageType)
                        if len(atts):
                            cv.feature(cur[NODE][pageType], **atts)
                    else:
                        if cur[NODE][pageType] is not None:
                            if cur["pageAtts"] is not None and len(cur["pageAtts"]):
                                cv.feature(cur[NODE][pageType], **cur["pageAtts"])
                            endPage(cv, cur)
                        cur[NODE][pageType] = cv.node(pageType)
                        cur["pageAtts"] = atts

            isBoundaryElem = (
                makeLineElems and tag == "lb" or makePageElems and tag == "pb"
            )

            if tag not in PASS_THROUGH and not isBoundaryElem:
                cur["afterSpace"] = False
                cur[NODE][tag] = cv.node(tag)
                curNode = cur[NODE][tag]
                if wordAsSlot:
                    if cur[NODE][WORD]:
                        cv.link(curNode, [cur[NODE][WORD][1]])
                if len(atts):
                    cv.feature(curNode, **atts)
                    if "rend" in atts:
                        rValue = atts["rend"]
                        r = makeNameLike(rValue)
                        if r:
                            for q in r.split():
                                cur.setdefault("rend", {}).setdefault(q, []).append(
                                    True
                                )

            beforeChildrenCustom = getattr(self, "beforeChildrenCustom", None)
            if beforeChildrenCustom is not None:
                beforeChildrenCustom(cv, cur, xnode, tag, atts)

            if not hasattr(xnode, "target") and xnode.text:
                textMaterial = WHITE_TRIM_RE.sub(" ", xnode.text)
                if isPure(cur):
                    if textMaterial and textMaterial != " ":
                        console(
                            (
                                "WARNING: Text material at the start of "
                                f"pure-content element <{tag}>"
                            ),
                            error=True,
                        )
                        stack = "-".join(n[0] for n in cur[XNEST])
                        console(f"\tElement stack: {stack}", error=True)
                        console(f"\tMaterial: `{textMaterial}`", error=True)
                else:
                    if tokenAsSlot:
                        addTokens(cv, cur, textMaterial)
                    else:
                        for ch in textMaterial:
                            addSlot(cv, cur, ch)

            return curNode

        def afterChildren(cv, cur, xnode, tag, atts):
            """Node actions after dealing with the children, but before the end tag.

            Here we make sure that the newline elements will get their last slot
            having a newline at the end of their `after` feature.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            xnode: object
                An LXML element node.
            tag: string
                The tag of the LXML node.
            atts: string
                The attributes of the LXML node, with namespaces stripped.
            """
            chunkSection = self.chunkSection
            makeLineElems = self.makeLineElems

            if makeLineElems:
                lineProperties = self.lineProperties
                lineType = lineProperties["nodeType"]
                lineElem = lineProperties["element"]
                lineProperties = self.lineProperties

            makePageElems = self.makePageElems

            if makePageElems:
                pageProperties = self.pageProperties
                pageType = pageProperties["nodeType"]
                pageProperties = self.pageProperties

            sectionModel = self.sectionModel

            if sectionModel == "II":
                chapterSection = self.chapterSection

            extraInstructions = self.extraInstructions

            if len(extraInstructions):
                lookupSource(cv, cur, tokenAsSlot, extraInstructions)

            isChap = isChapter(cur)
            isChnk = isChunk(cur)

            afterChildrenCustom = getattr(self, "afterChildrenCustom", None)
            if afterChildrenCustom is not None:
                afterChildrenCustom(cv, cur, xnode, tag, atts)

            if makeLineElems:
                isLineContainer = tag == lineElem
                inLine = cur["inLine"]

            if makePageElems:
                isPageContainer = matchModel(pageProperties, tag, atts)
                inPage = cur["inPage"]

            hasFinishedWord = False

            if makeLineElems and inLine and tag == "lb":
                pass

            if makePageElems and inPage and tag == "pb":
                pass

            isBoundaryElem = (
                makeLineElems and tag == "lb" or makePageElems and tag == "pb"
            )

            if makeLineElems and isLineContainer:
                # the page ends with the container
                if cur[NODE][lineType] is not None:
                    endLine(cv, cur)
                cur["inLine"] = False

            if makePageElems and isPageContainer:
                pbAtTop = pageProperties["pbAtTop"]
                if pbAtTop:
                    # the page ends with the container
                    if cur[NODE][pageType] is not None:
                        endPage(cv, cur)
                else:
                    # material after the last pb is not in a page
                    if cur[NODE][pageType] is not None:
                        cv.delete(cur[NODE][pageType])
                cur["inPage"] = False

            if tag not in PASS_THROUGH and not isBoundaryElem:
                curNode = cur[TNEST][-1]
                slots = cv.linked(curNode)
                empty = len(slots) == 0

                newLineTag = tag in NEWLINE_ELEMENTS

                if (
                    newLineTag
                    or isEndInPure(cur)
                    and not hasContinuousAncestor(cur)
                    and not cur["afterSpace"]
                ) and not empty:
                    spaceChar = "\n" if newLineTag or not hasMixedAncestor(cur) else " "
                    if tokenAsSlot:
                        cv.feature((T, slots[-1]), after=spaceChar)
                    else:
                        finishWord(cv, cur, None, spaceChar)
                        hasFinishedWord = True

                slots = cv.linked(curNode)
                empty = len(slots) == 0

                if empty:
                    lastSlot = addEmpty(cv, cur)
                    if cur["inHeader"]:
                        cv.feature(lastSlot, is_meta=1)
                    if cur["inNote"]:
                        cv.feature(lastSlot, is_note=1)
                    # take care that this empty slot falls under all sections
                    # for folders and files this is already guaranteed
                    # We need only to watch out for chapters and chunks
                    if cur[NODE].get(chunkSection, None) is None:
                        prevChunk = cur.get("prevChunk", None)
                        if prevChunk is None:
                            cur["danglingSlots"].add(lastSlot[1])
                        else:
                            cv.link(prevChunk, lastSlot)
                    if sectionModel == "II":
                        if cur[NODE].get(chapterSection, None) is None:
                            prevChapter = cur.get("prevChapter", None)
                            if prevChapter is None:
                                cur["danglingSlots"].add(lastSlot[1])
                            else:
                                cv.link(prevChapter, lastSlot)

                cv.terminate(curNode)

            if isChnk:
                if tokenAsSlot:
                    addSpace(cv, cur, "\n")
                else:
                    if not hasFinishedWord:
                        finishWord(cv, cur, None, "\n")
                cv.terminate(cur[NODE][chunkSection])

            if sectionModel == "II":
                if isChap:
                    if tokenAsSlot:
                        addSpace(cv, cur, "\n")
                    else:
                        if not hasFinishedWord:
                            finishWord(cv, cur, None, "\n")
                    cv.terminate(cur[NODE][chapterSection])

        def afterTag(cv, cur, xnode, tag, atts):
            """Node actions after dealing with the children and after the end tag.

            This is the place where we process the `tail` of an LXML node: the
            text material after the element and before the next open/close
            tag of any element.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            xnode: object
                An LXML element node.
            tag: string
                The tag of the LXML node.
            atts: string
                The attributes of the LXML node, with namespaces stripped.
            """
            if tag == TEI_HEADER:
                cur["inHeader"] = False
            elif tag in NOTE_LIKE:
                cur["inNote"] = False

            if tag not in PASS_THROUGH:
                if "rend" in atts:
                    rValue = atts["rend"]
                    r = makeNameLike(rValue)
                    if r:
                        for q in r.split():
                            cur["rend"][q].pop()

            if xnode.tail:
                if tag == "lb" and self.makeLineElems:
                    tail = xnode.tail.lstrip()
                    if not wordAsSlot:
                        pass
                else:
                    tail = xnode.tail

                tailMaterial = WHITE_TRIM_RE.sub(" ", tail)
                if isPure(cur):
                    if tailMaterial and tailMaterial != " ":
                        elem = cur[XNEST][-1][0]
                        console(
                            (
                                "WARNING: Text material after "
                                f"<{tag}> in pure-content element <{elem}>"
                            ),
                            error=True,
                        )
                        stack = "-".join(cur[XNEST][0])
                        console(f"\tElement stack: {stack}-{tag}", error=True)
                        console(f"\tMaterial: `{tailMaterial}`", error=True)
                else:
                    if tokenAsSlot:
                        addTokens(cv, cur, tailMaterial)
                    else:
                        for ch in tailMaterial:
                            addSlot(cv, cur, ch)

            afterTagCustom = getattr(self, "afterTagCustom", None)
            if afterTagCustom is not None:
                afterTagCustom(cv, cur, xnode, tag, atts)

        def director(cv):
            """Director function.

            Here we program a walk through the TEI sources.
            At every step of the walk we fire some actions that build TF nodes
            and assign features for them.

            Because everything is rather dynamic, we generate fairly standard
            metadata for the features, namely a link to the
            [TEI website](https://tei-c.org).

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            """
            makeLineElems = self.makeLineElems

            if makeLineElems:
                lineProperties = self.lineProperties
                lineType = lineProperties["nodeType"]

            makePageElems = self.makePageElems

            if makePageElems:
                pageProperties = self.pageProperties
                pageType = pageProperties["nodeType"]

            sectionModel = self.sectionModel
            A = self.A
            elementDefs = A.elementDefs

            cur = {}
            cur["pureElems"] = {
                modelInv[schemaOverride]: {
                    x for (x, (typ, mixed)) in eDefs.items() if not mixed
                }
                for (schemaOverride, eDefs) in elementDefs.items()
            }
            cur["mixedElems"] = {
                modelInv[schemaOverride]: {
                    x for (x, (typ, mixed)) in eDefs.items() if mixed
                }
                for (schemaOverride, eDefs) in elementDefs.items()
            }
            cur[NODE] = {}

            if sectionModel == "I":
                folderSection = self.folderSection
                fileSection = self.fileSection

                i = 0
                for xmlFolder, xmlFiles in self.getXML():
                    msg = "Start " if verbose >= 0 else "\t"
                    console(f"{msg}folder {xmlFolder}:")

                    cur[NODE][folderSection] = cv.node(folderSection)
                    value = {folderSection: xmlFolder}
                    cv.feature(cur[NODE][folderSection], **value)

                    j = 0
                    cr = ""
                    nl = True

                    for xmlFile in xmlFiles:
                        i += 1
                        j += 1
                        if j > PROGRESS_LIMIT:
                            cr = "\r"
                            nl = False

                        cur["xmlFile"] = xmlFile
                        xmlPath = f"{teiPath}/{xmlFolder}/{xmlFile}"
                        (model, adapt, tpl) = self.getSwitches(xmlPath)
                        cur["model"] = model
                        cur["template"] = tpl
                        cur["adaptation"] = adapt
                        modelRep = model or "TEI"
                        tplRep = tpl or ""
                        adRep = adapt or ""
                        label = f"{modelRep:<12} {adRep:<12} {tplRep:<12}"
                        if verbose >= 0:
                            console(
                                f"{cr}{i:>4} {label} {xmlFile:<50}",
                                newline=nl,
                            )

                        cur[NODE][fileSection] = cv.node(fileSection)
                        ids[xmlFile][""] = cur[NODE][fileSection]
                        value = {fileSection: xmlFile.removesuffix(".xml")}
                        cv.feature(cur[NODE][fileSection], **value)
                        if tpl:
                            cur[NODE][tpl] = cv.node(tpl)
                            cv.feature(cur[NODE][tpl], **value)

                        with fileOpen(xmlPath) as fh:
                            text = fh.read()
                            if transformFunc is not None:
                                text = transformFunc(text)
                            tree = etree.parse(text, parser)
                            root = tree.getroot()

                            if makeLineElems:
                                cur[NODE][lineType] = None
                                cur["inLine"] = False
                                cur["lineAtts"] = None

                            if makePageElems:
                                cur[NODE][pageType] = None
                                cur["inPage"] = False
                                cur["pageAtts"] = None

                            if not tokenAsSlot:
                                cur[NODE][WORD] = None
                            cur["inHeader"] = False
                            cur["inNote"] = False
                            cur[XNEST] = []
                            cur[TNEST] = []
                            cur[TSIB] = []
                            cur["chunkNum"] = 0
                            cur["prevChunk"] = None
                            cur["danglingSlots"] = set()
                            cur["prevWord"] = None
                            cur["wordStr"] = ""
                            cur["afterStr"] = ""
                            cur["afterSpace"] = True
                            cur["chunkElems"] = set()
                            walkNode(cv, cur, root)

                        if not tokenAsSlot:
                            addSlot(cv, cur, None)
                        if tpl:
                            cv.terminate(cur[NODE][tpl])
                        cv.terminate(cur[NODE][fileSection])

                    if verbose >= 0:
                        console("")
                        console(f"End   folder {xmlFolder}")

                    cv.terminate(cur[NODE][folderSection])

            elif sectionModel == "II":
                xmlFile = self.getXML()
                if xmlFile is None:
                    console("No XML files found!", error=True)
                    return False

                xmlPath = f"{teiPath}/{xmlFile}"
                (cur["model"], cur["adaptation"], cur["template"]) = self.getSwitches(
                    xmlPath
                )

                with fileOpen(f"{teiPath}/{xmlFile}") as fh:
                    cur["xmlFile"] = xmlFile
                    text = fh.read()
                    if transformFunc is not None:
                        text = transformFunc(text)
                    tree = etree.parse(text, parser)
                    root = tree.getroot()

                    if makeLineElems:
                        cur[NODE][lineType] = None
                        cur["inLine"] = False
                        cur["lineAtts"] = None

                    if makePageElems:
                        cur[NODE][pageType] = None
                        cur["inPage"] = False
                        cur["pageAtts"] = None

                    if not tokenAsSlot:
                        cur[NODE][WORD] = None
                    cur["inHeader"] = False
                    cur["inNote"] = False
                    cur[XNEST] = []
                    cur[TNEST] = []
                    cur[TSIB] = []
                    cur["chapterNum"] = 0
                    cur["chunkPNum"] = 0
                    cur["chunkONum"] = 0
                    cur["prevChunk"] = None
                    cur["prevChapter"] = None
                    cur["danglingSlots"] = set()
                    cur["prevWord"] = None
                    cur["wordStr"] = ""
                    cur["afterStr"] = ""
                    cur["afterSpace"] = True
                    cur["chunkElems"] = set()
                    cur["chapterElems"] = set()
                    for child in root.iterchildren(tag=etree.Element):
                        walkNode(cv, cur, child)

                if not tokenAsSlot:
                    addSlot(cv, cur, None)

            if verbose >= 0:
                console("")

            if verbose >= 0:
                console("Resolving links into edges ...")

            unresolvedRefs = {}
            unresolved = 0
            unresolvedUnique = 0
            resolved = 0
            resolvedUnique = 0

            for att, attRefs in refs.items():
                feature = f"link_{att}"
                edgeFeat = {feature: None}

                for (targetFile, targetId), sourceNodes in attRefs.items():
                    nSourceNodes = len(sourceNodes)
                    targetNode = ids[targetFile].get(targetId, None)
                    if targetNode is None:
                        unresolvedRefs.setdefault(targetFile, set()).add(targetId)
                        unresolvedUnique += 1
                        unresolved += nSourceNodes
                    else:
                        for sourceNode in sourceNodes:
                            cv.edge(sourceNode, targetNode, **edgeFeat)
                        resolvedUnique += 1
                        resolved += nSourceNodes

            if verbose >= 0:
                console(f"\t{resolvedUnique} in {resolved} reference(s) resolved")
                if unresolvedRefs:
                    console(
                        f"\t{unresolvedUnique} in {unresolved} reference(s): "
                        "could not be resolved"
                    )
                    if verbose == 1:
                        for targetFile, targetIds in sorted(unresolvedRefs.items()):
                            examples = " ".join(sorted(targetIds)[0:3])
                            console(f"\t\t{targetFile}: {len(targetIds)} x: {examples}")

            for fName in featureMeta:
                if not cv.occurs(fName):
                    cv.meta(fName)
            for fName in cv.features():
                if fName not in featureMeta:
                    if fName.startswith("rend_"):
                        r = fName[5:]
                        cv.meta(
                            fName,
                            description=f"whether text is to be rendered as {r}",
                            valueType="int",
                            conversionMethod=CM_LITC,
                            conversionCode=CONVERSION_METHODS[CM_LITC],
                        )
                        intFeatures.add(fName)
                    elif fName.startswith("link_"):
                        r = fName[5:]
                        cv.meta(
                            fName,
                            description=(
                                f"links to node identified by xml:id in attribute {r}"
                            ),
                            valueType="str",
                            conversionMethod=CM_LITP,
                            conversionCode=CONVERSION_METHODS[CM_LITP],
                        )
                    else:
                        cv.meta(
                            fName,
                            description=f"this is TEI attribute {fName}",
                            valueType="str",
                            conversionMethod=CM_LIT,
                            conversionCode=CONVERSION_METHODS[CM_LIT],
                        )

            levelConstraints = ["note < chunk, p", "salute < opener, closer"]
            if "chapterElems" in cur:
                for elem in cur["chapterElems"]:
                    levelConstraints.append(f"{elem} < chapter")
            if "chunkElems" in cur:
                for elem in cur["chunkElems"]:
                    levelConstraints.append(f"{elem} < chunk")

            levelConstraints = "; ".join(levelConstraints)

            cv.meta("otext", levelConstraints=levelConstraints)

            if verbose == 1:
                console("source reading done")
            return True

        return director

    def convertTask(self):
        """Implementation of the "convert" task.

        It sets up the `tf.convert.walker` machinery and runs it.

        Returns
        -------
        boolean
            Whether the conversion was successful.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        procins = self.procins
        verbose = self.verbose
        slotType = self.slotType
        generic = self.generic
        otext = self.otext
        featureMeta = self.featureMeta
        intFeatures = self.intFeatures

        makeLineElems = self.makeLineElems
        lineModel = self.lineModel
        if makeLineElems:
            lineProperties = self.lineProperties
            lineType = lineProperties["nodeType"]

        makePageElems = self.makePageElems
        pageModel = self.pageModel

        if makePageElems:
            pageProperties = self.pageProperties
            pageType = pageProperties["nodeType"]
            pbAtTop = pageProperties["pbAtTop"] if makePageElems else None

        tfPath = self.tfPath
        teiPath = self.teiPath

        if verbose >= 0:
            if verbose == 1:
                console(f"TEI to TF converting: {ux(teiPath)} => {ux(tfPath)}")
            if makeLineElems:
                lbRep = f" with {lineType} nodes for lines between lb elements"
                console(f"Line model {lineModel}{lbRep}")

            if makePageElems:
                wrt = "started" if pbAtTop else "ended"
                pbRep = f" with {pageType} nodes for pages {wrt} by pb elements"
                console(f"Page model {pageModel}{pbRep}")

            console(
                f"Processing instructions are {'treated' if procins else 'ignored'}"
            )

        initTree(tfPath, fresh=True, gentle=True)

        cv = self.getConverter()

        self.good = cv.walk(
            self.getDirector(),
            slotType,
            otext=otext,
            generic=generic,
            intFeatures=intFeatures,
            featureMeta=featureMeta,
            generateTf=True,
        )

    def loadTask(self):
        """Implementation of the "load" task.

        It loads the TF data that resides in the directory where the "convert" task
        deliver its results.

        During loading there are additional checks. If they succeed, we have evidence
        that we have a valid TF dataset.

        Also, during the first load intensive pre-computation of TF data takes place,
        the results of which will be cached in the invisible `.tf` directory there.

        That makes the TF data ready to be loaded fast, next time it is needed.

        Returns
        -------
        boolean
            Whether the loading was successful.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        tfPath = self.tfPath
        verbose = self.verbose
        silent = AUTO if verbose == 1 else TERSE if verbose == 0 else DEEP

        if not dirExists(tfPath):
            console(f"Directory {ux(tfPath)} does not exist.", error=True)
            console("No TF found, nothing to load", error=True)
            self.good = False
            return

        TF = Fabric(locations=[tfPath], silent=silent)
        allFeatures = TF.explore(silent=True, show=True)
        loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
        api = TF.load(loadableFeatures, silent=silent)
        if api:
            if verbose >= 0:
                console(f"max node = {api.F.otype.maxNode}")
            self.good = True
            return

        self.good = False

    # APP CREATION/UPDATING

    def appTask(self, tokenBased=False):
        """Implementation of the "app" task.

        It creates / updates a corpus-specific app plus specific documentation files.
        There should be a valid TF dataset in place, because some
        settings in the app derive from it.

        It will also read custom additions that are present in the target app directory.
        These files are:

        *   `about_custom.md`:
            A markdown file with specific colophon information about the dataset.
            In the generated file, this information will be put at the start.
        *   `transcription_custom.md`:
            A markdown file with specific encoding information about the dataset.
            In the generated file, this information will be put at the start.
        *   `config_custom.yaml`:
            A YAML file with configuration data that will be *merged* into the generated
            config.yaml.
        *   `app_custom.py`:
            A python file with named snippets of code to be inserted
            at corresponding places in the generated `app.py`
        *   `display_custom.css`:
            Additional CSS definitions that will be appended to the generated
            `display.css`.

        If the TF app for this resource needs custom code, this is the way to retain
        that code between automatic generation of files.

        Returns
        -------
        boolean
            Whether the operation was successful.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        verbose = self.verbose

        refDir = self.refDir
        myDir = self.myDir
        procins = self.procins
        wordAsSlot = self.wordAsSlot
        tokenAsSlot = self.tokenAsSlot
        charAsSlot = self.charAsSlot
        parentEdges = self.parentEdges
        siblingEdges = self.siblingEdges
        sectionModel = self.sectionModel
        sectionProperties = self.sectionProperties
        tfVersion = self.tfVersion

        # key | parentDir | file | template based

        # if parentDir is a tuple, the first part is the parentDir of the source
        # end the second part is the parentDir of the destination

        itemSpecs = (
            ("about", "docs", "about.md", False),
            ("trans", ("app", "docs"), "transcription.md", False),
            ("logo", "app/static", "logo.png", True),
            ("display", "app/static", "display.css", False),
            ("config", "app", "config.yaml", False),
            ("app", "app", "app.py", False),
        )
        genTasks = {
            s[0]: dict(parentDir=s[1], file=s[2], justCopy=s[3]) for s in itemSpecs
        }
        cssInfo = makeCssInfo()

        version = tfVersion.removesuffix(PRE) if tokenBased else tfVersion

        def createConfig(sourceText, customText):
            text = sourceText.replace("«version»", f'"{version}"')

            settings = readYaml(text=text, plain=True)
            settings.setdefault("provenanceSpec", {})["branch"] = BRANCH_DEFAULT_NEW

            if tokenBased:
                if "typeDisplay" in settings and "word" in settings["typeDisplay"]:
                    del settings["typeDisplay"]["word"]

            customSettings = (
                {} if not customText else readYaml(text=customText, plain=True)
            )

            mergeDict(settings, customSettings)

            text = writeYaml(settings)

            return text

        def createDisplay(sourceText, customText):
            """Copies and tweaks the display.css file of an TF app.

            We generate CSS code for a certain text formatting styles,
            triggered by `rend` attributes in the source.
            """

            css = sourceText.replace("«rends»", cssInfo)
            return f"{css}\n\n{customText}\n"

        def createApp(sourceText, customText):
            """Copies and tweaks the app.py file of an TF app.

            The template app.py provides text formatting functions.
            It retrieves text from features, but that is dependent on
            the settings of the conversion, in particular whether we have words as
            slots or characters.

            Depending on that we insert some code in the template.

            The template contains the string `F.matérial`, and it will be replaced
            by something like

            ```
            F.ch.v(n)
            ```

            or

            ```
            f"{F.str.v(n)}{F.after.v(n)}"
            ```

            That's why the variable `materialCode` in the body gets a rather
            unusual value: it is interpreted later on as code.
            """

            materialCode = (
                '''F.ch.v(n) or ""'''
                if charAsSlot or tokenBased
                else """f'{F.str.v(n) or ""}{F.after.v(n) or ""}'"""
            )
            rendValues = repr(KNOWN_RENDS)

            code = sourceText.replace("F.matérial", materialCode)
            code = code.replace('"rèndValues"', rendValues)

            hookStartRe = re.compile(r"^# DEF (import|init|extra)\s*$", re.M)
            hookEndRe = re.compile(r"^# END DEF\s*$", re.M)
            hookInsertRe = re.compile(r"^\s*# INSERT (import|init|extra)\s*$", re.M)

            custom = {}
            section = None

            for line in (customText or "").split("\n"):
                line = line.rstrip()

                if section is None:
                    match = hookStartRe.match(line)
                    if match:
                        section = match.group(1)
                        custom[section] = []
                else:
                    match = hookEndRe.match(line)
                    if match:
                        section = None
                    else:
                        custom[section].append(line)

            codeLines = []

            for line in code.split("\n"):
                line = line.rstrip()

                match = hookInsertRe.match(line)
                if match:
                    section = match.group(1)
                    codeLines.extend(custom.get(section, []))
                else:
                    codeLines.append(line)

            return "\n".join(codeLines) + "\n"

        def createTranscription(sourceText, customText):
            """Copies and tweaks the transcription.md file for a TF corpus."""
            org = self.org
            repo = self.repo
            relative = self.relative
            intFeatures = self.intFeatures
            extra = self.extra

            def metaRep(feat, meta):
                valueType = "int" if feat in intFeatures else "str"
                description = meta.get("description", "")
                extraFieldRep = "\n".join(
                    f"*   `{field}`: `{value}`"
                    for (field, value) in meta.items()
                    if field not in {"description", "valueType"}
                )

                return (
                    f"""{description}\n"""
                    f"""The values of this feature have type {valueType}.\n"""
                    f"""{extraFieldRep}"""
                )

            extra = "\n\n".join(
                f"## `{feat}`\n\n{metaRep(feat, info['meta'])}\n"
                for (feat, info) in extra.items()
            )

            text = (
                dedent(
                    f"""
                # Corpus {org} - {repo}{relative}

                """
                )
                + tweakTrans(
                    sourceText,
                    procins,
                    wordAsSlot,
                    tokenAsSlot,
                    charAsSlot,
                    parentEdges,
                    siblingEdges,
                    tokenBased,
                    sectionModel,
                    sectionProperties,
                    REND_DESC,
                    extra,
                )
                + dedent(
                    """

                    ## See also

                    *   [about](about.md)
                    """
                )
            )
            return f"{text}\n\n{customText}\n"

        def createAbout(sourceText, customText):
            org = self.org
            repo = self.repo
            relative = self.relative
            generic = self.generic
            if tokenBased:
                generic["version"] = version

            generic = "\n\n".join(
                f"## `{key}`\n\n`{value}`\n" for (key, value) in generic.items()
            )

            return f"{customText}\n\n{sourceText}\n\n" + (
                dedent(
                    f"""
                # Corpus {org} - {repo}{relative}

                """
                )
                + generic
                + dedent(
                    """

                    ## Conversion

                    Converted from TEI to TF

                    ## See also

                    *   [transcription](transcription.md)
                    """
                )
            )

        extraRep = " with NLP output " if tokenBased else ""

        if verbose >= 0:
            console(f"App updating {extraRep} ...")

        for name, info in genTasks.items():
            parentDir = info["parentDir"]
            (sourceBit, targetBit) = (
                parentDir if type(parentDir) is tuple else (parentDir, parentDir)
            )
            file = info[FILE]
            fileParts = file.rsplit(".", 1)
            if len(fileParts) == 1:
                fileParts = [file, ""]
            (fileBase, fileExt) = fileParts
            if fileExt:
                fileExt = f".{fileExt}"
            targetDir = f"{refDir}/{targetBit}"
            itemTarget = f"{targetDir}/{file}"
            itemCustom = f"{targetDir}/{fileBase}_custom{fileExt}"
            itemPre = f"{targetDir}/{fileBase}_orig{fileExt}"

            justCopy = info["justCopy"]
            teiDir = f"{myDir}/{sourceBit}"
            itemSource = f"{teiDir}/{file}"

            # If there is custom info, we do not have to preserve the previous version.
            # Otherwise we save the target before overwriting it; # unless it
            # has been saved before

            preExists = fileExists(itemPre)
            targetExists = fileExists(itemTarget)
            customExists = fileExists(itemCustom)

            msg = ""

            if justCopy:
                if targetExists:
                    msg = "(already exists, not overwritten)"
                    safe = False
                else:
                    msg = "(copied)"
                    safe = True
            else:
                if targetExists:
                    if customExists:
                        msg = "(generated with custom info)"
                    else:
                        if preExists:
                            msg = "(no custom info, older original exists)"
                        else:
                            msg = "(no custom info, original preserved)"
                            fileCopy(itemTarget, itemPre)
                else:
                    msg = "(created)"

            initTree(targetDir, fresh=False)

            if justCopy:
                if safe:
                    fileCopy(itemSource, itemTarget)
            else:
                if fileExists(itemSource):
                    with fileOpen(itemSource) as fh:
                        sourceText = fh.read()
                else:
                    sourceText = ""

                if fileExists(itemCustom):
                    with fileOpen(itemCustom) as fh:
                        customText = fh.read()
                else:
                    customText = ""

                targetText = (
                    createConfig
                    if name == "config"
                    else createApp
                    if name == "app"
                    else createDisplay
                    if name == "display"
                    else createTranscription
                    if name == "trans"
                    else createAbout
                    if name == "about"
                    else fileCopy  # this cannot occur because justCopy is False
                )(sourceText, customText)

                with fileOpen(itemTarget, mode="w") as fh:
                    fh.write(targetText)

            if verbose >= 0:
                console(f"\t{ux(itemTarget):30} {msg}")

        if verbose >= 0:
            console("Done")
        else:
            console(f"App updated{extraRep}")

    # START the TEXT-FABRIC BROWSER on this CORPUS

    def browseTask(self):
        """Implementation of the "browse" task.

        It gives a shell command to start the TF browser on
        the newly created corpus.
        There should be a valid TF dataset and app configuration in place

        Returns
        -------
        boolean
            Whether the operation was successful.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        org = self.org
        repo = self.repo
        relative = self.relative
        backend = self.backend
        tfVersion = self.tfVersion

        backendOpt = "" if backend == "github" else f"--backend={backend}"
        versionOpt = f"--version={tfVersion}"
        versionOpt = ""
        try:
            run(
                (
                    f"tf {org}/{repo}{relative}:clone --checkout=clone "
                    f"{versionOpt} {backendOpt}"
                ),
                shell=True,
            )
        except KeyboardInterrupt:
            pass

    def task(
        self,
        check=False,
        convert=False,
        load=False,
        app=False,
        apptoken=False,
        browse=False,
        verbose=None,
        validate=None,
    ):
        """Carry out any task, possibly modified by any flag.

        This is a higher level function that can execute a selection of tasks.

        The tasks will be executed in a fixed order:
        `check`, `convert`, `load`, `app`, `apptoken`, `browse`.
        But you can select which one(s) must be executed.

        If multiple tasks must be executed and one fails, the subsequent tasks
        will not be executed.

        Parameters
        ----------
        check: boolean, optional False
            Whether to carry out the `check` task.
        convert: boolean, optional False
            Whether to carry out the `convert` task.
        load: boolean, optional False
            Whether to carry out the `load` task.
        app: boolean, optional False
            Whether to carry out the `app` task.
        apptoken: boolean, optional False
            Whether to carry out the `apptoken` task.
        browse: boolean, optional False
            Whether to carry out the `browse` task"
        verbose: integer, optional -1
            Produce no (-1), some (0) or many (1) progress and reporting messages
        validate: boolean, optional True
            Whether to perform XML validation during the check task

        Returns
        -------
        boolean
            Whether all tasks have executed successfully.
        """
        if not self.importOK():
            return

        if verbose is not None:
            verboseSav = self.verbose
            self.verbose = verbose

        if validate is not None:
            self.validate = validate

        if not self.good:
            return False

        for condition, method, kwargs in (
            (check, self.checkTask, {}),
            (convert, self.convertTask, {}),
            (load, self.loadTask, {}),
            (app, self.appTask, {}),
            (apptoken, self.appTask, dict(tokenBased=True)),
            (browse, self.browseTask, {}),
        ):
            if condition:
                method(**kwargs)
                if not self.good:
                    break

        if verbose is not None:
            self.verbose = verboseSav
        return self.good


def main():
    (good, tasks, params, flags) = readArgs(
        "tf-fromtei", HELP, TASKS, PARAMS, FLAGS, notInAll=TASKS_EXCLUDED
    )
    if not good:
        return False

    Obj = TEI(**params, **flags)
    Obj.task(**tasks, **flags)

    return Obj.good


if __name__ == "__main__":
    sys.exit(0 if main() else 1)

Functions

def getRefs(tag, atts, xmlFile)
Expand source code Browse git
def getRefs(tag, atts, xmlFile):
    refAtt = REFERENCING.get(tag, None)
    result = []

    if refAtt is not None:
        refVal = atts.get(refAtt, None)
        if refVal is not None and not refVal.startswith("http"):
            for refv in refVal.split():
                parts = refv.split("#", 1)
                if len(parts) == 1:
                    targetFile = refv
                    targetId = ""
                else:
                    (targetFile, targetId) = parts
                if targetFile == "":
                    targetFile = xmlFile
                result.append((refAtt, targetFile, targetId))
    return result
def main()
Expand source code Browse git
def main():
    (good, tasks, params, flags) = readArgs(
        "tf-fromtei", HELP, TASKS, PARAMS, FLAGS, notInAll=TASKS_EXCLUDED
    )
    if not good:
        return False

    Obj = TEI(**params, **flags)
    Obj.task(**tasks, **flags)

    return Obj.good
def makeCssInfo()

Make the CSS info for the style sheet.

Expand source code Browse git
def makeCssInfo():
    """Make the CSS info for the style sheet."""
    rends = ""

    for rend, (description, css) in sorted(CSS_REND.items()):
        aliases = CSS_REND_ALIAS.get(rend, "")
        aliases = sorted(set(aliases.split()) | {rend})
        for alias in aliases:
            KNOWN_RENDS.add(alias)
            REND_DESC[alias] = description
        selector = ",".join(f".r_{alias}" for alias in aliases)
        contribution = f"\n{selector} {{{css}}}\n"
        rends += contribution

    return rends

Classes

class TEI (tei='latest', tf='latest', validate=True, verbose=-1)

Converts TEI to TF.

For documentation of the resulting encoding, read the transcription template.

Below we describe how to control the conversion machinery.

We adopt a fair bit of "convention over configuration" here, in order to lessen the burden for the user of specifying so many details.

Based on current directory from where the script is called, it defines all the ingredients to carry out a tf.convert.walker conversion of the TEI input.

This function is assumed to work in the context of a repository, i.e. a directory on your computer relative to which the input directory exists, and various output directories: tf, app, docs.

Your current directory must be at

~/backend/org/repo/relative

where

  • ~ is your home directory;
  • backend is an online back-end name, like github, gitlab, git.huc.knaw.nl;
  • org is an organization, person, or group in the back-end;
  • repo is a repository in the org.
  • relative is a directory path within the repo (0 or more components)

This is only about the directory structure on your local computer; it is not required that you have online incarnations of your repository in that back-end. Even your local repository does not have to be a git repository.

The only thing that matters is that the full path to your repo can be parsed as a sequence of home/backend/org/repo/relative.

Relative to this directory the program expects and creates input / output directories.

Input directories

tei

Location of the TEI-XML sources.

If it does not exist, the program aborts with an error.

Several levels of subdirectories are assumed:

  1. the version of the source (this could be a date string).
  2. volumes / collections of documents. The subdirectory __ignore__ is ignored.
  3. the TEI documents themselves, conforming to the TEI schema or some customization of it.

schema

TEI or other XML schemas against which the sources can be validated.

They should be XSD or RNG files.

Multiple XSD files

When you started with a RNG file and used tf.tools.xmlschema to convert it to XSD, you may have got multiple XSD files. One of them has the same base name as the original RNG file, and you should pass that name. It will import the remaining XSD files, so do not throw them away.

We use these files as custom TEI schemas, but to be sure, we still analyse the full TEI schema and use the schemas here as a set of overriding element definitions.

Output directories

report

Directory to write the results of the check task to: an inventory of elements / attributes encountered, and possible validation errors. If the directory does not exist, it will be created. The default value is . (i.e. the current directory in which the script is invoked).

tf

The directory under which the TF output file (with extension .tf) are placed. If it does not exist, it will be created. The TF files will be generated in a folder named by a version number, passed as tfVersion.

app and docs

Location of additional TF app configuration and documentation files. If they do not exist, they will be created with some sensible default settings and generated documentation. These settings can be overridden in the app/config_custom.yaml file. Also a default display.css file and a logo are added.

Custom content for these files can be provided in files with _custom appended to their base name.

docs

Location of additional documentation. This can be generated or hand-written material, or a mixture of the two.

Parameters

tei : string, optional ""

If empty, use the latest version under the tei directory with sources. Otherwise it should be a valid integer, and it is the index in the sorted list of versions there.

  • 0 or latest: latest version;
  • -1, -2, … : previous version, version before previous, …;
  • 1, 2, …: first version, second version, ....
  • everything else that is not a number is an explicit version

If the value cannot be parsed as an integer, it is used as the exact version name.

tf : string, optional ""

If empty, the TF version used will be the latest one under the tf directory. If the parameter prelim was used in the initialization of the TEI object, only versions ending in pre will be taken into account.

If it can be parsed as the integers 1, 2, or 3 it will bump the latest relevant TF version:

  • 0 or latest: overwrite the latest version
  • 1 will bump the major version
  • 2 will bump the intermediate version
  • 3 will bump the minor version
  • everything else is an explicit version

Otherwise, the value is taken as the exact version name.

verbose : integer, optional -1
Produce no (-1), some (0) or many (1) progress and reporting messages
Expand source code Browse git
class TEI(CheckImport):
    def __init__(
        self,
        tei=PARAMS["tei"][1],
        tf=PARAMS["tf"][1],
        validate=PARAMS["validate"][1],
        verbose=FLAGS["verbose"][1],
    ):
        """Converts TEI to TF.

        For documentation of the resulting encoding, read the
        [transcription template](https://github.com/annotation/text-fabric/blob/master/tf/convert/app/transcription.md).

        Below we describe how to control the conversion machinery.

        We adopt a fair bit of "convention over configuration" here, in order to lessen
        the burden for the user of specifying so many details.

        Based on current directory from where the script is called,
        it defines all the ingredients to carry out
        a `tf.convert.walker` conversion of the TEI input.

        This function is assumed to work in the context of a repository,
        i.e. a directory on your computer relative to which the input directory exists,
        and various output directories: `tf`, `app`, `docs`.

        Your current directory must be at

        ```
        ~/backend/org/repo/relative
        ```

        where

        *   `~` is your home directory;
        *   `backend` is an online back-end name,
            like `github`, `gitlab`, `git.huc.knaw.nl`;
        *   `org` is an organization, person, or group in the back-end;
        *   `repo` is a repository in the `org`.
        *   `relative` is a directory path within the repo (0 or more components)

        This is only about the directory structure on your local computer;
        it is not required that you have online incarnations of your repository
        in that back-end.
        Even your local repository does not have to be a git repository.

        The only thing that matters is that the full path to your repo can be parsed
        as a sequence of `home/backend/org/repo/relative`.

        Relative to this directory the program expects and creates
        input / output directories.

        ## Input directories

        ### `tei`

        *Location of the TEI-XML sources.*

        **If it does not exist, the program aborts with an error.**

        Several levels of subdirectories are assumed:

        1.  the version of the source (this could be a date string).
        1.  volumes / collections of documents. The subdirectory `__ignore__` is ignored.
        1.  the TEI documents themselves, conforming to the TEI schema or some
            customization of it.

        ### `schema`

        *TEI or other XML schemas against which the sources can be validated.*

        They should be XSD or RNG files.

        !!! note "Multiple XSD files"
            When you started with a RNG file and used `tf.tools.xmlschema` to
            convert it to XSD, you may have got multiple XSD files.
            One of them has the same base name as the original RNG file,
            and you should pass that name. It will import the remaining XSD files,
            so do not throw them away.

        We use these files as custom TEI schemas,
        but to be sure, we still analyse the full TEI schema and
        use the schemas here as a set of overriding element definitions.

        ## Output directories

        ### `report`

        Directory to write the results of the `check` task to: an inventory
        of elements / attributes encountered, and possible validation errors.
        If the directory does not exist, it will be created.
        The default value is `.` (i.e. the current directory in which
        the script is invoked).

        ### `tf`

        The directory under which the TF output file (with extension `.tf`)
        are placed.
        If it does not exist, it will be created.
        The TF files will be generated in a folder named by a version number,
        passed as `tfVersion`.

        ### `app` and `docs`

        Location of additional TF app configuration and documentation files.
        If they do not exist, they will be created with some sensible default
        settings and generated documentation.
        These settings can be overridden in the `app/config_custom.yaml` file.
        Also a default `display.css` file and a logo are added.

        Custom content for these files can be provided in files
        with `_custom` appended to their base name.

        ### `docs`

        Location of additional documentation.
        This can be generated or hand-written material, or a mixture of the two.

        Parameters
        ----------
        tei: string, optional ""
            If empty, use the latest version under the `tei` directory with sources.
            Otherwise it should be a valid integer, and it is the index in the
            sorted list of versions there.

            *   `0` or `latest`: latest version;
            *   `-1`, `-2`, ... : previous version, version before previous, ...;
            *   `1`, `2`, ...: first version, second version, ....
            *   everything else that is not a number is an explicit version

            If the value cannot be parsed as an integer, it is used as the exact
            version name.

        tf: string, optional ""
            If empty, the TF version used will be the latest one under the `tf`
            directory. If the parameter `prelim` was used in the initialization of
            the TEI object, only versions ending in `pre` will be taken into account.

            If it can be parsed as the integers 1, 2, or 3 it will bump the latest
            relevant TF version:

            *   `0` or `latest`: overwrite the latest version
            *   `1` will bump the major version
            *   `2` will bump the intermediate version
            *   `3` will bump the minor version
            *   everything else is an explicit version

            Otherwise, the value is taken as the exact version name.

        verbose: integer, optional -1
            Produce no (-1), some (0) or many (1) progress and reporting messages

        """
        super().__init__("lxml")
        if self.importOK(hint=True):
            self.etree = self.importGet()
        else:
            return

        self.good = True

        (backend, org, repo, relative) = getLocation()

        if any(s is None for s in (backend, org, repo, relative)):
            console(
                (
                    "Not working in a repo: "
                    f"backend={backend} org={org} repo={repo} relative={relative}"
                ),
                error=True,
            )
            self.good = False
            return

        if verbose == 1:
            console(
                f"Working in repository {org}/{repo}{relative} in back-end {backend}"
            )

        base = ex(f"~/{backend}")
        repoDir = f"{base}/{org}/{repo}"
        refDir = f"{repoDir}{relative}"
        programDir = f"{refDir}/programs"
        schemaDir = f"{refDir}/schema"
        convertSpec = f"{programDir}/tei.yaml"
        convertCustom = f"{programDir}/tei.py"

        self.schemaDir = schemaDir

        settings = readYaml(asFile=convertSpec, plain=True)

        customKeys = set(
            """
            transform
            beforeTag
            beforeChildren
            afterChildren
            afterTag
        """.strip().split()
        )

        functionType = type(lambda x: x)

        if fileExists(convertCustom):
            hooked = []

            try:
                spec = util.spec_from_file_location("teicustom", convertCustom)
                code = util.module_from_spec(spec)
                sys.path.insert(0, dirNm(convertCustom))
                spec.loader.exec_module(code)
                sys.path.pop(0)
                for method in customKeys:
                    if not hasattr(code, method):
                        continue

                    func = getattr(code, method)
                    typeFunc = type(func)
                    if typeFunc is not functionType:
                        console(
                            (
                                f"custom member {method} should be a function, "
                                f"but it is a {typeFunc.__name__}"
                            ),
                            error=True,
                        )
                        continue

                    methodC = f"{method}Custom"
                    setattr(self, methodC, func)
                    hooked.append(method)

            except Exception as e:
                console(str(e), error=True)
                for method in customKeys:
                    if not hasattr(self, method):
                        methodC = f"{method}Custom"
                        setattr(self, methodC, None)

            if verbose >= 0:
                console("With custom behaviour hooked in at:")
                for method in hooked:
                    methodC = f"{method}Custom"
                    console(f"\t{methodC} = {ux(convertCustom)}.{method}")

        generic = settings.get("generic", {})
        extra = settings.get("extra", {})
        models = settings.get("models", [])
        templates = settings.get("templates", [])
        templateTrigger = settings.get("templateTrigger", None)
        adaptations = settings.get("adaptations", [])
        adaptationTrigger = settings.get("adaptationTrigger", None)
        prelim = settings.get("prelim", True)
        granularity = settings.get("granularity", TOKEN)
        wordAsSlot = granularity == WORD
        tokenAsSlot = granularity == TOKEN
        charAsSlot = granularity == CHAR
        parentEdges = settings.get("parentEdges", True)
        siblingEdges = settings.get("siblingEdges", True)
        procins = settings.get("procins", False)

        lineModel = settings.get("lineModel", {})
        lineModel = checkModel(LINE, lineModel, verbose)

        if not lineModel:
            self.good = False
            return

        makeLineElems = lineModel["model"] == "II"
        lineProperties = lineModel.get("properties", None)
        lineModel = lineModel["model"]

        self.makeLineElems = makeLineElems
        self.lineModel = lineModel
        self.lineProperties = lineProperties

        pageModel = settings.get("pageModel", {})
        pageModel = checkModel(PAGE, pageModel, verbose)

        if not pageModel:
            self.good = False
            return

        makePageElems = pageModel["model"] == "II"
        pageProperties = pageModel.get("properties", None)
        pageModel = pageModel["model"]

        self.makePageElems = makePageElems
        self.pageModel = pageModel
        self.pageProperties = pageProperties

        sectionModel = settings.get("sectionModel", {})
        sectionModel = checkModel("section", sectionModel, verbose)
        if not sectionModel:
            self.good = False
            return

        sectionProperties = sectionModel.get("properties", None)
        sectionModel = sectionModel["model"]
        self.sectionModel = sectionModel
        self.sectionProperties = sectionProperties

        self.generic = generic
        self.extra = extra
        self.models = models
        self.templates = templates
        self.adaptations = adaptations

        if templateTrigger is None:
            self.templateAtt = None
            self.templateTag = None
        else:
            (tag, att) = templateTrigger.split("@")
            self.templateAtt = att
            self.templateTag = tag

        if adaptationTrigger is None:
            self.adaptationAtt = None
            self.adaptationTag = None
        else:
            (tag, att) = adaptationTrigger.split("@")
            self.adaptationAtt = att
            self.adaptationTag = tag

        templateTag = self.templateTag
        templateAtt = self.templateAtt
        adaptationTag = self.adaptationTag
        adaptationAtt = self.adaptationAtt

        triggers = {}
        self.triggers = triggers

        for kind, theAtt, theTag in (
            ("template", templateAtt, templateTag),
            ("adaptation", adaptationAtt, adaptationTag),
        ):
            triggerRe = None

            if theAtt is not None and theTag is not None:
                tagPat = re.escape(theTag)
                triggerRe = re.compile(
                    rf"""<{tagPat}\b[^>]*?{theAtt}=['"]([^'"]+)['"]"""
                )
            triggers[kind] = triggerRe

        self.A = Analysis(verbose=verbose)
        self.readSchemas()

        self.prelim = prelim
        self.wordAsSlot = wordAsSlot
        self.tokenAsSlot = tokenAsSlot
        self.charAsSlot = charAsSlot
        self.parentEdges = parentEdges
        self.siblingEdges = siblingEdges
        self.procins = procins

        reportDir = f"{refDir}/report"
        appDir = f"{refDir}/app"
        docsDir = f"{refDir}/docs"
        teiDir = f"{refDir}/tei"
        tfDir = f"{refDir}/tf"

        teiVersions = sorted(dirContents(teiDir)[1], key=versionSort)
        nTeiVersions = len(teiVersions)

        if tei in {"latest", "", "0", 0} or str(tei).lstrip("-").isdecimal():
            teiIndex = (0 if tei == "latest" else int(tei)) - 1

            try:
                teiVersion = teiVersions[teiIndex]
            except Exception:
                absIndex = teiIndex + (nTeiVersions if teiIndex < 0 else 0) + 1
                console(
                    (
                        f"no item in {absIndex} in {nTeiVersions} source versions "
                        f"in {ux(teiDir)}"
                    )
                    if len(teiVersions)
                    else f"no source versions in {ux(teiDir)}",
                    error=True,
                )
                self.good = False
                return
        else:
            teiVersion = tei

        teiPath = f"{teiDir}/{teiVersion}"
        reportPath = f"{reportDir}/{teiVersion}"

        if not dirExists(teiPath):
            console(
                f"source version {teiVersion} does not exists in {ux(teiDir)}",
                error=True,
            )
            self.good = False
            return

        teiStatuses = {tv: i for (i, tv) in enumerate(reversed(teiVersions))}
        teiStatus = teiStatuses[teiVersion]
        teiStatusRep = (
            "most recent"
            if teiStatus == 0
            else "previous"
            if teiStatus == 1
            else f"{teiStatus - 1} before previous"
        )
        if teiStatus == len(teiVersions) - 1 and len(teiVersions) > 1:
            teiStatusRep = "oldest"

        if verbose >= 0:
            console(f"TEI data version is {teiVersion} ({teiStatusRep})")

        tfVersions = sorted(dirContents(tfDir)[1], key=versionSort)
        if prelim:
            tfVersions = [tv for tv in tfVersions if tv.endswith(PRE)]

        latestTfVersion = (
            tfVersions[-1] if len(tfVersions) else ("0.0.0" + (PRE if prelim else ""))
        )
        if tf in {"latest", "", "0", 0}:
            tfVersion = latestTfVersion
            vRep = "latest"
        elif tf in {"1", "2", "3", 1, 2, 3}:
            bump = int(tf)
            parts = latestTfVersion.split(".")

            def getVer(b):
                return (
                    int(parts[b].removesuffix(PRE))
                    if prelim and b == len(parts) - 1
                    else int(parts[b])
                )

            def setVer(b, val):
                parts[b] = f"{val}{PRE}" if prelim and b == len(parts) - 1 else f"{val}"

            if bump > len(parts):
                console(
                    f"Cannot bump part {bump} of latest TF version {latestTfVersion}",
                    error=True,
                )
                self.good = False
                return
            else:
                b1 = bump - 1
                old = getVer(b1)
                setVer(b1, old + 1)
                for b in range(b1 + 1, len(parts)):
                    setVer(b, 0)
                tfVersion = ".".join(str(p) for p in parts)
                vRep = (
                    "major" if bump == 1 else "intermediate" if bump == 2 else "minor"
                )
                vRep = f"next {vRep}"
        else:
            tfVersion = tf
            status = "existing" if dirExists(f"{tfDir}/{tfVersion}") else "new"
            vRep = f"explicit {status}"

        tfPath = f"{tfDir}/{tfVersion}"

        if verbose >= 0:
            console(f"TF data version is {tfVersion} ({vRep})")
            console(
                f"Processing instructions are {'treated' if procins else 'ignored'}"
            )

        self.refDir = refDir
        self.teiVersion = teiVersion
        self.teiPath = teiPath
        self.tfVersion = tfVersion
        self.tfPath = tfPath
        self.reportPath = reportPath
        self.tfDir = tfDir
        self.appDir = appDir
        self.docsDir = docsDir
        self.backend = backend
        self.org = org
        self.repo = repo
        self.relative = relative

        levelNames = sectionProperties["levels"]
        self.levelNames = levelNames
        self.chunkLevel = levelNames[-1]

        if sectionModel == "II":
            self.chapterSection = levelNames[0]
            self.chunkSection = levelNames[1]
        else:
            self.folderSection = levelNames[0]
            self.fileSection = levelNames[1]
            self.chunkSection = levelNames[2]
            self.backMatter = sectionProperties.get("backMatter", None)

        chunkSection = self.chunkSection
        intFeatures = {"empty", chunkSection}
        self.intFeatures = intFeatures

        if siblingEdges:
            intFeatures.add("sibling")

        slotType = WORD if wordAsSlot else T if tokenAsSlot else CHAR
        self.slotType = slotType

        sectionFeatures = ",".join(levelNames)
        sectionTypes = ",".join(levelNames)

        textFeatures = "{ch}" if charAsSlot else "{str}{after}"
        otext = {
            "fmt:text-orig-full": textFeatures,
            "sectionFeatures": sectionFeatures,
            "sectionTypes": sectionTypes,
        }
        self.otext = otext

        featureMeta = dict(
            str=dict(
                description="the text of a word or token",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            ),
            after=dict(
                description="the text after a word till the next word",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            ),
            empty=dict(
                description="whether a slot has been inserted in an empty element",
                conversionMethod=CM_PROV,
                conversionCode=CONVERSION_METHODS[CM_PROV],
            ),
            is_meta=dict(
                description="whether a slot or word is in the teiHeader element",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            ),
            is_note=dict(
                description="whether a slot or word is in the note element",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            ),
        )
        if charAsSlot:
            featureMeta["extraspace"] = dict(
                description=(
                    "whether a space has been added after a character, "
                    "when it is in the direct child of a pure XML element"
                ),
                conversionMethod=CM_LITP,
                conversionCode=CONVERSION_METHODS[CM_LITP],
            )
            featureMeta["ch"] = dict(
                description="the UNICODE character of a slot",
                conversionMethod=CM_LITC,
                conversionCode=CONVERSION_METHODS[CM_LITC],
            )
        if parentEdges:
            featureMeta["parent"] = dict(
                description="edge between a node and its parent node",
                conversionMethod=CM_LITP,
                conversionCode=CONVERSION_METHODS[CM_LITP],
            )
        if siblingEdges:
            featureMeta["sibling"] = dict(
                description=(
                    "edge between a node and its preceding sibling nodes; "
                    "labeled with the distance between them"
                ),
                conversionMethod=CM_LITP,
                conversionCode=CONVERSION_METHODS[CM_LITP],
            )
        featureMeta[chunkSection] = dict(
            description=f"number of a {chunkSection} within a document",
            conversionMethod=CM_PROV,
            conversionCode=CONVERSION_METHODS[CM_PROV],
        )

        if sectionModel == "II":
            chapterSection = self.chapterSection
            featureMeta[chapterSection] = dict(
                description=f"name of {chapterSection}",
                conversionMethod=CM_PROV,
                conversionCode=CONVERSION_METHODS[CM_PROV],
            )
        else:
            folderSection = self.folderSection
            fileSection = self.fileSection
            featureMeta[folderSection] = dict(
                description=f"name of source {folderSection}",
                conversionMethod=CM_PROV,
                conversionCode=CONVERSION_METHODS[CM_PROV],
            )
            featureMeta[fileSection] = dict(
                description=f"name of source {fileSection}",
                conversionMethod=CM_PROV,
                conversionCode=CONVERSION_METHODS[CM_PROV],
            )

        self.featureMeta = featureMeta

        generic["sourceFormat"] = "TEI"
        generic["version"] = tfVersion
        generic["teiVersion"] = teiVersion
        generic["schema"] = "TEI" + (" + " + (" + ".join(models))) if models else ""

        extraInstructions = []

        for feat, featSpecs in extra.items():
            featMeta = featSpecs.get("meta", {})
            if "valueType" in featMeta:
                if featMeta["valueType"] == "int":
                    intFeatures.add(feat)
                del featMeta["valueType"]

            featPath = featSpecs.get("path", None)
            featPathRep = "" if featPath is None else "the content is taken from "
            featPathLogical = []

            sep = ""
            for comp in reversed(featPath or []):
                if type(comp) is str:
                    featPathRep += f"{sep}{comp}"
                    featPathLogical.append((comp, None))
                else:
                    for tag, atts in comp.items():
                        # there is only one item in this dict
                        featPathRep += f"{sep}{tag}["
                        featPathRep += ",".join(
                            f"{att}={v}" for (att, v) in sorted(atts.items())
                        )
                        featPathRep += "]"
                        featPathLogical.append((tag, atts))
                sep = "/"

            featureMeta[feat] = {
                k: v.replace("«base»", featPathRep) for (k, v) in featMeta.items()
            }
            nodeType = featSpecs.get("nodeType", None)
            if nodeType is not None and featPath:
                extraInstructions.append(
                    (list(reversed(featPathLogical)), nodeType, feat)
                )

        self.extraInstructions = tuple(extraInstructions)

        self.verbose = verbose
        self.validate = validate
        myDir = dirNm(abspath(__file__))
        self.myDir = myDir

    def readSchemas(self):
        schemaDir = self.schemaDir
        models = self.models
        A = self.A

        schemaFiles = dict(rng={}, xsd={})
        self.schemaFiles = schemaFiles
        modelInfo = {}
        self.modelInfo = modelInfo
        modelXsd = {}
        self.modelXsd = modelXsd
        modelInv = {}
        self.modelInv = modelInv

        for model in [None] + models:
            for kind in ("rng", "xsd"):
                schemaFile = (
                    A.getBaseSchema()[kind]
                    if model is None
                    else f"{schemaDir}/{model}.{kind}"
                )
                if fileExists(schemaFile):
                    schemaFiles[kind][model] = schemaFile
                    if (
                        kind == "rng"
                        or kind == "xsd"
                        and model not in schemaFiles["rng"]
                    ):
                        modelInfo[model] = schemaFile
            if model in schemaFiles["rng"] and model not in schemaFiles["xsd"]:
                schemaFileXsd = f"{schemaDir}/{model}.xsd"
                A.fromrelax(schemaFiles["rng"][model], schemaFileXsd)
                schemaFiles["xsd"][model] = schemaFileXsd

        baseSchema = schemaFiles["xsd"][None]
        modelXsd[None] = baseSchema
        modelInv[(baseSchema, None)] = None

        for model in models:
            override = schemaFiles["xsd"][model]
            modelXsd[model] = override
            modelInv[(baseSchema, override)] = model

    def getSwitches(self, xmlPath):
        verbose = self.verbose
        models = self.models
        adaptations = self.adaptations
        templates = self.templates
        triggers = self.triggers
        A = self.A

        text = None

        found = {}

        for kind, allOfKind in (
            ("model", models),
            ("adaptation", adaptations),
            ("template", templates),
        ):
            if text is None:
                with fileOpen(xmlPath) as fh:
                    text = fh.read()

            found[kind] = None

            if kind == "model":
                result = A.getModel(text)
                if result is None or result == "tei_all":
                    result = None
            else:
                result = None
                triggerRe = triggers[kind]
                if triggerRe is not None:
                    match = triggerRe.search(text)
                    result = match.group(1) if match else None

            if result is not None and result not in allOfKind:
                if verbose >= 0:
                    console(f"unavailable {kind} {result} in {ux(xmlPath)}")
                result = None
            found[kind] = result

        return (found["model"], found["adaptation"], found["template"])

    def getParser(self):
        """Configure the LXML parser.

        See [parser options](https://lxml.de/parsing.html#parser-options).

        Returns
        -------
        object
            A configured LXML parse object.
        """
        if not self.importOK():
            return None

        etree = self.etree
        procins = self.procins

        return etree.XMLParser(
            remove_blank_text=False,
            collect_ids=False,
            remove_comments=True,
            remove_pis=not procins,
            huge_tree=True,
        )

    def getXML(self):
        """Make an inventory of the TEI source files.

        Returns
        -------
        tuple of tuple | string
            If section model I is in force:

            The outer tuple has sorted entries corresponding to folders under the
            TEI input directory.
            Each such entry consists of the folder name and an inner tuple
            that contains the file names in that folder, sorted.

            If section model II is in force:

            It is the name of the single XML file.
        """
        verbose = self.verbose
        teiPath = self.teiPath
        sectionModel = self.sectionModel
        if verbose == 1:
            console(f"Section model {sectionModel}")

        if sectionModel == "I":
            backMatter = self.backMatter

            IGNORE = "__ignore__"

            xmlFilesRaw = collections.defaultdict(list)

            with scanDir(teiPath) as dh:
                for folder in dh:
                    folderName = folder.name
                    if folderName == IGNORE:
                        continue
                    if not folder.is_dir():
                        continue
                    with scanDir(f"{teiPath}/{folderName}") as fh:
                        for file in fh:
                            fileName = file.name
                            if not (
                                fileName.lower().endswith(".xml") and file.is_file()
                            ):
                                continue
                            xmlFilesRaw[folderName].append(fileName)

            xmlFiles = []
            hasBackMatter = False

            for folderName in sorted(xmlFilesRaw, key=versionSort):
                if folderName == backMatter:
                    hasBackMatter = True
                else:
                    fileNames = xmlFilesRaw[folderName]
                    xmlFiles.append((folderName, tuple(sorted(fileNames))))

            if hasBackMatter:
                fileNames = xmlFilesRaw[backMatter]
                xmlFiles.append((backMatter, tuple(sorted(fileNames))))

            xmlFiles = tuple(xmlFiles)

            return xmlFiles

        if sectionModel == "II":
            xmlFile = None
            with scanDir(teiPath) as fh:
                for file in fh:
                    fileName = file.name
                    if not (fileName.lower().endswith(".xml") and file.is_file()):
                        continue
                    xmlFile = fileName
                    break
            return xmlFile

    def checkTask(self):
        """Implementation of the "check" task.

        It validates the TEI, but only if a schema file has been passed explicitly
        when constructing the `TEI()` object.

        Then it makes an inventory of all elements and attributes in the TEI files.

        If tags are used in multiple namespaces, it will be reported.

        !!! caution "Conflation of namespaces"
            The TEI to TF conversion does construct node types and attributes
            without taking namespaces into account.
            However, the parsing process is namespace aware.

        The inventory lists all elements and attributes, and many attribute values.
        But is represents any digit with `n`, and some attributes that contain
        ids or keywords, are reduced to the value `x`.

        This information reduction helps to get a clear overview.

        It writes reports to the `reportPath`:

        *   `errors.txt`: validation errors
        *   `elements.txt`: element / attribute inventory.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        verbose = self.verbose
        procins = self.procins
        validate = self.validate
        modelInfo = self.modelInfo
        modelInv = self.modelInv
        modelXsd = self.modelXsd
        A = self.A
        etree = self.etree

        teiPath = self.teiPath
        reportPath = self.reportPath
        docsDir = self.docsDir
        sectionModel = self.sectionModel

        if verbose == 1:
            console(f"TEI to TF checking: {ux(teiPath)} => {ux(reportPath)}")
        if verbose >= 0:
            console(
                f"Processing instructions are {'treated' if procins else 'ignored'}"
            )
            console(f"XML validation will be {'performed' if validate else 'skipped'}")

        kindLabels = dict(
            format="Formatting Attributes",
            keyword="Keyword Attributes",
            rest="Remaining Attributes and Elements",
        )
        getStore = lambda: collections.defaultdict(  # noqa: E731
            lambda: collections.defaultdict(collections.Counter)
        )
        analysis = {x: getStore() for x in kindLabels}
        errors = []
        tagByNs = collections.defaultdict(collections.Counter)
        refs = collections.defaultdict(lambda: collections.Counter())
        ids = collections.defaultdict(lambda: collections.Counter())

        parser = self.getParser()
        baseSchema = modelXsd[None]
        overrides = [
            override for (model, override) in modelXsd.items() if model is not None
        ]
        A.getElementInfo(baseSchema, overrides, verbose=verbose)
        elementDefs = A.elementDefs

        initTree(reportPath)
        initTree(docsDir)

        nProcins = 0

        lbParents = collections.Counter()

        def analyse(root, analysis, xmlFile):
            FORMAT_ATTS = set(
                """
                dim
                level
                place
                rend
            """.strip().split()
            )

            KEYWORD_ATTS = set(
                """
                facs
                form
                function
                lang
                reason
                type
                unit
                who
            """.strip().split()
            )

            TRIM_ATTS = set(
                """
                id
                key
                target
                value
            """.strip().split()
            )

            NUM_RE = re.compile(r"""[0-9]""", re.S)

            def nodeInfo(xnode):
                nonlocal nProcins

                if procins and isinstance(xnode, etree._ProcessingInstruction):
                    target = xnode.target
                    tag = f"?{target}"
                    ns = ""
                    nProcins += 1
                else:
                    qName = etree.QName(xnode.tag)
                    tag = qName.localname
                    ns = qName.namespace

                atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()}

                tagByNs[tag][ns] += 1

                if tag == "lb":
                    parentTag = etree.QName(xnode.getparent().tag).localname
                    lbParents[parentTag] += 1

                if len(atts) == 0:
                    kind = "rest"
                    analysis[kind][tag][""][""] += 1
                else:
                    idv = atts.get("id", None)

                    if idv is not None:
                        ids[xmlFile][idv] += 1

                    for refAtt, targetFile, targetId in getRefs(tag, atts, xmlFile):
                        refs[xmlFile][(targetFile, targetId)] += 1

                    for k, v in atts.items():
                        kind = (
                            "format"
                            if k in FORMAT_ATTS
                            else "keyword"
                            if k in KEYWORD_ATTS
                            else "rest"
                        )
                        dest = analysis[kind]

                        if kind == "rest":
                            vTrim = "X" if k in TRIM_ATTS else NUM_RE.sub("N", v)
                            dest[tag][k][vTrim] += 1
                        else:
                            words = v.strip().split()
                            for w in words:
                                dest[tag][k][w.strip()] += 1

                for child in xnode.iterchildren(
                    tag=(etree.Element, etree.ProcessingInstruction)
                    if procins
                    else etree.Element
                ):
                    nodeInfo(child)

            nodeInfo(root)

        def writeErrors():
            """Write the errors to a file."""

            errorFile = f"{reportPath}/errors.txt"

            nErrors = 0
            nFiles = 0

            with fileOpen(errorFile, mode="w") as fh:
                prevFolder = None
                prevFile = None

                for folder, file, line, col, kind, text in errors:
                    newFolder = prevFolder != folder
                    newFile = newFolder or prevFile != file

                    if newFile:
                        nFiles += 1

                    if kind == "error":
                        nErrors += 1

                    indent1 = f"{folder}\n\t" if newFolder else "\t"
                    indent2 = f"{file}\n\t\t" if newFile else "\t"
                    loc = f"{line or ''}:{col or ''}"
                    text = "\n".join(wrap(text, width=80, subsequent_indent="\t\t\t"))
                    fh.write(f"{indent1}{indent2}{loc} {kind or ''} {text}\n")
                    prevFolder = folder
                    prevFile = file

            if nErrors:
                console(
                    (
                        f"{nErrors} validation error(s) in {nFiles} file(s) "
                        f"written to {errorFile}"
                    ),
                    error=True,
                )
            else:
                if verbose >= 0:
                    if validate:
                        console("Validation OK")
                    else:
                        console("No validation performed")

        def writeNamespaces():
            errorFile = f"{reportPath}/namespaces.txt"

            nErrors = 0

            nTags = len(tagByNs)

            with fileOpen(errorFile, mode="w") as fh:
                for tag, nsInfo in sorted(
                    tagByNs.items(), key=lambda x: (-len(x[1]), x[0])
                ):
                    label = "OK"
                    nNs = len(nsInfo)
                    if nNs > 1:
                        nErrors += 1
                        label = "XX"

                    for ns, amount in sorted(
                        nsInfo.items(), key=lambda x: (-x[1], x[0])
                    ):
                        fh.write(
                            f"{label} {nNs:>2} namespace for "
                            f"{tag:<16} : {amount:>5}x {ns}\n"
                        )

            if verbose >= 0:
                if procins:
                    plural = "" if nProcins == 1 else "s"
                    console(f"{nProcins} processing instruction{plural} encountered.")

                console(
                    f"{nTags} tags of which {nErrors} with multiple namespaces "
                    f"written to {errorFile}"
                    if verbose >= 0 or nErrors
                    else "Namespaces OK"
                )

        def writeReport():
            reportFile = f"{reportPath}/elements.txt"
            with fileOpen(reportFile, mode="w") as fh:
                fh.write(
                    "Inventory of tags and attributes in the source XML file(s).\n"
                    "Contains the following sections:\n"
                )
                for label in kindLabels.values():
                    fh.write(f"\t{label}\n")
                fh.write("\n\n")

                infoLines = 0

                def writeAttInfo(tag, att, attInfo):
                    nonlocal infoLines
                    nl = "" if tag == "" else "\n"
                    tagRep = "" if tag == "" else f"<{tag}>"
                    attRep = "" if att == "" else f"{att}="
                    atts = sorted(attInfo.items())
                    (val, amount) = atts[0]
                    fh.write(
                        f"{nl}\t{tagRep:<18} " f"{attRep:<11} {amount:>5}x {val}\n"
                    )
                    infoLines += 1
                    for val, amount in atts[1:]:
                        fh.write(
                            f"""\t{'':<7}{'':<18} {'"':<18} {amount:>5}x {val}\n"""
                        )
                        infoLines += 1

                def writeTagInfo(tag, tagInfo):
                    nonlocal infoLines
                    tags = sorted(tagInfo.items())
                    (att, attInfo) = tags[0]
                    writeAttInfo(tag, att, attInfo)
                    infoLines += 1
                    for att, attInfo in tags[1:]:
                        writeAttInfo("", att, attInfo)

                for kind, label in kindLabels.items():
                    fh.write(f"\n{label}\n")
                    for tag, tagInfo in sorted(analysis[kind].items()):
                        writeTagInfo(tag, tagInfo)

            if verbose >= 0:
                console(f"{infoLines} info line(s) written to {reportFile}")

        def writeElemTypes():
            elemsCombined = {}

            modelSet = set()

            for schemaOverride, eDefs in elementDefs.items():
                model = modelInv[schemaOverride]
                modelSet.add(model)
                for tag, (typ, mixed) in eDefs.items():
                    elemsCombined.setdefault(tag, {}).setdefault(model, {})
                    elemsCombined[tag][model]["typ"] = typ
                    elemsCombined[tag][model]["mixed"] = mixed

            tagReport = {}

            for tag, tagInfo in elemsCombined.items():
                tagLines = []
                tagReport[tag] = tagLines

                if None in tagInfo:
                    teiInfo = tagInfo[None]
                    teiTyp = teiInfo["typ"]
                    teiMixed = teiInfo["mixed"]
                    teiTypRep = "??" if teiTyp is None else typ
                    teiMixedRep = (
                        "??" if teiMixed is None else "mixed" if teiMixed else "pure"
                    )
                    mds = ["TEI"]

                    for model in sorted(x for x in tagInfo if x is not None):
                        info = tagInfo[model]
                        typ = info["typ"]
                        mixed = info["mixed"]
                        if typ == teiTyp and mixed == teiMixed:
                            mds.append(model)
                        else:
                            typRep = (
                                "" if typ == teiTyp else "??" if typ is None else typ
                            )
                            mixedRep = (
                                ""
                                if mixed == teiMixed
                                else "??"
                                if mixed is None
                                else "mixed"
                                if mixed
                                else "pure"
                            )
                            tagLines.append((tag, [model], typRep, mixedRep))
                    tagLines.insert(0, (tag, mds, teiTypRep, teiMixedRep))
                else:
                    for model in sorted(tagInfo):
                        info = tagInfo[model]
                        typ = info["typ"]
                        mixed = info["mixed"]
                        typRep = "??" if typ is None else typ
                        mixedRep = (
                            "??" if mixed is None else "mixed" if mixed else "pure"
                        )
                        tagLines.append((tag, [model], typRep, mixedRep))

            reportFile = f"{reportPath}/types.txt"
            with fileOpen(reportFile, mode="w") as fh:
                for tag in sorted(tagReport):
                    tagLines = tagReport[tag]
                    for tag, mds, typ, mixed in tagLines:
                        model = ",".join(mds)
                        fh.write(f"{tag:<18} {model:<18} {typ:<7} {mixed:<5}\n")

            if verbose >= 0:
                console(
                    f"{len(elemsCombined)} tag(s) type info written to {reportFile}"
                )

        def writeLbParents():
            reportFile = f"{reportPath}/lb-parents.txt"

            with open(reportFile, "w") as fh:
                for parent, n in sorted(lbParents.items()):
                    fh.write(f"{n:>5} x {parent}\n")

            if verbose >= 0:
                console(f"lb-parent info written to {reportFile}")

        def writeIdRefs():
            reportIdFile = f"{reportPath}/ids.txt"
            reportRefFile = f"{reportPath}/refs.txt"

            ih = fileOpen(reportIdFile, mode="w")
            rh = fileOpen(reportRefFile, mode="w")

            refdIds = collections.Counter()
            missingIds = set()

            totalRefs = 0
            totalRefsU = 0

            totalResolvable = 0
            totalResolvableU = 0
            totalDangling = 0
            totalDanglingU = 0

            seenItems = set()

            for file, items in refs.items():
                rh.write(f"{file}\n")

                resolvable = 0
                resolvableU = 0
                dangling = 0
                danglingU = 0

                for item, n in sorted(items.items()):
                    totalRefs += n

                    if item in seenItems:
                        newItem = False
                    else:
                        seenItems.add(item)
                        newItem = True
                        totalRefsU += 1

                    (target, idv) = item

                    if target not in ids or idv not in ids[target]:
                        status = "dangling"
                        dangling += n

                        if newItem:
                            missingIds.add((target, idv))
                            danglingU += 1
                    else:
                        status = "ok"
                        resolvable += n
                        refdIds[(target, idv)] += n

                        if newItem:
                            resolvableU += 1
                    rh.write(f"\t{status:<10} {n:>5} x {target} # {idv}\n")

                msgs = (
                    f"\tDangling:   {dangling:>4} x {danglingU:>4}",
                    f"\tResolvable: {resolvable:>4} x {resolvableU:>4}",
                )
                for msg in msgs:
                    rh.write(f"{msg}\n")

                totalResolvable += resolvable
                totalResolvableU += resolvableU
                totalDangling += dangling
                totalDanglingU += danglingU

            if verbose >= 0:
                console(f"Refs written to {reportRefFile}")
                msgs = (
                    f"\tresolvable: {totalResolvableU:>4} in {totalResolvable:>4}",
                    f"\tdangling:   {totalDanglingU:>4} in {totalDangling:>4}",
                    f"\tALL:        {totalRefsU:>4} in {totalRefs:>4} ",
                )
                for msg in msgs:
                    console(msg)

            totalIds = 0
            totalIdsU = 0
            totalIdsM = 0
            totalIdsRefd = 0
            totalIdsRefdU = 0
            totalIdsUnused = 0

            for file, items in ids.items():
                totalIds += len(items)

                ih.write(f"{file}\n")

                unique = 0
                multiple = 0
                refd = 0
                refdU = 0
                unused = 0

                for item, n in sorted(items.items()):
                    nRefs = refdIds.get((file, item), 0)

                    if n == 1:
                        unique += 1
                    else:
                        multiple += 1

                    if nRefs == 0:
                        unused += 1
                    else:
                        refd += nRefs
                        refdU += 1

                    status1 = f"{n}x"
                    plural = "" if nRefs == 1 else "s"
                    status2 = f"{nRefs}ref{plural}"

                    ih.write(f"\t{status1:<8} {status2:<8} {item}\n")

                msgs = (
                    f"\tUnique:     {unique:>4}",
                    f"\tNon-unique: {multiple:>4}",
                    f"\tUnused:     {unused:>4}",
                    f"\tReferenced: {refd:>4} x {refdU:>4}",
                )
                for msg in msgs:
                    ih.write(f"{msg}\n")

                totalIdsU += unique
                totalIdsM += multiple
                totalIdsRefdU += refdU
                totalIdsRefd += refd
                totalIdsUnused += unused

            if verbose >= 0:
                console(f"Ids written to {reportIdFile}")
                msgs = (
                    f"\treferenced: {totalIdsRefdU:>4} by {totalIdsRefd:>4}",
                    f"\tnon-unique: {totalIdsM:>4}",
                    f"\tunused:     {totalIdsUnused:>4}",
                    f"\tALL:        {totalIdsU:>4} in {totalIds:>4}",
                )
                for msg in msgs:
                    console(msg)

        def writeDoc():
            teiUrl = "https://tei-c.org/release/doc/tei-p5-doc/en/html"
            elUrlPrefix = f"{teiUrl}/ref-"
            attUrlPrefix = f"{teiUrl}/REF-ATTS.html#"
            docFile = f"{docsDir}/elements.md"
            with fileOpen(docFile, mode="w") as fh:
                fh.write(
                    dedent(
                        """
                        # Element and attribute inventory

                        Table of contents

                        """
                    )
                )
                for label in kindLabels.values():
                    labelAnchor = label.replace(" ", "-")
                    fh.write(f"*\t[{label}](#{labelAnchor})\n")

                fh.write("\n")

                tableHeader = dedent(
                    """
                    | element | attribute | value | amount
                    | --- | --- | --- | ---
                    """
                )

                def writeAttInfo(tag, att, attInfo):
                    tagRep = " " if tag == "" else f"[{tag}]({elUrlPrefix}{tag}.html)"
                    attRep = " " if att == "" else f"[{att}]({attUrlPrefix}{att})"
                    atts = sorted(attInfo.items())
                    (val, amount) = atts[0]
                    valRep = f"`{val}`" if val else ""
                    fh.write(
                        "| "
                        + (
                            " | ".join(
                                str(x)
                                for x in (
                                    tagRep,
                                    attRep,
                                    valRep,
                                    amount,
                                )
                            )
                        )
                        + "\n"
                    )
                    for val, amount in atts[1:]:
                        valRep = f"`{val}`" if val else ""
                        fh.write(f"""| | | {valRep} | {amount}\n""")

                def writeTagInfo(tag, tagInfo):
                    tags = sorted(tagInfo.items())
                    (att, attInfo) = tags[0]
                    writeAttInfo(tag, att, attInfo)
                    for att, attInfo in tags[1:]:
                        writeAttInfo("", att, attInfo)

                for kind, label in kindLabels.items():
                    fh.write(f"## {label}\n{tableHeader}")
                    for tag, tagInfo in sorted(analysis[kind].items()):
                        writeTagInfo(tag, tagInfo)
                    fh.write("\n")

        def filterError(msg):
            return msg == (
                "Element 'graphic', attribute 'url': [facet 'pattern'] "
                "The value '' is not accepted by the pattern '\\S+'."
            )

        def doXMLFile(xmlPath):
            tree = etree.parse(xmlPath, parser)
            root = tree.getroot()
            xmlFile = fileNm(xmlPath)
            ids[xmlFile][""] = 1
            analyse(root, analysis, xmlFile)

        xmlFilesByModel = collections.defaultdict(list)

        if sectionModel == "I":
            i = 0
            for xmlFolder, xmlFiles in self.getXML():
                msg = "Start " if verbose >= 0 else "\t"
                console(f"{msg}folder {xmlFolder}:")
                j = 0
                cr = ""
                nl = True

                for xmlFile in xmlFiles:
                    i += 1
                    j += 1
                    if j > PROGRESS_LIMIT:
                        cr = "\r"
                        nl = False
                    xmlPath = f"{teiPath}/{xmlFolder}/{xmlFile}"
                    (model, adapt, tpl) = self.getSwitches(xmlPath)
                    mdRep = model or "TEI"
                    tplRep = tpl or ""
                    adRep = adapt or ""

                    label = f"{mdRep:<12} {tplRep:<12} {adRep:<12}"

                    if verbose >= 0:
                        console(f"{cr}{i:>4} {label} {xmlFile:<50}", newline=nl)
                    xmlFilesByModel[model].append(xmlPath)
                if verbose >= 0:
                    console("")
                    console(f"End   folder {xmlFolder}")

        elif sectionModel == "II":
            xmlFile = self.getXML()
            if xmlFile is None:
                console("No XML files found!", error=True)
                return False

            xmlPath = f"{teiPath}/{xmlFile}"
            (model, adapt, tpl) = self.getSwitches(xmlPath)
            xmlFilesByModel[model].append(xmlPath)

        good = True

        for model, xmlPaths in xmlFilesByModel.items():
            if verbose >= 0:
                console(f"{len(xmlPaths)} {model or 'TEI'} file(s) ...")

            thisGood = True

            if validate:
                if verbose >= 0:
                    console("\tValidating ...")

                schemaFile = modelInfo.get(model, None)

                if schemaFile is None:
                    if verbose >= 0:
                        console(f"\t\tNo schema file for {model}")
                    if good is not None and good is not False:
                        good = None
                    continue

                (thisGood, info, theseErrors) = A.validate(schemaFile, xmlPaths)

                for line in info:
                    if verbose >= 0:
                        console(f"\t\t{line}")

            if not thisGood:
                good = False
                errors.extend(theseErrors)

            if verbose >= 0:
                console("\tMaking inventory ...")
            for xmlPath in xmlPaths:
                doXMLFile(xmlPath)

        if not good:
            self.good = False

        if verbose >= 0:
            console("")
        writeErrors()
        writeReport()
        writeElemTypes()
        writeDoc()
        writeNamespaces()
        writeIdRefs()
        writeLbParents()

    # SET UP CONVERSION

    def getConverter(self):
        """Initializes a converter.

        Returns
        -------
        object
            The `tf.convert.walker.CV` converter object, initialized.
        """
        verbose = self.verbose
        tfPath = self.tfPath

        silent = AUTO if verbose == 1 else TERSE if verbose == 0 else DEEP
        TF = Fabric(locations=tfPath, silent=silent)
        return CV(TF, silent=silent)

    # DIRECTOR

    def getDirector(self):
        """Factory for the director function.

        The `tf.convert.walker` relies on a corpus dependent `director` function
        that walks through the source data and spits out actions that
        produces the TF dataset.

        The director function that walks through the TEI input must be conditioned
        by the properties defined in the TEI schema and the customised schema, if any,
        that describes the source.

        Also some special additions need to be programmed, such as an extra section
        level, word boundaries, etc.

        We collect all needed data, store it, and define a local director function
        that has access to this data.

        Returns
        -------
        function
            The local director function that has been constructed.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        TEI_HEADER = "teiHeader"

        TEXT_ANCESTOR = "text"
        TEXT_ANCESTORS = set(
            """
            front
            body
            back
            group
            """.strip().split()
        )
        CHUNK_PARENTS = TEXT_ANCESTORS | {TEI_HEADER}

        CHUNK_ELEMS = set(
            """
            facsimile
            fsdDecl
            sourceDoc
            standOff
            """.strip().split()
        )

        PASS_THROUGH = set(
            """
            TEI
            """.strip().split()
        )

        # CHECKING

        HY = "\u2010"  # hyphen

        IN_WORD_HYPHENS = {HY, "-"}

        procins = self.procins
        verbose = self.verbose
        teiPath = self.teiPath
        wordAsSlot = self.wordAsSlot
        tokenAsSlot = self.tokenAsSlot
        parentEdges = self.parentEdges
        siblingEdges = self.siblingEdges
        featureMeta = self.featureMeta
        intFeatures = self.intFeatures
        transform = getattr(self, "transformCustom", None)
        chunkLevel = self.chunkLevel
        modelInv = self.modelInv
        modelInfo = self.modelInfo
        modelXsd = self.modelXsd
        A = self.A
        etree = self.etree

        transformFunc = (
            (lambda x: BytesIO(x.encode("utf-8")))
            if transform is None
            else lambda x: BytesIO(transform(x).encode("utf-8"))
        )

        parser = self.getParser()

        baseSchema = modelInfo[None]
        overrides = [
            override for (model, override) in modelInfo.items() if model is not None
        ]
        baseSchema = modelXsd[None]
        overrides = [
            override for (model, override) in modelXsd.items() if model is not None
        ]
        A.getElementInfo(baseSchema, overrides, verbose=-1)

        refs = collections.defaultdict(lambda: collections.defaultdict(set))
        ids = collections.defaultdict(dict)

        # WALKERS

        WHITE_TRIM_RE = re.compile(r"\s+", re.S)
        NON_NAME_RE = re.compile(r"[^a-zA-Z0-9_ ]+", re.S)

        NOTE_LIKE = set(
            """
            note
            """.strip().split()
        )
        EMPTY_ELEMENTS = set(
            """
            addSpan
            alt
            anchor
            anyElement
            attRef
            binary
            caesura
            catRef
            cb
            citeData
            classRef
            conversion
            damageSpan
            dataFacet
            default
            delSpan
            elementRef
            empty
            equiv
            fsdLink
            gb
            handShift
            iff
            lacunaEnd
            lacunaStart
            lb
            link
            localProp
            macroRef
            milestone
            move
            numeric
            param
            path
            pause
            pb
            ptr
            redo
            refState
            specDesc
            specGrpRef
            symbol
            textNode
            then
            undo
            unicodeProp
            unihanProp
            variantEncoding
            when
            witEnd
            witStart
            """.strip().split()
        )
        NEWLINE_ELEMENTS = set(
            """
            ab
            addrLine
            cb
            l
            lb
            lg
            list
            p
            pb
            seg
            table
            u
            """.strip().split()
        )
        CONTINUOUS_ELEMENTS = set(
            """
            choice
            """.strip().split()
        )

        def makeNameLike(x):
            return NON_NAME_RE.sub("_", x).strip("_")

        def walkNode(cv, cur, xnode):
            """Internal function to deal with a single element.

            Will be called recursively.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

                The subdictionary `cur["node"]` is used to store the currently generated
                nodes by node type.
            xnode: object
                An LXML element node.
            """
            if procins and isinstance(xnode, etree._ProcessingInstruction):
                target = xnode.target
                tag = f"?{target}"
            else:
                tag = etree.QName(xnode.tag).localname

            atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()}

            beforeTag(cv, cur, xnode, tag, atts)

            cur[XNEST].append((tag, atts))

            curNode = beforeChildren(cv, cur, xnode, tag, atts)

            if curNode is not None:
                if parentEdges:
                    if len(cur[TNEST]):
                        parentNode = cur[TNEST][-1]
                        cv.edge(curNode, parentNode, parent=None)

                cur[TNEST].append(curNode)

                if siblingEdges:
                    if len(cur[TSIB]):
                        siblings = cur[TSIB][-1]

                        nSiblings = len(siblings)
                        for i, sib in enumerate(siblings):
                            cv.edge(sib, curNode, sibling=nSiblings - i)
                        siblings.append(curNode)

                    cur[TSIB].append([])

            for child in xnode.iterchildren(
                tag=(etree.Element, etree.ProcessingInstruction)
                if procins
                else etree.Element
            ):
                walkNode(cv, cur, child)

            afterChildren(cv, cur, xnode, tag, atts)

            if curNode is not None:
                xmlFile = cur["xmlFile"]

                for refAtt, targetFile, targetId in getRefs(tag, atts, xmlFile):
                    refs[refAtt][(targetFile, targetId)].add(curNode)

                idVal = atts.get("id", None)
                if idVal is not None:
                    ids[xmlFile][idVal] = curNode

                if len(cur[TNEST]):
                    cur[TNEST].pop()
                if siblingEdges:
                    if len(cur[TSIB]):
                        cur[TSIB].pop()

            cur[XNEST].pop()
            afterTag(cv, cur, xnode, tag, atts)

        def isChapter(cur):
            """Whether the current element counts as a chapter node.

            ## Model I

            Not relevant: there are no chapter nodes inside an XML file.

            ## Model II

            Chapters are the highest section level (the only lower level is chunks).

            Chapters come in two kinds:

            *   the TEI header;
            *   the immediate children of `<text>`
                except `<front>`, `<body>`, `<back>`, `<group>`;
            *   the immediate children of
                `<front>`, `<body>`, `<back>`, `<group>`.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            sectionModel = self.sectionModel

            if sectionModel == "II":
                nest = cur[XNEST]
                nNest = len(nest)

                if nNest > 0 and nest[-1][0] in EMPTY_ELEMENTS:
                    return False

                outcome = nNest > 0 and (
                    nest[-1][0] == TEI_HEADER
                    or (
                        nNest > 1
                        and (
                            nest[-2][0] in TEXT_ANCESTORS
                            or nest[-2][0] == TEXT_ANCESTOR
                            and nest[-1][0] not in TEXT_ANCESTORS
                        )
                    )
                )
                if outcome:
                    cur["chapterElems"].add(nest[-1][0])

                return outcome

            return False

        def isChunk(cur):
            """Whether the current element counts as a chunk node.

            It depends on the section model, but also on the template.

            Note that we only can have distinct templates if we deal with
            multiple files, so only when we are in section model I.

            ## Model I

            Chunks are the lowest section level (the higher levels are folders
            and then files)

            The default is that chunks are the immediate children of the
            `<teiHeader>` and the `<body>`
            elements; a few other elements also count as chunks.

            However, if `drillDownDivs` is True and if the chunk appears to be
            a `<div>` element, we drill further down, until we arrive at a
            non-`<div>` element.

            But in specific templates we have different rules:

            ### `bibliolist`:

            *   The TEI Header is a chunk, and nothing inside the TEI header is a chunk;
            *   Everything at level 5, except `<listBibl>` is a chunk;
            *   The children of `<listBibl>` are chunks (the `<bibl>` elements
                and a few others), provided they are at level 6.

            ### `artworklist`

            *   The TEI Header is a chunk, and nothing inside the TEI header is a chunk;
            *   Everything at level 5 is a chunk.

            ## Model II

            Chunks are the lowest section level (the only higher level is chapters).

            Chunks are the immediate children of the chapters, and they come in two
            kinds: the ones that are `<p>` elements, and the rest.

            Deviation from this rule:

            *   If a chapter is a mixed content node, then it is also a chunk.
                and its subelements are not chunks

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            sectionModel = self.sectionModel

            nest = cur[XNEST]
            nNest = len(nest)
            model = cur["model"]

            if nNest == 0:
                return False

            thisTag = nest[-1][0]

            if sectionModel == "II":
                if nNest == 1:
                    outcome = False
                else:
                    parentTag = nest[-2][0]
                    meChptChnk = (
                        isChapter(cur) and thisTag not in cur["pureElems"][model]
                    )

                    if meChptChnk:
                        outcome = True
                    elif parentTag == TEI_HEADER:
                        outcome = True
                    elif nNest <= 2:
                        outcome = False
                    elif parentTag not in cur["pureElems"][model]:
                        outcome = False
                    else:
                        grandParentTag = nest[-3][0]
                        outcome = (
                            grandParentTag in TEXT_ANCESTORS
                            and thisTag not in EMPTY_ELEMENTS
                        ) or (
                            grandParentTag == TEXT_ANCESTOR
                            and parentTag not in TEXT_ANCESTORS
                        )

            elif sectionModel == "I":
                template = cur["template"]

                if template == "biolist":
                    if thisTag == TEI_HEADER:
                        outcome = True
                    elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                        outcome = False
                    elif nNest not in {5, 6}:
                        outcome = False
                    else:
                        parentTag = nest[-2][0]
                        if nNest == 5:
                            outcome = thisTag != "listPerson"
                        else:
                            outcome = parentTag == "listPerson"

                elif template == "bibliolist":
                    if thisTag == TEI_HEADER:
                        outcome = True
                    elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                        outcome = False
                    elif nNest not in {5, 6}:
                        outcome = False
                    else:
                        parentTag = nest[-2][0]
                        if nNest == 5:
                            outcome = thisTag != "listBibl"
                        else:
                            outcome = parentTag == "listBibl"

                elif template == "artworklist":
                    if thisTag == TEI_HEADER:
                        outcome = True
                    elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                        outcome = False
                    else:
                        outcome = nNest == 5

                else:
                    if thisTag in CHUNK_ELEMS:
                        outcome = True
                    elif nNest == 1:
                        outcome = False
                    else:
                        sectionProperties = self.sectionProperties
                        drillDownDivs = sectionProperties["drillDownDivs"]

                        parentTag = nest[-2][0]
                        if drillDownDivs:
                            if thisTag == "div":
                                outcome = False
                            else:
                                dParentTag = None
                                for ancestor in reversed(nest[0:-1]):
                                    if ancestor[0] != "div":
                                        dParentTag = ancestor[0]
                                        break
                                outcome = (
                                    dParentTag in CHUNK_PARENTS
                                    and thisTag not in EMPTY_ELEMENTS
                                ) or (
                                    dParentTag == TEXT_ANCESTOR
                                    and thisTag not in TEXT_ANCESTORS
                                )
                        else:
                            outcome = (
                                parentTag in CHUNK_PARENTS
                                and thisTag not in EMPTY_ELEMENTS
                            ) or (
                                parentTag == TEXT_ANCESTOR
                                and thisTag not in TEXT_ANCESTORS
                            )

            if outcome:
                cur["chunkElems"].add(nest[-1][0])

            return outcome

        def isPure(cur):
            """Whether the current tag has pure content.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            nest = cur[XNEST]
            model = cur["model"]
            return (
                len(nest) == 0
                or len(nest) > 0
                and nest[-1][0] in cur["pureElems"][model]
            )

        def isEndInPure(cur):
            """Whether the current end tag occurs in an element with pure content.

            If that is the case, then it is very likely that the end tag also
            marks the end of the current word.

            And we should not strip spaces after it.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            nest = cur[XNEST]
            model = cur["model"]
            return len(nest) > 1 and nest[-2][0] in cur["pureElems"][model]

        def hasMixedAncestor(cur):
            """Whether the current tag has an ancestor with mixed content.

            We use this in case a tag ends in an element with pure content.
            We should then add white-space to separate it from the next
            element of its parent.

            If the whole stack of element has pure content, we add
            a newline, because then we are probably in the TEI header,
            and things are most clear if they are on separate lines.

            But if one of the ancestors has mixed content, we are typically
            in some structured piece of information within running text,
            such as change markup. In this case we want to add merely a space.

            And we should not strip spaces after it.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            nest = cur[XNEST]
            model = cur["model"]
            return any(n[0] in cur["mixedElems"][model] for n in nest[0:-1])

        def hasContinuousAncestor(cur):
            """Whether an ancestor tag is a continuous pure element.

            A continuous pure element is an element whose child elements do not
            imply word separation, e.g. `<choice>`.

            Parameters
            ----------
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Returns
            -------
            boolean
            """
            nest = cur[XNEST]
            return any(n[0] in CONTINUOUS_ELEMENTS for n in nest[0:-1])

        def startWord(cv, cur, ch):
            """Start a word node if necessary.

            Whenever we encounter a character, we determine
            whether it starts or ends a word, and if it starts
            one, this function takes care of the necessary actions.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            ch: string
                A single character, the next character in the result data.
            """
            curWord = cur[NODE][WORD]

            if not curWord:
                prevWord = cur["prevWord"]
                if prevWord is not None:
                    cv.feature(prevWord, after=cur["afterStr"])
                if ch is not None:
                    if wordAsSlot:
                        curWord = cv.slot()
                    else:
                        curWord = cv.node(WORD)
                    cur[NODE][WORD] = curWord
                    addSlotFeatures(cv, cur, curWord)

            if ch is not None:
                cur["wordStr"] += ch

        def finishWord(cv, cur, ch, spaceChar):
            """Terminate a word node if necessary.

            Whenever we encounter a character, we determine
            whether it starts or ends a word, and if it ends
            one, this function takes care of the necessary actions.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            ch: string
                A single character, the next slot in the result data.
            spaceChar: string | void
                If None, no extra space or newline will be added.
                Otherwise, the `spaceChar` (a single space or newline will be added).
            """
            curWord = cur[NODE][WORD]
            if curWord:
                cv.feature(curWord, str=cur["wordStr"])
                if not wordAsSlot:
                    cv.terminate(curWord)
                cur[NODE][WORD] = None
                cur["wordStr"] = ""
                cur["prevWord"] = curWord
                cur["afterStr"] = ""

            if ch is not None:
                cur["afterStr"] += ch
            if spaceChar is not None:
                cur["afterStr"] = cur["afterStr"].rstrip() + spaceChar
                if not wordAsSlot:
                    addSpace(cv, cur, spaceChar)
                cur["afterSpace"] = True
            else:
                cur["afterSpace"] = False

        def addSlotFeatures(cv, cur, s):
            """Add generic features to a slot.

            Whenever we encounter a character, we add it as a new slot, unless
            `wordAsSlot` is in force. In that case we suppress the triggering of a
            slot node.
            If needed, we start / terminate word nodes as well.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            s: slot
                A previously added (slot) node
            """
            if cur["inHeader"]:
                cv.feature(s, is_meta=1)
            if cur["inNote"]:
                cv.feature(s, is_note=1)
            for r, stack in cur.get("rend", {}).items():
                if len(stack) > 0:
                    cv.feature(s, **{f"rend_{r}": 1})

        def addTokens(cv, cur, text):
            """Adds text as a series of tokens.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            text: string
                The text to be added.

            Only meant for the case where slots are tokens.
            """
            (beforew, material, afterw) = getWhites(text)

            if beforew:
                makeSpace(cv, cur)

            s = None

            for tx, after in tokenize(material):
                s = cv.slot()
                cv.feature(s, str=tx, after=after)
                addSlotFeatures(cv, cur, s)

            if afterw:
                if s is None:
                    makeSpace(cv, cur)
                else:
                    cv.feature(s, after=" ")

        def addSlot(cv, cur, ch):
            """Add a slot.

            Whenever we encounter a character, we add it as a new slot, unless
            `wordAsSlot` is in force. In that case we suppress the triggering of a
            slot node.
            If needed, we start / terminate word nodes as well.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            ch: string
                A single character, the next slot in the result data.
            """
            if ch in {"_", None} or ch.isalnum() or ch in IN_WORD_HYPHENS:
                startWord(cv, cur, ch)
            else:
                finishWord(cv, cur, ch, None)

            if wordAsSlot:
                s = cur[NODE][WORD]
            elif ch is None:
                s = None
            else:
                s = cv.slot()
                cv.feature(s, ch=ch)
            if s is not None:
                addSlotFeatures(cv, cur, s)

        def addEmpty(cv, cur):
            """Add an empty slot.

            We also terminate the current word.
            If words are slots, the empty slot is a word on its own.

            Returns
            -------
            node
                The empty slot
            """
            if tokenAsSlot:
                emptyNode = cv.slot()
                cv.feature(emptyNode, str=ZWSP, after="", empty=1)
            else:
                finishWord(cv, cur, None, None)
                startWord(cv, cur, ZWSP)
                emptyNode = cur[NODE][WORD]
                cv.feature(emptyNode, empty=1)

                if not wordAsSlot:
                    emptyNode = cv.slot()
                    cv.feature(emptyNode, ch=ZWSP, empty=1)

                finishWord(cv, cur, None, None)

            return emptyNode

        def addSpace(cv, cur, spaceChar):
            """Adds a space or a new line.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            spaceChar: string
                The character to add (supposed to be either a space or a newline).

            Only meant for the case where slots are characters or tokens.

            Suppressed when not in a lowest-level section.
            """
            if chunkLevel in cv.activeTypes():
                s = cv.slot()
                if tokenAsSlot:
                    cv.feature(s, str="", after=spaceChar, extraspace=1)
                else:
                    cv.feature(s, ch=spaceChar, extraspace=1)
                addSlotFeatures(cv, cur, s)

        def makeSpace(cv, cur):
            """Adds a space.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.

            Only meant for the case where slots are tokens.
            """
            s = cv.slot()
            cv.feature(s, str="", after=" ", extraspace=1)
            addSlotFeatures(cv, cur, s)

        def endLine(cv, cur):
            """Ends a line node.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            """
            lineProperties = self.lineProperties
            lineType = lineProperties["nodeType"]

            slots = cv.linked(cur[NODE][lineType])
            empty = len(slots) == 0

            if empty:
                lastSlot = addEmpty(cv, cur)
                if cur["inNote"]:
                    cv.feature(lastSlot, is_note=1)
            else:
                lastSlot = (T, slots[-1])

            if not wordAsSlot:
                after = cv.get("after", lastSlot)
                if after is not None and "\n" not in after:
                    cv.feature(lastSlot, after=f"{after.rstrip()}\n")
            cv.terminate(cur[NODE][lineType])

        def endPage(cv, cur):
            """Ends a page node.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            """
            pageProperties = self.pageProperties
            pageType = pageProperties["nodeType"]

            slots = cv.linked(cur[NODE][pageType])
            empty = len(slots) == 0

            if empty:
                lastSlot = addEmpty(cv, cur)
                if cur["inNote"]:
                    cv.feature(lastSlot, is_note=1)
            cv.terminate(cur[NODE][pageType])

        def beforeTag(cv, cur, xnode, tag, atts):
            """Actions before dealing with the element's tag.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            xnode: object
                An LXML element node.
            tag: string
                The tag of the LXML node.
            """
            beforeTagCustom = getattr(self, "beforeTagCustom", None)
            if beforeTagCustom is not None:
                beforeTagCustom(cv, cur, xnode, tag, atts)

        def beforeChildren(cv, cur, xnode, tag, atts):
            """Actions before dealing with the element's children.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            xnode: object
                An LXML element node.
            tag: string
                The tag of the LXML node.
            atts: string
                The attributes of the LXML node, with namespaces stripped.
            """
            makeLineElems = self.makeLineElems

            if makeLineElems:
                lineProperties = self.lineProperties
                lineElem = lineProperties["element"]
                lineType = lineProperties["nodeType"]
                isLineContainer = tag == lineElem
                inLine = cur["inLine"]

                if isLineContainer:
                    cur["inLine"] = True

                    # the line starts with the container
                    cur[NODE][lineType] = cv.node(lineType)

            makePageElems = self.makePageElems

            if makePageElems:
                pageProperties = self.pageProperties
                pageType = pageProperties["nodeType"]
                isPageContainer = matchModel(pageProperties, tag, atts)
                inPage = cur["inPage"]

                pbAtTop = pageProperties["pbAtTop"]

                if isPageContainer:
                    cur["inPage"] = True

                    if pbAtTop:
                        # material before the first pb in the container is not in a page
                        pass
                    else:
                        # the page starts with the container
                        cur[NODE][pageType] = cv.node(pageType)

            sectionModel = self.sectionModel
            sectionProperties = self.sectionProperties

            if sectionModel == "II":
                chapterSection = self.chapterSection
                chunkSection = self.chunkSection

                if isChapter(cur):
                    cur["chapterNum"] += 1
                    cur["prevChapter"] = cur[NODE].get(chapterSection, None)
                    cur[NODE][chapterSection] = cv.node(chapterSection)
                    cv.link(cur[NODE][chapterSection], cur["danglingSlots"])

                    value = {chapterSection: f"{cur['chapterNum']} {tag}"}
                    cv.feature(cur[NODE][chapterSection], **value)
                    cur["chunkPNum"] = 0
                    cur["chunkONum"] = 0
                    cur["prevChunk"] = cur[NODE].get(chunkSection, None)
                    cur[NODE][chunkSection] = cv.node(chunkSection)
                    cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                    cur["danglingSlots"] = set()
                    cur["infirstChunk"] = True

                # N.B. A node can count both as chapter and as chunk,
                # e.g. a <trailer> sibling of the chapter <div>s
                # A trailer has mixed content, so its subelements aren't typical chunks.
                if isChunk(cur):
                    if cur["infirstChunk"]:
                        cur["infirstChunk"] = False
                    else:
                        cur[NODE][chunkSection] = cv.node(chunkSection)
                        cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                        cur["danglingSlots"] = set()
                    if tag == "p":
                        cur["chunkPNum"] += 1
                        cn = cur["chunkPNum"]
                    else:
                        cur["chunkONum"] -= 1
                        cn = cur["chunkONum"]
                    value = {chunkSection: cn}
                    cv.feature(cur[NODE][chunkSection], **value)

                if matchModel(sectionProperties, tag, atts):
                    heading = etree.tostring(
                        xnode, encoding="unicode", method="text", with_tail=False
                    ).replace("\n", " ")
                    value = {chapterSection: heading}
                    cv.feature(cur[NODE][chapterSection], **value)
                    chapterNum = cur["chapterNum"]
                    if verbose >= 0:
                        console(
                            f"\rchapter {chapterNum:>4} {heading:<50}", newline=False
                        )
            else:
                chunkSection = self.chunkSection

                if isChunk(cur):
                    cur["chunkNum"] += 1
                    cur["prevChunk"] = cur[NODE].get(chunkSection, None)
                    cur[NODE][chunkSection] = cv.node(chunkSection)
                    cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                    cur["danglingSlots"] = set()
                    value = {chunkSection: cur["chunkNum"]}
                    cv.feature(cur[NODE][chunkSection], **value)

            if tag == TEI_HEADER:
                cur["inHeader"] = True
                if sectionModel == "II":
                    value = {chapterSection: "TEI header"}
                    cv.feature(cur[NODE][chapterSection], **value)
            if tag in NOTE_LIKE:
                cur["inNote"] = True
                if not tokenAsSlot:
                    finishWord(cv, cur, None, None)

            curNode = None

            if makeLineElems:
                if inLine and tag == "lb":
                    if cur[NODE][lineType] is not None:
                        if cur["lineAtts"] is not None and len(cur["lineAtts"]):
                            cv.feature(cur[NODE][lineType], **cur["lineAtts"])
                        endLine(cv, cur)
                    cur[NODE][lineType] = cv.node(lineType)
                    cur["lineAtts"] = atts

            if makePageElems:
                if inPage and tag == "pb":
                    if pbAtTop:
                        if cur[NODE][pageType] is not None:
                            endPage(cv, cur)
                        cur[NODE][pageType] = cv.node(pageType)
                        if len(atts):
                            cv.feature(cur[NODE][pageType], **atts)
                    else:
                        if cur[NODE][pageType] is not None:
                            if cur["pageAtts"] is not None and len(cur["pageAtts"]):
                                cv.feature(cur[NODE][pageType], **cur["pageAtts"])
                            endPage(cv, cur)
                        cur[NODE][pageType] = cv.node(pageType)
                        cur["pageAtts"] = atts

            isBoundaryElem = (
                makeLineElems and tag == "lb" or makePageElems and tag == "pb"
            )

            if tag not in PASS_THROUGH and not isBoundaryElem:
                cur["afterSpace"] = False
                cur[NODE][tag] = cv.node(tag)
                curNode = cur[NODE][tag]
                if wordAsSlot:
                    if cur[NODE][WORD]:
                        cv.link(curNode, [cur[NODE][WORD][1]])
                if len(atts):
                    cv.feature(curNode, **atts)
                    if "rend" in atts:
                        rValue = atts["rend"]
                        r = makeNameLike(rValue)
                        if r:
                            for q in r.split():
                                cur.setdefault("rend", {}).setdefault(q, []).append(
                                    True
                                )

            beforeChildrenCustom = getattr(self, "beforeChildrenCustom", None)
            if beforeChildrenCustom is not None:
                beforeChildrenCustom(cv, cur, xnode, tag, atts)

            if not hasattr(xnode, "target") and xnode.text:
                textMaterial = WHITE_TRIM_RE.sub(" ", xnode.text)
                if isPure(cur):
                    if textMaterial and textMaterial != " ":
                        console(
                            (
                                "WARNING: Text material at the start of "
                                f"pure-content element <{tag}>"
                            ),
                            error=True,
                        )
                        stack = "-".join(n[0] for n in cur[XNEST])
                        console(f"\tElement stack: {stack}", error=True)
                        console(f"\tMaterial: `{textMaterial}`", error=True)
                else:
                    if tokenAsSlot:
                        addTokens(cv, cur, textMaterial)
                    else:
                        for ch in textMaterial:
                            addSlot(cv, cur, ch)

            return curNode

        def afterChildren(cv, cur, xnode, tag, atts):
            """Node actions after dealing with the children, but before the end tag.

            Here we make sure that the newline elements will get their last slot
            having a newline at the end of their `after` feature.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            xnode: object
                An LXML element node.
            tag: string
                The tag of the LXML node.
            atts: string
                The attributes of the LXML node, with namespaces stripped.
            """
            chunkSection = self.chunkSection
            makeLineElems = self.makeLineElems

            if makeLineElems:
                lineProperties = self.lineProperties
                lineType = lineProperties["nodeType"]
                lineElem = lineProperties["element"]
                lineProperties = self.lineProperties

            makePageElems = self.makePageElems

            if makePageElems:
                pageProperties = self.pageProperties
                pageType = pageProperties["nodeType"]
                pageProperties = self.pageProperties

            sectionModel = self.sectionModel

            if sectionModel == "II":
                chapterSection = self.chapterSection

            extraInstructions = self.extraInstructions

            if len(extraInstructions):
                lookupSource(cv, cur, tokenAsSlot, extraInstructions)

            isChap = isChapter(cur)
            isChnk = isChunk(cur)

            afterChildrenCustom = getattr(self, "afterChildrenCustom", None)
            if afterChildrenCustom is not None:
                afterChildrenCustom(cv, cur, xnode, tag, atts)

            if makeLineElems:
                isLineContainer = tag == lineElem
                inLine = cur["inLine"]

            if makePageElems:
                isPageContainer = matchModel(pageProperties, tag, atts)
                inPage = cur["inPage"]

            hasFinishedWord = False

            if makeLineElems and inLine and tag == "lb":
                pass

            if makePageElems and inPage and tag == "pb":
                pass

            isBoundaryElem = (
                makeLineElems and tag == "lb" or makePageElems and tag == "pb"
            )

            if makeLineElems and isLineContainer:
                # the page ends with the container
                if cur[NODE][lineType] is not None:
                    endLine(cv, cur)
                cur["inLine"] = False

            if makePageElems and isPageContainer:
                pbAtTop = pageProperties["pbAtTop"]
                if pbAtTop:
                    # the page ends with the container
                    if cur[NODE][pageType] is not None:
                        endPage(cv, cur)
                else:
                    # material after the last pb is not in a page
                    if cur[NODE][pageType] is not None:
                        cv.delete(cur[NODE][pageType])
                cur["inPage"] = False

            if tag not in PASS_THROUGH and not isBoundaryElem:
                curNode = cur[TNEST][-1]
                slots = cv.linked(curNode)
                empty = len(slots) == 0

                newLineTag = tag in NEWLINE_ELEMENTS

                if (
                    newLineTag
                    or isEndInPure(cur)
                    and not hasContinuousAncestor(cur)
                    and not cur["afterSpace"]
                ) and not empty:
                    spaceChar = "\n" if newLineTag or not hasMixedAncestor(cur) else " "
                    if tokenAsSlot:
                        cv.feature((T, slots[-1]), after=spaceChar)
                    else:
                        finishWord(cv, cur, None, spaceChar)
                        hasFinishedWord = True

                slots = cv.linked(curNode)
                empty = len(slots) == 0

                if empty:
                    lastSlot = addEmpty(cv, cur)
                    if cur["inHeader"]:
                        cv.feature(lastSlot, is_meta=1)
                    if cur["inNote"]:
                        cv.feature(lastSlot, is_note=1)
                    # take care that this empty slot falls under all sections
                    # for folders and files this is already guaranteed
                    # We need only to watch out for chapters and chunks
                    if cur[NODE].get(chunkSection, None) is None:
                        prevChunk = cur.get("prevChunk", None)
                        if prevChunk is None:
                            cur["danglingSlots"].add(lastSlot[1])
                        else:
                            cv.link(prevChunk, lastSlot)
                    if sectionModel == "II":
                        if cur[NODE].get(chapterSection, None) is None:
                            prevChapter = cur.get("prevChapter", None)
                            if prevChapter is None:
                                cur["danglingSlots"].add(lastSlot[1])
                            else:
                                cv.link(prevChapter, lastSlot)

                cv.terminate(curNode)

            if isChnk:
                if tokenAsSlot:
                    addSpace(cv, cur, "\n")
                else:
                    if not hasFinishedWord:
                        finishWord(cv, cur, None, "\n")
                cv.terminate(cur[NODE][chunkSection])

            if sectionModel == "II":
                if isChap:
                    if tokenAsSlot:
                        addSpace(cv, cur, "\n")
                    else:
                        if not hasFinishedWord:
                            finishWord(cv, cur, None, "\n")
                    cv.terminate(cur[NODE][chapterSection])

        def afterTag(cv, cur, xnode, tag, atts):
            """Node actions after dealing with the children and after the end tag.

            This is the place where we process the `tail` of an LXML node: the
            text material after the element and before the next open/close
            tag of any element.

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            cur: dict
                Various pieces of data collected during walking
                and relevant for some next steps in the walk.
            xnode: object
                An LXML element node.
            tag: string
                The tag of the LXML node.
            atts: string
                The attributes of the LXML node, with namespaces stripped.
            """
            if tag == TEI_HEADER:
                cur["inHeader"] = False
            elif tag in NOTE_LIKE:
                cur["inNote"] = False

            if tag not in PASS_THROUGH:
                if "rend" in atts:
                    rValue = atts["rend"]
                    r = makeNameLike(rValue)
                    if r:
                        for q in r.split():
                            cur["rend"][q].pop()

            if xnode.tail:
                if tag == "lb" and self.makeLineElems:
                    tail = xnode.tail.lstrip()
                    if not wordAsSlot:
                        pass
                else:
                    tail = xnode.tail

                tailMaterial = WHITE_TRIM_RE.sub(" ", tail)
                if isPure(cur):
                    if tailMaterial and tailMaterial != " ":
                        elem = cur[XNEST][-1][0]
                        console(
                            (
                                "WARNING: Text material after "
                                f"<{tag}> in pure-content element <{elem}>"
                            ),
                            error=True,
                        )
                        stack = "-".join(cur[XNEST][0])
                        console(f"\tElement stack: {stack}-{tag}", error=True)
                        console(f"\tMaterial: `{tailMaterial}`", error=True)
                else:
                    if tokenAsSlot:
                        addTokens(cv, cur, tailMaterial)
                    else:
                        for ch in tailMaterial:
                            addSlot(cv, cur, ch)

            afterTagCustom = getattr(self, "afterTagCustom", None)
            if afterTagCustom is not None:
                afterTagCustom(cv, cur, xnode, tag, atts)

        def director(cv):
            """Director function.

            Here we program a walk through the TEI sources.
            At every step of the walk we fire some actions that build TF nodes
            and assign features for them.

            Because everything is rather dynamic, we generate fairly standard
            metadata for the features, namely a link to the
            [TEI website](https://tei-c.org).

            Parameters
            ----------
            cv: object
                The converter object, needed to issue actions.
            """
            makeLineElems = self.makeLineElems

            if makeLineElems:
                lineProperties = self.lineProperties
                lineType = lineProperties["nodeType"]

            makePageElems = self.makePageElems

            if makePageElems:
                pageProperties = self.pageProperties
                pageType = pageProperties["nodeType"]

            sectionModel = self.sectionModel
            A = self.A
            elementDefs = A.elementDefs

            cur = {}
            cur["pureElems"] = {
                modelInv[schemaOverride]: {
                    x for (x, (typ, mixed)) in eDefs.items() if not mixed
                }
                for (schemaOverride, eDefs) in elementDefs.items()
            }
            cur["mixedElems"] = {
                modelInv[schemaOverride]: {
                    x for (x, (typ, mixed)) in eDefs.items() if mixed
                }
                for (schemaOverride, eDefs) in elementDefs.items()
            }
            cur[NODE] = {}

            if sectionModel == "I":
                folderSection = self.folderSection
                fileSection = self.fileSection

                i = 0
                for xmlFolder, xmlFiles in self.getXML():
                    msg = "Start " if verbose >= 0 else "\t"
                    console(f"{msg}folder {xmlFolder}:")

                    cur[NODE][folderSection] = cv.node(folderSection)
                    value = {folderSection: xmlFolder}
                    cv.feature(cur[NODE][folderSection], **value)

                    j = 0
                    cr = ""
                    nl = True

                    for xmlFile in xmlFiles:
                        i += 1
                        j += 1
                        if j > PROGRESS_LIMIT:
                            cr = "\r"
                            nl = False

                        cur["xmlFile"] = xmlFile
                        xmlPath = f"{teiPath}/{xmlFolder}/{xmlFile}"
                        (model, adapt, tpl) = self.getSwitches(xmlPath)
                        cur["model"] = model
                        cur["template"] = tpl
                        cur["adaptation"] = adapt
                        modelRep = model or "TEI"
                        tplRep = tpl or ""
                        adRep = adapt or ""
                        label = f"{modelRep:<12} {adRep:<12} {tplRep:<12}"
                        if verbose >= 0:
                            console(
                                f"{cr}{i:>4} {label} {xmlFile:<50}",
                                newline=nl,
                            )

                        cur[NODE][fileSection] = cv.node(fileSection)
                        ids[xmlFile][""] = cur[NODE][fileSection]
                        value = {fileSection: xmlFile.removesuffix(".xml")}
                        cv.feature(cur[NODE][fileSection], **value)
                        if tpl:
                            cur[NODE][tpl] = cv.node(tpl)
                            cv.feature(cur[NODE][tpl], **value)

                        with fileOpen(xmlPath) as fh:
                            text = fh.read()
                            if transformFunc is not None:
                                text = transformFunc(text)
                            tree = etree.parse(text, parser)
                            root = tree.getroot()

                            if makeLineElems:
                                cur[NODE][lineType] = None
                                cur["inLine"] = False
                                cur["lineAtts"] = None

                            if makePageElems:
                                cur[NODE][pageType] = None
                                cur["inPage"] = False
                                cur["pageAtts"] = None

                            if not tokenAsSlot:
                                cur[NODE][WORD] = None
                            cur["inHeader"] = False
                            cur["inNote"] = False
                            cur[XNEST] = []
                            cur[TNEST] = []
                            cur[TSIB] = []
                            cur["chunkNum"] = 0
                            cur["prevChunk"] = None
                            cur["danglingSlots"] = set()
                            cur["prevWord"] = None
                            cur["wordStr"] = ""
                            cur["afterStr"] = ""
                            cur["afterSpace"] = True
                            cur["chunkElems"] = set()
                            walkNode(cv, cur, root)

                        if not tokenAsSlot:
                            addSlot(cv, cur, None)
                        if tpl:
                            cv.terminate(cur[NODE][tpl])
                        cv.terminate(cur[NODE][fileSection])

                    if verbose >= 0:
                        console("")
                        console(f"End   folder {xmlFolder}")

                    cv.terminate(cur[NODE][folderSection])

            elif sectionModel == "II":
                xmlFile = self.getXML()
                if xmlFile is None:
                    console("No XML files found!", error=True)
                    return False

                xmlPath = f"{teiPath}/{xmlFile}"
                (cur["model"], cur["adaptation"], cur["template"]) = self.getSwitches(
                    xmlPath
                )

                with fileOpen(f"{teiPath}/{xmlFile}") as fh:
                    cur["xmlFile"] = xmlFile
                    text = fh.read()
                    if transformFunc is not None:
                        text = transformFunc(text)
                    tree = etree.parse(text, parser)
                    root = tree.getroot()

                    if makeLineElems:
                        cur[NODE][lineType] = None
                        cur["inLine"] = False
                        cur["lineAtts"] = None

                    if makePageElems:
                        cur[NODE][pageType] = None
                        cur["inPage"] = False
                        cur["pageAtts"] = None

                    if not tokenAsSlot:
                        cur[NODE][WORD] = None
                    cur["inHeader"] = False
                    cur["inNote"] = False
                    cur[XNEST] = []
                    cur[TNEST] = []
                    cur[TSIB] = []
                    cur["chapterNum"] = 0
                    cur["chunkPNum"] = 0
                    cur["chunkONum"] = 0
                    cur["prevChunk"] = None
                    cur["prevChapter"] = None
                    cur["danglingSlots"] = set()
                    cur["prevWord"] = None
                    cur["wordStr"] = ""
                    cur["afterStr"] = ""
                    cur["afterSpace"] = True
                    cur["chunkElems"] = set()
                    cur["chapterElems"] = set()
                    for child in root.iterchildren(tag=etree.Element):
                        walkNode(cv, cur, child)

                if not tokenAsSlot:
                    addSlot(cv, cur, None)

            if verbose >= 0:
                console("")

            if verbose >= 0:
                console("Resolving links into edges ...")

            unresolvedRefs = {}
            unresolved = 0
            unresolvedUnique = 0
            resolved = 0
            resolvedUnique = 0

            for att, attRefs in refs.items():
                feature = f"link_{att}"
                edgeFeat = {feature: None}

                for (targetFile, targetId), sourceNodes in attRefs.items():
                    nSourceNodes = len(sourceNodes)
                    targetNode = ids[targetFile].get(targetId, None)
                    if targetNode is None:
                        unresolvedRefs.setdefault(targetFile, set()).add(targetId)
                        unresolvedUnique += 1
                        unresolved += nSourceNodes
                    else:
                        for sourceNode in sourceNodes:
                            cv.edge(sourceNode, targetNode, **edgeFeat)
                        resolvedUnique += 1
                        resolved += nSourceNodes

            if verbose >= 0:
                console(f"\t{resolvedUnique} in {resolved} reference(s) resolved")
                if unresolvedRefs:
                    console(
                        f"\t{unresolvedUnique} in {unresolved} reference(s): "
                        "could not be resolved"
                    )
                    if verbose == 1:
                        for targetFile, targetIds in sorted(unresolvedRefs.items()):
                            examples = " ".join(sorted(targetIds)[0:3])
                            console(f"\t\t{targetFile}: {len(targetIds)} x: {examples}")

            for fName in featureMeta:
                if not cv.occurs(fName):
                    cv.meta(fName)
            for fName in cv.features():
                if fName not in featureMeta:
                    if fName.startswith("rend_"):
                        r = fName[5:]
                        cv.meta(
                            fName,
                            description=f"whether text is to be rendered as {r}",
                            valueType="int",
                            conversionMethod=CM_LITC,
                            conversionCode=CONVERSION_METHODS[CM_LITC],
                        )
                        intFeatures.add(fName)
                    elif fName.startswith("link_"):
                        r = fName[5:]
                        cv.meta(
                            fName,
                            description=(
                                f"links to node identified by xml:id in attribute {r}"
                            ),
                            valueType="str",
                            conversionMethod=CM_LITP,
                            conversionCode=CONVERSION_METHODS[CM_LITP],
                        )
                    else:
                        cv.meta(
                            fName,
                            description=f"this is TEI attribute {fName}",
                            valueType="str",
                            conversionMethod=CM_LIT,
                            conversionCode=CONVERSION_METHODS[CM_LIT],
                        )

            levelConstraints = ["note < chunk, p", "salute < opener, closer"]
            if "chapterElems" in cur:
                for elem in cur["chapterElems"]:
                    levelConstraints.append(f"{elem} < chapter")
            if "chunkElems" in cur:
                for elem in cur["chunkElems"]:
                    levelConstraints.append(f"{elem} < chunk")

            levelConstraints = "; ".join(levelConstraints)

            cv.meta("otext", levelConstraints=levelConstraints)

            if verbose == 1:
                console("source reading done")
            return True

        return director

    def convertTask(self):
        """Implementation of the "convert" task.

        It sets up the `tf.convert.walker` machinery and runs it.

        Returns
        -------
        boolean
            Whether the conversion was successful.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        procins = self.procins
        verbose = self.verbose
        slotType = self.slotType
        generic = self.generic
        otext = self.otext
        featureMeta = self.featureMeta
        intFeatures = self.intFeatures

        makeLineElems = self.makeLineElems
        lineModel = self.lineModel
        if makeLineElems:
            lineProperties = self.lineProperties
            lineType = lineProperties["nodeType"]

        makePageElems = self.makePageElems
        pageModel = self.pageModel

        if makePageElems:
            pageProperties = self.pageProperties
            pageType = pageProperties["nodeType"]
            pbAtTop = pageProperties["pbAtTop"] if makePageElems else None

        tfPath = self.tfPath
        teiPath = self.teiPath

        if verbose >= 0:
            if verbose == 1:
                console(f"TEI to TF converting: {ux(teiPath)} => {ux(tfPath)}")
            if makeLineElems:
                lbRep = f" with {lineType} nodes for lines between lb elements"
                console(f"Line model {lineModel}{lbRep}")

            if makePageElems:
                wrt = "started" if pbAtTop else "ended"
                pbRep = f" with {pageType} nodes for pages {wrt} by pb elements"
                console(f"Page model {pageModel}{pbRep}")

            console(
                f"Processing instructions are {'treated' if procins else 'ignored'}"
            )

        initTree(tfPath, fresh=True, gentle=True)

        cv = self.getConverter()

        self.good = cv.walk(
            self.getDirector(),
            slotType,
            otext=otext,
            generic=generic,
            intFeatures=intFeatures,
            featureMeta=featureMeta,
            generateTf=True,
        )

    def loadTask(self):
        """Implementation of the "load" task.

        It loads the TF data that resides in the directory where the "convert" task
        deliver its results.

        During loading there are additional checks. If they succeed, we have evidence
        that we have a valid TF dataset.

        Also, during the first load intensive pre-computation of TF data takes place,
        the results of which will be cached in the invisible `.tf` directory there.

        That makes the TF data ready to be loaded fast, next time it is needed.

        Returns
        -------
        boolean
            Whether the loading was successful.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        tfPath = self.tfPath
        verbose = self.verbose
        silent = AUTO if verbose == 1 else TERSE if verbose == 0 else DEEP

        if not dirExists(tfPath):
            console(f"Directory {ux(tfPath)} does not exist.", error=True)
            console("No TF found, nothing to load", error=True)
            self.good = False
            return

        TF = Fabric(locations=[tfPath], silent=silent)
        allFeatures = TF.explore(silent=True, show=True)
        loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
        api = TF.load(loadableFeatures, silent=silent)
        if api:
            if verbose >= 0:
                console(f"max node = {api.F.otype.maxNode}")
            self.good = True
            return

        self.good = False

    # APP CREATION/UPDATING

    def appTask(self, tokenBased=False):
        """Implementation of the "app" task.

        It creates / updates a corpus-specific app plus specific documentation files.
        There should be a valid TF dataset in place, because some
        settings in the app derive from it.

        It will also read custom additions that are present in the target app directory.
        These files are:

        *   `about_custom.md`:
            A markdown file with specific colophon information about the dataset.
            In the generated file, this information will be put at the start.
        *   `transcription_custom.md`:
            A markdown file with specific encoding information about the dataset.
            In the generated file, this information will be put at the start.
        *   `config_custom.yaml`:
            A YAML file with configuration data that will be *merged* into the generated
            config.yaml.
        *   `app_custom.py`:
            A python file with named snippets of code to be inserted
            at corresponding places in the generated `app.py`
        *   `display_custom.css`:
            Additional CSS definitions that will be appended to the generated
            `display.css`.

        If the TF app for this resource needs custom code, this is the way to retain
        that code between automatic generation of files.

        Returns
        -------
        boolean
            Whether the operation was successful.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        verbose = self.verbose

        refDir = self.refDir
        myDir = self.myDir
        procins = self.procins
        wordAsSlot = self.wordAsSlot
        tokenAsSlot = self.tokenAsSlot
        charAsSlot = self.charAsSlot
        parentEdges = self.parentEdges
        siblingEdges = self.siblingEdges
        sectionModel = self.sectionModel
        sectionProperties = self.sectionProperties
        tfVersion = self.tfVersion

        # key | parentDir | file | template based

        # if parentDir is a tuple, the first part is the parentDir of the source
        # end the second part is the parentDir of the destination

        itemSpecs = (
            ("about", "docs", "about.md", False),
            ("trans", ("app", "docs"), "transcription.md", False),
            ("logo", "app/static", "logo.png", True),
            ("display", "app/static", "display.css", False),
            ("config", "app", "config.yaml", False),
            ("app", "app", "app.py", False),
        )
        genTasks = {
            s[0]: dict(parentDir=s[1], file=s[2], justCopy=s[3]) for s in itemSpecs
        }
        cssInfo = makeCssInfo()

        version = tfVersion.removesuffix(PRE) if tokenBased else tfVersion

        def createConfig(sourceText, customText):
            text = sourceText.replace("«version»", f'"{version}"')

            settings = readYaml(text=text, plain=True)
            settings.setdefault("provenanceSpec", {})["branch"] = BRANCH_DEFAULT_NEW

            if tokenBased:
                if "typeDisplay" in settings and "word" in settings["typeDisplay"]:
                    del settings["typeDisplay"]["word"]

            customSettings = (
                {} if not customText else readYaml(text=customText, plain=True)
            )

            mergeDict(settings, customSettings)

            text = writeYaml(settings)

            return text

        def createDisplay(sourceText, customText):
            """Copies and tweaks the display.css file of an TF app.

            We generate CSS code for a certain text formatting styles,
            triggered by `rend` attributes in the source.
            """

            css = sourceText.replace("«rends»", cssInfo)
            return f"{css}\n\n{customText}\n"

        def createApp(sourceText, customText):
            """Copies and tweaks the app.py file of an TF app.

            The template app.py provides text formatting functions.
            It retrieves text from features, but that is dependent on
            the settings of the conversion, in particular whether we have words as
            slots or characters.

            Depending on that we insert some code in the template.

            The template contains the string `F.matérial`, and it will be replaced
            by something like

            ```
            F.ch.v(n)
            ```

            or

            ```
            f"{F.str.v(n)}{F.after.v(n)}"
            ```

            That's why the variable `materialCode` in the body gets a rather
            unusual value: it is interpreted later on as code.
            """

            materialCode = (
                '''F.ch.v(n) or ""'''
                if charAsSlot or tokenBased
                else """f'{F.str.v(n) or ""}{F.after.v(n) or ""}'"""
            )
            rendValues = repr(KNOWN_RENDS)

            code = sourceText.replace("F.matérial", materialCode)
            code = code.replace('"rèndValues"', rendValues)

            hookStartRe = re.compile(r"^# DEF (import|init|extra)\s*$", re.M)
            hookEndRe = re.compile(r"^# END DEF\s*$", re.M)
            hookInsertRe = re.compile(r"^\s*# INSERT (import|init|extra)\s*$", re.M)

            custom = {}
            section = None

            for line in (customText or "").split("\n"):
                line = line.rstrip()

                if section is None:
                    match = hookStartRe.match(line)
                    if match:
                        section = match.group(1)
                        custom[section] = []
                else:
                    match = hookEndRe.match(line)
                    if match:
                        section = None
                    else:
                        custom[section].append(line)

            codeLines = []

            for line in code.split("\n"):
                line = line.rstrip()

                match = hookInsertRe.match(line)
                if match:
                    section = match.group(1)
                    codeLines.extend(custom.get(section, []))
                else:
                    codeLines.append(line)

            return "\n".join(codeLines) + "\n"

        def createTranscription(sourceText, customText):
            """Copies and tweaks the transcription.md file for a TF corpus."""
            org = self.org
            repo = self.repo
            relative = self.relative
            intFeatures = self.intFeatures
            extra = self.extra

            def metaRep(feat, meta):
                valueType = "int" if feat in intFeatures else "str"
                description = meta.get("description", "")
                extraFieldRep = "\n".join(
                    f"*   `{field}`: `{value}`"
                    for (field, value) in meta.items()
                    if field not in {"description", "valueType"}
                )

                return (
                    f"""{description}\n"""
                    f"""The values of this feature have type {valueType}.\n"""
                    f"""{extraFieldRep}"""
                )

            extra = "\n\n".join(
                f"## `{feat}`\n\n{metaRep(feat, info['meta'])}\n"
                for (feat, info) in extra.items()
            )

            text = (
                dedent(
                    f"""
                # Corpus {org} - {repo}{relative}

                """
                )
                + tweakTrans(
                    sourceText,
                    procins,
                    wordAsSlot,
                    tokenAsSlot,
                    charAsSlot,
                    parentEdges,
                    siblingEdges,
                    tokenBased,
                    sectionModel,
                    sectionProperties,
                    REND_DESC,
                    extra,
                )
                + dedent(
                    """

                    ## See also

                    *   [about](about.md)
                    """
                )
            )
            return f"{text}\n\n{customText}\n"

        def createAbout(sourceText, customText):
            org = self.org
            repo = self.repo
            relative = self.relative
            generic = self.generic
            if tokenBased:
                generic["version"] = version

            generic = "\n\n".join(
                f"## `{key}`\n\n`{value}`\n" for (key, value) in generic.items()
            )

            return f"{customText}\n\n{sourceText}\n\n" + (
                dedent(
                    f"""
                # Corpus {org} - {repo}{relative}

                """
                )
                + generic
                + dedent(
                    """

                    ## Conversion

                    Converted from TEI to TF

                    ## See also

                    *   [transcription](transcription.md)
                    """
                )
            )

        extraRep = " with NLP output " if tokenBased else ""

        if verbose >= 0:
            console(f"App updating {extraRep} ...")

        for name, info in genTasks.items():
            parentDir = info["parentDir"]
            (sourceBit, targetBit) = (
                parentDir if type(parentDir) is tuple else (parentDir, parentDir)
            )
            file = info[FILE]
            fileParts = file.rsplit(".", 1)
            if len(fileParts) == 1:
                fileParts = [file, ""]
            (fileBase, fileExt) = fileParts
            if fileExt:
                fileExt = f".{fileExt}"
            targetDir = f"{refDir}/{targetBit}"
            itemTarget = f"{targetDir}/{file}"
            itemCustom = f"{targetDir}/{fileBase}_custom{fileExt}"
            itemPre = f"{targetDir}/{fileBase}_orig{fileExt}"

            justCopy = info["justCopy"]
            teiDir = f"{myDir}/{sourceBit}"
            itemSource = f"{teiDir}/{file}"

            # If there is custom info, we do not have to preserve the previous version.
            # Otherwise we save the target before overwriting it; # unless it
            # has been saved before

            preExists = fileExists(itemPre)
            targetExists = fileExists(itemTarget)
            customExists = fileExists(itemCustom)

            msg = ""

            if justCopy:
                if targetExists:
                    msg = "(already exists, not overwritten)"
                    safe = False
                else:
                    msg = "(copied)"
                    safe = True
            else:
                if targetExists:
                    if customExists:
                        msg = "(generated with custom info)"
                    else:
                        if preExists:
                            msg = "(no custom info, older original exists)"
                        else:
                            msg = "(no custom info, original preserved)"
                            fileCopy(itemTarget, itemPre)
                else:
                    msg = "(created)"

            initTree(targetDir, fresh=False)

            if justCopy:
                if safe:
                    fileCopy(itemSource, itemTarget)
            else:
                if fileExists(itemSource):
                    with fileOpen(itemSource) as fh:
                        sourceText = fh.read()
                else:
                    sourceText = ""

                if fileExists(itemCustom):
                    with fileOpen(itemCustom) as fh:
                        customText = fh.read()
                else:
                    customText = ""

                targetText = (
                    createConfig
                    if name == "config"
                    else createApp
                    if name == "app"
                    else createDisplay
                    if name == "display"
                    else createTranscription
                    if name == "trans"
                    else createAbout
                    if name == "about"
                    else fileCopy  # this cannot occur because justCopy is False
                )(sourceText, customText)

                with fileOpen(itemTarget, mode="w") as fh:
                    fh.write(targetText)

            if verbose >= 0:
                console(f"\t{ux(itemTarget):30} {msg}")

        if verbose >= 0:
            console("Done")
        else:
            console(f"App updated{extraRep}")

    # START the TEXT-FABRIC BROWSER on this CORPUS

    def browseTask(self):
        """Implementation of the "browse" task.

        It gives a shell command to start the TF browser on
        the newly created corpus.
        There should be a valid TF dataset and app configuration in place

        Returns
        -------
        boolean
            Whether the operation was successful.
        """
        if not self.importOK():
            return

        if not self.good:
            return

        org = self.org
        repo = self.repo
        relative = self.relative
        backend = self.backend
        tfVersion = self.tfVersion

        backendOpt = "" if backend == "github" else f"--backend={backend}"
        versionOpt = f"--version={tfVersion}"
        versionOpt = ""
        try:
            run(
                (
                    f"tf {org}/{repo}{relative}:clone --checkout=clone "
                    f"{versionOpt} {backendOpt}"
                ),
                shell=True,
            )
        except KeyboardInterrupt:
            pass

    def task(
        self,
        check=False,
        convert=False,
        load=False,
        app=False,
        apptoken=False,
        browse=False,
        verbose=None,
        validate=None,
    ):
        """Carry out any task, possibly modified by any flag.

        This is a higher level function that can execute a selection of tasks.

        The tasks will be executed in a fixed order:
        `check`, `convert`, `load`, `app`, `apptoken`, `browse`.
        But you can select which one(s) must be executed.

        If multiple tasks must be executed and one fails, the subsequent tasks
        will not be executed.

        Parameters
        ----------
        check: boolean, optional False
            Whether to carry out the `check` task.
        convert: boolean, optional False
            Whether to carry out the `convert` task.
        load: boolean, optional False
            Whether to carry out the `load` task.
        app: boolean, optional False
            Whether to carry out the `app` task.
        apptoken: boolean, optional False
            Whether to carry out the `apptoken` task.
        browse: boolean, optional False
            Whether to carry out the `browse` task"
        verbose: integer, optional -1
            Produce no (-1), some (0) or many (1) progress and reporting messages
        validate: boolean, optional True
            Whether to perform XML validation during the check task

        Returns
        -------
        boolean
            Whether all tasks have executed successfully.
        """
        if not self.importOK():
            return

        if verbose is not None:
            verboseSav = self.verbose
            self.verbose = verbose

        if validate is not None:
            self.validate = validate

        if not self.good:
            return False

        for condition, method, kwargs in (
            (check, self.checkTask, {}),
            (convert, self.convertTask, {}),
            (load, self.loadTask, {}),
            (app, self.appTask, {}),
            (apptoken, self.appTask, dict(tokenBased=True)),
            (browse, self.browseTask, {}),
        ):
            if condition:
                method(**kwargs)
                if not self.good:
                    break

        if verbose is not None:
            self.verbose = verboseSav
        return self.good

Ancestors

Methods

def appTask(self, tokenBased=False)

Implementation of the "app" task.

It creates / updates a corpus-specific app plus specific documentation files. There should be a valid TF dataset in place, because some settings in the app derive from it.

It will also read custom additions that are present in the target app directory. These files are:

  • about_custom.md: A markdown file with specific colophon information about the dataset. In the generated file, this information will be put at the start.
  • transcription_custom.md: A markdown file with specific encoding information about the dataset. In the generated file, this information will be put at the start.
  • config_custom.yaml: A YAML file with configuration data that will be merged into the generated config.yaml.
  • app_custom.py: A python file with named snippets of code to be inserted at corresponding places in the generated app.py
  • display_custom.css: Additional CSS definitions that will be appended to the generated display.css.

If the TF app for this resource needs custom code, this is the way to retain that code between automatic generation of files.

Returns

boolean
Whether the operation was successful.
Expand source code Browse git
def appTask(self, tokenBased=False):
    """Implementation of the "app" task.

    It creates / updates a corpus-specific app plus specific documentation files.
    There should be a valid TF dataset in place, because some
    settings in the app derive from it.

    It will also read custom additions that are present in the target app directory.
    These files are:

    *   `about_custom.md`:
        A markdown file with specific colophon information about the dataset.
        In the generated file, this information will be put at the start.
    *   `transcription_custom.md`:
        A markdown file with specific encoding information about the dataset.
        In the generated file, this information will be put at the start.
    *   `config_custom.yaml`:
        A YAML file with configuration data that will be *merged* into the generated
        config.yaml.
    *   `app_custom.py`:
        A python file with named snippets of code to be inserted
        at corresponding places in the generated `app.py`
    *   `display_custom.css`:
        Additional CSS definitions that will be appended to the generated
        `display.css`.

    If the TF app for this resource needs custom code, this is the way to retain
    that code between automatic generation of files.

    Returns
    -------
    boolean
        Whether the operation was successful.
    """
    if not self.importOK():
        return

    if not self.good:
        return

    verbose = self.verbose

    refDir = self.refDir
    myDir = self.myDir
    procins = self.procins
    wordAsSlot = self.wordAsSlot
    tokenAsSlot = self.tokenAsSlot
    charAsSlot = self.charAsSlot
    parentEdges = self.parentEdges
    siblingEdges = self.siblingEdges
    sectionModel = self.sectionModel
    sectionProperties = self.sectionProperties
    tfVersion = self.tfVersion

    # key | parentDir | file | template based

    # if parentDir is a tuple, the first part is the parentDir of the source
    # end the second part is the parentDir of the destination

    itemSpecs = (
        ("about", "docs", "about.md", False),
        ("trans", ("app", "docs"), "transcription.md", False),
        ("logo", "app/static", "logo.png", True),
        ("display", "app/static", "display.css", False),
        ("config", "app", "config.yaml", False),
        ("app", "app", "app.py", False),
    )
    genTasks = {
        s[0]: dict(parentDir=s[1], file=s[2], justCopy=s[3]) for s in itemSpecs
    }
    cssInfo = makeCssInfo()

    version = tfVersion.removesuffix(PRE) if tokenBased else tfVersion

    def createConfig(sourceText, customText):
        text = sourceText.replace("«version»", f'"{version}"')

        settings = readYaml(text=text, plain=True)
        settings.setdefault("provenanceSpec", {})["branch"] = BRANCH_DEFAULT_NEW

        if tokenBased:
            if "typeDisplay" in settings and "word" in settings["typeDisplay"]:
                del settings["typeDisplay"]["word"]

        customSettings = (
            {} if not customText else readYaml(text=customText, plain=True)
        )

        mergeDict(settings, customSettings)

        text = writeYaml(settings)

        return text

    def createDisplay(sourceText, customText):
        """Copies and tweaks the display.css file of an TF app.

        We generate CSS code for a certain text formatting styles,
        triggered by `rend` attributes in the source.
        """

        css = sourceText.replace("«rends»", cssInfo)
        return f"{css}\n\n{customText}\n"

    def createApp(sourceText, customText):
        """Copies and tweaks the app.py file of an TF app.

        The template app.py provides text formatting functions.
        It retrieves text from features, but that is dependent on
        the settings of the conversion, in particular whether we have words as
        slots or characters.

        Depending on that we insert some code in the template.

        The template contains the string `F.matérial`, and it will be replaced
        by something like

        ```
        F.ch.v(n)
        ```

        or

        ```
        f"{F.str.v(n)}{F.after.v(n)}"
        ```

        That's why the variable `materialCode` in the body gets a rather
        unusual value: it is interpreted later on as code.
        """

        materialCode = (
            '''F.ch.v(n) or ""'''
            if charAsSlot or tokenBased
            else """f'{F.str.v(n) or ""}{F.after.v(n) or ""}'"""
        )
        rendValues = repr(KNOWN_RENDS)

        code = sourceText.replace("F.matérial", materialCode)
        code = code.replace('"rèndValues"', rendValues)

        hookStartRe = re.compile(r"^# DEF (import|init|extra)\s*$", re.M)
        hookEndRe = re.compile(r"^# END DEF\s*$", re.M)
        hookInsertRe = re.compile(r"^\s*# INSERT (import|init|extra)\s*$", re.M)

        custom = {}
        section = None

        for line in (customText or "").split("\n"):
            line = line.rstrip()

            if section is None:
                match = hookStartRe.match(line)
                if match:
                    section = match.group(1)
                    custom[section] = []
            else:
                match = hookEndRe.match(line)
                if match:
                    section = None
                else:
                    custom[section].append(line)

        codeLines = []

        for line in code.split("\n"):
            line = line.rstrip()

            match = hookInsertRe.match(line)
            if match:
                section = match.group(1)
                codeLines.extend(custom.get(section, []))
            else:
                codeLines.append(line)

        return "\n".join(codeLines) + "\n"

    def createTranscription(sourceText, customText):
        """Copies and tweaks the transcription.md file for a TF corpus."""
        org = self.org
        repo = self.repo
        relative = self.relative
        intFeatures = self.intFeatures
        extra = self.extra

        def metaRep(feat, meta):
            valueType = "int" if feat in intFeatures else "str"
            description = meta.get("description", "")
            extraFieldRep = "\n".join(
                f"*   `{field}`: `{value}`"
                for (field, value) in meta.items()
                if field not in {"description", "valueType"}
            )

            return (
                f"""{description}\n"""
                f"""The values of this feature have type {valueType}.\n"""
                f"""{extraFieldRep}"""
            )

        extra = "\n\n".join(
            f"## `{feat}`\n\n{metaRep(feat, info['meta'])}\n"
            for (feat, info) in extra.items()
        )

        text = (
            dedent(
                f"""
            # Corpus {org} - {repo}{relative}

            """
            )
            + tweakTrans(
                sourceText,
                procins,
                wordAsSlot,
                tokenAsSlot,
                charAsSlot,
                parentEdges,
                siblingEdges,
                tokenBased,
                sectionModel,
                sectionProperties,
                REND_DESC,
                extra,
            )
            + dedent(
                """

                ## See also

                *   [about](about.md)
                """
            )
        )
        return f"{text}\n\n{customText}\n"

    def createAbout(sourceText, customText):
        org = self.org
        repo = self.repo
        relative = self.relative
        generic = self.generic
        if tokenBased:
            generic["version"] = version

        generic = "\n\n".join(
            f"## `{key}`\n\n`{value}`\n" for (key, value) in generic.items()
        )

        return f"{customText}\n\n{sourceText}\n\n" + (
            dedent(
                f"""
            # Corpus {org} - {repo}{relative}

            """
            )
            + generic
            + dedent(
                """

                ## Conversion

                Converted from TEI to TF

                ## See also

                *   [transcription](transcription.md)
                """
            )
        )

    extraRep = " with NLP output " if tokenBased else ""

    if verbose >= 0:
        console(f"App updating {extraRep} ...")

    for name, info in genTasks.items():
        parentDir = info["parentDir"]
        (sourceBit, targetBit) = (
            parentDir if type(parentDir) is tuple else (parentDir, parentDir)
        )
        file = info[FILE]
        fileParts = file.rsplit(".", 1)
        if len(fileParts) == 1:
            fileParts = [file, ""]
        (fileBase, fileExt) = fileParts
        if fileExt:
            fileExt = f".{fileExt}"
        targetDir = f"{refDir}/{targetBit}"
        itemTarget = f"{targetDir}/{file}"
        itemCustom = f"{targetDir}/{fileBase}_custom{fileExt}"
        itemPre = f"{targetDir}/{fileBase}_orig{fileExt}"

        justCopy = info["justCopy"]
        teiDir = f"{myDir}/{sourceBit}"
        itemSource = f"{teiDir}/{file}"

        # If there is custom info, we do not have to preserve the previous version.
        # Otherwise we save the target before overwriting it; # unless it
        # has been saved before

        preExists = fileExists(itemPre)
        targetExists = fileExists(itemTarget)
        customExists = fileExists(itemCustom)

        msg = ""

        if justCopy:
            if targetExists:
                msg = "(already exists, not overwritten)"
                safe = False
            else:
                msg = "(copied)"
                safe = True
        else:
            if targetExists:
                if customExists:
                    msg = "(generated with custom info)"
                else:
                    if preExists:
                        msg = "(no custom info, older original exists)"
                    else:
                        msg = "(no custom info, original preserved)"
                        fileCopy(itemTarget, itemPre)
            else:
                msg = "(created)"

        initTree(targetDir, fresh=False)

        if justCopy:
            if safe:
                fileCopy(itemSource, itemTarget)
        else:
            if fileExists(itemSource):
                with fileOpen(itemSource) as fh:
                    sourceText = fh.read()
            else:
                sourceText = ""

            if fileExists(itemCustom):
                with fileOpen(itemCustom) as fh:
                    customText = fh.read()
            else:
                customText = ""

            targetText = (
                createConfig
                if name == "config"
                else createApp
                if name == "app"
                else createDisplay
                if name == "display"
                else createTranscription
                if name == "trans"
                else createAbout
                if name == "about"
                else fileCopy  # this cannot occur because justCopy is False
            )(sourceText, customText)

            with fileOpen(itemTarget, mode="w") as fh:
                fh.write(targetText)

        if verbose >= 0:
            console(f"\t{ux(itemTarget):30} {msg}")

    if verbose >= 0:
        console("Done")
    else:
        console(f"App updated{extraRep}")
def browseTask(self)

Implementation of the "browse" task.

It gives a shell command to start the TF browser on the newly created corpus. There should be a valid TF dataset and app configuration in place

Returns

boolean
Whether the operation was successful.
Expand source code Browse git
def browseTask(self):
    """Implementation of the "browse" task.

    It gives a shell command to start the TF browser on
    the newly created corpus.
    There should be a valid TF dataset and app configuration in place

    Returns
    -------
    boolean
        Whether the operation was successful.
    """
    if not self.importOK():
        return

    if not self.good:
        return

    org = self.org
    repo = self.repo
    relative = self.relative
    backend = self.backend
    tfVersion = self.tfVersion

    backendOpt = "" if backend == "github" else f"--backend={backend}"
    versionOpt = f"--version={tfVersion}"
    versionOpt = ""
    try:
        run(
            (
                f"tf {org}/{repo}{relative}:clone --checkout=clone "
                f"{versionOpt} {backendOpt}"
            ),
            shell=True,
        )
    except KeyboardInterrupt:
        pass
def checkTask(self)

Implementation of the "check" task.

It validates the TEI, but only if a schema file has been passed explicitly when constructing the TEI object.

Then it makes an inventory of all elements and attributes in the TEI files.

If tags are used in multiple namespaces, it will be reported.

Conflation of namespaces

The TEI to TF conversion does construct node types and attributes without taking namespaces into account. However, the parsing process is namespace aware.

The inventory lists all elements and attributes, and many attribute values. But is represents any digit with n, and some attributes that contain ids or keywords, are reduced to the value x.

This information reduction helps to get a clear overview.

It writes reports to the reportPath:

  • errors.txt: validation errors
  • elements.txt: element / attribute inventory.
Expand source code Browse git
def checkTask(self):
    """Implementation of the "check" task.

    It validates the TEI, but only if a schema file has been passed explicitly
    when constructing the `TEI()` object.

    Then it makes an inventory of all elements and attributes in the TEI files.

    If tags are used in multiple namespaces, it will be reported.

    !!! caution "Conflation of namespaces"
        The TEI to TF conversion does construct node types and attributes
        without taking namespaces into account.
        However, the parsing process is namespace aware.

    The inventory lists all elements and attributes, and many attribute values.
    But is represents any digit with `n`, and some attributes that contain
    ids or keywords, are reduced to the value `x`.

    This information reduction helps to get a clear overview.

    It writes reports to the `reportPath`:

    *   `errors.txt`: validation errors
    *   `elements.txt`: element / attribute inventory.
    """
    if not self.importOK():
        return

    if not self.good:
        return

    verbose = self.verbose
    procins = self.procins
    validate = self.validate
    modelInfo = self.modelInfo
    modelInv = self.modelInv
    modelXsd = self.modelXsd
    A = self.A
    etree = self.etree

    teiPath = self.teiPath
    reportPath = self.reportPath
    docsDir = self.docsDir
    sectionModel = self.sectionModel

    if verbose == 1:
        console(f"TEI to TF checking: {ux(teiPath)} => {ux(reportPath)}")
    if verbose >= 0:
        console(
            f"Processing instructions are {'treated' if procins else 'ignored'}"
        )
        console(f"XML validation will be {'performed' if validate else 'skipped'}")

    kindLabels = dict(
        format="Formatting Attributes",
        keyword="Keyword Attributes",
        rest="Remaining Attributes and Elements",
    )
    getStore = lambda: collections.defaultdict(  # noqa: E731
        lambda: collections.defaultdict(collections.Counter)
    )
    analysis = {x: getStore() for x in kindLabels}
    errors = []
    tagByNs = collections.defaultdict(collections.Counter)
    refs = collections.defaultdict(lambda: collections.Counter())
    ids = collections.defaultdict(lambda: collections.Counter())

    parser = self.getParser()
    baseSchema = modelXsd[None]
    overrides = [
        override for (model, override) in modelXsd.items() if model is not None
    ]
    A.getElementInfo(baseSchema, overrides, verbose=verbose)
    elementDefs = A.elementDefs

    initTree(reportPath)
    initTree(docsDir)

    nProcins = 0

    lbParents = collections.Counter()

    def analyse(root, analysis, xmlFile):
        FORMAT_ATTS = set(
            """
            dim
            level
            place
            rend
        """.strip().split()
        )

        KEYWORD_ATTS = set(
            """
            facs
            form
            function
            lang
            reason
            type
            unit
            who
        """.strip().split()
        )

        TRIM_ATTS = set(
            """
            id
            key
            target
            value
        """.strip().split()
        )

        NUM_RE = re.compile(r"""[0-9]""", re.S)

        def nodeInfo(xnode):
            nonlocal nProcins

            if procins and isinstance(xnode, etree._ProcessingInstruction):
                target = xnode.target
                tag = f"?{target}"
                ns = ""
                nProcins += 1
            else:
                qName = etree.QName(xnode.tag)
                tag = qName.localname
                ns = qName.namespace

            atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()}

            tagByNs[tag][ns] += 1

            if tag == "lb":
                parentTag = etree.QName(xnode.getparent().tag).localname
                lbParents[parentTag] += 1

            if len(atts) == 0:
                kind = "rest"
                analysis[kind][tag][""][""] += 1
            else:
                idv = atts.get("id", None)

                if idv is not None:
                    ids[xmlFile][idv] += 1

                for refAtt, targetFile, targetId in getRefs(tag, atts, xmlFile):
                    refs[xmlFile][(targetFile, targetId)] += 1

                for k, v in atts.items():
                    kind = (
                        "format"
                        if k in FORMAT_ATTS
                        else "keyword"
                        if k in KEYWORD_ATTS
                        else "rest"
                    )
                    dest = analysis[kind]

                    if kind == "rest":
                        vTrim = "X" if k in TRIM_ATTS else NUM_RE.sub("N", v)
                        dest[tag][k][vTrim] += 1
                    else:
                        words = v.strip().split()
                        for w in words:
                            dest[tag][k][w.strip()] += 1

            for child in xnode.iterchildren(
                tag=(etree.Element, etree.ProcessingInstruction)
                if procins
                else etree.Element
            ):
                nodeInfo(child)

        nodeInfo(root)

    def writeErrors():
        """Write the errors to a file."""

        errorFile = f"{reportPath}/errors.txt"

        nErrors = 0
        nFiles = 0

        with fileOpen(errorFile, mode="w") as fh:
            prevFolder = None
            prevFile = None

            for folder, file, line, col, kind, text in errors:
                newFolder = prevFolder != folder
                newFile = newFolder or prevFile != file

                if newFile:
                    nFiles += 1

                if kind == "error":
                    nErrors += 1

                indent1 = f"{folder}\n\t" if newFolder else "\t"
                indent2 = f"{file}\n\t\t" if newFile else "\t"
                loc = f"{line or ''}:{col or ''}"
                text = "\n".join(wrap(text, width=80, subsequent_indent="\t\t\t"))
                fh.write(f"{indent1}{indent2}{loc} {kind or ''} {text}\n")
                prevFolder = folder
                prevFile = file

        if nErrors:
            console(
                (
                    f"{nErrors} validation error(s) in {nFiles} file(s) "
                    f"written to {errorFile}"
                ),
                error=True,
            )
        else:
            if verbose >= 0:
                if validate:
                    console("Validation OK")
                else:
                    console("No validation performed")

    def writeNamespaces():
        errorFile = f"{reportPath}/namespaces.txt"

        nErrors = 0

        nTags = len(tagByNs)

        with fileOpen(errorFile, mode="w") as fh:
            for tag, nsInfo in sorted(
                tagByNs.items(), key=lambda x: (-len(x[1]), x[0])
            ):
                label = "OK"
                nNs = len(nsInfo)
                if nNs > 1:
                    nErrors += 1
                    label = "XX"

                for ns, amount in sorted(
                    nsInfo.items(), key=lambda x: (-x[1], x[0])
                ):
                    fh.write(
                        f"{label} {nNs:>2} namespace for "
                        f"{tag:<16} : {amount:>5}x {ns}\n"
                    )

        if verbose >= 0:
            if procins:
                plural = "" if nProcins == 1 else "s"
                console(f"{nProcins} processing instruction{plural} encountered.")

            console(
                f"{nTags} tags of which {nErrors} with multiple namespaces "
                f"written to {errorFile}"
                if verbose >= 0 or nErrors
                else "Namespaces OK"
            )

    def writeReport():
        reportFile = f"{reportPath}/elements.txt"
        with fileOpen(reportFile, mode="w") as fh:
            fh.write(
                "Inventory of tags and attributes in the source XML file(s).\n"
                "Contains the following sections:\n"
            )
            for label in kindLabels.values():
                fh.write(f"\t{label}\n")
            fh.write("\n\n")

            infoLines = 0

            def writeAttInfo(tag, att, attInfo):
                nonlocal infoLines
                nl = "" if tag == "" else "\n"
                tagRep = "" if tag == "" else f"<{tag}>"
                attRep = "" if att == "" else f"{att}="
                atts = sorted(attInfo.items())
                (val, amount) = atts[0]
                fh.write(
                    f"{nl}\t{tagRep:<18} " f"{attRep:<11} {amount:>5}x {val}\n"
                )
                infoLines += 1
                for val, amount in atts[1:]:
                    fh.write(
                        f"""\t{'':<7}{'':<18} {'"':<18} {amount:>5}x {val}\n"""
                    )
                    infoLines += 1

            def writeTagInfo(tag, tagInfo):
                nonlocal infoLines
                tags = sorted(tagInfo.items())
                (att, attInfo) = tags[0]
                writeAttInfo(tag, att, attInfo)
                infoLines += 1
                for att, attInfo in tags[1:]:
                    writeAttInfo("", att, attInfo)

            for kind, label in kindLabels.items():
                fh.write(f"\n{label}\n")
                for tag, tagInfo in sorted(analysis[kind].items()):
                    writeTagInfo(tag, tagInfo)

        if verbose >= 0:
            console(f"{infoLines} info line(s) written to {reportFile}")

    def writeElemTypes():
        elemsCombined = {}

        modelSet = set()

        for schemaOverride, eDefs in elementDefs.items():
            model = modelInv[schemaOverride]
            modelSet.add(model)
            for tag, (typ, mixed) in eDefs.items():
                elemsCombined.setdefault(tag, {}).setdefault(model, {})
                elemsCombined[tag][model]["typ"] = typ
                elemsCombined[tag][model]["mixed"] = mixed

        tagReport = {}

        for tag, tagInfo in elemsCombined.items():
            tagLines = []
            tagReport[tag] = tagLines

            if None in tagInfo:
                teiInfo = tagInfo[None]
                teiTyp = teiInfo["typ"]
                teiMixed = teiInfo["mixed"]
                teiTypRep = "??" if teiTyp is None else typ
                teiMixedRep = (
                    "??" if teiMixed is None else "mixed" if teiMixed else "pure"
                )
                mds = ["TEI"]

                for model in sorted(x for x in tagInfo if x is not None):
                    info = tagInfo[model]
                    typ = info["typ"]
                    mixed = info["mixed"]
                    if typ == teiTyp and mixed == teiMixed:
                        mds.append(model)
                    else:
                        typRep = (
                            "" if typ == teiTyp else "??" if typ is None else typ
                        )
                        mixedRep = (
                            ""
                            if mixed == teiMixed
                            else "??"
                            if mixed is None
                            else "mixed"
                            if mixed
                            else "pure"
                        )
                        tagLines.append((tag, [model], typRep, mixedRep))
                tagLines.insert(0, (tag, mds, teiTypRep, teiMixedRep))
            else:
                for model in sorted(tagInfo):
                    info = tagInfo[model]
                    typ = info["typ"]
                    mixed = info["mixed"]
                    typRep = "??" if typ is None else typ
                    mixedRep = (
                        "??" if mixed is None else "mixed" if mixed else "pure"
                    )
                    tagLines.append((tag, [model], typRep, mixedRep))

        reportFile = f"{reportPath}/types.txt"
        with fileOpen(reportFile, mode="w") as fh:
            for tag in sorted(tagReport):
                tagLines = tagReport[tag]
                for tag, mds, typ, mixed in tagLines:
                    model = ",".join(mds)
                    fh.write(f"{tag:<18} {model:<18} {typ:<7} {mixed:<5}\n")

        if verbose >= 0:
            console(
                f"{len(elemsCombined)} tag(s) type info written to {reportFile}"
            )

    def writeLbParents():
        reportFile = f"{reportPath}/lb-parents.txt"

        with open(reportFile, "w") as fh:
            for parent, n in sorted(lbParents.items()):
                fh.write(f"{n:>5} x {parent}\n")

        if verbose >= 0:
            console(f"lb-parent info written to {reportFile}")

    def writeIdRefs():
        reportIdFile = f"{reportPath}/ids.txt"
        reportRefFile = f"{reportPath}/refs.txt"

        ih = fileOpen(reportIdFile, mode="w")
        rh = fileOpen(reportRefFile, mode="w")

        refdIds = collections.Counter()
        missingIds = set()

        totalRefs = 0
        totalRefsU = 0

        totalResolvable = 0
        totalResolvableU = 0
        totalDangling = 0
        totalDanglingU = 0

        seenItems = set()

        for file, items in refs.items():
            rh.write(f"{file}\n")

            resolvable = 0
            resolvableU = 0
            dangling = 0
            danglingU = 0

            for item, n in sorted(items.items()):
                totalRefs += n

                if item in seenItems:
                    newItem = False
                else:
                    seenItems.add(item)
                    newItem = True
                    totalRefsU += 1

                (target, idv) = item

                if target not in ids or idv not in ids[target]:
                    status = "dangling"
                    dangling += n

                    if newItem:
                        missingIds.add((target, idv))
                        danglingU += 1
                else:
                    status = "ok"
                    resolvable += n
                    refdIds[(target, idv)] += n

                    if newItem:
                        resolvableU += 1
                rh.write(f"\t{status:<10} {n:>5} x {target} # {idv}\n")

            msgs = (
                f"\tDangling:   {dangling:>4} x {danglingU:>4}",
                f"\tResolvable: {resolvable:>4} x {resolvableU:>4}",
            )
            for msg in msgs:
                rh.write(f"{msg}\n")

            totalResolvable += resolvable
            totalResolvableU += resolvableU
            totalDangling += dangling
            totalDanglingU += danglingU

        if verbose >= 0:
            console(f"Refs written to {reportRefFile}")
            msgs = (
                f"\tresolvable: {totalResolvableU:>4} in {totalResolvable:>4}",
                f"\tdangling:   {totalDanglingU:>4} in {totalDangling:>4}",
                f"\tALL:        {totalRefsU:>4} in {totalRefs:>4} ",
            )
            for msg in msgs:
                console(msg)

        totalIds = 0
        totalIdsU = 0
        totalIdsM = 0
        totalIdsRefd = 0
        totalIdsRefdU = 0
        totalIdsUnused = 0

        for file, items in ids.items():
            totalIds += len(items)

            ih.write(f"{file}\n")

            unique = 0
            multiple = 0
            refd = 0
            refdU = 0
            unused = 0

            for item, n in sorted(items.items()):
                nRefs = refdIds.get((file, item), 0)

                if n == 1:
                    unique += 1
                else:
                    multiple += 1

                if nRefs == 0:
                    unused += 1
                else:
                    refd += nRefs
                    refdU += 1

                status1 = f"{n}x"
                plural = "" if nRefs == 1 else "s"
                status2 = f"{nRefs}ref{plural}"

                ih.write(f"\t{status1:<8} {status2:<8} {item}\n")

            msgs = (
                f"\tUnique:     {unique:>4}",
                f"\tNon-unique: {multiple:>4}",
                f"\tUnused:     {unused:>4}",
                f"\tReferenced: {refd:>4} x {refdU:>4}",
            )
            for msg in msgs:
                ih.write(f"{msg}\n")

            totalIdsU += unique
            totalIdsM += multiple
            totalIdsRefdU += refdU
            totalIdsRefd += refd
            totalIdsUnused += unused

        if verbose >= 0:
            console(f"Ids written to {reportIdFile}")
            msgs = (
                f"\treferenced: {totalIdsRefdU:>4} by {totalIdsRefd:>4}",
                f"\tnon-unique: {totalIdsM:>4}",
                f"\tunused:     {totalIdsUnused:>4}",
                f"\tALL:        {totalIdsU:>4} in {totalIds:>4}",
            )
            for msg in msgs:
                console(msg)

    def writeDoc():
        teiUrl = "https://tei-c.org/release/doc/tei-p5-doc/en/html"
        elUrlPrefix = f"{teiUrl}/ref-"
        attUrlPrefix = f"{teiUrl}/REF-ATTS.html#"
        docFile = f"{docsDir}/elements.md"
        with fileOpen(docFile, mode="w") as fh:
            fh.write(
                dedent(
                    """
                    # Element and attribute inventory

                    Table of contents

                    """
                )
            )
            for label in kindLabels.values():
                labelAnchor = label.replace(" ", "-")
                fh.write(f"*\t[{label}](#{labelAnchor})\n")

            fh.write("\n")

            tableHeader = dedent(
                """
                | element | attribute | value | amount
                | --- | --- | --- | ---
                """
            )

            def writeAttInfo(tag, att, attInfo):
                tagRep = " " if tag == "" else f"[{tag}]({elUrlPrefix}{tag}.html)"
                attRep = " " if att == "" else f"[{att}]({attUrlPrefix}{att})"
                atts = sorted(attInfo.items())
                (val, amount) = atts[0]
                valRep = f"`{val}`" if val else ""
                fh.write(
                    "| "
                    + (
                        " | ".join(
                            str(x)
                            for x in (
                                tagRep,
                                attRep,
                                valRep,
                                amount,
                            )
                        )
                    )
                    + "\n"
                )
                for val, amount in atts[1:]:
                    valRep = f"`{val}`" if val else ""
                    fh.write(f"""| | | {valRep} | {amount}\n""")

            def writeTagInfo(tag, tagInfo):
                tags = sorted(tagInfo.items())
                (att, attInfo) = tags[0]
                writeAttInfo(tag, att, attInfo)
                for att, attInfo in tags[1:]:
                    writeAttInfo("", att, attInfo)

            for kind, label in kindLabels.items():
                fh.write(f"## {label}\n{tableHeader}")
                for tag, tagInfo in sorted(analysis[kind].items()):
                    writeTagInfo(tag, tagInfo)
                fh.write("\n")

    def filterError(msg):
        return msg == (
            "Element 'graphic', attribute 'url': [facet 'pattern'] "
            "The value '' is not accepted by the pattern '\\S+'."
        )

    def doXMLFile(xmlPath):
        tree = etree.parse(xmlPath, parser)
        root = tree.getroot()
        xmlFile = fileNm(xmlPath)
        ids[xmlFile][""] = 1
        analyse(root, analysis, xmlFile)

    xmlFilesByModel = collections.defaultdict(list)

    if sectionModel == "I":
        i = 0
        for xmlFolder, xmlFiles in self.getXML():
            msg = "Start " if verbose >= 0 else "\t"
            console(f"{msg}folder {xmlFolder}:")
            j = 0
            cr = ""
            nl = True

            for xmlFile in xmlFiles:
                i += 1
                j += 1
                if j > PROGRESS_LIMIT:
                    cr = "\r"
                    nl = False
                xmlPath = f"{teiPath}/{xmlFolder}/{xmlFile}"
                (model, adapt, tpl) = self.getSwitches(xmlPath)
                mdRep = model or "TEI"
                tplRep = tpl or ""
                adRep = adapt or ""

                label = f"{mdRep:<12} {tplRep:<12} {adRep:<12}"

                if verbose >= 0:
                    console(f"{cr}{i:>4} {label} {xmlFile:<50}", newline=nl)
                xmlFilesByModel[model].append(xmlPath)
            if verbose >= 0:
                console("")
                console(f"End   folder {xmlFolder}")

    elif sectionModel == "II":
        xmlFile = self.getXML()
        if xmlFile is None:
            console("No XML files found!", error=True)
            return False

        xmlPath = f"{teiPath}/{xmlFile}"
        (model, adapt, tpl) = self.getSwitches(xmlPath)
        xmlFilesByModel[model].append(xmlPath)

    good = True

    for model, xmlPaths in xmlFilesByModel.items():
        if verbose >= 0:
            console(f"{len(xmlPaths)} {model or 'TEI'} file(s) ...")

        thisGood = True

        if validate:
            if verbose >= 0:
                console("\tValidating ...")

            schemaFile = modelInfo.get(model, None)

            if schemaFile is None:
                if verbose >= 0:
                    console(f"\t\tNo schema file for {model}")
                if good is not None and good is not False:
                    good = None
                continue

            (thisGood, info, theseErrors) = A.validate(schemaFile, xmlPaths)

            for line in info:
                if verbose >= 0:
                    console(f"\t\t{line}")

        if not thisGood:
            good = False
            errors.extend(theseErrors)

        if verbose >= 0:
            console("\tMaking inventory ...")
        for xmlPath in xmlPaths:
            doXMLFile(xmlPath)

    if not good:
        self.good = False

    if verbose >= 0:
        console("")
    writeErrors()
    writeReport()
    writeElemTypes()
    writeDoc()
    writeNamespaces()
    writeIdRefs()
    writeLbParents()
def convertTask(self)

Implementation of the "convert" task.

It sets up the tf.convert.walker machinery and runs it.

Returns

boolean
Whether the conversion was successful.
Expand source code Browse git
def convertTask(self):
    """Implementation of the "convert" task.

    It sets up the `tf.convert.walker` machinery and runs it.

    Returns
    -------
    boolean
        Whether the conversion was successful.
    """
    if not self.importOK():
        return

    if not self.good:
        return

    procins = self.procins
    verbose = self.verbose
    slotType = self.slotType
    generic = self.generic
    otext = self.otext
    featureMeta = self.featureMeta
    intFeatures = self.intFeatures

    makeLineElems = self.makeLineElems
    lineModel = self.lineModel
    if makeLineElems:
        lineProperties = self.lineProperties
        lineType = lineProperties["nodeType"]

    makePageElems = self.makePageElems
    pageModel = self.pageModel

    if makePageElems:
        pageProperties = self.pageProperties
        pageType = pageProperties["nodeType"]
        pbAtTop = pageProperties["pbAtTop"] if makePageElems else None

    tfPath = self.tfPath
    teiPath = self.teiPath

    if verbose >= 0:
        if verbose == 1:
            console(f"TEI to TF converting: {ux(teiPath)} => {ux(tfPath)}")
        if makeLineElems:
            lbRep = f" with {lineType} nodes for lines between lb elements"
            console(f"Line model {lineModel}{lbRep}")

        if makePageElems:
            wrt = "started" if pbAtTop else "ended"
            pbRep = f" with {pageType} nodes for pages {wrt} by pb elements"
            console(f"Page model {pageModel}{pbRep}")

        console(
            f"Processing instructions are {'treated' if procins else 'ignored'}"
        )

    initTree(tfPath, fresh=True, gentle=True)

    cv = self.getConverter()

    self.good = cv.walk(
        self.getDirector(),
        slotType,
        otext=otext,
        generic=generic,
        intFeatures=intFeatures,
        featureMeta=featureMeta,
        generateTf=True,
    )
def getConverter(self)

Initializes a converter.

Returns

object
The CV converter object, initialized.
Expand source code Browse git
def getConverter(self):
    """Initializes a converter.

    Returns
    -------
    object
        The `tf.convert.walker.CV` converter object, initialized.
    """
    verbose = self.verbose
    tfPath = self.tfPath

    silent = AUTO if verbose == 1 else TERSE if verbose == 0 else DEEP
    TF = Fabric(locations=tfPath, silent=silent)
    return CV(TF, silent=silent)
def getDirector(self)

Factory for the director function.

The tf.convert.walker relies on a corpus dependent director function that walks through the source data and spits out actions that produces the TF dataset.

The director function that walks through the TEI input must be conditioned by the properties defined in the TEI schema and the customised schema, if any, that describes the source.

Also some special additions need to be programmed, such as an extra section level, word boundaries, etc.

We collect all needed data, store it, and define a local director function that has access to this data.

Returns

function
The local director function that has been constructed.
Expand source code Browse git
def getDirector(self):
    """Factory for the director function.

    The `tf.convert.walker` relies on a corpus dependent `director` function
    that walks through the source data and spits out actions that
    produces the TF dataset.

    The director function that walks through the TEI input must be conditioned
    by the properties defined in the TEI schema and the customised schema, if any,
    that describes the source.

    Also some special additions need to be programmed, such as an extra section
    level, word boundaries, etc.

    We collect all needed data, store it, and define a local director function
    that has access to this data.

    Returns
    -------
    function
        The local director function that has been constructed.
    """
    if not self.importOK():
        return

    if not self.good:
        return

    TEI_HEADER = "teiHeader"

    TEXT_ANCESTOR = "text"
    TEXT_ANCESTORS = set(
        """
        front
        body
        back
        group
        """.strip().split()
    )
    CHUNK_PARENTS = TEXT_ANCESTORS | {TEI_HEADER}

    CHUNK_ELEMS = set(
        """
        facsimile
        fsdDecl
        sourceDoc
        standOff
        """.strip().split()
    )

    PASS_THROUGH = set(
        """
        TEI
        """.strip().split()
    )

    # CHECKING

    HY = "\u2010"  # hyphen

    IN_WORD_HYPHENS = {HY, "-"}

    procins = self.procins
    verbose = self.verbose
    teiPath = self.teiPath
    wordAsSlot = self.wordAsSlot
    tokenAsSlot = self.tokenAsSlot
    parentEdges = self.parentEdges
    siblingEdges = self.siblingEdges
    featureMeta = self.featureMeta
    intFeatures = self.intFeatures
    transform = getattr(self, "transformCustom", None)
    chunkLevel = self.chunkLevel
    modelInv = self.modelInv
    modelInfo = self.modelInfo
    modelXsd = self.modelXsd
    A = self.A
    etree = self.etree

    transformFunc = (
        (lambda x: BytesIO(x.encode("utf-8")))
        if transform is None
        else lambda x: BytesIO(transform(x).encode("utf-8"))
    )

    parser = self.getParser()

    baseSchema = modelInfo[None]
    overrides = [
        override for (model, override) in modelInfo.items() if model is not None
    ]
    baseSchema = modelXsd[None]
    overrides = [
        override for (model, override) in modelXsd.items() if model is not None
    ]
    A.getElementInfo(baseSchema, overrides, verbose=-1)

    refs = collections.defaultdict(lambda: collections.defaultdict(set))
    ids = collections.defaultdict(dict)

    # WALKERS

    WHITE_TRIM_RE = re.compile(r"\s+", re.S)
    NON_NAME_RE = re.compile(r"[^a-zA-Z0-9_ ]+", re.S)

    NOTE_LIKE = set(
        """
        note
        """.strip().split()
    )
    EMPTY_ELEMENTS = set(
        """
        addSpan
        alt
        anchor
        anyElement
        attRef
        binary
        caesura
        catRef
        cb
        citeData
        classRef
        conversion
        damageSpan
        dataFacet
        default
        delSpan
        elementRef
        empty
        equiv
        fsdLink
        gb
        handShift
        iff
        lacunaEnd
        lacunaStart
        lb
        link
        localProp
        macroRef
        milestone
        move
        numeric
        param
        path
        pause
        pb
        ptr
        redo
        refState
        specDesc
        specGrpRef
        symbol
        textNode
        then
        undo
        unicodeProp
        unihanProp
        variantEncoding
        when
        witEnd
        witStart
        """.strip().split()
    )
    NEWLINE_ELEMENTS = set(
        """
        ab
        addrLine
        cb
        l
        lb
        lg
        list
        p
        pb
        seg
        table
        u
        """.strip().split()
    )
    CONTINUOUS_ELEMENTS = set(
        """
        choice
        """.strip().split()
    )

    def makeNameLike(x):
        return NON_NAME_RE.sub("_", x).strip("_")

    def walkNode(cv, cur, xnode):
        """Internal function to deal with a single element.

        Will be called recursively.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.

            The subdictionary `cur["node"]` is used to store the currently generated
            nodes by node type.
        xnode: object
            An LXML element node.
        """
        if procins and isinstance(xnode, etree._ProcessingInstruction):
            target = xnode.target
            tag = f"?{target}"
        else:
            tag = etree.QName(xnode.tag).localname

        atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()}

        beforeTag(cv, cur, xnode, tag, atts)

        cur[XNEST].append((tag, atts))

        curNode = beforeChildren(cv, cur, xnode, tag, atts)

        if curNode is not None:
            if parentEdges:
                if len(cur[TNEST]):
                    parentNode = cur[TNEST][-1]
                    cv.edge(curNode, parentNode, parent=None)

            cur[TNEST].append(curNode)

            if siblingEdges:
                if len(cur[TSIB]):
                    siblings = cur[TSIB][-1]

                    nSiblings = len(siblings)
                    for i, sib in enumerate(siblings):
                        cv.edge(sib, curNode, sibling=nSiblings - i)
                    siblings.append(curNode)

                cur[TSIB].append([])

        for child in xnode.iterchildren(
            tag=(etree.Element, etree.ProcessingInstruction)
            if procins
            else etree.Element
        ):
            walkNode(cv, cur, child)

        afterChildren(cv, cur, xnode, tag, atts)

        if curNode is not None:
            xmlFile = cur["xmlFile"]

            for refAtt, targetFile, targetId in getRefs(tag, atts, xmlFile):
                refs[refAtt][(targetFile, targetId)].add(curNode)

            idVal = atts.get("id", None)
            if idVal is not None:
                ids[xmlFile][idVal] = curNode

            if len(cur[TNEST]):
                cur[TNEST].pop()
            if siblingEdges:
                if len(cur[TSIB]):
                    cur[TSIB].pop()

        cur[XNEST].pop()
        afterTag(cv, cur, xnode, tag, atts)

    def isChapter(cur):
        """Whether the current element counts as a chapter node.

        ## Model I

        Not relevant: there are no chapter nodes inside an XML file.

        ## Model II

        Chapters are the highest section level (the only lower level is chunks).

        Chapters come in two kinds:

        *   the TEI header;
        *   the immediate children of `<text>`
            except `<front>`, `<body>`, `<back>`, `<group>`;
        *   the immediate children of
            `<front>`, `<body>`, `<back>`, `<group>`.

        Parameters
        ----------
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.

        Returns
        -------
        boolean
        """
        sectionModel = self.sectionModel

        if sectionModel == "II":
            nest = cur[XNEST]
            nNest = len(nest)

            if nNest > 0 and nest[-1][0] in EMPTY_ELEMENTS:
                return False

            outcome = nNest > 0 and (
                nest[-1][0] == TEI_HEADER
                or (
                    nNest > 1
                    and (
                        nest[-2][0] in TEXT_ANCESTORS
                        or nest[-2][0] == TEXT_ANCESTOR
                        and nest[-1][0] not in TEXT_ANCESTORS
                    )
                )
            )
            if outcome:
                cur["chapterElems"].add(nest[-1][0])

            return outcome

        return False

    def isChunk(cur):
        """Whether the current element counts as a chunk node.

        It depends on the section model, but also on the template.

        Note that we only can have distinct templates if we deal with
        multiple files, so only when we are in section model I.

        ## Model I

        Chunks are the lowest section level (the higher levels are folders
        and then files)

        The default is that chunks are the immediate children of the
        `<teiHeader>` and the `<body>`
        elements; a few other elements also count as chunks.

        However, if `drillDownDivs` is True and if the chunk appears to be
        a `<div>` element, we drill further down, until we arrive at a
        non-`<div>` element.

        But in specific templates we have different rules:

        ### `bibliolist`:

        *   The TEI Header is a chunk, and nothing inside the TEI header is a chunk;
        *   Everything at level 5, except `<listBibl>` is a chunk;
        *   The children of `<listBibl>` are chunks (the `<bibl>` elements
            and a few others), provided they are at level 6.

        ### `artworklist`

        *   The TEI Header is a chunk, and nothing inside the TEI header is a chunk;
        *   Everything at level 5 is a chunk.

        ## Model II

        Chunks are the lowest section level (the only higher level is chapters).

        Chunks are the immediate children of the chapters, and they come in two
        kinds: the ones that are `<p>` elements, and the rest.

        Deviation from this rule:

        *   If a chapter is a mixed content node, then it is also a chunk.
            and its subelements are not chunks

        Parameters
        ----------
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.

        Returns
        -------
        boolean
        """
        sectionModel = self.sectionModel

        nest = cur[XNEST]
        nNest = len(nest)
        model = cur["model"]

        if nNest == 0:
            return False

        thisTag = nest[-1][0]

        if sectionModel == "II":
            if nNest == 1:
                outcome = False
            else:
                parentTag = nest[-2][0]
                meChptChnk = (
                    isChapter(cur) and thisTag not in cur["pureElems"][model]
                )

                if meChptChnk:
                    outcome = True
                elif parentTag == TEI_HEADER:
                    outcome = True
                elif nNest <= 2:
                    outcome = False
                elif parentTag not in cur["pureElems"][model]:
                    outcome = False
                else:
                    grandParentTag = nest[-3][0]
                    outcome = (
                        grandParentTag in TEXT_ANCESTORS
                        and thisTag not in EMPTY_ELEMENTS
                    ) or (
                        grandParentTag == TEXT_ANCESTOR
                        and parentTag not in TEXT_ANCESTORS
                    )

        elif sectionModel == "I":
            template = cur["template"]

            if template == "biolist":
                if thisTag == TEI_HEADER:
                    outcome = True
                elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                    outcome = False
                elif nNest not in {5, 6}:
                    outcome = False
                else:
                    parentTag = nest[-2][0]
                    if nNest == 5:
                        outcome = thisTag != "listPerson"
                    else:
                        outcome = parentTag == "listPerson"

            elif template == "bibliolist":
                if thisTag == TEI_HEADER:
                    outcome = True
                elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                    outcome = False
                elif nNest not in {5, 6}:
                    outcome = False
                else:
                    parentTag = nest[-2][0]
                    if nNest == 5:
                        outcome = thisTag != "listBibl"
                    else:
                        outcome = parentTag == "listBibl"

            elif template == "artworklist":
                if thisTag == TEI_HEADER:
                    outcome = True
                elif any(n[0] == TEI_HEADER for n in nest[0:-1]):
                    outcome = False
                else:
                    outcome = nNest == 5

            else:
                if thisTag in CHUNK_ELEMS:
                    outcome = True
                elif nNest == 1:
                    outcome = False
                else:
                    sectionProperties = self.sectionProperties
                    drillDownDivs = sectionProperties["drillDownDivs"]

                    parentTag = nest[-2][0]
                    if drillDownDivs:
                        if thisTag == "div":
                            outcome = False
                        else:
                            dParentTag = None
                            for ancestor in reversed(nest[0:-1]):
                                if ancestor[0] != "div":
                                    dParentTag = ancestor[0]
                                    break
                            outcome = (
                                dParentTag in CHUNK_PARENTS
                                and thisTag not in EMPTY_ELEMENTS
                            ) or (
                                dParentTag == TEXT_ANCESTOR
                                and thisTag not in TEXT_ANCESTORS
                            )
                    else:
                        outcome = (
                            parentTag in CHUNK_PARENTS
                            and thisTag not in EMPTY_ELEMENTS
                        ) or (
                            parentTag == TEXT_ANCESTOR
                            and thisTag not in TEXT_ANCESTORS
                        )

        if outcome:
            cur["chunkElems"].add(nest[-1][0])

        return outcome

    def isPure(cur):
        """Whether the current tag has pure content.

        Parameters
        ----------
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.

        Returns
        -------
        boolean
        """
        nest = cur[XNEST]
        model = cur["model"]
        return (
            len(nest) == 0
            or len(nest) > 0
            and nest[-1][0] in cur["pureElems"][model]
        )

    def isEndInPure(cur):
        """Whether the current end tag occurs in an element with pure content.

        If that is the case, then it is very likely that the end tag also
        marks the end of the current word.

        And we should not strip spaces after it.

        Parameters
        ----------
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.

        Returns
        -------
        boolean
        """
        nest = cur[XNEST]
        model = cur["model"]
        return len(nest) > 1 and nest[-2][0] in cur["pureElems"][model]

    def hasMixedAncestor(cur):
        """Whether the current tag has an ancestor with mixed content.

        We use this in case a tag ends in an element with pure content.
        We should then add white-space to separate it from the next
        element of its parent.

        If the whole stack of element has pure content, we add
        a newline, because then we are probably in the TEI header,
        and things are most clear if they are on separate lines.

        But if one of the ancestors has mixed content, we are typically
        in some structured piece of information within running text,
        such as change markup. In this case we want to add merely a space.

        And we should not strip spaces after it.

        Parameters
        ----------
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.

        Returns
        -------
        boolean
        """
        nest = cur[XNEST]
        model = cur["model"]
        return any(n[0] in cur["mixedElems"][model] for n in nest[0:-1])

    def hasContinuousAncestor(cur):
        """Whether an ancestor tag is a continuous pure element.

        A continuous pure element is an element whose child elements do not
        imply word separation, e.g. `<choice>`.

        Parameters
        ----------
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.

        Returns
        -------
        boolean
        """
        nest = cur[XNEST]
        return any(n[0] in CONTINUOUS_ELEMENTS for n in nest[0:-1])

    def startWord(cv, cur, ch):
        """Start a word node if necessary.

        Whenever we encounter a character, we determine
        whether it starts or ends a word, and if it starts
        one, this function takes care of the necessary actions.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        ch: string
            A single character, the next character in the result data.
        """
        curWord = cur[NODE][WORD]

        if not curWord:
            prevWord = cur["prevWord"]
            if prevWord is not None:
                cv.feature(prevWord, after=cur["afterStr"])
            if ch is not None:
                if wordAsSlot:
                    curWord = cv.slot()
                else:
                    curWord = cv.node(WORD)
                cur[NODE][WORD] = curWord
                addSlotFeatures(cv, cur, curWord)

        if ch is not None:
            cur["wordStr"] += ch

    def finishWord(cv, cur, ch, spaceChar):
        """Terminate a word node if necessary.

        Whenever we encounter a character, we determine
        whether it starts or ends a word, and if it ends
        one, this function takes care of the necessary actions.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        ch: string
            A single character, the next slot in the result data.
        spaceChar: string | void
            If None, no extra space or newline will be added.
            Otherwise, the `spaceChar` (a single space or newline will be added).
        """
        curWord = cur[NODE][WORD]
        if curWord:
            cv.feature(curWord, str=cur["wordStr"])
            if not wordAsSlot:
                cv.terminate(curWord)
            cur[NODE][WORD] = None
            cur["wordStr"] = ""
            cur["prevWord"] = curWord
            cur["afterStr"] = ""

        if ch is not None:
            cur["afterStr"] += ch
        if spaceChar is not None:
            cur["afterStr"] = cur["afterStr"].rstrip() + spaceChar
            if not wordAsSlot:
                addSpace(cv, cur, spaceChar)
            cur["afterSpace"] = True
        else:
            cur["afterSpace"] = False

    def addSlotFeatures(cv, cur, s):
        """Add generic features to a slot.

        Whenever we encounter a character, we add it as a new slot, unless
        `wordAsSlot` is in force. In that case we suppress the triggering of a
        slot node.
        If needed, we start / terminate word nodes as well.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        s: slot
            A previously added (slot) node
        """
        if cur["inHeader"]:
            cv.feature(s, is_meta=1)
        if cur["inNote"]:
            cv.feature(s, is_note=1)
        for r, stack in cur.get("rend", {}).items():
            if len(stack) > 0:
                cv.feature(s, **{f"rend_{r}": 1})

    def addTokens(cv, cur, text):
        """Adds text as a series of tokens.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        text: string
            The text to be added.

        Only meant for the case where slots are tokens.
        """
        (beforew, material, afterw) = getWhites(text)

        if beforew:
            makeSpace(cv, cur)

        s = None

        for tx, after in tokenize(material):
            s = cv.slot()
            cv.feature(s, str=tx, after=after)
            addSlotFeatures(cv, cur, s)

        if afterw:
            if s is None:
                makeSpace(cv, cur)
            else:
                cv.feature(s, after=" ")

    def addSlot(cv, cur, ch):
        """Add a slot.

        Whenever we encounter a character, we add it as a new slot, unless
        `wordAsSlot` is in force. In that case we suppress the triggering of a
        slot node.
        If needed, we start / terminate word nodes as well.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        ch: string
            A single character, the next slot in the result data.
        """
        if ch in {"_", None} or ch.isalnum() or ch in IN_WORD_HYPHENS:
            startWord(cv, cur, ch)
        else:
            finishWord(cv, cur, ch, None)

        if wordAsSlot:
            s = cur[NODE][WORD]
        elif ch is None:
            s = None
        else:
            s = cv.slot()
            cv.feature(s, ch=ch)
        if s is not None:
            addSlotFeatures(cv, cur, s)

    def addEmpty(cv, cur):
        """Add an empty slot.

        We also terminate the current word.
        If words are slots, the empty slot is a word on its own.

        Returns
        -------
        node
            The empty slot
        """
        if tokenAsSlot:
            emptyNode = cv.slot()
            cv.feature(emptyNode, str=ZWSP, after="", empty=1)
        else:
            finishWord(cv, cur, None, None)
            startWord(cv, cur, ZWSP)
            emptyNode = cur[NODE][WORD]
            cv.feature(emptyNode, empty=1)

            if not wordAsSlot:
                emptyNode = cv.slot()
                cv.feature(emptyNode, ch=ZWSP, empty=1)

            finishWord(cv, cur, None, None)

        return emptyNode

    def addSpace(cv, cur, spaceChar):
        """Adds a space or a new line.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        spaceChar: string
            The character to add (supposed to be either a space or a newline).

        Only meant for the case where slots are characters or tokens.

        Suppressed when not in a lowest-level section.
        """
        if chunkLevel in cv.activeTypes():
            s = cv.slot()
            if tokenAsSlot:
                cv.feature(s, str="", after=spaceChar, extraspace=1)
            else:
                cv.feature(s, ch=spaceChar, extraspace=1)
            addSlotFeatures(cv, cur, s)

    def makeSpace(cv, cur):
        """Adds a space.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.

        Only meant for the case where slots are tokens.
        """
        s = cv.slot()
        cv.feature(s, str="", after=" ", extraspace=1)
        addSlotFeatures(cv, cur, s)

    def endLine(cv, cur):
        """Ends a line node.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        """
        lineProperties = self.lineProperties
        lineType = lineProperties["nodeType"]

        slots = cv.linked(cur[NODE][lineType])
        empty = len(slots) == 0

        if empty:
            lastSlot = addEmpty(cv, cur)
            if cur["inNote"]:
                cv.feature(lastSlot, is_note=1)
        else:
            lastSlot = (T, slots[-1])

        if not wordAsSlot:
            after = cv.get("after", lastSlot)
            if after is not None and "\n" not in after:
                cv.feature(lastSlot, after=f"{after.rstrip()}\n")
        cv.terminate(cur[NODE][lineType])

    def endPage(cv, cur):
        """Ends a page node.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        """
        pageProperties = self.pageProperties
        pageType = pageProperties["nodeType"]

        slots = cv.linked(cur[NODE][pageType])
        empty = len(slots) == 0

        if empty:
            lastSlot = addEmpty(cv, cur)
            if cur["inNote"]:
                cv.feature(lastSlot, is_note=1)
        cv.terminate(cur[NODE][pageType])

    def beforeTag(cv, cur, xnode, tag, atts):
        """Actions before dealing with the element's tag.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        xnode: object
            An LXML element node.
        tag: string
            The tag of the LXML node.
        """
        beforeTagCustom = getattr(self, "beforeTagCustom", None)
        if beforeTagCustom is not None:
            beforeTagCustom(cv, cur, xnode, tag, atts)

    def beforeChildren(cv, cur, xnode, tag, atts):
        """Actions before dealing with the element's children.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        xnode: object
            An LXML element node.
        tag: string
            The tag of the LXML node.
        atts: string
            The attributes of the LXML node, with namespaces stripped.
        """
        makeLineElems = self.makeLineElems

        if makeLineElems:
            lineProperties = self.lineProperties
            lineElem = lineProperties["element"]
            lineType = lineProperties["nodeType"]
            isLineContainer = tag == lineElem
            inLine = cur["inLine"]

            if isLineContainer:
                cur["inLine"] = True

                # the line starts with the container
                cur[NODE][lineType] = cv.node(lineType)

        makePageElems = self.makePageElems

        if makePageElems:
            pageProperties = self.pageProperties
            pageType = pageProperties["nodeType"]
            isPageContainer = matchModel(pageProperties, tag, atts)
            inPage = cur["inPage"]

            pbAtTop = pageProperties["pbAtTop"]

            if isPageContainer:
                cur["inPage"] = True

                if pbAtTop:
                    # material before the first pb in the container is not in a page
                    pass
                else:
                    # the page starts with the container
                    cur[NODE][pageType] = cv.node(pageType)

        sectionModel = self.sectionModel
        sectionProperties = self.sectionProperties

        if sectionModel == "II":
            chapterSection = self.chapterSection
            chunkSection = self.chunkSection

            if isChapter(cur):
                cur["chapterNum"] += 1
                cur["prevChapter"] = cur[NODE].get(chapterSection, None)
                cur[NODE][chapterSection] = cv.node(chapterSection)
                cv.link(cur[NODE][chapterSection], cur["danglingSlots"])

                value = {chapterSection: f"{cur['chapterNum']} {tag}"}
                cv.feature(cur[NODE][chapterSection], **value)
                cur["chunkPNum"] = 0
                cur["chunkONum"] = 0
                cur["prevChunk"] = cur[NODE].get(chunkSection, None)
                cur[NODE][chunkSection] = cv.node(chunkSection)
                cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                cur["danglingSlots"] = set()
                cur["infirstChunk"] = True

            # N.B. A node can count both as chapter and as chunk,
            # e.g. a <trailer> sibling of the chapter <div>s
            # A trailer has mixed content, so its subelements aren't typical chunks.
            if isChunk(cur):
                if cur["infirstChunk"]:
                    cur["infirstChunk"] = False
                else:
                    cur[NODE][chunkSection] = cv.node(chunkSection)
                    cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                    cur["danglingSlots"] = set()
                if tag == "p":
                    cur["chunkPNum"] += 1
                    cn = cur["chunkPNum"]
                else:
                    cur["chunkONum"] -= 1
                    cn = cur["chunkONum"]
                value = {chunkSection: cn}
                cv.feature(cur[NODE][chunkSection], **value)

            if matchModel(sectionProperties, tag, atts):
                heading = etree.tostring(
                    xnode, encoding="unicode", method="text", with_tail=False
                ).replace("\n", " ")
                value = {chapterSection: heading}
                cv.feature(cur[NODE][chapterSection], **value)
                chapterNum = cur["chapterNum"]
                if verbose >= 0:
                    console(
                        f"\rchapter {chapterNum:>4} {heading:<50}", newline=False
                    )
        else:
            chunkSection = self.chunkSection

            if isChunk(cur):
                cur["chunkNum"] += 1
                cur["prevChunk"] = cur[NODE].get(chunkSection, None)
                cur[NODE][chunkSection] = cv.node(chunkSection)
                cv.link(cur[NODE][chunkSection], cur["danglingSlots"])
                cur["danglingSlots"] = set()
                value = {chunkSection: cur["chunkNum"]}
                cv.feature(cur[NODE][chunkSection], **value)

        if tag == TEI_HEADER:
            cur["inHeader"] = True
            if sectionModel == "II":
                value = {chapterSection: "TEI header"}
                cv.feature(cur[NODE][chapterSection], **value)
        if tag in NOTE_LIKE:
            cur["inNote"] = True
            if not tokenAsSlot:
                finishWord(cv, cur, None, None)

        curNode = None

        if makeLineElems:
            if inLine and tag == "lb":
                if cur[NODE][lineType] is not None:
                    if cur["lineAtts"] is not None and len(cur["lineAtts"]):
                        cv.feature(cur[NODE][lineType], **cur["lineAtts"])
                    endLine(cv, cur)
                cur[NODE][lineType] = cv.node(lineType)
                cur["lineAtts"] = atts

        if makePageElems:
            if inPage and tag == "pb":
                if pbAtTop:
                    if cur[NODE][pageType] is not None:
                        endPage(cv, cur)
                    cur[NODE][pageType] = cv.node(pageType)
                    if len(atts):
                        cv.feature(cur[NODE][pageType], **atts)
                else:
                    if cur[NODE][pageType] is not None:
                        if cur["pageAtts"] is not None and len(cur["pageAtts"]):
                            cv.feature(cur[NODE][pageType], **cur["pageAtts"])
                        endPage(cv, cur)
                    cur[NODE][pageType] = cv.node(pageType)
                    cur["pageAtts"] = atts

        isBoundaryElem = (
            makeLineElems and tag == "lb" or makePageElems and tag == "pb"
        )

        if tag not in PASS_THROUGH and not isBoundaryElem:
            cur["afterSpace"] = False
            cur[NODE][tag] = cv.node(tag)
            curNode = cur[NODE][tag]
            if wordAsSlot:
                if cur[NODE][WORD]:
                    cv.link(curNode, [cur[NODE][WORD][1]])
            if len(atts):
                cv.feature(curNode, **atts)
                if "rend" in atts:
                    rValue = atts["rend"]
                    r = makeNameLike(rValue)
                    if r:
                        for q in r.split():
                            cur.setdefault("rend", {}).setdefault(q, []).append(
                                True
                            )

        beforeChildrenCustom = getattr(self, "beforeChildrenCustom", None)
        if beforeChildrenCustom is not None:
            beforeChildrenCustom(cv, cur, xnode, tag, atts)

        if not hasattr(xnode, "target") and xnode.text:
            textMaterial = WHITE_TRIM_RE.sub(" ", xnode.text)
            if isPure(cur):
                if textMaterial and textMaterial != " ":
                    console(
                        (
                            "WARNING: Text material at the start of "
                            f"pure-content element <{tag}>"
                        ),
                        error=True,
                    )
                    stack = "-".join(n[0] for n in cur[XNEST])
                    console(f"\tElement stack: {stack}", error=True)
                    console(f"\tMaterial: `{textMaterial}`", error=True)
            else:
                if tokenAsSlot:
                    addTokens(cv, cur, textMaterial)
                else:
                    for ch in textMaterial:
                        addSlot(cv, cur, ch)

        return curNode

    def afterChildren(cv, cur, xnode, tag, atts):
        """Node actions after dealing with the children, but before the end tag.

        Here we make sure that the newline elements will get their last slot
        having a newline at the end of their `after` feature.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        xnode: object
            An LXML element node.
        tag: string
            The tag of the LXML node.
        atts: string
            The attributes of the LXML node, with namespaces stripped.
        """
        chunkSection = self.chunkSection
        makeLineElems = self.makeLineElems

        if makeLineElems:
            lineProperties = self.lineProperties
            lineType = lineProperties["nodeType"]
            lineElem = lineProperties["element"]
            lineProperties = self.lineProperties

        makePageElems = self.makePageElems

        if makePageElems:
            pageProperties = self.pageProperties
            pageType = pageProperties["nodeType"]
            pageProperties = self.pageProperties

        sectionModel = self.sectionModel

        if sectionModel == "II":
            chapterSection = self.chapterSection

        extraInstructions = self.extraInstructions

        if len(extraInstructions):
            lookupSource(cv, cur, tokenAsSlot, extraInstructions)

        isChap = isChapter(cur)
        isChnk = isChunk(cur)

        afterChildrenCustom = getattr(self, "afterChildrenCustom", None)
        if afterChildrenCustom is not None:
            afterChildrenCustom(cv, cur, xnode, tag, atts)

        if makeLineElems:
            isLineContainer = tag == lineElem
            inLine = cur["inLine"]

        if makePageElems:
            isPageContainer = matchModel(pageProperties, tag, atts)
            inPage = cur["inPage"]

        hasFinishedWord = False

        if makeLineElems and inLine and tag == "lb":
            pass

        if makePageElems and inPage and tag == "pb":
            pass

        isBoundaryElem = (
            makeLineElems and tag == "lb" or makePageElems and tag == "pb"
        )

        if makeLineElems and isLineContainer:
            # the page ends with the container
            if cur[NODE][lineType] is not None:
                endLine(cv, cur)
            cur["inLine"] = False

        if makePageElems and isPageContainer:
            pbAtTop = pageProperties["pbAtTop"]
            if pbAtTop:
                # the page ends with the container
                if cur[NODE][pageType] is not None:
                    endPage(cv, cur)
            else:
                # material after the last pb is not in a page
                if cur[NODE][pageType] is not None:
                    cv.delete(cur[NODE][pageType])
            cur["inPage"] = False

        if tag not in PASS_THROUGH and not isBoundaryElem:
            curNode = cur[TNEST][-1]
            slots = cv.linked(curNode)
            empty = len(slots) == 0

            newLineTag = tag in NEWLINE_ELEMENTS

            if (
                newLineTag
                or isEndInPure(cur)
                and not hasContinuousAncestor(cur)
                and not cur["afterSpace"]
            ) and not empty:
                spaceChar = "\n" if newLineTag or not hasMixedAncestor(cur) else " "
                if tokenAsSlot:
                    cv.feature((T, slots[-1]), after=spaceChar)
                else:
                    finishWord(cv, cur, None, spaceChar)
                    hasFinishedWord = True

            slots = cv.linked(curNode)
            empty = len(slots) == 0

            if empty:
                lastSlot = addEmpty(cv, cur)
                if cur["inHeader"]:
                    cv.feature(lastSlot, is_meta=1)
                if cur["inNote"]:
                    cv.feature(lastSlot, is_note=1)
                # take care that this empty slot falls under all sections
                # for folders and files this is already guaranteed
                # We need only to watch out for chapters and chunks
                if cur[NODE].get(chunkSection, None) is None:
                    prevChunk = cur.get("prevChunk", None)
                    if prevChunk is None:
                        cur["danglingSlots"].add(lastSlot[1])
                    else:
                        cv.link(prevChunk, lastSlot)
                if sectionModel == "II":
                    if cur[NODE].get(chapterSection, None) is None:
                        prevChapter = cur.get("prevChapter", None)
                        if prevChapter is None:
                            cur["danglingSlots"].add(lastSlot[1])
                        else:
                            cv.link(prevChapter, lastSlot)

            cv.terminate(curNode)

        if isChnk:
            if tokenAsSlot:
                addSpace(cv, cur, "\n")
            else:
                if not hasFinishedWord:
                    finishWord(cv, cur, None, "\n")
            cv.terminate(cur[NODE][chunkSection])

        if sectionModel == "II":
            if isChap:
                if tokenAsSlot:
                    addSpace(cv, cur, "\n")
                else:
                    if not hasFinishedWord:
                        finishWord(cv, cur, None, "\n")
                cv.terminate(cur[NODE][chapterSection])

    def afterTag(cv, cur, xnode, tag, atts):
        """Node actions after dealing with the children and after the end tag.

        This is the place where we process the `tail` of an LXML node: the
        text material after the element and before the next open/close
        tag of any element.

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        cur: dict
            Various pieces of data collected during walking
            and relevant for some next steps in the walk.
        xnode: object
            An LXML element node.
        tag: string
            The tag of the LXML node.
        atts: string
            The attributes of the LXML node, with namespaces stripped.
        """
        if tag == TEI_HEADER:
            cur["inHeader"] = False
        elif tag in NOTE_LIKE:
            cur["inNote"] = False

        if tag not in PASS_THROUGH:
            if "rend" in atts:
                rValue = atts["rend"]
                r = makeNameLike(rValue)
                if r:
                    for q in r.split():
                        cur["rend"][q].pop()

        if xnode.tail:
            if tag == "lb" and self.makeLineElems:
                tail = xnode.tail.lstrip()
                if not wordAsSlot:
                    pass
            else:
                tail = xnode.tail

            tailMaterial = WHITE_TRIM_RE.sub(" ", tail)
            if isPure(cur):
                if tailMaterial and tailMaterial != " ":
                    elem = cur[XNEST][-1][0]
                    console(
                        (
                            "WARNING: Text material after "
                            f"<{tag}> in pure-content element <{elem}>"
                        ),
                        error=True,
                    )
                    stack = "-".join(cur[XNEST][0])
                    console(f"\tElement stack: {stack}-{tag}", error=True)
                    console(f"\tMaterial: `{tailMaterial}`", error=True)
            else:
                if tokenAsSlot:
                    addTokens(cv, cur, tailMaterial)
                else:
                    for ch in tailMaterial:
                        addSlot(cv, cur, ch)

        afterTagCustom = getattr(self, "afterTagCustom", None)
        if afterTagCustom is not None:
            afterTagCustom(cv, cur, xnode, tag, atts)

    def director(cv):
        """Director function.

        Here we program a walk through the TEI sources.
        At every step of the walk we fire some actions that build TF nodes
        and assign features for them.

        Because everything is rather dynamic, we generate fairly standard
        metadata for the features, namely a link to the
        [TEI website](https://tei-c.org).

        Parameters
        ----------
        cv: object
            The converter object, needed to issue actions.
        """
        makeLineElems = self.makeLineElems

        if makeLineElems:
            lineProperties = self.lineProperties
            lineType = lineProperties["nodeType"]

        makePageElems = self.makePageElems

        if makePageElems:
            pageProperties = self.pageProperties
            pageType = pageProperties["nodeType"]

        sectionModel = self.sectionModel
        A = self.A
        elementDefs = A.elementDefs

        cur = {}
        cur["pureElems"] = {
            modelInv[schemaOverride]: {
                x for (x, (typ, mixed)) in eDefs.items() if not mixed
            }
            for (schemaOverride, eDefs) in elementDefs.items()
        }
        cur["mixedElems"] = {
            modelInv[schemaOverride]: {
                x for (x, (typ, mixed)) in eDefs.items() if mixed
            }
            for (schemaOverride, eDefs) in elementDefs.items()
        }
        cur[NODE] = {}

        if sectionModel == "I":
            folderSection = self.folderSection
            fileSection = self.fileSection

            i = 0
            for xmlFolder, xmlFiles in self.getXML():
                msg = "Start " if verbose >= 0 else "\t"
                console(f"{msg}folder {xmlFolder}:")

                cur[NODE][folderSection] = cv.node(folderSection)
                value = {folderSection: xmlFolder}
                cv.feature(cur[NODE][folderSection], **value)

                j = 0
                cr = ""
                nl = True

                for xmlFile in xmlFiles:
                    i += 1
                    j += 1
                    if j > PROGRESS_LIMIT:
                        cr = "\r"
                        nl = False

                    cur["xmlFile"] = xmlFile
                    xmlPath = f"{teiPath}/{xmlFolder}/{xmlFile}"
                    (model, adapt, tpl) = self.getSwitches(xmlPath)
                    cur["model"] = model
                    cur["template"] = tpl
                    cur["adaptation"] = adapt
                    modelRep = model or "TEI"
                    tplRep = tpl or ""
                    adRep = adapt or ""
                    label = f"{modelRep:<12} {adRep:<12} {tplRep:<12}"
                    if verbose >= 0:
                        console(
                            f"{cr}{i:>4} {label} {xmlFile:<50}",
                            newline=nl,
                        )

                    cur[NODE][fileSection] = cv.node(fileSection)
                    ids[xmlFile][""] = cur[NODE][fileSection]
                    value = {fileSection: xmlFile.removesuffix(".xml")}
                    cv.feature(cur[NODE][fileSection], **value)
                    if tpl:
                        cur[NODE][tpl] = cv.node(tpl)
                        cv.feature(cur[NODE][tpl], **value)

                    with fileOpen(xmlPath) as fh:
                        text = fh.read()
                        if transformFunc is not None:
                            text = transformFunc(text)
                        tree = etree.parse(text, parser)
                        root = tree.getroot()

                        if makeLineElems:
                            cur[NODE][lineType] = None
                            cur["inLine"] = False
                            cur["lineAtts"] = None

                        if makePageElems:
                            cur[NODE][pageType] = None
                            cur["inPage"] = False
                            cur["pageAtts"] = None

                        if not tokenAsSlot:
                            cur[NODE][WORD] = None
                        cur["inHeader"] = False
                        cur["inNote"] = False
                        cur[XNEST] = []
                        cur[TNEST] = []
                        cur[TSIB] = []
                        cur["chunkNum"] = 0
                        cur["prevChunk"] = None
                        cur["danglingSlots"] = set()
                        cur["prevWord"] = None
                        cur["wordStr"] = ""
                        cur["afterStr"] = ""
                        cur["afterSpace"] = True
                        cur["chunkElems"] = set()
                        walkNode(cv, cur, root)

                    if not tokenAsSlot:
                        addSlot(cv, cur, None)
                    if tpl:
                        cv.terminate(cur[NODE][tpl])
                    cv.terminate(cur[NODE][fileSection])

                if verbose >= 0:
                    console("")
                    console(f"End   folder {xmlFolder}")

                cv.terminate(cur[NODE][folderSection])

        elif sectionModel == "II":
            xmlFile = self.getXML()
            if xmlFile is None:
                console("No XML files found!", error=True)
                return False

            xmlPath = f"{teiPath}/{xmlFile}"
            (cur["model"], cur["adaptation"], cur["template"]) = self.getSwitches(
                xmlPath
            )

            with fileOpen(f"{teiPath}/{xmlFile}") as fh:
                cur["xmlFile"] = xmlFile
                text = fh.read()
                if transformFunc is not None:
                    text = transformFunc(text)
                tree = etree.parse(text, parser)
                root = tree.getroot()

                if makeLineElems:
                    cur[NODE][lineType] = None
                    cur["inLine"] = False
                    cur["lineAtts"] = None

                if makePageElems:
                    cur[NODE][pageType] = None
                    cur["inPage"] = False
                    cur["pageAtts"] = None

                if not tokenAsSlot:
                    cur[NODE][WORD] = None
                cur["inHeader"] = False
                cur["inNote"] = False
                cur[XNEST] = []
                cur[TNEST] = []
                cur[TSIB] = []
                cur["chapterNum"] = 0
                cur["chunkPNum"] = 0
                cur["chunkONum"] = 0
                cur["prevChunk"] = None
                cur["prevChapter"] = None
                cur["danglingSlots"] = set()
                cur["prevWord"] = None
                cur["wordStr"] = ""
                cur["afterStr"] = ""
                cur["afterSpace"] = True
                cur["chunkElems"] = set()
                cur["chapterElems"] = set()
                for child in root.iterchildren(tag=etree.Element):
                    walkNode(cv, cur, child)

            if not tokenAsSlot:
                addSlot(cv, cur, None)

        if verbose >= 0:
            console("")

        if verbose >= 0:
            console("Resolving links into edges ...")

        unresolvedRefs = {}
        unresolved = 0
        unresolvedUnique = 0
        resolved = 0
        resolvedUnique = 0

        for att, attRefs in refs.items():
            feature = f"link_{att}"
            edgeFeat = {feature: None}

            for (targetFile, targetId), sourceNodes in attRefs.items():
                nSourceNodes = len(sourceNodes)
                targetNode = ids[targetFile].get(targetId, None)
                if targetNode is None:
                    unresolvedRefs.setdefault(targetFile, set()).add(targetId)
                    unresolvedUnique += 1
                    unresolved += nSourceNodes
                else:
                    for sourceNode in sourceNodes:
                        cv.edge(sourceNode, targetNode, **edgeFeat)
                    resolvedUnique += 1
                    resolved += nSourceNodes

        if verbose >= 0:
            console(f"\t{resolvedUnique} in {resolved} reference(s) resolved")
            if unresolvedRefs:
                console(
                    f"\t{unresolvedUnique} in {unresolved} reference(s): "
                    "could not be resolved"
                )
                if verbose == 1:
                    for targetFile, targetIds in sorted(unresolvedRefs.items()):
                        examples = " ".join(sorted(targetIds)[0:3])
                        console(f"\t\t{targetFile}: {len(targetIds)} x: {examples}")

        for fName in featureMeta:
            if not cv.occurs(fName):
                cv.meta(fName)
        for fName in cv.features():
            if fName not in featureMeta:
                if fName.startswith("rend_"):
                    r = fName[5:]
                    cv.meta(
                        fName,
                        description=f"whether text is to be rendered as {r}",
                        valueType="int",
                        conversionMethod=CM_LITC,
                        conversionCode=CONVERSION_METHODS[CM_LITC],
                    )
                    intFeatures.add(fName)
                elif fName.startswith("link_"):
                    r = fName[5:]
                    cv.meta(
                        fName,
                        description=(
                            f"links to node identified by xml:id in attribute {r}"
                        ),
                        valueType="str",
                        conversionMethod=CM_LITP,
                        conversionCode=CONVERSION_METHODS[CM_LITP],
                    )
                else:
                    cv.meta(
                        fName,
                        description=f"this is TEI attribute {fName}",
                        valueType="str",
                        conversionMethod=CM_LIT,
                        conversionCode=CONVERSION_METHODS[CM_LIT],
                    )

        levelConstraints = ["note < chunk, p", "salute < opener, closer"]
        if "chapterElems" in cur:
            for elem in cur["chapterElems"]:
                levelConstraints.append(f"{elem} < chapter")
        if "chunkElems" in cur:
            for elem in cur["chunkElems"]:
                levelConstraints.append(f"{elem} < chunk")

        levelConstraints = "; ".join(levelConstraints)

        cv.meta("otext", levelConstraints=levelConstraints)

        if verbose == 1:
            console("source reading done")
        return True

    return director
def getParser(self)

Configure the LXML parser.

See parser options.

Returns

object
A configured LXML parse object.
Expand source code Browse git
def getParser(self):
    """Configure the LXML parser.

    See [parser options](https://lxml.de/parsing.html#parser-options).

    Returns
    -------
    object
        A configured LXML parse object.
    """
    if not self.importOK():
        return None

    etree = self.etree
    procins = self.procins

    return etree.XMLParser(
        remove_blank_text=False,
        collect_ids=False,
        remove_comments=True,
        remove_pis=not procins,
        huge_tree=True,
    )
def getSwitches(self, xmlPath)
Expand source code Browse git
def getSwitches(self, xmlPath):
    verbose = self.verbose
    models = self.models
    adaptations = self.adaptations
    templates = self.templates
    triggers = self.triggers
    A = self.A

    text = None

    found = {}

    for kind, allOfKind in (
        ("model", models),
        ("adaptation", adaptations),
        ("template", templates),
    ):
        if text is None:
            with fileOpen(xmlPath) as fh:
                text = fh.read()

        found[kind] = None

        if kind == "model":
            result = A.getModel(text)
            if result is None or result == "tei_all":
                result = None
        else:
            result = None
            triggerRe = triggers[kind]
            if triggerRe is not None:
                match = triggerRe.search(text)
                result = match.group(1) if match else None

        if result is not None and result not in allOfKind:
            if verbose >= 0:
                console(f"unavailable {kind} {result} in {ux(xmlPath)}")
            result = None
        found[kind] = result

    return (found["model"], found["adaptation"], found["template"])
def getXML(self)

Make an inventory of the TEI source files.

Returns

tuple of tuple | string

If section model I is in force:

The outer tuple has sorted entries corresponding to folders under the TEI input directory. Each such entry consists of the folder name and an inner tuple that contains the file names in that folder, sorted.

If section model II is in force:

It is the name of the single XML file.

Expand source code Browse git
def getXML(self):
    """Make an inventory of the TEI source files.

    Returns
    -------
    tuple of tuple | string
        If section model I is in force:

        The outer tuple has sorted entries corresponding to folders under the
        TEI input directory.
        Each such entry consists of the folder name and an inner tuple
        that contains the file names in that folder, sorted.

        If section model II is in force:

        It is the name of the single XML file.
    """
    verbose = self.verbose
    teiPath = self.teiPath
    sectionModel = self.sectionModel
    if verbose == 1:
        console(f"Section model {sectionModel}")

    if sectionModel == "I":
        backMatter = self.backMatter

        IGNORE = "__ignore__"

        xmlFilesRaw = collections.defaultdict(list)

        with scanDir(teiPath) as dh:
            for folder in dh:
                folderName = folder.name
                if folderName == IGNORE:
                    continue
                if not folder.is_dir():
                    continue
                with scanDir(f"{teiPath}/{folderName}") as fh:
                    for file in fh:
                        fileName = file.name
                        if not (
                            fileName.lower().endswith(".xml") and file.is_file()
                        ):
                            continue
                        xmlFilesRaw[folderName].append(fileName)

        xmlFiles = []
        hasBackMatter = False

        for folderName in sorted(xmlFilesRaw, key=versionSort):
            if folderName == backMatter:
                hasBackMatter = True
            else:
                fileNames = xmlFilesRaw[folderName]
                xmlFiles.append((folderName, tuple(sorted(fileNames))))

        if hasBackMatter:
            fileNames = xmlFilesRaw[backMatter]
            xmlFiles.append((backMatter, tuple(sorted(fileNames))))

        xmlFiles = tuple(xmlFiles)

        return xmlFiles

    if sectionModel == "II":
        xmlFile = None
        with scanDir(teiPath) as fh:
            for file in fh:
                fileName = file.name
                if not (fileName.lower().endswith(".xml") and file.is_file()):
                    continue
                xmlFile = fileName
                break
        return xmlFile
def loadTask(self)

Implementation of the "load" task.

It loads the TF data that resides in the directory where the "convert" task deliver its results.

During loading there are additional checks. If they succeed, we have evidence that we have a valid TF dataset.

Also, during the first load intensive pre-computation of TF data takes place, the results of which will be cached in the invisible .tf directory there.

That makes the TF data ready to be loaded fast, next time it is needed.

Returns

boolean
Whether the loading was successful.
Expand source code Browse git
def loadTask(self):
    """Implementation of the "load" task.

    It loads the TF data that resides in the directory where the "convert" task
    deliver its results.

    During loading there are additional checks. If they succeed, we have evidence
    that we have a valid TF dataset.

    Also, during the first load intensive pre-computation of TF data takes place,
    the results of which will be cached in the invisible `.tf` directory there.

    That makes the TF data ready to be loaded fast, next time it is needed.

    Returns
    -------
    boolean
        Whether the loading was successful.
    """
    if not self.importOK():
        return

    if not self.good:
        return

    tfPath = self.tfPath
    verbose = self.verbose
    silent = AUTO if verbose == 1 else TERSE if verbose == 0 else DEEP

    if not dirExists(tfPath):
        console(f"Directory {ux(tfPath)} does not exist.", error=True)
        console("No TF found, nothing to load", error=True)
        self.good = False
        return

    TF = Fabric(locations=[tfPath], silent=silent)
    allFeatures = TF.explore(silent=True, show=True)
    loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
    api = TF.load(loadableFeatures, silent=silent)
    if api:
        if verbose >= 0:
            console(f"max node = {api.F.otype.maxNode}")
        self.good = True
        return

    self.good = False
def readSchemas(self)
Expand source code Browse git
def readSchemas(self):
    schemaDir = self.schemaDir
    models = self.models
    A = self.A

    schemaFiles = dict(rng={}, xsd={})
    self.schemaFiles = schemaFiles
    modelInfo = {}
    self.modelInfo = modelInfo
    modelXsd = {}
    self.modelXsd = modelXsd
    modelInv = {}
    self.modelInv = modelInv

    for model in [None] + models:
        for kind in ("rng", "xsd"):
            schemaFile = (
                A.getBaseSchema()[kind]
                if model is None
                else f"{schemaDir}/{model}.{kind}"
            )
            if fileExists(schemaFile):
                schemaFiles[kind][model] = schemaFile
                if (
                    kind == "rng"
                    or kind == "xsd"
                    and model not in schemaFiles["rng"]
                ):
                    modelInfo[model] = schemaFile
        if model in schemaFiles["rng"] and model not in schemaFiles["xsd"]:
            schemaFileXsd = f"{schemaDir}/{model}.xsd"
            A.fromrelax(schemaFiles["rng"][model], schemaFileXsd)
            schemaFiles["xsd"][model] = schemaFileXsd

    baseSchema = schemaFiles["xsd"][None]
    modelXsd[None] = baseSchema
    modelInv[(baseSchema, None)] = None

    for model in models:
        override = schemaFiles["xsd"][model]
        modelXsd[model] = override
        modelInv[(baseSchema, override)] = model
def task(self, check=False, convert=False, load=False, app=False, apptoken=False, browse=False, verbose=None, validate=None)

Carry out any task, possibly modified by any flag.

This is a higher level function that can execute a selection of tasks.

The tasks will be executed in a fixed order: check, convert, load, app, apptoken, browse. But you can select which one(s) must be executed.

If multiple tasks must be executed and one fails, the subsequent tasks will not be executed.

Parameters

check : boolean, optional False
Whether to carry out the check task.
convert : boolean, optional False
Whether to carry out the convert task.
load : boolean, optional False
Whether to carry out the load task.
app : boolean, optional False
Whether to carry out the app task.
apptoken : boolean, optional False
Whether to carry out the apptoken task.
browse : boolean, optional False
Whether to carry out the browse task"
verbose : integer, optional -1
Produce no (-1), some (0) or many (1) progress and reporting messages
validate : boolean, optional True
Whether to perform XML validation during the check task

Returns

boolean
Whether all tasks have executed successfully.
Expand source code Browse git
def task(
    self,
    check=False,
    convert=False,
    load=False,
    app=False,
    apptoken=False,
    browse=False,
    verbose=None,
    validate=None,
):
    """Carry out any task, possibly modified by any flag.

    This is a higher level function that can execute a selection of tasks.

    The tasks will be executed in a fixed order:
    `check`, `convert`, `load`, `app`, `apptoken`, `browse`.
    But you can select which one(s) must be executed.

    If multiple tasks must be executed and one fails, the subsequent tasks
    will not be executed.

    Parameters
    ----------
    check: boolean, optional False
        Whether to carry out the `check` task.
    convert: boolean, optional False
        Whether to carry out the `convert` task.
    load: boolean, optional False
        Whether to carry out the `load` task.
    app: boolean, optional False
        Whether to carry out the `app` task.
    apptoken: boolean, optional False
        Whether to carry out the `apptoken` task.
    browse: boolean, optional False
        Whether to carry out the `browse` task"
    verbose: integer, optional -1
        Produce no (-1), some (0) or many (1) progress and reporting messages
    validate: boolean, optional True
        Whether to perform XML validation during the check task

    Returns
    -------
    boolean
        Whether all tasks have executed successfully.
    """
    if not self.importOK():
        return

    if verbose is not None:
        verboseSav = self.verbose
        self.verbose = verbose

    if validate is not None:
        self.validate = validate

    if not self.good:
        return False

    for condition, method, kwargs in (
        (check, self.checkTask, {}),
        (convert, self.convertTask, {}),
        (load, self.loadTask, {}),
        (app, self.appTask, {}),
        (apptoken, self.appTask, dict(tokenBased=True)),
        (browse, self.browseTask, {}),
    ):
        if condition:
            method(**kwargs)
            if not self.good:
                break

    if verbose is not None:
        self.verbose = verboseSav
    return self.good

Inherited members