Module ti.info.tei
TEI info
TI knows the TEI elements, because it will read and parse the complete TEI schema. From this the set of complex, mixed elements is distilled.
If the TEI source conforms to a customised TEI schema, it will be detected and the importer will read it and override the generic information of the TEI elements.
It is also possible to pass a choice of template and adaptation in a processing instruction. This does not influence validation, but it may influence further processing.
If the TEI consists of multiple source files, it is possible to specify different templates and adaptations for different files.
The possible values for models, templates, and adaptations should be declared in the configuration file. For each model there should be a corresponding schema in the schema directory, either an RNG or an XSD file.
Configuration and customization
You have to pass a specific additional file to the initializer of the TEI class:
path/tei.yml
in which you specify a bunch of values to get the conversion off the ground.
Keys and values of the tei.yml
file
models
, templates
and adaptations
list, optional []
Which TEI-based schemas and editem templates and adaptations are to be used.
models
For each model there should be an XSD or RNG file with that name in the schema
directory. The tei_all
schema is known to TF, no need to specify that one.
We'll try a RelaxNG schema (.rng
) first. If that exists, we use it for validation
with JING, and we also convert it with TRANG to an XSD schema, which we use for
analysing the schema: we want to know which elements are mixed and pure.
If there is no RelaxNG schema, we try an XSD schema (.xsd
). If that exists,
we can do the analysis, and we will use it also for validation.
Problems with RelaxNG validation
RelaxNG validation is not always reliable when performed with LXML, or any tool
based on libxml
, for that matter. That's why we try to avoid it. Even if we
translate the RelaxNG schema to an XSD schema by means of TRANG, the resulting
validation is not always reliable. So we use JING to validate the RelaxNG schema.
See also JING-TRANG.
Suppose we have a model declared like so:
models:
- suriano
The model is typically referenced in the TEI source file like so (it calls for the
suriano
model):
<?xml-model
href="https://xmlschema.huygens.knaw.nl/suriano.rng"
type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"
?>
The convertor matches the href
attribute with the suriano
model by picking the
trailing part without extension from the href attribute.
In cases where this fails, you can specify the model as a dict in the yaml file.
Suppose we have a href attribute like this, which refers to the dracor
model:
<?xml-model
href="https://dracor.org/schema.rng"
type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"
?>
You can specify this in the yaml file as follows:
models:
- dracor: https://dracor.org/schema.rng
templates
Which template(s) are to be used.
A template is just a keyword, associated with an XML file, that can be used to switch
to a specific kind of processing, such as letter
, bibliolist
, artworklist
.
You may specify an element or processing instruction with an attribute that triggers the template for the file in which it is found.
This will be retrieved from the file before XML parsing starts. For example,
templateTrigger="?editem@template"
will read the file and extract the value of the template
attribute of the editem
processing instruction and use that as the template for this file.
If no template is found in this way, the empty template is assumed.
adaptations
Which adaptations(s) are to be used. An adaptation is just a keyword, associated with an XML file, that can be used to switch to a specific kind of processing. It is meant to trigger tweaks on top of the behaviour of a template.
You may specify an element or processing instruction with an attribute that triggers the adaptation for the file in which it is found.
This will be retrieved from the file before XML parsing starts. For example,
adaptationTrigger="?editem@adaptation"
will read the file and extract the value of the adaptation
attribute of the editem
processing instruction and use that as the adaptation for this file.
If no adaptation is found in this way, the empty adaptation is assumed.
sectionModel
dict, optional {}
In model I, there are three section levels in total.
The corpus is divided in folders (section level 1), files (section level 2),
and chunks within files. The parameter levels
allows you to choose names for the
node types of these section levels.
In model II, there are 2 section levels in total.
The corpus consists of a single file, and section nodes will be added
for nodes at various levels, mainly outermost <div>
and <p>
elements and their
siblings of other element types.
The section heading for the second level is taken from elements in the neighbourhood,
whose name is given in the parameter element
, but only if they carry some attributes,
which can be specified in the attributes
parameter.
These elements should be immediate children of the section elements in question.
In model III, there are 3 section levels in total. The corpus consists of a single folder with several files (section level 1), with two levels of sections per file, as in model II.
If not passed, or an empty dict, section model I is assumed. A section model must be specified with the parameters relevant for the model:
dict(
model="II",
levels=["chapter", "chunk"],
element="head",
attributes=dict(rend="h3"),
)
or
dict(
model="III",
levels=["file", "part", "chunk"],
element="head",
attributes=dict(rend="h3"),
)
(model I does not require the element and attribute parameters)
or
dict(
model="I",
levels=["folder", "file", "chunk"],
)
This section model (I) accepts a few other parameters:
backMatter="backmatter"
This is the name of the folder that should not be treated as an ordinary folder, but as the folder with the sources for the back-matter, such as references, lists, indices, bibliography, biographies, etc.
For model II, the default parameters are:
element="head"
levels=["chapter", "chunk"],
attributes={}
For model III, the default parameters are:
element="head"
levels=["file", "part", "chunk"],
attributes={}
zoneBased
boolean, optional false
Whether the facs
attributes in pb
elements refer to identifiers of
surface
or zone
elements.
If not, the facs
attributes refer directly to file names.
These surface
or zone
elements must occur inside a facsimile
element just
after the tei header. Inside that referred element
is a graphics
element whose attribute facs
contains the file name of the page
scan. This file name is a path with or without leading directories but without
extension.
On the zone
element we expect the attributes ulx
, uly
, lrx
, lry
which
specify a region on the surface by their upper left and lower right points as
percentages from the origin of the surface. The origin is the upper left corner
of a surface. We transform these numbers into IIIF region specifications:
pct:
ulx,
uly,
lrx-ulx,
lry-uly
If we end up at a surface
, instead of a zone
, we provide the region specifier
full
.
See the IIIF Image API 3.
In either case a report file facs.yml will be generated.
There is a key for each file for each file, and then a list of all facs
attribute values on pb
elements.
If zoneBased
is true, several more files are generated:
-
facsMapping.yml
: a key for each file, and then for each declared surface or zone id within the fascimile element in that file: the url value of the graphics element encountered there, followed by«»
and then the IIIF specification of the region as explained above. -
facsProblems.yml
: Two top-level keys:facsNotDeclared
andfacsNotUsed
. Under each of these keys we have file keys and then:- in case of
facsNotDeclared
: facs-attribute values that have no entry in thefacsMapping
; - in case of
facsNotUsed
: graphic-url values that are not referred to by anypb
element.
- in case of
-
zoneErrors.yml
: If zones lack one of their required metrics, they are listed here, plus the default that has been filled in for them.
Last but not least, if zoneBased
is True, the page nodes will get two extra features:
facsfile
: the filename without extension of the page scanfacsregion
: the region specifier of the page on the page scan ```
Expand source code Browse git
"""
# TEI info
TI knows the TEI elements, because it will read and parse the complete
TEI schema. From this the set of complex, mixed elements is distilled.
If the TEI source conforms to a customised TEI schema, it will be detected and
the importer will read it and override the generic information of the TEI elements.
It is also possible to pass a choice of template and adaptation in a processing
instruction. This does not influence validation, but it may influence further
processing.
If the TEI consists of multiple source files, it is possible to specify different
templates and adaptations for different files.
The possible values for models, templates, and adaptations should be declared in
the configuration file.
For each model there should be a corresponding schema in the schema directory,
either an RNG or an XSD file.
# Configuration and customization
You have to pass a specific additional file to the initializer of the TEI class:
* `path/tei.yml` in which you specify a bunch of values to
get the conversion off the ground.
## Keys and values of the `tei.yml` file
### `models`, `templates` and `adaptations`
list, optional `[]`
Which TEI-based schemas and editem templates and adaptations are to be used.
#### `models`
For each *model* there should be an XSD or RNG file with that name in the `schema`
directory. The `tei_all` schema is known to TF, no need to specify that one.
We'll try a RelaxNG schema (`.rng`) first. If that exists, we use it for validation
with JING, and we also convert it with TRANG to an XSD schema, which we use for
analysing the schema: we want to know which elements are mixed and pure.
If there is no RelaxNG schema, we try an XSD schema (`.xsd`). If that exists,
we can do the analysis, and we will use it also for validation.
!!! note "Problems with RelaxNG validation"
RelaxNG validation is not always reliable when performed with LXML, or any tool
based on `libxml`, for that matter. That's why we try to avoid it. Even if we
translate the RelaxNG schema to an XSD schema by means of TRANG, the resulting
validation is not always reliable. So we use JING to validate the RelaxNG schema.
See also [JING-TRANG](https://code.google.com/archive/p/jing-trang/downloads).
Suppose we have a model declared like so:
```
models:
- suriano
```
The model is typically referenced in the TEI source file like so (it calls for the
`suriano` model):
```
<?xml-model
href="https://xmlschema.huygens.knaw.nl/suriano.rng"
type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"
?>
```
The convertor matches the `href` attribute with the `suriano` model by picking the
trailing part without extension from the href attribute.
In cases where this fails, you can specify the model as a dict in the yaml file.
Suppose we have a href attribute like this, which refers to the `dracor` model:
```
<?xml-model
href="https://dracor.org/schema.rng"
type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"
?>
```
You can specify this in the yaml file as follows:
```
models:
- dracor: https://dracor.org/schema.rng
```
#### `templates`
Which template(s) are to be used.
A template is just a keyword, associated with an XML file, that can be used to switch
to a specific kind of processing, such as `letter`, `bibliolist`, `artworklist`.
You may specify an element or processing instruction with an attribute
that triggers the template for the file in which it is found.
This will be retrieved from the file before XML parsing starts.
For example,
``` python
templateTrigger="?editem@template"
```
will read the file and extract the value of the `template` attribute of the `editem`
processing instruction and use that as the template for this file.
If no template is found in this way, the empty template is assumed.
#### `adaptations`
Which adaptations(s) are to be used.
An adaptation is just a keyword, associated with an XML file, that can be used to switch
to a specific kind of processing.
It is meant to trigger tweaks on top of the behaviour of a template.
You may specify an element or processing instruction with an attribute
that triggers the adaptation for the file in which it is found.
This will be retrieved from the file before XML parsing starts.
For example,
``` python
adaptationTrigger="?editem@adaptation"
```
will read the file and extract the value of the `adaptation` attribute of the `editem`
processing instruction and use that as the adaptation for this file.
If no adaptation is found in this way, the empty adaptation is assumed.
### `sectionModel`
dict, optional `{}`
In model I, there are three section levels in total.
The corpus is divided in folders (section level 1), files (section level 2),
and chunks within files. The parameter `levels` allows you to choose names for the
node types of these section levels.
In model II, there are 2 section levels in total.
The corpus consists of a single file, and section nodes will be added
for nodes at various levels, mainly outermost `<div>` and `<p>` elements and their
siblings of other element types.
The section heading for the second level is taken from elements in the neighbourhood,
whose name is given in the parameter `element`, but only if they carry some attributes,
which can be specified in the `attributes` parameter.
These elements should be immediate children of the section elements in question.
In model III, there are 3 section levels in total.
The corpus consists of a single folder with several files (section level 1),
with two levels of sections per file, as in model II.
If not passed, or an empty dict, section model I is assumed.
A section model must be specified with the parameters relevant for the
model:
``` python
dict(
model="II",
levels=["chapter", "chunk"],
element="head",
attributes=dict(rend="h3"),
)
```
or
``` python
dict(
model="III",
levels=["file", "part", "chunk"],
element="head",
attributes=dict(rend="h3"),
)
```
(model I does not require the *element* and *attribute* parameters)
or
``` python
dict(
model="I",
levels=["folder", "file", "chunk"],
)
```
This section model (I) accepts a few other parameters:
``` python
backMatter="backmatter"
```
This is the name of the folder that should not be treated as an ordinary folder, but
as the folder with the sources for the back-matter, such as references, lists, indices,
bibliography, biographies, etc.
For model II, the default parameters are:
``` python
element="head"
levels=["chapter", "chunk"],
attributes={}
```
For model III, the default parameters are:
``` python
element="head"
levels=["file", "part", "chunk"],
attributes={}
```
### `zoneBased`
boolean, optional `false`
Whether the `facs` attributes in `pb` elements refer to identifiers of
`surface` or `zone` elements.
If not, the `facs` attributes refer directly to file names.
These `surface` or `zone` elements must occur inside a `facsimile` element just
after the tei header. Inside that referred element
is a `graphics` element whose attribute `facs` contains the file name of the page
scan. This file name is a path with or without leading directories but without
extension.
On the `zone` element we expect the attributes `ulx`, `uly`, `lrx`, `lry` which
specify a region on the surface by their upper left and lower right points as
percentages from the origin of the surface. The origin is the upper left corner
of a surface. We transform these numbers into IIIF region specifications:
`pct:`*ulx*`,`*uly*`,`*lrx-ulx*`,`*lry-uly*
If we end up at a `surface`, instead of a `zone`, we provide the region specifier
`full`.
See the [IIIF Image API 3](https://iiif.io/api/image/3.0/#41-region).
In either case a report file facs.yml will be generated.
There is a key for each file for each file, and then a list of all `facs`
attribute values on `pb` elements.
If `zoneBased` is true, several more files are generated:
* `facsMapping.yml`:
a key for each file, and then for each declared surface or
zone id within the fascimile element in that file: the url value of the
graphics element encountered there, followed by `«»` and then the IIIF specification
of the region as explained above.
* `facsProblems.yml`:
Two top-level keys: `facsNotDeclared` and `facsNotUsed`.
Under each of these keys we have file keys and then:
* in case of `facsNotDeclared`: facs-attribute values that have no entry in the
`facsMapping`;
* in case of `facsNotUsed`: graphic-url values that are not referred to by any
`pb` element.
* `zoneErrors.yml`:
If zones lack one of their required metrics, they are listed here, plus the
default that has been filled in for them.
Last but not least, if `zoneBased` is True, the page nodes will get two extra features:
* `facsfile`: the filename without extension of the page scan
* `facsregion`: the region specifier of the page on the page scan
```
"""
import collections
import re
from textwrap import wrap
from lxml import etree
from ..kit.helpers import console, versionSort, readCfg
from ..kit.files import (
fileOpen,
unexpanduser as ux,
initTree,
dirExists,
fileExists,
scanDir,
writeYaml,
)
from ..kit.generic import AttrDict
from .helpers import checkSectionModel
from ..tools.xmlschema import Analysis
FACS_MAPPING_YML = "facsMapping.yml"
TASKS_EXCLUDED = {"apptoken", "browse"}
PROGRESS_LIMIT = 5
REFERENCING = dict(
ptr="target",
ref="target",
rs="ref",
)
ZONE_ATTS = (("ulx", 0), ("uly", 0), ("lrx", 100), ("lry", 100))
def getRefs(tag, atts, xmlFile):
refAtt = REFERENCING.get(tag, None)
result = []
if refAtt is not None:
refVal = atts.get(refAtt, None)
if refVal is not None and not refVal.startswith("http"):
for refv in refVal.split():
parts = refv.split("#", 1)
if len(parts) == 1:
targetFile = refv
targetId = ""
else:
(targetFile, targetId) = parts
if targetFile == "":
targetFile = xmlFile
result.append((refAtt, targetFile, targetId))
return result
class TEI:
def __init__(self, sourceDir, cfgFile, verbose=0):
"""Sets up information retrieval from a TEI source.
Parameters
----------
sourceDir: string
Directory of the TEI files.
Divided as follows:
1. volumes / collections of documents. The subdirectory
`__ignore__` is ignored.
1. the TEI documents themselves, conforming to the TEI schema or
some customization of it.
cfgFile: string
Path to the configuration file (yaml)
verbose: integer, optional -1
Produce no (-1), some (0) or many (1) progress and reporting messages
!!! note "Multiple XSD files"
When you started with a RNG file and used `ti.tools.xmlschema` to
convert it to XSD, you may have got multiple XSD files.
One of them has the same base name as the original RNG file,
and you should pass that name. It will import the remaining XSD files,
so do not throw them away.
"""
self.sourceDir = sourceDir
self.cfgFile = cfgFile
self.verbose = verbose
if not dirExists(sourceDir):
console("Source directory does not exist: {sourceDir}", error=True)
self.good = False
return
self.good = True
self.severeError = False
self.fatalError = False
(ok, settings) = readCfg(cfgFile, "tei", verbose=verbose, plain=True)
if not ok:
self.good = False
param = AttrDict()
self.param = param
param.models = settings.get("models", [])
param.procins = settings.get("procins", False)
param.zoneBased = settings.get("zoneBased", False)
sectionModel = settings.get("sectionModel", {})
sectionModel = checkSectionModel(sectionModel, verbose)
if not sectionModel:
self.good = False
return
sectionProperties = sectionModel["properties"]
param.sectionModel = sectionModel["model"]
param.backMatter = sectionProperties.get("backMatter", None)
param.templates = settings.get("templates", [])
param.adaptations = settings.get("adaptations", [])
templateTrigger = settings.get("templateTrigger", None)
adaptationTrigger = settings.get("adaptationTrigger", None)
if templateTrigger is None:
templateAtt = None
templateTag = None
else:
(tag, att) = templateTrigger.split("@")
templateAtt = att
templateTag = tag
if adaptationTrigger is None:
adaptationAtt = None
adaptationTag = None
else:
(tag, att) = adaptationTrigger.split("@")
adaptationAtt = att
adaptationTag = tag
triggers = {}
param.triggers = triggers
for kind, theAtt, theTag in (
("template", templateAtt, templateTag),
("adaptation", adaptationAtt, adaptationTag),
):
triggerRe = None
if theAtt is not None and theTag is not None:
tagPat = re.escape(theTag)
triggerRe = re.compile(
rf"""<{tagPat}\b[^>]*?{theAtt}=['"]([^'"]+)['"]"""
)
triggers[kind] = triggerRe
if not self.good:
return
def inventory(self, schemaDir, reportDir, carryon=False, verbose=None):
"""Implementation of the "check" task.
It validates the TEI.
Then it makes an inventory of all elements and attributes in the TEI files.
If tags are used in multiple namespaces, it will be reported.
!!! caution "Conflation of namespaces"
The TEI to TF conversion does construct node types and attributes
without taking namespaces into account.
However, the parsing process is namespace aware.
The inventory lists all elements and attributes, and many attribute values.
But is represents any digit with `n`, and some attributes that contain
ids or keywords, are reduced to the value `x`.
This information reduction helps to get a clear overview.
It writes reports to the `reportDir`:
* `errors.txt`: validation errors
* `elements.txt`: element / attribute inventory.
!!! note "Thoroughness of validation"
All xml files for the same model will be validated by a single call
to the validator. This is fast, but the
consequence is that after a fatal error the process terminates without
validating the remaining files. In that case, we'll redo validation
for each file separately.
Parameters
----------
reportDir: string
The directory where the report files will be generated
schemaDir: string
Directory of the RNG/XSD schema files.
We use these files as custom TEI schemas,
but to be sure, we still analyse the full TEI schema and
use the schemas here as a set of overriding element definitions.
carryon: boolean, optional False
Whether to carryon with making an inventory if validation has failed.
Normally, validation errors make it unlikely that further processing of
the XML will succeed. But if the validation errors appear to be mild,
and you want an inventory, you can pass the `True` to this parameter
at your own risk.
verbose: integer, optional None
Produce no (-1), some (0) or many (1) progress and reporting messages
If `None`, the value will be taken from the corresponding object member.
"""
if not self.good:
return
if not reportDir:
console("No report directory specified", error=True)
self.good = False
sourceDir = self.sourceDir
self.schemaDir = schemaDir
self.reportDir = reportDir
self.carryon = carryon
if verbose is None:
verbose = self.verbose
param = self.param
procins = param.procins
zoneBased = param.zoneBased
param.kindLabels = dict(
format="Formatting Attributes",
keyword="Keyword Attributes",
rest="Remaining Attributes and Elements",
)
out = AttrDict()
self.out = out
self.readSchemas(verbose=verbose)
A = self.A
self.parser = self.getParser()
modelXsd = out.modelXsd
if verbose is None:
verbose = self.verbose
if verbose == 1:
console(f"TEI to TF checking: {ux(sourceDir)} => {ux(reportDir)}")
if verbose >= 0:
console(
f"Processing instructions are {'treated' if procins else 'ignored'}"
)
console("XML validation will be performed")
baseSchema = modelXsd[None]
overrides = [
override for (model, override) in modelXsd.items() if model is not None
]
A.getElementInfo(baseSchema, overrides, verbose=verbose)
out.elementDefs = A.elementDefs
getStore = lambda: collections.defaultdict( # noqa: E731
lambda: collections.defaultdict(collections.Counter)
)
out.report = {x: getStore() for x in param.kindLabels}
out.errors = []
out.tagByNs = collections.defaultdict(collections.Counter)
out.refs = collections.defaultdict(lambda: collections.Counter())
out.ids = collections.defaultdict(lambda: collections.Counter())
out.lbParents = collections.Counter()
out.folders = []
out.pageScans = {}
out.facsMapping = {} if zoneBased else {}
out.facsKind = {}
out.facsNotDeclared = {}
out.facsNoId = {}
out.zoneRegionIncomplete = {}
out.nProcins = 0
out.nPagesNoFacs = 0
out.inFacsimile = False
out.surfaceId = None
out.scanFile = None
out.zoneId = None
out.zoneRegion = None
initTree(reportDir)
self.validate(verbose=verbose)
for xmlPath in out.toBeInventoried:
self.fileInventory(xmlPath)
if not self.good:
self.good = False
if verbose >= 0:
console("")
self.writeElemTypes(verbose=verbose)
if not self.severeError:
self.writeErrors(verbose=verbose)
if self.good or carryon:
self.writeFacs(verbose=verbose)
self.writeNamespaces(verbose=verbose)
self.writeReport(verbose=verbose)
self.writeIdRefs(verbose=verbose)
self.writeLbParents(verbose=verbose)
def validate(self, verbose=0):
sourceDir = self.sourceDir
carryon = self.carryon
A = self.A
param = self.param
sectionModel = param.sectionModel
out = self.out
errors = out.errors
modelInfo = out.modelInfo
out.toBeInventoried = []
xmlFilesByModel = collections.defaultdict(list)
out.files = self.getXML()
self.writeFileInfo()
if sectionModel == "I":
for xmlFolder, xmlFiles in out.files:
msg = "Start " if verbose >= 0 else "\t"
if verbose >= 0:
console(f"\t{msg}folder {xmlFolder}")
for xmlFile in xmlFiles:
xmlPath = f"{xmlFolder}/{xmlFile}"
xmlFullPath = f"{sourceDir}/{xmlPath}"
(model, adapt, tpl) = self.getSwitches(xmlFullPath)
xmlFilesByModel[model].append(xmlPath)
elif sectionModel == "II":
xmlFile = out.files
if xmlFile is None:
console("No XML files found!", error=True)
return False
xmlFullPath = f"{sourceDir}/{xmlFile}"
(model, adapt, tpl) = self.getSwitches(xmlFullPath)
xmlFilesByModel[model].append(xmlFile)
elif sectionModel == "III":
for xmlFile in out.files:
xmlFullPath = f"{sourceDir}/{xmlFile}"
(model, adapt, tpl) = self.getSwitches(xmlFullPath)
xmlFilesByModel[model].append(xmlFile)
good = True
severeError = False
fatalError = False
for model, xmlPaths in xmlFilesByModel.items():
if verbose >= 0:
console(f"{len(xmlPaths)} {model or 'TEI'} file(s) ...")
thisGood = True
if verbose >= 0:
console("\tValidating ...")
schemaFile = modelInfo.get(model, None)
if schemaFile is None:
if verbose >= 0:
console(f"\t\tNo schema file for {model}")
if good is not None and good is not False:
good = None
continue
(thisGood, info, theseErrors) = A.validate(
True,
schemaFile,
[f"{sourceDir}/{xmlPath}" for xmlPath in xmlPaths],
)
if thisGood == -1: # severe error, validation machinery not good
severeError = True
elif thisGood is None:
fatalError = True
# redo validation for each file separately in order to get all
# fatal errors
console("Fatal error in one of the XML files", error=True)
rInfo = [*info]
rTheseErrors = [*theseErrors]
rXmlPaths = [*xmlPaths]
iteration = 0
maxIter = 20
while True:
iteration += 1
if iteration > maxIter:
console(
"Stopped looking for more fatal errors after "
f"{maxIter} iterations",
error=True,
)
break
fatalPath = None
for e in rTheseErrors:
kind = e[4]
if kind == "fatal":
(folder, file) = e[0:2]
fatalPath = f"{folder}/{file}"
if fatalPath is None:
console("No more fatal errors", error=True)
break
console(
"Check for more fatal errors "
f"(iteration {iteration} of up to {maxIter}) "
f"after {fatalPath}",
error=True,
)
newRXmlPaths = []
skipping = True
for xmlPath in rXmlPaths:
if skipping:
if xmlPath == fatalPath:
skipping = False
else:
newRXmlPaths.append(xmlPath)
if not len(newRXmlPaths):
console("No more files to examine", error=True)
break
rXmlPaths = newRXmlPaths
(thisRGood, rInfo, rTheseErrors) = A.validate(
True,
schemaFile,
[f"{sourceDir}/{xmlPath}" for xmlPath in rXmlPaths],
verbose=True,
)
info.extend(rInfo)
theseErrors.extend(rTheseErrors)
if thisRGood is not None:
console("Last fatal error encountered", error=True)
break
for line in info:
if verbose >= 0:
console(f"\t\t{line}")
if severeError:
for err in theseErrors:
console(err, error=True)
self.severeError = True
break
if fatalError:
self.fatalError = True
if not thisGood:
good = False
errors.extend(theseErrors)
if not carryon:
continue
if (good or carryon) and verbose >= 0:
out.toBeInventoried.extend(xmlPaths)
def analyse(self, root, xmlPath):
FORMAT_ATTS = set(
"""
dim
level
place
rend
""".strip().split()
)
KEYWORD_ATTS = set(
"""
facs
form
function
lang
reason
type
unit
who
""".strip().split()
)
TRIM_ATTS = set(
"""
id
key
target
value
""".strip().split()
)
NUM_RE = re.compile(r"""[0-9]""", re.S)
param = self.param
procins = param.procins
zoneBased = param.zoneBased
out = self.out
report = out.report
tagByNs = out.tagByNs
refs = out.refs
ids = out.ids
lbParents = out.lbParents
pageScans = out.pageScans
facsMapping = out.facsMapping
facsKind = out.facsKind
facsNotDeclared = out.facsNotDeclared
facsNoId = out.facsNoId
zoneRegionIncomplete = out.zoneRegionIncomplete
def nodeInfo(xnode):
if procins and isinstance(xnode, etree._ProcessingInstruction):
target = xnode.target
tag = f"?{target}"
ns = ""
out.nProcins += 1
else:
qName = etree.QName(xnode.tag)
tag = qName.localname
ns = qName.namespace
atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()}
tagByNs[tag][ns] += 1
if tag == "lb":
parentTag = etree.QName(xnode.getparent().tag).localname
lbParents[parentTag] += 1
elif tag == "pb":
facsv = atts.get("facs", "")
if zoneBased:
facsv = facsv.removeprefix("#")
if facsv:
(scanName, scanRegion) = facsMapping[xmlPath].get(
facsv, ["", "full"]
)
if not scanName:
facsNotDeclared[xmlPath].add(facsv)
if facsv:
pageScans[xmlPath].append(facsv)
else:
out.nPagesNoFacs += 1
elif zoneBased:
if tag == "facsimile":
out.inFacsimile = True
elif out.inFacsimile:
if tag == "surface":
out.surfaceId = atts.get("id", None)
out.scanFile = None
if not out.surfaceId:
facsNoId[xmlPath]["surface"] += 1
elif tag == "zone":
out.zoneId = atts.get("id", None)
if out.zoneId:
out.zoneRegion = []
for a, aDefault in ZONE_ATTS:
aVal = atts.get(a, None)
if aVal is None:
aVal = aDefault
zoneRegionIncomplete.setdefault(out.zoneId, {})[
a
] = f"None => {aDefault}"
elif aVal.isdecimal():
aVal = int(aVal)
else:
zoneRegionIncomplete.setdefault(out.zoneId, {})[
a
] = f"{aVal} => {aDefault}"
out.zoneRegion.append(aVal)
(ulx, uly, lrx, lry) = out.zoneRegion
out.zoneRegion = f"pct:{ulx},{uly},{lrx - ulx},{lry - uly}"
if out.scanFile:
facsMapping[xmlPath][out.zoneId] = [
out.scanFile,
out.zoneRegion,
]
facsKind[xmlPath][out.zoneId] = "zone"
else:
facsNoId[xmlPath]["zone"] += 1
elif tag == "graphic":
# can be inside zone or inside surface
# if inside surface, it holds for all zones without
# own scanFile
thisScanFile = atts.get("url", None)
if thisScanFile is not None:
if out.zoneId:
facsMapping[xmlPath][out.zoneId] = [
thisScanFile,
out.zoneRegion,
]
facsKind[xmlPath][out.zoneId] = "zone"
else:
# this is a graphic outside the zones
# we set the surface wide scan file
# so that subsequent zones without graphic
# can pick this up
out.scanFile = thisScanFile
if out.surfaceId:
facsMapping[xmlPath][out.surfaceId] = [
out.scanFile,
"full",
]
facsKind[xmlPath][out.surfaceId] = "surface"
if len(atts) == 0:
kind = "rest"
report[kind][tag][""][""] += 1
else:
idv = atts.get("id", None)
if idv is not None:
ids[xmlPath][idv] += 1
for refAtt, targetFile, targetId in getRefs(tag, atts, xmlPath):
refs[xmlPath][(targetFile, targetId)] += 1
for k, v in atts.items():
kind = (
"format"
if k in FORMAT_ATTS
else "keyword" if k in KEYWORD_ATTS else "rest"
)
dest = report[kind]
if kind == "rest":
vTrim = "X" if k in TRIM_ATTS else NUM_RE.sub("N", v)
dest[tag][k][vTrim] += 1
else:
words = v.strip().split()
for w in words:
dest[tag][k][w.strip()] += 1
for child in xnode.iterchildren(
tag=(
(etree.Element, etree.ProcessingInstruction)
if procins
else etree.Element
)
):
nodeInfo(child)
if zoneBased:
if tag == "facsimile":
out.inFacsimile = False
elif out.inFacsimile:
if tag == "surface":
out.surfaceId = None
out.scanFile = None
elif tag == "zone":
out.zoneId = None
nodeInfo(root)
def fileInventory(self, xmlPath):
sourceDir = self.sourceDir
xmlFullPath = f"{sourceDir}/{xmlPath}"
out = self.out
ids = out.ids
pageScans = out.pageScans
facsMapping = out.facsMapping
facsKind = out.facsKind
facsNotDeclared = out.facsNotDeclared
facsNoId = out.facsNoId
pageScans[xmlPath] = []
facsMapping[xmlPath] = {}
facsKind[xmlPath] = {}
facsNotDeclared[xmlPath] = set()
facsNoId[xmlPath] = collections.Counter()
root = self.parseXML(xmlPath, xmlFullPath)
if root is None:
return
ids[xmlPath][""] = 1
self.analyse(root, xmlPath)
def writeFileInfo(self, verbose=0):
"""Write the folder/file info to a file."""
reportDir = self.reportDir
infoFile = f"{reportDir}/files.yml"
out = self.out
info = out.files
writeYaml(info, asFile=infoFile)
def writeErrors(self, verbose=0):
"""Write the errors to a file."""
reportDir = self.reportDir
errorFile = f"{reportDir}/errors.txt"
out = self.out
errors = out.errors
nErrors = 0
nFiles = 0
with fileOpen(errorFile, mode="w") as fh:
prevFolder = None
prevFile = None
for folder, file, line, col, kind, text in errors:
newFolder = prevFolder != folder
newFile = newFolder or prevFile != file
if newFile:
nFiles += 1
if kind in {"error", "fatal"}:
nErrors += 1
indent1 = f"{folder}\n\t" if newFolder else "\t"
indent2 = f"{file}\n\t\t" if newFile else "\t"
loc = f"{line or ''}:{col or ''}"
text = "\n".join(wrap(text, width=80, subsequent_indent="\t\t\t"))
fh.write(f"{indent1}{indent2}{loc} {kind or ''} {text}\n")
prevFolder = folder
prevFile = file
if nErrors:
console(
(
f"{nErrors} validation error(s) in {nFiles} file(s) "
f"written to {errorFile}"
),
error=True,
)
else:
if verbose >= 0:
console("Validation OK")
def writeFacs(self, verbose=0):
reportDir = self.reportDir
infoFile = f"{reportDir}/facsNoId.yml"
param = self.param
zoneBased = param.zoneBased
out = self.out
pageScans = out.pageScans
facsMapping = out.facsMapping
facsKind = out.facsKind
facsNotDeclared = out.facsNotDeclared
facsNoId = out.facsNoId
zoneRegionIncomplete = out.zoneRegionIncomplete
nPagesNoFacs = out.nPagesNoFacs
writeYaml(
{
f: {k: n for (k, n) in v.items() if n}
for (f, v) in facsNoId.items()
if len(v)
},
asFile=infoFile,
)
nSurfaces = sum(x["surface"] for x in facsNoId.values())
nZones = sum(x["zone"] for x in facsNoId.values())
if verbose >= 0:
pluralS = "" if nSurfaces == 1 else "s"
pluralZ = "" if nZones == 1 else "s"
if nSurfaces:
console(f"{nSurfaces} surface{pluralS} without id")
if nZones:
console(f"{nZones} zone{pluralZ} without id")
infoFile = f"{reportDir}/facs.yml"
nItems = sum(len(x) for x in pageScans.values())
nUnique = sum(len(set(x)) for x in pageScans.values())
writeYaml(pageScans, asFile=infoFile)
if verbose >= 0:
plural = "" if nPagesNoFacs == 1 else "s"
console(f"{nPagesNoFacs} pagebreak{plural} without facs attribute.")
plural = "" if nItems == 1 else "s"
console(f"{nItems} pagebreak{plural} encountered.")
plural = "" if nUnique == 1 else "s"
console(f"{nUnique} distinct scan{plural} referred to by pagebreaks.")
if not zoneBased:
return
infoFile = f"{reportDir}/facsKind.yml"
writeYaml(facsKind, asFile=infoFile)
infoFile = f"{reportDir}/{FACS_MAPPING_YML}"
writeYaml(facsMapping, asFile=infoFile)
if verbose >= 0:
nSurfaces = sum(
sum(1 for y in x.values() if y == "surface") for x in facsKind.values()
)
nZones = sum(
sum(1 for y in x.values() if y == "zone") for x in facsKind.values()
)
plural = "" if nSurfaces == 1 else "s"
console(f"{nSurfaces} surface{plural} declared")
plural = "" if nZones == 1 else "s"
console(f"{nZones} zone{plural} declared")
nItems = sum(len(x) for x in facsMapping.values())
plural = "" if nItems == 1 else "s"
console(f"{nItems} scan{plural} declared and mapped.")
infoFile = f"{reportDir}/facsProblems.yml"
facsNotUsed = {}
for xmlPath, mapping in facsMapping.items():
facsEncountered = set(pageScans[xmlPath])
thisFacsNotUsed = {}
for facs in mapping:
if facs not in facsEncountered:
kind = facsKind[xmlPath][facs]
thisFacsNotUsed.setdefault(kind, []).append(facs)
if len(thisFacsNotUsed):
facsNotUsed[xmlPath] = thisFacsNotUsed
facsProblems = {}
nFacsNotDeclared = sum(len(x) for x in facsNotDeclared.values())
nSurfacesNotUsed = sum(len(x.get("surface", [])) for x in facsNotUsed.values())
nZonesNotUsed = sum(len(x.get("zone", [])) for x in facsNotUsed.values())
if nFacsNotDeclared:
plural = "" if nFacsNotDeclared == 1 else "s"
console(f"{nFacsNotDeclared} undeclared scan{plural}", error=True)
facsProblems["facsNotDeclared"] = {
xmlPath: sorted(x) for (xmlPath, x) in facsNotDeclared.items() if len(x)
}
if nSurfacesNotUsed:
plural = "" if nSurfacesNotUsed == 1 else "s"
console(f"{nSurfacesNotUsed} unused surface{plural}", error=True)
if nZonesNotUsed:
plural = "" if nZonesNotUsed == 1 else "s"
console(f"{nZonesNotUsed} unused zone{plural}", error=True)
facsProblems["facsNotUsed"] = facsNotUsed
writeYaml(facsProblems, asFile=infoFile)
infoFile = f"{reportDir}/zoneErrors.yml"
nIncomplete = len(zoneRegionIncomplete)
plural = "" if nIncomplete == 1 else "s"
if nIncomplete:
console(f"{nIncomplete} missing zone region specifier{plural}", error=True)
console(f"See {infoFile}", error=True)
writeYaml(zoneRegionIncomplete, asFile=infoFile)
def writeNamespaces(self, verbose=0):
reportDir = self.reportDir
errorFile = f"{reportDir}/namespaces.txt"
param = self.param
procins = param.procins
out = self.out
tagByNs = out.tagByNs
nProcins = out.nProcins
nErrors = 0
nTags = len(tagByNs)
with fileOpen(errorFile, mode="w") as fh:
for tag, nsInfo in sorted(
tagByNs.items(), key=lambda x: (-len(x[1]), x[0])
):
label = "OK"
nNs = len(nsInfo)
if nNs > 1:
nErrors += 1
label = "XX"
for ns, amount in sorted(nsInfo.items(), key=lambda x: (-x[1], x[0])):
fh.write(
f"{label} {nNs:>2} namespace for "
f"{tag:<16} : {amount:>5}x {ns}\n"
)
if verbose >= 0:
if procins:
plural = "" if nProcins == 1 else "s"
console(f"{nProcins} processing instruction{plural} encountered.")
console(
(
f"{nTags} tags of which {nErrors} with multiple namespaces "
f"written to {errorFile}"
if verbose >= 0 or nErrors
else "Namespaces OK"
),
error=nErrors > 0,
)
def writeReport(self, verbose=0):
reportDir = self.reportDir
reportFile = f"{reportDir}/elements.txt"
param = self.param
kindLabels = param.kindLabels
out = self.out
report = out.report
with fileOpen(reportFile, mode="w") as fh:
fh.write(
"Inventory of tags and attributes in the source XML file(s).\n"
"Contains the following sections:\n"
)
for label in kindLabels.values():
fh.write(f"\t{label}\n")
fh.write("\n\n")
infoLines = 0
def writeAttInfo(tag, att, attInfo):
nonlocal infoLines
nl = "" if tag == "" else "\n"
tagRep = "" if tag == "" else f"<{tag}>"
attRep = "" if att == "" else f"{att}="
atts = sorted(attInfo.items())
(val, amount) = atts[0]
fh.write(f"{nl}\t{tagRep:<18} " f"{attRep:<11} {amount:>7}x {val}\n")
infoLines += 1
for val, amount in atts[1:]:
fh.write(f"""\t{'':<18} {'':<11} {amount:>7}x {val}\n""")
infoLines += 1
def writeTagInfo(tag, tagInfo):
nonlocal infoLines
tags = sorted(tagInfo.items())
(att, attInfo) = tags[0]
writeAttInfo(tag, att, attInfo)
infoLines += 1
for att, attInfo in tags[1:]:
writeAttInfo("", att, attInfo)
for kind, label in kindLabels.items():
fh.write(f"\n{label}\n")
for tag, tagInfo in sorted(report[kind].items()):
writeTagInfo(tag, tagInfo)
if verbose >= 0:
console(f"{infoLines} info line(s) written to {reportFile}")
def writeElemTypes(self, verbose=0):
reportDir = self.reportDir
out = self.out
elementDefs = out.elementDefs
modelInv = out.modelInv
elemsCombined = {}
modelSet = set()
for schemaOverride, eDefs in elementDefs.items():
model = modelInv[schemaOverride]
modelSet.add(model)
for tag, (typ, mixed) in eDefs.items():
elemsCombined.setdefault(tag, {}).setdefault(model, {})
elemsCombined[tag][model]["typ"] = typ
elemsCombined[tag][model]["mixed"] = mixed
tagReport = {}
for tag, tagInfo in elemsCombined.items():
tagLines = []
tagReport[tag] = tagLines
if None in tagInfo:
teiInfo = tagInfo[None]
teiTyp = teiInfo["typ"]
teiMixed = teiInfo["mixed"]
teiTypRep = "??" if teiTyp is None else typ
teiMixedRep = (
"??" if teiMixed is None else "mixed" if teiMixed else "pure"
)
mds = ["TEI"]
for model in sorted(x for x in tagInfo if x is not None):
info = tagInfo[model]
typ = info["typ"]
mixed = info["mixed"]
if typ == teiTyp and mixed == teiMixed:
mds.append(model)
else:
typRep = "" if typ == teiTyp else "??" if typ is None else typ
mixedRep = (
""
if mixed == teiMixed
else (
"??" if mixed is None else "mixed" if mixed else "pure"
)
)
tagLines.append((tag, [model], typRep, mixedRep))
tagLines.insert(0, (tag, mds, teiTypRep, teiMixedRep))
else:
for model in sorted(tagInfo):
info = tagInfo[model]
typ = info["typ"]
mixed = info["mixed"]
typRep = "??" if typ is None else typ
mixedRep = "??" if mixed is None else "mixed" if mixed else "pure"
tagLines.append((tag, [model], typRep, mixedRep))
reportFile = f"{reportDir}/types.txt"
with fileOpen(reportFile, mode="w") as fh:
for tag in sorted(tagReport):
tagLines = tagReport[tag]
for tag, mds, typ, mixed in tagLines:
model = ",".join(mds)
fh.write(f"{tag:<18} {model:<18} {typ or '':<7} {mixed or '':<5}\n")
if verbose >= 0:
console(f"{len(elemsCombined)} tag(s) type info written to {reportFile}")
def writeLbParents(self, verbose=0):
reportDir = self.reportDir
reportFile = f"{reportDir}/lb-parents.txt"
out = self.out
lbParents = out.lbParents
with fileOpen(reportFile, "w") as fh:
for parent, n in sorted(lbParents.items()):
fh.write(f"{n:>5} x {parent}\n")
if verbose >= 0:
console(f"lb-parent info written to {reportFile}")
def writeIdRefs(self, verbose=0):
reportDir = self.reportDir
reportIdFile = f"{reportDir}/ids.txt"
reportRefFile = f"{reportDir}/refs.txt"
out = self.out
refs = out.refs
ids = out.ids
ih = fileOpen(reportIdFile, mode="w")
rh = fileOpen(reportRefFile, mode="w")
refdIds = collections.Counter()
missingIds = set()
totalRefs = 0
totalRefsU = 0
totalResolvable = 0
totalResolvableU = 0
totalDangling = 0
totalDanglingU = 0
seenItems = set()
for file, items in refs.items():
rh.write(f"{file}\n")
resolvable = 0
resolvableU = 0
dangling = 0
danglingU = 0
for item, n in sorted(items.items()):
totalRefs += n
if item in seenItems:
newItem = False
else:
seenItems.add(item)
newItem = True
totalRefsU += 1
(target, idv) = item
if target not in ids or idv not in ids[target]:
status = "dangling"
dangling += n
if newItem:
missingIds.add((target, idv))
danglingU += 1
else:
status = "ok"
resolvable += n
refdIds[(target, idv)] += n
if newItem:
resolvableU += 1
rh.write(f"\t{status:<10} {n:>5} x {target} # {idv}\n")
msgs = (
f"\tDangling: {dangling:>4} x {danglingU:>4}",
f"\tResolvable: {resolvable:>4} x {resolvableU:>4}",
)
for msg in msgs:
rh.write(f"{msg}\n")
totalResolvable += resolvable
totalResolvableU += resolvableU
totalDangling += dangling
totalDanglingU += danglingU
if verbose >= 0:
console(f"Refs written to {reportRefFile}")
msgs = (
f"\tresolvable: {totalResolvableU:>4} in {totalResolvable:>4}",
f"\tdangling: {totalDanglingU:>4} in {totalDangling:>4}",
f"\tALL: {totalRefsU:>4} in {totalRefs:>4} ",
)
for msg in msgs:
console(msg)
totalIds = 0
totalIdsU = 0
totalIdsM = 0
totalIdsRefd = 0
totalIdsRefdU = 0
totalIdsUnused = 0
for file, items in ids.items():
totalIds += len(items)
ih.write(f"{file}\n")
unique = 0
multiple = 0
refd = 0
refdU = 0
unused = 0
for item, n in sorted(items.items()):
nRefs = refdIds.get((file, item), 0)
if n == 1:
unique += 1
else:
multiple += 1
if nRefs == 0:
unused += 1
else:
refd += nRefs
refdU += 1
status1 = f"{n}x"
plural = "" if nRefs == 1 else "s"
status2 = f"{nRefs}ref{plural}"
ih.write(f"\t{status1:<8} {status2:<8} {item}\n")
msgs = (
f"\tUnique: {unique:>4}",
f"\tNon-unique: {multiple:>4}",
f"\tUnused: {unused:>4}",
f"\tReferenced: {refd:>4} x {refdU:>4}",
)
for msg in msgs:
ih.write(f"{msg}\n")
totalIdsU += unique
totalIdsM += multiple
totalIdsRefdU += refdU
totalIdsRefd += refd
totalIdsUnused += unused
if verbose >= 0:
console(f"Ids written to {reportIdFile}")
msgs = (
f"\treferenced: {totalIdsRefdU:>4} by {totalIdsRefd:>4}",
f"\tnon-unique: {totalIdsM:>4}",
f"\tunused: {totalIdsUnused:>4}",
f"\tALL: {totalIdsU:>4} in {totalIds:>4}",
)
for msg in msgs:
console(msg)
def readSchemas(self, verbose=0):
schemaDir = self.schemaDir
param = self.param
models = param.models
out = self.out
out.modelXsd = {}
out.modelMap = {}
out.modelInfo = {}
out.modelInv = {}
A = Analysis(verbose=verbose)
self.A = A
newModels = []
schemaFiles = dict(rng={}, xsd={})
for model in [None] + models:
if type(model) is dict:
(model, href) = list(model.items())[0]
out.modelMap[href] = model
if model is not None:
newModels.append(model)
for kind in ("rng", "xsd"):
schemaFile = (
A.getBaseSchema()[kind]
if model is None
else f"{schemaDir}/{model}.{kind}"
)
if fileExists(schemaFile):
schemaFiles[kind][model] = schemaFile
if (
kind == "rng"
or kind == "xsd"
and model not in schemaFiles["rng"]
):
out.modelInfo[model] = schemaFile
if model in schemaFiles["rng"] and model not in schemaFiles["xsd"]:
schemaFileXsd = f"{schemaDir}/{model}.xsd"
result = A.fromrelax(schemaFiles["rng"][model], schemaFileXsd)
if not result:
console(
f"Could not convert relax schema {model} to xsd", error=True
)
self.good = False
if result is None:
self.severeError = True
return
schemaFiles["xsd"][model] = schemaFileXsd
baseSchema = schemaFiles["xsd"][None]
out.modelXsd[None] = baseSchema
out.modelInv[(baseSchema, None)] = None
for model in newModels:
override = schemaFiles["xsd"][model]
out.modelXsd[model] = override
out.modelInv[(baseSchema, override)] = model
def getSwitches(self, xmlPath):
verbose = self.verbose
A = self.A
param = self.param
models = param.models
templates = param.templates
adaptations = param.adaptations
triggers = param.triggers
out = self.out
modelMap = out.modelMap
text = None
found = {}
for kind, allOfKind in (
("model", models),
("adaptation", adaptations),
("template", templates),
):
if text is None:
with fileOpen(xmlPath) as fh:
text = fh.read()
found[kind] = None
if kind == "model":
result = A.getModel(text, modelMap)
if result is None or result == "tei_all":
result = None
else:
result = None
triggerRe = triggers[kind]
if triggerRe is not None:
match = triggerRe.search(text)
result = match.group(1) if match else None
if result is not None and result not in allOfKind:
if verbose >= 0:
console(f"unavailable {kind} {result} in {ux(xmlPath)}")
result = None
found[kind] = result
return (found["model"], found["adaptation"], found["template"])
def getParser(self):
"""Configure the LXML parser.
See [parser options](https://lxml.de/parsing.html#parser-options).
Returns
-------
object
A configured LXML parse object.
"""
param = self.param
procins = param.procins
return etree.XMLParser(
remove_blank_text=False,
collect_ids=False,
remove_comments=True,
remove_pis=not procins,
huge_tree=True,
)
def parseXML(self, fileName, fileOrText):
"""Parse an XML source.
This is not meant to validate the XML, only to parse the XML into elements,
attributes, and processing instructions, etc. Validity can be checked by means
of `tff.tools.xmlschema.Analysis.validate` as is done in the check task.
Parameters
----------
fileName: indicator of the file name, does not have to be the full path,
only used in error messages.
fileOrText: string
Either the full path of an XML file, or a string of raw XML text.
parser: object
A configured LXML parser object.
Returns
-------
object | void
The root of the resulting parse tree if the parsing succeeded, else None.
If the parsing failed, a message is written to stderr.
"""
parser = self.parser
try:
tree = etree.parse(fileOrText, parser)
result = tree.getroot()
except Exception as e:
console(f"{fileName}: {str(e)}", error=True)
result = None
return result
def getXML(self):
"""Make an inventory of the TEI source files.
Returns
-------
list of list | list of string | string
If section model I is in force:
The outer list has sorted entries corresponding to folders under the
TEI input directory.
Each such entry consists of the folder name and an inner list
that contains the file names in that folder, sorted.
If section model II is in force:
It is the name of the single XML file.
If section model III is in force:
It is a list of multiple XML files
"""
verbose = self.verbose
sourceDir = self.sourceDir
param = self.param
sectionModel = param.sectionModel
if verbose == 1:
console(f"Section model {sectionModel}")
if sectionModel == "I":
backMatter = param.backMatter
IGNORE = "__ignore__"
xmlFilesRaw = collections.defaultdict(list)
with scanDir(sourceDir) as dh:
for folder in dh:
folderName = folder.name
if folderName == IGNORE:
continue
if not folder.is_dir():
continue
with scanDir(f"{sourceDir}/{folderName}") as fh:
for file in fh:
fileName = file.name
if not (
fileName.lower().endswith(".xml") and file.is_file()
):
continue
xmlFilesRaw[folderName].append(fileName)
xmlFiles = []
hasBackMatter = False
for folderName in sorted(xmlFilesRaw, key=versionSort):
if folderName == backMatter:
hasBackMatter = True
else:
fileNames = xmlFilesRaw[folderName]
xmlFiles.append([folderName, sorted(fileNames)])
if hasBackMatter:
fileNames = xmlFilesRaw[backMatter]
xmlFiles.append([backMatter, sorted(fileNames)])
return xmlFiles
if sectionModel == "II":
xmlFile = None
with scanDir(sourceDir) as fh:
for file in fh:
fileName = file.name
if not (fileName.lower().endswith(".xml") and file.is_file()):
continue
xmlFile = fileName
break
return xmlFile
if sectionModel == "III":
xmlFiles = []
with scanDir(sourceDir) as fh:
for file in fh:
fileName = file.name
if not (fileName.lower().endswith(".xml") and file.is_file()):
continue
xmlFiles.append(fileName)
return sorted(xmlFiles)
Functions
def getRefs(tag, atts, xmlFile)
-
Expand source code Browse git
def getRefs(tag, atts, xmlFile): refAtt = REFERENCING.get(tag, None) result = [] if refAtt is not None: refVal = atts.get(refAtt, None) if refVal is not None and not refVal.startswith("http"): for refv in refVal.split(): parts = refv.split("#", 1) if len(parts) == 1: targetFile = refv targetId = "" else: (targetFile, targetId) = parts if targetFile == "": targetFile = xmlFile result.append((refAtt, targetFile, targetId)) return result
Classes
class TEI (sourceDir, cfgFile, verbose=0)
-
Sets up information retrieval from a TEI source.
Parameters
sourceDir
:string
-
Directory of the TEI files. Divided as follows:
- volumes / collections of documents. The subdirectory
__ignore__
is ignored. - the TEI documents themselves, conforming to the TEI schema or some customization of it.
- volumes / collections of documents. The subdirectory
cfgFile
:string
- Path to the configuration file (yaml)
verbose
:integer
, optional-1
- Produce no (-1), some (0) or many (1) progress and reporting messages
Multiple XSD files
When you started with a RNG file and used
ti.tools.xmlschema
to convert it to XSD, you may have got multiple XSD files. One of them has the same base name as the original RNG file, and you should pass that name. It will import the remaining XSD files, so do not throw them away.Expand source code Browse git
class TEI: def __init__(self, sourceDir, cfgFile, verbose=0): """Sets up information retrieval from a TEI source. Parameters ---------- sourceDir: string Directory of the TEI files. Divided as follows: 1. volumes / collections of documents. The subdirectory `__ignore__` is ignored. 1. the TEI documents themselves, conforming to the TEI schema or some customization of it. cfgFile: string Path to the configuration file (yaml) verbose: integer, optional -1 Produce no (-1), some (0) or many (1) progress and reporting messages !!! note "Multiple XSD files" When you started with a RNG file and used `ti.tools.xmlschema` to convert it to XSD, you may have got multiple XSD files. One of them has the same base name as the original RNG file, and you should pass that name. It will import the remaining XSD files, so do not throw them away. """ self.sourceDir = sourceDir self.cfgFile = cfgFile self.verbose = verbose if not dirExists(sourceDir): console("Source directory does not exist: {sourceDir}", error=True) self.good = False return self.good = True self.severeError = False self.fatalError = False (ok, settings) = readCfg(cfgFile, "tei", verbose=verbose, plain=True) if not ok: self.good = False param = AttrDict() self.param = param param.models = settings.get("models", []) param.procins = settings.get("procins", False) param.zoneBased = settings.get("zoneBased", False) sectionModel = settings.get("sectionModel", {}) sectionModel = checkSectionModel(sectionModel, verbose) if not sectionModel: self.good = False return sectionProperties = sectionModel["properties"] param.sectionModel = sectionModel["model"] param.backMatter = sectionProperties.get("backMatter", None) param.templates = settings.get("templates", []) param.adaptations = settings.get("adaptations", []) templateTrigger = settings.get("templateTrigger", None) adaptationTrigger = settings.get("adaptationTrigger", None) if templateTrigger is None: templateAtt = None templateTag = None else: (tag, att) = templateTrigger.split("@") templateAtt = att templateTag = tag if adaptationTrigger is None: adaptationAtt = None adaptationTag = None else: (tag, att) = adaptationTrigger.split("@") adaptationAtt = att adaptationTag = tag triggers = {} param.triggers = triggers for kind, theAtt, theTag in ( ("template", templateAtt, templateTag), ("adaptation", adaptationAtt, adaptationTag), ): triggerRe = None if theAtt is not None and theTag is not None: tagPat = re.escape(theTag) triggerRe = re.compile( rf"""<{tagPat}\b[^>]*?{theAtt}=['"]([^'"]+)['"]""" ) triggers[kind] = triggerRe if not self.good: return def inventory(self, schemaDir, reportDir, carryon=False, verbose=None): """Implementation of the "check" task. It validates the TEI. Then it makes an inventory of all elements and attributes in the TEI files. If tags are used in multiple namespaces, it will be reported. !!! caution "Conflation of namespaces" The TEI to TF conversion does construct node types and attributes without taking namespaces into account. However, the parsing process is namespace aware. The inventory lists all elements and attributes, and many attribute values. But is represents any digit with `n`, and some attributes that contain ids or keywords, are reduced to the value `x`. This information reduction helps to get a clear overview. It writes reports to the `reportDir`: * `errors.txt`: validation errors * `elements.txt`: element / attribute inventory. !!! note "Thoroughness of validation" All xml files for the same model will be validated by a single call to the validator. This is fast, but the consequence is that after a fatal error the process terminates without validating the remaining files. In that case, we'll redo validation for each file separately. Parameters ---------- reportDir: string The directory where the report files will be generated schemaDir: string Directory of the RNG/XSD schema files. We use these files as custom TEI schemas, but to be sure, we still analyse the full TEI schema and use the schemas here as a set of overriding element definitions. carryon: boolean, optional False Whether to carryon with making an inventory if validation has failed. Normally, validation errors make it unlikely that further processing of the XML will succeed. But if the validation errors appear to be mild, and you want an inventory, you can pass the `True` to this parameter at your own risk. verbose: integer, optional None Produce no (-1), some (0) or many (1) progress and reporting messages If `None`, the value will be taken from the corresponding object member. """ if not self.good: return if not reportDir: console("No report directory specified", error=True) self.good = False sourceDir = self.sourceDir self.schemaDir = schemaDir self.reportDir = reportDir self.carryon = carryon if verbose is None: verbose = self.verbose param = self.param procins = param.procins zoneBased = param.zoneBased param.kindLabels = dict( format="Formatting Attributes", keyword="Keyword Attributes", rest="Remaining Attributes and Elements", ) out = AttrDict() self.out = out self.readSchemas(verbose=verbose) A = self.A self.parser = self.getParser() modelXsd = out.modelXsd if verbose is None: verbose = self.verbose if verbose == 1: console(f"TEI to TF checking: {ux(sourceDir)} => {ux(reportDir)}") if verbose >= 0: console( f"Processing instructions are {'treated' if procins else 'ignored'}" ) console("XML validation will be performed") baseSchema = modelXsd[None] overrides = [ override for (model, override) in modelXsd.items() if model is not None ] A.getElementInfo(baseSchema, overrides, verbose=verbose) out.elementDefs = A.elementDefs getStore = lambda: collections.defaultdict( # noqa: E731 lambda: collections.defaultdict(collections.Counter) ) out.report = {x: getStore() for x in param.kindLabels} out.errors = [] out.tagByNs = collections.defaultdict(collections.Counter) out.refs = collections.defaultdict(lambda: collections.Counter()) out.ids = collections.defaultdict(lambda: collections.Counter()) out.lbParents = collections.Counter() out.folders = [] out.pageScans = {} out.facsMapping = {} if zoneBased else {} out.facsKind = {} out.facsNotDeclared = {} out.facsNoId = {} out.zoneRegionIncomplete = {} out.nProcins = 0 out.nPagesNoFacs = 0 out.inFacsimile = False out.surfaceId = None out.scanFile = None out.zoneId = None out.zoneRegion = None initTree(reportDir) self.validate(verbose=verbose) for xmlPath in out.toBeInventoried: self.fileInventory(xmlPath) if not self.good: self.good = False if verbose >= 0: console("") self.writeElemTypes(verbose=verbose) if not self.severeError: self.writeErrors(verbose=verbose) if self.good or carryon: self.writeFacs(verbose=verbose) self.writeNamespaces(verbose=verbose) self.writeReport(verbose=verbose) self.writeIdRefs(verbose=verbose) self.writeLbParents(verbose=verbose) def validate(self, verbose=0): sourceDir = self.sourceDir carryon = self.carryon A = self.A param = self.param sectionModel = param.sectionModel out = self.out errors = out.errors modelInfo = out.modelInfo out.toBeInventoried = [] xmlFilesByModel = collections.defaultdict(list) out.files = self.getXML() self.writeFileInfo() if sectionModel == "I": for xmlFolder, xmlFiles in out.files: msg = "Start " if verbose >= 0 else "\t" if verbose >= 0: console(f"\t{msg}folder {xmlFolder}") for xmlFile in xmlFiles: xmlPath = f"{xmlFolder}/{xmlFile}" xmlFullPath = f"{sourceDir}/{xmlPath}" (model, adapt, tpl) = self.getSwitches(xmlFullPath) xmlFilesByModel[model].append(xmlPath) elif sectionModel == "II": xmlFile = out.files if xmlFile is None: console("No XML files found!", error=True) return False xmlFullPath = f"{sourceDir}/{xmlFile}" (model, adapt, tpl) = self.getSwitches(xmlFullPath) xmlFilesByModel[model].append(xmlFile) elif sectionModel == "III": for xmlFile in out.files: xmlFullPath = f"{sourceDir}/{xmlFile}" (model, adapt, tpl) = self.getSwitches(xmlFullPath) xmlFilesByModel[model].append(xmlFile) good = True severeError = False fatalError = False for model, xmlPaths in xmlFilesByModel.items(): if verbose >= 0: console(f"{len(xmlPaths)} {model or 'TEI'} file(s) ...") thisGood = True if verbose >= 0: console("\tValidating ...") schemaFile = modelInfo.get(model, None) if schemaFile is None: if verbose >= 0: console(f"\t\tNo schema file for {model}") if good is not None and good is not False: good = None continue (thisGood, info, theseErrors) = A.validate( True, schemaFile, [f"{sourceDir}/{xmlPath}" for xmlPath in xmlPaths], ) if thisGood == -1: # severe error, validation machinery not good severeError = True elif thisGood is None: fatalError = True # redo validation for each file separately in order to get all # fatal errors console("Fatal error in one of the XML files", error=True) rInfo = [*info] rTheseErrors = [*theseErrors] rXmlPaths = [*xmlPaths] iteration = 0 maxIter = 20 while True: iteration += 1 if iteration > maxIter: console( "Stopped looking for more fatal errors after " f"{maxIter} iterations", error=True, ) break fatalPath = None for e in rTheseErrors: kind = e[4] if kind == "fatal": (folder, file) = e[0:2] fatalPath = f"{folder}/{file}" if fatalPath is None: console("No more fatal errors", error=True) break console( "Check for more fatal errors " f"(iteration {iteration} of up to {maxIter}) " f"after {fatalPath}", error=True, ) newRXmlPaths = [] skipping = True for xmlPath in rXmlPaths: if skipping: if xmlPath == fatalPath: skipping = False else: newRXmlPaths.append(xmlPath) if not len(newRXmlPaths): console("No more files to examine", error=True) break rXmlPaths = newRXmlPaths (thisRGood, rInfo, rTheseErrors) = A.validate( True, schemaFile, [f"{sourceDir}/{xmlPath}" for xmlPath in rXmlPaths], verbose=True, ) info.extend(rInfo) theseErrors.extend(rTheseErrors) if thisRGood is not None: console("Last fatal error encountered", error=True) break for line in info: if verbose >= 0: console(f"\t\t{line}") if severeError: for err in theseErrors: console(err, error=True) self.severeError = True break if fatalError: self.fatalError = True if not thisGood: good = False errors.extend(theseErrors) if not carryon: continue if (good or carryon) and verbose >= 0: out.toBeInventoried.extend(xmlPaths) def analyse(self, root, xmlPath): FORMAT_ATTS = set( """ dim level place rend """.strip().split() ) KEYWORD_ATTS = set( """ facs form function lang reason type unit who """.strip().split() ) TRIM_ATTS = set( """ id key target value """.strip().split() ) NUM_RE = re.compile(r"""[0-9]""", re.S) param = self.param procins = param.procins zoneBased = param.zoneBased out = self.out report = out.report tagByNs = out.tagByNs refs = out.refs ids = out.ids lbParents = out.lbParents pageScans = out.pageScans facsMapping = out.facsMapping facsKind = out.facsKind facsNotDeclared = out.facsNotDeclared facsNoId = out.facsNoId zoneRegionIncomplete = out.zoneRegionIncomplete def nodeInfo(xnode): if procins and isinstance(xnode, etree._ProcessingInstruction): target = xnode.target tag = f"?{target}" ns = "" out.nProcins += 1 else: qName = etree.QName(xnode.tag) tag = qName.localname ns = qName.namespace atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()} tagByNs[tag][ns] += 1 if tag == "lb": parentTag = etree.QName(xnode.getparent().tag).localname lbParents[parentTag] += 1 elif tag == "pb": facsv = atts.get("facs", "") if zoneBased: facsv = facsv.removeprefix("#") if facsv: (scanName, scanRegion) = facsMapping[xmlPath].get( facsv, ["", "full"] ) if not scanName: facsNotDeclared[xmlPath].add(facsv) if facsv: pageScans[xmlPath].append(facsv) else: out.nPagesNoFacs += 1 elif zoneBased: if tag == "facsimile": out.inFacsimile = True elif out.inFacsimile: if tag == "surface": out.surfaceId = atts.get("id", None) out.scanFile = None if not out.surfaceId: facsNoId[xmlPath]["surface"] += 1 elif tag == "zone": out.zoneId = atts.get("id", None) if out.zoneId: out.zoneRegion = [] for a, aDefault in ZONE_ATTS: aVal = atts.get(a, None) if aVal is None: aVal = aDefault zoneRegionIncomplete.setdefault(out.zoneId, {})[ a ] = f"None => {aDefault}" elif aVal.isdecimal(): aVal = int(aVal) else: zoneRegionIncomplete.setdefault(out.zoneId, {})[ a ] = f"{aVal} => {aDefault}" out.zoneRegion.append(aVal) (ulx, uly, lrx, lry) = out.zoneRegion out.zoneRegion = f"pct:{ulx},{uly},{lrx - ulx},{lry - uly}" if out.scanFile: facsMapping[xmlPath][out.zoneId] = [ out.scanFile, out.zoneRegion, ] facsKind[xmlPath][out.zoneId] = "zone" else: facsNoId[xmlPath]["zone"] += 1 elif tag == "graphic": # can be inside zone or inside surface # if inside surface, it holds for all zones without # own scanFile thisScanFile = atts.get("url", None) if thisScanFile is not None: if out.zoneId: facsMapping[xmlPath][out.zoneId] = [ thisScanFile, out.zoneRegion, ] facsKind[xmlPath][out.zoneId] = "zone" else: # this is a graphic outside the zones # we set the surface wide scan file # so that subsequent zones without graphic # can pick this up out.scanFile = thisScanFile if out.surfaceId: facsMapping[xmlPath][out.surfaceId] = [ out.scanFile, "full", ] facsKind[xmlPath][out.surfaceId] = "surface" if len(atts) == 0: kind = "rest" report[kind][tag][""][""] += 1 else: idv = atts.get("id", None) if idv is not None: ids[xmlPath][idv] += 1 for refAtt, targetFile, targetId in getRefs(tag, atts, xmlPath): refs[xmlPath][(targetFile, targetId)] += 1 for k, v in atts.items(): kind = ( "format" if k in FORMAT_ATTS else "keyword" if k in KEYWORD_ATTS else "rest" ) dest = report[kind] if kind == "rest": vTrim = "X" if k in TRIM_ATTS else NUM_RE.sub("N", v) dest[tag][k][vTrim] += 1 else: words = v.strip().split() for w in words: dest[tag][k][w.strip()] += 1 for child in xnode.iterchildren( tag=( (etree.Element, etree.ProcessingInstruction) if procins else etree.Element ) ): nodeInfo(child) if zoneBased: if tag == "facsimile": out.inFacsimile = False elif out.inFacsimile: if tag == "surface": out.surfaceId = None out.scanFile = None elif tag == "zone": out.zoneId = None nodeInfo(root) def fileInventory(self, xmlPath): sourceDir = self.sourceDir xmlFullPath = f"{sourceDir}/{xmlPath}" out = self.out ids = out.ids pageScans = out.pageScans facsMapping = out.facsMapping facsKind = out.facsKind facsNotDeclared = out.facsNotDeclared facsNoId = out.facsNoId pageScans[xmlPath] = [] facsMapping[xmlPath] = {} facsKind[xmlPath] = {} facsNotDeclared[xmlPath] = set() facsNoId[xmlPath] = collections.Counter() root = self.parseXML(xmlPath, xmlFullPath) if root is None: return ids[xmlPath][""] = 1 self.analyse(root, xmlPath) def writeFileInfo(self, verbose=0): """Write the folder/file info to a file.""" reportDir = self.reportDir infoFile = f"{reportDir}/files.yml" out = self.out info = out.files writeYaml(info, asFile=infoFile) def writeErrors(self, verbose=0): """Write the errors to a file.""" reportDir = self.reportDir errorFile = f"{reportDir}/errors.txt" out = self.out errors = out.errors nErrors = 0 nFiles = 0 with fileOpen(errorFile, mode="w") as fh: prevFolder = None prevFile = None for folder, file, line, col, kind, text in errors: newFolder = prevFolder != folder newFile = newFolder or prevFile != file if newFile: nFiles += 1 if kind in {"error", "fatal"}: nErrors += 1 indent1 = f"{folder}\n\t" if newFolder else "\t" indent2 = f"{file}\n\t\t" if newFile else "\t" loc = f"{line or ''}:{col or ''}" text = "\n".join(wrap(text, width=80, subsequent_indent="\t\t\t")) fh.write(f"{indent1}{indent2}{loc} {kind or ''} {text}\n") prevFolder = folder prevFile = file if nErrors: console( ( f"{nErrors} validation error(s) in {nFiles} file(s) " f"written to {errorFile}" ), error=True, ) else: if verbose >= 0: console("Validation OK") def writeFacs(self, verbose=0): reportDir = self.reportDir infoFile = f"{reportDir}/facsNoId.yml" param = self.param zoneBased = param.zoneBased out = self.out pageScans = out.pageScans facsMapping = out.facsMapping facsKind = out.facsKind facsNotDeclared = out.facsNotDeclared facsNoId = out.facsNoId zoneRegionIncomplete = out.zoneRegionIncomplete nPagesNoFacs = out.nPagesNoFacs writeYaml( { f: {k: n for (k, n) in v.items() if n} for (f, v) in facsNoId.items() if len(v) }, asFile=infoFile, ) nSurfaces = sum(x["surface"] for x in facsNoId.values()) nZones = sum(x["zone"] for x in facsNoId.values()) if verbose >= 0: pluralS = "" if nSurfaces == 1 else "s" pluralZ = "" if nZones == 1 else "s" if nSurfaces: console(f"{nSurfaces} surface{pluralS} without id") if nZones: console(f"{nZones} zone{pluralZ} without id") infoFile = f"{reportDir}/facs.yml" nItems = sum(len(x) for x in pageScans.values()) nUnique = sum(len(set(x)) for x in pageScans.values()) writeYaml(pageScans, asFile=infoFile) if verbose >= 0: plural = "" if nPagesNoFacs == 1 else "s" console(f"{nPagesNoFacs} pagebreak{plural} without facs attribute.") plural = "" if nItems == 1 else "s" console(f"{nItems} pagebreak{plural} encountered.") plural = "" if nUnique == 1 else "s" console(f"{nUnique} distinct scan{plural} referred to by pagebreaks.") if not zoneBased: return infoFile = f"{reportDir}/facsKind.yml" writeYaml(facsKind, asFile=infoFile) infoFile = f"{reportDir}/{FACS_MAPPING_YML}" writeYaml(facsMapping, asFile=infoFile) if verbose >= 0: nSurfaces = sum( sum(1 for y in x.values() if y == "surface") for x in facsKind.values() ) nZones = sum( sum(1 for y in x.values() if y == "zone") for x in facsKind.values() ) plural = "" if nSurfaces == 1 else "s" console(f"{nSurfaces} surface{plural} declared") plural = "" if nZones == 1 else "s" console(f"{nZones} zone{plural} declared") nItems = sum(len(x) for x in facsMapping.values()) plural = "" if nItems == 1 else "s" console(f"{nItems} scan{plural} declared and mapped.") infoFile = f"{reportDir}/facsProblems.yml" facsNotUsed = {} for xmlPath, mapping in facsMapping.items(): facsEncountered = set(pageScans[xmlPath]) thisFacsNotUsed = {} for facs in mapping: if facs not in facsEncountered: kind = facsKind[xmlPath][facs] thisFacsNotUsed.setdefault(kind, []).append(facs) if len(thisFacsNotUsed): facsNotUsed[xmlPath] = thisFacsNotUsed facsProblems = {} nFacsNotDeclared = sum(len(x) for x in facsNotDeclared.values()) nSurfacesNotUsed = sum(len(x.get("surface", [])) for x in facsNotUsed.values()) nZonesNotUsed = sum(len(x.get("zone", [])) for x in facsNotUsed.values()) if nFacsNotDeclared: plural = "" if nFacsNotDeclared == 1 else "s" console(f"{nFacsNotDeclared} undeclared scan{plural}", error=True) facsProblems["facsNotDeclared"] = { xmlPath: sorted(x) for (xmlPath, x) in facsNotDeclared.items() if len(x) } if nSurfacesNotUsed: plural = "" if nSurfacesNotUsed == 1 else "s" console(f"{nSurfacesNotUsed} unused surface{plural}", error=True) if nZonesNotUsed: plural = "" if nZonesNotUsed == 1 else "s" console(f"{nZonesNotUsed} unused zone{plural}", error=True) facsProblems["facsNotUsed"] = facsNotUsed writeYaml(facsProblems, asFile=infoFile) infoFile = f"{reportDir}/zoneErrors.yml" nIncomplete = len(zoneRegionIncomplete) plural = "" if nIncomplete == 1 else "s" if nIncomplete: console(f"{nIncomplete} missing zone region specifier{plural}", error=True) console(f"See {infoFile}", error=True) writeYaml(zoneRegionIncomplete, asFile=infoFile) def writeNamespaces(self, verbose=0): reportDir = self.reportDir errorFile = f"{reportDir}/namespaces.txt" param = self.param procins = param.procins out = self.out tagByNs = out.tagByNs nProcins = out.nProcins nErrors = 0 nTags = len(tagByNs) with fileOpen(errorFile, mode="w") as fh: for tag, nsInfo in sorted( tagByNs.items(), key=lambda x: (-len(x[1]), x[0]) ): label = "OK" nNs = len(nsInfo) if nNs > 1: nErrors += 1 label = "XX" for ns, amount in sorted(nsInfo.items(), key=lambda x: (-x[1], x[0])): fh.write( f"{label} {nNs:>2} namespace for " f"{tag:<16} : {amount:>5}x {ns}\n" ) if verbose >= 0: if procins: plural = "" if nProcins == 1 else "s" console(f"{nProcins} processing instruction{plural} encountered.") console( ( f"{nTags} tags of which {nErrors} with multiple namespaces " f"written to {errorFile}" if verbose >= 0 or nErrors else "Namespaces OK" ), error=nErrors > 0, ) def writeReport(self, verbose=0): reportDir = self.reportDir reportFile = f"{reportDir}/elements.txt" param = self.param kindLabels = param.kindLabels out = self.out report = out.report with fileOpen(reportFile, mode="w") as fh: fh.write( "Inventory of tags and attributes in the source XML file(s).\n" "Contains the following sections:\n" ) for label in kindLabels.values(): fh.write(f"\t{label}\n") fh.write("\n\n") infoLines = 0 def writeAttInfo(tag, att, attInfo): nonlocal infoLines nl = "" if tag == "" else "\n" tagRep = "" if tag == "" else f"<{tag}>" attRep = "" if att == "" else f"{att}=" atts = sorted(attInfo.items()) (val, amount) = atts[0] fh.write(f"{nl}\t{tagRep:<18} " f"{attRep:<11} {amount:>7}x {val}\n") infoLines += 1 for val, amount in atts[1:]: fh.write(f"""\t{'':<18} {'':<11} {amount:>7}x {val}\n""") infoLines += 1 def writeTagInfo(tag, tagInfo): nonlocal infoLines tags = sorted(tagInfo.items()) (att, attInfo) = tags[0] writeAttInfo(tag, att, attInfo) infoLines += 1 for att, attInfo in tags[1:]: writeAttInfo("", att, attInfo) for kind, label in kindLabels.items(): fh.write(f"\n{label}\n") for tag, tagInfo in sorted(report[kind].items()): writeTagInfo(tag, tagInfo) if verbose >= 0: console(f"{infoLines} info line(s) written to {reportFile}") def writeElemTypes(self, verbose=0): reportDir = self.reportDir out = self.out elementDefs = out.elementDefs modelInv = out.modelInv elemsCombined = {} modelSet = set() for schemaOverride, eDefs in elementDefs.items(): model = modelInv[schemaOverride] modelSet.add(model) for tag, (typ, mixed) in eDefs.items(): elemsCombined.setdefault(tag, {}).setdefault(model, {}) elemsCombined[tag][model]["typ"] = typ elemsCombined[tag][model]["mixed"] = mixed tagReport = {} for tag, tagInfo in elemsCombined.items(): tagLines = [] tagReport[tag] = tagLines if None in tagInfo: teiInfo = tagInfo[None] teiTyp = teiInfo["typ"] teiMixed = teiInfo["mixed"] teiTypRep = "??" if teiTyp is None else typ teiMixedRep = ( "??" if teiMixed is None else "mixed" if teiMixed else "pure" ) mds = ["TEI"] for model in sorted(x for x in tagInfo if x is not None): info = tagInfo[model] typ = info["typ"] mixed = info["mixed"] if typ == teiTyp and mixed == teiMixed: mds.append(model) else: typRep = "" if typ == teiTyp else "??" if typ is None else typ mixedRep = ( "" if mixed == teiMixed else ( "??" if mixed is None else "mixed" if mixed else "pure" ) ) tagLines.append((tag, [model], typRep, mixedRep)) tagLines.insert(0, (tag, mds, teiTypRep, teiMixedRep)) else: for model in sorted(tagInfo): info = tagInfo[model] typ = info["typ"] mixed = info["mixed"] typRep = "??" if typ is None else typ mixedRep = "??" if mixed is None else "mixed" if mixed else "pure" tagLines.append((tag, [model], typRep, mixedRep)) reportFile = f"{reportDir}/types.txt" with fileOpen(reportFile, mode="w") as fh: for tag in sorted(tagReport): tagLines = tagReport[tag] for tag, mds, typ, mixed in tagLines: model = ",".join(mds) fh.write(f"{tag:<18} {model:<18} {typ or '':<7} {mixed or '':<5}\n") if verbose >= 0: console(f"{len(elemsCombined)} tag(s) type info written to {reportFile}") def writeLbParents(self, verbose=0): reportDir = self.reportDir reportFile = f"{reportDir}/lb-parents.txt" out = self.out lbParents = out.lbParents with fileOpen(reportFile, "w") as fh: for parent, n in sorted(lbParents.items()): fh.write(f"{n:>5} x {parent}\n") if verbose >= 0: console(f"lb-parent info written to {reportFile}") def writeIdRefs(self, verbose=0): reportDir = self.reportDir reportIdFile = f"{reportDir}/ids.txt" reportRefFile = f"{reportDir}/refs.txt" out = self.out refs = out.refs ids = out.ids ih = fileOpen(reportIdFile, mode="w") rh = fileOpen(reportRefFile, mode="w") refdIds = collections.Counter() missingIds = set() totalRefs = 0 totalRefsU = 0 totalResolvable = 0 totalResolvableU = 0 totalDangling = 0 totalDanglingU = 0 seenItems = set() for file, items in refs.items(): rh.write(f"{file}\n") resolvable = 0 resolvableU = 0 dangling = 0 danglingU = 0 for item, n in sorted(items.items()): totalRefs += n if item in seenItems: newItem = False else: seenItems.add(item) newItem = True totalRefsU += 1 (target, idv) = item if target not in ids or idv not in ids[target]: status = "dangling" dangling += n if newItem: missingIds.add((target, idv)) danglingU += 1 else: status = "ok" resolvable += n refdIds[(target, idv)] += n if newItem: resolvableU += 1 rh.write(f"\t{status:<10} {n:>5} x {target} # {idv}\n") msgs = ( f"\tDangling: {dangling:>4} x {danglingU:>4}", f"\tResolvable: {resolvable:>4} x {resolvableU:>4}", ) for msg in msgs: rh.write(f"{msg}\n") totalResolvable += resolvable totalResolvableU += resolvableU totalDangling += dangling totalDanglingU += danglingU if verbose >= 0: console(f"Refs written to {reportRefFile}") msgs = ( f"\tresolvable: {totalResolvableU:>4} in {totalResolvable:>4}", f"\tdangling: {totalDanglingU:>4} in {totalDangling:>4}", f"\tALL: {totalRefsU:>4} in {totalRefs:>4} ", ) for msg in msgs: console(msg) totalIds = 0 totalIdsU = 0 totalIdsM = 0 totalIdsRefd = 0 totalIdsRefdU = 0 totalIdsUnused = 0 for file, items in ids.items(): totalIds += len(items) ih.write(f"{file}\n") unique = 0 multiple = 0 refd = 0 refdU = 0 unused = 0 for item, n in sorted(items.items()): nRefs = refdIds.get((file, item), 0) if n == 1: unique += 1 else: multiple += 1 if nRefs == 0: unused += 1 else: refd += nRefs refdU += 1 status1 = f"{n}x" plural = "" if nRefs == 1 else "s" status2 = f"{nRefs}ref{plural}" ih.write(f"\t{status1:<8} {status2:<8} {item}\n") msgs = ( f"\tUnique: {unique:>4}", f"\tNon-unique: {multiple:>4}", f"\tUnused: {unused:>4}", f"\tReferenced: {refd:>4} x {refdU:>4}", ) for msg in msgs: ih.write(f"{msg}\n") totalIdsU += unique totalIdsM += multiple totalIdsRefdU += refdU totalIdsRefd += refd totalIdsUnused += unused if verbose >= 0: console(f"Ids written to {reportIdFile}") msgs = ( f"\treferenced: {totalIdsRefdU:>4} by {totalIdsRefd:>4}", f"\tnon-unique: {totalIdsM:>4}", f"\tunused: {totalIdsUnused:>4}", f"\tALL: {totalIdsU:>4} in {totalIds:>4}", ) for msg in msgs: console(msg) def readSchemas(self, verbose=0): schemaDir = self.schemaDir param = self.param models = param.models out = self.out out.modelXsd = {} out.modelMap = {} out.modelInfo = {} out.modelInv = {} A = Analysis(verbose=verbose) self.A = A newModels = [] schemaFiles = dict(rng={}, xsd={}) for model in [None] + models: if type(model) is dict: (model, href) = list(model.items())[0] out.modelMap[href] = model if model is not None: newModels.append(model) for kind in ("rng", "xsd"): schemaFile = ( A.getBaseSchema()[kind] if model is None else f"{schemaDir}/{model}.{kind}" ) if fileExists(schemaFile): schemaFiles[kind][model] = schemaFile if ( kind == "rng" or kind == "xsd" and model not in schemaFiles["rng"] ): out.modelInfo[model] = schemaFile if model in schemaFiles["rng"] and model not in schemaFiles["xsd"]: schemaFileXsd = f"{schemaDir}/{model}.xsd" result = A.fromrelax(schemaFiles["rng"][model], schemaFileXsd) if not result: console( f"Could not convert relax schema {model} to xsd", error=True ) self.good = False if result is None: self.severeError = True return schemaFiles["xsd"][model] = schemaFileXsd baseSchema = schemaFiles["xsd"][None] out.modelXsd[None] = baseSchema out.modelInv[(baseSchema, None)] = None for model in newModels: override = schemaFiles["xsd"][model] out.modelXsd[model] = override out.modelInv[(baseSchema, override)] = model def getSwitches(self, xmlPath): verbose = self.verbose A = self.A param = self.param models = param.models templates = param.templates adaptations = param.adaptations triggers = param.triggers out = self.out modelMap = out.modelMap text = None found = {} for kind, allOfKind in ( ("model", models), ("adaptation", adaptations), ("template", templates), ): if text is None: with fileOpen(xmlPath) as fh: text = fh.read() found[kind] = None if kind == "model": result = A.getModel(text, modelMap) if result is None or result == "tei_all": result = None else: result = None triggerRe = triggers[kind] if triggerRe is not None: match = triggerRe.search(text) result = match.group(1) if match else None if result is not None and result not in allOfKind: if verbose >= 0: console(f"unavailable {kind} {result} in {ux(xmlPath)}") result = None found[kind] = result return (found["model"], found["adaptation"], found["template"]) def getParser(self): """Configure the LXML parser. See [parser options](https://lxml.de/parsing.html#parser-options). Returns ------- object A configured LXML parse object. """ param = self.param procins = param.procins return etree.XMLParser( remove_blank_text=False, collect_ids=False, remove_comments=True, remove_pis=not procins, huge_tree=True, ) def parseXML(self, fileName, fileOrText): """Parse an XML source. This is not meant to validate the XML, only to parse the XML into elements, attributes, and processing instructions, etc. Validity can be checked by means of `tff.tools.xmlschema.Analysis.validate` as is done in the check task. Parameters ---------- fileName: indicator of the file name, does not have to be the full path, only used in error messages. fileOrText: string Either the full path of an XML file, or a string of raw XML text. parser: object A configured LXML parser object. Returns ------- object | void The root of the resulting parse tree if the parsing succeeded, else None. If the parsing failed, a message is written to stderr. """ parser = self.parser try: tree = etree.parse(fileOrText, parser) result = tree.getroot() except Exception as e: console(f"{fileName}: {str(e)}", error=True) result = None return result def getXML(self): """Make an inventory of the TEI source files. Returns ------- list of list | list of string | string If section model I is in force: The outer list has sorted entries corresponding to folders under the TEI input directory. Each such entry consists of the folder name and an inner list that contains the file names in that folder, sorted. If section model II is in force: It is the name of the single XML file. If section model III is in force: It is a list of multiple XML files """ verbose = self.verbose sourceDir = self.sourceDir param = self.param sectionModel = param.sectionModel if verbose == 1: console(f"Section model {sectionModel}") if sectionModel == "I": backMatter = param.backMatter IGNORE = "__ignore__" xmlFilesRaw = collections.defaultdict(list) with scanDir(sourceDir) as dh: for folder in dh: folderName = folder.name if folderName == IGNORE: continue if not folder.is_dir(): continue with scanDir(f"{sourceDir}/{folderName}") as fh: for file in fh: fileName = file.name if not ( fileName.lower().endswith(".xml") and file.is_file() ): continue xmlFilesRaw[folderName].append(fileName) xmlFiles = [] hasBackMatter = False for folderName in sorted(xmlFilesRaw, key=versionSort): if folderName == backMatter: hasBackMatter = True else: fileNames = xmlFilesRaw[folderName] xmlFiles.append([folderName, sorted(fileNames)]) if hasBackMatter: fileNames = xmlFilesRaw[backMatter] xmlFiles.append([backMatter, sorted(fileNames)]) return xmlFiles if sectionModel == "II": xmlFile = None with scanDir(sourceDir) as fh: for file in fh: fileName = file.name if not (fileName.lower().endswith(".xml") and file.is_file()): continue xmlFile = fileName break return xmlFile if sectionModel == "III": xmlFiles = [] with scanDir(sourceDir) as fh: for file in fh: fileName = file.name if not (fileName.lower().endswith(".xml") and file.is_file()): continue xmlFiles.append(fileName) return sorted(xmlFiles)
Methods
def analyse(self, root, xmlPath)
-
Expand source code Browse git
def analyse(self, root, xmlPath): FORMAT_ATTS = set( """ dim level place rend """.strip().split() ) KEYWORD_ATTS = set( """ facs form function lang reason type unit who """.strip().split() ) TRIM_ATTS = set( """ id key target value """.strip().split() ) NUM_RE = re.compile(r"""[0-9]""", re.S) param = self.param procins = param.procins zoneBased = param.zoneBased out = self.out report = out.report tagByNs = out.tagByNs refs = out.refs ids = out.ids lbParents = out.lbParents pageScans = out.pageScans facsMapping = out.facsMapping facsKind = out.facsKind facsNotDeclared = out.facsNotDeclared facsNoId = out.facsNoId zoneRegionIncomplete = out.zoneRegionIncomplete def nodeInfo(xnode): if procins and isinstance(xnode, etree._ProcessingInstruction): target = xnode.target tag = f"?{target}" ns = "" out.nProcins += 1 else: qName = etree.QName(xnode.tag) tag = qName.localname ns = qName.namespace atts = {etree.QName(k).localname: v for (k, v) in xnode.attrib.items()} tagByNs[tag][ns] += 1 if tag == "lb": parentTag = etree.QName(xnode.getparent().tag).localname lbParents[parentTag] += 1 elif tag == "pb": facsv = atts.get("facs", "") if zoneBased: facsv = facsv.removeprefix("#") if facsv: (scanName, scanRegion) = facsMapping[xmlPath].get( facsv, ["", "full"] ) if not scanName: facsNotDeclared[xmlPath].add(facsv) if facsv: pageScans[xmlPath].append(facsv) else: out.nPagesNoFacs += 1 elif zoneBased: if tag == "facsimile": out.inFacsimile = True elif out.inFacsimile: if tag == "surface": out.surfaceId = atts.get("id", None) out.scanFile = None if not out.surfaceId: facsNoId[xmlPath]["surface"] += 1 elif tag == "zone": out.zoneId = atts.get("id", None) if out.zoneId: out.zoneRegion = [] for a, aDefault in ZONE_ATTS: aVal = atts.get(a, None) if aVal is None: aVal = aDefault zoneRegionIncomplete.setdefault(out.zoneId, {})[ a ] = f"None => {aDefault}" elif aVal.isdecimal(): aVal = int(aVal) else: zoneRegionIncomplete.setdefault(out.zoneId, {})[ a ] = f"{aVal} => {aDefault}" out.zoneRegion.append(aVal) (ulx, uly, lrx, lry) = out.zoneRegion out.zoneRegion = f"pct:{ulx},{uly},{lrx - ulx},{lry - uly}" if out.scanFile: facsMapping[xmlPath][out.zoneId] = [ out.scanFile, out.zoneRegion, ] facsKind[xmlPath][out.zoneId] = "zone" else: facsNoId[xmlPath]["zone"] += 1 elif tag == "graphic": # can be inside zone or inside surface # if inside surface, it holds for all zones without # own scanFile thisScanFile = atts.get("url", None) if thisScanFile is not None: if out.zoneId: facsMapping[xmlPath][out.zoneId] = [ thisScanFile, out.zoneRegion, ] facsKind[xmlPath][out.zoneId] = "zone" else: # this is a graphic outside the zones # we set the surface wide scan file # so that subsequent zones without graphic # can pick this up out.scanFile = thisScanFile if out.surfaceId: facsMapping[xmlPath][out.surfaceId] = [ out.scanFile, "full", ] facsKind[xmlPath][out.surfaceId] = "surface" if len(atts) == 0: kind = "rest" report[kind][tag][""][""] += 1 else: idv = atts.get("id", None) if idv is not None: ids[xmlPath][idv] += 1 for refAtt, targetFile, targetId in getRefs(tag, atts, xmlPath): refs[xmlPath][(targetFile, targetId)] += 1 for k, v in atts.items(): kind = ( "format" if k in FORMAT_ATTS else "keyword" if k in KEYWORD_ATTS else "rest" ) dest = report[kind] if kind == "rest": vTrim = "X" if k in TRIM_ATTS else NUM_RE.sub("N", v) dest[tag][k][vTrim] += 1 else: words = v.strip().split() for w in words: dest[tag][k][w.strip()] += 1 for child in xnode.iterchildren( tag=( (etree.Element, etree.ProcessingInstruction) if procins else etree.Element ) ): nodeInfo(child) if zoneBased: if tag == "facsimile": out.inFacsimile = False elif out.inFacsimile: if tag == "surface": out.surfaceId = None out.scanFile = None elif tag == "zone": out.zoneId = None nodeInfo(root)
def fileInventory(self, xmlPath)
-
Expand source code Browse git
def fileInventory(self, xmlPath): sourceDir = self.sourceDir xmlFullPath = f"{sourceDir}/{xmlPath}" out = self.out ids = out.ids pageScans = out.pageScans facsMapping = out.facsMapping facsKind = out.facsKind facsNotDeclared = out.facsNotDeclared facsNoId = out.facsNoId pageScans[xmlPath] = [] facsMapping[xmlPath] = {} facsKind[xmlPath] = {} facsNotDeclared[xmlPath] = set() facsNoId[xmlPath] = collections.Counter() root = self.parseXML(xmlPath, xmlFullPath) if root is None: return ids[xmlPath][""] = 1 self.analyse(root, xmlPath)
def getParser(self)
-
Expand source code Browse git
def getParser(self): """Configure the LXML parser. See [parser options](https://lxml.de/parsing.html#parser-options). Returns ------- object A configured LXML parse object. """ param = self.param procins = param.procins return etree.XMLParser( remove_blank_text=False, collect_ids=False, remove_comments=True, remove_pis=not procins, huge_tree=True, )
def getSwitches(self, xmlPath)
-
Expand source code Browse git
def getSwitches(self, xmlPath): verbose = self.verbose A = self.A param = self.param models = param.models templates = param.templates adaptations = param.adaptations triggers = param.triggers out = self.out modelMap = out.modelMap text = None found = {} for kind, allOfKind in ( ("model", models), ("adaptation", adaptations), ("template", templates), ): if text is None: with fileOpen(xmlPath) as fh: text = fh.read() found[kind] = None if kind == "model": result = A.getModel(text, modelMap) if result is None or result == "tei_all": result = None else: result = None triggerRe = triggers[kind] if triggerRe is not None: match = triggerRe.search(text) result = match.group(1) if match else None if result is not None and result not in allOfKind: if verbose >= 0: console(f"unavailable {kind} {result} in {ux(xmlPath)}") result = None found[kind] = result return (found["model"], found["adaptation"], found["template"])
def getXML(self)
-
Make an inventory of the TEI source files.
Returns
list
oflist | list
ofstring | string
-
If section model I is in force:
The outer list has sorted entries corresponding to folders under the TEI input directory. Each such entry consists of the folder name and an inner list that contains the file names in that folder, sorted.
If section model II is in force:
It is the name of the single XML file.
If section model III is in force:
It is a list of multiple XML files
Expand source code Browse git
def getXML(self): """Make an inventory of the TEI source files. Returns ------- list of list | list of string | string If section model I is in force: The outer list has sorted entries corresponding to folders under the TEI input directory. Each such entry consists of the folder name and an inner list that contains the file names in that folder, sorted. If section model II is in force: It is the name of the single XML file. If section model III is in force: It is a list of multiple XML files """ verbose = self.verbose sourceDir = self.sourceDir param = self.param sectionModel = param.sectionModel if verbose == 1: console(f"Section model {sectionModel}") if sectionModel == "I": backMatter = param.backMatter IGNORE = "__ignore__" xmlFilesRaw = collections.defaultdict(list) with scanDir(sourceDir) as dh: for folder in dh: folderName = folder.name if folderName == IGNORE: continue if not folder.is_dir(): continue with scanDir(f"{sourceDir}/{folderName}") as fh: for file in fh: fileName = file.name if not ( fileName.lower().endswith(".xml") and file.is_file() ): continue xmlFilesRaw[folderName].append(fileName) xmlFiles = [] hasBackMatter = False for folderName in sorted(xmlFilesRaw, key=versionSort): if folderName == backMatter: hasBackMatter = True else: fileNames = xmlFilesRaw[folderName] xmlFiles.append([folderName, sorted(fileNames)]) if hasBackMatter: fileNames = xmlFilesRaw[backMatter] xmlFiles.append([backMatter, sorted(fileNames)]) return xmlFiles if sectionModel == "II": xmlFile = None with scanDir(sourceDir) as fh: for file in fh: fileName = file.name if not (fileName.lower().endswith(".xml") and file.is_file()): continue xmlFile = fileName break return xmlFile if sectionModel == "III": xmlFiles = [] with scanDir(sourceDir) as fh: for file in fh: fileName = file.name if not (fileName.lower().endswith(".xml") and file.is_file()): continue xmlFiles.append(fileName) return sorted(xmlFiles)
def inventory(self, schemaDir, reportDir, carryon=False, verbose=None)
-
Implementation of the "check" task.
It validates the TEI.
Then it makes an inventory of all elements and attributes in the TEI files.
If tags are used in multiple namespaces, it will be reported.
Conflation of namespaces
The TEI to TF conversion does construct node types and attributes without taking namespaces into account. However, the parsing process is namespace aware.
The inventory lists all elements and attributes, and many attribute values. But is represents any digit with
n
, and some attributes that contain ids or keywords, are reduced to the valuex
.This information reduction helps to get a clear overview.
It writes reports to the
reportDir
:errors.txt
: validation errorselements.txt
: element / attribute inventory.
Thoroughness of validation
All xml files for the same model will be validated by a single call to the validator. This is fast, but the consequence is that after a fatal error the process terminates without validating the remaining files. In that case, we'll redo validation for each file separately.
Parameters
reportDir
:string
- The directory where the report files will be generated
schemaDir
:string
-
Directory of the RNG/XSD schema files.
We use these files as custom TEI schemas, but to be sure, we still analyse the full TEI schema and use the schemas here as a set of overriding element definitions.
carryon
:boolean
, optionalFalse
- Whether to carryon with making an inventory if validation has failed.
Normally, validation errors make it unlikely that further processing of
the XML will succeed. But if the validation errors appear to be mild,
and you want an inventory, you can pass the
True
to this parameter at your own risk. verbose
:integer
, optionalNone
- Produce no (-1), some (0) or many (1) progress and reporting messages
If
None
, the value will be taken from the corresponding object member.
Expand source code Browse git
def inventory(self, schemaDir, reportDir, carryon=False, verbose=None): """Implementation of the "check" task. It validates the TEI. Then it makes an inventory of all elements and attributes in the TEI files. If tags are used in multiple namespaces, it will be reported. !!! caution "Conflation of namespaces" The TEI to TF conversion does construct node types and attributes without taking namespaces into account. However, the parsing process is namespace aware. The inventory lists all elements and attributes, and many attribute values. But is represents any digit with `n`, and some attributes that contain ids or keywords, are reduced to the value `x`. This information reduction helps to get a clear overview. It writes reports to the `reportDir`: * `errors.txt`: validation errors * `elements.txt`: element / attribute inventory. !!! note "Thoroughness of validation" All xml files for the same model will be validated by a single call to the validator. This is fast, but the consequence is that after a fatal error the process terminates without validating the remaining files. In that case, we'll redo validation for each file separately. Parameters ---------- reportDir: string The directory where the report files will be generated schemaDir: string Directory of the RNG/XSD schema files. We use these files as custom TEI schemas, but to be sure, we still analyse the full TEI schema and use the schemas here as a set of overriding element definitions. carryon: boolean, optional False Whether to carryon with making an inventory if validation has failed. Normally, validation errors make it unlikely that further processing of the XML will succeed. But if the validation errors appear to be mild, and you want an inventory, you can pass the `True` to this parameter at your own risk. verbose: integer, optional None Produce no (-1), some (0) or many (1) progress and reporting messages If `None`, the value will be taken from the corresponding object member. """ if not self.good: return if not reportDir: console("No report directory specified", error=True) self.good = False sourceDir = self.sourceDir self.schemaDir = schemaDir self.reportDir = reportDir self.carryon = carryon if verbose is None: verbose = self.verbose param = self.param procins = param.procins zoneBased = param.zoneBased param.kindLabels = dict( format="Formatting Attributes", keyword="Keyword Attributes", rest="Remaining Attributes and Elements", ) out = AttrDict() self.out = out self.readSchemas(verbose=verbose) A = self.A self.parser = self.getParser() modelXsd = out.modelXsd if verbose is None: verbose = self.verbose if verbose == 1: console(f"TEI to TF checking: {ux(sourceDir)} => {ux(reportDir)}") if verbose >= 0: console( f"Processing instructions are {'treated' if procins else 'ignored'}" ) console("XML validation will be performed") baseSchema = modelXsd[None] overrides = [ override for (model, override) in modelXsd.items() if model is not None ] A.getElementInfo(baseSchema, overrides, verbose=verbose) out.elementDefs = A.elementDefs getStore = lambda: collections.defaultdict( # noqa: E731 lambda: collections.defaultdict(collections.Counter) ) out.report = {x: getStore() for x in param.kindLabels} out.errors = [] out.tagByNs = collections.defaultdict(collections.Counter) out.refs = collections.defaultdict(lambda: collections.Counter()) out.ids = collections.defaultdict(lambda: collections.Counter()) out.lbParents = collections.Counter() out.folders = [] out.pageScans = {} out.facsMapping = {} if zoneBased else {} out.facsKind = {} out.facsNotDeclared = {} out.facsNoId = {} out.zoneRegionIncomplete = {} out.nProcins = 0 out.nPagesNoFacs = 0 out.inFacsimile = False out.surfaceId = None out.scanFile = None out.zoneId = None out.zoneRegion = None initTree(reportDir) self.validate(verbose=verbose) for xmlPath in out.toBeInventoried: self.fileInventory(xmlPath) if not self.good: self.good = False if verbose >= 0: console("") self.writeElemTypes(verbose=verbose) if not self.severeError: self.writeErrors(verbose=verbose) if self.good or carryon: self.writeFacs(verbose=verbose) self.writeNamespaces(verbose=verbose) self.writeReport(verbose=verbose) self.writeIdRefs(verbose=verbose) self.writeLbParents(verbose=verbose)
def parseXML(self, fileName, fileOrText)
-
Parse an XML source.
This is not meant to validate the XML, only to parse the XML into elements, attributes, and processing instructions, etc. Validity can be checked by means of
tff.tools.xmlschema.Analysis.validate
as is done in the check task.Parameters
fileName
:indicator
ofthe file name, does not have to be the full path,
- only used in error messages.
fileOrText
:string
- Either the full path of an XML file, or a string of raw XML text.
parser
:object
- A configured LXML parser object.
Returns
object | void
- The root of the resulting parse tree if the parsing succeeded, else None. If the parsing failed, a message is written to stderr.
Expand source code Browse git
def parseXML(self, fileName, fileOrText): """Parse an XML source. This is not meant to validate the XML, only to parse the XML into elements, attributes, and processing instructions, etc. Validity can be checked by means of `tff.tools.xmlschema.Analysis.validate` as is done in the check task. Parameters ---------- fileName: indicator of the file name, does not have to be the full path, only used in error messages. fileOrText: string Either the full path of an XML file, or a string of raw XML text. parser: object A configured LXML parser object. Returns ------- object | void The root of the resulting parse tree if the parsing succeeded, else None. If the parsing failed, a message is written to stderr. """ parser = self.parser try: tree = etree.parse(fileOrText, parser) result = tree.getroot() except Exception as e: console(f"{fileName}: {str(e)}", error=True) result = None return result
def readSchemas(self, verbose=0)
-
Expand source code Browse git
def readSchemas(self, verbose=0): schemaDir = self.schemaDir param = self.param models = param.models out = self.out out.modelXsd = {} out.modelMap = {} out.modelInfo = {} out.modelInv = {} A = Analysis(verbose=verbose) self.A = A newModels = [] schemaFiles = dict(rng={}, xsd={}) for model in [None] + models: if type(model) is dict: (model, href) = list(model.items())[0] out.modelMap[href] = model if model is not None: newModels.append(model) for kind in ("rng", "xsd"): schemaFile = ( A.getBaseSchema()[kind] if model is None else f"{schemaDir}/{model}.{kind}" ) if fileExists(schemaFile): schemaFiles[kind][model] = schemaFile if ( kind == "rng" or kind == "xsd" and model not in schemaFiles["rng"] ): out.modelInfo[model] = schemaFile if model in schemaFiles["rng"] and model not in schemaFiles["xsd"]: schemaFileXsd = f"{schemaDir}/{model}.xsd" result = A.fromrelax(schemaFiles["rng"][model], schemaFileXsd) if not result: console( f"Could not convert relax schema {model} to xsd", error=True ) self.good = False if result is None: self.severeError = True return schemaFiles["xsd"][model] = schemaFileXsd baseSchema = schemaFiles["xsd"][None] out.modelXsd[None] = baseSchema out.modelInv[(baseSchema, None)] = None for model in newModels: override = schemaFiles["xsd"][model] out.modelXsd[model] = override out.modelInv[(baseSchema, override)] = model
def validate(self, verbose=0)
-
Expand source code Browse git
def validate(self, verbose=0): sourceDir = self.sourceDir carryon = self.carryon A = self.A param = self.param sectionModel = param.sectionModel out = self.out errors = out.errors modelInfo = out.modelInfo out.toBeInventoried = [] xmlFilesByModel = collections.defaultdict(list) out.files = self.getXML() self.writeFileInfo() if sectionModel == "I": for xmlFolder, xmlFiles in out.files: msg = "Start " if verbose >= 0 else "\t" if verbose >= 0: console(f"\t{msg}folder {xmlFolder}") for xmlFile in xmlFiles: xmlPath = f"{xmlFolder}/{xmlFile}" xmlFullPath = f"{sourceDir}/{xmlPath}" (model, adapt, tpl) = self.getSwitches(xmlFullPath) xmlFilesByModel[model].append(xmlPath) elif sectionModel == "II": xmlFile = out.files if xmlFile is None: console("No XML files found!", error=True) return False xmlFullPath = f"{sourceDir}/{xmlFile}" (model, adapt, tpl) = self.getSwitches(xmlFullPath) xmlFilesByModel[model].append(xmlFile) elif sectionModel == "III": for xmlFile in out.files: xmlFullPath = f"{sourceDir}/{xmlFile}" (model, adapt, tpl) = self.getSwitches(xmlFullPath) xmlFilesByModel[model].append(xmlFile) good = True severeError = False fatalError = False for model, xmlPaths in xmlFilesByModel.items(): if verbose >= 0: console(f"{len(xmlPaths)} {model or 'TEI'} file(s) ...") thisGood = True if verbose >= 0: console("\tValidating ...") schemaFile = modelInfo.get(model, None) if schemaFile is None: if verbose >= 0: console(f"\t\tNo schema file for {model}") if good is not None and good is not False: good = None continue (thisGood, info, theseErrors) = A.validate( True, schemaFile, [f"{sourceDir}/{xmlPath}" for xmlPath in xmlPaths], ) if thisGood == -1: # severe error, validation machinery not good severeError = True elif thisGood is None: fatalError = True # redo validation for each file separately in order to get all # fatal errors console("Fatal error in one of the XML files", error=True) rInfo = [*info] rTheseErrors = [*theseErrors] rXmlPaths = [*xmlPaths] iteration = 0 maxIter = 20 while True: iteration += 1 if iteration > maxIter: console( "Stopped looking for more fatal errors after " f"{maxIter} iterations", error=True, ) break fatalPath = None for e in rTheseErrors: kind = e[4] if kind == "fatal": (folder, file) = e[0:2] fatalPath = f"{folder}/{file}" if fatalPath is None: console("No more fatal errors", error=True) break console( "Check for more fatal errors " f"(iteration {iteration} of up to {maxIter}) " f"after {fatalPath}", error=True, ) newRXmlPaths = [] skipping = True for xmlPath in rXmlPaths: if skipping: if xmlPath == fatalPath: skipping = False else: newRXmlPaths.append(xmlPath) if not len(newRXmlPaths): console("No more files to examine", error=True) break rXmlPaths = newRXmlPaths (thisRGood, rInfo, rTheseErrors) = A.validate( True, schemaFile, [f"{sourceDir}/{xmlPath}" for xmlPath in rXmlPaths], verbose=True, ) info.extend(rInfo) theseErrors.extend(rTheseErrors) if thisRGood is not None: console("Last fatal error encountered", error=True) break for line in info: if verbose >= 0: console(f"\t\t{line}") if severeError: for err in theseErrors: console(err, error=True) self.severeError = True break if fatalError: self.fatalError = True if not thisGood: good = False errors.extend(theseErrors) if not carryon: continue if (good or carryon) and verbose >= 0: out.toBeInventoried.extend(xmlPaths)
def writeElemTypes(self, verbose=0)
-
Expand source code Browse git
def writeElemTypes(self, verbose=0): reportDir = self.reportDir out = self.out elementDefs = out.elementDefs modelInv = out.modelInv elemsCombined = {} modelSet = set() for schemaOverride, eDefs in elementDefs.items(): model = modelInv[schemaOverride] modelSet.add(model) for tag, (typ, mixed) in eDefs.items(): elemsCombined.setdefault(tag, {}).setdefault(model, {}) elemsCombined[tag][model]["typ"] = typ elemsCombined[tag][model]["mixed"] = mixed tagReport = {} for tag, tagInfo in elemsCombined.items(): tagLines = [] tagReport[tag] = tagLines if None in tagInfo: teiInfo = tagInfo[None] teiTyp = teiInfo["typ"] teiMixed = teiInfo["mixed"] teiTypRep = "??" if teiTyp is None else typ teiMixedRep = ( "??" if teiMixed is None else "mixed" if teiMixed else "pure" ) mds = ["TEI"] for model in sorted(x for x in tagInfo if x is not None): info = tagInfo[model] typ = info["typ"] mixed = info["mixed"] if typ == teiTyp and mixed == teiMixed: mds.append(model) else: typRep = "" if typ == teiTyp else "??" if typ is None else typ mixedRep = ( "" if mixed == teiMixed else ( "??" if mixed is None else "mixed" if mixed else "pure" ) ) tagLines.append((tag, [model], typRep, mixedRep)) tagLines.insert(0, (tag, mds, teiTypRep, teiMixedRep)) else: for model in sorted(tagInfo): info = tagInfo[model] typ = info["typ"] mixed = info["mixed"] typRep = "??" if typ is None else typ mixedRep = "??" if mixed is None else "mixed" if mixed else "pure" tagLines.append((tag, [model], typRep, mixedRep)) reportFile = f"{reportDir}/types.txt" with fileOpen(reportFile, mode="w") as fh: for tag in sorted(tagReport): tagLines = tagReport[tag] for tag, mds, typ, mixed in tagLines: model = ",".join(mds) fh.write(f"{tag:<18} {model:<18} {typ or '':<7} {mixed or '':<5}\n") if verbose >= 0: console(f"{len(elemsCombined)} tag(s) type info written to {reportFile}")
def writeErrors(self, verbose=0)
-
Write the errors to a file.
Expand source code Browse git
def writeErrors(self, verbose=0): """Write the errors to a file.""" reportDir = self.reportDir errorFile = f"{reportDir}/errors.txt" out = self.out errors = out.errors nErrors = 0 nFiles = 0 with fileOpen(errorFile, mode="w") as fh: prevFolder = None prevFile = None for folder, file, line, col, kind, text in errors: newFolder = prevFolder != folder newFile = newFolder or prevFile != file if newFile: nFiles += 1 if kind in {"error", "fatal"}: nErrors += 1 indent1 = f"{folder}\n\t" if newFolder else "\t" indent2 = f"{file}\n\t\t" if newFile else "\t" loc = f"{line or ''}:{col or ''}" text = "\n".join(wrap(text, width=80, subsequent_indent="\t\t\t")) fh.write(f"{indent1}{indent2}{loc} {kind or ''} {text}\n") prevFolder = folder prevFile = file if nErrors: console( ( f"{nErrors} validation error(s) in {nFiles} file(s) " f"written to {errorFile}" ), error=True, ) else: if verbose >= 0: console("Validation OK")
def writeFacs(self, verbose=0)
-
Expand source code Browse git
def writeFacs(self, verbose=0): reportDir = self.reportDir infoFile = f"{reportDir}/facsNoId.yml" param = self.param zoneBased = param.zoneBased out = self.out pageScans = out.pageScans facsMapping = out.facsMapping facsKind = out.facsKind facsNotDeclared = out.facsNotDeclared facsNoId = out.facsNoId zoneRegionIncomplete = out.zoneRegionIncomplete nPagesNoFacs = out.nPagesNoFacs writeYaml( { f: {k: n for (k, n) in v.items() if n} for (f, v) in facsNoId.items() if len(v) }, asFile=infoFile, ) nSurfaces = sum(x["surface"] for x in facsNoId.values()) nZones = sum(x["zone"] for x in facsNoId.values()) if verbose >= 0: pluralS = "" if nSurfaces == 1 else "s" pluralZ = "" if nZones == 1 else "s" if nSurfaces: console(f"{nSurfaces} surface{pluralS} without id") if nZones: console(f"{nZones} zone{pluralZ} without id") infoFile = f"{reportDir}/facs.yml" nItems = sum(len(x) for x in pageScans.values()) nUnique = sum(len(set(x)) for x in pageScans.values()) writeYaml(pageScans, asFile=infoFile) if verbose >= 0: plural = "" if nPagesNoFacs == 1 else "s" console(f"{nPagesNoFacs} pagebreak{plural} without facs attribute.") plural = "" if nItems == 1 else "s" console(f"{nItems} pagebreak{plural} encountered.") plural = "" if nUnique == 1 else "s" console(f"{nUnique} distinct scan{plural} referred to by pagebreaks.") if not zoneBased: return infoFile = f"{reportDir}/facsKind.yml" writeYaml(facsKind, asFile=infoFile) infoFile = f"{reportDir}/{FACS_MAPPING_YML}" writeYaml(facsMapping, asFile=infoFile) if verbose >= 0: nSurfaces = sum( sum(1 for y in x.values() if y == "surface") for x in facsKind.values() ) nZones = sum( sum(1 for y in x.values() if y == "zone") for x in facsKind.values() ) plural = "" if nSurfaces == 1 else "s" console(f"{nSurfaces} surface{plural} declared") plural = "" if nZones == 1 else "s" console(f"{nZones} zone{plural} declared") nItems = sum(len(x) for x in facsMapping.values()) plural = "" if nItems == 1 else "s" console(f"{nItems} scan{plural} declared and mapped.") infoFile = f"{reportDir}/facsProblems.yml" facsNotUsed = {} for xmlPath, mapping in facsMapping.items(): facsEncountered = set(pageScans[xmlPath]) thisFacsNotUsed = {} for facs in mapping: if facs not in facsEncountered: kind = facsKind[xmlPath][facs] thisFacsNotUsed.setdefault(kind, []).append(facs) if len(thisFacsNotUsed): facsNotUsed[xmlPath] = thisFacsNotUsed facsProblems = {} nFacsNotDeclared = sum(len(x) for x in facsNotDeclared.values()) nSurfacesNotUsed = sum(len(x.get("surface", [])) for x in facsNotUsed.values()) nZonesNotUsed = sum(len(x.get("zone", [])) for x in facsNotUsed.values()) if nFacsNotDeclared: plural = "" if nFacsNotDeclared == 1 else "s" console(f"{nFacsNotDeclared} undeclared scan{plural}", error=True) facsProblems["facsNotDeclared"] = { xmlPath: sorted(x) for (xmlPath, x) in facsNotDeclared.items() if len(x) } if nSurfacesNotUsed: plural = "" if nSurfacesNotUsed == 1 else "s" console(f"{nSurfacesNotUsed} unused surface{plural}", error=True) if nZonesNotUsed: plural = "" if nZonesNotUsed == 1 else "s" console(f"{nZonesNotUsed} unused zone{plural}", error=True) facsProblems["facsNotUsed"] = facsNotUsed writeYaml(facsProblems, asFile=infoFile) infoFile = f"{reportDir}/zoneErrors.yml" nIncomplete = len(zoneRegionIncomplete) plural = "" if nIncomplete == 1 else "s" if nIncomplete: console(f"{nIncomplete} missing zone region specifier{plural}", error=True) console(f"See {infoFile}", error=True) writeYaml(zoneRegionIncomplete, asFile=infoFile)
def writeFileInfo(self, verbose=0)
-
Write the folder/file info to a file.
Expand source code Browse git
def writeFileInfo(self, verbose=0): """Write the folder/file info to a file.""" reportDir = self.reportDir infoFile = f"{reportDir}/files.yml" out = self.out info = out.files writeYaml(info, asFile=infoFile)
def writeIdRefs(self, verbose=0)
-
Expand source code Browse git
def writeIdRefs(self, verbose=0): reportDir = self.reportDir reportIdFile = f"{reportDir}/ids.txt" reportRefFile = f"{reportDir}/refs.txt" out = self.out refs = out.refs ids = out.ids ih = fileOpen(reportIdFile, mode="w") rh = fileOpen(reportRefFile, mode="w") refdIds = collections.Counter() missingIds = set() totalRefs = 0 totalRefsU = 0 totalResolvable = 0 totalResolvableU = 0 totalDangling = 0 totalDanglingU = 0 seenItems = set() for file, items in refs.items(): rh.write(f"{file}\n") resolvable = 0 resolvableU = 0 dangling = 0 danglingU = 0 for item, n in sorted(items.items()): totalRefs += n if item in seenItems: newItem = False else: seenItems.add(item) newItem = True totalRefsU += 1 (target, idv) = item if target not in ids or idv not in ids[target]: status = "dangling" dangling += n if newItem: missingIds.add((target, idv)) danglingU += 1 else: status = "ok" resolvable += n refdIds[(target, idv)] += n if newItem: resolvableU += 1 rh.write(f"\t{status:<10} {n:>5} x {target} # {idv}\n") msgs = ( f"\tDangling: {dangling:>4} x {danglingU:>4}", f"\tResolvable: {resolvable:>4} x {resolvableU:>4}", ) for msg in msgs: rh.write(f"{msg}\n") totalResolvable += resolvable totalResolvableU += resolvableU totalDangling += dangling totalDanglingU += danglingU if verbose >= 0: console(f"Refs written to {reportRefFile}") msgs = ( f"\tresolvable: {totalResolvableU:>4} in {totalResolvable:>4}", f"\tdangling: {totalDanglingU:>4} in {totalDangling:>4}", f"\tALL: {totalRefsU:>4} in {totalRefs:>4} ", ) for msg in msgs: console(msg) totalIds = 0 totalIdsU = 0 totalIdsM = 0 totalIdsRefd = 0 totalIdsRefdU = 0 totalIdsUnused = 0 for file, items in ids.items(): totalIds += len(items) ih.write(f"{file}\n") unique = 0 multiple = 0 refd = 0 refdU = 0 unused = 0 for item, n in sorted(items.items()): nRefs = refdIds.get((file, item), 0) if n == 1: unique += 1 else: multiple += 1 if nRefs == 0: unused += 1 else: refd += nRefs refdU += 1 status1 = f"{n}x" plural = "" if nRefs == 1 else "s" status2 = f"{nRefs}ref{plural}" ih.write(f"\t{status1:<8} {status2:<8} {item}\n") msgs = ( f"\tUnique: {unique:>4}", f"\tNon-unique: {multiple:>4}", f"\tUnused: {unused:>4}", f"\tReferenced: {refd:>4} x {refdU:>4}", ) for msg in msgs: ih.write(f"{msg}\n") totalIdsU += unique totalIdsM += multiple totalIdsRefdU += refdU totalIdsRefd += refd totalIdsUnused += unused if verbose >= 0: console(f"Ids written to {reportIdFile}") msgs = ( f"\treferenced: {totalIdsRefdU:>4} by {totalIdsRefd:>4}", f"\tnon-unique: {totalIdsM:>4}", f"\tunused: {totalIdsUnused:>4}", f"\tALL: {totalIdsU:>4} in {totalIds:>4}", ) for msg in msgs: console(msg)
def writeLbParents(self, verbose=0)
-
Expand source code Browse git
def writeLbParents(self, verbose=0): reportDir = self.reportDir reportFile = f"{reportDir}/lb-parents.txt" out = self.out lbParents = out.lbParents with fileOpen(reportFile, "w") as fh: for parent, n in sorted(lbParents.items()): fh.write(f"{n:>5} x {parent}\n") if verbose >= 0: console(f"lb-parent info written to {reportFile}")
def writeNamespaces(self, verbose=0)
-
Expand source code Browse git
def writeNamespaces(self, verbose=0): reportDir = self.reportDir errorFile = f"{reportDir}/namespaces.txt" param = self.param procins = param.procins out = self.out tagByNs = out.tagByNs nProcins = out.nProcins nErrors = 0 nTags = len(tagByNs) with fileOpen(errorFile, mode="w") as fh: for tag, nsInfo in sorted( tagByNs.items(), key=lambda x: (-len(x[1]), x[0]) ): label = "OK" nNs = len(nsInfo) if nNs > 1: nErrors += 1 label = "XX" for ns, amount in sorted(nsInfo.items(), key=lambda x: (-x[1], x[0])): fh.write( f"{label} {nNs:>2} namespace for " f"{tag:<16} : {amount:>5}x {ns}\n" ) if verbose >= 0: if procins: plural = "" if nProcins == 1 else "s" console(f"{nProcins} processing instruction{plural} encountered.") console( ( f"{nTags} tags of which {nErrors} with multiple namespaces " f"written to {errorFile}" if verbose >= 0 or nErrors else "Namespaces OK" ), error=nErrors > 0, )
def writeReport(self, verbose=0)
-
Expand source code Browse git
def writeReport(self, verbose=0): reportDir = self.reportDir reportFile = f"{reportDir}/elements.txt" param = self.param kindLabels = param.kindLabels out = self.out report = out.report with fileOpen(reportFile, mode="w") as fh: fh.write( "Inventory of tags and attributes in the source XML file(s).\n" "Contains the following sections:\n" ) for label in kindLabels.values(): fh.write(f"\t{label}\n") fh.write("\n\n") infoLines = 0 def writeAttInfo(tag, att, attInfo): nonlocal infoLines nl = "" if tag == "" else "\n" tagRep = "" if tag == "" else f"<{tag}>" attRep = "" if att == "" else f"{att}=" atts = sorted(attInfo.items()) (val, amount) = atts[0] fh.write(f"{nl}\t{tagRep:<18} " f"{attRep:<11} {amount:>7}x {val}\n") infoLines += 1 for val, amount in atts[1:]: fh.write(f"""\t{'':<18} {'':<11} {amount:>7}x {val}\n""") infoLines += 1 def writeTagInfo(tag, tagInfo): nonlocal infoLines tags = sorted(tagInfo.items()) (att, attInfo) = tags[0] writeAttInfo(tag, att, attInfo) infoLines += 1 for att, attInfo in tags[1:]: writeAttInfo("", att, attInfo) for kind, label in kindLabels.items(): fh.write(f"\n{label}\n") for tag, tagInfo in sorted(report[kind].items()): writeTagInfo(tag, tagInfo) if verbose >= 0: console(f"{infoLines} info line(s) written to {reportFile}")