# Nodemaps

When different versions of a TF dataset exist, it makes sense to map the nodes from the older version to nodes of the newer version.

If we have annotations on the older version (possibly created with considerable man-machine labor), we can use such node mappings to transfer the annotations to the newer version without redoing all the work.

Mapping the nodes consists of two stages:

1. slot mapping This is very dependent on what happened between versions.
2. node mappings per node type These can be generically constructed once we have a slot mapping.

This module contains the function makeVersionMapping which is a function to furnish node mappings for node types given a slot mapping exists.

# Nodes in general

The basic idea we use for the general case is that that nodes are linked to slots. In text-fabric, the standard oslots edge feature lists for each non-slot node the slots it is linked to.

Combining the just created slot mappings between versions and the oslots feature, we can extend the slot mapping into a general node mapping.

In order to map a node $n$ in version $V$, we look at its slots $s$, use the already established slot mapping to map these to slots $t$ in version $W$, and collect the nodes $m$ in version $W$ that are linked to those $t$. They are good candidates for the mapping.

# Refinements

When we try to match nodes across versions, based on slot containment, we also respect their otypes. So we will not try to match a clause to a phrase. We make implicit use of the fact that for most otypes, the members contain disjoint slots.

# Multiple candidates

Of course, things are not always as neat as in the diagram. Textual objects may have split, or shifted, or collapsed. In general we find 0 or more candidates. Even if we find exactly one candidate, it does not have to be a perfect match. A typical situation is this:

We do not find a node $m\in W$ that occupies the mapped slots exactly. Instead, we find that the target area is split between two candidates who also reach outside the target area.

In such cases, we make edges to all such candidates, but we add a dissimilarity measure. If $m$ is the collection of slots, mapped from $n$, and $m_1$ is a candidate for $n$, meaning $m_1$ has overlap with $m$, then the dissimilarity of $m_1$ is defined as:

In words: the number of slots in the union of $m_1$ and $m$ minus the number of slots in their intersection.

In other words: $m_1$ gets a penalty for

• each slot $s\in m_1$ that is not in the mapped slots $m$;
• each mapped slot $t\in m$ that is not in $m_1$.

If a candidate occupies exactly the mapped slots, the dissimilarity is 0. If there is only one such candidate of the right type, the case is completely clear, and we do not add a dissimilarity value to the edge.

If there are more candidates, all of them will get an edge, and those edges will contain the dissimilarity value, even if that value is $0$.

# Diagnostic labels

We report the success in establishing the match between non-slot nodes. We do so per node type, and for each node type we list a few statistics, both in absolute numbers and in percentage of the total amount of nodes of that type in the source version.

We count the nodes that fall in each of the following cases. The list of cases is ordered by decreasing success of the mapping.

• b unique, perfect: there is only one match for the mapping and it is a perfect one in terms of slots linked to it;
• d multiple, one perfect: there are multiple matches, but at least one is perfect; this occurs typically if nodes of a type are linked to nested and overlapping sequences of slots;
• c unique, imperfect: there is only one match, but it is not perfect; this indicates that some boundary reorganization has happened between the two versions, and that some slots of the source node have been cut off in the target node; yet the fact that the source node and the target node correspond is clear;
• f multiple, cleanly composed: in this case the source node corresponds to a bunch of matches, which together cleanly cover the mapped slots of the source node; in other words: the original node has been split in several parts;
• e multiple, non-perfect: all remaining cases where there are matches; these situations can be the result of more intrusive changes; if it turns out to be a small set they do require closer inspection;
• a not mapped: these are nodes for which no match could be found.

An involved example of creating node mappings between versions (not using this code) is versionMappings.ipynb.

A simpler example, using this code is map.ipynb.

## Usage

from tf.dataset import Versions

V = Versions(api, va, vb, slotMap)
V.makeVersionMapping()

Expand source code Browse git
"""
# Nodemaps

When different versions of a TF dataset exist, it makes sense to map the nodes from the
older version to nodes of the newer version.

If we have annotations on the older version
(possibly created with considerable man-machine labor),
we can use such node mappings to transfer the annotations to the newer version without
redoing all the work.

Mapping the nodes consists of two stages:

1. **slot mapping** This is very dependent on what happened between versions.
2. **node mappings per node type** These can be generically constructed
once we have a slot mapping.

This module contains the function makeVersionMapping which is a function to furnish node mappings
for node types given a slot mapping exists.

# Nodes in general
The basic idea we use for the general case is that that nodes are linked to slots.
In text-fabric, the standard oslots edge feature lists for each non-slot node
the slots it is linked to.

Combining the just created slot mappings between versions and the oslots feature,
we can extend the slot mapping into a general node mapping.

In order to map a node \$$n\$$ in version \$$V\$$, we look at its slots \$$s\$$,
use the already established *slot mapping* to map these to slots
\$$t\$$ in version \$$W\$$,
and collect the nodes \$$m\$$ in version \$$W\$$ that are linked to those \$$t\$$.
They are good candidates for the mapping.

![5](../images/diffs.005.png)

# Refinements

When we try to match nodes across versions, based on slot containment, we also respect
their otypes. So we will not try to match a clause to a phrase.
We make implicit use of the fact that for most otypes, the members
contain disjoint slots.

# Multiple candidates
Of course, things are not always as neat as in the diagram.
Textual objects may have split, or shifted,
or collapsed.
In general we find 0 or more candidates.
Even if we find exactly one candidate, it does not have to be a perfect match.
A typical situation is this:

![6](../images/diffs.006.png)

We do not find a node \$$m\\in W\$$ that occupies the mapped slots exactly.
Instead, we find that the target area is split between two candidates who
also reach outside the target area.

In such cases, we make edges to all such candidates, but we add a dissimilarity measure.
If \$$m\$$ is the collection of slots, mapped from \$$n\$$, and \$$m_1\$$ is a candidate
for \$$n\$$, meaning \$$m_1\$$ has overlap with \$$m\$$,
then the *dissimilarity* of \$$m_1\$$ is defined as:

$$|m_1\\cup m| - |m_1\\cap m|$$

In words: the number of slots in the union of \$$m_1\$$ and \$$m\$$
minus the number of slots in their intersection.

In other words: \$$m_1\$$ gets a penalty for

* each slot \$$s\\in m_1\$$ that is not in the mapped slots \$$m\$$;
* each mapped slot \$$t\\in m\$$ that is not in \$$m_1\$$.

If a candidate occupies exactly the mapped slots, the dissimilarity is 0.
If there is only one such candidate of the right type, the case is completely clear,
and we do not add a dissimilarity value to the edge.

If there are more candidates, all of them will get an edge,
and those edges will contain the dissimilarity value, even if that value is \$$0\$$.

# Diagnostic labels
We report the success in establishing the match between non-slot nodes.
We do so per node type, and for each node type we list a few statistics,
both in absolute numbers and in percentage of the total amount of nodes of that
type in the source version.

We count the nodes that fall in each of the following cases.
The list of cases is ordered by decreasing success of the mapping.

*   b **unique, perfect**:
there is only one match for the mapping and it is a perfect one
in terms of slots linked to it;
*   d
**multiple, one perfect**:
there are multiple matches, but at least one is perfect;
this occurs typically if nodes of a type are linked to nested
and overlapping sequences of slots;
*   c
**unique, imperfect**:
there is only one match, but it is not perfect;
this indicates that some boundary reorganization has happened
between the two versions,
and that some slots of the source node have been cut off in the target node;
yet the fact that the source node and the target node correspond is clear;
*   f
**multiple, cleanly composed**:
in this case the source node corresponds to a bunch of matches,
which together cleanly cover the mapped slots of the source node;
in other words: the original node has been split in several parts;
*   e
**multiple, non-perfect**:
all remaining cases where there are matches;
these situations can be the result of more intrusive changes;
if it turns out to be a small set they do require closer inspection;
*   a
**not mapped**:
these are nodes for which no match could be found.

An involved example of creating node mappings between versions (not using this code) is
[versionMappings.ipynb](https://nbviewer.jupyter.org/github/ETCBC/bhsa/blob/master/programs/versionMappings.ipynb).

A simpler example, using this code is
[map.ipynb](https://nbviewer.jupyter.org/github/clariah/wp6-missieven/blob/master/programs/map.ipynb).

## Usage

python
from tf.dataset import Versions

V = Versions(api, va, vb, slotMap)
V.makeVersionMapping()

"""

import sys
import collections
import time
from itertools import chain

STAT_LABELS = collections.OrderedDict(
b="unique, perfect",
d="multiple, one perfect",
c="unique, imperfect",
f="multiple, cleanly composed",
e="multiple, non-perfect",
a="not mapped",
)

TIMESTAMP = None

def _duration():
global TIMESTAMP
if TIMESTAMP is None:
TIMESTAMP = time.time()

interval = time.time() - TIMESTAMP
if interval < 10:
return "{: 2.2f}s".format(interval)
interval = int(round(interval))
if interval < 60:
return "{:>2d}s".format(interval)
if interval < 3600:
return "{:>2d}m {:>02d}s".format(interval // 60, interval % 60)
return "{:>2d}h {:>02d}m {:>02d}s".format(
interval // 3600, (interval % 3600) // 60, interval % 60
)

def caption(level, heading, good=None, newLine=True, continuation=False):
prefix = "" if good is None else "SUCCES " if good else "FAILURE "
duration = "" if continuation else "{:>11} ".format(_duration())

if level == 0:  # non-heading message
decoration = "" if continuation else "| "
elif level == 1:  # pipeline level
formattedString = """
##{}##
# {} #
# {} #
# {} #
##{}##
""".format(
"#" * 90,
" " * 90,
" " * 90,
"#" * 90,
)
elif level == 2:  # repo level
formattedString = """
**{}**
* {} *
* {} *
* {} *
**{}**
""".format(
"*" * 90,
" " * 90,
" " * 90,
"*" * 90,
)
elif level == 3:  # task level
formattedString = """
--{}--
- {} -
--{}--
""".format(
"-" * 90,
"-" * 90,
)
elif level == 4:  # caption within task execution
formattedString = """..{}..
. {} .
..{}..""".format(
"." * 90,
"." * 90,
)
if newLine:
print(formattedString)
else:
sys.stdout.write(formattedString)

class Versions:
def __init__(self, api, va, vb, slotMap=None):
"""Map the nodes of a nodetype between two versions of a TF dataset.

There are two scenarios in which you can use this object:

**Extend a slot map to a node mapping**.

If this object is initialized with a slotMap, it can
extend that mapping to a complete node mapping and save the mapping
as an f"omap@{va}-{vb}" edge feature in the dataset version vb.

In this use case, the apis for both va and vb should be full TF APIs,
in particular, they should provide the tf.core.locality API.

**Use an existing node map to upgrade features from one version to another**.

In this use case, you do not have to pass a slotMap,
but in the vb dataset there should be an omap@a-b feature,
i.e. a node map from the nodes of va to those of vb.

The apis for both va and vb may be partial TF APIs, which means that they
only have to provide the tf.core.nodefeature.NodeFeatures (F) and
tf.core.edgefeature.EdgeFeatures (E) apis.

Parameters
----------
api: dict
TF-API objects to several versions of a dataset, keyed by version label
va: string
version label of the version whose nodes are the source of the mapping
vb: string
version label of the version whose nodes are the target of the mapping
slotMap: dict, optional None
The actual mapping between slots of the old version and the new version
"""
self.api = api
self.va = va
self.vb = vb
self.edge = slotMap
self.good = True
for v in (va, vb):
if va not in api:
print(f"No TF-API for version {va} in the api parameter.")
self.good = False

if not self.good:
return

# self.La = api[va].L
if slotMap is not None:
self.Lb = api[vb].L
self.Ea = api[va].E
self.Esa = api[va].Es
self.Eb = api[vb].E
self.Esb = api[vb].Es
self.Fa = api[va].F
self.Fsa = api[va].Fs
self.Fb = api[vb].F
self.Fsb = api[vb].Fs
self.TFa = api[va].TF
self.TFb = api[vb].TF

self.diagnosis = {}

def makeNodeMapping(self, nodeType):
edge = self.edge

if edge is None:
print("Cannot make node mapping if no slot mapping is given")
return False

va = self.va
vb = self.vb
Lb = self.Lb
Ea = self.Ea
Eb = self.Eb
Fa = self.Fa

Lub = Lb.u
Eosa = Ea.oslots.s
Eosb = Eb.oslots.s
otypesa = Fa.otype.s

caption(2, "Mapping {} nodes {} ==> {}".format(nodeType, va, vb))

diag = {}
caption(
0, "Extending slot mapping {} ==> {} for {} nodes".format(va, vb, nodeType)
)
for n in otypesa(nodeType):
slots = Eosa(n)
mappedSlots = set(
chain.from_iterable(set(edge[s]) for s in slots if s in edge)
)
mappedNodes = set(
chain.from_iterable(set(Lub(s, nodeType)) for s in mappedSlots)
)
result = {}
nMs = len(mappedNodes)
if nMs == 0:
diag[n] = "a"

elif nMs >= 1:
theseMSlots = {}
for m in mappedNodes:
mSlots = set(Eosb(m))
dis = len(mappedSlots | mSlots) - len(mappedSlots & mSlots)
result[m] = dis
theseMSlots[m] = mSlots

# we wait further case analysis before we put these counterparts of n into the edge

if nMs == 1:
m = list(mappedNodes)[0]
dis = result[m]
if dis == 0:
diag[n] = "b"
edge[n] = {m: None}
# this is the most freqent case, hence an optimization: no dis value.
# all other cases require the dis value to be passed on, even if 0
else:
diag[n] = "c"
edge[n] = {m: dis}
else:
edge[n] = result
dis = min(result.values())
if dis == 0:
diag[n] = "d"
else:
allMSlots = set(
chain.from_iterable(
set(theseMSlots[m]) for m in mappedNodes
)
)
composed = allMSlots == mappedSlots and sum(
result.values()
) == len(mappedSlots) * (len(mappedNodes) - 1)

if composed:
diag[n] = "f"
else:
diag[n] = "e"

self.diagnosis[nodeType] = diag
caption(0, "\tDone")

def exploreNodeMapping(self, nodeType):
va = self.va
vb = self.vb
diagnosis = self.diagnosis

caption(4, "Statistics for {} ==> {} ({})".format(va, vb, nodeType))

diag = diagnosis[nodeType]
total = len(diag)
if total == 0:
return

reasons = collections.Counter()

for (n, dia) in diag.items():
reasons[dia] += 1

caption(0, "\t{:<30} : {:6.2f}% {:>7}x".format("TOTAL", 100, total))
for stat in STAT_LABELS:
statLabel = STAT_LABELS[stat]
amount = reasons[stat]
if amount == 0:
continue
perc = 100 * amount / total
caption(0, "\t{:<30} : {:6.2f}% {:>7}x".format(statLabel, perc, amount))

def getDiagnosis(self, node=None, label=None):
"""Show the diagnosis of a mapping.

*    A dictionary that maps all nodes of a given type to a diagnostic label
*    A list of nodes of a node type that have a given diagnostic label
*    The label of a given node

Parameters
----------
node: string | int, optional None
The node or node type that you want to diagnose.
If a string, it is a node type, and you get information about the nodes
of that node type.
If an int, it is a node, and you get information about that node.
If None, you get information about all nodes.

label: char, optional None
If given, it is a diagnostic label (see tf.dataset.nodemaps),
and the result is dependent on the value passed to *node*:
if that is a single node, a boolean is returned telling whether that node
has the given label; if that is a node type, a tuple is returned of all
nodes in that node type that have this label;
if that is None, a tuple is returned of all nodes with that label.

If not given, the result depends again on the value passed to *node*:
if that is a single node, you get its diagnostic label;
if that is a node type, you get a dict, keyed by node, and valued
by diagnostic label, for all nodes in the node type;
if that is None, you get a dictionary keyed by node, and valued
by diagnostic label but now for all nodes.
"""

diagnosis = self.diagnosis
Fa = self.Fa
otypesa = Fa.otype.s

if label is None:
return (
dict(chain.from_iterable(x.items() for x in diagnosis.values()))
if node is None
else diagnosis.get(node, {})
if type(node) is str
else diagnosis.get(otypesa(node), {}).get(node, None)
if type(node) is int
else None
)
else:
return (
tuple(
y[0]
for y in chain.from_iterable(x.items() for x in diagnosis.values())
if y[1] == label
)
if node is None
else tuple(
y[0] for y in diagnosis.get(node, {}).items() if y[1] == label
)
if type(node) is str
else diagnosis.get(otypesa(node), {}).get(node, None) == label
if type(node) is int
else None
)

def legend(self):
"""Show the labels and descriptions of the diagnosis classes.

When the mapping turns out to be not perfect for a node,
the result can be categorized according to severity.

This function shows that classification.
See also tf.dataset.nodemaps.
"""

for (acro, desc) in STAT_LABELS.items():
print(f"{acro} = {desc}")

def omapName(self):
va = self.va
vb = self.vb
return f"omap@{va}-{vb}"

def writeMap(self):
TFb = self.TFb
va = self.va
vb = self.vb
edge = self.edge

fName = self.omapName()
caption(4, "Write edge as TF feature {}".format(fName))

edgeFeatures = {fName: edge}
fName: {
"about": "Mapping from the slots of version {} to version {}".format(
va, vb
),
"encoder": "Text-Fabric tf.dataset.nodemaps",
"valueType": "int",
"edgeValues": True,
}
}
TFb.save(
nodeFeatures={},
edgeFeatures=edgeFeatures,
)

def makeVersionMapping(self):
Fa = self.Fa

self.diagnosis = {}

nodeTypes = Fa.otype.all

for nodeType in nodeTypes[0:-1]:
self.makeNodeMapping(nodeType)
self.exploreNodeMapping(nodeType)

self.writeMap()

def migrateFeatures(self, featureNames, location=None):
"""Migrate features from one version to another based on a node map.

If you have a dataset with several features, and if there is a node map between
the versions, you can migrate features from the older version to the newer.

The naive approach per feature is:

For each node in the old version that has a value for the feature,
map the node to the new version.
That could be several nodes.
Then give all those nodes the value found on the node in the old set.

The problem with that is this:

Suppose node n in the old version maps to nodes p and q in the
new version, but not with the same quality.
Suppose the value is v. Then the feature assignment will assign
v to nodes p and q.

But there could very wel be a later node m with value w in the old version
that also maps to p, and with a lower quality.
Then later the node p in the new version will be assigned the value w
because of this, and this is sub otpimal.

So we make sure that for each feature and each node in the new version
the value assigned to it is the value of the node in the old version that
maps with the highest quality to the new node.

Parameters
----------
featureNames: tuple
The names of the features to migrate.
They may be node features or edge features or both.
location: string, optional None
If absent, the migrated features will be saved in the newer dataset.
Otherwise it is a path where the new features should be saved.
"""

omapName = self.omapName()
TFa = self.TFa
TFb = self.TFb
Fsa = self.Fsa
Esa = self.Esa
Esb = self.Esb
va = self.va
vb = self.vb

emap = Esb(omapName).f

nodeFeatures = {}
edgeFeatures = {}

def getBestMatches(n, dest):
result = []
mapped = emap(n)

if mapped:
perfect = [x for x in mapped if x[1] is None]
if perfect:
return [perfect[0]]

nearPerfect = [x for x in mapped if x[1] == 0]
if nearPerfect:
(m, q) = nearPerfect[0]
if m not in dest or m in dest and dest[m][1] is not None:
return [nearPerfect[0]]

for (m, q) in mapped:
# these q cannot be None or 0
if m in dest:
(origVal, origQ) = dest[m]
higherQuality = origQ is not None and q < origQ
if origVal == val:
if higherQuality:
result.append((m, q))
elif higherQuality:
result.append((m, q))
else:
result.append((m, q))
return result

for featureName in featureNames:
isEdge = TFa.features[featureName].isEdge
featureObj = (Esa if isEdge else Fsa)(featureName)
featureInfo = dict(featureObj.items())

edgeValues = featureObj.doValues if isEdge else False

dest = edgeFeatures if isEdge else nodeFeatures
dest[featureName] = {}
dest = dest[featureName]

# we will not only store the feature value in dest,
# but also the quality of the mapping that
# produced the value.
# We only override an existing value
# if the quality of the mapping is better.
# Moreover, if one mapping has quality 0,
# it means that the mapping is perfect regarding the material in both nodes
# but there are lesser mappings as well.
# In those cases we do not extend the value to those lesser mappings
# Later, we strip the qualities again.

if isEdge:
if edgeValues:
for (n, ts) in featureInfo.items():
matched = getBestMatches(n, dest)
if matched:
matchedTs = {}
for (t, val) in ts.items():
for (t, r) in getBestMatches(t, matchedTs):
matchedTs[t] = (val, r)
for t in matchedTs:
matchedTs[t] = matchedTs[t][0]
for (m, q) in matched:
dest[m] = (matchedTs, q)

"""
mapped = emap(n)
if mapped:
mappedts = {}
for (t, val) in ts.items():
tmapped = emap(t)
if tmapped:
for tEntry in tmapped:
mappedts[tEntry[0]] = val
for entry in mapped:
dest[entry[0]] = mappedts
"""
else:
for (n, ts) in featureInfo.items():
matched = getBestMatches(n, dest)
if matched:
matchedTs = {}
for t in ts:
for (t, r) in getBestMatches(t, matchedTs):
matchedTs[t] = (None, r)
matchedTs = set(matchedTs)
for (m, q) in matched:
dest[m] = (matchedTs, q)

"""
mapped = emap(n)
if mapped:
mappedts = set()
for t in ts:
tmapped = emap(t)
if tmapped:
for tEntry in tmapped:
for entry in mapped:
dest[entry[0]] = mappedts
"""
else:
for (n, val) in featureInfo.items():
for (m, q) in getBestMatches(n, dest):
dest[m] = (val, q)

for m in dest:
dest[m] = dest[m][0]

TFb.save(
nodeFeatures=nodeFeatures,
edgeFeatures=edgeFeatures,
location=location,
module=vb,
)

## Functions

 def caption(level, heading, good=None, newLine=True, continuation=False) 
Expand source code Browse git
def caption(level, heading, good=None, newLine=True, continuation=False):
prefix = "" if good is None else "SUCCES " if good else "FAILURE "
duration = "" if continuation else "{:>11} ".format(_duration())

if level == 0:  # non-heading message
decoration = "" if continuation else "| "
elif level == 1:  # pipeline level
formattedString = """
##{}##
# {} #
# {} #
# {} #
##{}##
""".format(
"#" * 90,
" " * 90,
" " * 90,
"#" * 90,
)
elif level == 2:  # repo level
formattedString = """
**{}**
* {} *
* {} *
* {} *
**{}**
""".format(
"*" * 90,
" " * 90,
" " * 90,
"*" * 90,
)
elif level == 3:  # task level
formattedString = """
--{}--
- {} -
--{}--
""".format(
"-" * 90,
"-" * 90,
)
elif level == 4:  # caption within task execution
formattedString = """..{}..
. {} .
..{}..""".format(
"." * 90,
"." * 90,
)
if newLine:
print(formattedString)
else:
sys.stdout.write(formattedString)

## Classes

 class Versions (api, va, vb, slotMap=None) 

Map the nodes of a nodetype between two versions of a TF dataset.

There are two scenarios in which you can use this object:

Extend a slot map to a node mapping.

If this object is initialized with a slotMap, it can extend that mapping to a complete node mapping and save the mapping as an f"omap@{va}-{vb}" edge feature in the dataset version vb.

In this use case, the apis for both va and vb should be full TF APIs, in particular, they should provide the tf.core.locality API.

Use an existing node map to upgrade features from one version to another.

In this use case, you do not have to pass a slotMap, but in the vb dataset there should be an omap@a-b feature, i.e. a node map from the nodes of va to those of vb.

The apis for both va and vb may be partial TF APIs, which means that they only have to provide the NodeFeatures (F) and EdgeFeatures (E) apis.

## Parameters

api : dict
TF-API objects to several versions of a dataset, keyed by version label
va : string
version label of the version whose nodes are the source of the mapping
vb : string
version label of the version whose nodes are the target of the mapping
slotMap : dict, optional None
The actual mapping between slots of the old version and the new version
Expand source code Browse git
class Versions:
def __init__(self, api, va, vb, slotMap=None):
"""Map the nodes of a nodetype between two versions of a TF dataset.

There are two scenarios in which you can use this object:

**Extend a slot map to a node mapping**.

If this object is initialized with a slotMap, it can
extend that mapping to a complete node mapping and save the mapping
as an f"omap@{va}-{vb}" edge feature in the dataset version vb.

In this use case, the apis for both va and vb should be full TF APIs,
in particular, they should provide the tf.core.locality API.

**Use an existing node map to upgrade features from one version to another**.

In this use case, you do not have to pass a slotMap,
but in the vb dataset there should be an omap@a-b feature,
i.e. a node map from the nodes of va to those of vb.

The apis for both va and vb may be partial TF APIs, which means that they
only have to provide the tf.core.nodefeature.NodeFeatures (F) and
tf.core.edgefeature.EdgeFeatures (E) apis.

Parameters
----------
api: dict
TF-API objects to several versions of a dataset, keyed by version label
va: string
version label of the version whose nodes are the source of the mapping
vb: string
version label of the version whose nodes are the target of the mapping
slotMap: dict, optional None
The actual mapping between slots of the old version and the new version
"""
self.api = api
self.va = va
self.vb = vb
self.edge = slotMap
self.good = True
for v in (va, vb):
if va not in api:
print(f"No TF-API for version {va} in the api parameter.")
self.good = False

if not self.good:
return

# self.La = api[va].L
if slotMap is not None:
self.Lb = api[vb].L
self.Ea = api[va].E
self.Esa = api[va].Es
self.Eb = api[vb].E
self.Esb = api[vb].Es
self.Fa = api[va].F
self.Fsa = api[va].Fs
self.Fb = api[vb].F
self.Fsb = api[vb].Fs
self.TFa = api[va].TF
self.TFb = api[vb].TF

self.diagnosis = {}

def makeNodeMapping(self, nodeType):
edge = self.edge

if edge is None:
print("Cannot make node mapping if no slot mapping is given")
return False

va = self.va
vb = self.vb
Lb = self.Lb
Ea = self.Ea
Eb = self.Eb
Fa = self.Fa

Lub = Lb.u
Eosa = Ea.oslots.s
Eosb = Eb.oslots.s
otypesa = Fa.otype.s

caption(2, "Mapping {} nodes {} ==> {}".format(nodeType, va, vb))

diag = {}
caption(
0, "Extending slot mapping {} ==> {} for {} nodes".format(va, vb, nodeType)
)
for n in otypesa(nodeType):
slots = Eosa(n)
mappedSlots = set(
chain.from_iterable(set(edge[s]) for s in slots if s in edge)
)
mappedNodes = set(
chain.from_iterable(set(Lub(s, nodeType)) for s in mappedSlots)
)
result = {}
nMs = len(mappedNodes)
if nMs == 0:
diag[n] = "a"

elif nMs >= 1:
theseMSlots = {}
for m in mappedNodes:
mSlots = set(Eosb(m))
dis = len(mappedSlots | mSlots) - len(mappedSlots & mSlots)
result[m] = dis
theseMSlots[m] = mSlots

# we wait further case analysis before we put these counterparts of n into the edge

if nMs == 1:
m = list(mappedNodes)[0]
dis = result[m]
if dis == 0:
diag[n] = "b"
edge[n] = {m: None}
# this is the most freqent case, hence an optimization: no dis value.
# all other cases require the dis value to be passed on, even if 0
else:
diag[n] = "c"
edge[n] = {m: dis}
else:
edge[n] = result
dis = min(result.values())
if dis == 0:
diag[n] = "d"
else:
allMSlots = set(
chain.from_iterable(
set(theseMSlots[m]) for m in mappedNodes
)
)
composed = allMSlots == mappedSlots and sum(
result.values()
) == len(mappedSlots) * (len(mappedNodes) - 1)

if composed:
diag[n] = "f"
else:
diag[n] = "e"

self.diagnosis[nodeType] = diag
caption(0, "\tDone")

def exploreNodeMapping(self, nodeType):
va = self.va
vb = self.vb
diagnosis = self.diagnosis

caption(4, "Statistics for {} ==> {} ({})".format(va, vb, nodeType))

diag = diagnosis[nodeType]
total = len(diag)
if total == 0:
return

reasons = collections.Counter()

for (n, dia) in diag.items():
reasons[dia] += 1

caption(0, "\t{:<30} : {:6.2f}% {:>7}x".format("TOTAL", 100, total))
for stat in STAT_LABELS:
statLabel = STAT_LABELS[stat]
amount = reasons[stat]
if amount == 0:
continue
perc = 100 * amount / total
caption(0, "\t{:<30} : {:6.2f}% {:>7}x".format(statLabel, perc, amount))

def getDiagnosis(self, node=None, label=None):
"""Show the diagnosis of a mapping.

*    A dictionary that maps all nodes of a given type to a diagnostic label
*    A list of nodes of a node type that have a given diagnostic label
*    The label of a given node

Parameters
----------
node: string | int, optional None
The node or node type that you want to diagnose.
If a string, it is a node type, and you get information about the nodes
of that node type.
If an int, it is a node, and you get information about that node.
If None, you get information about all nodes.

label: char, optional None
If given, it is a diagnostic label (see tf.dataset.nodemaps),
and the result is dependent on the value passed to *node*:
if that is a single node, a boolean is returned telling whether that node
has the given label; if that is a node type, a tuple is returned of all
nodes in that node type that have this label;
if that is None, a tuple is returned of all nodes with that label.

If not given, the result depends again on the value passed to *node*:
if that is a single node, you get its diagnostic label;
if that is a node type, you get a dict, keyed by node, and valued
by diagnostic label, for all nodes in the node type;
if that is None, you get a dictionary keyed by node, and valued
by diagnostic label but now for all nodes.
"""

diagnosis = self.diagnosis
Fa = self.Fa
otypesa = Fa.otype.s

if label is None:
return (
dict(chain.from_iterable(x.items() for x in diagnosis.values()))
if node is None
else diagnosis.get(node, {})
if type(node) is str
else diagnosis.get(otypesa(node), {}).get(node, None)
if type(node) is int
else None
)
else:
return (
tuple(
y[0]
for y in chain.from_iterable(x.items() for x in diagnosis.values())
if y[1] == label
)
if node is None
else tuple(
y[0] for y in diagnosis.get(node, {}).items() if y[1] == label
)
if type(node) is str
else diagnosis.get(otypesa(node), {}).get(node, None) == label
if type(node) is int
else None
)

def legend(self):
"""Show the labels and descriptions of the diagnosis classes.

When the mapping turns out to be not perfect for a node,
the result can be categorized according to severity.

This function shows that classification.
See also tf.dataset.nodemaps.
"""

for (acro, desc) in STAT_LABELS.items():
print(f"{acro} = {desc}")

def omapName(self):
va = self.va
vb = self.vb
return f"omap@{va}-{vb}"

def writeMap(self):
TFb = self.TFb
va = self.va
vb = self.vb
edge = self.edge

fName = self.omapName()
caption(4, "Write edge as TF feature {}".format(fName))

edgeFeatures = {fName: edge}
fName: {
"about": "Mapping from the slots of version {} to version {}".format(
va, vb
),
"encoder": "Text-Fabric tf.dataset.nodemaps",
"valueType": "int",
"edgeValues": True,
}
}
TFb.save(
nodeFeatures={},
edgeFeatures=edgeFeatures,
)

def makeVersionMapping(self):
Fa = self.Fa

self.diagnosis = {}

nodeTypes = Fa.otype.all

for nodeType in nodeTypes[0:-1]:
self.makeNodeMapping(nodeType)
self.exploreNodeMapping(nodeType)

self.writeMap()

def migrateFeatures(self, featureNames, location=None):
"""Migrate features from one version to another based on a node map.

If you have a dataset with several features, and if there is a node map between
the versions, you can migrate features from the older version to the newer.

The naive approach per feature is:

For each node in the old version that has a value for the feature,
map the node to the new version.
That could be several nodes.
Then give all those nodes the value found on the node in the old set.

The problem with that is this:

Suppose node n in the old version maps to nodes p and q in the
new version, but not with the same quality.
Suppose the value is v. Then the feature assignment will assign
v to nodes p and q.

But there could very wel be a later node m with value w in the old version
that also maps to p, and with a lower quality.
Then later the node p in the new version will be assigned the value w
because of this, and this is sub otpimal.

So we make sure that for each feature and each node in the new version
the value assigned to it is the value of the node in the old version that
maps with the highest quality to the new node.

Parameters
----------
featureNames: tuple
The names of the features to migrate.
They may be node features or edge features or both.
location: string, optional None
If absent, the migrated features will be saved in the newer dataset.
Otherwise it is a path where the new features should be saved.
"""

omapName = self.omapName()
TFa = self.TFa
TFb = self.TFb
Fsa = self.Fsa
Esa = self.Esa
Esb = self.Esb
va = self.va
vb = self.vb

emap = Esb(omapName).f

nodeFeatures = {}
edgeFeatures = {}

def getBestMatches(n, dest):
result = []
mapped = emap(n)

if mapped:
perfect = [x for x in mapped if x[1] is None]
if perfect:
return [perfect[0]]

nearPerfect = [x for x in mapped if x[1] == 0]
if nearPerfect:
(m, q) = nearPerfect[0]
if m not in dest or m in dest and dest[m][1] is not None:
return [nearPerfect[0]]

for (m, q) in mapped:
# these q cannot be None or 0
if m in dest:
(origVal, origQ) = dest[m]
higherQuality = origQ is not None and q < origQ
if origVal == val:
if higherQuality:
result.append((m, q))
elif higherQuality:
result.append((m, q))
else:
result.append((m, q))
return result

for featureName in featureNames:
isEdge = TFa.features[featureName].isEdge
featureObj = (Esa if isEdge else Fsa)(featureName)
featureInfo = dict(featureObj.items())

edgeValues = featureObj.doValues if isEdge else False

dest = edgeFeatures if isEdge else nodeFeatures
dest[featureName] = {}
dest = dest[featureName]

# we will not only store the feature value in dest,
# but also the quality of the mapping that
# produced the value.
# We only override an existing value
# if the quality of the mapping is better.
# Moreover, if one mapping has quality 0,
# it means that the mapping is perfect regarding the material in both nodes
# but there are lesser mappings as well.
# In those cases we do not extend the value to those lesser mappings
# Later, we strip the qualities again.

if isEdge:
if edgeValues:
for (n, ts) in featureInfo.items():
matched = getBestMatches(n, dest)
if matched:
matchedTs = {}
for (t, val) in ts.items():
for (t, r) in getBestMatches(t, matchedTs):
matchedTs[t] = (val, r)
for t in matchedTs:
matchedTs[t] = matchedTs[t][0]
for (m, q) in matched:
dest[m] = (matchedTs, q)

"""
mapped = emap(n)
if mapped:
mappedts = {}
for (t, val) in ts.items():
tmapped = emap(t)
if tmapped:
for tEntry in tmapped:
mappedts[tEntry[0]] = val
for entry in mapped:
dest[entry[0]] = mappedts
"""
else:
for (n, ts) in featureInfo.items():
matched = getBestMatches(n, dest)
if matched:
matchedTs = {}
for t in ts:
for (t, r) in getBestMatches(t, matchedTs):
matchedTs[t] = (None, r)
matchedTs = set(matchedTs)
for (m, q) in matched:
dest[m] = (matchedTs, q)

"""
mapped = emap(n)
if mapped:
mappedts = set()
for t in ts:
tmapped = emap(t)
if tmapped:
for tEntry in tmapped:
for entry in mapped:
dest[entry[0]] = mappedts
"""
else:
for (n, val) in featureInfo.items():
for (m, q) in getBestMatches(n, dest):
dest[m] = (val, q)

for m in dest:
dest[m] = dest[m][0]

TFb.save(
nodeFeatures=nodeFeatures,
edgeFeatures=edgeFeatures,
location=location,
module=vb,
)

### Methods

 def exploreNodeMapping(self, nodeType) 
Expand source code Browse git
def exploreNodeMapping(self, nodeType):
va = self.va
vb = self.vb
diagnosis = self.diagnosis

caption(4, "Statistics for {} ==> {} ({})".format(va, vb, nodeType))

diag = diagnosis[nodeType]
total = len(diag)
if total == 0:
return

reasons = collections.Counter()

for (n, dia) in diag.items():
reasons[dia] += 1

caption(0, "\t{:<30} : {:6.2f}% {:>7}x".format("TOTAL", 100, total))
for stat in STAT_LABELS:
statLabel = STAT_LABELS[stat]
amount = reasons[stat]
if amount == 0:
continue
perc = 100 * amount / total
caption(0, "\t{:<30} : {:6.2f}% {:>7}x".format(statLabel, perc, amount))
 def getDiagnosis(self, node=None, label=None) 

Show the diagnosis of a mapping.

• A dictionary that maps all nodes of a given type to a diagnostic label
• A list of nodes of a node type that have a given diagnostic label
• The label of a given node

## Parameters

node : string | int, optional None
The node or node type that you want to diagnose. If a string, it is a node type, and you get information about the nodes of that node type. If an int, it is a node, and you get information about that node. If None, you get information about all nodes.
label : char, optional None

If given, it is a diagnostic label (see tf.dataset.nodemaps), and the result is dependent on the value passed to node: if that is a single node, a boolean is returned telling whether that node has the given label; if that is a node type, a tuple is returned of all nodes in that node type that have this label; if that is None, a tuple is returned of all nodes with that label.

If not given, the result depends again on the value passed to node: if that is a single node, you get its diagnostic label; if that is a node type, you get a dict, keyed by node, and valued by diagnostic label, for all nodes in the node type; if that is None, you get a dictionary keyed by node, and valued by diagnostic label but now for all nodes.

Expand source code Browse git
def getDiagnosis(self, node=None, label=None):
"""Show the diagnosis of a mapping.

*    A dictionary that maps all nodes of a given type to a diagnostic label
*    A list of nodes of a node type that have a given diagnostic label
*    The label of a given node

Parameters
----------
node: string | int, optional None
The node or node type that you want to diagnose.
If a string, it is a node type, and you get information about the nodes
of that node type.
If an int, it is a node, and you get information about that node.
If None, you get information about all nodes.

label: char, optional None
If given, it is a diagnostic label (see tf.dataset.nodemaps),
and the result is dependent on the value passed to *node*:
if that is a single node, a boolean is returned telling whether that node
has the given label; if that is a node type, a tuple is returned of all
nodes in that node type that have this label;
if that is None, a tuple is returned of all nodes with that label.

If not given, the result depends again on the value passed to *node*:
if that is a single node, you get its diagnostic label;
if that is a node type, you get a dict, keyed by node, and valued
by diagnostic label, for all nodes in the node type;
if that is None, you get a dictionary keyed by node, and valued
by diagnostic label but now for all nodes.
"""

diagnosis = self.diagnosis
Fa = self.Fa
otypesa = Fa.otype.s

if label is None:
return (
dict(chain.from_iterable(x.items() for x in diagnosis.values()))
if node is None
else diagnosis.get(node, {})
if type(node) is str
else diagnosis.get(otypesa(node), {}).get(node, None)
if type(node) is int
else None
)
else:
return (
tuple(
y[0]
for y in chain.from_iterable(x.items() for x in diagnosis.values())
if y[1] == label
)
if node is None
else tuple(
y[0] for y in diagnosis.get(node, {}).items() if y[1] == label
)
if type(node) is str
else diagnosis.get(otypesa(node), {}).get(node, None) == label
if type(node) is int
else None
)
 def legend(self) 

Show the labels and descriptions of the diagnosis classes.

When the mapping turns out to be not perfect for a node, the result can be categorized according to severity.

This function shows that classification. See also tf.dataset.nodemaps.

Expand source code Browse git
def legend(self):
"""Show the labels and descriptions of the diagnosis classes.

When the mapping turns out to be not perfect for a node,
the result can be categorized according to severity.

This function shows that classification.
See also tf.dataset.nodemaps.
"""

for (acro, desc) in STAT_LABELS.items():
print(f"{acro} = {desc}")
 def makeNodeMapping(self, nodeType) 
Expand source code Browse git
def makeNodeMapping(self, nodeType):
edge = self.edge

if edge is None:
print("Cannot make node mapping if no slot mapping is given")
return False

va = self.va
vb = self.vb
Lb = self.Lb
Ea = self.Ea
Eb = self.Eb
Fa = self.Fa

Lub = Lb.u
Eosa = Ea.oslots.s
Eosb = Eb.oslots.s
otypesa = Fa.otype.s

caption(2, "Mapping {} nodes {} ==> {}".format(nodeType, va, vb))

diag = {}
caption(
0, "Extending slot mapping {} ==> {} for {} nodes".format(va, vb, nodeType)
)
for n in otypesa(nodeType):
slots = Eosa(n)
mappedSlots = set(
chain.from_iterable(set(edge[s]) for s in slots if s in edge)
)
mappedNodes = set(
chain.from_iterable(set(Lub(s, nodeType)) for s in mappedSlots)
)
result = {}
nMs = len(mappedNodes)
if nMs == 0:
diag[n] = "a"

elif nMs >= 1:
theseMSlots = {}
for m in mappedNodes:
mSlots = set(Eosb(m))
dis = len(mappedSlots | mSlots) - len(mappedSlots & mSlots)
result[m] = dis
theseMSlots[m] = mSlots

# we wait further case analysis before we put these counterparts of n into the edge

if nMs == 1:
m = list(mappedNodes)[0]
dis = result[m]
if dis == 0:
diag[n] = "b"
edge[n] = {m: None}
# this is the most freqent case, hence an optimization: no dis value.
# all other cases require the dis value to be passed on, even if 0
else:
diag[n] = "c"
edge[n] = {m: dis}
else:
edge[n] = result
dis = min(result.values())
if dis == 0:
diag[n] = "d"
else:
allMSlots = set(
chain.from_iterable(
set(theseMSlots[m]) for m in mappedNodes
)
)
composed = allMSlots == mappedSlots and sum(
result.values()
) == len(mappedSlots) * (len(mappedNodes) - 1)

if composed:
diag[n] = "f"
else:
diag[n] = "e"

self.diagnosis[nodeType] = diag
caption(0, "\tDone")
 def makeVersionMapping(self) 
Expand source code Browse git
def makeVersionMapping(self):
Fa = self.Fa

self.diagnosis = {}

nodeTypes = Fa.otype.all

for nodeType in nodeTypes[0:-1]:
self.makeNodeMapping(nodeType)
self.exploreNodeMapping(nodeType)

self.writeMap()
 def migrateFeatures(self, featureNames, location=None) 

Migrate features from one version to another based on a node map.

If you have a dataset with several features, and if there is a node map between the versions, you can migrate features from the older version to the newer.

The naive approach per feature is:

For each node in the old version that has a value for the feature, map the node to the new version. That could be several nodes. Then give all those nodes the value found on the node in the old set.

The problem with that is this:

Suppose node n in the old version maps to nodes p and q in the new version, but not with the same quality. Suppose the value is v. Then the feature assignment will assign v to nodes p and q.

But there could very wel be a later node m with value w in the old version that also maps to p, and with a lower quality. Then later the node p in the new version will be assigned the value w because of this, and this is sub otpimal.

So we make sure that for each feature and each node in the new version the value assigned to it is the value of the node in the old version that maps with the highest quality to the new node.

## Parameters

featureNames : tuple
The names of the features to migrate. They may be node features or edge features or both.
location : string, optional None
If absent, the migrated features will be saved in the newer dataset. Otherwise it is a path where the new features should be saved.
Expand source code Browse git
def migrateFeatures(self, featureNames, location=None):
"""Migrate features from one version to another based on a node map.

If you have a dataset with several features, and if there is a node map between
the versions, you can migrate features from the older version to the newer.

The naive approach per feature is:

For each node in the old version that has a value for the feature,
map the node to the new version.
That could be several nodes.
Then give all those nodes the value found on the node in the old set.

The problem with that is this:

Suppose node n in the old version maps to nodes p and q in the
new version, but not with the same quality.
Suppose the value is v. Then the feature assignment will assign
v to nodes p and q.

But there could very wel be a later node m with value w in the old version
that also maps to p, and with a lower quality.
Then later the node p in the new version will be assigned the value w
because of this, and this is sub otpimal.

So we make sure that for each feature and each node in the new version
the value assigned to it is the value of the node in the old version that
maps with the highest quality to the new node.

Parameters
----------
featureNames: tuple
The names of the features to migrate.
They may be node features or edge features or both.
location: string, optional None
If absent, the migrated features will be saved in the newer dataset.
Otherwise it is a path where the new features should be saved.
"""

omapName = self.omapName()
TFa = self.TFa
TFb = self.TFb
Fsa = self.Fsa
Esa = self.Esa
Esb = self.Esb
va = self.va
vb = self.vb

emap = Esb(omapName).f

nodeFeatures = {}
edgeFeatures = {}

def getBestMatches(n, dest):
result = []
mapped = emap(n)

if mapped:
perfect = [x for x in mapped if x[1] is None]
if perfect:
return [perfect[0]]

nearPerfect = [x for x in mapped if x[1] == 0]
if nearPerfect:
(m, q) = nearPerfect[0]
if m not in dest or m in dest and dest[m][1] is not None:
return [nearPerfect[0]]

for (m, q) in mapped:
# these q cannot be None or 0
if m in dest:
(origVal, origQ) = dest[m]
higherQuality = origQ is not None and q < origQ
if origVal == val:
if higherQuality:
result.append((m, q))
elif higherQuality:
result.append((m, q))
else:
result.append((m, q))
return result

for featureName in featureNames:
isEdge = TFa.features[featureName].isEdge
featureObj = (Esa if isEdge else Fsa)(featureName)
featureInfo = dict(featureObj.items())

edgeValues = featureObj.doValues if isEdge else False

dest = edgeFeatures if isEdge else nodeFeatures
dest[featureName] = {}
dest = dest[featureName]

# we will not only store the feature value in dest,
# but also the quality of the mapping that
# produced the value.
# We only override an existing value
# if the quality of the mapping is better.
# Moreover, if one mapping has quality 0,
# it means that the mapping is perfect regarding the material in both nodes
# but there are lesser mappings as well.
# In those cases we do not extend the value to those lesser mappings
# Later, we strip the qualities again.

if isEdge:
if edgeValues:
for (n, ts) in featureInfo.items():
matched = getBestMatches(n, dest)
if matched:
matchedTs = {}
for (t, val) in ts.items():
for (t, r) in getBestMatches(t, matchedTs):
matchedTs[t] = (val, r)
for t in matchedTs:
matchedTs[t] = matchedTs[t][0]
for (m, q) in matched:
dest[m] = (matchedTs, q)

"""
mapped = emap(n)
if mapped:
mappedts = {}
for (t, val) in ts.items():
tmapped = emap(t)
if tmapped:
for tEntry in tmapped:
mappedts[tEntry[0]] = val
for entry in mapped:
dest[entry[0]] = mappedts
"""
else:
for (n, ts) in featureInfo.items():
matched = getBestMatches(n, dest)
if matched:
matchedTs = {}
for t in ts:
for (t, r) in getBestMatches(t, matchedTs):
matchedTs[t] = (None, r)
matchedTs = set(matchedTs)
for (m, q) in matched:
dest[m] = (matchedTs, q)

"""
mapped = emap(n)
if mapped:
mappedts = set()
for t in ts:
tmapped = emap(t)
if tmapped:
for tEntry in tmapped:
for entry in mapped:
dest[entry[0]] = mappedts
"""
else:
for (n, val) in featureInfo.items():
for (m, q) in getBestMatches(n, dest):
dest[m] = (val, q)

for m in dest:
dest[m] = dest[m][0]

TFb.save(
nodeFeatures=nodeFeatures,
edgeFeatures=edgeFeatures,
location=location,
module=vb,
)
 def omapName(self) 
Expand source code Browse git
def omapName(self):
va = self.va
vb = self.vb
return f"omap@{va}-{vb}"
 def writeMap(self) 
Expand source code Browse git
def writeMap(self):
TFb = self.TFb
va = self.va
vb = self.vb
edge = self.edge

fName = self.omapName()
caption(4, "Write edge as TF feature {}".format(fName))

edgeFeatures = {fName: edge}
fName: {
"about": "Mapping from the slots of version {} to version {}".format(
va, vb
),
"encoder": "Text-Fabric tf.dataset.nodemaps",
"valueType": "int",
"edgeValues": True,
}
}
TFb.save(
nodeFeatures={},
edgeFeatures=edgeFeatures,
)