Module tf.ner.scopes
Scope handling
Scopes are a column in entity spreadsheets that limit the effects of entity triggers. This module defines data structures for scopes and some fundamental operations on them, such as representing them as string, parsing them from strings, comparing and sorting them.
A scope specification is a comma-separated list of interval strings.
An interval string is either a section or an interval between two sections.
A section is written as a section heading, an interval as two section headings
with a -
in between.
An interval is taken from the start of the first section to the end of the second section.
A section on its own is taken from its start to its end.
Examples:
3
is the whole of section3
3-3.4
is from the start of section3
to the end of section3.4
3.4-3
is from the start of section3.4
to the end of section3
3.4-5
is from the start of section3.4
, through the rest of section3
, through the whole of section4
, till the end of section5
A section heading is written as it appears when TF represents sections. If you browse the corpus in the TF browser you'll see how those sections are represented. Even if section headings are not numeric, TF knows which sections you mean:
Genesis 1:5-Deuteronomy 26:6
if from the start of book Genesis chapter 1 verse 5, through the rest of Genesis, through the books of Exodus and Leviticus, till the end of Deuteronomy chapter 26 verse 6.
Even if the headings given at the boundaries of an interval are numeric, TF knows the exact ordering of all sections, and will fill them in.
Suppose in a corpus you have main sections 1
, 4
, 3
, 3b
, 5
in that
order, then:
1-5
is from the start of1
, through4
,3
,3b
, to the end of5
1-4
is sections1
and4
only1-3
is sections1
,4
,3
only3-5
is sections3
,3b
and5
only
Implementation detail: Text-Fabric pre-computes the sequence of all sections
in your corpus, and maps them onto a legal numbering system, where each
section corresponds to a tuple of sequence numbers. For example, in the
Hebrew Bible,
Exodus 3:4
is mapped to (2, 3, 4)
since Exodus
is the
second top level section in that corpus.
In the Suriano letters,
04@027:5
is mapped to (3, 27, 5)
, since the main sections start at
02
, there is no 01
.
The inverse mapping is also present in the pre-computed data of the corpus.
These mappings are from section nodes to legal number tuples and back.
The mapping from section node to heading string and back is is done by
TF functions sectionStrFromNode()
and
nodeFromSectionStr()
.
Expand source code Browse git
"""Scope handling
Scopes are a column in entity spreadsheets that limit the effects of entity
triggers. This module defines data structures for scopes and some fundamental
operations on them, such as representing them as string, parsing them from strings,
comparing and sorting them.
A scope specification is a comma-separated list of interval strings.
An interval string is either a section or an interval between two sections.
A section is written as a section heading, an interval as two section headings
with a `-` in between.
An interval is taken from the start of the first section to the end of the
second section.
A section on its own is taken from its start to its end.
Examples:
* `3` is the whole of section `3`
* `3-3.4` is from the start of section `3` to the end of section `3.4`
* `3.4-3` is from the start of section `3.4` to the end of section `3`
* `3.4-5` is from the start of section `3.4`, through the rest of section `3`,
through the whole of section `4`, till the end of section `5`
A section heading is written as it appears when TF represents sections.
If you browse the corpus in the TF browser you'll see how those
sections are represented. Even if section headings are not numeric, TF knows
which sections you mean:
* `Genesis 1:5-Deuteronomy 26:6` if from the start of book Genesis
chapter 1 verse 5, through the rest of Genesis, through the books of
Exodus and Leviticus, till the end of Deuteronomy chapter 26 verse 6.
Even if the headings given at the boundaries of an interval are numeric, TF
knows the exact ordering of all sections, and will fill them in.
Suppose in a corpus you have main sections `1`, `4`, `3`, `3b`, `5` in that
order, then:
* `1-5` is from the start of `1`, through `4`, `3`, `3b`, to the end of `5`
* `1-4` is sections `1` and `4` *only*
* `1-3` is sections `1`, `4`, `3` *only*
* `3-5` is sections `3`, `3b` and `5` *only*
Implementation detail: Text-Fabric pre-computes the sequence of all sections
in your corpus, and maps them onto a legal numbering system, where each
section corresponds to a tuple of sequence numbers. For example, in the
[Hebrew Bible](https://github.com/ETCBC/bhsa),
`Exodus 3:4` is mapped to `(2, 3, 4)` since `Exodus` is the
second top level section in that corpus.
In the [Suriano letters](https://gitlab.huc.knaw.nl/suriano/letters),
`04@027:5` is mapped to `(3, 27, 5)`, since the main sections start at
`02`, there is no `01`.
The inverse mapping is also present in the pre-computed data of the corpus.
These mappings are from section nodes to legal number tuples and back.
The mapping from section node to heading string and back is is done by
TF functions `tf.advanced.sections.sectionStrFromNode()` and
`tf.advanced.sections.nodeFromSectionStr()`.
"""
import functools
from ..core.helpers import console
def _prevLoc(x):
"""Find the previous location.
Only for locs that can start a scope.
Parameters
----------
tuple
The tuple is a "legal" section number, possibly with `0` in them, but not `-1`.
Returns
-------
tuple
Nearly the same tuple, but with the last element decreased by one, if possible,
otherwise the last element is set to `-1` and the next last element is
decreased by one, if possible, and so on.
"""
if x[2] != 0:
return (x[0], x[1], x[2] - 1)
if x[1] != 0:
return (x[0], x[1] - 1, -1)
if x[0] != 0:
return (x[0] - 1, -1, -1)
return (0, 0, 0)
def _nextLoc(x):
"""Find the next location.
Only for locs that can end a scope.
Parameters
----------
tuple
The tuple is a "legal" section number, possibly with `-1` in them, but not `0`.
Returns
-------
tuple
Nearly the same tuple, but with the last element increased by one, if it is not
`-1`, otherwise the last element is set to `0` and the next last element is
increased by one, if it is not `-1`, and so on.
"""
if x[2] != -1:
return (x[0], x[1], x[2] + 1)
if x[1] != -1:
return (x[0], x[1] + 1, 0)
if x[0] != -1:
return (x[0] + 1, 0, 0)
return (-1, -1, -1)
def _pointCmp(x, y):
"""Compares two section sequence numbers, only one component of them.
We reckon with the value `-1` under the interpretation that it is the last one.
Parameters
----------
x: integer
The first point
y: integer
The second point
Returns
-------
integer
0 if they are equal, -1 if the first comes before the second, -1 otherwise.
"""
return 0 if x == y else -1 if y == -1 or (x != -1 and x < y) else 1
def _locCmp(xLoc, yLoc):
"""Compares two full section sequence numbers.
Parameters
----------
xLoc: tuple
The first section
yLoc: tuple
The second section
Returns
-------
integer
0 if they are equal, -1 if the first comes before the second, -1 otherwise.
"""
for x, y in zip(xLoc, yLoc):
if x != y:
eq = _pointCmp(x, y)
if eq != 0:
return eq
return 0
def _locMax(xLoc, yLoc):
"""Which of two sections comes last?
Parameters
----------
xLoc: tuple
The first section
yLoc: tuple
The second section
Returns
-------
tuple
either `xLoc` or `yLoc`, whichever comes last.
"""
return xLoc if _locCmp(xLoc, yLoc) == 1 else yLoc
def _locMin(xLoc, yLoc):
"""Which of two sections comes first?
Parameters
----------
xLoc: tuple
The first section
yLoc: tuple
The second section
Returns
-------
tuple
either `xLoc` or `yLoc`, whichever comes first.
"""
return xLoc if _locCmp(xLoc, yLoc) == -1 else yLoc
def _sameLoc(xLoc, yLoc):
"""Whether two section tuples correspond to the same section.
Example: `(2, 0 , 0)` is the same section as `(2,)`, likewise
`(2, -1, -1)` is the same section as `(2,)`, although they point
to different places within that section.
This function abstracts from that difference in representation.
Parameters
----------
xLoc: tuple
The first section
yLoc: tuple
The second section
Returns
-------
boolean
whether both tuples refer to the same section.
"""
return all(
xc == yc or ((xc == -1 or xc == 0) and (yc == -1 or yc == 0))
for (xc, yc) in zip(xLoc, yLoc)
)
def _intervalCmp(xIntv, yIntv):
"""Compares two intervals of section sequence numbers.
We extend the notion of before and after from sections to intervals.
The canonical rule is this:
The interval that starts earlier comes first, no matter what.
If both intervals have the same start point, the one that ends *later* comes
first.
So containing intervals preceded contained intervals, and this makes this
relationship a generalization of the pre-order in trees.
If you have a nest of intervals and all pairs of intervals either are disjunct,
or the one lies embedded in the other, then this nest is a tree. The children of
an interval are the intervals properly embedded in it. The pre-order on this tree
is exactly defined as the canonical rule above.
But if there are intervals that properly overlap each other, the canonical rule
still makes sense, hence we have a generalization of the tree order.
Parameters
----------
xIntv: tuple
The first interval
yIntv: tuple
The second interval
Returns
-------
integer
0 if they are equal, -1 if the first comes before the second, -1 otherwise.
"""
(xB, xE) = xIntv
(yB, yE) = yIntv
eqx = _locCmp(xB, yB)
return _locCmp(yE, xE) if eqx == 0 else eqx
_locSort = functools.cmp_to_key(_locCmp)
"""Sort key for locations (individual sections).
"""
_intervalSort = functools.cmp_to_key(_intervalCmp)
"""Sort key for intervals of sections.
"""
def _sortLocs(locs):
"""Sort function for locations (individual sections).
Parameters
----------
locs: iterable of tuple of integer
The sequence of locations
Returns
-------
tuple
The sorted locations.
"""
return tuple(sorted(locs, key=_locSort))
def _sortIntervals(scopes):
"""Sort function for intervals of sections.
Parameters
----------
intvs: iterable of tuple of tuple of integer
The sequence of intervals
Returns
-------
tuple
The sorted intervals.
"""
return tuple(sorted(scopes, key=_intervalSort))
def _locInInterval(loc, intv):
"""Whether a location (section) is contained in an interval.
Parameters
----------
loc: tuple of integer
The location, which is a section given by its sequence numbers
intv: tuple of tuple of integer
The interval, given by its start and and location
Returns
-------
boolean
Whether the location is contained in the interval
"""
(b, e) = intv
return _locCmp(b, loc) <= 0 and _locCmp(loc, e) <= 0
def locInScope(loc, scope):
"""Whether a location (section) is contained in a scope (sequence of intervals).
Parameters
----------
loc: tuple of integer
The location, which is a section given by its sequence numbers
scope: tuple of tuple of tuple of integer
The scope, given as a tuple of intervals
Returns
-------
boolean
Whether the location is contained in the scope
"""
if not len(scope):
return False
return any(_locInInterval(loc, intv) for intv in scope)
def partitionScopes(scopeDict):
"""Partition a set of scopes into intervals.
The idea is to create a set of intervals such that:
* there is no scope boundary within any interval;
* every interval has at least one scope boundary at its start and one at its end.
Parameters
----------
scopeDict: dict
The scopes are given as a mapping from string representations of scopes to
logical scopes, i.e. the data structures you get when you parse scopes.
Returns
-------
list
The sorted sequence of resulting intervals
"""
scopeFromStr = {}
strFromScope = {}
boundaries = {}
intervals = []
for scopeStr, scope in scopeDict.items():
for intv in scope:
(b, e) = intv
boundaries.setdefault(b, {}).setdefault("b", set()).add(intv)
boundaries.setdefault(e, {}).setdefault("e", set()).add(intv)
scopeFromStr.setdefault(scopeStr, set()).add(intv)
strFromScope.setdefault(intv, set()).add(scopeStr)
curScopes = set()
inScope = False
for x in _sortLocs(boundaries):
beginScopes = boundaries[x].get("b", set())
endScopes = boundaries[x].get("e", set())
inScope = len(curScopes) > 0
hasB = len(beginScopes) > 0
hasE = len(endScopes) > 0
if hasB and hasE:
intervals[-1][1] = _prevLoc(x)
if intervals[-1][1] < intervals[-1][0]:
intervals.pop()
curScopes |= beginScopes
intervals.append([x, x, _sortIntervals(curScopes)])
curScopes -= endScopes
inScope = len(curScopes) > 0
if inScope:
intervals.append([_nextLoc(x), None, _sortIntervals(curScopes)])
elif hasB:
if inScope:
intervals[-1][1] = _prevLoc(x)
if intervals[-1][1] < intervals[-1][0]:
intervals.pop()
curScopes |= beginScopes
intervals.append([x, None, _sortIntervals(curScopes)])
elif hasE:
intervals[-1][1] = x
curScopes -= endScopes
inScope = len(curScopes) > 0
if inScope:
intervals.append([_nextLoc(x), None, _sortIntervals(curScopes)])
if inScope:
intervals[-1][1] = x
for x in intervals:
newX = []
seen = set()
for intv in x[2]:
for scopeStr in strFromScope[intv]:
if scopeStr in seen:
continue
seen.add(scopeStr)
newX.append(scopeStr)
x[2] = tuple(newX)
return intervals
def getIntvIndex(buckets, instructions, getSeqFromNode):
"""Map buckets in the corpus on intervals in a given set of intervals.
When we are going to look up triggers in the corpus, we do so bucket by bucket.
The validity of triggers is constrained to their scope. Whenever we leave an
interval and go to the next, the those scopes change.
We need a quick way to determine for each bucket in the corpus to which
interval it belongs.
Parameters
----------
buckets: iterable of integer
The nodes corresponding to the lowest level sections in the corpus
instructions: iterable of tuple of tuple of integer
This is a sequence of intervals. You may pass the `instructions` member
of sheet data, since that is a mapping from intervals to search
instructions for those intervals.
When treated as an iterbale, such a dict is a sequence of intervals.
We will sort the intervals before computing the index.
getSeqFromNode:
Corpus dependent function that gives the "legal" sequence number for
sections, passed as nodes. See `tf.ner.corpus.Corpus.getSeqFromNode()`
Returns
-------
dict
Maps bucket nodes to the interval in the sequence of intervals to which they
belong.
"""
intervals = _sortIntervals(x for x in instructions if x != ())
nIntervals = len(intervals)
if nIntervals == 0:
return {bucket: () for bucket in buckets}
intvIndex = {}
i = 0
intv = intervals[i]
(b, e) = intv
for bucket in buckets:
hd = getSeqFromNode(bucket)
assigned = False
if _locCmp(b, hd) == 1:
intvIndex[bucket] = ()
continue
while i < nIntervals:
if _locCmp(e, hd) == -1:
i += 1
if i < nIntervals:
intv = intervals[i]
(b, e) = intv
else:
intvIndex[bucket] = ()
assigned = True
break
else:
intvIndex[bucket] = () if _locCmp(b, hd) == 1 else intv
assigned = True
break
if not assigned:
intvIndex[bucket] = ()
return intvIndex
class Scopes:
"""Functions that do scope handling.
These functions will be added as methods to the class that inherits this class.
"""
def repLoc(self, loc):
"""Represent a location as a string.
A location identifies a section by means of legal number.
We now want the heading of that section, as given by the features and
settings of the corpus. We use the corpus dependent function
`tf.ner.corpus.Corpus.getStrFromSeq()` for that.
Parameters
----------
loc: tuple of integer
The "legal" number of a section.
Returns
-------
string
The section heading of the corresponding section.
"""
getStrFromSeq = self.getStrFromSeq
return getStrFromSeq(tuple(x for x in loc if x != 0 and x != -1))
def parseLoc(self, locStr, plain=True):
"""Parses a location string.
A location string is a section heading in the corpus.
Now we want the "legal" number of that section.
We use the corpus dependent function
`tf.ner.corpus.Corpus.getSeqFromStr()` for that.
Parameters
----------
locStr: string
The section heading as it appears in the corpus
plain: boolean, optional True
Whether to return just the result or additional information as well.
Returns
-------
tuple of integer or dict
The plain result is a tuple of the numbers that make up the "legal"
number of the section. This is returned when `plain=False` is passed.
If there are errors, None is returned.
But if `plain=True` is passed, a dict is returned, with keys **result**
for the plain result; **warning** for warnings if the parsing failed; and
**normal** for a normalized representaiton of the section.
"""
getSeqFromStr = self.getSeqFromStr
locStr = locStr.strip()
if not locStr:
result = ()
return (
()
if plain
else dict(result=(), warning=None, normal=self.repLoc(result))
)
(error, result) = getSeqFromStr(locStr)
if error:
result = None
normal = None
warning = error
else:
normal = self.repLoc(result)
warning = None
return result if plain else dict(result=result, warning=warning, normal=normal)
def repInterval(self, intv):
"""Represent an interval of sections as string.
Parameters
----------
intv: tuple of tuple of integer
The interval given as start and end section tuples
Returns
-------
string
Either a single section heading, if the start and end section are the same,
or `-` surrounded by the start and end section headings.
"""
if intv is None or len(intv) == 0:
return "()"
(b, e) = intv
return (
self.repLoc(b) if _sameLoc(b, e) else f"{self.repLoc(b)}-{self.repLoc(e)}"
)
def repScope(self, scope):
"""Represent a scope as string.
Parameters
----------
scope: tuple of tuple of tuple of integer
The scope given as sequence of intervals
Returns
-------
string
Either a single section heading, if the start and end section are the same,
or `-` surrounded by the start and end section headings.
"""
return ", ".join(self.repInterval(intv) for intv in scope)
def parseInterval(self, intvStr, plain=True):
"""Parses an interval given as string.
A interval string is two section headings separated by a `-`.
We parse it into a 2-tuple of the section headings, both also parsed into
tuples of integers.
If the resulting section tuples have less than the maximum number of components,
we fill them up: the start section will be filled up with `0`-s, and the
end section will be filled up with `-1`-s. This corresponds to the
interpretation that the start section represent its start point, and
the end section represent its end point.
Parameters
----------
intvStr: string
The interval string
plain: boolean, optional True
Whether to return just the result or additional information as well.
Returns
-------
tuple of integer or dict
The plain result is a 2-tuple of tuples of integer.
This is returned when `plain=False` is passed.
If there are errors, None is returned.
But if `plain=True` is passed, a dict is returned, with keys **result**
for the plain result; **warning** for warnings if the parsing failed; and
**normal** for a normalized representaiton of the section.
"""
result = None
warnings = []
intvStr = intvStr.strip()
if not intvStr:
result = ((0, 0, 0), (-1, -1, -1))
return (
result
if plain
else dict(
result=result, warning=warnings, normal=self.repInterval(result)
)
)
parts = intvStr.split("-", 1)
if len(parts) == 1:
info = self.parseLoc(intvStr, plain=False)
w = info["warning"]
if w:
warnings.append(w)
else:
s = info["result"]
result = (
(s + (0, 0, 0))[0:3],
(s + (-1, -1, -1))[0:3],
)
else:
part = parts[0].strip()
info1 = self.parseLoc(part, plain=False)
w1 = info1["warning"]
part = parts[1].strip()
info2 = self.parseLoc(part, plain=False)
w2 = info2["warning"]
if w1 or w2:
if w1:
warnings.append(w1)
if w2:
warnings.append(w2)
else:
s1 = info1["result"]
s2 = info2["result"]
result = (
(s1 + (0, 0, 0))[0:3],
(s2 + (-1, -1, -1))[0:3],
)
normal = None if result is None else self.repInterval(result)
return result if plain else dict(result=result, warning=warnings, normal=normal)
def parseScope(self, scopeStr, plain=True):
"""Parse a scope specification into logical specifiers of regions in the corpus.
A scope specification is a comma-separated list of interval strings.
An interval string is either a section or an interval between two sections.
A section is written as a section heading, an interval as two section headings
with a `-` in between.
An interval is taken from the start of the first section to the end of the
second section.
A section on its own is taken from its start to its end.
Parameters
----------
scopeStr: string
The scope string
plain: boolean, optional True
Whether to return just the result or additional information as well.
Returns
-------
tuple of tuple of integer or dict
The plain result is a tuple of 2-tuple of tuples of integer.
This is returned when `plain=False` is passed.
If there are errors, None is returned.
But if `plain=True` is passed, a dict is returned, with keys **result**
for the plain result; **warning** for warnings if the parsing failed; and
**normal** for a normalized representaiton of the section.
"""
results = []
warnings = []
if not scopeStr:
return () if plain else dict(result=(), warning=warnings, normal="")
for intvStr in scopeStr.split(","):
if not intvStr:
continue
info = self.parseInterval(intvStr, plain=plain)
result = info if plain else info["result"]
if result is None:
if not plain:
warnings.extend(info["warning"])
continue
results.append(result)
results = tuple(_sortIntervals(results))
return (
results
if plain
else dict(result=results, warning=warnings, normal=self.repScope(results))
)
def intersectScopes(self, *scopeStrs):
"""Produce the intersection of severel scopes.
We use this function to test whether two triggers have a region where they
are both in scope.
Parameters
----------
scopeStrs: iterable
Sequence of scope specifiers
Returns
-------
tuple of tuple of tuple of integer
This is a tuple of intervals, forming the intersection of all given scopes
"""
curIntersection = [self.parseInterval("")]
for scopeStr in scopeStrs:
newIntersection = []
for bLoc, eLoc in self.parseScope(scopeStr):
for ibLoc, ieLoc in curIntersection:
if _locCmp(ieLoc, bLoc) == -1:
# ieLoc < bLoc
continue
if _locCmp(ibLoc, eLoc) == 1:
# ibLoc > eLoc
break
# now
# bLoc <= ieLoc
# ibLoc <= eLoc
newIbLoc = _locMax(ibLoc, bLoc)
newIeLoc = _locMin(ieLoc, eLoc)
newIntersection.append((newIbLoc, newIeLoc))
curIntersection = newIntersection
return tuple(curIntersection)
def testPartitioning(self, scopeStrs):
"""Test the partitioning of scopes.
Only for debugging purposes.
Parameters
----------
scopeStrs: iterable of string
The scope specifiers
Returns
-------
list
A list of intervals that make up the resulting partition.
"""
scopeIndex = {}
for scopeStr in scopeStrs:
info = self.parseScope(scopeStr, plain=False)
warning = info["warning"]
if len(warning):
console(f"Errors in {scopeStr}: {'; '.join(warning)}")
else:
scopes = info["result"]
normScopeStr = info["normal"]
console(
f"{scopeStr} => {normScopeStr}\n"
f"\t{self.repScope(_sortIntervals(scopes))}"
)
scopeIndex[normScopeStr] = scopes
partitionScopes(scopeIndex)
Functions
def getIntvIndex(buckets, instructions, getSeqFromNode)
-
Map buckets in the corpus on intervals in a given set of intervals.
When we are going to look up triggers in the corpus, we do so bucket by bucket. The validity of triggers is constrained to their scope. Whenever we leave an interval and go to the next, the those scopes change. We need a quick way to determine for each bucket in the corpus to which interval it belongs.
Parameters
buckets
:iterable
ofinteger
- The nodes corresponding to the lowest level sections in the corpus
instructions
:iterable
oftuple
oftuple
ofinteger
-
This is a sequence of intervals. You may pass the
instructions
member of sheet data, since that is a mapping from intervals to search instructions for those intervals. When treated as an iterbale, such a dict is a sequence of intervals.We will sort the intervals before computing the index.
getSeqFromNode: Corpus dependent function that gives the "legal" sequence number for sections, passed as nodes. See
Corpus.getSeqFromNode
Returns
dict
- Maps bucket nodes to the interval in the sequence of intervals to which they belong.
def locInScope(loc, scope)
-
Whether a location (section) is contained in a scope (sequence of intervals).
Parameters
loc
:tuple
ofinteger
- The location, which is a section given by its sequence numbers
scope
:tuple
oftuple
oftuple
ofinteger
- The scope, given as a tuple of intervals
Returns
boolean
- Whether the location is contained in the scope
def partitionScopes(scopeDict)
-
Partition a set of scopes into intervals.
The idea is to create a set of intervals such that:
- there is no scope boundary within any interval;
- every interval has at least one scope boundary at its start and one at its end.
Parameters
scopeDict
:dict
- The scopes are given as a mapping from string representations of scopes to logical scopes, i.e. the data structures you get when you parse scopes.
Returns
list
- The sorted sequence of resulting intervals
Classes
class Scopes
-
Functions that do scope handling.
These functions will be added as methods to the class that inherits this class.
Expand source code Browse git
class Scopes: """Functions that do scope handling. These functions will be added as methods to the class that inherits this class. """ def repLoc(self, loc): """Represent a location as a string. A location identifies a section by means of legal number. We now want the heading of that section, as given by the features and settings of the corpus. We use the corpus dependent function `tf.ner.corpus.Corpus.getStrFromSeq()` for that. Parameters ---------- loc: tuple of integer The "legal" number of a section. Returns ------- string The section heading of the corresponding section. """ getStrFromSeq = self.getStrFromSeq return getStrFromSeq(tuple(x for x in loc if x != 0 and x != -1)) def parseLoc(self, locStr, plain=True): """Parses a location string. A location string is a section heading in the corpus. Now we want the "legal" number of that section. We use the corpus dependent function `tf.ner.corpus.Corpus.getSeqFromStr()` for that. Parameters ---------- locStr: string The section heading as it appears in the corpus plain: boolean, optional True Whether to return just the result or additional information as well. Returns ------- tuple of integer or dict The plain result is a tuple of the numbers that make up the "legal" number of the section. This is returned when `plain=False` is passed. If there are errors, None is returned. But if `plain=True` is passed, a dict is returned, with keys **result** for the plain result; **warning** for warnings if the parsing failed; and **normal** for a normalized representaiton of the section. """ getSeqFromStr = self.getSeqFromStr locStr = locStr.strip() if not locStr: result = () return ( () if plain else dict(result=(), warning=None, normal=self.repLoc(result)) ) (error, result) = getSeqFromStr(locStr) if error: result = None normal = None warning = error else: normal = self.repLoc(result) warning = None return result if plain else dict(result=result, warning=warning, normal=normal) def repInterval(self, intv): """Represent an interval of sections as string. Parameters ---------- intv: tuple of tuple of integer The interval given as start and end section tuples Returns ------- string Either a single section heading, if the start and end section are the same, or `-` surrounded by the start and end section headings. """ if intv is None or len(intv) == 0: return "()" (b, e) = intv return ( self.repLoc(b) if _sameLoc(b, e) else f"{self.repLoc(b)}-{self.repLoc(e)}" ) def repScope(self, scope): """Represent a scope as string. Parameters ---------- scope: tuple of tuple of tuple of integer The scope given as sequence of intervals Returns ------- string Either a single section heading, if the start and end section are the same, or `-` surrounded by the start and end section headings. """ return ", ".join(self.repInterval(intv) for intv in scope) def parseInterval(self, intvStr, plain=True): """Parses an interval given as string. A interval string is two section headings separated by a `-`. We parse it into a 2-tuple of the section headings, both also parsed into tuples of integers. If the resulting section tuples have less than the maximum number of components, we fill them up: the start section will be filled up with `0`-s, and the end section will be filled up with `-1`-s. This corresponds to the interpretation that the start section represent its start point, and the end section represent its end point. Parameters ---------- intvStr: string The interval string plain: boolean, optional True Whether to return just the result or additional information as well. Returns ------- tuple of integer or dict The plain result is a 2-tuple of tuples of integer. This is returned when `plain=False` is passed. If there are errors, None is returned. But if `plain=True` is passed, a dict is returned, with keys **result** for the plain result; **warning** for warnings if the parsing failed; and **normal** for a normalized representaiton of the section. """ result = None warnings = [] intvStr = intvStr.strip() if not intvStr: result = ((0, 0, 0), (-1, -1, -1)) return ( result if plain else dict( result=result, warning=warnings, normal=self.repInterval(result) ) ) parts = intvStr.split("-", 1) if len(parts) == 1: info = self.parseLoc(intvStr, plain=False) w = info["warning"] if w: warnings.append(w) else: s = info["result"] result = ( (s + (0, 0, 0))[0:3], (s + (-1, -1, -1))[0:3], ) else: part = parts[0].strip() info1 = self.parseLoc(part, plain=False) w1 = info1["warning"] part = parts[1].strip() info2 = self.parseLoc(part, plain=False) w2 = info2["warning"] if w1 or w2: if w1: warnings.append(w1) if w2: warnings.append(w2) else: s1 = info1["result"] s2 = info2["result"] result = ( (s1 + (0, 0, 0))[0:3], (s2 + (-1, -1, -1))[0:3], ) normal = None if result is None else self.repInterval(result) return result if plain else dict(result=result, warning=warnings, normal=normal) def parseScope(self, scopeStr, plain=True): """Parse a scope specification into logical specifiers of regions in the corpus. A scope specification is a comma-separated list of interval strings. An interval string is either a section or an interval between two sections. A section is written as a section heading, an interval as two section headings with a `-` in between. An interval is taken from the start of the first section to the end of the second section. A section on its own is taken from its start to its end. Parameters ---------- scopeStr: string The scope string plain: boolean, optional True Whether to return just the result or additional information as well. Returns ------- tuple of tuple of integer or dict The plain result is a tuple of 2-tuple of tuples of integer. This is returned when `plain=False` is passed. If there are errors, None is returned. But if `plain=True` is passed, a dict is returned, with keys **result** for the plain result; **warning** for warnings if the parsing failed; and **normal** for a normalized representaiton of the section. """ results = [] warnings = [] if not scopeStr: return () if plain else dict(result=(), warning=warnings, normal="") for intvStr in scopeStr.split(","): if not intvStr: continue info = self.parseInterval(intvStr, plain=plain) result = info if plain else info["result"] if result is None: if not plain: warnings.extend(info["warning"]) continue results.append(result) results = tuple(_sortIntervals(results)) return ( results if plain else dict(result=results, warning=warnings, normal=self.repScope(results)) ) def intersectScopes(self, *scopeStrs): """Produce the intersection of severel scopes. We use this function to test whether two triggers have a region where they are both in scope. Parameters ---------- scopeStrs: iterable Sequence of scope specifiers Returns ------- tuple of tuple of tuple of integer This is a tuple of intervals, forming the intersection of all given scopes """ curIntersection = [self.parseInterval("")] for scopeStr in scopeStrs: newIntersection = [] for bLoc, eLoc in self.parseScope(scopeStr): for ibLoc, ieLoc in curIntersection: if _locCmp(ieLoc, bLoc) == -1: # ieLoc < bLoc continue if _locCmp(ibLoc, eLoc) == 1: # ibLoc > eLoc break # now # bLoc <= ieLoc # ibLoc <= eLoc newIbLoc = _locMax(ibLoc, bLoc) newIeLoc = _locMin(ieLoc, eLoc) newIntersection.append((newIbLoc, newIeLoc)) curIntersection = newIntersection return tuple(curIntersection) def testPartitioning(self, scopeStrs): """Test the partitioning of scopes. Only for debugging purposes. Parameters ---------- scopeStrs: iterable of string The scope specifiers Returns ------- list A list of intervals that make up the resulting partition. """ scopeIndex = {} for scopeStr in scopeStrs: info = self.parseScope(scopeStr, plain=False) warning = info["warning"] if len(warning): console(f"Errors in {scopeStr}: {'; '.join(warning)}") else: scopes = info["result"] normScopeStr = info["normal"] console( f"{scopeStr} => {normScopeStr}\n" f"\t{self.repScope(_sortIntervals(scopes))}" ) scopeIndex[normScopeStr] = scopes partitionScopes(scopeIndex)
Subclasses
Methods
def intersectScopes(self, *scopeStrs)
-
Produce the intersection of severel scopes.
We use this function to test whether two triggers have a region where they are both in scope.
Parameters
scopeStrs
:iterable
- Sequence of scope specifiers
Returns
tuple
oftuple
oftuple
ofinteger
- This is a tuple of intervals, forming the intersection of all given scopes
def parseInterval(self, intvStr, plain=True)
-
Parses an interval given as string.
A interval string is two section headings separated by a
-
. We parse it into a 2-tuple of the section headings, both also parsed into tuples of integers.If the resulting section tuples have less than the maximum number of components, we fill them up: the start section will be filled up with
0
-s, and the end section will be filled up with-1
-s. This corresponds to the interpretation that the start section represent its start point, and the end section represent its end point.Parameters
intvStr
:string
- The interval string
plain
:boolean
, optionalTrue
- Whether to return just the result or additional information as well.
Returns
tuple
ofinteger
ordict
-
The plain result is a 2-tuple of tuples of integer. This is returned when
plain=False
is passed. If there are errors, None is returned.But if
plain=True
is passed, a dict is returned, with keys result for the plain result; warning for warnings if the parsing failed; and normal for a normalized representaiton of the section.
def parseLoc(self, locStr, plain=True)
-
Parses a location string.
A location string is a section heading in the corpus. Now we want the "legal" number of that section. We use the corpus dependent function
Corpus.getSeqFromStr
for that.Parameters
locStr
:string
- The section heading as it appears in the corpus
plain
:boolean
, optionalTrue
- Whether to return just the result or additional information as well.
Returns
tuple
ofinteger
ordict
-
The plain result is a tuple of the numbers that make up the "legal" number of the section. This is returned when
plain=False
is passed. If there are errors, None is returned.But if
plain=True
is passed, a dict is returned, with keys result for the plain result; warning for warnings if the parsing failed; and normal for a normalized representaiton of the section.
def parseScope(self, scopeStr, plain=True)
-
Parse a scope specification into logical specifiers of regions in the corpus.
A scope specification is a comma-separated list of interval strings.
An interval string is either a section or an interval between two sections.
A section is written as a section heading, an interval as two section headings with a
-
in between.An interval is taken from the start of the first section to the end of the second section.
A section on its own is taken from its start to its end.
Parameters
scopeStr
:string
- The scope string
plain
:boolean
, optionalTrue
- Whether to return just the result or additional information as well.
Returns
tuple
oftuple
ofinteger
ordict
-
The plain result is a tuple of 2-tuple of tuples of integer. This is returned when
plain=False
is passed. If there are errors, None is returned.But if
plain=True
is passed, a dict is returned, with keys result for the plain result; warning for warnings if the parsing failed; and normal for a normalized representaiton of the section.
def repInterval(self, intv)
-
Represent an interval of sections as string.
Parameters
intv
:tuple
oftuple
ofinteger
- The interval given as start and end section tuples
Returns
string
- Either a single section heading, if the start and end section are the same,
or
-
surrounded by the start and end section headings.
def repLoc(self, loc)
-
Represent a location as a string.
A location identifies a section by means of legal number. We now want the heading of that section, as given by the features and settings of the corpus. We use the corpus dependent function
Corpus.getStrFromSeq
for that.Parameters
loc
:tuple
ofinteger
- The "legal" number of a section.
Returns
string
- The section heading of the corresponding section.
def repScope(self, scope)
-
Represent a scope as string.
Parameters
scope
:tuple
oftuple
oftuple
ofinteger
- The scope given as sequence of intervals
Returns
string
- Either a single section heading, if the start and end section are the same,
or
-
surrounded by the start and end section headings.
def testPartitioning(self, scopeStrs)
-
Test the partitioning of scopes.
Only for debugging purposes.
Parameters
scopeStrs
:iterable
ofstring
- The scope specifiers
Returns
list
- A list of intervals that make up the resulting partition.