Module tf.ner.data
Annotation data module.
This module manages the data of annotation sets.
Annotation sets are either the sets of pre-existing entities in the corpus or the result of actions by the user of this tool, whether he uses the TF browser, or the API in his own programs, or the result of looking op the triggers in a spreadsheet.
Annotation sets must be stored on file, must be read from file, and must be represented in memory in various ways in order to make the API functions of the tool efficient.
We have set up the functions in such a way that sets are only loaded and processed if they are needed and out of date.
Expand source code Browse git
"""Annotation data module.
This module manages the data of annotation sets.
Annotation sets are either the sets of pre-existing entities in the corpus or the
result of actions by the user of this tool, whether he uses the TF browser, or the API
in his own programs, or the result of looking op the triggers in a spreadsheet.
Annotation sets must be stored on file, must be read from file,
and must be represented in memory in various ways in order to make the
API functions of the tool efficient.
We have set up the functions in such a way that sets are only loaded and
processed if they are needed and out of date.
"""
import collections
import time
from ..core.generic import AttrDict
from ..core.helpers import console
from ..core.files import fileOpen, mTime, fileExists, initTree
from .corpus import Corpus
class Data(Corpus):
def __init__(self, sets=None):
"""Manages annotation sets and their corresponding data.
This class is also responsible for adding entities to a set and deleting
entities from them.
Both addition and deletion is implemented by first figuring out what has to be
done, and then applying it to the entity data on disk; after that we
perform a data load from the update file.
Parameters
----------
sets: object, optional None
Entity sets to start with.
If None, a fresh store of sets will be created.
When the tool runs in browser context, each request will create a
`Data` object from scratch. If no sets are passed to the initializer,
it will need to load the required sets from file.
This is wasteful.
We have set up the web server in such a way that it incorporates the
annotation sets. The web server will pass them to the
`tf.ner.ner.NER` object initializer, which passes
it to the initializer here.
In that way, the `Data` object can start with the sets already in memory.
"""
Corpus.__init__(self)
if not self.properlySetup:
return
self.sets = sets
annoDir = self.annoDir
initTree(annoDir, fresh=False)
def loadSetData(self):
"""Loads the current annotation set into memory.
It has two phases:
* loading the source set (see `Data.fromSourceSet()`)
* processing the loaded set (see `Data.processSet()`)
"""
if not self.properlySetup:
return
sets = self.sets
setName = self.setName
if setName not in sets:
sets[setName] = AttrDict()
changed = self.fromSourceSet()
self.processSet(changed)
def _clearSetData(self):
"""Clears the current annotation set data from memory."""
if not self.properlySetup:
return
sets = self.sets
setName = self.setName
setData = AttrDict()
setData.entities = AttrDict()
sets[setName] = setData
self.processSet(True)
def fromSourceSet(self):
"""Loads an annotation set from source.
If the current annotation set is `""`, the annotation set is already present in
the TF data, and we compile it into a dict of entity data keyed
by entity node.
Otherwise, we read the corresponding TSV file from disk and compile it
into a dict of entity data keyed by line number.
After collection of the set it is stored under the following keys:
* `dateLoaded`: datetime when the set was last loaded from disk;
* `entities`: the list of entities as loaded from the source;
it is a dict of entities, keyed by nodes or line numbers;
each entity specifies a tuple of feature values and a list of slots
that are part of the entity.
"""
if not self.properlySetup:
return None
settings = self.settings
setName = self.setName
setIsSrc = self.setIsSrc
setData = self.sets[setName]
annoDir = self.annoDir
settings = self.settings
features = settings.features
featureDefault = self.featureDefault
nF = len(features)
checkFeature = self.checkFeature
fvalFromNode = self.fvalFromNode
slotsFromNode = self.slotsFromNode
setFile = f"{annoDir}/{setName}/entities.tsv"
if "buckets" not in setData:
setData.buckets = self.getBucketNodes()
changed = False
if setIsSrc:
if "entities" not in setData:
entities = {}
hasFeature = {feat: checkFeature(feat) for feat in features}
for e in self.getEntityNodes():
slots = slotsFromNode(e)
entities[e] = (
tuple(
(
fvalFromNode(feat, e)
if hasFeature[feat]
else featureDefault[feat](slots)
)
for feat in features
),
tuple(slots),
)
setData.entities = entities
else:
if (
"entities" not in setData
or "dateLoaded" not in setData
or (len(setData.entities) > 0 and not fileExists(setFile))
or (fileExists(setFile) and setData.dateLoaded < mTime(setFile))
):
changed = True
entities = {}
if fileExists(setFile):
with fileOpen(setFile) as df:
for e, line in enumerate(df):
fields = tuple(line.rstrip("\n").split("\t"))
entities[e] = (
tuple(fields[0:nF]),
tuple(int(f) for f in fields[nF:]),
)
setData.entities = entities
setData.dateLoaded = time.time()
return changed
def processSet(self, changed):
"""Generates derived data structures out of the source set.
After loading we process the set into derived data structures.
We try to be lazy. We only load a set from disk if it is not
already in memory, or if the set on disk has been updated since the last load.
The resulting data is stored in the current set under the various keys.
After processing, the time of processing is recorded, so that it can be
observed if the processed set is no longer up to date w.r.t. the source.
For each such set we produce several data structures, which we store
under the following keys:
* `dateProcessed`: datetime when the set was last processed
* `entityText`: dict, text of entity by entity node or line number in
TSV file;
* `entityTextVal`: dict of dict, set of feature values of entity, keyed by
feature name and then by text of the entity;
* `entitySummary`: dict, list of entity nodes / line numbers, keyed by value
of entity kind;
* `entityIdent`: dict, list of entity nodes./line numbers, keyed by tuple of
entity feature values (these tuples are identifying for an entity);
* `entityFreq`: dict of counters, a counter for each feature name; the
counter gives the number of times each value of that feature occurs in an
entity;
* `entityIdentFirst`: dict, keyed by entity id, and valued by the number
of that entity in the list. If multiple entities in the list happen to
have this id, the number of the first entity of them is chosen as value;
* `entityIndex`: dict of dict, a dict for each feature name; the sub-dict
gives for each position the values that entities occupying that position
can have; positions are tuples of slots;
* `entityVal`: dict, keyed by value tuples gives the set of positions
that entities with that value tuple occupy;
* `entitySlotVal`: dict, keyed by positions gives the set of values
that entities occupying that position can have;
* `entitySlotAll`: dict, keyed by single first slots gives the set of
ending slots that entities starting at that first slot have;
* `entitySlotIndex`: dict, keyed by single slot gives list of items
corresponding to entities that occupy that slot;
* if an entity starts there, an entry `[True, -n, values]` is made;
* if an entity ends there, an entry `[False, n, values]` is made;
* if an entity occupies that slot without starting or ending there,
an entry `None` is made;
Above, `n` is the length of the entity in tokens and `values` is the
tuple of feature values of that entity.
This is precisely the information we need if we want to mark up a set of
entities in the surrounding context of tokens.
Parameters
----------
changed: boolean
Whether the set has changed since last processing.
"""
if not self.properlySetup:
return
settings = self.settings
textFromSlots = self.textFromSlots
features = settings.features
summaryIndices = settings.summaryIndices
setName = self.setName
setData = self.sets[setName]
dateLoaded = setData.dateLoaded
dateProcessed = setData.dateProcessed
if (
changed
or "dateProcessed" not in setData
or "entityText" not in setData
or "entityTextVal" not in setData
or "entitySummary" not in setData
or "entityIdent" not in setData
or "entityIdentFirst" not in setData
or "entityFreq" not in setData
or "entityIndex" not in setData
or "entityVal" not in setData
or "entitySlotVal" not in setData
or "entitySlotAll" not in setData
or "entitySlotIndex" not in setData
or dateLoaded is not None
and dateProcessed < dateLoaded
):
entityItems = setData.entities.items()
entityText = {}
entityTextVal = {feat: collections.defaultdict(set) for feat in features}
entitySummary = {}
entityIdent = {}
entityIdentFirst = {}
entityFreq = {feat: collections.Counter() for feat in features}
entityIndex = {feat: {} for feat in features}
entityVal = {}
entitySlotVal = {}
entitySlotAll = {}
entitySlotIndex = {}
for e, (fVals, slots) in entityItems:
txt = textFromSlots(slots)
ident = fVals
summary = tuple(fVals[i] for i in summaryIndices)
entityText[e] = txt
entityVal.setdefault(fVals, set()).add(slots)
for feat, val in zip(features, fVals):
entityFreq[feat][val] += 1
entityIndex[feat].setdefault(slots, set()).add(val)
entityTextVal[feat][txt].add(val)
entityIdent.setdefault(ident, []).append(e)
if ident not in entityIdentFirst:
entityIdentFirst[ident] = e
entitySummary.setdefault(summary, []).append(e)
entitySlotVal.setdefault(slots, set()).add(fVals)
firstSlot = slots[0]
lastSlot = slots[-1]
entitySlotAll.setdefault(firstSlot, set()).add(lastSlot)
for slot in slots:
isFirst = slot == firstSlot
isLast = slot == lastSlot
if isFirst or isLast:
if isFirst:
entitySlotIndex.setdefault(slot, []).append(
[True, firstSlot - lastSlot - 1, ident]
)
if isLast:
entitySlotIndex.setdefault(slot, []).append(
[False, lastSlot - firstSlot + 1, ident]
)
else:
entitySlotIndex.setdefault(slot, []).append(None)
setData.entityText = entityText
setData.entityTextVal = entityTextVal
setData.entitySummary = entitySummary
setData.entityIdent = entityIdent
setData.entityIdentFirst = entityIdentFirst
setData.entityFreq = {
feat: sorted(entityFreq[feat].items()) for feat in features
}
setData.entityIndex = entityIndex
setData.entityVal = entityVal
setData.entitySlotVal = entitySlotVal
setData.entitySlotAll = entitySlotAll
setData.entitySlotIndex = entitySlotIndex
setData.dateProcessed = time.time()
def delEntity(self, vals, allMatches=None, returns=True):
"""Delete entity occurrences from the current set.
This operation is not allowed if the current set is a read-only set
(from a spreadsheet or the already baked-in entities).
The entities to delete are selected by their feature values.
So you can use this function to delete all entities with a certain
entity id and kind.
Moreover, you can also specify a set of locations and restrict the entity
removal to the entities that occupy those locations.
Parameters
----------
vals: tuple
For each entity feature it has a value of that feature. This specifies
which entities have to go.
allMatches: iterable of tuple of integer, optional None
A number of slot tuples. They are the locations from which the candidate
entities will be deleted.
If it is None, the entity candidates will be removed wherever they occur.
returns: boolean, optional False
If False, the function reports how many entities have been deleted
and how many were not present in the specified locations.
Otherwise, these numbers are returned.
Returns
-------
(int, int) or void
If `returns`, it returns the number of non-existing entities that were
asked to be deleted and the number of actually deleted entities.
If the operation is not allowed, both integers above are set to -1.
"""
if not self.properlySetup:
return
setIsRo = self.setIsRo
setNameRep = self.setNameRep
if setIsRo:
if returns:
return (-1, -1)
console(f"Entity deletion not allowed on {setNameRep}", error=True)
return
setData = self.getSetData()
oldEntities = setData.entities
delEntities = set()
oldEntitiesBySlots = set()
for e, (fVals, slots) in oldEntities.items():
if fVals == vals:
oldEntitiesBySlots.add(slots)
missing = 0
deleted = 0
delSlots = oldEntitiesBySlots if allMatches is None else allMatches
for slots in delSlots:
if slots not in oldEntitiesBySlots:
missing += 1
continue
delEntities.add((vals, slots))
deleted += 1
if len(delEntities):
self._weedEntities(delEntities)
self.loadSetData()
if returns:
return (missing, deleted)
self.console(f"Not present: {missing:>5} x")
self.console(f"Deleted: {deleted:>5} x")
def delEntityRich(self, deletions, buckets, excludedTokens=set()):
"""Delete specified entity occurrences from the current set.
This operation is not allowed if the current set is a read-only set
(from a spreadsheet or the already baked-in entities).
This function has more detailed instructions as to which entities
should be deleted than `Data.delEntity()` .
It is a handy function for the TF browser to call, but not so much when you
are manipulating entities yourself in a Jupyter notebook.
Parameters
----------
deletions: tuple of tuple or string
Each member of the tuple corresponds to an entity feature.
It is either a single value of such a feature, or an iterable
of such values.
The tuple together specifies a set of entities whose entity features
have values that are either equal to the corresponding member of
`deletions` or contained in it.
buckets: iterable of list
Restricts the scope where entities should be removed.
This is typically the result of
`tf.ner.corpus.Corpus.filterContent()`.
The only important thing is that member 2 of each bucket is the list
of entity matches in that bucket.
Only entities that occupy these places will be removed.
excludedTokens: set, optional set()
This is the set of token positions that define the entities that must be
skipped from deletion. If the last slot of an entity is in this set,
the entity will not be deleted.
"""
if not self.properlySetup:
return
setNameRep = self.setNameRep
setIsRo = self.setIsRo
browse = self.browse
if setIsRo:
msg = f"Entity deletion not allowed on {setNameRep}"
if browse:
return [[msg]]
else:
console(msg, error=True)
return
settings = self.settings
features = settings.features
setData = self.getSetData()
oldEntities = setData.entities
report = []
delEntities = set()
delEntitiesByE = set()
deletions = tuple([x] if type(x) is str else x for x in deletions)
if any(len(x) > 0 for x in deletions):
oldEntitiesBySlots = collections.defaultdict(set)
for e, info in oldEntities.items():
oldEntitiesBySlots[info[1]].add(e)
excl = 0
fValTuples = [()]
for vals in deletions:
delTuples = []
for val in vals:
delTuples.extend([ft + (val,) for ft in fValTuples])
fValTuples = delTuples
stats = collections.Counter()
for bucket in buckets:
allMatches = bucket[2]
for slots in allMatches:
if slots[-1] in excludedTokens:
excl += 1
continue
candidates = oldEntitiesBySlots.get(slots, set())
for e in candidates:
toBeDeleted = False
fVals = oldEntities[e][0]
if fVals in fValTuples:
toBeDeleted = True
if toBeDeleted:
if e not in delEntitiesByE:
delEntitiesByE.add(e)
delEntities.add((fVals, slots))
stats[fVals] += 1
report.append(
tuple(sorted(stats.items())) if len(stats) else ["Nothing deleted"]
)
if excl:
report.append(f"Deletion: occurrences excluded: {excl}")
if len(delEntities):
self._weedEntities(delEntities)
if browse:
return report
self.loadSetData()
(stats, *rest) = report
if type(stats) is list:
self.console("\n".join(stats))
else:
for vals, freq in stats:
repVals = " ".join(
f"{feat}={val}" for (feat, val) in zip(features, vals)
)
self.console(f"Deleted {freq:>5} x {repVals}")
if len(rest):
self.console("\n".join(rest))
def addEntity(self, vals, allMatches, returns=True):
"""Add entity occurrences to the current set.
This operation is not allowed if the current set is a read-only set
(from a spreadsheet or the already baked-in entities).
The entities to add are specified by their feature values.
So you can use this function to add entities with a certain
entity id and kind.
You also have to specify a set of locations where the entities should be added.
Parameters
----------
vals: tuple
For each entity feature it has a value of that feature. This specifies
which entities have will be added.
allMatches: iterable of tuple of integer
A number of slot tuples. They are the locations where the entities will be
added.
returns: boolean, optional False
If True, reports how many entities have been added and how many
were already present in the specified locations.
Otherwise, these numbers are returned by the function.
Returns
-------
(int, int) or void
If `returns`, it returns the number of already existing entities that were
asked to be deleted and the number of actually deleted entities.
If the operation is not allowed, both integers above are set to -1.
"""
if not self.properlySetup:
return
setNameRep = self.setNameRep
setIsRo = self.setIsRo
if setIsRo:
if returns:
return (-1, -1)
console(f"Entity addition not allowed on {setNameRep}", error=True)
return
setData = self.getSetData()
oldEntities = setData.entities
addE = set()
oldEntitiesBySlots = set()
for e, (fVals, slots) in oldEntities.items():
if fVals == vals:
oldEntitiesBySlots.add(slots)
present = 0
added = 0
for slots in allMatches:
if slots in oldEntitiesBySlots:
present += 1
continue
info = (vals, slots)
if info not in addE:
addE.add(info)
added += 1
if len(addE):
self._mergeEntities(addE)
self.loadSetData()
if returns:
return (present, added)
self.console(f"Already present: {present:>5} x")
self.console(f"Added: {added:>5} x")
def addEntities(self, newEntities, returns=True, _lowlevel=False):
"""Add multiple entities efficiently to the current set.
This operation is not allowed if the current set is a read-only set, unless
`_lowlevel` is True.
If you have multiple entities to add, it is wasteful to do multiple passes over
the corpus to find them.
This method does them all in one fell swoop.
Parameters
----------
newEntites: iterable of tuples of tuples
each new entity consists of
* a tuple of entity feature values, specifying the entity to add
* a list of slot tuples, specifying where to add this entity
_lowlevel: boolean, optional False
Whether this function is executed in low-level mode.
Some calls of this function are done in specific contexts, where certain
conditions are known to be fulfilled and do not have to be checked.
The intention is that only this codebase will ever pass `_lowlevel=True`,
and that outside functions never pass this parameter.
returns: boolean, optional False
If True, eports how many entities have been added and how many were
already present in the specified locations.
Otherwise it returns these numbers, unless `_lowlevel` is True, in which
case it returns nothing.
Returns
-------
(int, int) or void
If `returns`, it returns the number of already existing entities that were
asked to be deleted and the number of actually deleted entities.
If the operation is not allowed, both integers above are set to -1.
"""
if not self.properlySetup:
return
setNameRep = self.setNameRep
setIsRo = self.setIsRo
setIsX = self.setIsX
if not _lowlevel and setIsRo:
if returns:
return (-1, -1)
console(f"Entities addition not allowed on {setNameRep}", error=True)
return
if _lowlevel and not setIsX:
return
setData = self.getSetData()
oldEntities = set(setData.entities.values())
addE = set()
present = 0
added = 0
for fVals, allMatches in newEntities:
for slots in allMatches:
if (fVals, slots) in oldEntities:
present += 1
elif (fVals, slots) in addE:
continue
else:
added += 1
addE.add((fVals, slots))
if len(addE):
self._mergeEntities(addE, _lowlevel=_lowlevel)
self.loadSetData()
if returns:
return (present, added)
if _lowlevel:
return
self.console(f"Already present: {present:>5} x")
self.console(f"Added: {added:>5} x")
def addEntityRich(self, additions, buckets, excludedTokens=set()):
"""Add specified entity occurrences to the current set.
This operation is not allowed if the current set is a read-only set
(from a spreadsheet or the already baked-in entities).
This function has more detailed instructions as to which entities
should be added than `Data.addEntity()` .
It is a handy function for the TF browser to call, but not so much when you
are manipulating entities yourself in a Jupyter notebook.
Parameters
----------
additions: tuple of tuple or string
Each member of the tuple corresponds to an entity feature.
It is either a single value of such a feature, or an iterable
of such values.
The tuple together specifies a set of entities whose entity features
have values that are either equal to the corresponding member of
`additions` or contained in it.
buckets: iterable of list
This is typically the result of
`tf.ner.corpus.Corpus.filterContent()`.
The only important thing is that member 2 of each bucket is the list
of entity matches in that bucket.
Entities will only be added at these places.
excludedTokens: set, optional set()
This is the set of token positions that define the locations that must not
receive new entities. If the last slot of an entity is in this set,
no entity will be added there.
"""
if not self.properlySetup:
return
setNameRep = self.setNameRep
setIsRo = self.setIsRo
browse = self.browse
if setIsRo:
msg = f"Entity addition not allowed on {setNameRep}"
if browse:
return [[msg]]
else:
console(msg, error=True)
return
settings = self.settings
features = settings.features
setData = self.getSetData()
oldEntities = setData.entities
report = []
addEnts = set()
additions = tuple([x] if type(x) is str else x for x in additions)
if all(len(x) > 0 for x in additions):
oldEntitiesBySlots = collections.defaultdict(set)
for e, (fVals, slots) in oldEntities.items():
oldEntitiesBySlots[slots].add(fVals)
excl = 0
fValTuples = [()]
for vals in additions:
newTuples = []
for val in vals:
newTuples.extend([ft + (val,) for ft in fValTuples])
fValTuples = newTuples
stats = collections.Counter()
for bucket in buckets:
allMatches = bucket[2]
for slots in allMatches:
if slots[-1] in excludedTokens:
excl += 1
continue
existing = oldEntitiesBySlots.get(slots, set())
for fVals in fValTuples:
if fVals in existing:
continue
info = (fVals, slots)
if info not in addEnts:
addEnts.add(info)
stats[fVals] += 1
report.append(
tuple(sorted(stats.items())) if len(stats) else ["Nothing added"]
)
if excl:
report.append(f"Addition: occurrences excluded: {excl}")
if len(addEnts):
self._mergeEntities(addEnts)
if browse:
return report
self.loadSetData()
(stats, *rest) = report
if type(stats) is list:
self.console("\n".join(stats))
else:
for vals, freq in stats:
repVals = " ".join(
f"{feat}={val}" for (feat, val) in zip(features, vals)
)
self.console(f"Added {freq:>5} x {repVals}")
if len(rest):
self.console("\n".join(rest))
def saveEntitiesAs(self, dataFile):
"""Export an annotation set to a file.
This function is used when a set has to be duplicated:
`tf.ner.sets.Sets.setDup()`.
Parameters
----------
dataFile: string
The path of the file to write to.
"""
if not self.properlySetup:
return
setData = self.getSetData()
entities = setData.entities
with fileOpen(dataFile, mode="a") as fh:
for fVals, slots in entities.values():
fh.write("\t".join(str(x) for x in (*fVals, *slots)) + "\n")
def _weedEntities(self, delEntities):
"""Performs deletions to the current annotation set.
This operation is not allowed if the current set is a read-only set
(from a spreadsheet or the already baked-in entities).
Parameters
----------
delEntities: set
The set consists of entity specs: a tuple of values of entity features,
and an iterable of slot tuples where the entity is located.
"""
if not self.properlySetup:
return
setName = self.setName
setNameRep = self.setNameRep
setIsRo = self.setIsRo
if setIsRo:
console(f"Entity weeding not allowed on {setNameRep}", error=True)
return
settings = self.settings
features = settings.features
nF = len(features)
annoDir = self.annoDir
dataFile = f"{annoDir}/{setName}/entities.tsv"
newEntities = []
with fileOpen(dataFile) as fh:
for line in fh:
fields = tuple(line.rstrip("\n").split("\t"))
fVals = tuple(fields[0:nF])
slots = tuple(int(f) for f in fields[nF:])
info = (fVals, slots)
if info in delEntities:
continue
newEntities.append(line)
with fileOpen(dataFile, mode="w") as fh:
fh.write("".join(newEntities))
def _mergeEntities(self, newEntities, _lowlevel=False):
"""Performs additions to the current annotation set.
This operation is not allowed if the current set is the read-only set with the
empty name.
Parameters
----------
newEntities: set
The set consists of entity specs: a tuple of values of entity features,
and an iterable of slot tuples where the entity is located.
_lowlevel: boolean, optional False
Whether this function is executed in low-level mode.
Some calls of this function are done in specific contexts, where certain
conditions are known to be fulfilled and do not have to be checked.
The intention is that only this codebase will ever pass `_lowlevel=True`,
and that outside functions never pass this parameter.
"""
if not self.properlySetup:
return
setName = self.setName
setNameRep = self.setNameRep
setIsRo = self.setIsRo
setIsX = self.setIsX
if not _lowlevel and setIsRo:
console(f"Entity merging not allowed on {setNameRep}", error=True)
return
if _lowlevel and not setIsX:
return
annoDir = self.annoDir
dataFile = f"{annoDir}/{setName}/entities.tsv"
with fileOpen(dataFile, mode="w" if _lowlevel else "a") as fh:
for fVals, slots in newEntities:
fh.write("\t".join(str(x) for x in (*fVals, *slots)) + "\n")
Classes
class Data (sets=None)
-
Manages annotation sets and their corresponding data.
This class is also responsible for adding entities to a set and deleting entities from them.
Both addition and deletion is implemented by first figuring out what has to be done, and then applying it to the entity data on disk; after that we perform a data load from the update file.
Parameters
sets
:object
, optionalNone
-
Entity sets to start with. If None, a fresh store of sets will be created.
When the tool runs in browser context, each request will create a
Data
object from scratch. If no sets are passed to the initializer, it will need to load the required sets from file. This is wasteful.We have set up the web server in such a way that it incorporates the annotation sets. The web server will pass them to the
NER
object initializer, which passes it to the initializer here.In that way, the
Data
object can start with the sets already in memory.
Expand source code Browse git
class Data(Corpus): def __init__(self, sets=None): """Manages annotation sets and their corresponding data. This class is also responsible for adding entities to a set and deleting entities from them. Both addition and deletion is implemented by first figuring out what has to be done, and then applying it to the entity data on disk; after that we perform a data load from the update file. Parameters ---------- sets: object, optional None Entity sets to start with. If None, a fresh store of sets will be created. When the tool runs in browser context, each request will create a `Data` object from scratch. If no sets are passed to the initializer, it will need to load the required sets from file. This is wasteful. We have set up the web server in such a way that it incorporates the annotation sets. The web server will pass them to the `tf.ner.ner.NER` object initializer, which passes it to the initializer here. In that way, the `Data` object can start with the sets already in memory. """ Corpus.__init__(self) if not self.properlySetup: return self.sets = sets annoDir = self.annoDir initTree(annoDir, fresh=False) def loadSetData(self): """Loads the current annotation set into memory. It has two phases: * loading the source set (see `Data.fromSourceSet()`) * processing the loaded set (see `Data.processSet()`) """ if not self.properlySetup: return sets = self.sets setName = self.setName if setName not in sets: sets[setName] = AttrDict() changed = self.fromSourceSet() self.processSet(changed) def _clearSetData(self): """Clears the current annotation set data from memory.""" if not self.properlySetup: return sets = self.sets setName = self.setName setData = AttrDict() setData.entities = AttrDict() sets[setName] = setData self.processSet(True) def fromSourceSet(self): """Loads an annotation set from source. If the current annotation set is `""`, the annotation set is already present in the TF data, and we compile it into a dict of entity data keyed by entity node. Otherwise, we read the corresponding TSV file from disk and compile it into a dict of entity data keyed by line number. After collection of the set it is stored under the following keys: * `dateLoaded`: datetime when the set was last loaded from disk; * `entities`: the list of entities as loaded from the source; it is a dict of entities, keyed by nodes or line numbers; each entity specifies a tuple of feature values and a list of slots that are part of the entity. """ if not self.properlySetup: return None settings = self.settings setName = self.setName setIsSrc = self.setIsSrc setData = self.sets[setName] annoDir = self.annoDir settings = self.settings features = settings.features featureDefault = self.featureDefault nF = len(features) checkFeature = self.checkFeature fvalFromNode = self.fvalFromNode slotsFromNode = self.slotsFromNode setFile = f"{annoDir}/{setName}/entities.tsv" if "buckets" not in setData: setData.buckets = self.getBucketNodes() changed = False if setIsSrc: if "entities" not in setData: entities = {} hasFeature = {feat: checkFeature(feat) for feat in features} for e in self.getEntityNodes(): slots = slotsFromNode(e) entities[e] = ( tuple( ( fvalFromNode(feat, e) if hasFeature[feat] else featureDefault[feat](slots) ) for feat in features ), tuple(slots), ) setData.entities = entities else: if ( "entities" not in setData or "dateLoaded" not in setData or (len(setData.entities) > 0 and not fileExists(setFile)) or (fileExists(setFile) and setData.dateLoaded < mTime(setFile)) ): changed = True entities = {} if fileExists(setFile): with fileOpen(setFile) as df: for e, line in enumerate(df): fields = tuple(line.rstrip("\n").split("\t")) entities[e] = ( tuple(fields[0:nF]), tuple(int(f) for f in fields[nF:]), ) setData.entities = entities setData.dateLoaded = time.time() return changed def processSet(self, changed): """Generates derived data structures out of the source set. After loading we process the set into derived data structures. We try to be lazy. We only load a set from disk if it is not already in memory, or if the set on disk has been updated since the last load. The resulting data is stored in the current set under the various keys. After processing, the time of processing is recorded, so that it can be observed if the processed set is no longer up to date w.r.t. the source. For each such set we produce several data structures, which we store under the following keys: * `dateProcessed`: datetime when the set was last processed * `entityText`: dict, text of entity by entity node or line number in TSV file; * `entityTextVal`: dict of dict, set of feature values of entity, keyed by feature name and then by text of the entity; * `entitySummary`: dict, list of entity nodes / line numbers, keyed by value of entity kind; * `entityIdent`: dict, list of entity nodes./line numbers, keyed by tuple of entity feature values (these tuples are identifying for an entity); * `entityFreq`: dict of counters, a counter for each feature name; the counter gives the number of times each value of that feature occurs in an entity; * `entityIdentFirst`: dict, keyed by entity id, and valued by the number of that entity in the list. If multiple entities in the list happen to have this id, the number of the first entity of them is chosen as value; * `entityIndex`: dict of dict, a dict for each feature name; the sub-dict gives for each position the values that entities occupying that position can have; positions are tuples of slots; * `entityVal`: dict, keyed by value tuples gives the set of positions that entities with that value tuple occupy; * `entitySlotVal`: dict, keyed by positions gives the set of values that entities occupying that position can have; * `entitySlotAll`: dict, keyed by single first slots gives the set of ending slots that entities starting at that first slot have; * `entitySlotIndex`: dict, keyed by single slot gives list of items corresponding to entities that occupy that slot; * if an entity starts there, an entry `[True, -n, values]` is made; * if an entity ends there, an entry `[False, n, values]` is made; * if an entity occupies that slot without starting or ending there, an entry `None` is made; Above, `n` is the length of the entity in tokens and `values` is the tuple of feature values of that entity. This is precisely the information we need if we want to mark up a set of entities in the surrounding context of tokens. Parameters ---------- changed: boolean Whether the set has changed since last processing. """ if not self.properlySetup: return settings = self.settings textFromSlots = self.textFromSlots features = settings.features summaryIndices = settings.summaryIndices setName = self.setName setData = self.sets[setName] dateLoaded = setData.dateLoaded dateProcessed = setData.dateProcessed if ( changed or "dateProcessed" not in setData or "entityText" not in setData or "entityTextVal" not in setData or "entitySummary" not in setData or "entityIdent" not in setData or "entityIdentFirst" not in setData or "entityFreq" not in setData or "entityIndex" not in setData or "entityVal" not in setData or "entitySlotVal" not in setData or "entitySlotAll" not in setData or "entitySlotIndex" not in setData or dateLoaded is not None and dateProcessed < dateLoaded ): entityItems = setData.entities.items() entityText = {} entityTextVal = {feat: collections.defaultdict(set) for feat in features} entitySummary = {} entityIdent = {} entityIdentFirst = {} entityFreq = {feat: collections.Counter() for feat in features} entityIndex = {feat: {} for feat in features} entityVal = {} entitySlotVal = {} entitySlotAll = {} entitySlotIndex = {} for e, (fVals, slots) in entityItems: txt = textFromSlots(slots) ident = fVals summary = tuple(fVals[i] for i in summaryIndices) entityText[e] = txt entityVal.setdefault(fVals, set()).add(slots) for feat, val in zip(features, fVals): entityFreq[feat][val] += 1 entityIndex[feat].setdefault(slots, set()).add(val) entityTextVal[feat][txt].add(val) entityIdent.setdefault(ident, []).append(e) if ident not in entityIdentFirst: entityIdentFirst[ident] = e entitySummary.setdefault(summary, []).append(e) entitySlotVal.setdefault(slots, set()).add(fVals) firstSlot = slots[0] lastSlot = slots[-1] entitySlotAll.setdefault(firstSlot, set()).add(lastSlot) for slot in slots: isFirst = slot == firstSlot isLast = slot == lastSlot if isFirst or isLast: if isFirst: entitySlotIndex.setdefault(slot, []).append( [True, firstSlot - lastSlot - 1, ident] ) if isLast: entitySlotIndex.setdefault(slot, []).append( [False, lastSlot - firstSlot + 1, ident] ) else: entitySlotIndex.setdefault(slot, []).append(None) setData.entityText = entityText setData.entityTextVal = entityTextVal setData.entitySummary = entitySummary setData.entityIdent = entityIdent setData.entityIdentFirst = entityIdentFirst setData.entityFreq = { feat: sorted(entityFreq[feat].items()) for feat in features } setData.entityIndex = entityIndex setData.entityVal = entityVal setData.entitySlotVal = entitySlotVal setData.entitySlotAll = entitySlotAll setData.entitySlotIndex = entitySlotIndex setData.dateProcessed = time.time() def delEntity(self, vals, allMatches=None, returns=True): """Delete entity occurrences from the current set. This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities). The entities to delete are selected by their feature values. So you can use this function to delete all entities with a certain entity id and kind. Moreover, you can also specify a set of locations and restrict the entity removal to the entities that occupy those locations. Parameters ---------- vals: tuple For each entity feature it has a value of that feature. This specifies which entities have to go. allMatches: iterable of tuple of integer, optional None A number of slot tuples. They are the locations from which the candidate entities will be deleted. If it is None, the entity candidates will be removed wherever they occur. returns: boolean, optional False If False, the function reports how many entities have been deleted and how many were not present in the specified locations. Otherwise, these numbers are returned. Returns ------- (int, int) or void If `returns`, it returns the number of non-existing entities that were asked to be deleted and the number of actually deleted entities. If the operation is not allowed, both integers above are set to -1. """ if not self.properlySetup: return setIsRo = self.setIsRo setNameRep = self.setNameRep if setIsRo: if returns: return (-1, -1) console(f"Entity deletion not allowed on {setNameRep}", error=True) return setData = self.getSetData() oldEntities = setData.entities delEntities = set() oldEntitiesBySlots = set() for e, (fVals, slots) in oldEntities.items(): if fVals == vals: oldEntitiesBySlots.add(slots) missing = 0 deleted = 0 delSlots = oldEntitiesBySlots if allMatches is None else allMatches for slots in delSlots: if slots not in oldEntitiesBySlots: missing += 1 continue delEntities.add((vals, slots)) deleted += 1 if len(delEntities): self._weedEntities(delEntities) self.loadSetData() if returns: return (missing, deleted) self.console(f"Not present: {missing:>5} x") self.console(f"Deleted: {deleted:>5} x") def delEntityRich(self, deletions, buckets, excludedTokens=set()): """Delete specified entity occurrences from the current set. This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities). This function has more detailed instructions as to which entities should be deleted than `Data.delEntity()` . It is a handy function for the TF browser to call, but not so much when you are manipulating entities yourself in a Jupyter notebook. Parameters ---------- deletions: tuple of tuple or string Each member of the tuple corresponds to an entity feature. It is either a single value of such a feature, or an iterable of such values. The tuple together specifies a set of entities whose entity features have values that are either equal to the corresponding member of `deletions` or contained in it. buckets: iterable of list Restricts the scope where entities should be removed. This is typically the result of `tf.ner.corpus.Corpus.filterContent()`. The only important thing is that member 2 of each bucket is the list of entity matches in that bucket. Only entities that occupy these places will be removed. excludedTokens: set, optional set() This is the set of token positions that define the entities that must be skipped from deletion. If the last slot of an entity is in this set, the entity will not be deleted. """ if not self.properlySetup: return setNameRep = self.setNameRep setIsRo = self.setIsRo browse = self.browse if setIsRo: msg = f"Entity deletion not allowed on {setNameRep}" if browse: return [[msg]] else: console(msg, error=True) return settings = self.settings features = settings.features setData = self.getSetData() oldEntities = setData.entities report = [] delEntities = set() delEntitiesByE = set() deletions = tuple([x] if type(x) is str else x for x in deletions) if any(len(x) > 0 for x in deletions): oldEntitiesBySlots = collections.defaultdict(set) for e, info in oldEntities.items(): oldEntitiesBySlots[info[1]].add(e) excl = 0 fValTuples = [()] for vals in deletions: delTuples = [] for val in vals: delTuples.extend([ft + (val,) for ft in fValTuples]) fValTuples = delTuples stats = collections.Counter() for bucket in buckets: allMatches = bucket[2] for slots in allMatches: if slots[-1] in excludedTokens: excl += 1 continue candidates = oldEntitiesBySlots.get(slots, set()) for e in candidates: toBeDeleted = False fVals = oldEntities[e][0] if fVals in fValTuples: toBeDeleted = True if toBeDeleted: if e not in delEntitiesByE: delEntitiesByE.add(e) delEntities.add((fVals, slots)) stats[fVals] += 1 report.append( tuple(sorted(stats.items())) if len(stats) else ["Nothing deleted"] ) if excl: report.append(f"Deletion: occurrences excluded: {excl}") if len(delEntities): self._weedEntities(delEntities) if browse: return report self.loadSetData() (stats, *rest) = report if type(stats) is list: self.console("\n".join(stats)) else: for vals, freq in stats: repVals = " ".join( f"{feat}={val}" for (feat, val) in zip(features, vals) ) self.console(f"Deleted {freq:>5} x {repVals}") if len(rest): self.console("\n".join(rest)) def addEntity(self, vals, allMatches, returns=True): """Add entity occurrences to the current set. This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities). The entities to add are specified by their feature values. So you can use this function to add entities with a certain entity id and kind. You also have to specify a set of locations where the entities should be added. Parameters ---------- vals: tuple For each entity feature it has a value of that feature. This specifies which entities have will be added. allMatches: iterable of tuple of integer A number of slot tuples. They are the locations where the entities will be added. returns: boolean, optional False If True, reports how many entities have been added and how many were already present in the specified locations. Otherwise, these numbers are returned by the function. Returns ------- (int, int) or void If `returns`, it returns the number of already existing entities that were asked to be deleted and the number of actually deleted entities. If the operation is not allowed, both integers above are set to -1. """ if not self.properlySetup: return setNameRep = self.setNameRep setIsRo = self.setIsRo if setIsRo: if returns: return (-1, -1) console(f"Entity addition not allowed on {setNameRep}", error=True) return setData = self.getSetData() oldEntities = setData.entities addE = set() oldEntitiesBySlots = set() for e, (fVals, slots) in oldEntities.items(): if fVals == vals: oldEntitiesBySlots.add(slots) present = 0 added = 0 for slots in allMatches: if slots in oldEntitiesBySlots: present += 1 continue info = (vals, slots) if info not in addE: addE.add(info) added += 1 if len(addE): self._mergeEntities(addE) self.loadSetData() if returns: return (present, added) self.console(f"Already present: {present:>5} x") self.console(f"Added: {added:>5} x") def addEntities(self, newEntities, returns=True, _lowlevel=False): """Add multiple entities efficiently to the current set. This operation is not allowed if the current set is a read-only set, unless `_lowlevel` is True. If you have multiple entities to add, it is wasteful to do multiple passes over the corpus to find them. This method does them all in one fell swoop. Parameters ---------- newEntites: iterable of tuples of tuples each new entity consists of * a tuple of entity feature values, specifying the entity to add * a list of slot tuples, specifying where to add this entity _lowlevel: boolean, optional False Whether this function is executed in low-level mode. Some calls of this function are done in specific contexts, where certain conditions are known to be fulfilled and do not have to be checked. The intention is that only this codebase will ever pass `_lowlevel=True`, and that outside functions never pass this parameter. returns: boolean, optional False If True, eports how many entities have been added and how many were already present in the specified locations. Otherwise it returns these numbers, unless `_lowlevel` is True, in which case it returns nothing. Returns ------- (int, int) or void If `returns`, it returns the number of already existing entities that were asked to be deleted and the number of actually deleted entities. If the operation is not allowed, both integers above are set to -1. """ if not self.properlySetup: return setNameRep = self.setNameRep setIsRo = self.setIsRo setIsX = self.setIsX if not _lowlevel and setIsRo: if returns: return (-1, -1) console(f"Entities addition not allowed on {setNameRep}", error=True) return if _lowlevel and not setIsX: return setData = self.getSetData() oldEntities = set(setData.entities.values()) addE = set() present = 0 added = 0 for fVals, allMatches in newEntities: for slots in allMatches: if (fVals, slots) in oldEntities: present += 1 elif (fVals, slots) in addE: continue else: added += 1 addE.add((fVals, slots)) if len(addE): self._mergeEntities(addE, _lowlevel=_lowlevel) self.loadSetData() if returns: return (present, added) if _lowlevel: return self.console(f"Already present: {present:>5} x") self.console(f"Added: {added:>5} x") def addEntityRich(self, additions, buckets, excludedTokens=set()): """Add specified entity occurrences to the current set. This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities). This function has more detailed instructions as to which entities should be added than `Data.addEntity()` . It is a handy function for the TF browser to call, but not so much when you are manipulating entities yourself in a Jupyter notebook. Parameters ---------- additions: tuple of tuple or string Each member of the tuple corresponds to an entity feature. It is either a single value of such a feature, or an iterable of such values. The tuple together specifies a set of entities whose entity features have values that are either equal to the corresponding member of `additions` or contained in it. buckets: iterable of list This is typically the result of `tf.ner.corpus.Corpus.filterContent()`. The only important thing is that member 2 of each bucket is the list of entity matches in that bucket. Entities will only be added at these places. excludedTokens: set, optional set() This is the set of token positions that define the locations that must not receive new entities. If the last slot of an entity is in this set, no entity will be added there. """ if not self.properlySetup: return setNameRep = self.setNameRep setIsRo = self.setIsRo browse = self.browse if setIsRo: msg = f"Entity addition not allowed on {setNameRep}" if browse: return [[msg]] else: console(msg, error=True) return settings = self.settings features = settings.features setData = self.getSetData() oldEntities = setData.entities report = [] addEnts = set() additions = tuple([x] if type(x) is str else x for x in additions) if all(len(x) > 0 for x in additions): oldEntitiesBySlots = collections.defaultdict(set) for e, (fVals, slots) in oldEntities.items(): oldEntitiesBySlots[slots].add(fVals) excl = 0 fValTuples = [()] for vals in additions: newTuples = [] for val in vals: newTuples.extend([ft + (val,) for ft in fValTuples]) fValTuples = newTuples stats = collections.Counter() for bucket in buckets: allMatches = bucket[2] for slots in allMatches: if slots[-1] in excludedTokens: excl += 1 continue existing = oldEntitiesBySlots.get(slots, set()) for fVals in fValTuples: if fVals in existing: continue info = (fVals, slots) if info not in addEnts: addEnts.add(info) stats[fVals] += 1 report.append( tuple(sorted(stats.items())) if len(stats) else ["Nothing added"] ) if excl: report.append(f"Addition: occurrences excluded: {excl}") if len(addEnts): self._mergeEntities(addEnts) if browse: return report self.loadSetData() (stats, *rest) = report if type(stats) is list: self.console("\n".join(stats)) else: for vals, freq in stats: repVals = " ".join( f"{feat}={val}" for (feat, val) in zip(features, vals) ) self.console(f"Added {freq:>5} x {repVals}") if len(rest): self.console("\n".join(rest)) def saveEntitiesAs(self, dataFile): """Export an annotation set to a file. This function is used when a set has to be duplicated: `tf.ner.sets.Sets.setDup()`. Parameters ---------- dataFile: string The path of the file to write to. """ if not self.properlySetup: return setData = self.getSetData() entities = setData.entities with fileOpen(dataFile, mode="a") as fh: for fVals, slots in entities.values(): fh.write("\t".join(str(x) for x in (*fVals, *slots)) + "\n") def _weedEntities(self, delEntities): """Performs deletions to the current annotation set. This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities). Parameters ---------- delEntities: set The set consists of entity specs: a tuple of values of entity features, and an iterable of slot tuples where the entity is located. """ if not self.properlySetup: return setName = self.setName setNameRep = self.setNameRep setIsRo = self.setIsRo if setIsRo: console(f"Entity weeding not allowed on {setNameRep}", error=True) return settings = self.settings features = settings.features nF = len(features) annoDir = self.annoDir dataFile = f"{annoDir}/{setName}/entities.tsv" newEntities = [] with fileOpen(dataFile) as fh: for line in fh: fields = tuple(line.rstrip("\n").split("\t")) fVals = tuple(fields[0:nF]) slots = tuple(int(f) for f in fields[nF:]) info = (fVals, slots) if info in delEntities: continue newEntities.append(line) with fileOpen(dataFile, mode="w") as fh: fh.write("".join(newEntities)) def _mergeEntities(self, newEntities, _lowlevel=False): """Performs additions to the current annotation set. This operation is not allowed if the current set is the read-only set with the empty name. Parameters ---------- newEntities: set The set consists of entity specs: a tuple of values of entity features, and an iterable of slot tuples where the entity is located. _lowlevel: boolean, optional False Whether this function is executed in low-level mode. Some calls of this function are done in specific contexts, where certain conditions are known to be fulfilled and do not have to be checked. The intention is that only this codebase will ever pass `_lowlevel=True`, and that outside functions never pass this parameter. """ if not self.properlySetup: return setName = self.setName setNameRep = self.setNameRep setIsRo = self.setIsRo setIsX = self.setIsX if not _lowlevel and setIsRo: console(f"Entity merging not allowed on {setNameRep}", error=True) return if _lowlevel and not setIsX: return annoDir = self.annoDir dataFile = f"{annoDir}/{setName}/entities.tsv" with fileOpen(dataFile, mode="w" if _lowlevel else "a") as fh: for fVals, slots in newEntities: fh.write("\t".join(str(x) for x in (*fVals, *slots)) + "\n")
Ancestors
Subclasses
Methods
def addEntities(self, newEntities, returns=True)
-
Add multiple entities efficiently to the current set.
This operation is not allowed if the current set is a read-only set, unless
_lowlevel
is True.If you have multiple entities to add, it is wasteful to do multiple passes over the corpus to find them.
This method does them all in one fell swoop.
Parameters
newEntites
:iterable
oftuples
oftuples
-
each new entity consists of
- a tuple of entity feature values, specifying the entity to add
- a list of slot tuples, specifying where to add this entity
_lowlevel
:boolean
, optionalFalse
- Whether this function is executed in low-level mode.
Some calls of this function are done in specific contexts, where certain
conditions are known to be fulfilled and do not have to be checked.
The intention is that only this codebase will ever pass
_lowlevel=True
, and that outside functions never pass this parameter. returns
:boolean
, optionalFalse
-
If True, eports how many entities have been added and how many were already present in the specified locations. Otherwise it returns these numbers, unless
_lowlevel
is True, in whichcase it returns nothing.
Returns
(int, int) or void If
returns
, it returns the number of already existing entities that were asked to be deleted and the number of actually deleted entities.If the operation is not allowed, both integers above are set to -1.
def addEntity(self, vals, allMatches, returns=True)
-
Add entity occurrences to the current set.
This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities).
The entities to add are specified by their feature values. So you can use this function to add entities with a certain entity id and kind.
You also have to specify a set of locations where the entities should be added.
Parameters
vals
:tuple
- For each entity feature it has a value of that feature. This specifies which entities have will be added.
allMatches
:iterable
oftuple
ofinteger
- A number of slot tuples. They are the locations where the entities will be added.
returns
:boolean
, optionalFalse
- If True, reports how many entities have been added and how many were already present in the specified locations. Otherwise, these numbers are returned by the function.
Returns
(int, int) or void If
returns
, it returns the number of already existing entities that were asked to be deleted and the number of actually deleted entities.If the operation is not allowed, both integers above are set to -1.
def addEntityRich(self, additions, buckets, excludedTokens=set())
-
Add specified entity occurrences to the current set.
This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities).
This function has more detailed instructions as to which entities should be added than
Data.addEntity()
.It is a handy function for the TF browser to call, but not so much when you are manipulating entities yourself in a Jupyter notebook.
Parameters
additions
:tuple
oftuple
orstring
- Each member of the tuple corresponds to an entity feature.
It is either a single value of such a feature, or an iterable
of such values.
The tuple together specifies a set of entities whose entity features
have values that are either equal to the corresponding member of
additions
or contained in it. buckets
:iterable
oflist
- This is typically the result of
Corpus.filterContent()
. The only important thing is that member 2 of each bucket is the list of entity matches in that bucket. Entities will only be added at these places. excludedTokens
:set
, optionalset()
- This is the set of token positions that define the locations that must not receive new entities. If the last slot of an entity is in this set, no entity will be added there.
def delEntity(self, vals, allMatches=None, returns=True)
-
Delete entity occurrences from the current set.
This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities).
The entities to delete are selected by their feature values. So you can use this function to delete all entities with a certain entity id and kind.
Moreover, you can also specify a set of locations and restrict the entity removal to the entities that occupy those locations.
Parameters
vals
:tuple
- For each entity feature it has a value of that feature. This specifies which entities have to go.
allMatches
:iterable
oftuple
ofinteger
, optionalNone
- A number of slot tuples. They are the locations from which the candidate entities will be deleted. If it is None, the entity candidates will be removed wherever they occur.
returns
:boolean
, optionalFalse
- If False, the function reports how many entities have been deleted and how many were not present in the specified locations. Otherwise, these numbers are returned.
Returns
(int, int) or void If
returns
, it returns the number of non-existing entities that were asked to be deleted and the number of actually deleted entities.If the operation is not allowed, both integers above are set to -1.
def delEntityRich(self, deletions, buckets, excludedTokens=set())
-
Delete specified entity occurrences from the current set.
This operation is not allowed if the current set is a read-only set (from a spreadsheet or the already baked-in entities).
This function has more detailed instructions as to which entities should be deleted than
Data.delEntity()
.It is a handy function for the TF browser to call, but not so much when you are manipulating entities yourself in a Jupyter notebook.
Parameters
deletions
:tuple
oftuple
orstring
- Each member of the tuple corresponds to an entity feature.
It is either a single value of such a feature, or an iterable
of such values.
The tuple together specifies a set of entities whose entity features
have values that are either equal to the corresponding member of
deletions
or contained in it. buckets
:iterable
oflist
- Restricts the scope where entities should be removed.
This is typically the result of
Corpus.filterContent()
. The only important thing is that member 2 of each bucket is the list of entity matches in that bucket. Only entities that occupy these places will be removed. excludedTokens
:set
, optionalset()
- This is the set of token positions that define the entities that must be skipped from deletion. If the last slot of an entity is in this set, the entity will not be deleted.
def fromSourceSet(self)
-
Loads an annotation set from source.
If the current annotation set is
""
, the annotation set is already present in the TF data, and we compile it into a dict of entity data keyed by entity node.Otherwise, we read the corresponding TSV file from disk and compile it into a dict of entity data keyed by line number.
After collection of the set it is stored under the following keys:
dateLoaded
: datetime when the set was last loaded from disk;entities
: the list of entities as loaded from the source; it is a dict of entities, keyed by nodes or line numbers; each entity specifies a tuple of feature values and a list of slots that are part of the entity.
def loadSetData(self)
-
Loads the current annotation set into memory.
It has two phases:
- loading the source set (see
Data.fromSourceSet()
) - processing the loaded set (see
Data.processSet()
)
- loading the source set (see
def processSet(self, changed)
-
Generates derived data structures out of the source set.
After loading we process the set into derived data structures.
We try to be lazy. We only load a set from disk if it is not already in memory, or if the set on disk has been updated since the last load.
The resulting data is stored in the current set under the various keys.
After processing, the time of processing is recorded, so that it can be observed if the processed set is no longer up to date w.r.t. the source.
For each such set we produce several data structures, which we store under the following keys:
dateProcessed
: datetime when the set was last processedentityText
: dict, text of entity by entity node or line number in TSV file;entityTextVal
: dict of dict, set of feature values of entity, keyed by feature name and then by text of the entity;entitySummary
: dict, list of entity nodes / line numbers, keyed by value of entity kind;entityIdent
: dict, list of entity nodes./line numbers, keyed by tuple of entity feature values (these tuples are identifying for an entity);entityFreq
: dict of counters, a counter for each feature name; the counter gives the number of times each value of that feature occurs in an entity;entityIdentFirst
: dict, keyed by entity id, and valued by the number of that entity in the list. If multiple entities in the list happen to have this id, the number of the first entity of them is chosen as value;entityIndex
: dict of dict, a dict for each feature name; the sub-dict gives for each position the values that entities occupying that position can have; positions are tuples of slots;entityVal
: dict, keyed by value tuples gives the set of positions that entities with that value tuple occupy;entitySlotVal
: dict, keyed by positions gives the set of values that entities occupying that position can have;entitySlotAll
: dict, keyed by single first slots gives the set of ending slots that entities starting at that first slot have;-
entitySlotIndex
: dict, keyed by single slot gives list of items corresponding to entities that occupy that slot;- if an entity starts there, an entry
[True, -n, values]
is made; - if an entity ends there, an entry
[False, n, values]
is made; - if an entity occupies that slot without starting or ending there,
an entry
None
is made;
Above,
n
is the length of the entity in tokens andvalues
is the tuple of feature values of that entity.This is precisely the information we need if we want to mark up a set of entities in the surrounding context of tokens.
- if an entity starts there, an entry
Parameters
changed
:boolean
- Whether the set has changed since last processing.
def saveEntitiesAs(self, dataFile)
-
Export an annotation set to a file.
This function is used when a set has to be duplicated:
Sets.setDup()
.Parameters
dataFile
:string
- The path of the file to write to.
Inherited members
Corpus
:bucketType
checkBuckets
checkFeature
console
consoleLine
featureDefault
filterContent
fvalFromNode
getAfter
getBucketNodes
getContext
getEid
getEntityNodes
getKind
getSeqFromNode
getSeqFromStr
getStr
getStrFromSeq
properlySetup
sectionHead
slotType
slotsFromNode
stringsFromTokens
textFromNode
textFromSlots
tokensFromNode