Module tf.ner.variants
Detect name variants in the corpus.
We provide a class can detect variants of the triggers in a NER spreadsheet, that is variants as they occur in the corpus.
One way (and the way supported here) of obtaining the variants is by the tool analiticcl by Martin Reynaert and Maarten van Gompel.
Expand source code Browse git
"""Detect name variants in the corpus.
We provide a class can detect variants of the triggers in a NER spreadsheet,
that is variants as they occur in the corpus.
One way (and the way supported here) of obtaining the variants is
by the tool
[analiticcl](https://github.com/proycon/analiticcl) by Martin
Reynaert and Maarten van Gompel.
"""
import collections
import re
# from analiticcl import VariantModel, Weights, SearchParameters
from ..capable import CheckImport
from ..core.files import initTree, fileExists, readJson, writeJson, fileOpen
from ..core.helpers import console
from ..convert.recorder import Recorder
HTML_PRE = """<html>
<head>
<meta charset="utf-8"/>
<style>
«css»
</style>
</head>
<body>
"""
HTML_POST = """
</body>
</html>
"""
class Detect:
def __init__(self, NE, sheet=None):
"""Detect spelling variants.
In order to use this class you may have to install analiticcl. First
you have to make sure you have the programming language Rust operational on
your machine, and then you have to install analiticcl as a Rust program,
and finally you have to do
```
pip install analiticcl
```
See the
[python bindings](https://github.com/proycon/analiticcl/tree/master/bindings/python)
for detailed instructions.
If you have an instance of the `tf.ner.ner.NER` class, e.g. from an earlier
call like this
```
NE = A.makeNer()
```
and if you have pointed this `NE` to a NER spreadsheet, e.g. by
```
NE.setTask(".persons")
```
then you can get an instance of this class by saying
```
D = NE.variantDetection()
```
There are methods to produce a plain text of the corpus, a lexicon based on the
triggers in the spreadsheet, and then to search the plain text for variants
of the lexicon by passing some well-chosen parameters to analiticcl.
After the search, which may take long, the raw results are cached, and then
filtered, with the help of an optional exceptions file.
If you change the filtering parameters, you do not have to rerun the expensive
search by analiticcl.
The variants found can then be merged with the original triggers, and saved as
a new spreadsheet, next to the original one.
The settings for analiticcl are ultimately given in the config.yaml in the
`ner` directory of the corpus, where the other settings for the NER process
are also given. See `tf.ner.settings`.
Parameters
----------
NE: object
A `tf.ner.ner.NER` instance.
sheet: string, optional None
The name of a NER sheet that serves as input of the variant detection
process. If not passed, it is assumed that the `NE` instances is
already switched to a NER sheet before setting up this object.
However, in any case, we switch again to the sheet in question to make
sure we do so in case sesnsitive mode (even if we have used the sheet
in case insensitive mode for the lookup).
"""
CI = CheckImport("analiticcl")
if CI.importOK(hint=True):
an = CI.importGet()
self.VariantModel = an.VariantModel
self.Weights = an.Weights
self.SearchParameters = an.SearchParameters
else:
self.properlySetup = False
return None
self.properlySetup = True
self.NE = NE
if sheet is None:
sheet = NE.sheetName
NE.setSheet(sheet, caseSensitive=True, force=True, forceSilent=True)
else:
NE.setSheet(sheet, caseSensitive=True, force=True)
if not NE.setIsX:
console(
(
"Setting up this instance requires having a current "
f"NER spreadsheet selected. '{sheet}' is not a spreadsheet."
),
error=True,
)
self.properlySetup = False
return
self.sheet = sheet
self.sheetData = NE.getSheetData()
app = NE.app
self.app = app
settings = NE.settings.variants
self.settings = settings
workDir = f"{app.context.localDir}/{app.context.extraData}/analyticcl"
self.workDir = workDir
initTree(workDir, fresh=False)
NE.setSheet(sheet, caseSensitive=True, force=True)
sheetData = NE.getSheetData()
NE.console("Overview of names by length:")
triggers = set(sheetData.rowMap)
lengths = collections.defaultdict(list)
for trigger in triggers:
lengths[len(trigger.split())].append(trigger)
for n, trigs in sorted(lengths.items(), key=lambda x: -x[0]):
examples = "\n ".join(sorted(trigs, key=lambda x: x.lower())[0:5])
NE.console(f" {n} tokens: {len(trigs):>3} names e.g.:\n {examples}")
def prepare(self):
"""Prepare the data for the search of spelling variants.
We construct an alphabet and a plain text out of the corpus,
and we construct a lexicon from the triggers in the current spreadsheet.
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
self.makeAlphabet()
self.makeText()
self.makeLexicon()
self.setupAnaliticcl()
def search(self, start=None, end=None, force=0):
"""Search for spelling variants in the corpus.
We search part of the corpus or the whole corpus for spelling variants.
The process has two stages:
1. a run of analaticcl
2. filtering of the results
The run of analiticcl is expensive, more than 10 minutes on the
[Suriano corpus](https://gitlab.huc.knaw.nl/suriano/letters).
The results of stage 1 will be cached. For every choice of search parameters
and boundary points in the corpus there is a separate cache.
A number of analiticcl-specific parameters will influence the search.
They can be tweaked in the config file of the NER module, under the variants
section.
Parameters
----------
start: integer, optional None
The place in the corpus where the search has to start; it is the offset
in the plain text. If `None`, start from the beginning of the corpus.
end: integer, optional None
The place in the corpus where the search must end; it is the offset
in the plain text. If `None`, the search will be performed till the end
of the corpus.
force: integer, optional 0
Valid values are `0`, `1`, `2`.
Meaning: `0`: use the cached result, if it is available.
`1`: use the cached result for stage 1, if available, but perform the
filtering (again).
`2`: do not use the cache, but compute everything again.
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
SearchParameters = self.SearchParameters
ansettings = self.settings.analiticcl
searchParams = ansettings.searchParams
suoffsets = searchParams.unicodeoffsets
smaxngram = searchParams.max_ngram
sfreqweight = searchParams.freq_weight
scoring = ansettings.scoring
sthreshold = scoring.threshold
NE = self.NE
app = self.app
model = self.model
rec = self.rec
textComplete = self.textComplete
workDir = self.workDir
lexiconOccs = self.lexiconOccs
text = textComplete[start:end]
nText = len(text)
offset = 0 if start is None else nText + start if start < 0 else start
NE.console(f"{nText:>8} text length")
NE.console(f"{offset:>8} offset in complete text")
slug = (
f"{suoffsets}-{smaxngram}-{sfreqweight}-{sthreshold}-"
f"{suoffsets}-{smaxngram}-{sfreqweight}-{sthreshold}"
)
matchesFile = f"{workDir}/matches-{start}-{end}-settings={slug}.json"
matchesPosFile = f"{workDir}/matchespos-{start}-{end}-settings={slug}.json"
rawMatchesFile = f"{workDir}/rawmatches-{start}-{end}-settings={slug}.json"
app.indent(reset=True)
if force == 2 or not fileExists(rawMatchesFile):
app.info("Compute variants of the lexicon words ...")
rawMatches = model.find_all_matches(
text,
SearchParameters(
unicodeoffsets=suoffsets,
max_ngram=smaxngram,
freq_weight=sfreqweight,
score_threshold=sthreshold,
),
)
writeJson(rawMatches, asFile=rawMatchesFile)
else:
app.info("Read previously computed variants of the lexicon words ...")
rawMatches = readJson(asFile=rawMatchesFile, plain=False)
app.info(f"{len(rawMatches):>8} raw matches")
if force == 1 or not fileExists(matchesFile) or not fileExists(matchesPosFile):
app.info("Filter variants of the lexicon words ...")
positions = rec.positions(simple=True)
matches = {}
matchPositions = collections.defaultdict(list)
for match in rawMatches:
text = match["input"].replace("\n", " ")
textL = text.lower()
if text in lexiconOccs:
continue
candidates = match["variants"]
if len(candidates) == 0:
continue
candidates = {
cand["text"]: s
for cand in candidates
if (s := cand["score"]) >= sthreshold
}
if len(candidates) == 0:
continue
textRemove = set()
for cand in candidates:
candL = cand.lower()
if candL == textL:
textRemove.add(cand)
for cand in textRemove:
del candidates[cand]
if len(candidates) == 0:
continue
# if the match ends with 's we remove the part without it from the
# candidates
if text.endswith("'s"):
head = text.removesuffix("'s")
if head in candidates:
del candidates[head]
if len(candidates) == 0:
continue
# we have another need to filter: if the text of a match is one short
# word longer than a candidate we remove that candidate
# provided the extra word is lower case and has at most 3 letters
# this is to prevent cases like
# «Adam Schivelbergh» versus «Adam Schivelbergh di»
#
# We do this also when the extra word is at the start, like
# «di monsignor Mangot» versus «monsignor Mangot»
parts = text.split()
if len(parts) > 0:
(head, tail) = (parts[0:-1], parts[-1])
# if len(tail) <= 3 and tail.islower():
if len(tail) <= 3:
head = " ".join(head)
if head in candidates:
del candidates[head]
if len(candidates) == 0:
continue
(head, tail) = (parts[0], parts[1:])
# if len(head) <= 3 and head.islower():
if len(head) <= 3:
tail = " ".join(tail)
if tail in candidates:
del candidates[tail]
if len(candidates) == 0:
continue
position = match["offset"]
start = position["begin"]
end = position["end"]
nodes = sorted(
{positions[i] for i in range(offset + start, offset + end)}
)
matches[text] = candidates
matchPositions[text].append(nodes)
writeJson(matches, asFile=matchesFile)
writeJson(matchPositions, asFile=matchesPosFile)
else:
app.info("Read previously filtered variants of the lexicon words ...")
matches = readJson(asFile=matchesFile, plain=False)
matchPositions = readJson(asFile=matchesPosFile, plain=False)
app.info(f"{len(matches):>8} filtered matches")
self.matches = matches
self.matchPositions = matchPositions
console(f"{len(matches)} variants found")
def listVariants(self, start=None, end=None):
"""List the search results to the console.
Show (part of) the variants found on the console as a plain text table.
This content will also be written to the file `variants.txt` in the
work directory.
Parameters
----------
start: integer, optional None
The sequence number of the first result to show.
If `None`, start with the first result.
end: integer, optional None
The sequence number of the last result to show.
If `None`, continue to the last result.
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
workDir = self.workDir
matches = self.matches
lines = []
head = ("variant", "score", "candidate")
dash = f"{'-' * 4} | {'-' * 25} | {'-' * 5} | {'-' * 25}"
console(f"{'i':>4} | {head[0]:<25} | {head[1]} | {head[2]}")
console(f"{dash}")
startN = start or 0
for text, candidates in sorted(matches.items(), key=lambda x: x[0].lower()):
for cand, score in sorted(candidates.items(), key=lambda x: x[0].lower()):
lines.append((text, score, cand))
for i, (text, score, cand) in enumerate(lines[start:end]):
console(f"{i + startN:>4} | {text:<25} | {score:4.2f} | {cand}")
console(f"{dash}")
file = f"{workDir}/variants.tsv"
with fileOpen(file, "w") as fh:
headStr = "\t".join(head)
fh.write(f"{headStr}\n")
for text, score, cand in lines:
fh.write(f"{text}\t{score:4.2f}\t{cand}\n")
console(f"{len(matches)} variants found and written to {file}")
def mergeTriggers(self, level=1):
"""Merge spelling variants of triggers into a NER sheet.
When we have found spelling variants of triggers, we want to include
them in the entity lookup. This function places the variants in the same
cells as the triggers they are variants of. However, it will not
overwrite the original spreadsheet, but create a new, enriched spreadsheet.
We collect the necessary information as follows:
* `matches: dict`:
The spelling variants are keys, and their values are again dicts, keyed
by the words in the triggers that come closest, and valued by a measure
of the proximity.
It is assumed that all of these variants are good variants, in that the
scores are always above a certain threshold, e.g. 0.8 .
* `mergedFile: string`:
The path of the new spreadsheet with the merged triggers: it will sit next
to the original spreadsheet, but with an extension such as `-merged` added
to its name (this can be configured in the NER config file near the corpus).
* `exclusionFile: string, optional None`
The path to an optional file with exclusions, one per line.
Variants that occur in the exclusion list will not be merged in.
The file sits next to the original spreadsheet, but with an extension such
as `-notmerged` (this is configurable) and file extension `.txt`.
This function will also produce several reports:
Parameters
----------
level: integer, optional 1
Only relevant for reporting the new variants. Occurrences of the
new variants are counted by section. This parameter specifies the
level of those sections. It should be 1, 2 or 3.
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
NE = self.NE
trigI = NE.trigI
commentI = NE.commentI
sheetDir = NE.sheetDir
sheetName = NE.sheetName
settings = self.settings
mergedExtension = settings.mergedExtension
notMergedExtension = settings.notMergedExtension
mergedFile = f"{sheetDir}/{sheetName}{mergedExtension}.xlsx"
mergedReportFile = f"{sheetDir}/{sheetName}{mergedExtension}-report.tsv"
exclusionFile = f"{sheetDir}/{sheetName}{notMergedExtension}.txt"
if not fileExists(exclusionFile):
NE.console(f"File with excluded variants not found: {exclusionFile}")
exclusionFile = None
matches = self.matches
matchPositions = self.matchPositions
sectionHead = NE.sectionHead
noVariant = set()
if exclusionFile is not None and fileExists(exclusionFile):
with fileOpen(exclusionFile) as fh:
for line in fh:
noVariant.add(line.strip())
nNoVariant = len(noVariant)
pl = "" if nNoVariant == 1 else "s"
NE.console(f"{nNoVariant} excluded variant{pl} found in {exclusionFile}")
mapping = {}
excluded = 0
for text, candidates in matches.items():
if text in noVariant:
excluded += 1
continue
for cand in candidates:
mapping.setdefault(cand, set()).add(text)
ple = "" if excluded == 1 else "s"
NE.console(f"{excluded} variant{ple} excluded as trigger")
rows = NE.readSheetData()
nAdded = 0
totAdded = 0
variantsAdded = {}
for r, row in enumerate(rows):
if r == 0 or r == 1 or row[commentI].startswith("#"):
continue
rn = r + 1
triggers = set(row[trigI])
nPrev = len(triggers)
newTriggers = []
for trigger in triggers:
newTriggers.append(trigger)
for variant in mapping.get(trigger, []):
newTriggers.append(variant)
variantsAdded.setdefault(rn, []).append((variant, trigger))
newTriggers = sorted(set(newTriggers))
row[trigI] = newTriggers
nPost = len(newTriggers)
nDiff = nPost - nPrev
if nDiff != 0:
nAdded += 1
totAdded += nDiff
lines = [("row", "trigger", "variant", "occurences")]
for rn in sorted(variantsAdded):
for variant, trigger in variantsAdded[rn]:
sectionInfo = collections.Counter()
for occ in matchPositions[variant]:
slot = occ[0]
section = sectionHead(slot, level=level)
sectionInfo[section] += 1
hitData = [
f"{section}x{hits}" for section, hits in sorted(sectionInfo.items())
]
for hits in hitData:
lines.append((rn, trigger, variant, hits))
with fileOpen(mergedReportFile, "w") as fh:
nLines = len(lines)
for i, line in enumerate(lines):
if i < 10 or i > nLines - 10:
(row, trigger, variant, hits) = line
NE.console(f"{row:<4} {trigger:<40} ~> {variant:<40} = {hits}")
lineStr = "\t".join(str(x) for x in line)
fh.write(f"{lineStr}\n")
pls = "" if nAdded == 1 else "s"
plt = "" if totAdded == 1 else "s"
NE.console(f"{nAdded} triggerset{pls} expanded with {totAdded} trigger{plt}")
NE.console(f"Wrote merge report to file {mergedReportFile}")
NE.writeSheetData(rows, asFile=mergedFile)
NE.console(f"Wrote merged triggers to sheet {mergedFile}")
def showResults(self, start=None, end=None):
"""Show the search results to the console.
Show (part of) the variants found on the console with additional context.
Parameters
----------
start: integer, optional None
The sequence number of the first result to show.
If `None`, start with the first result.
end: integer, optional None
The sequence number of the last result to show.
If `None`, continue to the last result.
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
app = self.app
F = app.api.F
matches = self.matches
matchPositions = self.matchPositions
i = 0
for text, candidates in sorted(matches.items())[start:end]:
i += 1
nCand = len(candidates)
pl = "" if nCand == 1 else "s"
console(f"{i:>4} Variant «{text}» of {nCand} candidate{pl}")
console(" Occurrences:")
occs = matchPositions[text]
for nodes in occs:
sectionStart = app.sectionStrFromNode(nodes[0])
sectionEnd = app.sectionStrFromNode(nodes[-1])
section = (
sectionStart
if sectionStart == sectionEnd
else f"{sectionStart} - {sectionEnd}"
)
preStart = max((nodes[0] - 10, 1))
preEnd = nodes[0]
postStart = nodes[-1] + 1
postEnd = min((nodes[-1] + 10, F.otype.maxSlot + 1))
preText = "".join(
f"{F.str.v(n)}{F.after.v(n)}" for n in range(preStart, preEnd)
)
inText = "".join(f"{F.str.v(n)}{F.after.v(n)}" for n in nodes)
postText = "".join(
f"{F.str.v(n)}{F.after.v(n)}" for n in range(postStart, postEnd)
)
context = f"{section}: {preText}«{inText}»{postText}".replace("\n", " ")
console(f" {context}")
console(" Candidates with score:")
for cand, score in sorted(candidates.items(), key=lambda x: (-x[1], x[0])):
console(f"\t{score:4.2f} {cand}")
console("-----")
def displayResults(self, start=None, end=None, asFile=None):
"""Display the results as HTML files.
This content will also be written to the files under the subdirectory
`extra` in the work directory.
Parameters
----------
start: integer, optional None
The sequence number of the first result to show.
If `None`, start with the first result.
end: integer, optional None
The sequence number of the last result to show.
If `None`, continue to the last result.
asFile: string, optional None
If None, the results will be displayed as HTML on the console.
Otherwise, the results will be written down as a set of HTML files,
whose names start with this string.
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
app = self.app
L = app.api.L
matches = self.matches
matchPositions = self.matchPositions
lexiconOccs = self.lexiconOccs
workDir = self.workDir
if asFile is not None:
content = []
htmlStart = HTML_PRE.replace("«css»", app.context.css)
htmlEnd = HTML_POST
content.append([htmlStart])
empty = True
app.indent(reset=True)
app.info("Gathering information on extra triggers ...")
i = 0
s = 0
for varText, candidates in sorted(
matches.items(), key=lambda x: (-len(x[1]), x[0])
)[start:end]:
i += 1
# make a list of where the candidates are and include the score
varOccs = matchPositions[varText]
nVar = len(varOccs)
candOccs = []
candRep1 = "`" + "` or `".join(candidates) + "`"
candRep2 = "<code>" + "</code> or <code>".join(candidates) + "</code>"
for cand, score in candidates.items():
myOccs = lexiconOccs[cand]
for occ in myOccs:
candOccs.append((occ[0], score, cand, occ))
# use this list later to find the nearest/best variant
if asFile is None:
app.dm(
f"# {i}: {nVar} x variant `{varText}` on "
f"candidate {candRep1}\n\n"
)
else:
content[-1].append(
f"<h1>{i}: {nVar} x variant <code>{varText}</code> on "
f"candidate {candRep2}</h1>"
)
empty = False
sections = set()
highlights = {}
bestCand = None
for candOcc in candOccs:
if bestCand is None or bestCand[1] < candOcc[1]:
bestCand = candOcc
for n in bestCand[3]:
highlights[n] = "lightgreen"
section = L.u(bestCand[0], otype="chunk")[0]
sections.add(section)
for varNodes in varOccs:
highlights |= {n: "goldenrod" for n in varNodes}
nFirst = varNodes[0]
section = L.u(nFirst, otype="chunk")[0]
sections.add(section)
nearestCand = None
for candOcc in candOccs:
if nearestCand is None or abs(nearestCand[0] - nFirst) > abs(
candOcc[0] - nFirst
):
nearestCand = candOcc
for n in nearestCand[3]:
if n in highlights:
highlights[n] = "yellow"
else:
highlights[n] = "cyan"
section = L.u(nearestCand[0], otype="chunk")[0]
sections.add(section)
sections = tuple((s,) for s in sorted(sections))
s += len(sections)
if asFile is None:
app.table(sections, highlights=highlights, full=True)
else:
content[-1].append(
app.table(
sections, highlights=highlights, full=True, _asString=True
)
)
if i % 10 == 0:
app.info(f"{i:>4} variants done giving {s:>4} chunks")
if i % 100 == 0:
content[-1].append(htmlEnd)
content.append([htmlStart])
empty = True
app.info(f"{i:>4} matches done")
if asFile is not None:
content[-1].append(htmlEnd)
if empty:
content.pop()
extraFileBase = f"{workDir}/extra"
initTree(extraFileBase, fresh=True, gentle=True)
for i, material in enumerate(content):
extraFile = f"{extraFileBase}/{asFile}{i + 1:>02}.html"
with fileOpen(extraFile, "w") as fh:
fh.write("\n".join(material))
console(f"Extra triggers written to {extraFile}")
def makeAlphabet(self):
"""Gathers the alphabet on which the corpus is based.
The characters of the corpus have already been collected by Text-Fabric,
and that is from where we pick them up.
We separate the digits from the rest.
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
app = self.app
C = app.api.C
workDir = self.workDir
alphabetFile = f"{workDir}/alphabet.tsv"
self.alphabetFile = alphabetFile
with fileOpen(alphabetFile, "w") as fh:
# This file will consist of one character per line,
# for each distinct alpha character in the corpus, ordered by frequency.
# Numeric characters will be put on a single line, with tabs in between.
# All other characters will be ignored.
digits = []
for c, freq in C.characters.data["text-orig-full"]:
if c.isalpha():
fh.write(f"{c}\n")
elif c.isdigit():
digits.append(c)
fh.write("\t".join(digits))
console(f"Alphabet written to {alphabetFile}")
def makeText(self):
"""Generate a plain text from the corpus.
We make sure that we resolve the hyphenation of words across line
boundaries.
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
NE = self.NE
app = self.app
api = app.api
F = app.api.F
L = app.api.L
workDir = self.workDir
rec = Recorder(api)
lineType = self.settings.lineType
slotType = F.otype.slotType
maxSlot = F.otype.maxSlot
lines = F.otype.s(lineType)
lineEnds = {L.d(ln, otype=slotType)[-1] for ln in lines}
skipTo = None
for t in range(1, maxSlot + 1):
tp = t + 1
tpp = t + 2
if tp in lineEnds and tp < maxSlot and F.str.v(tp) == "-":
rec.start(t)
rec.add(f"{F.str.v(t)}{F.after.v(t)}")
rec.end(t)
rec.start(tpp)
rec.add(f"{F.str.v(tpp)}{F.after.v(tpp)}\n")
rec.end(tpp)
skipTo = tpp
elif skipTo is not None:
if t < skipTo:
continue
else:
skipTo = None
else:
rec.start(t)
rec.add(f"{F.str.v(t)}{F.after.v(t)}")
rec.end(t)
self.rec = rec
textComplete = rec.text()
self.textComplete = textComplete
textFile = f"{workDir}/text.txt"
with fileOpen(textFile, "w") as fh:
fh.write(textComplete)
NE.console(f"Text written to {textFile} - {len(textComplete)} characters")
def makeLexicon(self):
"""Make a lexicon out of the triggers of a spreadsheet."""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
NE = self.NE
app = self.app
workDir = self.workDir
sheetData = self.sheetData
NEinventory = sheetData.inventory
app.indent(reset=True)
app.info("Collecting the triggers for the lexicon")
inventory = {}
for eidkind, triggers in NEinventory.items():
for trigger, scopes in triggers.items():
inventory.setdefault(trigger, set())
for occs in scopes.values():
for slots in occs:
inventory[trigger].add(tuple(slots))
app.info(f"{len(inventory)} triggers collected")
remSpaceRe = re.compile(r""" +([^A-Za-z])""")
accentSpaceRe = re.compile(r"""([’']) +""")
lexicon = {}
mapNormal = {}
lexiconOccs = {}
self.lexiconOccs = lexiconOccs
for name, occs in inventory.items():
occStr = name
occNormal = remSpaceRe.sub(r"\1", occStr)
occNormal = accentSpaceRe.sub(r"\1", occNormal)
nOccs = len(occs)
lexicon[occNormal] = nOccs
mapNormal[occNormal] = occStr
lexiconOccs[occNormal] = occs
sortedLexicon = sorted(lexicon.items(), key=lambda x: (-x[1], x[0].lower()))
for name, n in sortedLexicon[0:10]:
NE.console(f" {n:>3} x {name}")
NE.console(" ...")
for name, n in sortedLexicon[-10:]:
NE.console(f" {n:>3} x {name}")
NE.console(f"{len(lexicon):>8} lexicon length")
lexiconFile = f"{workDir}/lexicon.tsv"
self.lexiconFile = lexiconFile
with fileOpen(lexiconFile, "w") as fh:
for name, n in sorted(lexicon.items()):
fh.write(f"{name}\t{n}\n")
NE.console(f"Lexicon written to {lexiconFile}")
def setupAnaliticcl(self):
"""Configure analiticcl for the big search.
We gather the parameters from the variants section of the NER
config file (see `tf.ner.settings`).
For the description of the parameters, see the
[analiticcl tutorial](https://github.com/proycon/analiticcl/blob/master/tutorial.ipynb)
"""
if not self.properlySetup:
console("This instance is not properly set up", error=True)
return
NE = self.NE
VariantModel = self.VariantModel
Weights = self.Weights
ansettings = self.settings.analiticcl
weights = ansettings.weights
wld = weights.ld
wlcs = weights.lcs
wprefix = weights.prefix
wsuffix = weights.suffix
wcase = weights.case
alphabetFile = self.alphabetFile
lexiconFile = self.lexiconFile
NE.console("Set up analiticcl")
model = VariantModel(
alphabetFile,
Weights(ld=wld, lcs=wlcs, prefix=wprefix, suffix=wsuffix, case=wcase),
)
self.model = model
model.read_lexicon(lexiconFile)
model.build()
Classes
class Detect (NE, sheet=None)
-
Detect spelling variants.
In order to use this class you may have to install analiticcl. First you have to make sure you have the programming language Rust operational on your machine, and then you have to install analiticcl as a Rust program, and finally you have to do
pip install analiticcl
See the python bindings for detailed instructions.
If you have an instance of the
NER
class, e.g. from an earlier call like thisNE = A.makeNer()
and if you have pointed this
NE
to a NER spreadsheet, e.g. byNE.setTask(".persons")
then you can get an instance of this class by saying
D = NE.variantDetection()
There are methods to produce a plain text of the corpus, a lexicon based on the triggers in the spreadsheet, and then to search the plain text for variants of the lexicon by passing some well-chosen parameters to analiticcl.
After the search, which may take long, the raw results are cached, and then filtered, with the help of an optional exceptions file. If you change the filtering parameters, you do not have to rerun the expensive search by analiticcl.
The variants found can then be merged with the original triggers, and saved as a new spreadsheet, next to the original one.
The settings for analiticcl are ultimately given in the config.yaml in the
ner
directory of the corpus, where the other settings for the NER process are also given. Seetf.ner.settings
.Parameters
NE
:object
- A
NER
instance. sheet
:string
, optionalNone
- The name of a NER sheet that serves as input of the variant detection
process. If not passed, it is assumed that the
NE
instances is already switched to a NER sheet before setting up this object. However, in any case, we switch again to the sheet in question to make sure we do so in case sesnsitive mode (even if we have used the sheet in case insensitive mode for the lookup).
Expand source code Browse git
class Detect: def __init__(self, NE, sheet=None): """Detect spelling variants. In order to use this class you may have to install analiticcl. First you have to make sure you have the programming language Rust operational on your machine, and then you have to install analiticcl as a Rust program, and finally you have to do ``` pip install analiticcl ``` See the [python bindings](https://github.com/proycon/analiticcl/tree/master/bindings/python) for detailed instructions. If you have an instance of the `tf.ner.ner.NER` class, e.g. from an earlier call like this ``` NE = A.makeNer() ``` and if you have pointed this `NE` to a NER spreadsheet, e.g. by ``` NE.setTask(".persons") ``` then you can get an instance of this class by saying ``` D = NE.variantDetection() ``` There are methods to produce a plain text of the corpus, a lexicon based on the triggers in the spreadsheet, and then to search the plain text for variants of the lexicon by passing some well-chosen parameters to analiticcl. After the search, which may take long, the raw results are cached, and then filtered, with the help of an optional exceptions file. If you change the filtering parameters, you do not have to rerun the expensive search by analiticcl. The variants found can then be merged with the original triggers, and saved as a new spreadsheet, next to the original one. The settings for analiticcl are ultimately given in the config.yaml in the `ner` directory of the corpus, where the other settings for the NER process are also given. See `tf.ner.settings`. Parameters ---------- NE: object A `tf.ner.ner.NER` instance. sheet: string, optional None The name of a NER sheet that serves as input of the variant detection process. If not passed, it is assumed that the `NE` instances is already switched to a NER sheet before setting up this object. However, in any case, we switch again to the sheet in question to make sure we do so in case sesnsitive mode (even if we have used the sheet in case insensitive mode for the lookup). """ CI = CheckImport("analiticcl") if CI.importOK(hint=True): an = CI.importGet() self.VariantModel = an.VariantModel self.Weights = an.Weights self.SearchParameters = an.SearchParameters else: self.properlySetup = False return None self.properlySetup = True self.NE = NE if sheet is None: sheet = NE.sheetName NE.setSheet(sheet, caseSensitive=True, force=True, forceSilent=True) else: NE.setSheet(sheet, caseSensitive=True, force=True) if not NE.setIsX: console( ( "Setting up this instance requires having a current " f"NER spreadsheet selected. '{sheet}' is not a spreadsheet." ), error=True, ) self.properlySetup = False return self.sheet = sheet self.sheetData = NE.getSheetData() app = NE.app self.app = app settings = NE.settings.variants self.settings = settings workDir = f"{app.context.localDir}/{app.context.extraData}/analyticcl" self.workDir = workDir initTree(workDir, fresh=False) NE.setSheet(sheet, caseSensitive=True, force=True) sheetData = NE.getSheetData() NE.console("Overview of names by length:") triggers = set(sheetData.rowMap) lengths = collections.defaultdict(list) for trigger in triggers: lengths[len(trigger.split())].append(trigger) for n, trigs in sorted(lengths.items(), key=lambda x: -x[0]): examples = "\n ".join(sorted(trigs, key=lambda x: x.lower())[0:5]) NE.console(f" {n} tokens: {len(trigs):>3} names e.g.:\n {examples}") def prepare(self): """Prepare the data for the search of spelling variants. We construct an alphabet and a plain text out of the corpus, and we construct a lexicon from the triggers in the current spreadsheet. """ if not self.properlySetup: console("This instance is not properly set up", error=True) return self.makeAlphabet() self.makeText() self.makeLexicon() self.setupAnaliticcl() def search(self, start=None, end=None, force=0): """Search for spelling variants in the corpus. We search part of the corpus or the whole corpus for spelling variants. The process has two stages: 1. a run of analaticcl 2. filtering of the results The run of analiticcl is expensive, more than 10 minutes on the [Suriano corpus](https://gitlab.huc.knaw.nl/suriano/letters). The results of stage 1 will be cached. For every choice of search parameters and boundary points in the corpus there is a separate cache. A number of analiticcl-specific parameters will influence the search. They can be tweaked in the config file of the NER module, under the variants section. Parameters ---------- start: integer, optional None The place in the corpus where the search has to start; it is the offset in the plain text. If `None`, start from the beginning of the corpus. end: integer, optional None The place in the corpus where the search must end; it is the offset in the plain text. If `None`, the search will be performed till the end of the corpus. force: integer, optional 0 Valid values are `0`, `1`, `2`. Meaning: `0`: use the cached result, if it is available. `1`: use the cached result for stage 1, if available, but perform the filtering (again). `2`: do not use the cache, but compute everything again. """ if not self.properlySetup: console("This instance is not properly set up", error=True) return SearchParameters = self.SearchParameters ansettings = self.settings.analiticcl searchParams = ansettings.searchParams suoffsets = searchParams.unicodeoffsets smaxngram = searchParams.max_ngram sfreqweight = searchParams.freq_weight scoring = ansettings.scoring sthreshold = scoring.threshold NE = self.NE app = self.app model = self.model rec = self.rec textComplete = self.textComplete workDir = self.workDir lexiconOccs = self.lexiconOccs text = textComplete[start:end] nText = len(text) offset = 0 if start is None else nText + start if start < 0 else start NE.console(f"{nText:>8} text length") NE.console(f"{offset:>8} offset in complete text") slug = ( f"{suoffsets}-{smaxngram}-{sfreqweight}-{sthreshold}-" f"{suoffsets}-{smaxngram}-{sfreqweight}-{sthreshold}" ) matchesFile = f"{workDir}/matches-{start}-{end}-settings={slug}.json" matchesPosFile = f"{workDir}/matchespos-{start}-{end}-settings={slug}.json" rawMatchesFile = f"{workDir}/rawmatches-{start}-{end}-settings={slug}.json" app.indent(reset=True) if force == 2 or not fileExists(rawMatchesFile): app.info("Compute variants of the lexicon words ...") rawMatches = model.find_all_matches( text, SearchParameters( unicodeoffsets=suoffsets, max_ngram=smaxngram, freq_weight=sfreqweight, score_threshold=sthreshold, ), ) writeJson(rawMatches, asFile=rawMatchesFile) else: app.info("Read previously computed variants of the lexicon words ...") rawMatches = readJson(asFile=rawMatchesFile, plain=False) app.info(f"{len(rawMatches):>8} raw matches") if force == 1 or not fileExists(matchesFile) or not fileExists(matchesPosFile): app.info("Filter variants of the lexicon words ...") positions = rec.positions(simple=True) matches = {} matchPositions = collections.defaultdict(list) for match in rawMatches: text = match["input"].replace("\n", " ") textL = text.lower() if text in lexiconOccs: continue candidates = match["variants"] if len(candidates) == 0: continue candidates = { cand["text"]: s for cand in candidates if (s := cand["score"]) >= sthreshold } if len(candidates) == 0: continue textRemove = set() for cand in candidates: candL = cand.lower() if candL == textL: textRemove.add(cand) for cand in textRemove: del candidates[cand] if len(candidates) == 0: continue # if the match ends with 's we remove the part without it from the # candidates if text.endswith("'s"): head = text.removesuffix("'s") if head in candidates: del candidates[head] if len(candidates) == 0: continue # we have another need to filter: if the text of a match is one short # word longer than a candidate we remove that candidate # provided the extra word is lower case and has at most 3 letters # this is to prevent cases like # «Adam Schivelbergh» versus «Adam Schivelbergh di» # # We do this also when the extra word is at the start, like # «di monsignor Mangot» versus «monsignor Mangot» parts = text.split() if len(parts) > 0: (head, tail) = (parts[0:-1], parts[-1]) # if len(tail) <= 3 and tail.islower(): if len(tail) <= 3: head = " ".join(head) if head in candidates: del candidates[head] if len(candidates) == 0: continue (head, tail) = (parts[0], parts[1:]) # if len(head) <= 3 and head.islower(): if len(head) <= 3: tail = " ".join(tail) if tail in candidates: del candidates[tail] if len(candidates) == 0: continue position = match["offset"] start = position["begin"] end = position["end"] nodes = sorted( {positions[i] for i in range(offset + start, offset + end)} ) matches[text] = candidates matchPositions[text].append(nodes) writeJson(matches, asFile=matchesFile) writeJson(matchPositions, asFile=matchesPosFile) else: app.info("Read previously filtered variants of the lexicon words ...") matches = readJson(asFile=matchesFile, plain=False) matchPositions = readJson(asFile=matchesPosFile, plain=False) app.info(f"{len(matches):>8} filtered matches") self.matches = matches self.matchPositions = matchPositions console(f"{len(matches)} variants found") def listVariants(self, start=None, end=None): """List the search results to the console. Show (part of) the variants found on the console as a plain text table. This content will also be written to the file `variants.txt` in the work directory. Parameters ---------- start: integer, optional None The sequence number of the first result to show. If `None`, start with the first result. end: integer, optional None The sequence number of the last result to show. If `None`, continue to the last result. """ if not self.properlySetup: console("This instance is not properly set up", error=True) return workDir = self.workDir matches = self.matches lines = [] head = ("variant", "score", "candidate") dash = f"{'-' * 4} | {'-' * 25} | {'-' * 5} | {'-' * 25}" console(f"{'i':>4} | {head[0]:<25} | {head[1]} | {head[2]}") console(f"{dash}") startN = start or 0 for text, candidates in sorted(matches.items(), key=lambda x: x[0].lower()): for cand, score in sorted(candidates.items(), key=lambda x: x[0].lower()): lines.append((text, score, cand)) for i, (text, score, cand) in enumerate(lines[start:end]): console(f"{i + startN:>4} | {text:<25} | {score:4.2f} | {cand}") console(f"{dash}") file = f"{workDir}/variants.tsv" with fileOpen(file, "w") as fh: headStr = "\t".join(head) fh.write(f"{headStr}\n") for text, score, cand in lines: fh.write(f"{text}\t{score:4.2f}\t{cand}\n") console(f"{len(matches)} variants found and written to {file}") def mergeTriggers(self, level=1): """Merge spelling variants of triggers into a NER sheet. When we have found spelling variants of triggers, we want to include them in the entity lookup. This function places the variants in the same cells as the triggers they are variants of. However, it will not overwrite the original spreadsheet, but create a new, enriched spreadsheet. We collect the necessary information as follows: * `matches: dict`: The spelling variants are keys, and their values are again dicts, keyed by the words in the triggers that come closest, and valued by a measure of the proximity. It is assumed that all of these variants are good variants, in that the scores are always above a certain threshold, e.g. 0.8 . * `mergedFile: string`: The path of the new spreadsheet with the merged triggers: it will sit next to the original spreadsheet, but with an extension such as `-merged` added to its name (this can be configured in the NER config file near the corpus). * `exclusionFile: string, optional None` The path to an optional file with exclusions, one per line. Variants that occur in the exclusion list will not be merged in. The file sits next to the original spreadsheet, but with an extension such as `-notmerged` (this is configurable) and file extension `.txt`. This function will also produce several reports: Parameters ---------- level: integer, optional 1 Only relevant for reporting the new variants. Occurrences of the new variants are counted by section. This parameter specifies the level of those sections. It should be 1, 2 or 3. """ if not self.properlySetup: console("This instance is not properly set up", error=True) return NE = self.NE trigI = NE.trigI commentI = NE.commentI sheetDir = NE.sheetDir sheetName = NE.sheetName settings = self.settings mergedExtension = settings.mergedExtension notMergedExtension = settings.notMergedExtension mergedFile = f"{sheetDir}/{sheetName}{mergedExtension}.xlsx" mergedReportFile = f"{sheetDir}/{sheetName}{mergedExtension}-report.tsv" exclusionFile = f"{sheetDir}/{sheetName}{notMergedExtension}.txt" if not fileExists(exclusionFile): NE.console(f"File with excluded variants not found: {exclusionFile}") exclusionFile = None matches = self.matches matchPositions = self.matchPositions sectionHead = NE.sectionHead noVariant = set() if exclusionFile is not None and fileExists(exclusionFile): with fileOpen(exclusionFile) as fh: for line in fh: noVariant.add(line.strip()) nNoVariant = len(noVariant) pl = "" if nNoVariant == 1 else "s" NE.console(f"{nNoVariant} excluded variant{pl} found in {exclusionFile}") mapping = {} excluded = 0 for text, candidates in matches.items(): if text in noVariant: excluded += 1 continue for cand in candidates: mapping.setdefault(cand, set()).add(text) ple = "" if excluded == 1 else "s" NE.console(f"{excluded} variant{ple} excluded as trigger") rows = NE.readSheetData() nAdded = 0 totAdded = 0 variantsAdded = {} for r, row in enumerate(rows): if r == 0 or r == 1 or row[commentI].startswith("#"): continue rn = r + 1 triggers = set(row[trigI]) nPrev = len(triggers) newTriggers = [] for trigger in triggers: newTriggers.append(trigger) for variant in mapping.get(trigger, []): newTriggers.append(variant) variantsAdded.setdefault(rn, []).append((variant, trigger)) newTriggers = sorted(set(newTriggers)) row[trigI] = newTriggers nPost = len(newTriggers) nDiff = nPost - nPrev if nDiff != 0: nAdded += 1 totAdded += nDiff lines = [("row", "trigger", "variant", "occurences")] for rn in sorted(variantsAdded): for variant, trigger in variantsAdded[rn]: sectionInfo = collections.Counter() for occ in matchPositions[variant]: slot = occ[0] section = sectionHead(slot, level=level) sectionInfo[section] += 1 hitData = [ f"{section}x{hits}" for section, hits in sorted(sectionInfo.items()) ] for hits in hitData: lines.append((rn, trigger, variant, hits)) with fileOpen(mergedReportFile, "w") as fh: nLines = len(lines) for i, line in enumerate(lines): if i < 10 or i > nLines - 10: (row, trigger, variant, hits) = line NE.console(f"{row:<4} {trigger:<40} ~> {variant:<40} = {hits}") lineStr = "\t".join(str(x) for x in line) fh.write(f"{lineStr}\n") pls = "" if nAdded == 1 else "s" plt = "" if totAdded == 1 else "s" NE.console(f"{nAdded} triggerset{pls} expanded with {totAdded} trigger{plt}") NE.console(f"Wrote merge report to file {mergedReportFile}") NE.writeSheetData(rows, asFile=mergedFile) NE.console(f"Wrote merged triggers to sheet {mergedFile}") def showResults(self, start=None, end=None): """Show the search results to the console. Show (part of) the variants found on the console with additional context. Parameters ---------- start: integer, optional None The sequence number of the first result to show. If `None`, start with the first result. end: integer, optional None The sequence number of the last result to show. If `None`, continue to the last result. """ if not self.properlySetup: console("This instance is not properly set up", error=True) return app = self.app F = app.api.F matches = self.matches matchPositions = self.matchPositions i = 0 for text, candidates in sorted(matches.items())[start:end]: i += 1 nCand = len(candidates) pl = "" if nCand == 1 else "s" console(f"{i:>4} Variant «{text}» of {nCand} candidate{pl}") console(" Occurrences:") occs = matchPositions[text] for nodes in occs: sectionStart = app.sectionStrFromNode(nodes[0]) sectionEnd = app.sectionStrFromNode(nodes[-1]) section = ( sectionStart if sectionStart == sectionEnd else f"{sectionStart} - {sectionEnd}" ) preStart = max((nodes[0] - 10, 1)) preEnd = nodes[0] postStart = nodes[-1] + 1 postEnd = min((nodes[-1] + 10, F.otype.maxSlot + 1)) preText = "".join( f"{F.str.v(n)}{F.after.v(n)}" for n in range(preStart, preEnd) ) inText = "".join(f"{F.str.v(n)}{F.after.v(n)}" for n in nodes) postText = "".join( f"{F.str.v(n)}{F.after.v(n)}" for n in range(postStart, postEnd) ) context = f"{section}: {preText}«{inText}»{postText}".replace("\n", " ") console(f" {context}") console(" Candidates with score:") for cand, score in sorted(candidates.items(), key=lambda x: (-x[1], x[0])): console(f"\t{score:4.2f} {cand}") console("-----") def displayResults(self, start=None, end=None, asFile=None): """Display the results as HTML files. This content will also be written to the files under the subdirectory `extra` in the work directory. Parameters ---------- start: integer, optional None The sequence number of the first result to show. If `None`, start with the first result. end: integer, optional None The sequence number of the last result to show. If `None`, continue to the last result. asFile: string, optional None If None, the results will be displayed as HTML on the console. Otherwise, the results will be written down as a set of HTML files, whose names start with this string. """ if not self.properlySetup: console("This instance is not properly set up", error=True) return app = self.app L = app.api.L matches = self.matches matchPositions = self.matchPositions lexiconOccs = self.lexiconOccs workDir = self.workDir if asFile is not None: content = [] htmlStart = HTML_PRE.replace("«css»", app.context.css) htmlEnd = HTML_POST content.append([htmlStart]) empty = True app.indent(reset=True) app.info("Gathering information on extra triggers ...") i = 0 s = 0 for varText, candidates in sorted( matches.items(), key=lambda x: (-len(x[1]), x[0]) )[start:end]: i += 1 # make a list of where the candidates are and include the score varOccs = matchPositions[varText] nVar = len(varOccs) candOccs = [] candRep1 = "`" + "` or `".join(candidates) + "`" candRep2 = "<code>" + "</code> or <code>".join(candidates) + "</code>" for cand, score in candidates.items(): myOccs = lexiconOccs[cand] for occ in myOccs: candOccs.append((occ[0], score, cand, occ)) # use this list later to find the nearest/best variant if asFile is None: app.dm( f"# {i}: {nVar} x variant `{varText}` on " f"candidate {candRep1}\n\n" ) else: content[-1].append( f"<h1>{i}: {nVar} x variant <code>{varText}</code> on " f"candidate {candRep2}</h1>" ) empty = False sections = set() highlights = {} bestCand = None for candOcc in candOccs: if bestCand is None or bestCand[1] < candOcc[1]: bestCand = candOcc for n in bestCand[3]: highlights[n] = "lightgreen" section = L.u(bestCand[0], otype="chunk")[0] sections.add(section) for varNodes in varOccs: highlights |= {n: "goldenrod" for n in varNodes} nFirst = varNodes[0] section = L.u(nFirst, otype="chunk")[0] sections.add(section) nearestCand = None for candOcc in candOccs: if nearestCand is None or abs(nearestCand[0] - nFirst) > abs( candOcc[0] - nFirst ): nearestCand = candOcc for n in nearestCand[3]: if n in highlights: highlights[n] = "yellow" else: highlights[n] = "cyan" section = L.u(nearestCand[0], otype="chunk")[0] sections.add(section) sections = tuple((s,) for s in sorted(sections)) s += len(sections) if asFile is None: app.table(sections, highlights=highlights, full=True) else: content[-1].append( app.table( sections, highlights=highlights, full=True, _asString=True ) ) if i % 10 == 0: app.info(f"{i:>4} variants done giving {s:>4} chunks") if i % 100 == 0: content[-1].append(htmlEnd) content.append([htmlStart]) empty = True app.info(f"{i:>4} matches done") if asFile is not None: content[-1].append(htmlEnd) if empty: content.pop() extraFileBase = f"{workDir}/extra" initTree(extraFileBase, fresh=True, gentle=True) for i, material in enumerate(content): extraFile = f"{extraFileBase}/{asFile}{i + 1:>02}.html" with fileOpen(extraFile, "w") as fh: fh.write("\n".join(material)) console(f"Extra triggers written to {extraFile}") def makeAlphabet(self): """Gathers the alphabet on which the corpus is based. The characters of the corpus have already been collected by Text-Fabric, and that is from where we pick them up. We separate the digits from the rest. """ if not self.properlySetup: console("This instance is not properly set up", error=True) return app = self.app C = app.api.C workDir = self.workDir alphabetFile = f"{workDir}/alphabet.tsv" self.alphabetFile = alphabetFile with fileOpen(alphabetFile, "w") as fh: # This file will consist of one character per line, # for each distinct alpha character in the corpus, ordered by frequency. # Numeric characters will be put on a single line, with tabs in between. # All other characters will be ignored. digits = [] for c, freq in C.characters.data["text-orig-full"]: if c.isalpha(): fh.write(f"{c}\n") elif c.isdigit(): digits.append(c) fh.write("\t".join(digits)) console(f"Alphabet written to {alphabetFile}") def makeText(self): """Generate a plain text from the corpus. We make sure that we resolve the hyphenation of words across line boundaries. """ if not self.properlySetup: console("This instance is not properly set up", error=True) return NE = self.NE app = self.app api = app.api F = app.api.F L = app.api.L workDir = self.workDir rec = Recorder(api) lineType = self.settings.lineType slotType = F.otype.slotType maxSlot = F.otype.maxSlot lines = F.otype.s(lineType) lineEnds = {L.d(ln, otype=slotType)[-1] for ln in lines} skipTo = None for t in range(1, maxSlot + 1): tp = t + 1 tpp = t + 2 if tp in lineEnds and tp < maxSlot and F.str.v(tp) == "-": rec.start(t) rec.add(f"{F.str.v(t)}{F.after.v(t)}") rec.end(t) rec.start(tpp) rec.add(f"{F.str.v(tpp)}{F.after.v(tpp)}\n") rec.end(tpp) skipTo = tpp elif skipTo is not None: if t < skipTo: continue else: skipTo = None else: rec.start(t) rec.add(f"{F.str.v(t)}{F.after.v(t)}") rec.end(t) self.rec = rec textComplete = rec.text() self.textComplete = textComplete textFile = f"{workDir}/text.txt" with fileOpen(textFile, "w") as fh: fh.write(textComplete) NE.console(f"Text written to {textFile} - {len(textComplete)} characters") def makeLexicon(self): """Make a lexicon out of the triggers of a spreadsheet.""" if not self.properlySetup: console("This instance is not properly set up", error=True) return NE = self.NE app = self.app workDir = self.workDir sheetData = self.sheetData NEinventory = sheetData.inventory app.indent(reset=True) app.info("Collecting the triggers for the lexicon") inventory = {} for eidkind, triggers in NEinventory.items(): for trigger, scopes in triggers.items(): inventory.setdefault(trigger, set()) for occs in scopes.values(): for slots in occs: inventory[trigger].add(tuple(slots)) app.info(f"{len(inventory)} triggers collected") remSpaceRe = re.compile(r""" +([^A-Za-z])""") accentSpaceRe = re.compile(r"""([’']) +""") lexicon = {} mapNormal = {} lexiconOccs = {} self.lexiconOccs = lexiconOccs for name, occs in inventory.items(): occStr = name occNormal = remSpaceRe.sub(r"\1", occStr) occNormal = accentSpaceRe.sub(r"\1", occNormal) nOccs = len(occs) lexicon[occNormal] = nOccs mapNormal[occNormal] = occStr lexiconOccs[occNormal] = occs sortedLexicon = sorted(lexicon.items(), key=lambda x: (-x[1], x[0].lower())) for name, n in sortedLexicon[0:10]: NE.console(f" {n:>3} x {name}") NE.console(" ...") for name, n in sortedLexicon[-10:]: NE.console(f" {n:>3} x {name}") NE.console(f"{len(lexicon):>8} lexicon length") lexiconFile = f"{workDir}/lexicon.tsv" self.lexiconFile = lexiconFile with fileOpen(lexiconFile, "w") as fh: for name, n in sorted(lexicon.items()): fh.write(f"{name}\t{n}\n") NE.console(f"Lexicon written to {lexiconFile}") def setupAnaliticcl(self): """Configure analiticcl for the big search. We gather the parameters from the variants section of the NER config file (see `tf.ner.settings`). For the description of the parameters, see the [analiticcl tutorial](https://github.com/proycon/analiticcl/blob/master/tutorial.ipynb) """ if not self.properlySetup: console("This instance is not properly set up", error=True) return NE = self.NE VariantModel = self.VariantModel Weights = self.Weights ansettings = self.settings.analiticcl weights = ansettings.weights wld = weights.ld wlcs = weights.lcs wprefix = weights.prefix wsuffix = weights.suffix wcase = weights.case alphabetFile = self.alphabetFile lexiconFile = self.lexiconFile NE.console("Set up analiticcl") model = VariantModel( alphabetFile, Weights(ld=wld, lcs=wlcs, prefix=wprefix, suffix=wsuffix, case=wcase), ) self.model = model model.read_lexicon(lexiconFile) model.build()
Methods
def displayResults(self, start=None, end=None, asFile=None)
-
Display the results as HTML files.
This content will also be written to the files under the subdirectory
extra
in the work directory.Parameters
start
:integer
, optionalNone
- The sequence number of the first result to show.
If
None
, start with the first result. end
:integer
, optionalNone
- The sequence number of the last result to show.
If
None
, continue to the last result. asFile
:string
, optionalNone
- If None, the results will be displayed as HTML on the console. Otherwise, the results will be written down as a set of HTML files, whose names start with this string.
def listVariants(self, start=None, end=None)
-
List the search results to the console.
Show (part of) the variants found on the console as a plain text table.
This content will also be written to the file
variants.txt
in the work directory.Parameters
start
:integer
, optionalNone
- The sequence number of the first result to show.
If
None
, start with the first result. end
:integer
, optionalNone
- The sequence number of the last result to show.
If
None
, continue to the last result.
def makeAlphabet(self)
-
Gathers the alphabet on which the corpus is based.
The characters of the corpus have already been collected by Text-Fabric, and that is from where we pick them up.
We separate the digits from the rest.
def makeLexicon(self)
-
Make a lexicon out of the triggers of a spreadsheet.
def makeText(self)
-
Generate a plain text from the corpus.
We make sure that we resolve the hyphenation of words across line boundaries.
def mergeTriggers(self, level=1)
-
Merge spelling variants of triggers into a NER sheet.
When we have found spelling variants of triggers, we want to include them in the entity lookup. This function places the variants in the same cells as the triggers they are variants of. However, it will not overwrite the original spreadsheet, but create a new, enriched spreadsheet.
We collect the necessary information as follows:
matches: dict
: The spelling variants are keys, and their values are again dicts, keyed by the words in the triggers that come closest, and valued by a measure of the proximity. It is assumed that all of these variants are good variants, in that the scores are always above a certain threshold, e.g. 0.8 .mergedFile: string
: The path of the new spreadsheet with the merged triggers: it will sit next to the original spreadsheet, but with an extension such as-merged
added to its name (this can be configured in the NER config file near the corpus).exclusionFile: string, optional None
The path to an optional file with exclusions, one per line. Variants that occur in the exclusion list will not be merged in. The file sits next to the original spreadsheet, but with an extension such as-notmerged
(this is configurable) and file extension.txt
.
This function will also produce several reports:
Parameters
level
:integer
, optional1
- Only relevant for reporting the new variants. Occurrences of the new variants are counted by section. This parameter specifies the level of those sections. It should be 1, 2 or 3.
def prepare(self)
-
Prepare the data for the search of spelling variants.
We construct an alphabet and a plain text out of the corpus, and we construct a lexicon from the triggers in the current spreadsheet.
def search(self, start=None, end=None, force=0)
-
Search for spelling variants in the corpus.
We search part of the corpus or the whole corpus for spelling variants. The process has two stages:
- a run of analaticcl
- filtering of the results
The run of analiticcl is expensive, more than 10 minutes on the Suriano corpus.
The results of stage 1 will be cached. For every choice of search parameters and boundary points in the corpus there is a separate cache.
A number of analiticcl-specific parameters will influence the search. They can be tweaked in the config file of the NER module, under the variants section.
Parameters
start
:integer
, optionalNone
- The place in the corpus where the search has to start; it is the offset
in the plain text. If
None
, start from the beginning of the corpus. end
:integer
, optionalNone
- The place in the corpus where the search must end; it is the offset
in the plain text. If
None
, the search will be performed till the end of the corpus. force
:integer
, optional0
- Valid values are
0
,1
,2
. Meaning:0
: use the cached result, if it is available.1
: use the cached result for stage 1, if available, but perform the filtering (again).2
: do not use the cache, but compute everything again.
def setupAnaliticcl(self)
-
Configure analiticcl for the big search.
We gather the parameters from the variants section of the NER config file (see
tf.ner.settings
).For the description of the parameters, see the analiticcl tutorial
def showResults(self, start=None, end=None)
-
Show the search results to the console.
Show (part of) the variants found on the console with additional context.
Parameters
start
:integer
, optionalNone
- The sequence number of the first result to show.
If
None
, start with the first result. end
:integer
, optionalNone
- The sequence number of the last result to show.
If
None
, continue to the last result.