Module tf.ner.variants

Detect name variants in the corpus.

We provide a class can detect variants of the triggers in a NER spreadsheet, that is variants as they occur in the corpus.

One way (and the way supported here) of obtaining the variants is by the tool analiticcl by Martin Reynaert and Maarten van Gompel.

Classes

class Detect (NE, sheet=None)
Expand source code Browse git
class Detect:
    def __init__(self, NE, sheet=None):
        """Detect spelling variants.

        In order to use this class you may have to install analiticcl. First
        you have to make sure you have the programming language Rust operational on
        your machine, and then you have to install analiticcl as a Rust program,
        and finally you have to do

        ```
        pip install analiticcl
        ```

        See the
        [python bindings](https://github.com/proycon/analiticcl/tree/master/bindings/python)
        for detailed instructions.

        If you have an instance of the `tf.ner.ner.NER` class, e.g. from an earlier
        call like this

        ```
        NE = A.makeNer()
        ```

        and if you have pointed this `NE` to a NER spreadsheet, e.g. by

        ```
        NE.setTask(".persons")
        ```

        then you can get an instance of this class by saying

        ```
        D = NE.variantDetection()
        ```

        There are methods to produce a plain text of the corpus, a lexicon based on the
        triggers in the spreadsheet, and then to search the plain text for variants
        of the lexicon by passing some well-chosen parameters to analiticcl.

        After the search, which may take long, the raw results are cached, and then
        filtered, with the help of an optional exceptions file.
        If you change the filtering parameters, you do not have to rerun the expensive
        search by analiticcl.

        The variants found can then be merged with the original triggers, and saved as
        a new spreadsheet, next to the original one.

        The settings for analiticcl are ultimately given in the config.yaml in the
        `ner` directory of the corpus, where the other settings for the NER process
        are also given. See `tf.ner.settings`.

        Parameters
        ----------
        NE: object
            A `tf.ner.ner.NER` instance.
        sheet: string, optional None
            The name of a NER sheet that serves as input of the variant detection
            process. If not passed, it is assumed that the `NE` instances is
            already switched to a NER sheet before setting up this object.
            However, in any case, we switch again to the sheet in question to make
            sure we do so in case sesnsitive mode (even if we have used the sheet
            in case insensitive mode for the lookup).
        """
        CI = CheckImport("analiticcl")

        if CI.importOK(hint=True):
            an = CI.importGet()
            self.VariantModel = an.VariantModel
            self.Weights = an.Weights
            self.SearchParameters = an.SearchParameters
        else:
            self.properlySetup = False
            return None

        self.properlySetup = True

        self.NE = NE

        if sheet is None:
            sheet = NE.sheetName
            NE.setSheet(sheet, caseSensitive=True, force=True, forceSilent=True)
        else:
            NE.setSheet(sheet, caseSensitive=True, force=True)

        if not NE.setIsX:
            console(
                (
                    "Setting up this instance requires having a current "
                    f"NER spreadsheet selected. '{sheet}' is not a spreadsheet."
                ),
                error=True,
            )
            self.properlySetup = False
            return

        self.sheet = sheet
        self.sheetData = NE.getSheetData()

        app = NE.app
        self.app = app

        settings = NE.settings.variants
        self.settings = settings

        workDir = f"{app.context.localDir}/{app.context.extraData}/analyticcl"
        self.workDir = workDir
        initTree(workDir, fresh=False)

        NE.setSheet(sheet, caseSensitive=True, force=True)
        sheetData = NE.getSheetData()

        NE.console("Overview of names by length:")
        triggers = set(sheetData.rowMap)

        lengths = collections.defaultdict(list)

        for trigger in triggers:
            lengths[len(trigger.split())].append(trigger)

        for n, trigs in sorted(lengths.items(), key=lambda x: -x[0]):
            examples = "\n      ".join(sorted(trigs, key=lambda x: x.lower())[0:5])
            NE.console(f"  {n} tokens: {len(trigs):>3} names e.g.:\n      {examples}")

    def prepare(self):
        """Prepare the data for the search of spelling variants.

        We construct an alphabet and a plain text out of the corpus,
        and we construct a lexicon from the triggers in the current spreadsheet.
        """
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        self.makeAlphabet()
        self.makeText()
        self.makeLexicon()
        self.setupAnaliticcl()

    def search(self, start=None, end=None, force=0):
        """Search for spelling variants in the corpus.

        We search part of the corpus or the whole corpus for spelling variants.
        The process has two stages:

        1.  a run of analaticcl
        2. filtering of the results

        The run of analiticcl is expensive, more than 10 minutes on the
        [Suriano corpus](https://gitlab.huc.knaw.nl/suriano/letters).

        The results of stage 1 will be cached. For every choice of search parameters
        and boundary points in the corpus there is a separate cache.

        A number of analiticcl-specific parameters will influence the search.
        They can be tweaked in the config file of the NER module, under the variants
        section.

        Parameters
        ----------
        start: integer, optional None
            The place in the corpus where the search has to start; it is the offset
            in the plain text. If `None`, start from the beginning of the corpus.
        end: integer, optional None
            The place in the corpus where the search must end; it is the offset
            in the plain text. If `None`, the search will be performed till the end
            of the corpus.
        force: integer, optional 0
            Valid values are `0`, `1`, `2`.
            Meaning: `0`: use the cached result, if it is available.
            `1`: use the cached result for stage 1, if available, but perform the
            filtering (again).
            `2`: do not use the cache, but compute everything again.
        """
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        SearchParameters = self.SearchParameters
        ansettings = self.settings.analiticcl
        searchParams = ansettings.searchParams
        suoffsets = searchParams.unicodeoffsets
        smaxngram = searchParams.max_ngram
        sfreqweight = searchParams.freq_weight
        scoring = ansettings.scoring
        sthreshold = scoring.threshold

        NE = self.NE
        app = self.app
        model = self.model
        rec = self.rec
        textComplete = self.textComplete
        workDir = self.workDir
        lexiconOccs = self.lexiconOccs

        text = textComplete[start:end]
        nText = len(text)

        offset = 0 if start is None else nText + start if start < 0 else start

        NE.console(f"{nText:>8} text  length")
        NE.console(f"{offset:>8} offset in complete text")

        slug = (
            f"{suoffsets}-{smaxngram}-{sfreqweight}-{sthreshold}-"
            f"{suoffsets}-{smaxngram}-{sfreqweight}-{sthreshold}"
        )
        matchesFile = f"{workDir}/matches-{start}-{end}-settings={slug}.json"
        matchesPosFile = f"{workDir}/matchespos-{start}-{end}-settings={slug}.json"
        rawMatchesFile = f"{workDir}/rawmatches-{start}-{end}-settings={slug}.json"

        app.indent(reset=True)

        if force == 2 or not fileExists(rawMatchesFile):
            app.info("Compute variants of the lexicon words ...")
            rawMatches = model.find_all_matches(
                text,
                SearchParameters(
                    unicodeoffsets=suoffsets,
                    max_ngram=smaxngram,
                    freq_weight=sfreqweight,
                    score_threshold=sthreshold,
                ),
            )
            writeJson(rawMatches, asFile=rawMatchesFile)
        else:
            app.info("Read previously computed variants of the lexicon words ...")
            rawMatches = readJson(asFile=rawMatchesFile, plain=False)

        app.info(f"{len(rawMatches):>8} raw   matches")

        if force == 1 or not fileExists(matchesFile) or not fileExists(matchesPosFile):
            app.info("Filter variants of the lexicon words ...")
            positions = rec.positions(simple=True)

            matches = {}
            matchPositions = collections.defaultdict(list)

            for match in rawMatches:
                text = match["input"].replace("\n", " ")
                textL = text.lower()

                if text in lexiconOccs:
                    continue

                candidates = match["variants"]

                if len(candidates) == 0:
                    continue

                candidates = {
                    cand["text"]: s
                    for cand in candidates
                    if (s := cand["score"]) >= sthreshold
                }

                if len(candidates) == 0:
                    continue

                textRemove = set()

                for cand in candidates:
                    candL = cand.lower()
                    if candL == textL:
                        textRemove.add(cand)

                for cand in textRemove:
                    del candidates[cand]

                if len(candidates) == 0:
                    continue

                # if the match ends with 's we remove the part without it from the
                # candidates

                if text.endswith("'s"):
                    head = text.removesuffix("'s")
                    if head in candidates:
                        del candidates[head]

                        if len(candidates) == 0:
                            continue

                # we have another need to filter: if the text of a match is one short
                # word longer than a candidate we remove that candidate
                # provided the extra word is lower case and has at most 3 letters
                # this is to prevent cases like
                # «Adam Schivelbergh» versus «Adam Schivelbergh di»
                #
                # We do this also when the extra word is at the start, like
                # «di monsignor Mangot» versus «monsignor Mangot»

                parts = text.split()

                if len(parts) > 0:
                    (head, tail) = (parts[0:-1], parts[-1])

                    # if len(tail) <= 3 and tail.islower():
                    if len(tail) <= 3:
                        head = " ".join(head)

                        if head in candidates:
                            del candidates[head]

                            if len(candidates) == 0:
                                continue

                    (head, tail) = (parts[0], parts[1:])

                    # if len(head) <= 3 and head.islower():
                    if len(head) <= 3:
                        tail = " ".join(tail)

                        if tail in candidates:
                            del candidates[tail]

                            if len(candidates) == 0:
                                continue

                position = match["offset"]
                start = position["begin"]
                end = position["end"]
                nodes = sorted(
                    {positions[i] for i in range(offset + start, offset + end)}
                )

                matches[text] = candidates
                matchPositions[text].append(nodes)

            writeJson(matches, asFile=matchesFile)
            writeJson(matchPositions, asFile=matchesPosFile)
        else:
            app.info("Read previously filtered variants of the lexicon words ...")
            matches = readJson(asFile=matchesFile, plain=False)
            matchPositions = readJson(asFile=matchesPosFile, plain=False)

        app.info(f"{len(matches):>8} filtered matches")

        self.matches = matches
        self.matchPositions = matchPositions
        console(f"{len(matches)} variants found")

    def listVariants(self, start=None, end=None):
        """List the search results to the console.

        Show (part of) the variants found on the console as a plain text table.

        This content will also be written to the file `variants.txt` in the
        work directory.

        Parameters
        ----------
        start: integer, optional None
            The sequence number of the first result to show.
            If `None`, start with the first result.
        end: integer, optional None
            The sequence number of the last result to show.
            If `None`, continue to the last result.
        """
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        workDir = self.workDir
        matches = self.matches

        lines = []

        head = ("variant", "score", "candidate")
        dash = f"{'-' * 4} | {'-' * 25} | {'-' * 5} | {'-' * 25}"
        console(f"{'i':>4} | {head[0]:<25} | {head[1]} | {head[2]}")
        console(f"{dash}")
        startN = start or 0

        for text, candidates in sorted(matches.items(), key=lambda x: x[0].lower()):
            for cand, score in sorted(candidates.items(), key=lambda x: x[0].lower()):
                lines.append((text, score, cand))

        for i, (text, score, cand) in enumerate(lines[start:end]):
            console(f"{i + startN:>4} | {text:<25} |  {score:4.2f} | {cand}")

        console(f"{dash}")

        file = f"{workDir}/variants.tsv"

        with fileOpen(file, "w") as fh:
            headStr = "\t".join(head)
            fh.write(f"{headStr}\n")
            for text, score, cand in lines:
                fh.write(f"{text}\t{score:4.2f}\t{cand}\n")

        console(f"{len(matches)} variants found and written to {file}")

    def mergeTriggers(self, level=1):
        """Merge spelling variants of triggers into a NER sheet.

        When we have found spelling variants of triggers, we want to include
        them in the entity lookup. This function places the variants in the same
        cells as the triggers they are variants of. However, it will not
        overwrite the original spreadsheet, but create a new, enriched spreadsheet.

        We collect the necessary information as follows:

        *   `matches: dict`:
            The spelling variants are keys, and their values are again dicts, keyed
            by the words in the triggers that come closest, and valued by a measure
            of the proximity.
            It is assumed that all of these variants are good variants, in that the
            scores are always above a certain threshold, e.g. 0.8 .
        *   `mergedFile: string`:
            The path of the new spreadsheet with the merged triggers: it will sit next
            to the original spreadsheet, but with an extension such as `-merged` added
            to its name (this can be configured in the NER config file near the corpus).
        *   `exclusionFile: string, optional None`
            The path to an optional file with exclusions, one per line.
            Variants that occur in the exclusion list will not be merged in.
            The file sits next to the original spreadsheet, but with an extension such
            as `-notmerged` (this is configurable) and file extension `.txt`.

        This function will also produce several reports:

        Parameters
        ----------
        level: integer, optional 1
            Only relevant for reporting the new variants. Occurrences of the
            new variants are counted by section. This parameter specifies the
            level of those sections. It should be 1, 2 or 3.
        """
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        NE = self.NE
        trigI = NE.trigI
        commentI = NE.commentI
        sheetDir = NE.sheetDir
        sheetName = NE.sheetName
        settings = self.settings
        mergedExtension = settings.mergedExtension
        notMergedExtension = settings.notMergedExtension
        mergedFile = f"{sheetDir}/{sheetName}{mergedExtension}.xlsx"
        mergedReportFile = f"{sheetDir}/{sheetName}{mergedExtension}-report.tsv"
        exclusionFile = f"{sheetDir}/{sheetName}{notMergedExtension}.txt"

        if not fileExists(exclusionFile):
            NE.console(f"File with excluded variants not found: {exclusionFile}")
            exclusionFile = None

        matches = self.matches
        matchPositions = self.matchPositions
        sectionHead = NE.sectionHead

        noVariant = set()

        if exclusionFile is not None and fileExists(exclusionFile):
            with fileOpen(exclusionFile) as fh:
                for line in fh:
                    noVariant.add(line.strip())

            nNoVariant = len(noVariant)
            pl = "" if nNoVariant == 1 else "s"
            NE.console(f"{nNoVariant} excluded variant{pl} found in {exclusionFile}")

        mapping = {}
        excluded = 0

        for text, candidates in matches.items():
            if text in noVariant:
                excluded += 1
                continue
            for cand in candidates:
                mapping.setdefault(cand, set()).add(text)

        ple = "" if excluded == 1 else "s"
        NE.console(f"{excluded} variant{ple} excluded as trigger")

        rows = NE.readSheetData()

        nAdded = 0
        totAdded = 0

        variantsAdded = {}

        for r, row in enumerate(rows):
            if r == 0 or r == 1 or row[commentI].startswith("#"):
                continue

            rn = r + 1
            triggers = set(row[trigI])
            nPrev = len(triggers)

            newTriggers = []

            for trigger in triggers:
                newTriggers.append(trigger)

                for variant in mapping.get(trigger, []):
                    newTriggers.append(variant)
                    variantsAdded.setdefault(rn, []).append((variant, trigger))

            newTriggers = sorted(set(newTriggers))
            row[trigI] = newTriggers
            nPost = len(newTriggers)

            nDiff = nPost - nPrev

            if nDiff != 0:
                nAdded += 1
                totAdded += nDiff

        lines = [("row", "trigger", "variant", "occurences")]

        for rn in sorted(variantsAdded):
            for variant, trigger in variantsAdded[rn]:
                sectionInfo = collections.Counter()

                for occ in matchPositions[variant]:
                    slot = occ[0]

                    section = sectionHead(slot, level=level)
                    sectionInfo[section] += 1

                hitData = [
                    f"{section}x{hits}" for section, hits in sorted(sectionInfo.items())
                ]
                for hits in hitData:
                    lines.append((rn, trigger, variant, hits))

        with fileOpen(mergedReportFile, "w") as fh:
            nLines = len(lines)

            for i, line in enumerate(lines):
                if i < 10 or i > nLines - 10:
                    (row, trigger, variant, hits) = line
                    NE.console(f"{row:<4} {trigger:<40} ~> {variant:<40} = {hits}")

                lineStr = "\t".join(str(x) for x in line)
                fh.write(f"{lineStr}\n")

        pls = "" if nAdded == 1 else "s"
        plt = "" if totAdded == 1 else "s"
        NE.console(f"{nAdded} triggerset{pls} expanded with {totAdded} trigger{plt}")
        NE.console(f"Wrote merge report to file {mergedReportFile}")

        NE.writeSheetData(rows, asFile=mergedFile)
        NE.console(f"Wrote merged triggers to sheet {mergedFile}")

    def showResults(self, start=None, end=None):
        """Show the search results to the console.

        Show (part of) the variants found on the console with additional context.

        Parameters
        ----------
        start: integer, optional None
            The sequence number of the first result to show.
            If `None`, start with the first result.
        end: integer, optional None
            The sequence number of the last result to show.
            If `None`, continue to the last result.
        """
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        app = self.app
        F = app.api.F
        matches = self.matches
        matchPositions = self.matchPositions

        i = 0

        for text, candidates in sorted(matches.items())[start:end]:
            i += 1
            nCand = len(candidates)
            pl = "" if nCand == 1 else "s"
            console(f"{i:>4} Variant «{text}» of {nCand} candidate{pl}")
            console("  Occurrences:")
            occs = matchPositions[text]

            for nodes in occs:
                sectionStart = app.sectionStrFromNode(nodes[0])
                sectionEnd = app.sectionStrFromNode(nodes[-1])
                section = (
                    sectionStart
                    if sectionStart == sectionEnd
                    else f"{sectionStart} - {sectionEnd}"
                )
                preStart = max((nodes[0] - 10, 1))
                preEnd = nodes[0]
                postStart = nodes[-1] + 1
                postEnd = min((nodes[-1] + 10, F.otype.maxSlot + 1))
                preText = "".join(
                    f"{F.str.v(n)}{F.after.v(n)}" for n in range(preStart, preEnd)
                )
                inText = "".join(f"{F.str.v(n)}{F.after.v(n)}" for n in nodes)
                postText = "".join(
                    f"{F.str.v(n)}{F.after.v(n)}" for n in range(postStart, postEnd)
                )
                context = f"{section}: {preText}«{inText}»{postText}".replace("\n", " ")
                console(f"    {context}")

            console("  Candidates with score:")

            for cand, score in sorted(candidates.items(), key=lambda x: (-x[1], x[0])):
                console(f"\t{score:4.2f} {cand}")

            console("-----")

    def displayResults(self, start=None, end=None, asFile=None):
        """Display the results as HTML files.

        This content will also be written to the files under the subdirectory
        `extra` in the work directory.

        Parameters
        ----------
        start: integer, optional None
            The sequence number of the first result to show.
            If `None`, start with the first result.
        end: integer, optional None
            The sequence number of the last result to show.
            If `None`, continue to the last result.
        asFile: string, optional None
            If None, the results will be displayed as HTML on the console.
            Otherwise, the results will be written down as a set of HTML files,
            whose names start with this string.
        """

        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        app = self.app
        L = app.api.L
        matches = self.matches
        matchPositions = self.matchPositions
        lexiconOccs = self.lexiconOccs
        workDir = self.workDir

        if asFile is not None:
            content = []

            htmlStart = HTML_PRE.replace("«css»", app.context.css)
            htmlEnd = HTML_POST

            content.append([htmlStart])
            empty = True

            app.indent(reset=True)
            app.info("Gathering information on extra triggers ...")

        i = 0
        s = 0

        for varText, candidates in sorted(
            matches.items(), key=lambda x: (-len(x[1]), x[0])
        )[start:end]:
            i += 1

            # make a list of where the candidates are and include the score

            varOccs = matchPositions[varText]
            nVar = len(varOccs)

            candOccs = []
            candRep1 = "`" + "` or `".join(candidates) + "`"
            candRep2 = "<code>" + "</code> or <code>".join(candidates) + "</code>"

            for cand, score in candidates.items():
                myOccs = lexiconOccs[cand]

                for occ in myOccs:
                    candOccs.append((occ[0], score, cand, occ))

            # use this list later to find the nearest/best variant

            if asFile is None:
                app.dm(
                    f"# {i}: {nVar} x variant `{varText}` on "
                    f"candidate {candRep1}\n\n"
                )
            else:
                content[-1].append(
                    f"<h1>{i}: {nVar} x variant <code>{varText}</code> on "
                    f"candidate {candRep2}</h1>"
                )
                empty = False

            sections = set()
            highlights = {}
            bestCand = None

            for candOcc in candOccs:
                if bestCand is None or bestCand[1] < candOcc[1]:
                    bestCand = candOcc

            for n in bestCand[3]:
                highlights[n] = "lightgreen"

            section = L.u(bestCand[0], otype="chunk")[0]
            sections.add(section)

            for varNodes in varOccs:
                highlights |= {n: "goldenrod" for n in varNodes}

                nFirst = varNodes[0]
                section = L.u(nFirst, otype="chunk")[0]
                sections.add(section)

                nearestCand = None

                for candOcc in candOccs:
                    if nearestCand is None or abs(nearestCand[0] - nFirst) > abs(
                        candOcc[0] - nFirst
                    ):
                        nearestCand = candOcc

                for n in nearestCand[3]:
                    if n in highlights:
                        highlights[n] = "yellow"
                    else:
                        highlights[n] = "cyan"

                section = L.u(nearestCand[0], otype="chunk")[0]
                sections.add(section)

            sections = tuple((s,) for s in sorted(sections))

            s += len(sections)

            if asFile is None:
                app.table(sections, highlights=highlights, full=True)
            else:
                content[-1].append(
                    app.table(
                        sections, highlights=highlights, full=True, _asString=True
                    )
                )

                if i % 10 == 0:
                    app.info(f"{i:>4} variants done giving {s:>4} chunks")
                if i % 100 == 0:
                    content[-1].append(htmlEnd)
                    content.append([htmlStart])
                    empty = True

        app.info(f"{i:>4} matches done")

        if asFile is not None:
            content[-1].append(htmlEnd)
            if empty:
                content.pop()

            extraFileBase = f"{workDir}/extra"
            initTree(extraFileBase, fresh=True, gentle=True)

            for i, material in enumerate(content):
                extraFile = f"{extraFileBase}/{asFile}{i + 1:>02}.html"

                with fileOpen(extraFile, "w") as fh:
                    fh.write("\n".join(material))

                console(f"Extra triggers written to {extraFile}")

    def makeAlphabet(self):
        """Gathers the alphabet on which the corpus is based.

        The characters of the corpus have already been collected by Text-Fabric,
        and that is from where we pick them up.

        We separate the digits from the rest.
        """
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        app = self.app
        C = app.api.C
        workDir = self.workDir

        alphabetFile = f"{workDir}/alphabet.tsv"
        self.alphabetFile = alphabetFile

        with fileOpen(alphabetFile, "w") as fh:
            # This file will consist of one character per line,
            # for each distinct alpha character in the corpus, ordered by frequency.
            # Numeric characters will be put on a single line, with tabs in between.
            # All other characters will be ignored.

            digits = []

            for c, freq in C.characters.data["text-orig-full"]:
                if c.isalpha():
                    fh.write(f"{c}\n")
                elif c.isdigit():
                    digits.append(c)
            fh.write("\t".join(digits))

        console(f"Alphabet written to {alphabetFile}")

    def makeText(self):
        """Generate a plain text from the corpus.

        We make sure that we resolve the hyphenation of words across line
        boundaries.
        """
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        NE = self.NE
        app = self.app
        api = app.api
        F = app.api.F
        L = app.api.L
        workDir = self.workDir

        rec = Recorder(api)

        lineType = self.settings.lineType
        slotType = F.otype.slotType
        maxSlot = F.otype.maxSlot
        lines = F.otype.s(lineType)
        lineEnds = {L.d(ln, otype=slotType)[-1] for ln in lines}
        skipTo = None

        for t in range(1, maxSlot + 1):
            tp = t + 1
            tpp = t + 2

            if tp in lineEnds and tp < maxSlot and F.str.v(tp) == "-":
                rec.start(t)
                rec.add(f"{F.str.v(t)}{F.after.v(t)}")
                rec.end(t)
                rec.start(tpp)
                rec.add(f"{F.str.v(tpp)}{F.after.v(tpp)}\n")
                rec.end(tpp)
                skipTo = tpp
            elif skipTo is not None:
                if t < skipTo:
                    continue
                else:
                    skipTo = None
            else:
                rec.start(t)
                rec.add(f"{F.str.v(t)}{F.after.v(t)}")
                rec.end(t)

        self.rec = rec
        textComplete = rec.text()
        self.textComplete = textComplete

        textFile = f"{workDir}/text.txt"

        with fileOpen(textFile, "w") as fh:
            fh.write(textComplete)

        NE.console(f"Text written to {textFile} - {len(textComplete)} characters")

    def makeLexicon(self):
        """Make a lexicon out of the triggers of a spreadsheet."""
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        NE = self.NE
        app = self.app
        workDir = self.workDir

        sheetData = self.sheetData
        NEinventory = sheetData.inventory

        app.indent(reset=True)
        app.info("Collecting the triggers for the lexicon")

        inventory = {}

        for eidkind, triggers in NEinventory.items():
            for trigger, scopes in triggers.items():
                inventory.setdefault(trigger, set())

                for occs in scopes.values():
                    for slots in occs:
                        inventory[trigger].add(tuple(slots))

        app.info(f"{len(inventory)} triggers collected")

        remSpaceRe = re.compile(r""" +([^A-Za-z])""")
        accentSpaceRe = re.compile(r"""([’']) +""")

        lexicon = {}
        mapNormal = {}

        lexiconOccs = {}
        self.lexiconOccs = lexiconOccs

        for name, occs in inventory.items():
            occStr = name
            occNormal = remSpaceRe.sub(r"\1", occStr)
            occNormal = accentSpaceRe.sub(r"\1", occNormal)
            nOccs = len(occs)
            lexicon[occNormal] = nOccs
            mapNormal[occNormal] = occStr
            lexiconOccs[occNormal] = occs

        sortedLexicon = sorted(lexicon.items(), key=lambda x: (-x[1], x[0].lower()))

        for name, n in sortedLexicon[0:10]:
            NE.console(f"  {n:>3} x {name}")

        NE.console("  ...")

        for name, n in sortedLexicon[-10:]:
            NE.console(f"  {n:>3} x {name}")

        NE.console(f"{len(lexicon):>8} lexicon length")

        lexiconFile = f"{workDir}/lexicon.tsv"
        self.lexiconFile = lexiconFile

        with fileOpen(lexiconFile, "w") as fh:
            for name, n in sorted(lexicon.items()):
                fh.write(f"{name}\t{n}\n")

        NE.console(f"Lexicon written to {lexiconFile}")

    def setupAnaliticcl(self):
        """Configure analiticcl for the big search.

        We gather the parameters from the variants section of the NER
        config file (see `tf.ner.settings`).

        For the description of the parameters, see the
        [analiticcl tutorial](https://github.com/proycon/analiticcl/blob/master/tutorial.ipynb)
        """
        if not self.properlySetup:
            console("This instance is not properly set up", error=True)
            return

        NE = self.NE
        VariantModel = self.VariantModel
        Weights = self.Weights
        ansettings = self.settings.analiticcl
        weights = ansettings.weights
        wld = weights.ld
        wlcs = weights.lcs
        wprefix = weights.prefix
        wsuffix = weights.suffix
        wcase = weights.case

        alphabetFile = self.alphabetFile
        lexiconFile = self.lexiconFile

        NE.console("Set up analiticcl")

        model = VariantModel(
            alphabetFile,
            Weights(ld=wld, lcs=wlcs, prefix=wprefix, suffix=wsuffix, case=wcase),
        )
        self.model = model
        model.read_lexicon(lexiconFile)
        model.build()

Detect spelling variants.

In order to use this class you may have to install analiticcl. First you have to make sure you have the programming language Rust operational on your machine, and then you have to install analiticcl as a Rust program, and finally you have to do

pip install analiticcl

See the python bindings for detailed instructions.

If you have an instance of the NER class, e.g. from an earlier call like this

NE = A.makeNer()

and if you have pointed this NE to a NER spreadsheet, e.g. by

NE.setTask(".persons")

then you can get an instance of this class by saying

D = NE.variantDetection()

There are methods to produce a plain text of the corpus, a lexicon based on the triggers in the spreadsheet, and then to search the plain text for variants of the lexicon by passing some well-chosen parameters to analiticcl.

After the search, which may take long, the raw results are cached, and then filtered, with the help of an optional exceptions file. If you change the filtering parameters, you do not have to rerun the expensive search by analiticcl.

The variants found can then be merged with the original triggers, and saved as a new spreadsheet, next to the original one.

The settings for analiticcl are ultimately given in the config.yaml in the ner directory of the corpus, where the other settings for the NER process are also given. See tf.ner.settings.

Parameters

NE : object
A NER instance.
sheet : string, optional None
The name of a NER sheet that serves as input of the variant detection process. If not passed, it is assumed that the NE instances is already switched to a NER sheet before setting up this object. However, in any case, we switch again to the sheet in question to make sure we do so in case sesnsitive mode (even if we have used the sheet in case insensitive mode for the lookup).

Methods

def displayResults(self, start=None, end=None, asFile=None)
Expand source code Browse git
def displayResults(self, start=None, end=None, asFile=None):
    """Display the results as HTML files.

    This content will also be written to the files under the subdirectory
    `extra` in the work directory.

    Parameters
    ----------
    start: integer, optional None
        The sequence number of the first result to show.
        If `None`, start with the first result.
    end: integer, optional None
        The sequence number of the last result to show.
        If `None`, continue to the last result.
    asFile: string, optional None
        If None, the results will be displayed as HTML on the console.
        Otherwise, the results will be written down as a set of HTML files,
        whose names start with this string.
    """

    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    app = self.app
    L = app.api.L
    matches = self.matches
    matchPositions = self.matchPositions
    lexiconOccs = self.lexiconOccs
    workDir = self.workDir

    if asFile is not None:
        content = []

        htmlStart = HTML_PRE.replace("«css»", app.context.css)
        htmlEnd = HTML_POST

        content.append([htmlStart])
        empty = True

        app.indent(reset=True)
        app.info("Gathering information on extra triggers ...")

    i = 0
    s = 0

    for varText, candidates in sorted(
        matches.items(), key=lambda x: (-len(x[1]), x[0])
    )[start:end]:
        i += 1

        # make a list of where the candidates are and include the score

        varOccs = matchPositions[varText]
        nVar = len(varOccs)

        candOccs = []
        candRep1 = "`" + "` or `".join(candidates) + "`"
        candRep2 = "<code>" + "</code> or <code>".join(candidates) + "</code>"

        for cand, score in candidates.items():
            myOccs = lexiconOccs[cand]

            for occ in myOccs:
                candOccs.append((occ[0], score, cand, occ))

        # use this list later to find the nearest/best variant

        if asFile is None:
            app.dm(
                f"# {i}: {nVar} x variant `{varText}` on "
                f"candidate {candRep1}\n\n"
            )
        else:
            content[-1].append(
                f"<h1>{i}: {nVar} x variant <code>{varText}</code> on "
                f"candidate {candRep2}</h1>"
            )
            empty = False

        sections = set()
        highlights = {}
        bestCand = None

        for candOcc in candOccs:
            if bestCand is None or bestCand[1] < candOcc[1]:
                bestCand = candOcc

        for n in bestCand[3]:
            highlights[n] = "lightgreen"

        section = L.u(bestCand[0], otype="chunk")[0]
        sections.add(section)

        for varNodes in varOccs:
            highlights |= {n: "goldenrod" for n in varNodes}

            nFirst = varNodes[0]
            section = L.u(nFirst, otype="chunk")[0]
            sections.add(section)

            nearestCand = None

            for candOcc in candOccs:
                if nearestCand is None or abs(nearestCand[0] - nFirst) > abs(
                    candOcc[0] - nFirst
                ):
                    nearestCand = candOcc

            for n in nearestCand[3]:
                if n in highlights:
                    highlights[n] = "yellow"
                else:
                    highlights[n] = "cyan"

            section = L.u(nearestCand[0], otype="chunk")[0]
            sections.add(section)

        sections = tuple((s,) for s in sorted(sections))

        s += len(sections)

        if asFile is None:
            app.table(sections, highlights=highlights, full=True)
        else:
            content[-1].append(
                app.table(
                    sections, highlights=highlights, full=True, _asString=True
                )
            )

            if i % 10 == 0:
                app.info(f"{i:>4} variants done giving {s:>4} chunks")
            if i % 100 == 0:
                content[-1].append(htmlEnd)
                content.append([htmlStart])
                empty = True

    app.info(f"{i:>4} matches done")

    if asFile is not None:
        content[-1].append(htmlEnd)
        if empty:
            content.pop()

        extraFileBase = f"{workDir}/extra"
        initTree(extraFileBase, fresh=True, gentle=True)

        for i, material in enumerate(content):
            extraFile = f"{extraFileBase}/{asFile}{i + 1:>02}.html"

            with fileOpen(extraFile, "w") as fh:
                fh.write("\n".join(material))

            console(f"Extra triggers written to {extraFile}")

Display the results as HTML files.

This content will also be written to the files under the subdirectory extra in the work directory.

Parameters

start : integer, optional None
The sequence number of the first result to show. If None, start with the first result.
end : integer, optional None
The sequence number of the last result to show. If None, continue to the last result.
asFile : string, optional None
If None, the results will be displayed as HTML on the console. Otherwise, the results will be written down as a set of HTML files, whose names start with this string.
def listVariants(self, start=None, end=None)
Expand source code Browse git
def listVariants(self, start=None, end=None):
    """List the search results to the console.

    Show (part of) the variants found on the console as a plain text table.

    This content will also be written to the file `variants.txt` in the
    work directory.

    Parameters
    ----------
    start: integer, optional None
        The sequence number of the first result to show.
        If `None`, start with the first result.
    end: integer, optional None
        The sequence number of the last result to show.
        If `None`, continue to the last result.
    """
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    workDir = self.workDir
    matches = self.matches

    lines = []

    head = ("variant", "score", "candidate")
    dash = f"{'-' * 4} | {'-' * 25} | {'-' * 5} | {'-' * 25}"
    console(f"{'i':>4} | {head[0]:<25} | {head[1]} | {head[2]}")
    console(f"{dash}")
    startN = start or 0

    for text, candidates in sorted(matches.items(), key=lambda x: x[0].lower()):
        for cand, score in sorted(candidates.items(), key=lambda x: x[0].lower()):
            lines.append((text, score, cand))

    for i, (text, score, cand) in enumerate(lines[start:end]):
        console(f"{i + startN:>4} | {text:<25} |  {score:4.2f} | {cand}")

    console(f"{dash}")

    file = f"{workDir}/variants.tsv"

    with fileOpen(file, "w") as fh:
        headStr = "\t".join(head)
        fh.write(f"{headStr}\n")
        for text, score, cand in lines:
            fh.write(f"{text}\t{score:4.2f}\t{cand}\n")

    console(f"{len(matches)} variants found and written to {file}")

List the search results to the console.

Show (part of) the variants found on the console as a plain text table.

This content will also be written to the file variants.txt in the work directory.

Parameters

start : integer, optional None
The sequence number of the first result to show. If None, start with the first result.
end : integer, optional None
The sequence number of the last result to show. If None, continue to the last result.
def makeAlphabet(self)
Expand source code Browse git
def makeAlphabet(self):
    """Gathers the alphabet on which the corpus is based.

    The characters of the corpus have already been collected by Text-Fabric,
    and that is from where we pick them up.

    We separate the digits from the rest.
    """
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    app = self.app
    C = app.api.C
    workDir = self.workDir

    alphabetFile = f"{workDir}/alphabet.tsv"
    self.alphabetFile = alphabetFile

    with fileOpen(alphabetFile, "w") as fh:
        # This file will consist of one character per line,
        # for each distinct alpha character in the corpus, ordered by frequency.
        # Numeric characters will be put on a single line, with tabs in between.
        # All other characters will be ignored.

        digits = []

        for c, freq in C.characters.data["text-orig-full"]:
            if c.isalpha():
                fh.write(f"{c}\n")
            elif c.isdigit():
                digits.append(c)
        fh.write("\t".join(digits))

    console(f"Alphabet written to {alphabetFile}")

Gathers the alphabet on which the corpus is based.

The characters of the corpus have already been collected by Text-Fabric, and that is from where we pick them up.

We separate the digits from the rest.

def makeLexicon(self)
Expand source code Browse git
def makeLexicon(self):
    """Make a lexicon out of the triggers of a spreadsheet."""
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    NE = self.NE
    app = self.app
    workDir = self.workDir

    sheetData = self.sheetData
    NEinventory = sheetData.inventory

    app.indent(reset=True)
    app.info("Collecting the triggers for the lexicon")

    inventory = {}

    for eidkind, triggers in NEinventory.items():
        for trigger, scopes in triggers.items():
            inventory.setdefault(trigger, set())

            for occs in scopes.values():
                for slots in occs:
                    inventory[trigger].add(tuple(slots))

    app.info(f"{len(inventory)} triggers collected")

    remSpaceRe = re.compile(r""" +([^A-Za-z])""")
    accentSpaceRe = re.compile(r"""([’']) +""")

    lexicon = {}
    mapNormal = {}

    lexiconOccs = {}
    self.lexiconOccs = lexiconOccs

    for name, occs in inventory.items():
        occStr = name
        occNormal = remSpaceRe.sub(r"\1", occStr)
        occNormal = accentSpaceRe.sub(r"\1", occNormal)
        nOccs = len(occs)
        lexicon[occNormal] = nOccs
        mapNormal[occNormal] = occStr
        lexiconOccs[occNormal] = occs

    sortedLexicon = sorted(lexicon.items(), key=lambda x: (-x[1], x[0].lower()))

    for name, n in sortedLexicon[0:10]:
        NE.console(f"  {n:>3} x {name}")

    NE.console("  ...")

    for name, n in sortedLexicon[-10:]:
        NE.console(f"  {n:>3} x {name}")

    NE.console(f"{len(lexicon):>8} lexicon length")

    lexiconFile = f"{workDir}/lexicon.tsv"
    self.lexiconFile = lexiconFile

    with fileOpen(lexiconFile, "w") as fh:
        for name, n in sorted(lexicon.items()):
            fh.write(f"{name}\t{n}\n")

    NE.console(f"Lexicon written to {lexiconFile}")

Make a lexicon out of the triggers of a spreadsheet.

def makeText(self)
Expand source code Browse git
def makeText(self):
    """Generate a plain text from the corpus.

    We make sure that we resolve the hyphenation of words across line
    boundaries.
    """
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    NE = self.NE
    app = self.app
    api = app.api
    F = app.api.F
    L = app.api.L
    workDir = self.workDir

    rec = Recorder(api)

    lineType = self.settings.lineType
    slotType = F.otype.slotType
    maxSlot = F.otype.maxSlot
    lines = F.otype.s(lineType)
    lineEnds = {L.d(ln, otype=slotType)[-1] for ln in lines}
    skipTo = None

    for t in range(1, maxSlot + 1):
        tp = t + 1
        tpp = t + 2

        if tp in lineEnds and tp < maxSlot and F.str.v(tp) == "-":
            rec.start(t)
            rec.add(f"{F.str.v(t)}{F.after.v(t)}")
            rec.end(t)
            rec.start(tpp)
            rec.add(f"{F.str.v(tpp)}{F.after.v(tpp)}\n")
            rec.end(tpp)
            skipTo = tpp
        elif skipTo is not None:
            if t < skipTo:
                continue
            else:
                skipTo = None
        else:
            rec.start(t)
            rec.add(f"{F.str.v(t)}{F.after.v(t)}")
            rec.end(t)

    self.rec = rec
    textComplete = rec.text()
    self.textComplete = textComplete

    textFile = f"{workDir}/text.txt"

    with fileOpen(textFile, "w") as fh:
        fh.write(textComplete)

    NE.console(f"Text written to {textFile} - {len(textComplete)} characters")

Generate a plain text from the corpus.

We make sure that we resolve the hyphenation of words across line boundaries.

def mergeTriggers(self, level=1)
Expand source code Browse git
def mergeTriggers(self, level=1):
    """Merge spelling variants of triggers into a NER sheet.

    When we have found spelling variants of triggers, we want to include
    them in the entity lookup. This function places the variants in the same
    cells as the triggers they are variants of. However, it will not
    overwrite the original spreadsheet, but create a new, enriched spreadsheet.

    We collect the necessary information as follows:

    *   `matches: dict`:
        The spelling variants are keys, and their values are again dicts, keyed
        by the words in the triggers that come closest, and valued by a measure
        of the proximity.
        It is assumed that all of these variants are good variants, in that the
        scores are always above a certain threshold, e.g. 0.8 .
    *   `mergedFile: string`:
        The path of the new spreadsheet with the merged triggers: it will sit next
        to the original spreadsheet, but with an extension such as `-merged` added
        to its name (this can be configured in the NER config file near the corpus).
    *   `exclusionFile: string, optional None`
        The path to an optional file with exclusions, one per line.
        Variants that occur in the exclusion list will not be merged in.
        The file sits next to the original spreadsheet, but with an extension such
        as `-notmerged` (this is configurable) and file extension `.txt`.

    This function will also produce several reports:

    Parameters
    ----------
    level: integer, optional 1
        Only relevant for reporting the new variants. Occurrences of the
        new variants are counted by section. This parameter specifies the
        level of those sections. It should be 1, 2 or 3.
    """
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    NE = self.NE
    trigI = NE.trigI
    commentI = NE.commentI
    sheetDir = NE.sheetDir
    sheetName = NE.sheetName
    settings = self.settings
    mergedExtension = settings.mergedExtension
    notMergedExtension = settings.notMergedExtension
    mergedFile = f"{sheetDir}/{sheetName}{mergedExtension}.xlsx"
    mergedReportFile = f"{sheetDir}/{sheetName}{mergedExtension}-report.tsv"
    exclusionFile = f"{sheetDir}/{sheetName}{notMergedExtension}.txt"

    if not fileExists(exclusionFile):
        NE.console(f"File with excluded variants not found: {exclusionFile}")
        exclusionFile = None

    matches = self.matches
    matchPositions = self.matchPositions
    sectionHead = NE.sectionHead

    noVariant = set()

    if exclusionFile is not None and fileExists(exclusionFile):
        with fileOpen(exclusionFile) as fh:
            for line in fh:
                noVariant.add(line.strip())

        nNoVariant = len(noVariant)
        pl = "" if nNoVariant == 1 else "s"
        NE.console(f"{nNoVariant} excluded variant{pl} found in {exclusionFile}")

    mapping = {}
    excluded = 0

    for text, candidates in matches.items():
        if text in noVariant:
            excluded += 1
            continue
        for cand in candidates:
            mapping.setdefault(cand, set()).add(text)

    ple = "" if excluded == 1 else "s"
    NE.console(f"{excluded} variant{ple} excluded as trigger")

    rows = NE.readSheetData()

    nAdded = 0
    totAdded = 0

    variantsAdded = {}

    for r, row in enumerate(rows):
        if r == 0 or r == 1 or row[commentI].startswith("#"):
            continue

        rn = r + 1
        triggers = set(row[trigI])
        nPrev = len(triggers)

        newTriggers = []

        for trigger in triggers:
            newTriggers.append(trigger)

            for variant in mapping.get(trigger, []):
                newTriggers.append(variant)
                variantsAdded.setdefault(rn, []).append((variant, trigger))

        newTriggers = sorted(set(newTriggers))
        row[trigI] = newTriggers
        nPost = len(newTriggers)

        nDiff = nPost - nPrev

        if nDiff != 0:
            nAdded += 1
            totAdded += nDiff

    lines = [("row", "trigger", "variant", "occurences")]

    for rn in sorted(variantsAdded):
        for variant, trigger in variantsAdded[rn]:
            sectionInfo = collections.Counter()

            for occ in matchPositions[variant]:
                slot = occ[0]

                section = sectionHead(slot, level=level)
                sectionInfo[section] += 1

            hitData = [
                f"{section}x{hits}" for section, hits in sorted(sectionInfo.items())
            ]
            for hits in hitData:
                lines.append((rn, trigger, variant, hits))

    with fileOpen(mergedReportFile, "w") as fh:
        nLines = len(lines)

        for i, line in enumerate(lines):
            if i < 10 or i > nLines - 10:
                (row, trigger, variant, hits) = line
                NE.console(f"{row:<4} {trigger:<40} ~> {variant:<40} = {hits}")

            lineStr = "\t".join(str(x) for x in line)
            fh.write(f"{lineStr}\n")

    pls = "" if nAdded == 1 else "s"
    plt = "" if totAdded == 1 else "s"
    NE.console(f"{nAdded} triggerset{pls} expanded with {totAdded} trigger{plt}")
    NE.console(f"Wrote merge report to file {mergedReportFile}")

    NE.writeSheetData(rows, asFile=mergedFile)
    NE.console(f"Wrote merged triggers to sheet {mergedFile}")

Merge spelling variants of triggers into a NER sheet.

When we have found spelling variants of triggers, we want to include them in the entity lookup. This function places the variants in the same cells as the triggers they are variants of. However, it will not overwrite the original spreadsheet, but create a new, enriched spreadsheet.

We collect the necessary information as follows:

  • matches: dict: The spelling variants are keys, and their values are again dicts, keyed by the words in the triggers that come closest, and valued by a measure of the proximity. It is assumed that all of these variants are good variants, in that the scores are always above a certain threshold, e.g. 0.8 .
  • mergedFile: string: The path of the new spreadsheet with the merged triggers: it will sit next to the original spreadsheet, but with an extension such as -merged added to its name (this can be configured in the NER config file near the corpus).
  • exclusionFile: string, optional None The path to an optional file with exclusions, one per line. Variants that occur in the exclusion list will not be merged in. The file sits next to the original spreadsheet, but with an extension such as -notmerged (this is configurable) and file extension .txt.

This function will also produce several reports:

Parameters

level : integer, optional 1
Only relevant for reporting the new variants. Occurrences of the new variants are counted by section. This parameter specifies the level of those sections. It should be 1, 2 or 3.
def prepare(self)
Expand source code Browse git
def prepare(self):
    """Prepare the data for the search of spelling variants.

    We construct an alphabet and a plain text out of the corpus,
    and we construct a lexicon from the triggers in the current spreadsheet.
    """
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    self.makeAlphabet()
    self.makeText()
    self.makeLexicon()
    self.setupAnaliticcl()

Prepare the data for the search of spelling variants.

We construct an alphabet and a plain text out of the corpus, and we construct a lexicon from the triggers in the current spreadsheet.

def search(self, start=None, end=None, force=0)
Expand source code Browse git
def search(self, start=None, end=None, force=0):
    """Search for spelling variants in the corpus.

    We search part of the corpus or the whole corpus for spelling variants.
    The process has two stages:

    1.  a run of analaticcl
    2. filtering of the results

    The run of analiticcl is expensive, more than 10 minutes on the
    [Suriano corpus](https://gitlab.huc.knaw.nl/suriano/letters).

    The results of stage 1 will be cached. For every choice of search parameters
    and boundary points in the corpus there is a separate cache.

    A number of analiticcl-specific parameters will influence the search.
    They can be tweaked in the config file of the NER module, under the variants
    section.

    Parameters
    ----------
    start: integer, optional None
        The place in the corpus where the search has to start; it is the offset
        in the plain text. If `None`, start from the beginning of the corpus.
    end: integer, optional None
        The place in the corpus where the search must end; it is the offset
        in the plain text. If `None`, the search will be performed till the end
        of the corpus.
    force: integer, optional 0
        Valid values are `0`, `1`, `2`.
        Meaning: `0`: use the cached result, if it is available.
        `1`: use the cached result for stage 1, if available, but perform the
        filtering (again).
        `2`: do not use the cache, but compute everything again.
    """
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    SearchParameters = self.SearchParameters
    ansettings = self.settings.analiticcl
    searchParams = ansettings.searchParams
    suoffsets = searchParams.unicodeoffsets
    smaxngram = searchParams.max_ngram
    sfreqweight = searchParams.freq_weight
    scoring = ansettings.scoring
    sthreshold = scoring.threshold

    NE = self.NE
    app = self.app
    model = self.model
    rec = self.rec
    textComplete = self.textComplete
    workDir = self.workDir
    lexiconOccs = self.lexiconOccs

    text = textComplete[start:end]
    nText = len(text)

    offset = 0 if start is None else nText + start if start < 0 else start

    NE.console(f"{nText:>8} text  length")
    NE.console(f"{offset:>8} offset in complete text")

    slug = (
        f"{suoffsets}-{smaxngram}-{sfreqweight}-{sthreshold}-"
        f"{suoffsets}-{smaxngram}-{sfreqweight}-{sthreshold}"
    )
    matchesFile = f"{workDir}/matches-{start}-{end}-settings={slug}.json"
    matchesPosFile = f"{workDir}/matchespos-{start}-{end}-settings={slug}.json"
    rawMatchesFile = f"{workDir}/rawmatches-{start}-{end}-settings={slug}.json"

    app.indent(reset=True)

    if force == 2 or not fileExists(rawMatchesFile):
        app.info("Compute variants of the lexicon words ...")
        rawMatches = model.find_all_matches(
            text,
            SearchParameters(
                unicodeoffsets=suoffsets,
                max_ngram=smaxngram,
                freq_weight=sfreqweight,
                score_threshold=sthreshold,
            ),
        )
        writeJson(rawMatches, asFile=rawMatchesFile)
    else:
        app.info("Read previously computed variants of the lexicon words ...")
        rawMatches = readJson(asFile=rawMatchesFile, plain=False)

    app.info(f"{len(rawMatches):>8} raw   matches")

    if force == 1 or not fileExists(matchesFile) or not fileExists(matchesPosFile):
        app.info("Filter variants of the lexicon words ...")
        positions = rec.positions(simple=True)

        matches = {}
        matchPositions = collections.defaultdict(list)

        for match in rawMatches:
            text = match["input"].replace("\n", " ")
            textL = text.lower()

            if text in lexiconOccs:
                continue

            candidates = match["variants"]

            if len(candidates) == 0:
                continue

            candidates = {
                cand["text"]: s
                for cand in candidates
                if (s := cand["score"]) >= sthreshold
            }

            if len(candidates) == 0:
                continue

            textRemove = set()

            for cand in candidates:
                candL = cand.lower()
                if candL == textL:
                    textRemove.add(cand)

            for cand in textRemove:
                del candidates[cand]

            if len(candidates) == 0:
                continue

            # if the match ends with 's we remove the part without it from the
            # candidates

            if text.endswith("'s"):
                head = text.removesuffix("'s")
                if head in candidates:
                    del candidates[head]

                    if len(candidates) == 0:
                        continue

            # we have another need to filter: if the text of a match is one short
            # word longer than a candidate we remove that candidate
            # provided the extra word is lower case and has at most 3 letters
            # this is to prevent cases like
            # «Adam Schivelbergh» versus «Adam Schivelbergh di»
            #
            # We do this also when the extra word is at the start, like
            # «di monsignor Mangot» versus «monsignor Mangot»

            parts = text.split()

            if len(parts) > 0:
                (head, tail) = (parts[0:-1], parts[-1])

                # if len(tail) <= 3 and tail.islower():
                if len(tail) <= 3:
                    head = " ".join(head)

                    if head in candidates:
                        del candidates[head]

                        if len(candidates) == 0:
                            continue

                (head, tail) = (parts[0], parts[1:])

                # if len(head) <= 3 and head.islower():
                if len(head) <= 3:
                    tail = " ".join(tail)

                    if tail in candidates:
                        del candidates[tail]

                        if len(candidates) == 0:
                            continue

            position = match["offset"]
            start = position["begin"]
            end = position["end"]
            nodes = sorted(
                {positions[i] for i in range(offset + start, offset + end)}
            )

            matches[text] = candidates
            matchPositions[text].append(nodes)

        writeJson(matches, asFile=matchesFile)
        writeJson(matchPositions, asFile=matchesPosFile)
    else:
        app.info("Read previously filtered variants of the lexicon words ...")
        matches = readJson(asFile=matchesFile, plain=False)
        matchPositions = readJson(asFile=matchesPosFile, plain=False)

    app.info(f"{len(matches):>8} filtered matches")

    self.matches = matches
    self.matchPositions = matchPositions
    console(f"{len(matches)} variants found")

Search for spelling variants in the corpus.

We search part of the corpus or the whole corpus for spelling variants. The process has two stages:

  1. a run of analaticcl
  2. filtering of the results

The run of analiticcl is expensive, more than 10 minutes on the Suriano corpus.

The results of stage 1 will be cached. For every choice of search parameters and boundary points in the corpus there is a separate cache.

A number of analiticcl-specific parameters will influence the search. They can be tweaked in the config file of the NER module, under the variants section.

Parameters

start : integer, optional None
The place in the corpus where the search has to start; it is the offset in the plain text. If None, start from the beginning of the corpus.
end : integer, optional None
The place in the corpus where the search must end; it is the offset in the plain text. If None, the search will be performed till the end of the corpus.
force : integer, optional 0
Valid values are 0, 1, 2. Meaning: 0: use the cached result, if it is available. 1: use the cached result for stage 1, if available, but perform the filtering (again). 2: do not use the cache, but compute everything again.
def setupAnaliticcl(self)
Expand source code Browse git
def setupAnaliticcl(self):
    """Configure analiticcl for the big search.

    We gather the parameters from the variants section of the NER
    config file (see `tf.ner.settings`).

    For the description of the parameters, see the
    [analiticcl tutorial](https://github.com/proycon/analiticcl/blob/master/tutorial.ipynb)
    """
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    NE = self.NE
    VariantModel = self.VariantModel
    Weights = self.Weights
    ansettings = self.settings.analiticcl
    weights = ansettings.weights
    wld = weights.ld
    wlcs = weights.lcs
    wprefix = weights.prefix
    wsuffix = weights.suffix
    wcase = weights.case

    alphabetFile = self.alphabetFile
    lexiconFile = self.lexiconFile

    NE.console("Set up analiticcl")

    model = VariantModel(
        alphabetFile,
        Weights(ld=wld, lcs=wlcs, prefix=wprefix, suffix=wsuffix, case=wcase),
    )
    self.model = model
    model.read_lexicon(lexiconFile)
    model.build()

Configure analiticcl for the big search.

We gather the parameters from the variants section of the NER config file (see tf.ner.settings).

For the description of the parameters, see the analiticcl tutorial

def showResults(self, start=None, end=None)
Expand source code Browse git
def showResults(self, start=None, end=None):
    """Show the search results to the console.

    Show (part of) the variants found on the console with additional context.

    Parameters
    ----------
    start: integer, optional None
        The sequence number of the first result to show.
        If `None`, start with the first result.
    end: integer, optional None
        The sequence number of the last result to show.
        If `None`, continue to the last result.
    """
    if not self.properlySetup:
        console("This instance is not properly set up", error=True)
        return

    app = self.app
    F = app.api.F
    matches = self.matches
    matchPositions = self.matchPositions

    i = 0

    for text, candidates in sorted(matches.items())[start:end]:
        i += 1
        nCand = len(candidates)
        pl = "" if nCand == 1 else "s"
        console(f"{i:>4} Variant «{text}» of {nCand} candidate{pl}")
        console("  Occurrences:")
        occs = matchPositions[text]

        for nodes in occs:
            sectionStart = app.sectionStrFromNode(nodes[0])
            sectionEnd = app.sectionStrFromNode(nodes[-1])
            section = (
                sectionStart
                if sectionStart == sectionEnd
                else f"{sectionStart} - {sectionEnd}"
            )
            preStart = max((nodes[0] - 10, 1))
            preEnd = nodes[0]
            postStart = nodes[-1] + 1
            postEnd = min((nodes[-1] + 10, F.otype.maxSlot + 1))
            preText = "".join(
                f"{F.str.v(n)}{F.after.v(n)}" for n in range(preStart, preEnd)
            )
            inText = "".join(f"{F.str.v(n)}{F.after.v(n)}" for n in nodes)
            postText = "".join(
                f"{F.str.v(n)}{F.after.v(n)}" for n in range(postStart, postEnd)
            )
            context = f"{section}: {preText}«{inText}»{postText}".replace("\n", " ")
            console(f"    {context}")

        console("  Candidates with score:")

        for cand, score in sorted(candidates.items(), key=lambda x: (-x[1], x[0])):
            console(f"\t{score:4.2f} {cand}")

        console("-----")

Show the search results to the console.

Show (part of) the variants found on the console with additional context.

Parameters

start : integer, optional None
The sequence number of the first result to show. If None, start with the first result.
end : integer, optional None
The sequence number of the last result to show. If None, continue to the last result.