Module `tf.ner.sheets`

Management of NER spreadsheets.

NER spreadsheets contain information about named entities and the trigger strings by which they can be found in the corpus.

Global variables

var SHEET_KEYS

Keys by which the data of a NER sheet is organized.

caseSensitive: boolean, whether to look for triggers in the corpus in a case-sensitive way;
metaFields: the list of colums in the NER spreadsheet that are not involved in entity lookup: this is the metadata of an entity;
metaData: the metadata of an entity as dictionary;
logData: the collected messages issues while reading and processing a NER sheet;
nameMap: mapping from entities to names;
scopeMap: mapping from scope strings to scope data structures;
rowMap: mapping from triggers to the rows where they occur;
instructions: compiled search instructions to search for all triggers simultaneously; more precisely, it is a mapping from intervals to tuples of three dictionaries:
- tPos: dictionary that maps positions to tokens that a trigger may have in that position to the set of triggers that have that token in that position;
- tMap: a mapping from triggers to scopes; the idea is that in every interval all triggers that are active in that interval have exactly one scope that includes that interval; the mapping gives that scope;
- idMap: a mapping from triggers to entities; given a trigger that is active in an interval, and given its scope in that interval, there is exactly one entity that is triggered; the eid and kind of that entity identify it and is the value;
inventory: complete result of the search for triggers;
triggerFromMatch: mapping for each match in the corpus which trigger was matched in what scope;
triggerScopes: for each trigger, the set of its scopes where it is used is stored;
allTriggers: the set of all triggers, more precisely, the set of all trigger-entity-scope combinations;
hitData: overview of the inventory (see above).

Classes

class Sheets (sheets=None)

Expand source code Browse git

class Sheets(Scopes, Triggers):
    def __init__(self, sheets=None):
        """Handling of NER spreadsheets.

        A NER spreadsheet contains entity information; in particular it links
        named entities to triggers by which they can be found in the corpus.

        See `readSheetData()` for the description of the shape of the spreadsheet,
        which is expected to be an Excel sheet.

        Parameters
        ----------
        sheets: dict, optional None
            Sheet data to start with. Relevant for when the TF browser uses this class.
            See `tf.ner.ner.NER`
        """
        CI = CheckImport("openpyxl")

        if CI.importOK(hint=True):
            openpyxl = CI.importGet()
            self.loadXls = openpyxl.load_workbook
            self.Workbook = openpyxl.Workbook
        else:
            self.properlySetup = False
            return None

        self.properlySetup = True
        self.scopeI = 3
        self.trigI = 4
        self.commentI = 0

        # browse = self.browse
        self.sheets = sheets
        settings = self.settings
        self.spaceEscaped = settings.spaceEscaped
        self.transform = settings.transform
        keywordFeatures = settings.keywordFeatures
        kindFeature = keywordFeatures[0]
        self.kindFeature = kindFeature
        defaultValues = settings.defaultValues
        self.defaultKind = defaultValues.get(kindFeature, "")

        self.sheetName = None
        """The current NER sheet."""

        self.sheetNames = set()
        """The set of names of NER sheets that are present on the file system."""

        self.readSheets()

    def getToTokensFunc(self):
        """Make a tokenize function.

        Produce a function that can tokenize strings in the same way as the corpus
        has been tokenized.

        Returns
        -------
        function
            This function takes a string and returns an iterable of tokens.
            The tokenization is aware of whether the current sheet works
            case-sensitively, and whether spaces have been escaped as underscores.
        """
        settings = self.settings
        spaceEscaped = settings.spaceEscaped
        sheetData = self.getSheetData()
        caseSensitive = sheetData.caseSensitive

        def myToTokens(trigger):
            return toTokens(
                trigger, spaceEscaped=spaceEscaped, caseSensitive=caseSensitive
            )

        return myToTokens

    def readSheets(self):
        """Read the list current ner sheets (again).

        Use this when you change ner sheets outside the NER browser, e.g.
        by editing the spreadsheets in Excel.
        """
        sheetDir = self.sheetDir
        setNames = self.setNames

        sheetNames = set()
        self.sheetNames = sheetNames

        for sheetFile in dirContents(sheetDir)[0]:
            if extNm(sheetFile) == "xlsx":
                sheetName = sheetFile.removesuffix(".xlsx")
                sheetNames.add(sheetName)
                setNames.add(f".{sheetName}")

    def setSheet(self, newSheet, force=False, caseSensitive=None, forceSilent=False):
        """Switch to a named ner sheet.

        After the switch, the new sheet will be loaded into memory, processed, and
        executed.

        Parameters
        ----------
        newSheet: string
            The name of the new ner sheet to switch to.
        force: boolean, optional False
            If True, do not load from cached data, but do all computations afresh.
        caseSensitive: boolean, optional None
            Whether to work with the spreadsheet in a case-sensitive way.
            If `None`, the value is taken from the current instance of this class.
        forceSilent: boolean, optional False
            If True, all output is suppressed, irrespective of the `silent` member
            in the instance. Only show stopping error messages are let through.

        """
        if not self.properlySetup:
            return

        browse = self.browse

        sheetName = self.sheetName

        sheetNames = self.sheetNames
        sheetDir = self.sheetDir
        sheetFile = f"{sheetDir}/{newSheet}.xlsx"

        if newSheet not in sheetNames and not fileExists(sheetFile):
            if not browse:
                console(f"NER sheet {newSheet} ({sheetFile}) does not exist")
            if newSheet is not None:
                return

        # if not browse:
        #    self.setSet("" if newSheet is None else f".{newSheet}")

        if newSheet != sheetName:
            sheetName = newSheet
            self.sheetName = sheetName

        if caseSensitive is None:
            caseSensitive = self.caseSensitive

        self.forceSilent = True
        self.loadSheetData(force=force, caseSensitive=caseSensitive)

    def loadSheetData(self, force=False, caseSensitive=False):
        """Loads the current ner sheet into memory, if there is one.

        If the current ner sheet is None, nothing has to be done.

        Otherwise, we read the corresponding excel sheet(s) from disk and compile them
        into instructions, if needed.

        When loading a sheet, it will be checked first whether its data is already
        in memory and uptotdate, in that case nothing will be done.

        Otherwise, it will be checked wether an uptodate version of its data exists
        on disk. If so, it will be loaded.

        Otherwise, or if `force=True` is passed, the data will be computed from
        scratch and saved to disk, with a time stamp.

        Parameters
        ----------
        force: boolean, optional False
            If True, do not load from cached data, but do all computations afresh.
        caseSensitive: boolean, optional False
            Whether to work with the spreadsheet in a case-sensitive way.
        """
        if not self.properlySetup:
            return

        sheetName = self.sheetName
        setName = self.setName
        annoDir = self.annoDir
        setDir = f"{annoDir}/{setName}"
        dataFile = f"{setDir}/data.gz"
        timeFile = f"{setDir}/time.txt"
        timeKey = "time"

        if sheetName is None:
            return None

        forceSilent = self.forceSilent
        sheets = self.sheets

        if sheetName not in sheets:
            sheets[sheetName] = AttrDict()

        browse = self.browse
        sheetData = self.getSheetData()
        sheetDir = self.sheetDir
        sheetFile = f"{sheetDir}/{sheetName}.xlsx"
        sheetExists = fileExists(sheetFile)
        sheetUpdated = mTime(sheetFile) if sheetExists else 0
        needReload = (
            force
            or sheetData.caseSensitive != caseSensitive
            or sheetData.get(timeKey, 0) < sheetUpdated
        )

        showLog = True

        if needReload:
            # try to load from zipped data file first
            tm = 0
            dataUptodate = not force and fileExists(timeFile)

            if dataUptodate:
                with fileOpen(timeFile) as fh:
                    info = fh.read().strip()
                try:
                    tm = float(info)
                    dataUptodate = sheetUpdated < tm
                except Exception:
                    dataUptodate = False

            # only load if that data exists and is still up to date

            if dataUptodate:
                if fileExists(dataFile):
                    with gzip.open(dataFile, mode="rb") as f:
                        data = pickle.load(f)

                    if data["caseSensitive"] != caseSensitive:
                        loaded = False
                    else:
                        for k in CLEAR_KEYS:
                            if k in sheetData:
                                sheetData[k].clear()

                        for k in SHEET_KEYS:
                            v = data.get(k, None)

                            if v is None:
                                if k in sheetData:
                                    if type(sheetData[k]) in {list, dict, set}:
                                        sheetData[k].clear()
                                    else:
                                        sheetData[k] = None
                            else:
                                sheetData[k] = v

                        sheetData[timeKey] = tm
                        loaded = True
                else:
                    loaded = False
            else:
                loaded = False

            if loaded:
                # we have loaded valid, up-to-date, previously compiled spreadsheet data
                if not forceSilent:
                    self.console("SHEET data: loaded from disk")
            else:
                # now we really heave to read and compile the spreadsheet
                if not forceSilent:
                    self.console(
                        "SHEET data: computing from scratch ...", newline=False
                    )
                sheetData.logData = []
                sheetData.caseSensitive = caseSensitive

                self._readSheet()
                tm = time.time()
                sheetData[timeKey] = tm

                self._compileSheet()
                self._prepareSheet()
                self._processSheet()

                with gzip.open(dataFile, mode="wb", compresslevel=GZIP_LEVEL) as f:
                    f.write(
                        pickle.dumps(
                            {k: sheetData[k] for k in SHEET_KEYS},
                            protocol=PICKLE_PROTOCOL,
                        )
                    )

                with fileOpen(timeFile, "w") as fh:
                    fh.write(f"{tm}\n")

                if not forceSilent:
                    self.console("done")
                showLog = False
        else:
            # the compiled spreadsheet data we have in memory is still up to date
            if not forceSilent:
                self.console("SHEET data: already in memory and uptodate")

        if showLog and not browse and not forceSilent:
            for x in sheetData.logData:
                self.consoleLine(*x)

        self.forceSilent = False

    def makeSheetOfSingleTokens(self):
        """Make a derived sheet based on the individual tokens in the triggers.

        The current sheet will be used to make a new sheet with a row for every
        token in every trigger on the original sheet.
        In this way all tokens in triggers will be searched individually.
        Since tokens do not overlap, the tokens as triggers do not interfere with
        each other.
        This can be a convenient debugging tool for the entity spreadsheet in the
        case of triggers without hits.

        Note however, that there are also other functions that help with debugging
        the spreadsheet:

        *   `tf.ner.triggers.Triggers.triggerInterference`
        *   `tf.ner.triggers.Triggers.reportHits`
        """
        Workbook = self.Workbook
        sheetName = self.sheetName
        sheetDir = self.sheetDir
        sheetPath = f"{sheetDir}/{sheetName}-single.xlsx"
        sheetData = self.getSheetData()
        raw = sheetData.raw

        # raw.setdefault(eidkind, {})[normScopeStr] = (r + 1, triggers)
        words = {}

        for eidkind, eData in raw.items():
            for scopeStr, (r, triggers) in eData.items():
                for trigger in triggers:
                    for word in trigger.split():
                        if word.isalpha():
                            words.setdefault(word, {}).setdefault(scopeStr, set()).add(
                                r
                            )

        wb = Workbook()
        ws = wb.active
        ws.append(("comment", "name", "kind", "scope", "triggers", "origrow"))
        ws.append(("", "", "", "", "", ""))

        eids = {}

        for word in sorted(words):
            wordData = words[word]
            eid = toSmallId(word).replace(".", " ")
            n = eids.get(eid, 0)

            eidDis = eid if n == 0 else f"{eid}({n})"

            n += 1
            eids[eid] = n
            rows = set()

            for scopeStr in sorted(wordData):
                rows |= set(wordData[scopeStr])

            rowRep = ",".join(str(r) for r in sorted(rows))
            ws.append(("", eidDis, "x", "", word, rowRep))

        wb.save(sheetPath)

    def writeSheetData(self, rows, asFile=None):
        """Write a spreadsheet.

        When the data for a spreadsheet has been gathered, you can write it to Excel
        by means of this function.

        Parameters
        ----------
        rows: iterable of iterables
            The rows, each row an interable of fields
        asFile: string, optional None
            The path of the file to write the Excel data to.
        """
        if not asFile:
            console("Pass the path of a destination file in param asFile")
            return

        Workbook = self.Workbook
        trigI = self.trigI
        commentI = self.commentI
        wb = Workbook()
        ws = wb.active

        for r, row in enumerate(rows):
            if r <= 1:
                ws.append(row)
                continue

            isCommentRow = row[commentI].startswith("#")

            if not isCommentRow:
                row[trigI] = "; ".join(row[trigI])

            ws.append(row)

        wb.save(asFile)

    def readSheetData(self):
        """Read the data of the spreadsheet and deliver it as a list of field lists.

        Concerning the Excel sheets in `ner/sheets`:

        *   they will be read by `tf.ner.ner.NER.setSheet()`;
        *   you might need to `pip install openpyxl` first;

        Each row in the spreadsheet specifies a set of triggers for a named entity
        in a certain scope. The row may have additional metadata, which will be linked
        to the named entity.
        If you need several scopes for the same named entity, make sure the metadata
        in those rows is identical. Otherwise it is unpredictable from which row the
        metadata will be taken.

        The first row is a header row, but the names are only important for the columns
        marked as metadata columns.

        The second row will be ignored, you can use it as an explanation or subtitle
        of the column titles in the header row.

        Here is a specification of the columns. We only specify the first so many
        columns. The remaining columns are treated as metadata.

        1.  **comment**. If this cell starts with a `#` the whole row will be
            skipped and ignored. Otherwise, the value is taken as a comment, that you
            are free to fill in or leave empty.

        1.  **name**. The full name of the entity, as it occurs in articles and
            reference works.

        1.  **kind**. A label that indicates what type of entity it is, e.g. `PER` for
            person and `LOC` for location and `ORG` for organization. But you are free
            to chose the labels.

        1.  **scope**. A string that indicates portions in the corpus where the
            triggers for this entity are valid. The scope specifies zero or more
            intervals of sections in the corpus, where an empty scope denotes the
            whole corpus. See `tf.ner.scopes.Scopes.parseScope()`
            for the syntax of scope specifiers.

        1.  **triggers**. A list of triggers, i.e. textual strings that occur in the
            corpus and trigger the detection of the entity named in the **name**
            column. The individual triggers must be separated by a `;`. Triggers may
            contain multiple words, and even `,`s. It is recommended not to use
            newlines in the trigger cells. When the triggers are read, white-space
            will be trimmed, and some character replacements will take place, which
            are dependent on the corpus. Think of replacing various kinds of quotes
            by ASCII quotes.

        The remaining columns are metadata columns. This metadata can be retrieved
        by means of the function `tf.ner.ner.NER.getMeta()`

        A good, real-world example of such a spreadsheet is *people.xlsx* in the
        [Suriano corpus](https://gitlab.huc.knaw.nl/suriano/letters/-/tree/main/ner/specs?ref_type=heads)

        Returns
        -------
        list of list
            Rows which are lists of fields.
        """
        sheetName = self.sheetName
        sheetDir = self.sheetDir
        loadXls = self.loadXls
        trigI = self.trigI
        commentI = self.commentI
        normalizeChars = self.normalizeChars
        spaceEscaped = self.spaceEscaped

        sheetData = self.getSheetData()
        caseSensitive = sheetData.caseSensitive

        def myNormalize(x):
            return normalize(
                str(x) if normalizeChars is None else normalizeChars(str(x))
            )

        sheetPath = f"{sheetDir}/{sheetName}.xlsx"

        wb = loadXls(sheetPath, data_only=True)
        ws = wb.active

        result = []

        for r, row in enumerate(ws.rows):
            fields = []
            result.append(fields)

            isCommentRow = (row[commentI].value or "").startswith("#")

            for i, field in enumerate(row):
                value = field.value or ""

                if not isCommentRow and r > 1 and i == trigI:
                    value = myNormalize(value)
                    value = {
                        y
                        for x in value.split(";")
                        if (
                            y := tnorm(
                                x,
                                spaceEscaped=spaceEscaped,
                                caseSensitive=caseSensitive,
                            )
                        )
                        != ""
                    }

                fields.append(value)

        return result

    def getMeta(self):
        """Retrieves the metadata of the current sheet.

        The metadata of each entity is stored in the extra fields in its row.
        The writer of the NER sheet is free to chose additional rows.

        Returns
        -------
        tuple
            The first member is a list of the names of the metadata columns,
            taken from the first row of the spreadsheet.

            The second member is a dict with the metadata itself.
            The keys are strings of the form *entity identifier*`-`*entity kind*.
            The values are tuples, where the i-th member is the value for the i-th
            name in the list of metadata fields.
        """
        sheetData = self.getSheetData()
        return (sheetData.metaFields, sheetData.metaData)

    def _readSheet(self):
        """Read all the spreadsheets, the main one and the tweaks.

        Several checks on the sanity of the data will be performed.

        Store the results in a hierarchy that mimicks the way they are organized in the
        file system.
        """
        forceSilent = self.forceSilent
        sheetName = self.sheetName
        sheetDir = self.sheetDir
        loadXls = self.loadXls
        defaultKind = self.defaultKind
        transform = self.transform
        spaceEscaped = self.spaceEscaped
        normalizeChars = self.normalizeChars
        trigI = self.trigI
        scopeI = self.scopeI

        def spec(msg):
            if forceSilent:
                return
            self.log(None, 0, msg)

        def log(msg):
            if forceSilent:
                return
            self.log(False, 0, msg)

        def err(msg):
            if forceSilent:
                return
            self.log(True, 0, msg)

        def err1(msg):
            if forceSilent:
                return
            self.log(True, 1, msg)

        sheetData = self.getSheetData()
        caseSensitive = sheetData.caseSensitive

        nameMap = {}
        sheetData.nameMap = nameMap
        """Will contain a mapping from entities to names.

        The entities are keyed by their (eid, kind) tuple.
        The values are names plus the sheet where they are first defined.
        """

        metaData = {}
        sheetData.metaData = metaData
        """Will contain the data of most columns in the spreadsheet.

        The scope and triggers columns are not stored.

        For each row, a key is computed (the small eid based on the name in the first
        column). Under this key we store the list of values.
        """

        rowMap = {}
        sheetData.rowMap = rowMap
        """Will contain a mapping from triggers to row.

        For each trigger we store the row in the NER sheet where that trigger occurs.
        """

        for k in CLEAR_KEYS:
            if k in sheetData:
                sheetData[k].clear()

        for k in SHEET_KEYS:
            if k in sheetData:
                if type(sheetData[k]) in {list, dict, set}:
                    sheetData[k].clear()

        spec("Reading sheets")

        scopeMap = {}
        sheetData.scopeMap = scopeMap
        """Will contain a mapping from scope specifiers to logical scopes.

        Scope specifiers are strings, logical scopes are the data structures
        that you get when you parse those strings.
        """

        sheetPath = f"{sheetDir}/{sheetName}.xlsx"

        wb = loadXls(sheetPath, data_only=True)
        ws = wb.active
        maxCol = ws.max_column
        maxRow = ws.max_row

        if not forceSilent:
            self.console(f"Sheet with {maxRow} rows and {maxCol} columns")

        raw = {}
        sheetData.raw = raw

        multiNames = {}
        noNames = set()
        noTrigs = set()
        emptyLines = set()
        scopeMistakes = {}

        def myNormalize(x):
            """Normalization function that performs additional replacements.

            The replacements are coded in the function `normalizeChars()`,
            which can be passed to the instance of this class.
            """
            return normalize(
                str(x) if normalizeChars is None else normalizeChars(str(x))
            )

        for r, row in enumerate(ws.rows):
            if r == 0:
                metaFields = [
                    myNormalize(row[i].value or "")
                    for i in range(maxCol)
                    if i not in {0, scopeI, trigI}
                ]
                sheetData.metaFields = metaFields

                continue
            if r == 1:
                continue
            if not any(c.value for c in row):
                continue

            triggerStr = row[trigI].value

            if triggerStr is not None and "\n" in triggerStr:
                triggerRep = triggerStr.replace("\n", "\\n")
                msg = f"r{r + 1:>3}: newline in trigger string: {triggerRep}"
                log(msg)
                continue

            (comment, name, kind, scopeStr, triggerStr) = (
                myNormalize(row[i].value or "") for i in range(trigI + 1)
            )

            if comment.startswith("#"):
                continue

            if not name or not triggerStr:
                if name:
                    noTrigs.add(r + 1)
                elif triggerStr:
                    noNames.add(r + 1)
                else:
                    emptyLines.add(r + 1)
                continue

            triggers = {
                y
                for x in triggerStr.split(";")
                if (
                    y := tnorm(
                        x, spaceEscaped=spaceEscaped, caseSensitive=caseSensitive
                    )
                )
                != ""
            }

            for trigger in triggers:
                rowMap.setdefault(trigger, []).append(r + 1)

            if len(triggers) == 0:
                noTrigs.add(r + 1)
                continue

            if not kind:
                kind = defaultKind
                msg = f"r{r + 1:>3}: " f"no kind name, supplied {defaultKind}"
                log(msg)

            info = self.parseScope(scopeStr, plain=False)
            warnings = info["warning"]

            if len(warnings):
                scopeMistakes[r + 1] = "; ".join(warnings)
                continue

            normScopeStr = info["normal"]

            if normScopeStr != "":
                scopes = info["result"]

                if normScopeStr not in scopeMap:
                    scopeMap[normScopeStr] = scopes

            eid = toSmallId(name, transform=transform)
            eidkind = (eid, kind)

            if normScopeStr in raw.get(eidkind, {}):
                ar = raw[eidkind][normScopeStr][0]
                multiNames[r + 1] = f"({normScopeStr}) {nameMap[eidkind]} also in r{ar}"
                continue
            else:
                nameMap[eidkind] = name

            raw.setdefault(eidkind, {})[normScopeStr] = (r + 1, triggers)

            metaKey = f"{eid}-{kind}"
            metaData[metaKey] = [
                myNormalize(row[i].value or "")
                for i in range(maxCol)
                if i not in {0, scopeI, trigI}
            ]

        if not forceSilent:
            for diags, isdict, label in (
                (emptyLines, False, "without a name and triggers"),
                (multiNames, True, "with a duplicate name"),
                (noNames, False, "without a name"),
                (scopeMistakes, True, "with scope mistakes"),
                (noTrigs, False, "without triggers"),
            ):
                n = len(diags)

                if n > 0:
                    if isdict:
                        if n > 0:
                            plural = "" if n == 1 else "s"
                            err(f"{n} row{plural} {label}:")

                            for r in sorted(diags):
                                msg = diags[r]
                                err1(f"r{r:>3}: {msg}")
                    else:
                        rep = ", ".join(str(x) for x in sorted(diags)[0:10])
                        plural = "" if n == 1 else "s"
                        err(f"{n} row{plural} {label}:\n\te.g.: {rep}")

    def _compileSheet(self):
        """Compiles the info in tweaked sheets according to the scopes in it.

        On the basis of the scopes that are given for the triggers, we partition
        the corpus in maximal intervals in which no trigger goes into or out of scope.
        During these intervals we have a fixed set of triggers that must be looked up,
        and we can check for the consistency of these triggers.

        In every scope we may find two kinds of problems with triggers and scopes:

        *   ambiguous trigger: a trigger is used by multiple entities in that scope.
        *   clashing triggers: a trigger occurs in more than one row for a specific
            entity and a specific scope.

        These problems are detected and reported.
        """
        forceSilent = self.forceSilent
        sheetData = self.getSheetData()
        compiled = AttrDict()
        sheetData.compiled = compiled
        nameMap = sheetData.nameMap

        def spec(msg, cache=None):
            if forceSilent:
                return
            self.log(None, 0, msg, cache=cache)

        def log(msg, cache=None):
            if forceSilent:
                return
            self.log(False, 0, msg, cache=cache)

        def err(msg, cache=None):
            if forceSilent:
                return
            self.log(True, 0, msg, cache=cache)

        def errProblem(problem, cache):
            if forceSilent:
                return
            for indent, msg in problem:
                self.log(True, indent, msg, cache=cache)

        tFullMap = {}

        def getTriggerRep(tr):
            triggerInfo = tFullMap[tr]

            result = []

            for scope in sorted(triggerInfo):
                scopeRep = "" if not scope else f"({scope})"
                scopeInfo = triggerInfo[scope]

                for eidkind in sorted(scopeInfo):
                    name = sheetData.nameMap[eidkind]
                    rowRep = ",".join(str(r) for r in sorted(scopeInfo[eidkind]))
                    result.append(f"'{tr}'{scopeRep} r{rowRep} for {name}")
            return result

        spec("Checking scopes")

        raw = sheetData.raw
        intvMap = sheetData.scopeMap
        intervals = partitionScopes(intvMap)

        problems = set()

        for b, e, scopeStrs in [[None, None, None]] + intervals:
            scopeStrSet = {""} if scopeStrs is None else set(scopeStrs)
            intv = (b, e) if b is not None and e is not None else ()

            headCache = []
            problemCache = []

            msg = self.repInterval(intv)
            spec(msg, cache=headCache)

            thisCompiled = AttrDict()
            compiled[intv] = thisCompiled

            # check which triggers are a substring in which other triggers
            # these triggers can never be found!

            tFullMap.clear()

            for eidkind, info in raw.items():
                if "" in info:
                    (r, triggers) = info[""]

                    for trigger in triggers:
                        tFullMap.setdefault(trigger, {}).setdefault("", {}).setdefault(
                            eidkind, set()
                        ).add(r)

                for scopeStr, (r, triggers) in info.items():
                    if scopeStr == "":
                        continue

                    if scopeStr in scopeStrSet:
                        for trigger in triggers:
                            tFullMap.setdefault(trigger, {}).setdefault(
                                scopeStr, {}
                            ).setdefault(eidkind, set()).add(r)

            # now compile the result information: tMap, newSheet

            newSheet = {}
            thisCompiled.sheet = newSheet

            tMap = {}
            thisCompiled.tMap = tMap

            for eidkind, info in raw.items():
                if "" in info:
                    (r, triggers) = info[""]
                    newSheet[eidkind] = triggers

                    for trigger in triggers:
                        tMap[trigger] = ""

                for scopeStr, (r, triggers) in info.items():
                    if scopeStr == "":
                        continue

                    if scopeStr in scopeStrSet:
                        newSheet[eidkind] = triggers

                        for trigger in triggers:
                            tMap[trigger] = scopeStr

            # the sheet is now compiled, all compiled data has been created
            # the rest only deals with error reporting

            ambi = 0
            clashes = 0

            for trigger in sorted(tFullMap):
                problem = []
                triggerInfo = tFullMap[trigger]

                for scopeStr in sorted(triggerInfo):
                    scopeInfo = triggerInfo[scopeStr]
                    scopeRep = f" in scope {scopeStr}" if scopeStr else ""

                    thisAmbi = False
                    theseClashes = 0

                    if len(scopeInfo) > 1:
                        problem.append((0, f"Ambi: '{trigger}'{scopeRep}: "))
                        thisAmbi = True

                    for eidkind, rs in scopeInfo.items():
                        name = nameMap[eidkind]
                        rowRep = ",".join(str(r) for r in rs)
                        nRs = len(rs)

                        if nRs > 1:
                            theseClashes += 1

                        if thisAmbi:
                            problem.append((1, f"{name}: {rowRep}"))
                        elif nRs > 1:
                            problem.append(
                                (0, f"Clash: '{trigger}' for {name}: {rowRep}")
                            )

                    clashes += theseClashes

                    if len(problem):
                        problem = tuple(problem)

                        if problem not in problems:
                            if thisAmbi:
                                ambi += 1
                            if theseClashes:
                                clashes += theseClashes

                            problems.add(problem)
                            errProblem(problem, problemCache)

            if ambi > 0:
                err(f"Ambiguous triggers: {ambi} x", cache=problemCache)

            if clashes > 0:
                err(f"Reused triggers scope: {clashes} x", cache=problemCache)

            if len(problemCache) > 0:
                self.flush(headCache)
                self.flush(problemCache)

    def _prepareSheet(self):
        """Transform the sheets into instructions.

        Now we have intervals with fixed sets of triggers,
        we can generate instructions out of the sheets: info that the search
        algorithm needs to do its work.

        This info is such that it supports simultaneous searching for multiple triggers.
        """
        spaceEscaped = self.spaceEscaped

        sheetData = self.getSheetData()
        compiled = sheetData.compiled
        nameMap = sheetData.nameMap
        caseSensitive = sheetData.caseSensitive

        instructions = {}
        sheetData.instructions = instructions

        triggerScopes = {}
        sheetData.triggerScopes = triggerScopes

        allTriggers = set()
        sheetData.allTriggers = allTriggers

        for intv, thisCompiled in compiled.items():
            sheet = thisCompiled.sheet
            tMap = thisCompiled.tMap

            triggerSet = set()
            tPos = {}
            idMap = {}

            theseInstructions = dict(tPos=tPos, tMap=tMap)
            instructions[intv] = theseInstructions

            for eidkind, triggers in sheet.items():
                for trigger in triggers:
                    triggerT = toTokens(
                        trigger, spaceEscaped=spaceEscaped, caseSensitive=caseSensitive
                    )
                    triggerSet.add(triggerT)
                    idMap.setdefault(trigger, []).append(eidkind)

            for triggerT in triggerSet:
                for i, token in enumerate(triggerT):
                    tPos.setdefault(i, {}).setdefault(token, set()).add(triggerT)

            theseInstructions["idMap"] = {
                trigger: eidkinds[0] for (trigger, eidkinds) in idMap.items()
            }

            tMap = theseInstructions["tMap"]
            idMap = theseInstructions["idMap"]

            for trigger, scope in tMap.items():
                eidkind = idMap.get(trigger, None)

                if eidkind is None:
                    continue

                name = nameMap[eidkind]
                allTriggers.add((name, eidkind, trigger, scope))
                triggerScopes.setdefault(trigger, set()).add(scope)

    def _processSheet(self):
        """Carries out the search instructions that have been compiled from the sheet.

        We look up the occurrences, organize the hits of the triggers, and store them
        as entities in the current set.
        """
        if not self.properlySetup:
            return

        forceSilent = self.forceSilent
        silent = self.silent
        browse = self.browse

        if not browse and not silent and not forceSilent:
            app = self.app
            app.indent(reset=True)
            app.info("Looking up occurrences of many candidates ...")

        self.findOccs()
        self._collectHits()

        if not browse and not silent and not forceSilent:
            app.info("done")

        self._markEntities()

    def _collectHits(self):
        """Stores the trigger hits.

        The hits will be stored under the key `hitData` in the sheet data.
        """
        if not self.properlySetup:
            return

        forceSilent = self.forceSilent
        sheetData = self.getSheetData()
        inventory = sheetData.inventory
        instructions = sheetData.instructions

        def log(msg):
            if forceSilent:
                return
            self.log(False, 0, msg)

        hitData = {}
        sheetData.hitData = hitData

        for path, data in instructions.items():
            idMap = data["idMap"]
            tMap = data["tMap"]

            for trigger, scope in tMap.items():
                eidkind = idMap.get(trigger, None)

                if eidkind is None:
                    continue

                occs = inventory.get(eidkind, {}).get(trigger, {}).get(scope, [])
                hitData.setdefault(eidkind, {})[(trigger, scope)] = len(occs)

    def _markEntities(self):
        """Marks up the members of the inventory as entities.

        The instructions contain the entity identifier and the entity kind that
        have to be assigned to the surface forms.

        The inventory knows where the occurrences of the surface forms are.
        All these occurrences will be added to the current entity set.
        """
        if not self.properlySetup:
            return

        sheetData = self.getSheetData()
        inventory = sheetData.inventory

        newEntities = []
        triggerFromMatch = {}
        sheetData.triggerFromMatch = triggerFromMatch

        for eidkind, entData in inventory.items():
            for trigger, triggerData in entData.items():
                for scope, matches in triggerData.items():
                    for match in matches:
                        triggerFromMatch[match] = (trigger, scope)

                    newEntities.append((eidkind, matches))

        self._addToSet(newEntities)

    def getSheetData(self):
        """Deliver the data of current sheet.

        Sheet data is `tf.core.generic.AttrDict` with various kinds of data under
        various keys, which are listed in `tf.ner.sheets.SHEET_KEYS`.

        Returns
        -------
        dict
        """
        sheetName = self.sheetName
        sheetsData = self.sheets

        return sheetsData.setdefault(sheetName, AttrDict())

    def log(self, isError, indent, msg, cache=None):
        """Issue a message to the user.

        Depending on the `silent` member of the instance and on whether the message
        is an error message, it can be inhibited.

        Parameters
        ----------
        isError: boolean
            Whether it is an error message or a normal message
        indent: integer
            How far (in tabs) the message should be indented
        msg: string
            The actual message
        cache: list, optional None
            If it is a list, the output will not be sent to the console,
            but appended to the cache.
            You can save it for later.
        """
        silent = self.silent

        if silent and not isError:
            return

        browse = self.browse
        sheetData = self.getSheetData()
        logData = sheetData.logData

        if cache is None:
            logData.append((isError, indent, msg))

        if not browse:
            if cache is None:
                self.consoleLine(isError, indent, msg)
            else:
                cache.append((isError, indent, msg))

    def flush(self, cache):
        """Flushes the cache of log messages.

        Parameters
        ----------
        cache: list
            The items in the cache
        """
        browse = self.browse
        sheetData = self.getSheetData()
        logData = sheetData.logData

        if not browse:
            for item in cache:
                logData.append(item)
                self.consoleLine(*item)

Functions that do scope handling.

These functions will be added as methods to the class that inherits this class.

Handling of NER spreadsheets.

A NER spreadsheet contains entity information; in particular it links named entities to triggers by which they can be found in the corpus.

See readSheetData() for the description of the shape of the spreadsheet, which is expected to be an Excel sheet.

Parameters

sheets : dict, optional None: Sheet data to start with. Relevant for when the TF browser uses this class. See NER

Ancestors

Subclasses

Instance variables

var sheetName: The current NER sheet.
var sheetNames: The set of names of NER sheets that are present on the file system.

Methods

def flush(self, cache)

Expand source code Browse git

def flush(self, cache):
    """Flushes the cache of log messages.

    Parameters
    ----------
    cache: list
        The items in the cache
    """
    browse = self.browse
    sheetData = self.getSheetData()
    logData = sheetData.logData

    if not browse:
        for item in cache:
            logData.append(item)
            self.consoleLine(*item)

Flushes the cache of log messages.

Parameters

cache : list: The items in the cache

def getMeta(self)

Expand source code Browse git

def getMeta(self):
    """Retrieves the metadata of the current sheet.

    The metadata of each entity is stored in the extra fields in its row.
    The writer of the NER sheet is free to chose additional rows.

    Returns
    -------
    tuple
        The first member is a list of the names of the metadata columns,
        taken from the first row of the spreadsheet.

        The second member is a dict with the metadata itself.
        The keys are strings of the form *entity identifier*`-`*entity kind*.
        The values are tuples, where the i-th member is the value for the i-th
        name in the list of metadata fields.
    """
    sheetData = self.getSheetData()
    return (sheetData.metaFields, sheetData.metaData)

Retrieves the metadata of the current sheet.

The metadata of each entity is stored in the extra fields in its row. The writer of the NER sheet is free to chose additional rows.

Returns

tuple

The first member is a list of the names of the metadata columns, taken from the first row of the spreadsheet.

The second member is a dict with the metadata itself. The keys are strings of the form entity identifier-entity kind. The values are tuples, where the i-th member is the value for the i-th name in the list of metadata fields.

def getSheetData(self)

Expand source code Browse git

def getSheetData(self):
    """Deliver the data of current sheet.

    Sheet data is `tf.core.generic.AttrDict` with various kinds of data under
    various keys, which are listed in `tf.ner.sheets.SHEET_KEYS`.

    Returns
    -------
    dict
    """
    sheetName = self.sheetName
    sheetsData = self.sheets

    return sheetsData.setdefault(sheetName, AttrDict())

Deliver the data of current sheet.

Sheet data is AttrDict with various kinds of data under various keys, which are listed in SHEET_KEYS.

Returns

dict

def getToTokensFunc(self)

Expand source code Browse git

def getToTokensFunc(self):
    """Make a tokenize function.

    Produce a function that can tokenize strings in the same way as the corpus
    has been tokenized.

    Returns
    -------
    function
        This function takes a string and returns an iterable of tokens.
        The tokenization is aware of whether the current sheet works
        case-sensitively, and whether spaces have been escaped as underscores.
    """
    settings = self.settings
    spaceEscaped = settings.spaceEscaped
    sheetData = self.getSheetData()
    caseSensitive = sheetData.caseSensitive

    def myToTokens(trigger):
        return toTokens(
            trigger, spaceEscaped=spaceEscaped, caseSensitive=caseSensitive
        )

    return myToTokens

Make a tokenize function.

Produce a function that can tokenize strings in the same way as the corpus has been tokenized.

Returns

function: This function takes a string and returns an iterable of tokens. The tokenization is aware of whether the current sheet works case-sensitively, and whether spaces have been escaped as underscores.

def loadSheetData(self, force=False, caseSensitive=False)

Expand source code Browse git

def loadSheetData(self, force=False, caseSensitive=False):
    """Loads the current ner sheet into memory, if there is one.

    If the current ner sheet is None, nothing has to be done.

    Otherwise, we read the corresponding excel sheet(s) from disk and compile them
    into instructions, if needed.

    When loading a sheet, it will be checked first whether its data is already
    in memory and uptotdate, in that case nothing will be done.

    Otherwise, it will be checked wether an uptodate version of its data exists
    on disk. If so, it will be loaded.

    Otherwise, or if `force=True` is passed, the data will be computed from
    scratch and saved to disk, with a time stamp.

    Parameters
    ----------
    force: boolean, optional False
        If True, do not load from cached data, but do all computations afresh.
    caseSensitive: boolean, optional False
        Whether to work with the spreadsheet in a case-sensitive way.
    """
    if not self.properlySetup:
        return

    sheetName = self.sheetName
    setName = self.setName
    annoDir = self.annoDir
    setDir = f"{annoDir}/{setName}"
    dataFile = f"{setDir}/data.gz"
    timeFile = f"{setDir}/time.txt"
    timeKey = "time"

    if sheetName is None:
        return None

    forceSilent = self.forceSilent
    sheets = self.sheets

    if sheetName not in sheets:
        sheets[sheetName] = AttrDict()

    browse = self.browse
    sheetData = self.getSheetData()
    sheetDir = self.sheetDir
    sheetFile = f"{sheetDir}/{sheetName}.xlsx"
    sheetExists = fileExists(sheetFile)
    sheetUpdated = mTime(sheetFile) if sheetExists else 0
    needReload = (
        force
        or sheetData.caseSensitive != caseSensitive
        or sheetData.get(timeKey, 0) < sheetUpdated
    )

    showLog = True

    if needReload:
        # try to load from zipped data file first
        tm = 0
        dataUptodate = not force and fileExists(timeFile)

        if dataUptodate:
            with fileOpen(timeFile) as fh:
                info = fh.read().strip()
            try:
                tm = float(info)
                dataUptodate = sheetUpdated < tm
            except Exception:
                dataUptodate = False

        # only load if that data exists and is still up to date

        if dataUptodate:
            if fileExists(dataFile):
                with gzip.open(dataFile, mode="rb") as f:
                    data = pickle.load(f)

                if data["caseSensitive"] != caseSensitive:
                    loaded = False
                else:
                    for k in CLEAR_KEYS:
                        if k in sheetData:
                            sheetData[k].clear()

                    for k in SHEET_KEYS:
                        v = data.get(k, None)

                        if v is None:
                            if k in sheetData:
                                if type(sheetData[k]) in {list, dict, set}:
                                    sheetData[k].clear()
                                else:
                                    sheetData[k] = None
                        else:
                            sheetData[k] = v

                    sheetData[timeKey] = tm
                    loaded = True
            else:
                loaded = False
        else:
            loaded = False

        if loaded:
            # we have loaded valid, up-to-date, previously compiled spreadsheet data
            if not forceSilent:
                self.console("SHEET data: loaded from disk")
        else:
            # now we really heave to read and compile the spreadsheet
            if not forceSilent:
                self.console(
                    "SHEET data: computing from scratch ...", newline=False
                )
            sheetData.logData = []
            sheetData.caseSensitive = caseSensitive

            self._readSheet()
            tm = time.time()
            sheetData[timeKey] = tm

            self._compileSheet()
            self._prepareSheet()
            self._processSheet()

            with gzip.open(dataFile, mode="wb", compresslevel=GZIP_LEVEL) as f:
                f.write(
                    pickle.dumps(
                        {k: sheetData[k] for k in SHEET_KEYS},
                        protocol=PICKLE_PROTOCOL,
                    )
                )

            with fileOpen(timeFile, "w") as fh:
                fh.write(f"{tm}\n")

            if not forceSilent:
                self.console("done")
            showLog = False
    else:
        # the compiled spreadsheet data we have in memory is still up to date
        if not forceSilent:
            self.console("SHEET data: already in memory and uptodate")

    if showLog and not browse and not forceSilent:
        for x in sheetData.logData:
            self.consoleLine(*x)

    self.forceSilent = False

Loads the current ner sheet into memory, if there is one.

If the current ner sheet is None, nothing has to be done.

Otherwise, we read the corresponding excel sheet(s) from disk and compile them into instructions, if needed.

When loading a sheet, it will be checked first whether its data is already in memory and uptotdate, in that case nothing will be done.

Otherwise, it will be checked wether an uptodate version of its data exists on disk. If so, it will be loaded.

Otherwise, or if force=True is passed, the data will be computed from scratch and saved to disk, with a time stamp.

Parameters

force : boolean, optional False: If True, do not load from cached data, but do all computations afresh.
caseSensitive : boolean, optional False: Whether to work with the spreadsheet in a case-sensitive way.

def log(self, isError, indent, msg, cache=None)

Expand source code Browse git

def log(self, isError, indent, msg, cache=None):
    """Issue a message to the user.

    Depending on the `silent` member of the instance and on whether the message
    is an error message, it can be inhibited.

    Parameters
    ----------
    isError: boolean
        Whether it is an error message or a normal message
    indent: integer
        How far (in tabs) the message should be indented
    msg: string
        The actual message
    cache: list, optional None
        If it is a list, the output will not be sent to the console,
        but appended to the cache.
        You can save it for later.
    """
    silent = self.silent

    if silent and not isError:
        return

    browse = self.browse
    sheetData = self.getSheetData()
    logData = sheetData.logData

    if cache is None:
        logData.append((isError, indent, msg))

    if not browse:
        if cache is None:
            self.consoleLine(isError, indent, msg)
        else:
            cache.append((isError, indent, msg))

Issue a message to the user.

Depending on the silent member of the instance and on whether the message is an error message, it can be inhibited.

Parameters

isError : boolean: Whether it is an error message or a normal message
indent : integer: How far (in tabs) the message should be indented
msg : string: The actual message
cache : list, optional None: If it is a list, the output will not be sent to the console, but appended to the cache. You can save it for later.

def makeSheetOfSingleTokens(self)

Expand source code Browse git

def makeSheetOfSingleTokens(self):
    """Make a derived sheet based on the individual tokens in the triggers.

    The current sheet will be used to make a new sheet with a row for every
    token in every trigger on the original sheet.
    In this way all tokens in triggers will be searched individually.
    Since tokens do not overlap, the tokens as triggers do not interfere with
    each other.
    This can be a convenient debugging tool for the entity spreadsheet in the
    case of triggers without hits.

    Note however, that there are also other functions that help with debugging
    the spreadsheet:

    *   `tf.ner.triggers.Triggers.triggerInterference`
    *   `tf.ner.triggers.Triggers.reportHits`
    """
    Workbook = self.Workbook
    sheetName = self.sheetName
    sheetDir = self.sheetDir
    sheetPath = f"{sheetDir}/{sheetName}-single.xlsx"
    sheetData = self.getSheetData()
    raw = sheetData.raw

    # raw.setdefault(eidkind, {})[normScopeStr] = (r + 1, triggers)
    words = {}

    for eidkind, eData in raw.items():
        for scopeStr, (r, triggers) in eData.items():
            for trigger in triggers:
                for word in trigger.split():
                    if word.isalpha():
                        words.setdefault(word, {}).setdefault(scopeStr, set()).add(
                            r
                        )

    wb = Workbook()
    ws = wb.active
    ws.append(("comment", "name", "kind", "scope", "triggers", "origrow"))
    ws.append(("", "", "", "", "", ""))

    eids = {}

    for word in sorted(words):
        wordData = words[word]
        eid = toSmallId(word).replace(".", " ")
        n = eids.get(eid, 0)

        eidDis = eid if n == 0 else f"{eid}({n})"

        n += 1
        eids[eid] = n
        rows = set()

        for scopeStr in sorted(wordData):
            rows |= set(wordData[scopeStr])

        rowRep = ",".join(str(r) for r in sorted(rows))
        ws.append(("", eidDis, "x", "", word, rowRep))

    wb.save(sheetPath)

Make a derived sheet based on the individual tokens in the triggers.

The current sheet will be used to make a new sheet with a row for every token in every trigger on the original sheet. In this way all tokens in triggers will be searched individually. Since tokens do not overlap, the tokens as triggers do not interfere with each other. This can be a convenient debugging tool for the entity spreadsheet in the case of triggers without hits.

Note however, that there are also other functions that help with debugging the spreadsheet:

def readSheetData(self)

Expand source code Browse git

def readSheetData(self):
    """Read the data of the spreadsheet and deliver it as a list of field lists.

    Concerning the Excel sheets in `ner/sheets`:

    *   they will be read by `tf.ner.ner.NER.setSheet()`;
    *   you might need to `pip install openpyxl` first;

    Each row in the spreadsheet specifies a set of triggers for a named entity
    in a certain scope. The row may have additional metadata, which will be linked
    to the named entity.
    If you need several scopes for the same named entity, make sure the metadata
    in those rows is identical. Otherwise it is unpredictable from which row the
    metadata will be taken.

    The first row is a header row, but the names are only important for the columns
    marked as metadata columns.

    The second row will be ignored, you can use it as an explanation or subtitle
    of the column titles in the header row.

    Here is a specification of the columns. We only specify the first so many
    columns. The remaining columns are treated as metadata.

    1.  **comment**. If this cell starts with a `#` the whole row will be
        skipped and ignored. Otherwise, the value is taken as a comment, that you
        are free to fill in or leave empty.

    1.  **name**. The full name of the entity, as it occurs in articles and
        reference works.

    1.  **kind**. A label that indicates what type of entity it is, e.g. `PER` for
        person and `LOC` for location and `ORG` for organization. But you are free
        to chose the labels.

    1.  **scope**. A string that indicates portions in the corpus where the
        triggers for this entity are valid. The scope specifies zero or more
        intervals of sections in the corpus, where an empty scope denotes the
        whole corpus. See `tf.ner.scopes.Scopes.parseScope()`
        for the syntax of scope specifiers.

    1.  **triggers**. A list of triggers, i.e. textual strings that occur in the
        corpus and trigger the detection of the entity named in the **name**
        column. The individual triggers must be separated by a `;`. Triggers may
        contain multiple words, and even `,`s. It is recommended not to use
        newlines in the trigger cells. When the triggers are read, white-space
        will be trimmed, and some character replacements will take place, which
        are dependent on the corpus. Think of replacing various kinds of quotes
        by ASCII quotes.

    The remaining columns are metadata columns. This metadata can be retrieved
    by means of the function `tf.ner.ner.NER.getMeta()`

    A good, real-world example of such a spreadsheet is *people.xlsx* in the
    [Suriano corpus](https://gitlab.huc.knaw.nl/suriano/letters/-/tree/main/ner/specs?ref_type=heads)

    Returns
    -------
    list of list
        Rows which are lists of fields.
    """
    sheetName = self.sheetName
    sheetDir = self.sheetDir
    loadXls = self.loadXls
    trigI = self.trigI
    commentI = self.commentI
    normalizeChars = self.normalizeChars
    spaceEscaped = self.spaceEscaped

    sheetData = self.getSheetData()
    caseSensitive = sheetData.caseSensitive

    def myNormalize(x):
        return normalize(
            str(x) if normalizeChars is None else normalizeChars(str(x))
        )

    sheetPath = f"{sheetDir}/{sheetName}.xlsx"

    wb = loadXls(sheetPath, data_only=True)
    ws = wb.active

    result = []

    for r, row in enumerate(ws.rows):
        fields = []
        result.append(fields)

        isCommentRow = (row[commentI].value or "").startswith("#")

        for i, field in enumerate(row):
            value = field.value or ""

            if not isCommentRow and r > 1 and i == trigI:
                value = myNormalize(value)
                value = {
                    y
                    for x in value.split(";")
                    if (
                        y := tnorm(
                            x,
                            spaceEscaped=spaceEscaped,
                            caseSensitive=caseSensitive,
                        )
                    )
                    != ""
                }

            fields.append(value)

    return result

Read the data of the spreadsheet and deliver it as a list of field lists.

Concerning the Excel sheets in ner/sheets:

they will be read by Sheets.setSheet();
you might need to pip install openpyxl first;

Each row in the spreadsheet specifies a set of triggers for a named entity in a certain scope. The row may have additional metadata, which will be linked to the named entity. If you need several scopes for the same named entity, make sure the metadata in those rows is identical. Otherwise it is unpredictable from which row the metadata will be taken.

The first row is a header row, but the names are only important for the columns marked as metadata columns.

The second row will be ignored, you can use it as an explanation or subtitle of the column titles in the header row.

Here is a specification of the columns. We only specify the first so many columns. The remaining columns are treated as metadata.

comment. If this cell starts with a # the whole row will be skipped and ignored. Otherwise, the value is taken as a comment, that you are free to fill in or leave empty.
name. The full name of the entity, as it occurs in articles and reference works.
kind. A label that indicates what type of entity it is, e.g. PER for person and LOC for location and ORG for organization. But you are free to chose the labels.
scope. A string that indicates portions in the corpus where the triggers for this entity are valid. The scope specifies zero or more intervals of sections in the corpus, where an empty scope denotes the whole corpus. See Scopes.parseScope() for the syntax of scope specifiers.
triggers. A list of triggers, i.e. textual strings that occur in the corpus and trigger the detection of the entity named in the name column. The individual triggers must be separated by a ;. Triggers may contain multiple words, and even ,s. It is recommended not to use newlines in the trigger cells. When the triggers are read, white-space will be trimmed, and some character replacements will take place, which are dependent on the corpus. Think of replacing various kinds of quotes by ASCII quotes.

The remaining columns are metadata columns. This metadata can be retrieved by means of the function Sheets.getMeta()

A good, real-world example of such a spreadsheet is people.xlsx in the Suriano corpus

Returns

list of list: Rows which are lists of fields.

def readSheets(self)

Expand source code Browse git

def readSheets(self):
    """Read the list current ner sheets (again).

    Use this when you change ner sheets outside the NER browser, e.g.
    by editing the spreadsheets in Excel.
    """
    sheetDir = self.sheetDir
    setNames = self.setNames

    sheetNames = set()
    self.sheetNames = sheetNames

    for sheetFile in dirContents(sheetDir)[0]:
        if extNm(sheetFile) == "xlsx":
            sheetName = sheetFile.removesuffix(".xlsx")
            sheetNames.add(sheetName)
            setNames.add(f".{sheetName}")

Read the list current ner sheets (again).

Use this when you change ner sheets outside the NER browser, e.g. by editing the spreadsheets in Excel.

def setSheet(self, newSheet, force=False, caseSensitive=None, forceSilent=False)

Expand source code Browse git

def setSheet(self, newSheet, force=False, caseSensitive=None, forceSilent=False):
    """Switch to a named ner sheet.

    After the switch, the new sheet will be loaded into memory, processed, and
    executed.

    Parameters
    ----------
    newSheet: string
        The name of the new ner sheet to switch to.
    force: boolean, optional False
        If True, do not load from cached data, but do all computations afresh.
    caseSensitive: boolean, optional None
        Whether to work with the spreadsheet in a case-sensitive way.
        If `None`, the value is taken from the current instance of this class.
    forceSilent: boolean, optional False
        If True, all output is suppressed, irrespective of the `silent` member
        in the instance. Only show stopping error messages are let through.

    """
    if not self.properlySetup:
        return

    browse = self.browse

    sheetName = self.sheetName

    sheetNames = self.sheetNames
    sheetDir = self.sheetDir
    sheetFile = f"{sheetDir}/{newSheet}.xlsx"

    if newSheet not in sheetNames and not fileExists(sheetFile):
        if not browse:
            console(f"NER sheet {newSheet} ({sheetFile}) does not exist")
        if newSheet is not None:
            return

    # if not browse:
    #    self.setSet("" if newSheet is None else f".{newSheet}")

    if newSheet != sheetName:
        sheetName = newSheet
        self.sheetName = sheetName

    if caseSensitive is None:
        caseSensitive = self.caseSensitive

    self.forceSilent = True
    self.loadSheetData(force=force, caseSensitive=caseSensitive)

Switch to a named ner sheet.

After the switch, the new sheet will be loaded into memory, processed, and executed.

Parameters

newSheet : string: The name of the new ner sheet to switch to.
force : boolean, optional False: If True, do not load from cached data, but do all computations afresh.
caseSensitive : boolean, optional None: Whether to work with the spreadsheet in a case-sensitive way. If None, the value is taken from the current instance of this class.
forceSilent : boolean, optional False: If True, all output is suppressed, irrespective of the silent member in the instance. Only show stopping error messages are let through.

def writeSheetData(self, rows, asFile=None)

Expand source code Browse git

def writeSheetData(self, rows, asFile=None):
    """Write a spreadsheet.

    When the data for a spreadsheet has been gathered, you can write it to Excel
    by means of this function.

    Parameters
    ----------
    rows: iterable of iterables
        The rows, each row an interable of fields
    asFile: string, optional None
        The path of the file to write the Excel data to.
    """
    if not asFile:
        console("Pass the path of a destination file in param asFile")
        return

    Workbook = self.Workbook
    trigI = self.trigI
    commentI = self.commentI
    wb = Workbook()
    ws = wb.active

    for r, row in enumerate(rows):
        if r <= 1:
            ws.append(row)
            continue

        isCommentRow = row[commentI].startswith("#")

        if not isCommentRow:
            row[trigI] = "; ".join(row[trigI])

        ws.append(row)

    wb.save(asFile)

Write a spreadsheet.

When the data for a spreadsheet has been gathered, you can write it to Excel by means of this function.

Parameters

rows : iterable of iterables: The rows, each row an interable of fields
asFile : string, optional None: The path of the file to write the Excel data to.

Inherited members

Scopes:
- intersectScopes
- parseInterval
- parseLoc
- parseScope
- repInterval
- repLoc
- repScope
- testPartitioning
Triggers:
- diagnoseTrigger
- diagnoseTriggers
- findOccs
- findTrigger
- findTriggers
- interference
- partitionTriggers
- reportHits
- triggerInterference