Module tf.convert.mql

MQL

You can interchange with MQL data. TF can read and write MQL dumps. An MQL dump is a text file, like an SQL dump. It contains the instructions to create and fill a complete database.

Correspondence TF and MQL

After exporting a TF dataset to MQL, the resulting MQL database has the following properties with respect to the TF dataset it comes from:

  • the TF slots correspond exactly with the MQL monads and have the same numbers; provided the monad numbers in the MQL dump are consecutive. In MQL this is not obligatory. Even if there gaps in the monads sequence, we will fill the holes during conversion, so the slots are tightly consecutive;
  • the TF nodes correspond exactly with the MQL objects and have the same numbers

Node features in MQL

The values of TF features are of two types, int and str, and they translate to corresponding MQL types integer and string. The actual values do not undergo any transformation.

That means that in MQL queries, you use quotes if the feature is a string feature. Only if the feature is a number feature, you may omit the quotes:

[word sp='verb']
[verse chapter=1 and verse=1]

Integers in MQL

We restrict the values of integers to those between minus 2 ** 31 - 1 and plus 2 ** 31 - 1 because Emdros does not dealt with arbitrarily small or large integers. If there are TF features with integer values that are out of bounds, it will be reported, and no conversion will be made.

Enumeration types

It is attractive to use enumeration types for the values of a feature, where ever possible, because then you can query those features in MQL with IN and without quotes:

[chapter book IN (Genesis, Exodus)]

We will generate enumerations for eligible features.

Integer values can already be queried like this, even if they are not part of an enumeration. So we restrict ourselves to node features with string values. We put the following extra restrictions:

  • the number of distinct values is less than 1000
  • all values must be legal C names, in practice: starting with a letter, followed by letters, digits, or _. The letters can only be plain ASCII letters, uppercase and lowercase.

Features that comply with these restrictions will get an enumeration type. Currently, we provide no ways to configure this in more detail.

Instead of creating separate enumeration types for individual features, we collect all enumerated values for all those features into one big enumeration type.

The reason is that MQL considers equal values in different types as distinct values. If we had separate types, we could never compare values for different features.

Edge features in MQL

There is no place for edge values in MQL. There is only one concept of feature in MQL: object features, which are node features. But TF edges without values can be seen as node features: nodes are mapped onto sets of nodes to which the edges go. And that notion is supported by MQL: edge features are translated into MQL features of type LIST OF id_d, i.e. lists of object identifiers.

TF Edge features become multivalued when translated to MQL

This has an important consequence: a feature in MQL with type id_d translates to an edge in TF. If we translate this edge back to MQL, we get a feature of type LIST OF id_d.

Queries in the original MQL with conditions like

[object_type edge_feature = some_id]

will not work for the edge feature that has made the roundtrip through TF. Instead, when working in the round-tripped MQL, you have to say

[object_type edge_feature HAS some_id]

Naming of features in MQL

Legal names in MQL

MQL names for databases, object types and features must be valid C identifiers (yes, the computer language C).

The requirements are for names are:

  • start with a letter (ASCII, upper-case or lower-case)
  • follow by any sequence of ASCII upper / lower-case letters or digits or underscores (_)
  • avoid being a reserved word in the C language

So, we have to change names coming from TF if they are invalid in MQL. We do that by replacing illegal characters by _, and, if the result does not start with a letter, we prepend an x. We do not check whether the name is a reserved C word.

With these provisos:

  • the given dbName correspond to the MQL database name
  • the TF otypes correspond to the MQL objects
  • the TF features correspond to the MQL features

The MQL export is usually quite massive (500MB for the Hebrew Bible). It can be compressed greatly, especially by the program bzip2.

Existing database

If you try to import an MQL file in Emdros, and there exists already a file or directory with the same name as the MQL database, your import will fail spectacularly. So do not do that.

A good way to prevent clashes:

  • export the MQL to outside your ~/text-fabric-data directory, e.g. to ~/Downloads;
  • before importing the MQL file, delete the previous copy;

Delete existing copy:

cd ~/Downloads
rm dataset ; mql -b 3 < dataset.mql
Expand source code Browse git
"""
# MQL

You can interchange with [MQL data](https://emdros.org).
TF can read and write MQL dumps.
An MQL dump is a text file, like an SQL dump.
It contains the instructions to create and fill a complete database.

## Correspondence TF and MQL

After exporting a TF dataset to MQL, the resulting MQL database has the
following properties with respect to the TF dataset it comes from:

*   the TF *slots* correspond exactly with the MQL *monads* and have the same
    numbers; provided the monad numbers in the MQL dump are consecutive. In MQL
    this is not obligatory. Even if there gaps in the monads sequence, we will
    fill the holes during conversion, so the slots are tightly consecutive;
*   the TF *nodes* correspond exactly with the MQL *objects* and have the same
    numbers

## Node features in MQL

The values of TF features are of two types, `int` and `str`, and they translate
to corresponding MQL types `integer` and `string`. The actual values do not
undergo any transformation.

That means that in MQL queries, you use quotes if the feature is a string feature.
Only if the feature is a number feature, you may omit the quotes:

```
[word sp='verb']
[verse chapter=1 and verse=1]
```

## Integers in MQL

We restrict the values of integers to those between minus `2 ** 31 - 1` and plus
`2 ** 31 - 1` because Emdros does not dealt with arbitrarily small or large integers.
If there are TF features with integer values that are out of bounds, it
will be reported, and no conversion will be made.

## Enumeration types

It is attractive to use enumeration types for the values of a feature, where ever
possible, because then you can query those features in MQL with `IN` and without
quotes:

```
[chapter book IN (Genesis, Exodus)]
```

We will generate enumerations for eligible features.

Integer values can already be queried like this, even if they are not part of an
enumeration. So we restrict ourselves to node features with string values. We
put the following extra restrictions:

*   the number of distinct values is less than 1000
*   all values must be legal C names, in practice: starting with a letter,
    followed by letters, digits, or `_`. The letters can only be plain ASCII
    letters, uppercase and lowercase.

Features that comply with these restrictions will get an enumeration type.
Currently, we provide no ways to configure this in more detail.

Instead of creating separate enumeration types for individual features,
we collect all enumerated values for all those features into one
big enumeration type.

The reason is that MQL considers equal values in different types as
distinct values. If we had separate types, we could never compare
values for different features.

## Edge features in MQL

There is no place for edge values in MQL. There is only one concept of feature
in MQL: object features, which are node features.
But TF edges without values can be seen as node features: nodes are
mapped onto sets of nodes to which the edges go. And that notion is supported by
MQL:
edge features are translated into MQL features of type `LIST OF id_d`,
i.e. lists of object identifiers.

!!! caution "TF Edge features become multivalued when translated to MQL"
    This has an important consequence: a feature in MQL with type `id_d` translates
    to an edge in TF. If we translate this edge back to MQL, we get a feature of type
    `LIST OF id_d`.

    Queries in the original MQL with conditions like

    ```
    [object_type edge_feature = some_id]
    ```

    will not work for the edge feature that has made the roundtrip through TF.
    Instead, when working in the round-tripped MQL, you have to say

    ```
    [object_type edge_feature HAS some_id]
    ```

## Naming of features in MQL

!!! caution "Legal names in MQL"
    MQL names for databases, object types and features must be valid C identifiers
    (yes, the computer language C).

The requirements are for names are:

*   start with a letter (ASCII, upper-case or lower-case)
*   follow by any sequence of ASCII upper / lower-case letters or digits or
    underscores (`_`)
*   avoid being a reserved word in the C language

So, we have to change names coming from TF if they are invalid in MQL. We do
that by replacing illegal characters by `_`, and, if the result does not start
with a letter, we prepend an `x`. We do not check whether the name is a reserved
C word.

With these provisos:

*   the given `dbName` correspond to the MQL *database name*
*   the TF `otypes` correspond to the MQL *objects*
*   the TF `features` correspond to the MQL *features*

The MQL export is usually quite massive (500MB for the Hebrew Bible).
It can be compressed greatly, especially by the program `bzip2`.

!!! caution "Existing database"
    If you try to import an MQL file in Emdros, and there exists already a file or
    directory with the same name as the MQL database, your import will fail
    spectacularly. So do not do that.

A good way to prevent clashes:

*   export the MQL to outside your `~/text-fabric-data` directory, e.g. to
    `~/Downloads`;
*   before importing the MQL file, delete the previous copy;

Delete existing copy:

``` sh
cd ~/Downloads
rm dataset ; mql -b 3 < dataset.mql
```
"""

import re
from itertools import chain
from ..parameters import WARP, OTYPE, OSLOTS
from ..core.fabric import FabricCore
from ..core.helpers import (
    cleanName,
    isClean,
    specFromRanges,
    rangesFromList,
    setFromSpec,
    nbytes,
    console,
)
from ..core.files import (
    fileOpen,
    expanduser as ex,
    unexpanduser as ux,
    expandDir,
    dirMake,
    DOWNLOADS,
)
from ..core.timestamp import SILENT_D, silentConvert

# If a feature, with type string, has less than ENUM_LIMIT values,
# an enumeration type for it will be created
# provided all values of that feature are a valid name for MQL.

ENUM_LIMIT = 1000

ONE_ENUM_TYPE = True

MAX_INT = 2**31 - 1
MIN_INT = -MAX_INT


def exportMQL(app, mqlDb, exportDir=None):
    """Exports the complete TF dataset into single MQL database.

    Parameters
    ----------
    app: object
        A `tf.advanced.app.App` object, which holds the corpus data
        that will be exported to MQL.
    mqlDb: string
        Name of the MQL database
    exportDir: string, optional None
        Directory where the MQL data will be saved.
        If None is given, it will end up in the same repo as the dataset, in a new
        top-level subdirectory called `mql`.
        The exported data will be written to file `exportDir/mqlDb.mql`.
        If `exportDir` starts with `~`, the `~` will be expanded to your
        home directory.
        Likewise, `..` will be expanded to the parent of the current directory,
        and `.` to the current directory, both only at the start of `exportDir`.

    Returns
    -------
    None

    See Also
    --------
    tf.convert.mql
    """
    indent = app.indent
    indent(level=0, reset=True)

    if exportDir is None:
        repoLocation = getattr(app, "repoLocation", None)
        if repoLocation is None:
            locations = getattr(app, "locations", None)
            if locations is None or len(locations) == 0:
                baseDir = DOWNLOADS
            else:
                baseDir = expandDir(app, f"{locations[0]}/..")
        else:
            baseDir = repoLocation
        exportDir = f"{baseDir}/mql"
    else:
        exportDir = expandDir(app, exportDir)

    mqlNameClean = cleanName(mqlDb)
    mql = MQL(app, mqlNameClean, exportDir)
    mql.write()


def importMQL(mqlFile, saveDir, silent=None, slotType=None, otext=None, meta=None):
    """Converts an MQL database dump to a TF dataset.

    Parameters
    ----------
    mqlFile: string
        Path to the file which contains the MQL code.
    saveDir: string
        Path to where a new TF app will be created.
    silent: string
        How silent the newly created TF object must be.

    slotType: string
        You have to tell which object type in the MQL file acts as the slot type,
        because TF cannot see that on its own.

    otext: dict
        You can pass the information about sections and text formats as
        the parameter `otext`. This info will end up in the `otext.tf` feature.
        Pass it as a dictionary of keys and values, like so:

            otext = {
                'fmt:text-trans-plain': '{glyphs}{trailer}',
                'sectionFeatures': 'book,chapter,verse',
            }

    meta: dict
        Likewise, you can add a dictionary keyed by features
        that will added to the metadata of the corresponding features.

        You may also add metadata for the empty feature `""`,
        this will be added to the metadata of all features.
        Handy to add provenance data there.

        Example:

            meta = {
                "": dict(
                    dataset='DLC',
                    datasetName='Digital Language Corpus',
                    author="That 's me",
                ),
                "sp": dict(
                    description: "part-of-speech",
                ),
            }

        !!! note "description"
            TF will display all metadata information under the
            key `description` in a more prominent place than the other
            metadata.

        !!! caution "`value type`"
            Do not pass the value types of the features here.

    Returns
    -------
    object
        A `tf.core.fabric.FabricCore` object holding the conversion result of the
        MQL data into TF.
    """

    TF = FabricCore(locations=saveDir, silent=silent)
    tmObj = TF.tmObj
    indent = tmObj.indent

    indent(level=0, reset=True)
    (good, nodeFeatures, edgeFeatures, metaData) = tfFromMql(
        mqlFile, tmObj, slotType=slotType, otext=otext, meta=meta
    )
    if good:
        TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)
    return TF


class MQL:
    def __init__(self, app, mqlDb, exportDir, silent=SILENT_D):
        self.app = app
        self.silent = silentConvert(silent)
        app.setSilent(silent)
        warning = app.warning

        self.mqlNameOrig = mqlDb
        exportDir = ex(exportDir)
        self.exportDir = exportDir

        cleanDb = cleanName(mqlDb)
        if cleanDb != mqlDb:
            warning(f'db name "{mqlDb}" => "{cleanDb}"')
        self.mqlDb = cleanDb

        self.enums = {}
        self._check()

    def write(self):
        silent = self.silent
        app = self.app
        error = app.error
        info = app.info
        indent = app.indent
        exportDir = self.exportDir

        if not self.good:
            return

        dirMake(self.exportDir)

        mqlFile = f"{self.exportDir}/{self.mqlDb}.mql"
        try:
            fm = fileOpen(mqlFile, mode="w")
        except Exception:
            error(f"Could not write to {ux(mqlFile)}")
            self.good = False
            return

        info(f"Loading {len(self.featureList)} features")
        for ft in self.featureList:
            fObj = self.features[ft]
            fObj.load(silent=silent)

        self.fm = fm
        self._writeStartDb()
        self._writeEnums()
        self._writeTypes()
        self._writeDataAll()
        self._writeEndDb()
        indent(level=0)
        info(f"MQL in {ux(exportDir)}")
        info("Done")

    def _check(self):
        silent = self.silent
        app = self.app
        error = app.error
        info = app.info
        indent = app.indent
        tfFeatures = app.api.TF.features

        info(f"Checking features of dataset {self.mqlDb}")

        self.features = {}
        self.featureList = []
        indent(level=1)
        good = True

        for f, fo in sorted(tfFeatures.items()):
            if fo.method is not None or f in WARP:
                continue

            fo.load(metaOnly=True, silent=silent)

            if fo.isConfig:
                continue

            if fo.dataType == "int":
                fMap = fo.data
                outOfBound = {x for x in fMap.values() if x < MIN_INT or x > MAX_INT}
                nOutOfBound = len(outOfBound)

                if nOutOfBound:
                    error(
                        f'integer feature "{f}" has {nOutOfBound} values '
                        f"less than {MIN_INT} or larger than {MAX_INT}"
                    )
                    good = False

            cleanF = cleanName(f)

            if cleanF != f:
                error(f'feature "{f}" => "{cleanF}"')

            self.featureList.append(cleanF)
            self.features[cleanF] = fo

        for feat in (OTYPE, OSLOTS, "__levels__"):
            if feat not in tfFeatures:
                error(
                    "{} feature {} is missing from data set".format(
                        (
                            "Warp"
                            if feat in WARP
                            else "Computed" if feat.startswith("__") else "Data"
                        ),
                        feat,
                    )
                )
                good = False
            else:
                fObj = tfFeatures[feat]
                if not fObj.load(silent=silent):
                    good = False

        indent(level=0)

        if not good:
            error("Export to MQL aborted")
        else:
            info(f"{len(self.featureList)} features to export to MQL ...")

        self.good = good

    def _writeStartDb(self):
        self.fm.write(
            """
CREATE DATABASE '{name}'
GO
USE DATABASE '{name}'
GO
""".format(
                name=self.mqlDb
            )
        )

    def _writeEndDb(self):
        self.fm.write(
            """
VACUUM DATABASE ANALYZE
GO
"""
        )
        self.fm.close()

    def _writeEnums(self):
        app = self.app
        info = app.info
        indent = app.indent

        indent(level=0)
        info("Writing enumerations")
        indent(level=1)
        for ft in self.featureList:
            ftClean = cleanName(ft)
            fObj = self.features[ft]

            if fObj.isEdge or fObj.dataType == "int":
                continue

            fMap = fObj.data
            fValues = sorted(set(fMap.values()))

            if len(fValues) > ENUM_LIMIT:
                continue

            eligible = all(isClean(fVal) for fVal in fValues)

            if not eligible:
                unclean = [fVal for fVal in fValues if not isClean(fVal)]
                console(
                    "\t{:<15}: {:>4} values, {} not a name, e.g. «{}»".format(
                        ftClean,
                        len(fValues),
                        len(unclean),
                        unclean[0],
                    )
                )
                continue
            self.enums[ftClean] = fValues

        if ONE_ENUM_TYPE:
            self._writeEnumsAsOne()
        else:
            for ft in sorted(self.enums):
                self._writeEnum(ft)
            indent(level=0)
            info(f"Written {len(self.enums)} enumerations")

    def _writeEnumsAsOne(self):
        app = self.app
        info = app.info

        fValues = sorted(
            set(chain.from_iterable((set(fV) for fV in self.enums.values())))
        )
        if len(fValues):
            info(f"Writing an all-in-one enum with {len(fValues):>4} values")
            fValuesEnumerated = ",\n\t".join(
                "{} = {}".format(fVal, i) for (i, fVal) in enumerate(fValues)
            )
            self.fm.write(
                f"""
CREATE ENUMERATION all_enum = {{
    {fValuesEnumerated}
}}
GO
"""
            )

    def _writeEnum(self, ft):
        app = self.app
        info = app.info

        fValues = self.enums[ft]
        if len(fValues):
            info(f"enum {ft:<15} with {len(fValues):>4} values")
            fValuesEnumerated = ",\n\t".join(
                f"{fVal} = {i}" for (i, fVal) in enumerate(fValues)
            )
            self.fm.write(
                f"""
CREATE ENUMERATION {ft}_enum = {{
    {fValuesEnumerated}
}}
GO
"""
            )

    def _writeTypes(self):
        def valInt(n):
            return str(n)

        def valStr(s):
            if "'" in s:
                return '"{}"'.format(s.replace('"', '\\"'))
            else:
                return "'{}'".format(s)

        def valIds(ids):
            return "({})".format(",".join(str(i) for i in ids))

        app = self.app
        warning = app.warning
        info = app.info
        indent = app.indent
        tfFeatures = app.api.TF.features

        self.levels = tfFeatures["__levels__"].data[::-1]
        indent(level=0)
        info(
            "Mapping {} features onto {} object types".format(
                len(self.featureList),
                len(self.levels),
            )
        )
        otypeSupport = {}

        for otype, av, start, end in self.levels:
            cleanOtype = cleanName(otype)
            if cleanOtype != otype:
                warning(f'otype "{otype}" => "{cleanOtype}"')
            otypeSupport[cleanOtype] = set(range(start, end + 1))

        self.otypes = {}
        self.featureTypes = {}
        self.featureMethods = {}

        for ft in self.featureList:
            ftClean = cleanName(ft)
            fObj = self.features[ft]
            if fObj.isEdge:
                dataType = "LIST OF id_d"
                method = valIds
            else:
                if fObj.dataType == "str":
                    dataType = 'string DEFAULT ""'
                    method = valInt if ft in self.enums else valStr
                elif fObj.dataType == "int":
                    dataType = "integer DEFAULT 0"
                    method = valInt
                else:
                    dataType = 'string DEFAULT ""'
                    method = valStr
            self.featureTypes[ft] = dataType
            self.featureMethods[ft] = method

            support = set(fObj.data.keys())
            for otype in otypeSupport:
                if len(support & otypeSupport[otype]):
                    self.otypes.setdefault(otype, []).append(ftClean)

        for otype in (cleanName(x[0]) for x in self.levels):
            self._writeType(otype)

    def _writeType(self, otype):
        self.fm.write(
            f"""
CREATE OBJECT TYPE
[{otype}
"""
        )
        for ft in self.otypes.get(otype, []):
            fType = (
                "{}_enum".format("all" if ONE_ENUM_TYPE else ft)
                if ft in self.enums
                else self.featureTypes[ft]
            )
            self.fm.write(f"  {ft}:{fType};\n")
        self.fm.write(
            """
]
GO
"""
        )

    def _writeDataAll(self):
        app = self.app
        info = app.info
        tfFeatures = app.api.TF.features

        info(
            "Writing {} features as data in {} object types".format(
                len(self.featureList),
                len(self.levels),
            )
        )
        oslotsData = tfFeatures[OSLOTS].data
        self.oslots = oslotsData[0]
        self.maxSlot = oslotsData[1]

        for otype, av, start, end in self.levels:
            self._writeData(otype, start, end)

    def _writeData(self, otype, start, end):
        app = self.app
        info = app.info
        indent = app.indent

        fm = self.fm

        indent(level=1, reset=True)
        info(f"{otype} data ...")
        oslots = self.oslots
        maxSlot = self.maxSlot
        oFeats = self.otypes.get(otype, [])
        features = self.features
        featureMethods = self.featureMethods
        fm.write(
            """
DROP INDEXES ON OBJECT TYPE[{o}]
GO
CREATE OBJECTS
WITH OBJECT TYPE[{o}]
""".format(
                o=otype
            )
        )
        curSize = 0
        LIMIT = 50000
        t = 0
        j = 0
        indent(level=2, reset=True)

        for n in range(start, end + 1):
            oMql = """
CREATE OBJECT
FROM MONADS= {{ {m} }}
WITH ID_D={i} [
""".format(
                m=(
                    n
                    if n <= maxSlot
                    else specFromRanges(rangesFromList(oslots[n - maxSlot - 1]))
                ),
                i=n,
            )
            for ft in oFeats:
                method = featureMethods[ft]
                fMap = features[ft].data
                if n in fMap:
                    oMql += f"{ft}:={method(fMap[n])};\n"
            oMql += """
]
"""
            fm.write(oMql)
            curSize += len(bytes(oMql, encoding="utf8"))
            t += 1
            j += 1
            if j == LIMIT:
                fm.write(
                    """
GO
CREATE OBJECTS
WITH OBJECT TYPE[{o}]
""".format(
                        o=otype
                    )
                )
                info(
                    f"batch of size {nbytes(curSize):>20} with {j:>7} of {t:>7} {otype}s"
                )
                j = 0
                curSize = 0

        info(f"batch of size {nbytes(curSize):>20} with {j:>7} of {t:>7} {otype}s")
        fm.write(
            """
GO
CREATE INDEXES ON OBJECT TYPE[{o}]
GO
""".format(
                o=otype
            )
        )

        indent(level=1)
        info("{} data: {} objects".format(otype, t))


# MQL IMPORT

uniscan = re.compile(r"(?:\\x..)+")


def makeuni(match):
    """Make proper UNICODE of a text that contains byte escape codes
    such as backslash `xb6`
    """
    byts = eval('"' + match.group(0) + '"')
    return byts.encode("latin1").decode("utf-8")


def uni(line):
    return uniscan.sub(makeuni, line)


def tfFromMql(mqlFile, tmObj, slotType=None, otext=None, meta=None):
    """Generate TF from MQL

    Parameters
    ----------
    tmObj: object
        A `tf.core.timestamp.Timestamp` object
    mqlFile, slotType, otype, meta: mixed
        See `tf.convert.mql.importMQL`
    """
    mqlFile = ex(mqlFile)
    error = tmObj.error

    if slotType is None:
        error("ERROR: no slotType specified")
        return (False, {}, {}, {})
    (good, objectTypes, tables, edgeF, nodeF) = parseMql(mqlFile, tmObj)
    if not good:
        return (False, {}, {}, {})
    return tfFromData(tmObj, objectTypes, tables, edgeF, nodeF, slotType, otext, meta)


def parseMql(mqlFile, tmObj):
    info = tmObj.info
    error = tmObj.error

    info("Parsing MQL source ...")
    fh = fileOpen(mqlFile)

    objectTypes = dict()
    tables = dict()

    edgeF = dict()
    nodeF = dict()

    curId = None
    curEnum = None
    curObjectType = None
    curTable = None
    curObject = None
    curValue = None
    curFeature = None
    seeObjects = False

    inObjectTypeFeatures = False

    STRING_TYPES = {"ascii", "string"}

    enums = dict()

    chunkSize = 1000000
    inThisChunk = 0

    good = True

    for ln, line in enumerate(fh):
        inThisChunk += 1
        if inThisChunk == chunkSize:
            info(f"\tline {ln + 1:>9}")
            inThisChunk = 0
        if line.startswith("CREATE OBJECTS WITH OBJECT TYPE") or line.startswith(
            "WITH OBJECT TYPE"
        ):
            comps = line.rstrip().rstrip("]").split("[", 1)
            curTable = comps[1]
            info(f"\t\tobjects in {curTable}")
            curObject = None
            if curTable not in tables:
                tables[curTable] = dict()
            seeObjects = True
        elif line == "CREATE OBJECT\n":
            curObject = None
            curObject = dict(feats=dict(), monads=None)
            curId = None
            seeObjects = True
        elif curEnum is not None:
            if line.startswith("}"):
                curEnum = None
                continue
            comps = line.strip().rstrip(",").split("=", 1)
            comp = comps[0].strip()
            words = comp.split()
            if words[0] == "DEFAULT":
                enums[curEnum]["default"] = uni(words[1])
                value = words[1]
            else:
                value = words[0]
            enums[curEnum]["values"].append(value)
        elif curObjectType is not None:
            if line.startswith("]"):
                curObjectType = None
                inObjectTypeFeatures = False
                continue
            if curObjectType is True:
                if line.startswith("["):
                    curObjectType = line.rstrip()[1:]
                    objectTypes[curObjectType] = dict()
                    info(f"\t\totype {curObjectType}")
                    inObjectTypeFeatures = True
                    continue
            if inObjectTypeFeatures:
                comps = line.strip().rstrip(";").split(":", 1)
                feature = comps[0].strip()
                fInfo = comps[1].strip()
                fCleanInfo = fInfo.replace("FROM SET", "")
                fInfoComps = fCleanInfo.split(" ", 1)
                fMQLType = fInfoComps[0]
                if len(fInfoComps) == 2:
                    fDefaultComps = fInfoComps[1].strip().split(" ", 1)
                    fDefault = fDefaultComps[1] if len(fDefaultComps) > 1 else None
                else:
                    fDefault = None
                if fDefault is not None and fMQLType in STRING_TYPES:
                    fDefault = uni(fDefault[1:-1])
                default = enums.get(fMQLType, {}).get("default", fDefault)
                ftype = (
                    "str"
                    if fMQLType in enums
                    else (
                        "int"
                        if fMQLType == "integer"
                        else (
                            "str"
                            if fMQLType in STRING_TYPES
                            else "int" if fInfo == "id_d" else "str"
                        )
                    )
                )
                isEdge = fMQLType == "id_d"
                if isEdge:
                    edgeF.setdefault(curObjectType, set()).add(feature)
                else:
                    nodeF.setdefault(curObjectType, set()).add(feature)

                objectTypes[curObjectType][feature] = (ftype, default)
                info(
                    "\t\t\tfeature {} ({}) =def= {} : {}".format(
                        feature, ftype, default, "edge" if isEdge else "node"
                    )
                )
        elif seeObjects:
            if curObject is not None:
                if line.startswith("]"):
                    objectType = objectTypes[curTable]
                    for feature, (ftype, default) in objectType.items():
                        if feature not in curObject["feats"] and default is not None:
                            curObject["feats"][feature] = default
                    tables[curTable][curId] = curObject
                    curObject = None
                    continue
                elif line.startswith("["):
                    name = line.rstrip()[1:]
                    if len(name):
                        curTable = name
                        if curTable not in tables:
                            tables[curTable] = dict()
                elif line.startswith("FROM MONADS"):
                    monads = (
                        line.split("=", 1)[1]
                        .replace("{", "")
                        .replace("}", "")
                        .replace(" ", "")
                        .strip()
                    )
                    curObject["monads"] = setFromSpec(monads)
                elif line.startswith("WITH ID_D"):
                    comps = line.replace("[", "").rstrip().split("=", 1)
                    curId = int(comps[1])
                elif line.startswith("GO"):
                    pass
                elif line.strip() == "":
                    pass
                else:
                    if curValue is not None:
                        toBeContinued = not line.rstrip().endswith('";')
                        if toBeContinued:
                            curValue += line
                        else:
                            curValue += line.rstrip().rstrip(";").rstrip('"')
                            curObject["feats"][curFeature] = uni(curValue)
                            curValue = None
                            curFeature = None
                        continue
                    if ":=" in line:
                        (featurePart, valuePart) = line.split("=", 1)
                        feature = featurePart[0:-1].strip()
                        valuePart = valuePart.lstrip()
                        isText = ':="' in line
                        toBeContinued = isText and not line.rstrip().endswith('";')
                        if toBeContinued:
                            # this happens if a feature value
                            # contains a new line
                            # we must continue scanning lines
                            # until we meet the end of the value
                            curFeature = feature
                            curValue = valuePart.lstrip('"')
                        else:
                            value = valuePart.rstrip().rstrip(";").strip('"')
                            curObject["feats"][feature] = (
                                uni(value) if isText else value
                            )
                    else:
                        error(f"ERROR: line {ln}: unrecognized line -->{line}<--")
                        good = False
                        break
            else:
                if line.startswith("CREATE OBJECT"):
                    curObject = dict(feats=dict(), monads=None)
                    curId = None
        else:
            if line.startswith("CREATE ENUMERATION"):
                words = line.split()
                curEnum = words[2]
                enums[curEnum] = dict(default=None, values=[])
                info(f"\t\tenum {curEnum}")
            elif line.startswith("CREATE OBJECT TYPE"):
                curObjectType = True
                inObjectTypeFeatures = False
    info(f"{ln + 1} lines parsed")
    fh.close()
    for table in tables:
        info(f"{len(tables[table])} objects of type {table}")

    if len(tables) == 0:
        info("No objects found")
    return (good, objectTypes, tables, nodeF, edgeF)


def tfFromData(tmObj, objectTypes, tables, nodeF, edgeF, slotType, otext, meta):
    info = tmObj.info

    info("Making TF data ...")

    NIL = {"nil", "NIL", "Nil"}

    tableOrder = [slotType] + [t for t in sorted(tables) if t != slotType]

    iddFromMonad = dict()
    slotFromMonad = dict()

    nodeFromIdd = dict()
    iddFromNode = dict()

    nodeFeatures = dict()
    edgeFeatures = dict()
    metaData = dict()

    # metadata that ends up in every feature
    metaData[""] = meta.get("", {})
    distinctFeatures = chain(
        chain.from_iterable(nodeF.values()), chain.from_iterable(edgeF.values())
    )
    for f in distinctFeatures:
        metaInfo = meta.get(f, None)
        if metaInfo is not None:
            metaData[f] = metaInfo

    # the config feature otext
    metaData["otext"] = otext

    good = True

    info("Monad - idd mapping ...")
    for idd in tables.get(slotType, {}):
        monad = list(tables[slotType][idd]["monads"])[0]
        iddFromMonad[monad] = idd

    info("Removing holes in the monad sequence")
    # we set up a monad - slot mapping
    curSlot = 0
    otype = dict()
    for monad in sorted(iddFromMonad):
        curSlot += 1
        slotFromMonad[monad] = curSlot
        idd = iddFromMonad[monad]
        nodeFromIdd[idd] = curSlot
        iddFromNode[curSlot] = idd
        otype[curSlot] = slotType

    maxSlot = curSlot
    info(f"maxSlot={maxSlot}")

    info("Node mapping and otype ...")
    node = maxSlot
    for t in tableOrder[1:]:
        for idd in sorted(tables[t]):
            node += 1
            nodeFromIdd[idd] = node
            iddFromNode[node] = idd
            otype[node] = t

    nodeFeatures["otype"] = otype
    metaData["otype"] = dict(valueType="str")

    info("oslots ...")
    oslots = dict()
    for t in tableOrder[1:]:
        for idd in tables.get(t, {}):
            node = nodeFromIdd[idd]
            monads = tables[t][idd]["monads"]
            oslots[node] = {slotFromMonad[m] for m in monads}
    edgeFeatures["oslots"] = oslots
    metaData["oslots"] = dict(valueType="str")

    info("metadata ...")
    for t in nodeF:
        for f in nodeF[t]:
            ftype = objectTypes[t][f][0]
            metaData.setdefault(f, {})["valueType"] = ftype
    for t in edgeF:
        for f in edgeF[t]:
            metaData.setdefault(f, {})["valueType"] = "str"

    info("features ...")
    chunkSize = 100000
    for t in tableOrder:
        info(f"\tfeatures from {t}s")
        inThisChunk = 0
        thisTable = tables.get(t, {})
        for i, idd in enumerate(thisTable):
            inThisChunk += 1
            if inThisChunk == chunkSize:
                info(f"\t{i + 1:>9} {t}s")
                inThisChunk = 0
            node = nodeFromIdd[idd]
            features = tables[t][idd]["feats"]
            for f, v in features.items():
                isEdge = f in edgeF.get(t, set())
                if isEdge:
                    if v not in NIL:
                        edgeFeatures.setdefault(f, {}).setdefault(node, set()).add(
                            nodeFromIdd[int(v)]
                        )
                else:
                    nodeFeatures.setdefault(f, {})[node] = v
        info(f"\t{len(thisTable):>9} {t}s")

    return (good, nodeFeatures, edgeFeatures, metaData)

Functions

def exportMQL(app, mqlDb, exportDir=None)

Exports the complete TF dataset into single MQL database.

Parameters

app : object
A App object, which holds the corpus data that will be exported to MQL.
mqlDb : string
Name of the MQL database
exportDir : string, optional None
Directory where the MQL data will be saved. If None is given, it will end up in the same repo as the dataset, in a new top-level subdirectory called mql. The exported data will be written to file exportDir/mqlDb.mql. If exportDir starts with ~, the ~ will be expanded to your home directory. Likewise, .. will be expanded to the parent of the current directory, and . to the current directory, both only at the start of exportDir.

Returns

None
 

See Also

tf.convert.mql

def importMQL(mqlFile, saveDir, silent=None, slotType=None, otext=None, meta=None)

Converts an MQL database dump to a TF dataset.

Parameters

mqlFile : string
Path to the file which contains the MQL code.
saveDir : string
Path to where a new TF app will be created.
silent : string
How silent the newly created TF object must be.
slotType : string
You have to tell which object type in the MQL file acts as the slot type, because TF cannot see that on its own.
otext : dict
You can pass the information about sections and text formats as the parameter otext. This info will end up in the otext.tf feature. Pass it as a dictionary of keys and values, like so:
otext = {
    'fmt:text-trans-plain': '{glyphs}{trailer}',
    'sectionFeatures': 'book,chapter,verse',
}
meta : dict

Likewise, you can add a dictionary keyed by features that will added to the metadata of the corresponding features.

You may also add metadata for the empty feature "", this will be added to the metadata of all features. Handy to add provenance data there.

Example:

meta = {
    "": dict(
        dataset='DLC',
        datasetName='Digital Language Corpus',
        author="That 's me",
    ),
    "sp": dict(
        description: "part-of-speech",
    ),
}

description

TF will display all metadata information under the key description in a more prominent place than the other metadata.

value type

Do not pass the value types of the features here.

Returns

object
A FabricCore object holding the conversion result of the MQL data into TF.
def makeuni(match)

Make proper UNICODE of a text that contains byte escape codes such as backslash xb6

def parseMql(mqlFile, tmObj)
def tfFromData(tmObj, objectTypes, tables, nodeF, edgeF, slotType, otext, meta)
def tfFromMql(mqlFile, tmObj, slotType=None, otext=None, meta=None)

Generate TF from MQL

Parameters

tmObj : object
A Timestamp object
mqlFile, slotType, otype, meta : mixed
See importMQL()
def uni(line)

Classes

class MQL (app, mqlDb, exportDir, silent='auto')
Expand source code Browse git
class MQL:
    def __init__(self, app, mqlDb, exportDir, silent=SILENT_D):
        self.app = app
        self.silent = silentConvert(silent)
        app.setSilent(silent)
        warning = app.warning

        self.mqlNameOrig = mqlDb
        exportDir = ex(exportDir)
        self.exportDir = exportDir

        cleanDb = cleanName(mqlDb)
        if cleanDb != mqlDb:
            warning(f'db name "{mqlDb}" => "{cleanDb}"')
        self.mqlDb = cleanDb

        self.enums = {}
        self._check()

    def write(self):
        silent = self.silent
        app = self.app
        error = app.error
        info = app.info
        indent = app.indent
        exportDir = self.exportDir

        if not self.good:
            return

        dirMake(self.exportDir)

        mqlFile = f"{self.exportDir}/{self.mqlDb}.mql"
        try:
            fm = fileOpen(mqlFile, mode="w")
        except Exception:
            error(f"Could not write to {ux(mqlFile)}")
            self.good = False
            return

        info(f"Loading {len(self.featureList)} features")
        for ft in self.featureList:
            fObj = self.features[ft]
            fObj.load(silent=silent)

        self.fm = fm
        self._writeStartDb()
        self._writeEnums()
        self._writeTypes()
        self._writeDataAll()
        self._writeEndDb()
        indent(level=0)
        info(f"MQL in {ux(exportDir)}")
        info("Done")

    def _check(self):
        silent = self.silent
        app = self.app
        error = app.error
        info = app.info
        indent = app.indent
        tfFeatures = app.api.TF.features

        info(f"Checking features of dataset {self.mqlDb}")

        self.features = {}
        self.featureList = []
        indent(level=1)
        good = True

        for f, fo in sorted(tfFeatures.items()):
            if fo.method is not None or f in WARP:
                continue

            fo.load(metaOnly=True, silent=silent)

            if fo.isConfig:
                continue

            if fo.dataType == "int":
                fMap = fo.data
                outOfBound = {x for x in fMap.values() if x < MIN_INT or x > MAX_INT}
                nOutOfBound = len(outOfBound)

                if nOutOfBound:
                    error(
                        f'integer feature "{f}" has {nOutOfBound} values '
                        f"less than {MIN_INT} or larger than {MAX_INT}"
                    )
                    good = False

            cleanF = cleanName(f)

            if cleanF != f:
                error(f'feature "{f}" => "{cleanF}"')

            self.featureList.append(cleanF)
            self.features[cleanF] = fo

        for feat in (OTYPE, OSLOTS, "__levels__"):
            if feat not in tfFeatures:
                error(
                    "{} feature {} is missing from data set".format(
                        (
                            "Warp"
                            if feat in WARP
                            else "Computed" if feat.startswith("__") else "Data"
                        ),
                        feat,
                    )
                )
                good = False
            else:
                fObj = tfFeatures[feat]
                if not fObj.load(silent=silent):
                    good = False

        indent(level=0)

        if not good:
            error("Export to MQL aborted")
        else:
            info(f"{len(self.featureList)} features to export to MQL ...")

        self.good = good

    def _writeStartDb(self):
        self.fm.write(
            """
CREATE DATABASE '{name}'
GO
USE DATABASE '{name}'
GO
""".format(
                name=self.mqlDb
            )
        )

    def _writeEndDb(self):
        self.fm.write(
            """
VACUUM DATABASE ANALYZE
GO
"""
        )
        self.fm.close()

    def _writeEnums(self):
        app = self.app
        info = app.info
        indent = app.indent

        indent(level=0)
        info("Writing enumerations")
        indent(level=1)
        for ft in self.featureList:
            ftClean = cleanName(ft)
            fObj = self.features[ft]

            if fObj.isEdge or fObj.dataType == "int":
                continue

            fMap = fObj.data
            fValues = sorted(set(fMap.values()))

            if len(fValues) > ENUM_LIMIT:
                continue

            eligible = all(isClean(fVal) for fVal in fValues)

            if not eligible:
                unclean = [fVal for fVal in fValues if not isClean(fVal)]
                console(
                    "\t{:<15}: {:>4} values, {} not a name, e.g. «{}»".format(
                        ftClean,
                        len(fValues),
                        len(unclean),
                        unclean[0],
                    )
                )
                continue
            self.enums[ftClean] = fValues

        if ONE_ENUM_TYPE:
            self._writeEnumsAsOne()
        else:
            for ft in sorted(self.enums):
                self._writeEnum(ft)
            indent(level=0)
            info(f"Written {len(self.enums)} enumerations")

    def _writeEnumsAsOne(self):
        app = self.app
        info = app.info

        fValues = sorted(
            set(chain.from_iterable((set(fV) for fV in self.enums.values())))
        )
        if len(fValues):
            info(f"Writing an all-in-one enum with {len(fValues):>4} values")
            fValuesEnumerated = ",\n\t".join(
                "{} = {}".format(fVal, i) for (i, fVal) in enumerate(fValues)
            )
            self.fm.write(
                f"""
CREATE ENUMERATION all_enum = {{
    {fValuesEnumerated}
}}
GO
"""
            )

    def _writeEnum(self, ft):
        app = self.app
        info = app.info

        fValues = self.enums[ft]
        if len(fValues):
            info(f"enum {ft:<15} with {len(fValues):>4} values")
            fValuesEnumerated = ",\n\t".join(
                f"{fVal} = {i}" for (i, fVal) in enumerate(fValues)
            )
            self.fm.write(
                f"""
CREATE ENUMERATION {ft}_enum = {{
    {fValuesEnumerated}
}}
GO
"""
            )

    def _writeTypes(self):
        def valInt(n):
            return str(n)

        def valStr(s):
            if "'" in s:
                return '"{}"'.format(s.replace('"', '\\"'))
            else:
                return "'{}'".format(s)

        def valIds(ids):
            return "({})".format(",".join(str(i) for i in ids))

        app = self.app
        warning = app.warning
        info = app.info
        indent = app.indent
        tfFeatures = app.api.TF.features

        self.levels = tfFeatures["__levels__"].data[::-1]
        indent(level=0)
        info(
            "Mapping {} features onto {} object types".format(
                len(self.featureList),
                len(self.levels),
            )
        )
        otypeSupport = {}

        for otype, av, start, end in self.levels:
            cleanOtype = cleanName(otype)
            if cleanOtype != otype:
                warning(f'otype "{otype}" => "{cleanOtype}"')
            otypeSupport[cleanOtype] = set(range(start, end + 1))

        self.otypes = {}
        self.featureTypes = {}
        self.featureMethods = {}

        for ft in self.featureList:
            ftClean = cleanName(ft)
            fObj = self.features[ft]
            if fObj.isEdge:
                dataType = "LIST OF id_d"
                method = valIds
            else:
                if fObj.dataType == "str":
                    dataType = 'string DEFAULT ""'
                    method = valInt if ft in self.enums else valStr
                elif fObj.dataType == "int":
                    dataType = "integer DEFAULT 0"
                    method = valInt
                else:
                    dataType = 'string DEFAULT ""'
                    method = valStr
            self.featureTypes[ft] = dataType
            self.featureMethods[ft] = method

            support = set(fObj.data.keys())
            for otype in otypeSupport:
                if len(support & otypeSupport[otype]):
                    self.otypes.setdefault(otype, []).append(ftClean)

        for otype in (cleanName(x[0]) for x in self.levels):
            self._writeType(otype)

    def _writeType(self, otype):
        self.fm.write(
            f"""
CREATE OBJECT TYPE
[{otype}
"""
        )
        for ft in self.otypes.get(otype, []):
            fType = (
                "{}_enum".format("all" if ONE_ENUM_TYPE else ft)
                if ft in self.enums
                else self.featureTypes[ft]
            )
            self.fm.write(f"  {ft}:{fType};\n")
        self.fm.write(
            """
]
GO
"""
        )

    def _writeDataAll(self):
        app = self.app
        info = app.info
        tfFeatures = app.api.TF.features

        info(
            "Writing {} features as data in {} object types".format(
                len(self.featureList),
                len(self.levels),
            )
        )
        oslotsData = tfFeatures[OSLOTS].data
        self.oslots = oslotsData[0]
        self.maxSlot = oslotsData[1]

        for otype, av, start, end in self.levels:
            self._writeData(otype, start, end)

    def _writeData(self, otype, start, end):
        app = self.app
        info = app.info
        indent = app.indent

        fm = self.fm

        indent(level=1, reset=True)
        info(f"{otype} data ...")
        oslots = self.oslots
        maxSlot = self.maxSlot
        oFeats = self.otypes.get(otype, [])
        features = self.features
        featureMethods = self.featureMethods
        fm.write(
            """
DROP INDEXES ON OBJECT TYPE[{o}]
GO
CREATE OBJECTS
WITH OBJECT TYPE[{o}]
""".format(
                o=otype
            )
        )
        curSize = 0
        LIMIT = 50000
        t = 0
        j = 0
        indent(level=2, reset=True)

        for n in range(start, end + 1):
            oMql = """
CREATE OBJECT
FROM MONADS= {{ {m} }}
WITH ID_D={i} [
""".format(
                m=(
                    n
                    if n <= maxSlot
                    else specFromRanges(rangesFromList(oslots[n - maxSlot - 1]))
                ),
                i=n,
            )
            for ft in oFeats:
                method = featureMethods[ft]
                fMap = features[ft].data
                if n in fMap:
                    oMql += f"{ft}:={method(fMap[n])};\n"
            oMql += """
]
"""
            fm.write(oMql)
            curSize += len(bytes(oMql, encoding="utf8"))
            t += 1
            j += 1
            if j == LIMIT:
                fm.write(
                    """
GO
CREATE OBJECTS
WITH OBJECT TYPE[{o}]
""".format(
                        o=otype
                    )
                )
                info(
                    f"batch of size {nbytes(curSize):>20} with {j:>7} of {t:>7} {otype}s"
                )
                j = 0
                curSize = 0

        info(f"batch of size {nbytes(curSize):>20} with {j:>7} of {t:>7} {otype}s")
        fm.write(
            """
GO
CREATE INDEXES ON OBJECT TYPE[{o}]
GO
""".format(
                o=otype
            )
        )

        indent(level=1)
        info("{} data: {} objects".format(otype, t))

Methods

def write(self)