Module tf.core.fabric
FabricCore
The main class that works the core API is Fabric
.
In this module we define FabricCore
, which contains most of the
functionality of Fabric
.
It does not contain the volume support.
Volume support requires tf.volumes.extract
and tf.volumes.collect
which
need to load and save TF datasets, and loading and saving are Fabric
functionalities.
Hence a Fabric with volume support would lead to circular imports.
By leaving out volume support, volume support can use FabricCore
instead of Fabric.
Global variables
var PRECOMPUTE
-
Pre-computation steps.
Each step corresponds to a pre-computation task.
A task is specified by a tuple containing:
Parameters
dep
:boolean
- Whether the step is dependent on the presence of additional features. Only relevant for the pre-computation of section structure: that should only happen if there are section features.
name
:string
- The name of the result of a pre-computed task. The result is a blob of data that can be loaded and compressed just as ordinary features.
function
:function
- The function that performs the pre-computation task.
These functions are defined in
tf.core.prepare
. dependencies
:strings
- The remaining parts of the tuple are the names of pre-computed features that must be coomputed before and whose results are passed as argument to the function that executes the pre-computation.
For a description of what the steps are for, see the functions in
tf.core.prepare
.
Classes
class FabricCore (locations=None, modules=None, silent='auto')
-
Expand source code Browse git
class FabricCore: """Initialize the core API for a corpus. Top level management of * locating TF feature files * loading and saving feature data * pre-computing auxiliary data * caching pre-computed and compressed data TF is initialized for a corpus. It will search a set of directories and catalogue all `.tf` files it finds there. These are the features you can subsequently load. Here `directories` and `subdirectories` are strings with directory names separated by newlines, or iterables of directories. Parameters ---------- locations: string | iterable of strings, optional The directories specified here are used as base locations in searching for TF feature files. In general, they will not searched directly, but certain subdirectories of them will be searched, specified by the `modules` parameter. Defaults: ~/Downloads/text-fabric-data ~/text-fabric-data ~/github/text-fabric-data So if you have stored your main TF dataset in `text-fabric-data` in one of these directories you do not have to pass a location to Fabric. modules: string | iterable of strings The directories specified in here are used as sub directories appended to the directories given by the `locations` parameter. All `.tf` files (non-recursively) in any `location/module` will be added to the feature set to be loaded in this session. The order in `modules` is important, because if a feature occurs in multiple modules, the last one will be chosen. In this way you can easily override certain features in one module by features in an other module of your choice. Default: `['']` So if you leave it out, TF will just search the paths specified in `locations`. silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` _withGc: boolean, optional True If False, it disables the Python garbage collector before loading features. Used to experiment with performance. !!! note "`otext@` in modules" If modules contain features with a name starting with `otext@`, then the format definitions in these features will be added to the format definitions in the regular `otext` feature (which is a `tf.parameters.WARP` feature). In this way, modules that define new features for text representation, also can add new formats to the Text-API. Returns ------- object An object from which you can call up all the of methods of the core API. """ def __init__(self, locations=None, modules=None, silent=SILENT_D, _withGc=True): silent = silentConvert(silent) self._withGc = _withGc self.silent = silent tmObj = Timestamp(silent=silent) self.tmObj = tmObj setSilent = tmObj.setSilent setSilent(silent) self.banner = BANNER """The banner Text-Fabric. Will be shown just after start up, if the silence is not `deep`. """ self.version = VERSION """The version number of the TF library. """ (on32, warn, msg) = check32() warning = tmObj.warning info = tmObj.info debug = tmObj.debug if on32: warning(warn, tm=False) if msg: info(msg, tm=False) debug(self.banner, tm=False) self.good = True if modules is None: modules = [""] elif type(modules) is str: modules = [normpath(x.strip()) for x in itemize(modules, "\n")] else: modules = [normpath(str(x)) for x in modules] self.modules = modules if locations is None: locations = LOCATIONS elif type(locations) is str: locations = [normpath(x.strip()) for x in itemize(locations, "\n")] else: locations = [normpath(str(x)) for x in locations] setDir(self) self.locations = [] for loc in locations: self.locations.append(expandDir(self, loc)) self.locationRep = "\n\t".join( "\n\t".join(f"{lc}/{f}" for f in self.modules) for lc in self.locations ) self.featuresRequested = [] self.features = {} """Dictionary of all features that TF has found, whether loaded or not. Under each feature name is all info about that feature. The best use of this is to get the metadata of features: TF.features['fff'].metaData This works for all features `fff` that have been found, whether the feature is loaded or not. If a feature is loaded, you can also use `F.fff.meta` of `E.fff.meta` depending on whether `fff` is a node feature or an edge feature. !!! caution "Do not print!" If a feature is loaded, its data is also in the feature info. This can be an enormous amount of information, and you can easily overwhelm your notebook if you print it. """ self._makeIndex() def load(self, features, add=False, silent=SILENT_D): """Loads features from disk into RAM memory. Parameters ---------- features: string | iterable Either a string containing space separated feature names, or an iterable of feature names. The feature names are just the names of `.tf` files without directory information and without extension. add: boolean, optional False The features will be added to the same currently loaded features, managed by the current API. Meant to be able to dynamically load features without reloading lots of features for nothing. silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` Returns ------- boolean | object If `add` is `True` a boolean indicating success is returned. Otherwise, the result is a new `tf.core.api.Api` if the feature could be loaded, else `False`. """ silent = silentConvert(silent) tmObj = self.tmObj isSilent = tmObj.isSilent setSilent = tmObj.setSilent indent = tmObj.indent debug = tmObj.debug warning = tmObj.warning error = tmObj.error cache = tmObj.cache reset = tmObj.reset featuresOnly = self.featuresOnly wasSilent = isSilent() setSilent(silent) indent(level=0, reset=True) self.sectionsOK = True self.structureOK = True self.good = True if self.good: featuresRequested = sorted(fitemize(features)) if add: self.featuresRequested += featuresRequested else: self.featuresRequested = featuresRequested for fName in (OTYPE, OSLOTS, OTEXT): self._loadFeature(fName, optional=fName == OTEXT or featuresOnly) self.textFeatures = set() if self.good and not featuresOnly: if OTEXT in self.features: otextMeta = self.features[OTEXT].metaData for otextMod in self.features: if otextMod.startswith(OTEXT + "@"): self._loadFeature(otextMod) otextMeta.update(self.features[otextMod].metaData) self.sectionFeats = itemize(otextMeta.get("sectionFeatures", ""), ",") self.sectionTypes = itemize(otextMeta.get("sectionTypes", ""), ",") self.structureFeats = itemize( otextMeta.get("structureFeatures", ""), "," ) self.structureTypes = itemize(otextMeta.get("structureTypes", ""), ",") (self.cformats, self.formatFeats) = collectFormats(otextMeta) if not (0 < len(self.sectionTypes) <= 3) or not ( 0 < len(self.sectionFeats) <= 3 ): if not add: warning( f"Dataset without sections in {OTEXT}:" f"no section functions in the T-API" ) self.sectionsOK = False else: self.textFeatures |= set(self.sectionFeats) self.sectionFeatsWithLanguage = tuple( f for f in self.features if f == self.sectionFeats[0] or f.startswith(f"{self.sectionFeats[0]}@") ) self.textFeatures |= set(self.sectionFeatsWithLanguage) if not self.structureTypes or not self.structureFeats: if not add: debug( f"Dataset without structure sections in {OTEXT}:" f"no structure functions in the T-API" ) self.structureOK = False else: self.textFeatures |= set(self.structureFeats) formatFeats = set(self.formatFeats) self.textFeatures |= formatFeats for fName in self.textFeatures: self._loadFeature(fName, optional=fName in formatFeats) dep1Feats = self.dep1Feats if dep1Feats: cformats = self.cformats tFormats = {} tFeats = set() for (fmt, (otpl, tpl, featData)) in cformats.items(): feats = set(chain.from_iterable(x[0] for x in featData)) tFormats[fmt] = tuple(sorted(feats)) tFeats |= feats tFeats = tuple(sorted(tFeats)) extraDependencies = [tFormats] for tFeat in tFeats: featData = self.features[tFeat].data extraDependencies.append((tFeat, featData)) for cFeat in dep1Feats: self.features[cFeat].dependencies += extraDependencies else: self.sectionsOK = False self.structureOK = False if self.good and not featuresOnly: self._precompute() if self.good: reset() for fName in self.featuresRequested: self._loadFeature(fName) if not self.good: indent(level=0) cache() error("Not all features could be loaded / computed") result = False break reset() if self.good: if add: try: self._updateApi() result = True except MemoryError: console(MEM_MSG) result = False else: try: result = self._makeApi() except MemoryError: console(MEM_MSG) result = False else: result = False setSilent(wasSilent) return result def explore(self, silent=SILENT_D, show=True): """Makes categorization of all features in the dataset. Parameters ---------- silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` show: boolean, optional True If `False`, the resulting dictionary is delivered in `TF.featureSets`; if `True`, the dictionary is returned as function result. Returns ------- dict | None A dictionary with keys `nodes`, `edges`, `configs`, `computeds`. Under each key there is the set of feature names in that category. How this dictionary is delivered, depends on the parameter *show*. Notes ----- !!! explanation "`configs`" These are configuration features, with metadata only, no data. E.g. `otext`. !!! explanation "`computeds`" These are blocks of pre-computed data, available under the `C` API, see `tf.core.computed.Computeds`. The sets do not indicate whether a feature is loaded or not. There are other functions that give you the loaded features: `tf.core.api.Api.Fall` for nodes and `tf.core.api.Api.Eall` for edges. """ silent = silentConvert(silent) tmObj = self.tmObj isSilent = tmObj.isSilent setSilent = tmObj.setSilent info = tmObj.info wasSilent = isSilent() setSilent(silent) nodes = set() edges = set() configs = set() computeds = set() for (fName, fObj) in self.features.items(): fObj.load(silent=silent, metaOnly=True) dest = None if fObj.method: dest = computeds elif fObj.isConfig: dest = configs elif fObj.isEdge: dest = edges else: dest = nodes dest.add(fName) info( "Feature overview: {} for nodes; {} for edges; {} configs; {} computed".format( len(nodes), len(edges), len(configs), len(computeds), ) ) self.featureSets = dict( nodes=nodes, edges=edges, configs=configs, computeds=computeds ) setSilent(wasSilent) if show: return dict( (kind, tuple(sorted(kindSet))) for (kind, kindSet) in sorted( self.featureSets.items(), key=lambda x: x[0] ) ) def loadAll(self, silent=SILENT_D): """Load all loadable features. Parameters ---------- silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` """ silent = silentConvert(silent) api = self.load("", silent=silent) allFeatures = self.explore(silent=silent, show=True) loadableFeatures = allFeatures["nodes"] + allFeatures["edges"] self.load(loadableFeatures, add=True, silent=silent) return api def clearCache(self): """Clears the cache of compiled TF data. TF pre-computes data for you, so that it can be loaded faster. If the original data is updated, TF detects it, and will recompute that data. But there are cases, when the algorithms of TF have changed, without any changes in the data, where you might want to clear the cache of pre-computed results. Calling this function just does it, and it is equivalent with manually removing all `.tfx` files inside the hidden `.tf` directory inside your dataset. !!! hint "No need to load" It is not needed to execute a `TF.load()` first. See Also -------- tf.clean """ for (fName, fObj) in self.features.items(): fObj.cleanDataBin() def save( self, nodeFeatures: nodeFeatureDict = {}, edgeFeatures: edgeFeatureDict = {}, metaData: metaDataDict = {}, location=None, module=None, silent=SILENT_D, ): """Saves newly generated data to disk as TF features, nodes and / or edges. If you have collected feature data in dictionaries, keyed by the names of the features, and valued by their feature data, then you can save that data to `.tf` feature files on disk. It is this easy to export new data as features: collect the data and metadata of the features and feed it in an orderly way to `TF.save()` and there you go. Parameters ---------- nodeFeatures: dict of dict The data of a node feature is a dictionary with nodes as keys (integers!) and strings or numbers as (feature) values. This parameter holds all those dictionaries, keyed by feature name. edgeFeatures: dict of dict The data of an edge feature is a dictionary with nodes as keys, and sets or dictionaries as values. These sets should be sets of nodes (integers!), and these dictionaries should have nodes as keys and strings or numbers as values. This parameter holds all those dictionaries, keyed by feature name. metaData: dict of dict The meta data for every feature to be saved is a key-value dictionary. This parameter holds all those dictionaries, keyed by feature name. !!! explanation "value types" The type of the feature values (`int` or `str`) should be specified under key `valueType`. !!! explanation "edge values" If you save an edge feature, and there are values in that edge feature, you have to say so, by specifying `edgeValues=True` in the metadata for that feature. !!! explanation "generic metadata" This parameter may also contain fields under the empty name. These fields will be added to all features in `nodeFeatures` and `edgeFeatures`. !!! explanation "configuration features" If you need to write the *configuration* feature `otext`, which is a metadata-only feature, just add the metadata under key `otext` in this parameter and make sure that `otext` is not a key in `nodeFeatures` nor in `edgeFeatures`. These fields will be written into the separate configuration feature `otext`, with no data associated. location: dict The (meta)data will be written to the very last directory that TF searched when looking for features (this is determined by the `locations` and `modules` parameters in `tf.fabric.Fabric`. If both `locations` and `modules` are empty, writing will take place in the current directory. But you can override it: If you pass `location=something`, TF will save in `something/mod`, where `mod` is the last member of the `modules` parameter of TF. module: dict This is an additional way of overriding the default location where TF saves new features. See the *location* parameter. If you pass `module=something`, TF will save in `loc/something`, where `loc` is the last member of the `locations` parameter of TF. If you pass `location=path1` and `module=path2`, TF will save in `path1/path2`. silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` """ silent = silentConvert(silent) tmObj = self.tmObj isSilent = tmObj.isSilent setSilent = tmObj.setSilent indent = tmObj.indent info = tmObj.info error = tmObj.error good = True wasSilent = isSilent() setSilent(silent) indent(level=0, reset=True) self._getWriteLoc(location=location, module=module) configFeatures = dict( f for f in metaData.items() if f[0] != "" and f[0] not in nodeFeatures and f[0] not in edgeFeatures ) info( "Exporting {} node and {} edge and {} configuration features to {}:".format( len(nodeFeatures), len(edgeFeatures), len(configFeatures), self.writeDir, ) ) todo = [] for fName in sorted(nodeFeatures): todo.append((fName, nodeFeatures[fName], False, False)) for fName in sorted(edgeFeatures): todo.append((fName, edgeFeatures[fName], True, False)) for fName in sorted(configFeatures): todo.append((fName, configFeatures[fName], None, True)) total = collections.Counter() failed = collections.Counter() maxSlot = None maxNode = None slotType = None if OTYPE in nodeFeatures: info(f"VALIDATING {OSLOTS} feature") otypeData = nodeFeatures[OTYPE] if type(otypeData) is tuple: (otypeData, slotType, maxSlot, maxNode) = otypeData elif 1 in otypeData: slotType = otypeData[1] maxSlot = max(n for n in otypeData if otypeData[n] == slotType) maxNode = max(otypeData) if OSLOTS in edgeFeatures: info(f"VALIDATING {OSLOTS} feature") oslotsData = edgeFeatures[OSLOTS] if type(oslotsData) is tuple: (oslotsData, maxSlot, maxNode) = oslotsData if maxSlot is None or maxNode is None: error(f"ERROR: cannot check validity of {OSLOTS} feature") good = False else: info(f"maxSlot={maxSlot:>11}") info(f"maxNode={maxNode:>11}") maxNodeInData = max(oslotsData) minNodeInData = min(oslotsData) mappedSlotNodes = [] unmappedNodes = [] fakeNodes = [] start = min((maxSlot + 1, minNodeInData)) end = max((maxNode, maxNodeInData)) for n in range(start, end + 1): if n in oslotsData: if n <= maxSlot: mappedSlotNodes.append(n) elif n > maxNode: fakeNodes.append(n) else: if maxSlot < n <= maxNode: unmappedNodes.append(n) if mappedSlotNodes: error(f"ERROR: {OSLOTS} maps slot nodes") error(makeExamples(mappedSlotNodes), tm=False) good = False if fakeNodes: error(f"ERROR: {OSLOTS} maps nodes that are not in {OTYPE}") error(makeExamples(fakeNodes), tm=False) good = False if unmappedNodes: error(f"ERROR: {OSLOTS} fails to map nodes:") unmappedByType = {} for n in unmappedNodes: unmappedByType.setdefault( otypeData.get(n, "_UNKNOWN_"), [] ).append(n) for (nType, nodes) in sorted( unmappedByType.items(), key=lambda x: (-len(x[1]), x[0]), ): error(f"--- unmapped {nType:<10} : {makeExamples(nodes)}") good = False if good: info(f"OK: {OSLOTS} is valid") for (fName, data, isEdge, isConfig) in todo: edgeValues = False fMeta = {} fMeta.update(metaData.get("", {})) fMeta.update(metaData.get(fName, {})) if fMeta.get("edgeValues", False): edgeValues = True if "edgeValues" in fMeta: del fMeta["edgeValues"] fObj = Data( f"{self.writeDir}/{fName}.tf", self.tmObj, data=data, metaData=fMeta, isEdge=isEdge, isConfig=isConfig, edgeValues=edgeValues, ) tag = "config" if isConfig else "edge" if isEdge else "node" if fObj.save(nodeRanges=fName == OTYPE, overwrite=True, silent=silent): total[tag] += 1 else: failed[tag] += 1 indent(level=0) info( f"""Exported {total["node"]} node features""" f""" and {total["edge"]} edge features""" f""" and {total["config"]} config features""" f""" to {self.writeDir}""" ) if len(failed): for (tag, nf) in sorted(failed.items()): error(f"Failed to export {nf} {tag} features") good = False setSilent(wasSilent) return good def _loadFeature(self, fName, optional=False): if not self.good: return False tmObj = self.tmObj isSilent = tmObj.isSilent error = tmObj.error silent = isSilent() if fName not in self.features: if not optional: error(f'Feature "{fName}" not available in\n{self.locationRep}') self.good = False else: if not self.features[fName].load(silent=silent, _withGc=self._withGc): self.good = False def _makeIndex(self): tmObj = self.tmObj info = tmObj.info debug = tmObj.debug warning = tmObj.warning self.features = {} self.featuresIgnored = {} tfFiles = {} for loc in self.locations: for mod in self.modules: dirF = normpath(f"{loc}/{mod}") if not dirExists(dirF): continue with scanDir(dirF) as sd: files = tuple( e.name for e in sd if e.is_file() and e.name.endswith(".tf") ) for fileF in files: (fName, ext) = splitExt(fileF) tfFiles.setdefault(fName, []).append(f"{dirF}/{fileF}") for (fName, featurePaths) in sorted(tfFiles.items()): chosenFPath = featurePaths[-1] for featurePath in sorted(set(featurePaths[0:-1])): if featurePath != chosenFPath: self.featuresIgnored.setdefault(fName, []).append(featurePath) self.features[fName] = Data(chosenFPath, self.tmObj) self._getWriteLoc() debug( "{} features found and {} ignored".format( len(tfFiles), sum(len(x) for x in self.featuresIgnored.values()), ), tm=False, ) self.featuresOnly = False if OTYPE not in self.features or OSLOTS not in self.features: info( f"Not all of the warp features {OTYPE} and {OSLOTS} " f"are present in\n{self.locationRep}" ) info("Only the Feature and Edge APIs will be enabled") self.featuresOnly = True if OTEXT in self.features: self._loadFeature(OTEXT, optional=True) else: info((f'Warp feature "{OTEXT}" not found. Working without Text-API\n')) self.features[OTEXT] = Data( f"{OTEXT}.tf", self.tmObj, isConfig=True, metaData=OTEXT_DEFAULT, ) self.features[OTEXT].dataLoaded = True good = True if not self.featuresOnly: self.warpDir = self.features[OTYPE].dirName self.precomputeList = [] self.dep1Feats = [] for (dep2, fName, method, dependencies) in PRECOMPUTE: thisGood = True if dep2 and OTEXT not in self.features: continue if dep2 == 1: self.dep1Feats.append(fName) elif dep2 == 2: otextMeta = self.features[OTEXT].metaData sFeatures = f"{KIND[fName]}Features" sFeats = tuple(itemize(otextMeta.get(sFeatures, ""), ",")) dependencies = dependencies + sFeats for dep in dependencies: if dep not in self.features: warning( "Missing dependency for computed data feature " f'"{fName}": "{dep}"' ) thisGood = False if not thisGood: good = False self.features[fName] = Data( f"{self.warpDir}/{fName}.x", self.tmObj, method=method, dependencies=[self.features.get(dep, None) for dep in dependencies], ) self.precomputeList.append((fName, dep2)) self.good = good def _getWriteLoc(self, location=None, module=None): writeLoc = ( ex(location) if location is not None else "" if len(self.locations) == 0 else self.locations[-1] ) writeMod = ( module if module is not None else "" if len(self.modules) == 0 else self.modules[-1] ) self.writeDir = ( f"{writeLoc}{writeMod}" if writeLoc == "" or writeMod == "" else f"{writeLoc}/{writeMod}" ) def _precompute(self): tmObj = self.tmObj isSilent = tmObj.isSilent good = True for (fName, dep2) in self.precomputeList: ok = getattr(self, f'{fName.strip("_")}OK', False) if dep2 == 2 and not ok: continue if not self.features[fName].load(silent=isSilent()): good = False break self.good = good def _makeApi(self): if not self.good: return None tmObj = self.tmObj isSilent = tmObj.isSilent indent = tmObj.indent debug = tmObj.debug featuresOnly = self.featuresOnly silent = isSilent() api = Api(self) api.featuresOnly = featuresOnly if not featuresOnly: w0info = self.features[OTYPE] w1info = self.features[OSLOTS] if not featuresOnly: setattr(api.F, OTYPE, OtypeFeature(api, w0info.metaData, w0info.data)) setattr(api.E, OSLOTS, OslotsFeature(api, w1info.metaData, w1info.data)) requestedSet = set(self.featuresRequested) for fName in self.features: fObj = self.features[fName] if fObj.dataLoaded and not fObj.isConfig: if fObj.method: if not featuresOnly: feat = fName.strip("_") ok = getattr(self, f"{feat}OK", False) ap = api.C if fName in [ fn for (fn, dep2) in self.precomputeList if not dep2 == 2 or ok ]: setattr(ap, feat, Computed(api, fObj.data)) else: fObj.unload() if hasattr(ap, feat): delattr(api.C, feat) else: if fName in requestedSet | self.textFeatures: if fName in (OTYPE, OSLOTS, OTEXT): continue elif fObj.isEdge: setattr( api.E, fName, EdgeFeature( api, fObj.metaData, fObj.data, fObj.edgeValues ), ) else: setattr( api.F, fName, NodeFeature(api, fObj.metaData, fObj.data) ) else: if ( fName in (OTYPE, OSLOTS, OTEXT) or fName in self.textFeatures ): continue elif fObj.isEdge: if hasattr(api.E, fName): delattr(api.E, fName) else: if hasattr(api.F, fName): delattr(api.F, fName) fObj.unload() if not featuresOnly: addOtype(api) addNodes(api) addLocality(api) addText(api) addSearch(api, silent) indent(level=0) debug("All features loaded / computed - for details use TF.isLoaded()") self.api = api setattr(self, "isLoaded", self.api.isLoaded) return api def _updateApi(self): if not self.good: return None api = self.api tmObj = self.tmObj indent = tmObj.indent debug = tmObj.debug requestedSet = set(self.featuresRequested) for fName in self.features: fObj = self.features[fName] if fObj.dataLoaded and not fObj.isConfig: if not fObj.method: if fName in requestedSet | self.textFeatures: if fName in (OTYPE, OSLOTS, OTEXT): continue elif fObj.isEdge: apiFobj = EdgeFeature( api, fObj.metaData, fObj.data, fObj.edgeValues ) setattr(api.E, fName, apiFobj) else: apiFobj = NodeFeature(api, fObj.metaData, fObj.data) setattr(api.F, fName, apiFobj) else: if ( fName in (OTYPE, OSLOTS, OTEXT) or fName in self.textFeatures ): continue elif fObj.isEdge: if hasattr(api.E, fName): delattr(api.E, fName) else: if hasattr(api.F, fName): delattr(api.F, fName) fObj.unload() indent(level=0) debug("All additional features loaded - for details use TF.isLoaded()")
Initialize the core API for a corpus.
Top level management of
- locating TF feature files
- loading and saving feature data
- pre-computing auxiliary data
- caching pre-computed and compressed data
TF is initialized for a corpus. It will search a set of directories and catalogue all
.tf
files it finds there. These are the features you can subsequently load.Here
directories
andsubdirectories
are strings with directory names separated by newlines, or iterables of directories.Parameters
locations
:string | iterable
ofstrings
, optional-
The directories specified here are used as base locations in searching for TF feature files. In general, they will not searched directly, but certain subdirectories of them will be searched, specified by the
modules
parameter.Defaults:
~/Downloads/text-fabric-data ~/text-fabric-data ~/github/text-fabric-data
So if you have stored your main TF dataset in
text-fabric-data
in one of these directories you do not have to pass a location to Fabric. modules
:string | iterable
ofstrings
-
The directories specified in here are used as sub directories appended to the directories given by the
locations
parameter.All
.tf
files (non-recursively) in anylocation/module
will be added to the feature set to be loaded in this session. The order inmodules
is important, because if a feature occurs in multiple modules, the last one will be chosen. In this way you can easily override certain features in one module by features in an other module of your choice.Default:
['']
So if you leave it out, TF will just search the paths specified in
locations
. silent
:string
, optionalSILENT_D
- See
Timestamp
_withGc
:boolean
, optionalTrue
- If False, it disables the Python garbage collector before loading features. Used to experiment with performance.
otext@
in modulesIf modules contain features with a name starting with
otext@
, then the format definitions in these features will be added to the format definitions in the regularotext
feature (which is aWARP
feature). In this way, modules that define new features for text representation, also can add new formats to the Text-API.Returns
object
- An object from which you can call up all the of methods of the core API.
Subclasses
Instance variables
-
The banner Text-Fabric.
Will be shown just after start up, if the silence is not
deep
. var features
-
Dictionary of all features that TF has found, whether loaded or not.
Under each feature name is all info about that feature.
The best use of this is to get the metadata of features:
TF.features['fff'].metaData
This works for all features
fff
that have been found, whether the feature is loaded or not.If a feature is loaded, you can also use
F.fff.meta
ofE.fff.meta
depending on whetherfff
is a node feature or an edge feature.Do not print!
If a feature is loaded, its data is also in the feature info. This can be an enormous amount of information, and you can easily overwhelm your notebook if you print it.
var version
-
The version number of the TF library.
Methods
def clearCache(self)
-
Expand source code Browse git
def clearCache(self): """Clears the cache of compiled TF data. TF pre-computes data for you, so that it can be loaded faster. If the original data is updated, TF detects it, and will recompute that data. But there are cases, when the algorithms of TF have changed, without any changes in the data, where you might want to clear the cache of pre-computed results. Calling this function just does it, and it is equivalent with manually removing all `.tfx` files inside the hidden `.tf` directory inside your dataset. !!! hint "No need to load" It is not needed to execute a `TF.load()` first. See Also -------- tf.clean """ for (fName, fObj) in self.features.items(): fObj.cleanDataBin()
Clears the cache of compiled TF data.
TF pre-computes data for you, so that it can be loaded faster. If the original data is updated, TF detects it, and will recompute that data.
But there are cases, when the algorithms of TF have changed, without any changes in the data, where you might want to clear the cache of pre-computed results.
Calling this function just does it, and it is equivalent with manually removing all
.tfx
files inside the hidden.tf
directory inside your dataset.No need to load
It is not needed to execute a
TF.load()
first.See Also
def explore(self, silent='auto', show=True)
-
Expand source code Browse git
def explore(self, silent=SILENT_D, show=True): """Makes categorization of all features in the dataset. Parameters ---------- silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` show: boolean, optional True If `False`, the resulting dictionary is delivered in `TF.featureSets`; if `True`, the dictionary is returned as function result. Returns ------- dict | None A dictionary with keys `nodes`, `edges`, `configs`, `computeds`. Under each key there is the set of feature names in that category. How this dictionary is delivered, depends on the parameter *show*. Notes ----- !!! explanation "`configs`" These are configuration features, with metadata only, no data. E.g. `otext`. !!! explanation "`computeds`" These are blocks of pre-computed data, available under the `C` API, see `tf.core.computed.Computeds`. The sets do not indicate whether a feature is loaded or not. There are other functions that give you the loaded features: `tf.core.api.Api.Fall` for nodes and `tf.core.api.Api.Eall` for edges. """ silent = silentConvert(silent) tmObj = self.tmObj isSilent = tmObj.isSilent setSilent = tmObj.setSilent info = tmObj.info wasSilent = isSilent() setSilent(silent) nodes = set() edges = set() configs = set() computeds = set() for (fName, fObj) in self.features.items(): fObj.load(silent=silent, metaOnly=True) dest = None if fObj.method: dest = computeds elif fObj.isConfig: dest = configs elif fObj.isEdge: dest = edges else: dest = nodes dest.add(fName) info( "Feature overview: {} for nodes; {} for edges; {} configs; {} computed".format( len(nodes), len(edges), len(configs), len(computeds), ) ) self.featureSets = dict( nodes=nodes, edges=edges, configs=configs, computeds=computeds ) setSilent(wasSilent) if show: return dict( (kind, tuple(sorted(kindSet))) for (kind, kindSet) in sorted( self.featureSets.items(), key=lambda x: x[0] ) )
Makes categorization of all features in the dataset.
Parameters
silent
:string
, optionalSILENT_D
- See
Timestamp
show
:boolean
, optionalTrue
- If
False
, the resulting dictionary is delivered inTF.featureSets
; ifTrue
, the dictionary is returned as function result.
Returns
dict | None
- A dictionary
with keys
nodes
,edges
,configs
,computeds
. Under each key there is the set of feature names in that category. How this dictionary is delivered, depends on the parameter show.
Notes
configs
These are configuration features, with metadata only, no data. E.g.
otext
.computeds
These are blocks of pre-computed data, available under the
C
API, seeComputeds
.The sets do not indicate whether a feature is loaded or not. There are other functions that give you the loaded features:
Api.Fall()
for nodes andApi.Eall()
for edges. def load(self, features, add=False, silent='auto')
-
Expand source code Browse git
def load(self, features, add=False, silent=SILENT_D): """Loads features from disk into RAM memory. Parameters ---------- features: string | iterable Either a string containing space separated feature names, or an iterable of feature names. The feature names are just the names of `.tf` files without directory information and without extension. add: boolean, optional False The features will be added to the same currently loaded features, managed by the current API. Meant to be able to dynamically load features without reloading lots of features for nothing. silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` Returns ------- boolean | object If `add` is `True` a boolean indicating success is returned. Otherwise, the result is a new `tf.core.api.Api` if the feature could be loaded, else `False`. """ silent = silentConvert(silent) tmObj = self.tmObj isSilent = tmObj.isSilent setSilent = tmObj.setSilent indent = tmObj.indent debug = tmObj.debug warning = tmObj.warning error = tmObj.error cache = tmObj.cache reset = tmObj.reset featuresOnly = self.featuresOnly wasSilent = isSilent() setSilent(silent) indent(level=0, reset=True) self.sectionsOK = True self.structureOK = True self.good = True if self.good: featuresRequested = sorted(fitemize(features)) if add: self.featuresRequested += featuresRequested else: self.featuresRequested = featuresRequested for fName in (OTYPE, OSLOTS, OTEXT): self._loadFeature(fName, optional=fName == OTEXT or featuresOnly) self.textFeatures = set() if self.good and not featuresOnly: if OTEXT in self.features: otextMeta = self.features[OTEXT].metaData for otextMod in self.features: if otextMod.startswith(OTEXT + "@"): self._loadFeature(otextMod) otextMeta.update(self.features[otextMod].metaData) self.sectionFeats = itemize(otextMeta.get("sectionFeatures", ""), ",") self.sectionTypes = itemize(otextMeta.get("sectionTypes", ""), ",") self.structureFeats = itemize( otextMeta.get("structureFeatures", ""), "," ) self.structureTypes = itemize(otextMeta.get("structureTypes", ""), ",") (self.cformats, self.formatFeats) = collectFormats(otextMeta) if not (0 < len(self.sectionTypes) <= 3) or not ( 0 < len(self.sectionFeats) <= 3 ): if not add: warning( f"Dataset without sections in {OTEXT}:" f"no section functions in the T-API" ) self.sectionsOK = False else: self.textFeatures |= set(self.sectionFeats) self.sectionFeatsWithLanguage = tuple( f for f in self.features if f == self.sectionFeats[0] or f.startswith(f"{self.sectionFeats[0]}@") ) self.textFeatures |= set(self.sectionFeatsWithLanguage) if not self.structureTypes or not self.structureFeats: if not add: debug( f"Dataset without structure sections in {OTEXT}:" f"no structure functions in the T-API" ) self.structureOK = False else: self.textFeatures |= set(self.structureFeats) formatFeats = set(self.formatFeats) self.textFeatures |= formatFeats for fName in self.textFeatures: self._loadFeature(fName, optional=fName in formatFeats) dep1Feats = self.dep1Feats if dep1Feats: cformats = self.cformats tFormats = {} tFeats = set() for (fmt, (otpl, tpl, featData)) in cformats.items(): feats = set(chain.from_iterable(x[0] for x in featData)) tFormats[fmt] = tuple(sorted(feats)) tFeats |= feats tFeats = tuple(sorted(tFeats)) extraDependencies = [tFormats] for tFeat in tFeats: featData = self.features[tFeat].data extraDependencies.append((tFeat, featData)) for cFeat in dep1Feats: self.features[cFeat].dependencies += extraDependencies else: self.sectionsOK = False self.structureOK = False if self.good and not featuresOnly: self._precompute() if self.good: reset() for fName in self.featuresRequested: self._loadFeature(fName) if not self.good: indent(level=0) cache() error("Not all features could be loaded / computed") result = False break reset() if self.good: if add: try: self._updateApi() result = True except MemoryError: console(MEM_MSG) result = False else: try: result = self._makeApi() except MemoryError: console(MEM_MSG) result = False else: result = False setSilent(wasSilent) return result
Loads features from disk into RAM memory.
Parameters
features
:string | iterable
- Either a string containing space separated feature names, or an
iterable of feature names.
The feature names are just the names of
.tf
files without directory information and without extension. add
:boolean
, optionalFalse
- The features will be added to the same currently loaded features, managed by the current API. Meant to be able to dynamically load features without reloading lots of features for nothing.
silent
:string
, optionalSILENT_D
- See
Timestamp
Returns
boolean | object
- If
add
isTrue
a boolean indicating success is returned. Otherwise, the result is a newApi
if the feature could be loaded, elseFalse
.
def loadAll(self, silent='auto')
-
Expand source code Browse git
def loadAll(self, silent=SILENT_D): """Load all loadable features. Parameters ---------- silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` """ silent = silentConvert(silent) api = self.load("", silent=silent) allFeatures = self.explore(silent=silent, show=True) loadableFeatures = allFeatures["nodes"] + allFeatures["edges"] self.load(loadableFeatures, add=True, silent=silent) return api
def save(self,
nodeFeatures: Dict[str, Dict[int, str | int]] = {},
edgeFeatures: Dict[str, Dict[int, Set[int] | Dict[int, str | int]]] = {},
metaData: Dict[str, Dict[str, str]] = {},
location=None,
module=None,
silent='auto')-
Expand source code Browse git
def save( self, nodeFeatures: nodeFeatureDict = {}, edgeFeatures: edgeFeatureDict = {}, metaData: metaDataDict = {}, location=None, module=None, silent=SILENT_D, ): """Saves newly generated data to disk as TF features, nodes and / or edges. If you have collected feature data in dictionaries, keyed by the names of the features, and valued by their feature data, then you can save that data to `.tf` feature files on disk. It is this easy to export new data as features: collect the data and metadata of the features and feed it in an orderly way to `TF.save()` and there you go. Parameters ---------- nodeFeatures: dict of dict The data of a node feature is a dictionary with nodes as keys (integers!) and strings or numbers as (feature) values. This parameter holds all those dictionaries, keyed by feature name. edgeFeatures: dict of dict The data of an edge feature is a dictionary with nodes as keys, and sets or dictionaries as values. These sets should be sets of nodes (integers!), and these dictionaries should have nodes as keys and strings or numbers as values. This parameter holds all those dictionaries, keyed by feature name. metaData: dict of dict The meta data for every feature to be saved is a key-value dictionary. This parameter holds all those dictionaries, keyed by feature name. !!! explanation "value types" The type of the feature values (`int` or `str`) should be specified under key `valueType`. !!! explanation "edge values" If you save an edge feature, and there are values in that edge feature, you have to say so, by specifying `edgeValues=True` in the metadata for that feature. !!! explanation "generic metadata" This parameter may also contain fields under the empty name. These fields will be added to all features in `nodeFeatures` and `edgeFeatures`. !!! explanation "configuration features" If you need to write the *configuration* feature `otext`, which is a metadata-only feature, just add the metadata under key `otext` in this parameter and make sure that `otext` is not a key in `nodeFeatures` nor in `edgeFeatures`. These fields will be written into the separate configuration feature `otext`, with no data associated. location: dict The (meta)data will be written to the very last directory that TF searched when looking for features (this is determined by the `locations` and `modules` parameters in `tf.fabric.Fabric`. If both `locations` and `modules` are empty, writing will take place in the current directory. But you can override it: If you pass `location=something`, TF will save in `something/mod`, where `mod` is the last member of the `modules` parameter of TF. module: dict This is an additional way of overriding the default location where TF saves new features. See the *location* parameter. If you pass `module=something`, TF will save in `loc/something`, where `loc` is the last member of the `locations` parameter of TF. If you pass `location=path1` and `module=path2`, TF will save in `path1/path2`. silent: string, optional tf.core.timestamp.SILENT_D See `tf.core.timestamp.Timestamp` """ silent = silentConvert(silent) tmObj = self.tmObj isSilent = tmObj.isSilent setSilent = tmObj.setSilent indent = tmObj.indent info = tmObj.info error = tmObj.error good = True wasSilent = isSilent() setSilent(silent) indent(level=0, reset=True) self._getWriteLoc(location=location, module=module) configFeatures = dict( f for f in metaData.items() if f[0] != "" and f[0] not in nodeFeatures and f[0] not in edgeFeatures ) info( "Exporting {} node and {} edge and {} configuration features to {}:".format( len(nodeFeatures), len(edgeFeatures), len(configFeatures), self.writeDir, ) ) todo = [] for fName in sorted(nodeFeatures): todo.append((fName, nodeFeatures[fName], False, False)) for fName in sorted(edgeFeatures): todo.append((fName, edgeFeatures[fName], True, False)) for fName in sorted(configFeatures): todo.append((fName, configFeatures[fName], None, True)) total = collections.Counter() failed = collections.Counter() maxSlot = None maxNode = None slotType = None if OTYPE in nodeFeatures: info(f"VALIDATING {OSLOTS} feature") otypeData = nodeFeatures[OTYPE] if type(otypeData) is tuple: (otypeData, slotType, maxSlot, maxNode) = otypeData elif 1 in otypeData: slotType = otypeData[1] maxSlot = max(n for n in otypeData if otypeData[n] == slotType) maxNode = max(otypeData) if OSLOTS in edgeFeatures: info(f"VALIDATING {OSLOTS} feature") oslotsData = edgeFeatures[OSLOTS] if type(oslotsData) is tuple: (oslotsData, maxSlot, maxNode) = oslotsData if maxSlot is None or maxNode is None: error(f"ERROR: cannot check validity of {OSLOTS} feature") good = False else: info(f"maxSlot={maxSlot:>11}") info(f"maxNode={maxNode:>11}") maxNodeInData = max(oslotsData) minNodeInData = min(oslotsData) mappedSlotNodes = [] unmappedNodes = [] fakeNodes = [] start = min((maxSlot + 1, minNodeInData)) end = max((maxNode, maxNodeInData)) for n in range(start, end + 1): if n in oslotsData: if n <= maxSlot: mappedSlotNodes.append(n) elif n > maxNode: fakeNodes.append(n) else: if maxSlot < n <= maxNode: unmappedNodes.append(n) if mappedSlotNodes: error(f"ERROR: {OSLOTS} maps slot nodes") error(makeExamples(mappedSlotNodes), tm=False) good = False if fakeNodes: error(f"ERROR: {OSLOTS} maps nodes that are not in {OTYPE}") error(makeExamples(fakeNodes), tm=False) good = False if unmappedNodes: error(f"ERROR: {OSLOTS} fails to map nodes:") unmappedByType = {} for n in unmappedNodes: unmappedByType.setdefault( otypeData.get(n, "_UNKNOWN_"), [] ).append(n) for (nType, nodes) in sorted( unmappedByType.items(), key=lambda x: (-len(x[1]), x[0]), ): error(f"--- unmapped {nType:<10} : {makeExamples(nodes)}") good = False if good: info(f"OK: {OSLOTS} is valid") for (fName, data, isEdge, isConfig) in todo: edgeValues = False fMeta = {} fMeta.update(metaData.get("", {})) fMeta.update(metaData.get(fName, {})) if fMeta.get("edgeValues", False): edgeValues = True if "edgeValues" in fMeta: del fMeta["edgeValues"] fObj = Data( f"{self.writeDir}/{fName}.tf", self.tmObj, data=data, metaData=fMeta, isEdge=isEdge, isConfig=isConfig, edgeValues=edgeValues, ) tag = "config" if isConfig else "edge" if isEdge else "node" if fObj.save(nodeRanges=fName == OTYPE, overwrite=True, silent=silent): total[tag] += 1 else: failed[tag] += 1 indent(level=0) info( f"""Exported {total["node"]} node features""" f""" and {total["edge"]} edge features""" f""" and {total["config"]} config features""" f""" to {self.writeDir}""" ) if len(failed): for (tag, nf) in sorted(failed.items()): error(f"Failed to export {nf} {tag} features") good = False setSilent(wasSilent) return good
Saves newly generated data to disk as TF features, nodes and / or edges.
If you have collected feature data in dictionaries, keyed by the names of the features, and valued by their feature data, then you can save that data to
.tf
feature files on disk.It is this easy to export new data as features: collect the data and metadata of the features and feed it in an orderly way to
TF.save()
and there you go.Parameters
nodeFeatures
:dict
ofdict
- The data of a node feature is a dictionary with nodes as keys (integers!) and strings or numbers as (feature) values. This parameter holds all those dictionaries, keyed by feature name.
edgeFeatures
:dict
ofdict
- The data of an edge feature is a dictionary with nodes as keys, and sets or dictionaries as values. These sets should be sets of nodes (integers!), and these dictionaries should have nodes as keys and strings or numbers as values. This parameter holds all those dictionaries, keyed by feature name.
metaData
:dict
ofdict
-
The meta data for every feature to be saved is a key-value dictionary. This parameter holds all those dictionaries, keyed by feature name.
value types
The type of the feature values (
int
orstr
) should be specified under keyvalueType
.edge values
If you save an edge feature, and there are values in that edge feature, you have to say so, by specifying
edgeValues=True
in the metadata for that feature.generic metadata
This parameter may also contain fields under the empty name. These fields will be added to all features in
nodeFeatures
andedgeFeatures
.configuration features
If you need to write the configuration feature
otext
, which is a metadata-only feature, just add the metadata under keyotext
in this parameter and make sure thatotext
is not a key innodeFeatures
nor inedgeFeatures
. These fields will be written into the separate configuration featureotext
, with no data associated. location
:dict
-
The (meta)data will be written to the very last directory that TF searched when looking for features (this is determined by the
locations
andmodules
parameters inFabric
.If both
locations
andmodules
are empty, writing will take place in the current directory.But you can override it:
If you pass
location=something
, TF will save insomething/mod
, wheremod
is the last member of themodules
parameter of TF. module
:dict
-
This is an additional way of overriding the default location where TF saves new features. See the location parameter.
If you pass
module=something
, TF will save inloc/something
, whereloc
is the last member of thelocations
parameter of TF.If you pass
location=path1
andmodule=path2
, TF will save inpath1/path2
. silent
:string
, optionalSILENT_D
- See
Timestamp