Module `tf.writing.transcription`

Transcription

TF has support for several writing systems, by means of transcription tables and fonts that will be invoked when displaying the main text.

It also calls functions to use these tables for converting Hebrew and Syriac text material to transliterated representations and back.

There is also a phonetic transcription for Hebrew, designed in phono.ipynb

Character tables and fonts

`hbo` Hebrew

tf.writing.hebrew: full list of characters covered by the ETCBC and phonetic transcriptions

Font Ezra SIL.

`syc` Syriac

tf.writing.syriac: full list of characters covered by the ETCBC transcriptions

Font Estrangelo Edessa.

`ara` Arabic

tf.writing.arabic: full list of characters covered by the transcription used for the Quran

Font AmiriQuran.

`grc` Greek

Font Gentium.

`akk` Akkadian

Font Santakku.

`uga` Ugaritic

Font Santakku.

`cld` Neo Aramaic

Font CharisSIL-R.

Classes

class Transcription

Expand source code Browse git

class Transcription:
    """Conversion between UNICODE and various transcriptions.

    Usage notes:

    Invoke the transcription functionality as follows:

    ```
    from tf.writing.transcription import Transcription
    ```

    Some of the attributes and methods below are *class* attributes,
    others are instance attributes.

    A class attribute `aaa` can be retrieved by saying

    ``` python
    Transcription.aaa
    ```

    To retrieve an instance attribute, you need an instance first, like

    ``` python
    tr = Transcription()
    tr.aaa
    ```
    """

    decomp = {
        "\u05E9\u05C1": "\uFB2A",
        "\u05E9\u05C2": "\uFB2B",
    }
    hebrew_mapping = {
        "_": " ",  # space inside word
        "92": "\u0591",  # etnahta = atnach
        "01": "\u0592",  # segolta
        "65": "\u0593",  # shalshelet
        "80": "\u0594",  # zaqef_qatan
        "85": "\u0595",  # zaqef_gadol
        "73": "\u0596",  # tipeha = tifcha
        "81": "\u0597",  # revia = rebia
        "82": "\u0598",  # zarqa = tsinorit = zinorit = sinnorit
        "03": "\u0599",  # pashta
        "10": "\u059A",  # yetiv = yetib
        "91": "\u059B",  # tevir = tebir
        "61": "\u059C",  # geresh
        "11": "\u059D",  # geresh muqdam = mugrash
        "62": "\u059E",  # gershayim = garshayim
        "84": "\u059F",  # qarney para = pazer_gadol
        "14": "\u05A0",  # telisha_gedola
        "44": "\u05A0",  # telisha_gedola = telisha_gedola_med
        "83": "\u05A1",  # pazer
        "74": "\u05A3",  # munah = munach
        "70": "\u05A4",  # mahapakh = mehuppach
        "71": "\u05A5",  # merkha = merecha
        "72": "\u05A6",  # merkha kefula = merecha_kepula
        "94": "\u05A7",  # darga
        "63": "\u05A8",  # qadma = azla
        "33": "\u05A8",  # pashta_med < qadma
        "04": "\u05A9",  # telisha_qetana
        "24": "\u05A9",  # telisha_qetana = telisha_qetana_med
        "93": "\u05AA",  # yera ben yomo = yerach
        "60": "\u05AB",  # ole = ole_weyored
        "64": "\u05AC",  # iluy = illuy
        "13": "\u05AD",  # dehi = dechi
        "02": "\u05AE",  # zinor = sinnor
        "*": "\u05AF",  # masora circle
        ":": "\u05B0",  # sheva = shewa
        ":E": "\u05B1",  # hataf segol = chataph_segol
        ":A": "\u05B2",  # hataf patah = chataph_patach
        ":@": "\u05B3",  # hataf qamats = chataph_qamats
        "I": "\u05B4",  # hiriq = chiriq
        ";": "\u05B5",  # tsere
        "E": "\u05B6",  # segol
        "A": "\u05B7",  # patach
        "@": "\u05B8",  # qamats
        "O": "\u05B9",  # holam = cholam
        "U": "\u05BB",  # qubuts = qubbuts
        ".": "\u05BC",  # dagesh
        "25": "\u05BD",  # silluq yamin
        "45": "\u05BD",  # meteg
        "35": "\u05BD",  # meteg (tikon)
        "75": "\u05BD",  # siluq = silluq
        "95": "\u05BD",  # meteg = meteg_yamin
        "&": "\u05BE",  # maqaf
        ",": "\u05BF",  # rafe = raphe
        "05": "\u05C0",  # paseq
        ".c": "\u05C1",  # shin dot
        ".f": "\u05C2",  # sin dot
        "00": "\u05C3",  # sof_pasuq
        "52": "\u05C4",  # upper dot = puncta_above
        "53": "\u05C5",  # lower dot = puncta_below
        "ñ": "\u05C6\u0307",  # nun hafukha
        "Ñ": "\u05C6\u0307",  # nun hafukha
        ">": "\u05D0",  # alef
        "B": "\u05D1",  # bet
        "G": "\u05D2",  # gimel
        "D": "\u05D3",  # dalet
        "H": "\u05D4",  # he
        "W": "\u05D5",  # vav
        "Z": "\u05D6",  # zayin
        "X": "\u05D7",  # het
        "V": "\u05D8",  # tet
        "J": "\u05D9",  # yod
        "k": "\u05DA",  # kaf final
        "K": "\u05DB",  # kaf
        "L": "\u05DC",  # lamed
        "m": "\u05DD",  # mem final
        "M": "\u05DE",  # mem
        "n": "\u05DF",  # nun final
        "N": "\u05E0",  # nun
        "S": "\u05E1",  # samekh
        "<": "\u05E2",  # ayin
        "p": "\u05E3",  # pe final
        "P": "\u05E4",  # pe
        "y": "\u05E5",  # tsadi final
        "Y": "\u05E6",  # tsadi
        "Q": "\u05E7",  # qof
        "R": "\u05E8",  # resh
        "#": "\u05E9",  # sin unpointed
        "T": "\u05EA",  # tav
        "C": "\uFB2A",  # shin pointed
        "F": "\uFB2B",  # sin pointed
        "55": "<UNMAPPED 55=large letter>",  # large_letter
        "56": "<UNMAPPED 56=small letter>",  # small_letter
        "57": "<UNMAPPED 57=suspended letter>",  # suspended_letter
        "-": "",  # suppress space afterward
        "'": "\u05f3",  # punctuation geresh
        '"': "\u05f4",  # punctuation gershayim
    }
    """
    Maps all ETCBC transliteration character combinations for Hebrew to UNICODE.

    Example: sof-pasuq:

    ``` python
    Transcription.hebrew_mapping['00']
    ```

    Output:

    ```
    ׃
    ```
    """

    hebrew_cons = ">BGDHWZXVJKLMNS<PYQRFCT"
    trans_final_pat = re.compile(
        r"(["
        + hebrew_cons
        + r"][^_&]*)([KMNPY])([^"
        + hebrew_cons
        + r"_&]*(:?[_&]|\Z))"
    )
    trans_hebrew_pat = re.compile(r"(:[AE@]|.[cf]|:|[0-9][0-9]|.)")
    swap_accent_pat = re.compile(
        r"(\A|[_&])([0-9][0-9])([" + hebrew_cons + r"])([:;,.EAIOU@]*)"
    )
    remove_accent_pat = re.compile(r"((?:[0-9][0-9])|[,*])")
    remove_point_pat = re.compile(
        r"((?:[0-9][0-9])|(?:\.[cf])|(?::[@AE])|[,.:;@AEIOU*])"
    )
    remove_psn_pat = re.compile(r"00[ _SPNÑñ]*")
    remove_psq_pat = re.compile(r"(?:[ _]+05[ _]*)|(?:05[ _]+)")
    shin_pat = re.compile(r"[CF]")
    ph_simple_pat = re.compile(r"([ˈˌᵊᵃᵒᵉāo*])")
    noorigspace = re.compile(
        r"""
          (?: [&-]\Z)           # space, maqef or nospace
        | (?:
               0[05]            # sof pasuq or paseq
               (?:_[SNP])*      # nun hafukha, setumah, petuhah at end of verse
               \Z
          )
        | (?:_[SPN])+           #  nun hafukha, setumah, petuhah between words
    """,
        re.X,
    )

    ugaritic_mappingi = {
        "𐎀": "a",
        "𐎁": "b",
        "𐎂": "g",
        "𐎃": "ḫ",
        "𐎄": "d",
        "𐎅": "h",
        "𐎆": "w",
        "𐎇": "z",
        "𐎈": "ḥ",
        "𐎉": "ṭ",
        "𐎊": "y",
        "𐎋": "k",
        "𐎌": "š",
        "𐎍": "l",
        "𐎎": "m",
        "𐎏": "ḏ",
        "𐎐": "n",
        "𐎑": "ẓ",
        "𐎒": "s",
        "𐎓": "ˤ",
        "𐎔": "p",
        "𐎕": "ṣ",
        "𐎖": "q",
        "𐎗": "r",
        "𐎘": "ṯ",
        "𐎙": "ġ",
        "𐎚": "t",
        "𐎛": "i",
        "𐎜": "u",
        "𐎝": "s2",
        "𐎟": ".",
    }
    """
    Maps Ugaritic unicode characters to their conventional transliteration characters.

    Unidentified characters:

    * `x` (damaged ?)
    * `/` (alternative ?) only twice, in `atypˤtba/r` and `xxxxl/d…`
    * `,` (comma) only once in a very long word starting at 551 ... `km,ad` ...
    * `<` `>` (brackets marking uncertainty ?)
    * `…` (unreadable ?)
    * `00a0` (non-breaking space)
    """

    ugaritic_mapping = {v: k for (k, v) in ugaritic_mappingi.items()}

    syriac_mapping_simple = {
        ">": "\u0710",  # alaph
        "B": "\u0712",  # beth
        "G": "\u0713",  # gamal
        "D": "\u0715",  # dalat
        "H": "\u0717",  # he
        "W": "\u0718",  # waw
        "Z": "\u0719",  # zain
        "X": "\u071A",  # heth
        "V": "\u071B",  # teth
        "J": "\u071D",  # yudh
        "K": "\u071F",  # kaph
        "L": "\u0720",  # lamadh
        "M": "\u0721",  # mim
        "N": "\u0722",  # nun
        "S": "\u0723",  # semkath
        "<": "\u0725",  # e
        "P": "\u0726",  # pe
        "Y": "\u0728",  # sadhe
        "Q": "\u0729",  # qaph
        "R": "\u072A",  # rish
        "C": "\u072B",  # shin
        "T": "\u072C",  # taw
        "s": "\u0724",  # semkath final
        "p": "\u0727",  # pe reversed
    }

    syriac_mapping_pil = {
        # LETTERS
        "'": "\u0710",  # alaph
        "b": "\u0712",  # beth
        "g": "\u0713",  # gamal
        "d": "\u0715",  # dalat
        "h": "\u0717",  # he
        "w": "\u0718",  # waw
        "z": "\u0719",  # zain
        "H": "\u071A",  # heth
        "T": "\u071B",  # teth
        "y": "\u071D",  # yod
        "k": "\u071F",  # kaf
        "l": "\u0720",  # lamad
        "m": "\u0721",  # mim
        "n": "\u0722",  # nun
        "s": "\u0723",  # semkath
        "`": "\u0725",  # 'e
        "p": "\u0726",  # pe
        "S": "\u0728",  # tsade
        "q": "\u0729",  # qof
        "r": "\u072A",  # resh
        "$": "\u072B",  # shin
        "t": "\u072C",  # taw
        # WORD-BOUND DIACRITICS
        '"': "\u0308",  # seyame
        "#": "\u0323",  # diacritical dot below
        "^": "\u0307",  # diacritical dot above
        "~": "\u0307",  # abbreviation mark
        # NON-VOCALIC LETTER-BOUND DIACRITICS
        "#,": "\u0742",  # rukkakha
        '#"': "\u0342",  # unclear (COMBINING DIAERESIS BELOW)
        "#!": "\u0744",  # unclear (SYRIAC TWO VERTICAL DOTS BELOW)
        "#_": "\u0331",  # linea occultans infera
        "^,": "\u0741",  # qushshaya
        "^!": "\u0743",  # unclear (SYRIAC TWO VERTICAL DOTS ABOVE)
        "^_": "\u0304",  # linea occultans supera
        # VOCALIC LETTER-BOUND DIACRITICS
        ":": "",  # shewa
        "A": "\u0733",  # qamets
        "A1": "\u0734",  # zeqapa
        "A2": "\u0735",  # zeqofo
        "E": "\u0739",  # tsere, revasa karya
        "O": "\u073F",  # holem, rewaha
        "a": "\u0730",  # patah
        "a1": "\u0731",  # petaha
        "a2": "\u0732",  # petoho
        "e": "\u0736",  # segol
        "e1": "\u0737",  # revasa arrika
        "e2": "\u0738",  # revoso
        "i": "\u073A",  # hireq
        "i1": "\u073B",  # hevoso
        "y#": "\u071D\u073C",  # hevasa
        "u": "\u073D",  # qubbuts
        "u1": "\u073E",  # esoso
        "w#": "\u0718\u073C",  # esasa allisa
        "w^": "\u0718\u073F",  # esasa rewiha
        # INTERPUNCTION
        "#.": "\u0702",  # menachta, meshalyana (ES), metdamrana, samka
        "#:": "\u0704",  # metkashpana (ES)
        "#\\": "\u0709",  # tahtaya, metkashpana (WS), meshalyana (WS)
        "=.": "\u002E",  # pasuqa
        "=/": "\u0707",  # elaya
        "=:": "\u003A",  # shewaya (WS), zauga (ES)
        "=\\": "\u0706",  # unclear (SYRIAC COLON SKEWED LEFT)
        "^.": "\u0701",  # paquda, metkashpana (ES), meshalyana (ES), etsyana, meshalana?
        '^"': "\u0705",  # rahta
        "^:": "\u0703",  # taksa (WS), zauga elaya (ES)
        "^\\": "\u0708",  # unclear (SYRIAC SUPRALINEAR COLON SKEWED LEFT)
        # PERICOPE MARKERS
        "*": "\u0700",  # rosette
        ".": "\u00B7",  # common dot in caesuras
        "@": "\u2722",  # vignette
        "_": "\u2014",  # dash in caesuras
        "o": "\u2022",  # large dot in caesuras
    }

    syriac_mapping = {  # this is WIT
        # LETTERS
        ">": "\u0710",  # alaph
        "B": "\u0712",  # beth
        "G": "\u0713",  # gamal
        "D": "\u0715",  # dalat
        "H": "\u0717",  # he
        "W": "\u0718",  # waw
        "Z": "\u0719",  # zain
        "X": "\u071A",  # heth
        "V": "\u071B",  # teth
        "J": "\u071D",  # yod
        "K": "\u071F",  # kaf
        "L": "\u0720",  # lamad
        "M": "\u0721",  # mim
        "N": "\u0722",  # nun
        "S": "\u0723",  # semkath
        "<": "\u0725",  # 'e
        "P": "\u0726",  # pe
        "Y": "\u0728",  # tsade
        "Q": "\u0729",  # qof
        "R": "\u072A",  # resh
        "C": "\u072B",  # shin
        "T": "\u072C",  # taw
        # WORD-BOUND DIACRITICS
        '"': "\u0308",  # seyame
        "#": "\u0323",  # diacritical dot below
        "^": "\u0307",  # diacritical dot above
        # NON-VOCALIC LETTER-BOUND DIACRITICS
        "^!": "\u0743",  # unclear (SYRIAC TWO VERTICAL DOTS ABOVE)
        # VOCALIC LETTER-BOUND DIACRITICS
        ":": "",  # shewa
        "A": "\u0733",  # qamets
        "A1": "\u0734",  # zeqapa
        "A2": "\u0735",  # zeqofo
        "O": "\u073F",  # holem, rewaha
        "@": "\u0730",  # patah
        "@1": "\u0731",  # petaha
        "@2": "\u0732",  # petoho
        "E": "\u0736",  # segol
        "E1": "\u0737",  # revasa arrika
        "E2": "\u0738",  # revoso
        "I": "\u073A",  # hireq
        "I1": "\u073B",  # hevoso
        "U": "\u073D",  # qubbuts
        "U1": "\u073E",  # esoso
        # INTERPUNCTION
        "#\\": "\u0709",  # tahtaya, metkashpana (WS), meshalyana (WS)
        "=.": "\u002E",  # pasuqa
        "=#": "\u0707",  # elaya
        "=:": "\u003A",  # shewaya (WS), zauga (ES)
        "=^": "\u0706",  # unclear (SYRIAC COLON SKEWED LEFT)
        "=/": "\u0707",  # elaya
        "=\\": "\u0706",  # unclear (SYRIAC COLON SKEWED LEFT)
        "^:": "\u0703",  # taksa (WS), zauga elaya (ES)
        "^\\": "\u0708",  # unclear (SYRIAC SUPRALINEAR COLON SKEWED LEFT)
        # PERICOPE MARKERS
        "*": "\u0700",  # rosette
        ".": "\u00B7",  # common dot in caesuras
        "_": "\u2014",  # dash in caesuras
        "o": "\u2022",  # large dot in caesuras
    }
    """
    Maps all ETCBC transliteration character combinations for Syriac to UNICODE.

    Example: semkath-final:

    ``` python
    Transcription.syriac_mapping['s']
    ```

    Output:

    ```
    ܤ
    ```
    """

    trans_syriac_pat = re.compile(r"([AE@IU][12]?|=[.#:\^/\\]|[\^#][!:\\]|.)")

    arabic_mapping = {
        " ": "\u0020",  # SPACE
        "'": "\u0621",  # ARABIC LETTER HAMZA
        ">": "\u0623",  # ARABIC LETTER ALEF WITH HAMZA ABOVE
        "&": "\u0624",  # ARABIC LETTER WAW WITH HAMZA ABOVE
        "<": "\u0625",  # ARABIC LETTER ALEF WITH HAMZA BELOW
        "}": "\u0626",  # ARABIC LETTER YEH WITH HAMZA ABOVE
        "A": "\u0627",  # ARABIC LETTER ALEF
        "b": "\u0628",  # ARABIC LETTER BEH
        "p": "\u0629",  # ARABIC LETTER TEH MARBUTA
        "t": "\u062a",  # ARABIC LETTER TEH
        "v": "\u062b",  # ARABIC LETTER THEH
        "j": "\u062c",  # ARABIC LETTER JEEM
        "H": "\u062d",  # ARABIC LETTER HAH
        "x": "\u062e",  # ARABIC LETTER KHAH
        "d": "\u062f",  # ARABIC LETTER DAL
        "*": "\u0630",  # ARABIC LETTER THAL
        "r": "\u0631",  # ARABIC LETTER REH
        "z": "\u0632",  # ARABIC LETTER ZAIN
        "s": "\u0633",  # ARABIC LETTER SEEN
        "$": "\u0634",  # ARABIC LETTER SHEEN
        "S": "\u0635",  # ARABIC LETTER SAD
        "D": "\u0636",  # ARABIC LETTER DAD
        "T": "\u0637",  # ARABIC LETTER TAH
        "Z": "\u0638",  # ARABIC LETTER ZAH
        "E": "\u0639",  # ARABIC LETTER AIN
        "g": "\u063a",  # ARABIC LETTER GHAIN
        "_": "\u0640",  # ARABIC TATWEEL
        "f": "\u0641",  # ARABIC LETTER FEH
        "q": "\u0642",  # ARABIC LETTER QAF
        "k": "\u0643",  # ARABIC LETTER KAF
        "l": "\u0644",  # ARABIC LETTER LAM
        "m": "\u0645",  # ARABIC LETTER MEEM
        "n": "\u0646",  # ARABIC LETTER NOON
        "h": "\u0647",  # ARABIC LETTER HEH
        "w": "\u0648",  # ARABIC LETTER WAW
        "Y": "\u0649",  # ARABIC LETTER ALEF MAKSURA
        "y": "\u064a",  # ARABIC LETTER YEH
        "F": "\u064b",  # ARABIC FATHATAN
        "N": "\u064c",  # ARABIC DAMMATAN
        "K": "\u064d",  # ARABIC KASRATAN
        "a": "\u064e",  # ARABIC FATHA
        "u": "\u064f",  # ARABIC DAMMA
        "i": "\u0650",  # ARABIC KASRA
        "~": "\u0651",  # ARABIC SHADDA
        "o": "\u0652",  # ARABIC SUKUN
        "^": "\u0653",  # ARABIC MADDAH ABOVE
        "#": "\u0654",  # ARABIC HAMZA ABOVE
        "`": "\u0670",  # ARABIC LETTER SUPERSCRIPT ALEF
        "{": "\u0671",  # ARABIC LETTER ALEF WASLA
        ":": "\u06dc",  # ARABIC SMALL HIGH SEEN
        "@": "\u06df",  # ARABIC SMALL HIGH ROUNDED ZERO
        '"': "\u06e0",  # ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
        "[": "\u06e2",  # ARABIC SMALL HIGH MEEM ISOLATED FORM
        ";": "\u06e3",  # ARABIC SMALL LOW SEEN
        ",": "\u06e5",  # ARABIC SMALL WAW
        ".": "\u06e6",  # ARABIC SMALL YEH
        "!": "\u06e8",  # ARABIC SMALL HIGH NOON
        "-": "\u06ea",  # ARABIC EMPTY CENTRE LOW STOP
        "+": "\u06eb",  # ARABIC EMPTY CENTRE HIGH STOP
        "%": "\u06ec",  # ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
        "]": "\u06ed",  # ARABIC SMALL LOW MEEM
    }
    """
    Maps an Arabic transliteration character to UNICODE.
    This is the mapping used in the Quran representation on tanzil.net.

    Example: beh

    ``` python
    Transcription.syriac_mapping['b']
    ```

    Output:

    ```
    ب
    ```

    Maps an Arabic letter in UNICODE to its transliteration

    Example: beh transliteration

    ``` python
    Transcription.syriac_mapping['ب']
    ```

    Output:

    ```
    b
    ```
    """

    arabic_mappingi = {v: k for (k, v) in arabic_mapping.items()}

    arabicTrans = {
        "\u0020": (" ", " ", " ", " "),  # SPACE
        "\u060c": ("‚", ",", ",", ","),  # ARABIC COMMA
        "\u061b": ("„", ";", ";", ";"),  # ARABIC SEMICOLON
        "\u061f": ("?", "?", "?", "?"),  # ARABIC QUESTION MARK
        "\u0621": ("'", "'", "ʾ", "'"),  # ARABIC LETTER HAMZA
        "\u0622": ("»", "_a", "ā", "ā"),  # ARABIC LETTER ALEF WITH MADDA ABOVE
        "\u0623": (">", "_a", "ā", "ā"),  # ARABIC LETTER ALEF WITH HAMZA ABOVE
        "\u0624": ("&", "'", "ʾ", "'"),  # ARABIC LETTER WAW WITH HAMZA ABOVE
        "\u0625": ("<", "_a", "ā", "ā"),  # ARABIC LETTER ALEF WITH HAMZA BELOW
        "\u0626": ("}", "'", "ʾ", "y"),  # ARABIC LETTER YEH WITH HAMZA ABOVE
        "\u0627": ("A", "_a", "ā", "ā"),  # ARABIC LETTER ALEF
        "\u0628": ("b", "b", "b", "b"),  # ARABIC LETTER BEH
        "\u0629": ("p", "=t", "ŧ", "t"),  # ARABIC LETTER TEH MARBUTA
        "\u062a": ("t", "t", "t", "t"),  # ARABIC LETTER TEH
        "\u062b": ("v", "_t", "ṯ", "th"),  # ARABIC LETTER THEH
        "\u062c": ("j", "j", "ǧ", "j"),  # ARABIC LETTER JEEM
        "\u062d": ("H", "*h", "ḥ", "ḥ"),  # ARABIC LETTER HAH
        "\u062e": ("x", "_h", "ḫ", "kh"),  # ARABIC LETTER KHAH
        "\u062f": ("d", "d", "d", "d"),  # ARABIC LETTER DAL
        "\u0630": ("*", "_d", "ḏ", "dh"),  # ARABIC LETTER THAL
        "\u0631": ("r", "r", "r", "r"),  # ARABIC LETTER REH
        "\u0632": ("z", "z", "z", "z"),  # ARABIC LETTER ZAIN
        "\u0633": ("s", "s", "s", "s"),  # ARABIC LETTER SEEN
        "\u0634": ("$", "^s", "š", "sh"),  # ARABIC LETTER SHEEN
        "\u0635": ("S", "*s", "ṣ", "ṣ"),  # ARABIC LETTER SAD
        "\u0636": ("D", "*d", "ḍ", "ḍ"),  # ARABIC LETTER DAD
        "\u0637": ("T", "*t", "ṭ", "ṭ"),  # ARABIC LETTER TAH
        "\u0638": ("Z", "*z", "ẓ", "ẓ"),  # ARABIC LETTER ZAH
        "\u0639": ("E", "`", "ʿ", "`"),  # ARABIC LETTER AIN
        "\u063a": ("g", "*g", "ġ", "gh"),  # ARABIC LETTER GHAIN
        "\u0640": ("_", "", "", ""),  # ARABIC TATWEEL
        "\u0641": ("f", "f", "f", "f"),  # ARABIC LETTER FEH
        "\u0642": ("q", "*k", "ḳ", "q"),  # ARABIC LETTER QAF
        "\u0643": ("k", "k", "k", "k"),  # ARABIC LETTER KAF
        "\u0644": ("l", "l", "l", "l"),  # ARABIC LETTER LAM
        "\u0645": ("m", "m", "m", "m"),  # ARABIC LETTER MEEM
        "\u0646": ("n", "n", "n", "n"),  # ARABIC LETTER NOON
        "\u0647": ("h", "h", "h", "h"),  # ARABIC LETTER HEH
        "\u0648": ("w", "w", "w", "w"),  # ARABIC LETTER WAW
        "\u0649": ("Y", "/a", "á", "ā"),  # ARABIC LETTER ALEF MAKSURA
        "\u064a": ("y", "y", "y", "y"),  # ARABIC LETTER YEH
        "\u064b": ("F", "a*n", "aⁿ", "an"),  # ARABIC FATHATAN
        "\u064c": ("N", "u*n", "uⁿ", "un"),  # ARABIC DAMMATAN
        "\u064d": ("K", "i*n", "iⁿ", "in"),  # ARABIC KASRATAN
        "\u064e": ("a", "a", "a", "a"),  # ARABIC FATHA
        "\u064f": ("u", "u", "u", "u"),  # ARABIC DAMMA
        "\u0650": ("i", "i", "i", "i"),  # ARABIC KASRA
        "\u0651": ("~", "u", "u", "ūw"),  # ARABIC SHADDA
        "\u0652": ("o", "a", "a", "a"),  # ARABIC SUKUN
        "\u0653": ("^", "_a", "ā", "ā"),  # ARABIC MADDAH ABOVE
        "\u0654": ("#", "'", "ʾ", "ā"),  # ARABIC HAMZA ABOVE
        "\u0655": ("=", "'", "ʾ", "ā"),  # ARABIC HAMZA BELOW
        "\u0660": ("0", "0", "0", "0"),  # ARABIC INDIC DIGIT ZERO
        "\u0661": ("1", "1", "1", "1"),  # ARABIC INDIC DIGIT ONE
        "\u0662": ("2", "2", "2", "2"),  # ARABIC INDIC DIGIT TWO
        "\u0663": ("3", "3", "3", "3"),  # ARABIC INDIC DIGIT THREE
        "\u0664": ("4", "4", "4", "4"),  # ARABIC INDIC DIGIT FOUR
        "\u0665": ("5", "5", "5", "5"),  # ARABIC INDIC DIGIT FIVE
        "\u0666": ("6", "6", "6", "6"),  # ARABIC INDIC DIGIT SIX
        "\u0667": ("7", "7", "7", "7"),  # ARABIC INDIC DIGIT SEVEN
        "\u0668": ("8", "8", "8", "8"),  # ARABIC INDIC DIGIT EIGHT
        "\u0669": ("9", "9", "9", "9"),  # ARABIC INDIC DIGIT NINE
        "\u0670": ("`", "~a", "ã", ""),  # ARABIC LETTER SUPERSCRIPT ALEF
        "\u0671": ("{", "a", "a", "a"),  # ARABIC LETTER ALEF WASLA
        "\u06af": ("G", "g", "g", "g"),  # ARABIC LETTER GAF
        "\u06cc": ("J", "#y", "Y", "y"),  # ARABIC LETTER FARSI YEH
        "\u06d6": (
            "SlY",
            "*sl/a",
            "ṣlá",
            "ṣla",
        ),  # ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
        "\u06da": ("M", "#j", "J", "j"),  # ARABIC SMALL HIGH JEEM
        "\u06dc": (":", "#s", "S", "s"),  # ARABIC SMALL HIGH SEEN
        "\u06df": ("@", "0", "0", "0"),  # ARABIC SMALL HIGH ROUNDED ZERO
        "\u06e0": ('"', "0", "0", "0"),  # ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
        "\u06e2": ("[", "#m", "M", "M"),  # ARABIC SMALL HIGH MEEM ISOLATED FORM
        "\u06e3": (";", "#s", "S", "S"),  # ARABIC SMALL LOW SEEN
        "\u06e5": (",", "#w", "W", "W"),  # ARABIC SMALL WAW
        "\u06e6": (".", "#y", "Y", "Y"),  # ARABIC SMALL YEH
        "\u06e8": ("!", "#n", "N", "N"),  # ARABIC SMALL HIGH NOON
        "\u06ea": ("-", ".", ".", "."),  # ARABIC EMPTY CENTRE LOW STOP
        "\u06eb": ("+", ".", ".", "."),  # ARABIC EMPTY CENTRE HIGH STOP
        "\u06ec": ("%", ".", ".", "."),  # ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
        "\u06ed": ("]", "#m", "M", "M"),  # ARABIC SMALL LOW MEEM
    }
    """
    More Arabic transcriptions:

    *   column 1: `custom` [Quran-tanzil](http://tanzil.net/#1:1), slightly extended
    *   column 2: `ascii` resp. `latin` plus diacritics also known as betacode.
        We use a list compiled by
        [Peter Verkinderen](https://pverkind.github.io/betacodeTranscriber/js/betacode.js)
    *   column 4: `standard` (Library of Congress) (to-be filled).
        We use the
        [arabic romanization list of 2012](https://www.loc.gov/catdir/cpso/romanization/arabic.pdf)
        We refrain of from applying rules that cannot be computed without
        lexical/grammatical/dialectical knowledge of the arabic language.
    """

    arabicTransQuran = {}
    arabicTransAscii = {}
    arabicTransLatin = {}
    arabicTransStandard = {}

    for ara, (qur, asc, lat, std) in arabicTrans.items():
        arabicTransQuran[ara] = qur
        arabicTransAscii[ara] = asc
        arabicTransLatin[ara] = lat
        arabicTransStandard[ara] = std

    def quranFromArabic(word):
        return "".join(Transcription.arabicTransQuran.get(x, x) for x in word)

    def asciiFromArabic(word):
        return "".join(Transcription.arabicTransAscii.get(x, x) for x in word)

    def latinFromArabic(word):
        return "".join(Transcription.arabicTransLatin.get(x, x) for x in word)

    def standardFromArabic(word):
        return "".join(Transcription.arabicTransStandard.get(x, x) for x in word)

    def __init__(self):
        self.hebrew_consonants = {
            Transcription.hebrew_mapping[x] for x in Transcription.hebrew_cons
        }
        self.hebrew_consonants.add("\u05E9")
        self.hebrew_mappingi = dict(
            (v, k) for (k, v) in Transcription.hebrew_mapping.items() if k != ""
        )
        self.syriac_mappingi = dict(
            (v, k) for (k, v) in Transcription.syriac_mapping.items() if k != ""
        )
        # special treatment needed for nun hafukha,
        # since it is consists of two characters
        self.hebrew_mappingi["\u05C6"] = "ñ"
        self.hebrew_mappingi["\u0307"] = ""
        self.syriac_punctuation_trans = (
            "#\\",
            "=.",
            "=#",
            "=:",
            "=^",
            "=/",
            "=\\",
            "^:",
            "^\\",
        )
        self.syriac_punctuation_syc = tuple(
            Transcription.syriac_mapping[c] for c in self.syriac_punctuation_trans
        )

        arabicMap = dict(qur={}, asc={}, lat={}, std={})
        arabicMapI = dict(qur={}, asc={}, lat={}, std={})

        for u in sorted(Transcription.arabicTrans):
            (qur, asc, lat, std) = Transcription.arabicTrans[u]
            for k, v in zip(arabicMap.keys(), Transcription.arabicTrans[u]):
                arabicMap[k].setdefault(v, u)
                arabicMapI[k][u] = v

        self.arabicMap = arabicMap
        self.arabicMapI = arabicMapI

    def sycSplitPunc(self):
        pass

    def _comp(s):
        for d, c in Transcription.decomp.items():
            s = s.replace(d, c)
        return s

    def _decomp(s):
        for d, c in Transcription.decomp.items():
            s = s.replace(c, d)
        return s

    def suffix_and_finales(word):
        """
        Given an ETCBC transliteration, split it into the word material
        and the interword material that follows it (space, punctuation).
        Replace the last consonant of the word material by its final form, if applicable.

        Output a tuple with the modified word material and the interword material.

        Example:

        ``` python
        Transcription.suffix_and_finales('71T_H@>@95REY00')
        ```

        Output:

        ```
        ('71T_H@>@95REy', '00\n')
        ```

        Note that the `Y` has been replaced by `y`.
        """

        # first split the word proper from the suffix,
        # and add a space if there is no other suffix
        add_space = ""
        suffix = ""
        new_word = word
        if not word:
            return (new_word, suffix + add_space)
        lastch = new_word[-1]
        if lastch == "-" or lastch == "&":
            new_word = new_word[0:-1]
            suffix = lastch
        else:
            if len(new_word) >= 2:
                lastch = new_word[-1]
                llastch = new_word[-2]
                if llastch == "_" and (lastch == "P" or lastch == "S"):
                    new_word = new_word[0:-2]
                    suffix = " " + lastch + suffix + " "
            if len(new_word) >= 2:
                lastch = new_word[-1]
                llastch = new_word[-2]
                if llastch == "_" and (lastch == "N"):
                    new_word = new_word[0:-2]
                    suffix = " ñ" + suffix + " "
            if len(new_word) >= 2:
                lastch = new_word[-1]
                llastch = new_word[-2]
                if llastch == "0" and (lastch == "0" or lastch == "5"):
                    new_word = new_word[0:-2]
                    suffix = (" " if lastch == "5" else "") + llastch + lastch + suffix
                    add_space = "\n" if lastch == "0" else " "
        if suffix == "":
            add_space = " "
        elif suffix == "-":
            add_space = ""
            suffix = ""
        # second: replace consonants by their final forms when needed
        new_word = Transcription.trans_final_pat.sub(Transcription._map_final, new_word)
        return (new_word, suffix + add_space)

    def _map_final(m):
        return m.group(1) + m.group(2).lower() + m.group(3)

    def _map_hebrew(m):
        return Transcription.hebrew_mapping.get(m.group(1), m.group(1))

    def _map_syriac(m):
        return Transcription.syriac_mapping.get(m.group(1), m.group(1))

    def _swap_accent(m):
        return m.group(1) + m.group(3) + m.group(4) + m.group(2)

    def _remove_accent(m):
        return "00" if m.group(1) == "00" else "05" if m.group(1) == "05" else ""

    def _remove_point(m):
        return "00" if m.group(1) == "00" else "05" if m.group(1) == "05" else ""

    def _ph_simple(m):
        return "å" if m.group(1) in "āo" else ""

    # UNICODE normalization is harmful
    # if there is a combination of dagesh, vowel and accent.

    def suppress_space(word):
        """
        Given an ETCBC transliteration of a word,
        match the end of the word for punctuation and spacing characters
        (`sof pasuq`, `paseq`, `nun hafukha`, `setumah`, `petuhah`, space, no-space)

        Example:

        ``` python
        Transcription.suppress_space('B.:&')
        Transcription.suppress_space('B.@R@74>')
        Transcription.suppress_space('71T_H@>@95REY00')
        ```

        Output:

        ```
        <re.Match object; span=(3, 4), match='&'>
        None
        <re.Match object; span=(13, 15), match='00'>
        ```
        """

        return Transcription.noorigspace.search(word)

    def to_etcbc_v(word):
        """
        Given an ETCBC transliteration of a fully pointed word,
        strip all the non-vowel pointing (i.e. the accents).

        Example:

        ``` python
        Transcription.to_etcbc_v('HAC.@MA73JIm')
        ```

        Output:

        ```
        HAC.@MAJIm
        ```
        """

        return Transcription.remove_accent_pat.sub(Transcription._remove_accent, word)

    def to_etcbc_c(word):
        """
        Given an ETCBC transliteration of a fully pointed word,
        strip everything except the consonants.
        Punctuation will also be stripped.

        Example:

        ``` python
        Transcription.to_etcbc_c('HAC.@MA73JIm')
        ```

        Output:

        ```
        H#MJM
        ```

        Note that the pointed shin (`C`) is replaced by an unpointed one (`#`).
        """

        word = Transcription.remove_point_pat.sub(Transcription._remove_point, word)
        word = Transcription.remove_psn_pat.sub(
            "00", word
        )  # remove nun hafukha, setumah, petuhah at the end of a verse
        word = Transcription.remove_psq_pat.sub(
            " ", word
        )  # replace paseq with attached spaces by single space
        word = word.upper()  # no final forms of consonants
        return Transcription.shin_pat.sub("#", word)

    def to_hebrew(word):
        """
        Given a transliteration of a fully pointed word,
        produce the word in UNICODE Hebrew.
        Care will be taken that vowel pointing will be added to consonants
        before accent pointing.

        Example:

        ``` python
        Transcription.to_hebrew('HAC.@MA73JIm')
        ```

        Output:

        ```
        הַשָּׁמַ֖יִם
        ```
        """

        word = Transcription.swap_accent_pat.sub(Transcription._swap_accent, word)
        return Transcription.trans_hebrew_pat.sub(Transcription._map_hebrew, word)

    def to_hebrew_v(word):
        """
        Given a transliteration of a fully pointed word,
        produce the word in UNICODE Hebrew, but without the accents.

        Example:

        ``` python
        Transcription.to_hebrew_v('HAC.@MA73JIm')
        ```

        Output:

        ```
        הַשָּׁמַיִם
        ```
        """

        return Transcription.trans_hebrew_pat.sub(
            Transcription._map_hebrew, Transcription.to_etcbc_v(word)
        )

    def to_hebrew_c(word):
        """
        Given a transliteration of a fully pointed word,
        produce the word in UNICODE Hebrew, but without the pointing.

        Example:

        ``` python
        Transcription.to_hebrew_c('HAC.@MA73JIm')
        ```

        Output:

        ```
        השמימ
        ```

        Note that final consonant forms are not being used.
        """

        return Transcription.trans_hebrew_pat.sub(
            Transcription._map_hebrew, Transcription.to_etcbc_c(word)
        )

    def to_hebrew_x(word):
        """
        Given a transliteration of a fully pointed word,
        produce the word in UNICODE Hebrew, but without the pointing.
        Vowel pointing and accent pointing will be applied in the order given
        by the input word.

        Example:

        ``` python
        Transcription.to_hebrew_x('HAC.@MA73JIm')
        ```

        Output:

        ```
        הַשָּׁמַ֖יִם
        ```
        """

        return Transcription.trans_hebrew_pat.sub(Transcription._map_hebrew, word)

    def ph_simplify(pword):
        """
        Given a phonological transliteration of a fully pointed word,
        produce a more coarse phonological transliteration.

        Example:

        ``` python
        Transcription.ph_simplify('ʔᵉlōhˈîm')
        Transcription.ph_simplify('māqˈôm')
        Transcription.ph_simplify('kol')
        ```

        Output:

        ```
        ʔlōhîm
        måqôm
        kål
        ```

        Note that the simplified version transliterates the `qamets gadol` and `qatan`
        to the same character.
        """

        return Transcription.ph_simple_pat.sub(Transcription._ph_simple, pword)

    def from_hebrew(self, word):
        """
        Given a fully pointed word in UNICODE Hebrew,
        produce the word in ETCBC transliteration.

        Example:

        ``` python
        tr.from_hebrew('הָאָֽרֶץ׃')
        ```

        Output:

        ```
        H@>@95REy00
        ```
        """

        return "".join(
            self.hebrew_mappingi.get(x, x) for x in Transcription._comp(word)
        )

    def to_syriac(self, word):
        """
        Given a word in ETCBC transliteration,
        produce the word in UNICODE Syriac.

        Example:

        ``` python
        tr.to_syriac('MKSJN')
        ```

        Output:

        ```
        ܡܟܣܝܢ
        ```
        """

        return Transcription.trans_syriac_pat.sub(Transcription._map_syriac, word)

    def from_syriac(self, word):
        """
        Given a word in UNICODE Syriac,
        produce the word in ETCBC transliteration.

        Example:

        ``` python
        tr.from_syriac('ܡܟܣܝܢ')
        ```

        Output:

        ```
        MKSJN
        ```
        """

        return "".join(self.syriac_mappingi.get(x, x) for x in word)

    def can_to_syriac(self, word):
        return all(
            candidate in Transcription.syriac_mapping
            for candidate in Transcription.trans_syriac_pat.findall(word)
            if candidate != " "
        )

    def can_from_syriac(self, word):
        return all(c in self.syriac_mappingi for c in word if c != " ")

    def to_ugaritic(word):
        """
        Given a word in transliteration,
        produce the word in UNICODE Ugaritic.

        kṯbx
        𐎋𐎘𐎁x


        Example:

        ``` python
        Transcription.to_ugaritic('kṯbx')
        ```

        Output:

        ```
        𐎋𐎘𐎁x
        ```
        """

        return "".join(Transcription.ugaritic_mapping.get(x, x) for x in word)

    def from_ugaritic(word):
        """
        Given a word in UNICODE Ugaritic,
        produce the word in transliteration.

        Example:

        ``` python
        Transcription.from_ugaritic('𐎋𐎘𐎁x')
        ```

        Output:

        ```
        kṯbx
        ```
        """

        return "".join(Transcription.ugaritic_mappingi.get(x, x) for x in word)

    def to_arabic(word):
        """
        Given a word in transliteration,
        produce the word in UNICODE Arabic.

        Example:

        ``` python
        Transcription.to_arabic('bisomi')
        ```

        Output:

        ```
        بِسْمِ
        ```
        """

        return "".join(Transcription.arabic_mapping.get(x, x) for x in word)

    def from_arabic(word):
        """
        Given a word in UNICODE Arabic,
        produce the word in transliteration.

        Example:

        ``` python
        Transcription.from_arabic('بِسْمِ')
        ```

        Output:

        ```
        bisomi
        ```
        """

        return "".join(Transcription.arabic_mappingi.get(x, x) for x in word)

Conversion between UNICODE and various transcriptions.

Usage notes:

Invoke the transcription functionality as follows:

from tf.writing.transcription import Transcription

Some of the attributes and methods below are class attributes, others are instance attributes.

A class attribute aaa can be retrieved by saying

Transcription.aaa

To retrieve an instance attribute, you need an instance first, like

tr = Transcription()
tr.aaa

Class variables

var ara

The type of the None singleton.

var arabicTrans

More Arabic transcriptions:

column 1: custom Quran-tanzil, slightly extended
column 2: ascii resp. latin plus diacritics also known as betacode. We use a list compiled by Peter Verkinderen
column 4: standard (Library of Congress) (to-be filled). We use the arabic romanization list of 2012 We refrain of from applying rules that cannot be computed without lexical/grammatical/dialectical knowledge of the arabic language.

var arabicTransAscii

The type of the None singleton.

var arabicTransLatin

The type of the None singleton.

var arabicTransQuran

The type of the None singleton.

var arabicTransStandard

The type of the None singleton.

var arabic_mapping

Maps an Arabic transliteration character to UNICODE. This is the mapping used in the Quran representation on tanzil.net.

Example: beh

Transcription.syriac_mapping['b']

Output:

ب

Maps an Arabic letter in UNICODE to its transliteration

Example: beh transliteration

Transcription.syriac_mapping['ب']

Output:

var arabic_mappingi

The type of the None singleton.

var asc

The type of the None singleton.

var decomp

The type of the None singleton.

var hebrew_cons

The type of the None singleton.

var hebrew_mapping

Maps all ETCBC transliteration character combinations for Hebrew to UNICODE.

Example: sof-pasuq:

Transcription.hebrew_mapping['00']

Output:

׃

var lat

The type of the None singleton.

var noorigspace

The type of the None singleton.

var ph_simple_pat

The type of the None singleton.

var qur

The type of the None singleton.

var remove_accent_pat

The type of the None singleton.

var remove_point_pat

The type of the None singleton.

var remove_psn_pat

The type of the None singleton.

var remove_psq_pat

The type of the None singleton.

var shin_pat

The type of the None singleton.

var std

The type of the None singleton.

var swap_accent_pat

The type of the None singleton.

var syriac_mapping

Maps all ETCBC transliteration character combinations for Syriac to UNICODE.

Example: semkath-final:

Transcription.syriac_mapping['s']

Output:

ܤ

var syriac_mapping_pil

The type of the None singleton.

var syriac_mapping_simple

The type of the None singleton.

var trans_final_pat

The type of the None singleton.

var trans_hebrew_pat

The type of the None singleton.

var trans_syriac_pat

The type of the None singleton.

var ugaritic_mapping

The type of the None singleton.

var ugaritic_mappingi

Maps Ugaritic unicode characters to their conventional transliteration characters.

Unidentified characters:

x (damaged ?)
/ (alternative ?) only twice, in atypˤtba/r and xxxxl/d…
, (comma) only once in a very long word starting at 551 … km,ad …
< > (brackets marking uncertainty ?)
… (unreadable ?)
00a0 (non-breaking space)

Methods

def asciiFromArabic(word)

Expand source code Browse git

def asciiFromArabic(word):
    return "".join(Transcription.arabicTransAscii.get(x, x) for x in word)

def can_from_syriac(self, word)

Expand source code Browse git

def can_from_syriac(self, word):
    return all(c in self.syriac_mappingi for c in word if c != " ")

def can_to_syriac(self, word)

Expand source code Browse git

def can_to_syriac(self, word):
    return all(
        candidate in Transcription.syriac_mapping
        for candidate in Transcription.trans_syriac_pat.findall(word)
        if candidate != " "
    )

def from_arabic(word)

Expand source code Browse git

def from_arabic(word):
    """
    Given a word in UNICODE Arabic,
    produce the word in transliteration.

    Example:

    ``` python
    Transcription.from_arabic('بِسْمِ')
    ```

    Output:

    ```
    bisomi
    ```
    """

    return "".join(Transcription.arabic_mappingi.get(x, x) for x in word)

Given a word in UNICODE Arabic, produce the word in transliteration.

Example:

Transcription.from_arabic('بِسْمِ')

Output:

bisomi

def from_hebrew(self, word)

Expand source code Browse git

def from_hebrew(self, word):
    """
    Given a fully pointed word in UNICODE Hebrew,
    produce the word in ETCBC transliteration.

    Example:

    ``` python
    tr.from_hebrew('הָאָֽרֶץ׃')
    ```

    Output:

    ```
    H@>@95REy00
    ```
    """

    return "".join(
        self.hebrew_mappingi.get(x, x) for x in Transcription._comp(word)
    )

Given a fully pointed word in UNICODE Hebrew, produce the word in ETCBC transliteration.

Example:

tr.from_hebrew('הָאָֽרֶץ׃')

Output:

H@>@95REy00

def from_syriac(self, word)

Expand source code Browse git

def from_syriac(self, word):
    """
    Given a word in UNICODE Syriac,
    produce the word in ETCBC transliteration.

    Example:

    ``` python
    tr.from_syriac('ܡܟܣܝܢ')
    ```

    Output:

    ```
    MKSJN
    ```
    """

    return "".join(self.syriac_mappingi.get(x, x) for x in word)

Given a word in UNICODE Syriac, produce the word in ETCBC transliteration.

Example:

tr.from_syriac('ܡܟܣܝܢ')

Output:

MKSJN

def from_ugaritic(word)

Expand source code Browse git

def from_ugaritic(word):
    """
    Given a word in UNICODE Ugaritic,
    produce the word in transliteration.

    Example:

    ``` python
    Transcription.from_ugaritic('𐎋𐎘𐎁x')
    ```

    Output:

    ```
    kṯbx
    ```
    """

    return "".join(Transcription.ugaritic_mappingi.get(x, x) for x in word)

Given a word in UNICODE Ugaritic, produce the word in transliteration.

Example:

Transcription.from_ugaritic('𐎋𐎘𐎁x')

Output:

kṯbx

def latinFromArabic(word)

Expand source code Browse git

def latinFromArabic(word):
    return "".join(Transcription.arabicTransLatin.get(x, x) for x in word)

def ph_simplify(pword)

Expand source code Browse git

def ph_simplify(pword):
    """
    Given a phonological transliteration of a fully pointed word,
    produce a more coarse phonological transliteration.

    Example:

    ``` python
    Transcription.ph_simplify('ʔᵉlōhˈîm')
    Transcription.ph_simplify('māqˈôm')
    Transcription.ph_simplify('kol')
    ```

    Output:

    ```
    ʔlōhîm
    måqôm
    kål
    ```

    Note that the simplified version transliterates the `qamets gadol` and `qatan`
    to the same character.
    """

    return Transcription.ph_simple_pat.sub(Transcription._ph_simple, pword)

Given a phonological transliteration of a fully pointed word, produce a more coarse phonological transliteration.

Example:

Transcription.ph_simplify('ʔᵉlōhˈîm')
Transcription.ph_simplify('māqˈôm')
Transcription.ph_simplify('kol')

Output:

ʔlōhîm
måqôm
kål

Note that the simplified version transliterates the qamets gadol and qatan to the same character.

def quranFromArabic(word)

Expand source code Browse git

def quranFromArabic(word):
    return "".join(Transcription.arabicTransQuran.get(x, x) for x in word)

def standardFromArabic(word)

Expand source code Browse git

def standardFromArabic(word):
    return "".join(Transcription.arabicTransStandard.get(x, x) for x in word)

def suffix_and_finales(word)

Expand source code Browse git

def suffix_and_finales(word):
    """
    Given an ETCBC transliteration, split it into the word material
    and the interword material that follows it (space, punctuation).
    Replace the last consonant of the word material by its final form, if applicable.

    Output a tuple with the modified word material and the interword material.

    Example:

    ``` python
    Transcription.suffix_and_finales('71T_H@>@95REY00')
    ```

    Output:

    ```
    ('71T_H@>@95REy', '00\n')
    ```

    Note that the `Y` has been replaced by `y`.
    """

    # first split the word proper from the suffix,
    # and add a space if there is no other suffix
    add_space = ""
    suffix = ""
    new_word = word
    if not word:
        return (new_word, suffix + add_space)
    lastch = new_word[-1]
    if lastch == "-" or lastch == "&":
        new_word = new_word[0:-1]
        suffix = lastch
    else:
        if len(new_word) >= 2:
            lastch = new_word[-1]
            llastch = new_word[-2]
            if llastch == "_" and (lastch == "P" or lastch == "S"):
                new_word = new_word[0:-2]
                suffix = " " + lastch + suffix + " "
        if len(new_word) >= 2:
            lastch = new_word[-1]
            llastch = new_word[-2]
            if llastch == "_" and (lastch == "N"):
                new_word = new_word[0:-2]
                suffix = " ñ" + suffix + " "
        if len(new_word) >= 2:
            lastch = new_word[-1]
            llastch = new_word[-2]
            if llastch == "0" and (lastch == "0" or lastch == "5"):
                new_word = new_word[0:-2]
                suffix = (" " if lastch == "5" else "") + llastch + lastch + suffix
                add_space = "\n" if lastch == "0" else " "
    if suffix == "":
        add_space = " "
    elif suffix == "-":
        add_space = ""
        suffix = ""
    # second: replace consonants by their final forms when needed
    new_word = Transcription.trans_final_pat.sub(Transcription._map_final, new_word)
    return (new_word, suffix + add_space)

Given an ETCBC transliteration, split it into the word material and the interword material that follows it (space, punctuation). Replace the last consonant of the word material by its final form, if applicable.

    Output a tuple with the modified word material and the interword material.

    Example:

    ``` python
    Transcription.suffix_and_finales('71T_H@>@95REY00')
    ```

    Output:

    ```
    ('71T_H@>@95REy', '00

') ```

    Note that the <code>Y</code> has been replaced by <code>y</code>.

def suppress_space(word)

Expand source code Browse git

def suppress_space(word):
    """
    Given an ETCBC transliteration of a word,
    match the end of the word for punctuation and spacing characters
    (`sof pasuq`, `paseq`, `nun hafukha`, `setumah`, `petuhah`, space, no-space)

    Example:

    ``` python
    Transcription.suppress_space('B.:&')
    Transcription.suppress_space('B.@R@74>')
    Transcription.suppress_space('71T_H@>@95REY00')
    ```

    Output:

    ```
    <re.Match object; span=(3, 4), match='&'>
    None
    <re.Match object; span=(13, 15), match='00'>
    ```
    """

    return Transcription.noorigspace.search(word)

Given an ETCBC transliteration of a word, match the end of the word for punctuation and spacing characters (sof pasuq, paseq, nun hafukha, setumah, petuhah, space, no-space)

Example:

Transcription.suppress_space('B.:&')
Transcription.suppress_space('B.@R@74>')
Transcription.suppress_space('71T_H@>@95REY00')

Output:

<re.Match object; span=(3, 4), match='&'>
None
<re.Match object; span=(13, 15), match='00'>

def sycSplitPunc(self)

Expand source code Browse git

def sycSplitPunc(self):
    pass

def to_arabic(word)

Expand source code Browse git

def to_arabic(word):
    """
    Given a word in transliteration,
    produce the word in UNICODE Arabic.

    Example:

    ``` python
    Transcription.to_arabic('bisomi')
    ```

    Output:

    ```
    بِسْمِ
    ```
    """

    return "".join(Transcription.arabic_mapping.get(x, x) for x in word)

Given a word in transliteration, produce the word in UNICODE Arabic.

Example:

Transcription.to_arabic('bisomi')

Output:

بِسْمِ

def to_etcbc_c(word)

Expand source code Browse git

def to_etcbc_c(word):
    """
    Given an ETCBC transliteration of a fully pointed word,
    strip everything except the consonants.
    Punctuation will also be stripped.

    Example:

    ``` python
    Transcription.to_etcbc_c('HAC.@MA73JIm')
    ```

    Output:

    ```
    H#MJM
    ```

    Note that the pointed shin (`C`) is replaced by an unpointed one (`#`).
    """

    word = Transcription.remove_point_pat.sub(Transcription._remove_point, word)
    word = Transcription.remove_psn_pat.sub(
        "00", word
    )  # remove nun hafukha, setumah, petuhah at the end of a verse
    word = Transcription.remove_psq_pat.sub(
        " ", word
    )  # replace paseq with attached spaces by single space
    word = word.upper()  # no final forms of consonants
    return Transcription.shin_pat.sub("#", word)

Given an ETCBC transliteration of a fully pointed word, strip everything except the consonants. Punctuation will also be stripped.

Example:

Transcription.to_etcbc_c('HAC.@MA73JIm')

Output:

H#MJM

Note that the pointed shin (C) is replaced by an unpointed one (#).

def to_etcbc_v(word)

Expand source code Browse git

def to_etcbc_v(word):
    """
    Given an ETCBC transliteration of a fully pointed word,
    strip all the non-vowel pointing (i.e. the accents).

    Example:

    ``` python
    Transcription.to_etcbc_v('HAC.@MA73JIm')
    ```

    Output:

    ```
    HAC.@MAJIm
    ```
    """

    return Transcription.remove_accent_pat.sub(Transcription._remove_accent, word)

Given an ETCBC transliteration of a fully pointed word, strip all the non-vowel pointing (i.e. the accents).

Example:

Transcription.to_etcbc_v('HAC.@MA73JIm')

Output:

HAC.@MAJIm

def to_hebrew(word)

Expand source code Browse git

def to_hebrew(word):
    """
    Given a transliteration of a fully pointed word,
    produce the word in UNICODE Hebrew.
    Care will be taken that vowel pointing will be added to consonants
    before accent pointing.

    Example:

    ``` python
    Transcription.to_hebrew('HAC.@MA73JIm')
    ```

    Output:

    ```
    הַשָּׁמַ֖יִם
    ```
    """

    word = Transcription.swap_accent_pat.sub(Transcription._swap_accent, word)
    return Transcription.trans_hebrew_pat.sub(Transcription._map_hebrew, word)

Given a transliteration of a fully pointed word, produce the word in UNICODE Hebrew. Care will be taken that vowel pointing will be added to consonants before accent pointing.

Example:

Transcription.to_hebrew('HAC.@MA73JIm')

Output:

הַשָּׁמַ֖יִם

def to_hebrew_c(word)

Expand source code Browse git

def to_hebrew_c(word):
    """
    Given a transliteration of a fully pointed word,
    produce the word in UNICODE Hebrew, but without the pointing.

    Example:

    ``` python
    Transcription.to_hebrew_c('HAC.@MA73JIm')
    ```

    Output:

    ```
    השמימ
    ```

    Note that final consonant forms are not being used.
    """

    return Transcription.trans_hebrew_pat.sub(
        Transcription._map_hebrew, Transcription.to_etcbc_c(word)
    )

Given a transliteration of a fully pointed word, produce the word in UNICODE Hebrew, but without the pointing.

Example:

Transcription.to_hebrew_c('HAC.@MA73JIm')

Output:

השמימ

Note that final consonant forms are not being used.

def to_hebrew_v(word)

Expand source code Browse git

def to_hebrew_v(word):
    """
    Given a transliteration of a fully pointed word,
    produce the word in UNICODE Hebrew, but without the accents.

    Example:

    ``` python
    Transcription.to_hebrew_v('HAC.@MA73JIm')
    ```

    Output:

    ```
    הַשָּׁמַיִם
    ```
    """

    return Transcription.trans_hebrew_pat.sub(
        Transcription._map_hebrew, Transcription.to_etcbc_v(word)
    )

Given a transliteration of a fully pointed word, produce the word in UNICODE Hebrew, but without the accents.

Example:

Transcription.to_hebrew_v('HAC.@MA73JIm')

Output:

הַשָּׁמַיִם

def to_hebrew_x(word)

Expand source code Browse git

def to_hebrew_x(word):
    """
    Given a transliteration of a fully pointed word,
    produce the word in UNICODE Hebrew, but without the pointing.
    Vowel pointing and accent pointing will be applied in the order given
    by the input word.

    Example:

    ``` python
    Transcription.to_hebrew_x('HAC.@MA73JIm')
    ```

    Output:

    ```
    הַשָּׁמַ֖יִם
    ```
    """

    return Transcription.trans_hebrew_pat.sub(Transcription._map_hebrew, word)

Given a transliteration of a fully pointed word, produce the word in UNICODE Hebrew, but without the pointing. Vowel pointing and accent pointing will be applied in the order given by the input word.

Example:

Transcription.to_hebrew_x('HAC.@MA73JIm')

Output:

הַשָּׁמַ֖יִם

def to_syriac(self, word)

Expand source code Browse git

def to_syriac(self, word):
    """
    Given a word in ETCBC transliteration,
    produce the word in UNICODE Syriac.

    Example:

    ``` python
    tr.to_syriac('MKSJN')
    ```

    Output:

    ```
    ܡܟܣܝܢ
    ```
    """

    return Transcription.trans_syriac_pat.sub(Transcription._map_syriac, word)

Given a word in ETCBC transliteration, produce the word in UNICODE Syriac.

Example:

tr.to_syriac('MKSJN')

Output:

ܡܟܣܝܢ

def to_ugaritic(word)

Expand source code Browse git

def to_ugaritic(word):
    """
    Given a word in transliteration,
    produce the word in UNICODE Ugaritic.

    kṯbx
    𐎋𐎘𐎁x


    Example:

    ``` python
    Transcription.to_ugaritic('kṯbx')
    ```

    Output:

    ```
    𐎋𐎘𐎁x
    ```
    """

    return "".join(Transcription.ugaritic_mapping.get(x, x) for x in word)

Given a word in transliteration, produce the word in UNICODE Ugaritic.

kṯbx 𐎋𐎘𐎁x

Example:

Transcription.to_ugaritic('kṯbx')

Output:

𐎋𐎘𐎁x

Transcription

Character tables and fonts

hbo Hebrew

syc Syriac

ara Arabic

grc Greek

akk Akkadian

uga Ugaritic

cld Neo Aramaic

Classes

Class variables

Methods

`hbo` Hebrew

`syc` Syriac

`ara` Arabic

`grc` Greek

`akk` Akkadian

`uga` Ugaritic

`cld` Neo Aramaic