Words and Lemmatization

2025-02-18

Tagging lexical words in the text and/or linking to lemmata for purposes of indexing or search.

TEI definition: w ; EpiDoc-specific customization: w

Explicit markup of words (tokenization) and identification of their dictionary headwords (lemmatization) are both optional. Many projects simply leave these features unmarked, or rely on automated processes in search software to detect word-breaks and link to lemmatizing tools such as Morpheus. (The Papyrological Navigator, Perseus, and the TLG all use methods similar to this.)

To explicitly mark-up lexical words in a papyrological or epigraphic text, however, each word in the text should be enclosed in a w element. (For ease of processing, it is advised that inter-word spacing, punctuation and other features be left outside of this element, and if possible there should be no spaces or carriage returns within the w element.)

<w>maximo</w>
<w>tribunicia</w>
<w>potestate</w>
<num>XXIIII</num>
<w>imperatori</w>

(IRT: 22)

To record the lemma or dictionary headword of the word in question, the simplest solution is to enter the uninflected form in a lemma attribute, which may be used, for example, to generate the entries in a lexical index to the corpus.

<w lemma="ἵστημι">ἕστηκα</w>

A slightly more sophisticated solution is to enter in a lemmaRef attribute a URL or other URI that points to the entry for the word in question in a database or online dictionary. This solution allows for better disambiguation of homonymous words, for example, or linking to morphological and statistical information about the word.

<w lemmaRef="http://www.perseus.tufts.edu/hopper/morph?l=fero&la=la">tulisti</w>

Multiple lemmata

If a single word tag contains more than one indexable word (e.g. "populusque" in Latin, "κἀκ" in Greek or "gudrotæn" and "hlkuþrs" in Old Danish), or an incomplete word is ambiguous and might correspond to either of two or more lemmata (e.g. ΕΚΦΕΥΞ), then all lemmata that should be indexed may be included in the lemma attribute, separated by spaces:

<w lemma="suus que">suisque</w>

<w lemma="ἔκφευξις ἐκφεύγω" part="I">ΕΚΦΕΥΞ</w>
<gap reason="lost" extent="unknown"
unit="character"/>

<w lemma="Guð dróttin">gud drotæn</w>

<w lemma="helgu Guðs">hlku kuþrs</w>

Interaction with editorial markup

The w element may contain further editorial tags, such as expansions of abbreviations. Wherever possible the w element should remain intact (cf. Editorial restoration: Segmented or adjacent lacunae).