Languages and scripts

2025-02-18

TEI/EpiDoc provides mechanisms for encoding languages and scripts (writing systems) as they relate to the contents of an EpiDoc file and to the text(s) described and transcribed therein. In doing so, we make use of Internet standards for the identification of these languages and scripts. This portion of the Guidelines addresses all relevant aspects.

Relevant element documentation (TEI):

TEI definition: langUsage ; EpiDoc-specific customization: langUsage
TEI definition: language ; EpiDoc-specific customization: language
TEI definition: textLang ; EpiDoc-specific customization: textLang
TEI definition: foreign ; EpiDoc-specific customization: foreign

Indicating languages and scripts used in an EpiDoc file

TEI and EpiDoc follow the best current practice outlined in the Network Working Group's RFC 5646: Tags for Identifying Languages, which establishes the norms for same on an Internet-wide basis. The RFC and supporting documents define a syntax for creating short strings of characters (‘language tags’) that function as unique identifiers for any desired combination of language and script. These tags are composed of ‘subtags’ for language qua language, writing sysem (script), and regional and dialectical variation. The RFC also establishes a process for registration and maintenance of these subtags by the Internet Assigned Numbers Authority.

A valid EpiDoc file must make use of subtags recorded in the IANA Language Subtag Registry. Many EpiDoc creators will already be familiar with some of these codes from other digital projects, for example:

Grek = Greek script
Latn = Latin script
en = English language (assumed to be in its standard script: Latn)
fr = French language (assumed to be in its standard script: Latn)
el = Modern Greek language (1453-; assumed to be in its standard script: Grek)
grc = Ancient Greek language (to 1453; assumed to be in its standard script: Grek)
grc-Latn = Ancient Greek Language (to 1453), rendered in Latin script

When the IANA registry does not provide appropriate codes, then an EpiDoc project may devise "private use subtags", so long as they are internally defined in the EpiDoc file as outlined in the following paragraph and so long as they conform syntactically to the specifications laid out in RFC 5646, sections 2.1: Syntax and 4.6: Considerations for Private Use Subtags. For example, the Campā Inscriptions team determined that the two Cham language subtags (cja = Western Cham and cjm = Eastern Cham) and the associated script subtag (Cham) were substantively different from the ancient Cham language and script represented in the inscriptions. Therefore the private use subtag "x-oldcam-latn-ci" was invented and given the project-specific meaning "Old Cam language in Old Cam script transliterated in Latin characters." Whenever possible, EpiDoc projects and practitioners should undertake to register new subtags with the IANA for the benefit of others. A procedure for same is set out in RFC 5646 Section 3.5.

<langUsage>
<language ident="ar">Arabic</language>
<language ident="cop">Coptic</language>
<language ident="egy-Egyd">Egyptian in Demotic script</language>
<language ident="egy-Egyh">Egyptian in Hieratic script</language>
<language ident="egy-Egyp">Egyptian Hieroglyphic</language>
<language ident="etr">Etruscan</language>
<language ident="el">Modern Greek</language>
<language ident="grc">Ancient Greek</language>
<language ident="grc-Latn">Ancient Greek written in Latin script</language>
<language ident="he">Hebrew</language>
<language ident="la">Latin</language>
<language ident="la-Grek">Latin written in Greek script</language>
</langUsage>

Character Encodings and Fonts

tba

Indicating the modern language and script used throughout the EpiDoc file

tba

Signaling the languages and scripts used in the texts and translations being presented

The primary and secondary language(s) of an inscription should be declared in the textLang element contained in the TEI header; the primary language should be contained within the mainLang attribute and any other languages should be listed in the otherLangs attribute. Note that the languages included here should only refer to the languages or writing systems contained within the text itself, rather than those throughout the edition. If the script differs from the default script of the language (e.g., Greek transliterated into Latin), this should also be indicated using the appropriate language codes, where available.

<textLang mainLang="grc"
otherLangs="grc-Latn la"> Inscription in ancient Greek with some words transcribed in Latin characters, and
later annotation in Latin.
</textLang>

The language·s for a given translation of a text should not be included in the textLang element; instead, they should be included as an xml:lang attribute on <div type="translation"> or nested <div type="textpart">s within it.

<div type="translation" xml:lang="en">
<p>To the Divine Spirits. Quintus Rutilius Faustio lived for twenty-one years
(and) twenty-six days. (His) parents (made this) for (their) most beloved son.</p>
</div>

Marking transitions in language and/or script in the text

A single language should be indicated as the default for a text by including it in the mainLang attribute on the textLang element. The default script for a given language is assumed unless otherwise indicated (i.e. Greek will be written in the Greek alphabet, etc.). When transitions can between languages and/or script are limited in scope (i.e. a single word, a short phrase, etc. in a language differing from that declare in the mainLang attribute in the header), the word or words should be contained within a foreign element (see Multi-Language Texts). Language and/or script may be identified by including the attribute xml:lang. In the following brief example from the US Epigraphy Project, the inscription begins in Latin but transitions to Greek which is written in the Latin script:

<div type="edition">
<ab>
<lb n="1"/>Cum tu<unclear>i</unclear>s <foreign xml:lang="grc-Latn">pie zeses</foreign>.
</ab>
</div>

(Source)

If longer passages of the text are in different languages and/or scripts, the default language may be further specified with an xml:lang attribute on the <div type="edition"> and if other sections are in different languages/scripts, these should be declared by adding a xml:lang attribute to the block-level containers of that text (e.g. ab, lg, seg, or <div type="textpart">. (Compare the examples given at Multi-Language Texts under point 2.) One may also indicate shifts in script in the same fashion; that is, a hypothetical inscription that is entirely in Greek but also includes Greek transliterated into the Roman alphabet would declare its default language ("grc") in the xml:lang attribute and transliterated passages would be marked with the xml:lang attribute with a value of "grc-Latn". It may also be desirable to indicate a change in hand when the script and/or language changes using the handShift element, if such a shift is discernible; further details about the script can be included in the handNote elements of the manuscript description. Compare the following example from the US Epigraphy Project, which begins in Latin and then contains a translation of the text in Greek:

<div type="edition" xml:lang="lat">
<ab>
  <lb n="1"/>Divi <persName>
   <name type="cognomen" key="Augustus">
    <expan>
     <abbr>Aug</abbr>
     <ex>usti</ex>
    </expan>
   </name>
  </persName> liberti
<lb n="2"/>
  <persName>
   <name type="cognomen" key="Calyx">Calycis</name>
  </persName>
  <persName>
   <name type="cognomen" key="Hygeia">Hygeiae</name>
  </persName>.
</ab>
<ab xml:lang="grc">
  <lb n="3"/>
  <handShift n="2"/> Θεοῦ Σεβαστοῦ
<lb n="4"/>ἀπελευθέρου
<lb n="5"/>
  <persName>
   <name type="cognomen" key="Calyx">Καλύκος</name>
  </persName> ἀπελευ
<lb n="6" break="no"/>θέρα <persName>
   <name type="cognomen" key="Hygeia">Ὑγεῖα</name>
  </persName>.
</ab>
</div>

(Source)

<div xml:lang="ar" type="edition"
xml:space="preserve">                 <ab>                     <lb n="1"/><supplied reason="lost">بـ</supplied>ـسم الـ<supplied reason="lost">ـلّه</supplied> <supplied reason="lost">الرّحمن</supplied> <supplied reason="lost">الرّحيم</supplied>                     <lb n="2"/><seg xml:lang="grc"><supplied reason="lost">ἐν ὀνόματι τοῦ <expan>θ<ex>εο</ex>ῦ</expan> τοῦ ἐλεήμον</supplied><unclear>ος</unclear> <expan><ex>καὶ</ex></expan> <expan>φιλανθρ<supplied reason="lost">ώ</supplied><unclear>π</unclear><ex>ου</ex></expan>                     <lb n="3"/><supplied reason="lost">οὐκ ἔστιν <expan>θ<ex>εὸ</ex>ς</expan> εἰ μὴ ὁ <expan>θ<ex>εὸ</ex>ς</expan> μόν</supplied><unclear>ος</unclear> <unclear>Μααμὲτ</unclear> <unclear>ἀπόστολος</unclear> <expan><unclear>Θ</unclear><ex>εο</ex></expan><unclear>ῦ</unclear></seg>                     <lb n="4"/> لا إله إلّا اللّه <supplied reason="lost">وحده</supplied> <supplied reason="lost">لا</supplied> <supplied reason="lost">شريك</supplied> <supplied reason="lost">له</supplied>                     <lb n="5"/>  لم يلد ولم يولـ<supplied reason="lost">ـد</supplied> <supplied reason="lost">ولم</supplied> <supplied reason="lost">يكن</supplied> <supplied reason="lost">له</supplied> <supplied reason="lost">كفؤا</supplied> <supplied reason="lost">أحد</supplied>                     <lb n="6"/><gap reason="lost" extent="unknown"
unit="character"/> <gap reason="illegible" quantity="14"
unit="character"/>                     <lb n="7"/><seg xml:lang="grc"><gap reason="lost" quantity="20"
unit="character"/><supplied reason="lost"> εἰς </supplied><choice><reg><supplied reason="lost">τ</supplied>ὴν</reg><orig><supplied reason="lost">τ</supplied>ὶν</orig></choice> <choice><reg>ὀρθὴν</reg><orig>ὀρθὶν</orig></choice> πίστιν</seg>                     <lb n="8"/> محمّد رسول اللّـ<supplied reason="lost">ـه</supplied> <supplied reason="lost">أرسله</supplied> <supplied reason="lost">بالهدى</supplied> <supplied reason="lost">ودين</supplied> <supplied reason="lost">الحقّ</supplied>                     <lb n="9"/> عبد اللّه الوليد <supplied reason="lost">أمير</supplied> <supplied reason="lost">المؤمنين</supplied>                     <lb n="10"/><seg xml:lang="grc"><supplied reason="lost">Ἀβδελλα Ἀλουλιδ Ἀμιρ Ἀλ</supplied><unclear>μουμνιν</unclear>                     <lb n="11"/><supplied reason="lost">Ἀβδελλα υἱὸς Ἀβδελμα</supplied>λικ <expan>σύμτουλ<ex>ος</ex></expan></seg>                     <lb n="12"/> هذ<supplied reason="omitted">ا</supplied> ممّا أمر به الأمير <supplied reason="lost">عبد</supplied> <supplied reason="lost">اللّه</supplied> <supplied reason="lost">بن</supplied> <supplied reason="lost">عبد</supplied> <supplied reason="lost">الملك</supplied>                     <lb n="13"/>  في سنة <supplied reason="lost">تسع</supplied> <supplied reason="lost">وثمنين</supplied>                 </ab>
</div>

(Source)

Responsibility for this section

Gabriel Bodard, author
Tom Elliott, author
Scott DiGiulio, author

EpiDoc version: 9.7

Date: 2025-02-18

Languages and scripts

2025-02-18

Relevant element documentation (TEI):

Indicating languages and scripts used in an EpiDoc file

Character Encodings and Fonts

Indicating the modern language and script used throughout the EpiDoc file

Signaling the languages and scripts used in the texts and translations being presented

Marking transitions in language and/or script in the text

See also:

Responsibility for this section