TEI Lex-0

— A baseline encoding for lexicographic data

3. Entries

An <entry> is a basic reference unit in a dictionary: it groups together all the information related to a particular lemma. For instance:

    <entry xml:id="OALD.competitor"
     type="mainEntryxml:lang="en">
     <form type="lemma">
      <orth>competitor</orth>
      <hyph>com|peti|tor</hyph>
      <pron>k@m"petit@(r)</pron>
     </form>
     <gramGrp>
      <gram type="pos">n</gram>
     </gramGrp>
     <sense xml:id="OALD.competitor.1">
      <def>person who competes.</def>
     </sense>
    </entry>OALD (1974) 
    <entry xml:id="MM.RSSKJ.круна"
     xml:lang="sr">
     <form type="lemma">
      <orth>кру̏на</orth>
     </form>
     <etym>(<cit type="etymonxml:lang="de">
       <lang norm="dexml:lang="sr">нем.</lang>
       <form>
        <orth>Krone</orth>
       </form>
      </cit>
      <pc>,</pc>
      <cit type="etymonxml:lang="la">
       <lbl xml:lang="sr">из</lbl>
       <lang expand="латинскиnorm="la">лат.</lang>
      </cit>)</etym>
     <sense xml:id="MM.RSSKJ.круна.1">
      <num>1.</num>
      <sense xml:id="MM.RSSKJ.круна.1a">
       <num>а)</num>
       <def>украс на глави као знак владарске власти;</def>
      </sense>
      <sense xml:id="MM.RSSKJ.круна.1b">
       <num>б)</num>
       <usg type="meaningType"
        expand="фигуративноnorm="figurative">фиг.</usg>
       <def>владар.</def>
      </sense>
     </sense>
     <sense xml:id="MM.RSSKJ.круна.2">
      <num>2.</num>
      <def>новчана јединица у неким европским земљама, разне вредности.</def>
     </sense>
     <sense xml:id="MM.RSSKJ.круна.3">
      <num>3.</num>
      <def>део лиснатог дрвета изнад стабле (гране и лшће);</def>
      <xr type="synonymy">
       <lbl>син.</lbl>
       <ref type="sense">крошња</ref>
       <pc>.</pc>
      </xr>
     </sense>
     <sense xml:id="MM.RSSKJ.круна.4">
      <num>4.</num>
      <usg type="meaningType"
       expand="фигуративноnorm="figurative">фиг.</usg>
      <def>врхунац, највиши домет неког рада, забаве.</def>
     </sense>
    </entry>Московљевић (1990) 

3.1. Mandatory attributes

The TEI Lex-0 schema prescribes two mandatory attributes on <entry>:

  • xml:id uniquely identifies the element it is associated with;
  • xml:lang identifies the object language of the element it is associated with.

In XML, xml:lang is inherited from the immediately enclosing element or from its closest ancestor that has this attribute. This means that in XML not every element needs to have the xml:lang attribute.

TEI Lex-0 recommends that xml:lang be attached to so-called container elements (such as <entry> and <cit>) rather than individual <form> elements.

In addition, TEI Lex-0 privileges <entry> as the dictionary’s central textual component by requiring both a unique identifier (xml:id) as well as xml:lang.

    Naming your languages

    xml:lang identifies the object language of the element it is associated with. The language ‘tag’ (i.e. the value of this attribute) must follow IETF BCP 47, the Internet Engineering Task Force's best-practice document outlining standard identifiers for labeling language content. To learn more about what language tag is appropriate for your project, check out W3C's useful resource on choosing language tags.

    If the language or language variety you are working on is not covered by BCP 47, make sure to follow the syntax of Private Use Tags described in BCP 47 Section 2.2.7 when creating one. Do this only if you are absolutely certain that no standard tag exists for your object language.

    If you have created a "private" language tag, you can validate it (in terms of its structural well-formedness and validity) using the BCP 47 validator.

    Language tags containing private-use subtags should be documented in the TEI header, specifically using one or more <language> elements grouped under <langUsage> inside <profileDesc>:

    <profileDesc>
     <langUsage>
      <language ident="mix"
       role="objectLanguage">Mixtepec Mixtec</language>
      <language ident="mix-x-YCNY"
       role="objectLanguage">Yucanany Mixtec</language>
     </langUsage>
    </profileDesc>

3.2. Grammatical properties

Grammatical properties of lexical entries should be specified in entry/gramGrp/gram. This <gram> element will typically specify the part-of-speech of the entry:

    <entry xml:lang="entype="mainEntry"
     xml:id="on">
     <form type="lemma">
      <orth>on</orth>
     </form>
     <gramGrp>
      <gram type="pos">prep</gram>
     </gramGrp>
     <!--...-->
    </entry>

Notes:

  1. Grammatical properties of the entry as a whole should not be specified in entry/form[@type="lemma"]/gramGrp.
  2. entry/form/gramGrp should be used only if a particular form (a dialectal variant, for instance) has different grammatical properties from the lemma; or to indicate the grammatical properties of the inflected form which clearly deviate from the lemma.
  3. For entries which group grammatical homonyms inside single entries (e.g. in English dictionaries which do not have separate entries for conversion pairs of nouns and verbs, such as run or aid see the discussion under Nested entries vs. multiple-senses.

3.2.1. Typology of gram

The TEI Guidelines define:

  • seven specific elements which can be used to mark up particular grammatical properties:<case>, <gen> (for gender), <iType> (for inflection type), <mood>, <number>, <per> (for person) and <tns> (for tense); and
  • one general element (<gram>) which can be used to encode different kinds of grammatical properties.

The Guidelines themselves do not explain the reasoning behind having two different mechanisms for encoding the same kind of information. The two mechanisms are treated as fully interchangeable: see, for instance, the first two examples in Section 9.3.2.

While it is perfectly understandable why marking up grammatical information using a number of specific, granular elements can be considered desirable, the current situation is less than perfect:

  • if both <pos>prep</pos> and <gram type="pos">prep</gram> are possible, and if both mean exactly the same thing, the choice about how to encode grammatical information will always be partially arbitrary;
  • the specific grammatical elements in TEI cover some important grammatical categories, but are certainly not exhaustive: for instance, Slavic dictionaries will, as a rule, indicate aspect (imperfective or perfective) as the defining grammatical property of verbs, yet there is no specific element for: <aspect> in TEI.
  • if there are no specific elements for every possible grammatical category, mixing specific and general elements (for instance <pos>v.</pos> and <gram type="aspect">imperf.</gram> within the same entry and/or dictionary will most likely further complicate data processing and data interoperability.

Considering the goals of TEI Lex-0 to serve as a common baseline and target format for transforming and comparing different lexical resources, we have decided to do away with the specific elements for grammatical properties. Instead, we recommend the use of typed <gram> elements. This is a decision that wasn't taken lightly and one which solicited a great deal of discussion. It goes without saying that TEI itself will continue to support both mechanisms and that an XSLT transformation from <pos>prep</pos> to <gram type="pos">prep</gram> for those who want to convert their dictionaries to TEI Lex-0 would be easily accomplished.

The following table shows a mapping between the specific TEI elements and the typed <gram> elements in TEI Lex-0:

Mapping between specific elements in TEI and the generalized mechanism in TEI Lex-0
TEITEI Lex-0
<pos>n.</pos><gram type="pos">n.</gram>
<case>acc.</case><gram type="case">acc.</gram>
<gen>f.</gen><gram type="gender">f.</gram>
<iType>7</iType><gram type="inflectionType">7</gram>
<mood>indic.</mood><gram type="mood">indic.</gram>
<number>sg.</number><gram type="number">sg.</gram>
<per>3rd</per><gram type="person">3rd</gram>
<tns>aorist</tns><gram type="tense">aorist</gram>
<colloc>de</tns><gram type="colloc">de</gram>
-<gram type="aspect">imperf.</gram>
-<gram type="valency">intr.</gram>
-<gram type="construction">[+conj.]</gram>
-<gram type="degree">comp.</gram>

Note: See also next section on Collocates.

TEI5 is missing a specific element for encoding the grammatical aspect of verbs (for values such as perfective, imperfective) and valency (for values such as transitive, intransitive, reflexive, and impersonal). TEI Lex-0 is therefore introducing two suggested grammatical types: gram[@type="aspect"] and gram[@type="valency"]for encoding such values in dictionaries.

The attribute values for gram[@type] are a semi-closed list: this means that we will discuss and adopt additional values as demonstrated by examples from dictionaries that are encoded by members of our community.

If your dictionary has grammatical labels that do not fit into the above categories, do let us know by filing a ticket on GitHub.

3.2.2. Collocates

The TEI Guidelines define a specific element <colloc> (collocate) for marking up "any sequence of words that co-occur with the headword with significant frequency." The prototypical example from the Guidelines is this:
    <entry>
     <form>
      <orth>médire</orth>
     </form>
     <gramGrp>
      <colloc>de</colloc>
     </gramGrp>
    </entry>
In line with the simplification of the elements used to describe grammatical properties in dictionaries, TEI Lex-0 recommends the use of <gram type="collocate"></gram> to encode these phenomena, i.e.: >
    <entry xml:lang="frxml:id="DDLF.médire">
     <form type="lemma">
      <orth>médire</orth>
     </form>
     <gramGrp>
      <gram type="collocate">de</gram>
     </gramGrp>
    </entry>
In TEI Lex-0, we make a distinction between purely lexical collocates (as in médire de) and various types of grammatical co-occurrences, differently referred to in the literature as rection, government, dependency etc. The suggested value for this type of grammatical co-occurrence in TEI Lex-0 is <gram type="construction"></gram>
    <gramGrp>
     <gram type="construction">[+ conj.]</gram>
    </gramGrp>

3.3. Deprecated entry-like elements

The current TEI Guidelines define five different container elements that may serve as grouping devices for entry-level lexical information:

  • <entry>: contains a single structured entry in any kind of lexical resource, such as a dictionary or lexicon.
  • <entryFree>: contains a single unstructured entry in any kind of lexical resource, such as a dictionary or lexicon.
  • <superEntry>: groups a sequence of entries within any kind of lexical resource, such as a dictionary or lexicon which function as a single unit, for example a set of homographs.
  • <re>: (related entry) contains a dictionary entry for a lexical item related to the headword, such as a compound phrase or derived form, embedded inside a larger entry.
  • <hom>: (homograph) groups information relating to one homograph within an entry

These five elements can be used to distinguish different types of entries along two conceptual axes:

  • Structured vs. unstructured entries, i. e. entries that can readily be represented (in the lexical view) in the spirit of the TEI Guideline’s Dictionary Chapter (<entry>, <re>) vs. entries that for some reason violate the generic content model of <entry> or <re> and thus have to be represented more freely (<entryFree>). A third category in this respect are entries that exhibit a highly reduced amount of lexical content while this content is still of essentially entry-like nature (<superEntry>).
  • Containing vs. contained entries: entries may contain additional lexical information that can be conceived as an additional dictionary entry in its own right. Specifically, <superEntry> may contain <entry>, and <entry> in turn may contain <re> to represent the embedding of lexical entries on three distinct levels. Due to <re> being allowed to be used recursively, the number of levels for representing entry-like lexical information inside other such blocks is effectively unrestricted. At the same time, two different mechanism can be used to create homographic entries: <superEntry> containing multiple <entry> elements; or <entry> containing multiple <hom> elements.

3.3.1. hom

Making a clear difference between a situation where an entry has to be split into two or more homonyms and one where these differences correspond to a semantic alternation is lexicographically difficult. Still, the main danger in keeping both possibilities in the representation of a lexical entry in a digital lexicon is to introduce a systematic structural ambiguity as to where the appropriate information is to be found. We thus deprecate <hom> altogether in the present recommendation and have this element replaced by the nested <entry> construct.

For instance, the following example from the TEI Guidelines:

    <entry>
     <form>
      <orth>bray</orth>
      <pron>breI</pron>
     </form>
     <hom>
      <gramGrp>
       <gram type="pos">n</gram>
      </gramGrp>
      <sense>
       <def>cry of an ass; sound of a trumpet.</def>
      </sense>
     </hom>
     <hom>
      <gramGrp>
       <gram type="pos">vt</gram>
       <subc>VP2A</subc>
      </gramGrp>
      <sense>
       <def>make a cry or sound of this kind.</def>
      </sense>
     </hom>
    </entry>

would in TEI Lex-0 be represented as:

    <entry type="mainEntryxml:id="bray"
     xml:lang="en">
     <form type="lemma">
      <orth>bray</orth>
      <pron>brel</pron>
     </form>
     <entry xml:id="bray_nxml:lang="en"
      type="homonymicEntry">
      <gramGrp>
       <gram type="pos">n</gram>
      </gramGrp>
      <sense xml:id="bray_n.1">
       <def>cry of an ass</def>
      </sense>
      <pc>;</pc>
      <sense xml:id="bray_n.2">
       <def>sound of a trumpet</def>
      </sense>
      <pc>.</pc>
     </entry>
     <entry xml:id="bray_vtxml:lang="en"
      type="homonymicEntry">
      <gramGrp>
       <gram type="pos">vt</gram>
       <gram type="inflectionType">VP2A</gram>
      </gramGrp>
      <sense xml:id="bray_vt.1">
       <def>make a cry or sound of this kind</def>
      </sense>
      <pc>.</pc>
     </entry>
    </entry>

In a similar fashion, consider this entry from the Dictionary of the Portuguese Language by Morais:

    <entry xml:id="MORAIS.1.DLP.JANTAR"
     type="mainEntryxml:lang="pt">
     <entry xml:id="MORAIS.1.DLP.JANTAR-vt"
      type="homonymicEntryxml:lang="pt">
      <form type="lemma">
       <orth>JANTAR</orth>
      </form>
      <metamark function="lemmaDelimiter">,</metamark>
      <gramGrp>
       <gram type="posnorm="VERB">v.</gram>
       <gram type="voice">at.</gram>
      </gramGrp>
      <sense xml:id="MORAIS.1.DLP.JANTAR.s.1">
       <def>comer ao meio dia , ou comer depois de almoçar.</def>
      </sense>
     </entry>
     <entry xml:id="MORAIS.1.DLP.JANTAR-n"
      type="homonymicEntryxml:lang="pt">
      <form type="lemma">
       <orth>JANTAR</orth>
      </form>
      <metamark function="lemmaDelimiter">,</metamark>
      <gramGrp>
       <gram type="posnorm="NOUN">ſ.</gram>
       <gram type="gen">m.</gram>
      </gramGrp>
      <sense xml:id="MORAIS.1.DLP.JANTAR.s.2">
       <def>a ſegunda das tres comidas regulares do dia, entre o almoço , e aceia , ou antes da merenda.</def>
      </sense>
      <pc>.</pc>
      <metamark function="senseDelimiter">§</metamark>
      <sense xml:id="MORAIS.1.DLP.JANTAR.s.3">
       <def>Porção de dinheiro , que as Villas , e Cidades davão aos Reis , quando hião de correição para ſuſtento de ſua comitiva</def>
      </sense>
      <pc>.</pc>
      <bibl type="attestation"
       source="#M._L._Monarchia_Luſitana">
       <title>M. Luſ.</title>
       <citedRange unit="volume">t. 5</citedRange>
       <citedRange unit="folium">f. 53</citedRange>
       <citedRange unit="chapter">cap. 27</citedRange>
      </bibl>
     </entry>
    </entry>Silva (1789) 

3.3.2. superEntry

By making <entry> recursive, TEI Lex-0 has eliminated the need for grouping entries with <superEntry>.

This is especially important for traditional root-based dictionaries, which start with the root as the main headword, followed by full-fledged lexicographic entries of derived headwords.

    <entry type="wordFamilyxml:lang="ar"
     xml:id="syj">
     <form type="root">
      <orth>سيج</orth>
     </form>
     <pc>:</pc>
     <!-- To fence (verb) -->
     <entry type="mainEntryxml:lang="ar"
      xml:id="syj1">
      <form type="lemma">
       <orth>سيّج</orth>
      </form>
      <sense xml:id="syj1_sense1">
       <cit type="example">
        <quote>الكرم</quote>
       </cit>
       <pc>:</pc>
       <def>جعل له سياجا</def>
      </sense>
      <pc>٠</pc>
     </entry>
     <!-- A fence (noun) -->
     <entry type="mainEntryxml:lang="ar"
      xml:id="syj2">
      <form type="lemma">
       <orth>السياج</orth>
      </form>
      <form type="inflected">
       <gramGrp>
        <gram type="numbervalue="plural">ج</gram>
       </gramGrp>
       <form type="variant">
        <orth>سيَاجات</orth>
       </form>
       <lbl>و</lbl>
       <form type="variant">
        <orth>أسْوِجة</orth>
       </form>
       <lbl>و</lbl>
       <form type="variant">
        <orth>أَسْوِجة</orth>
       </form>
       <lbl>و</lbl>
       <form type="variant">
        <orth>سُوج</orth>
       </form>
      </form>
      <pc>:</pc>
      <sense xml:id="syj2_sense1">
       <def>الحائط</def>
      </sense>
      <pc>||</pc>
      <sense xml:id="syj2_sense2">
       <def>ما أُحيط بهِ على شيءٍ كالكرم و النخل</def>
      </sense>
     </entry>
     <pc>٠</pc>
     <!-- A kind of fish -->
     <entry type="mainEntryxml:lang="ar"
      xml:id="syj3">
      <form type="lemma">
       <orth>السيْجان</orth>
      </form>
      <pc>(</pc>
      <usg type="domainvalue="animal">ح</usg>
      <pc>)</pc>
      <pc>:</pc>
      <sense xml:id="syj3_sense1">
       <def>نوع من السمك</def>
      </sense>
     </entry>
    </entry>Almonjid (2014) 
    <entry type="wordFamilyxml:lang="ar"
     xml:id="shahama">
     <form type="root">
      <orth>شهم</orth>
     </form>
     <pc>:</pc>
     <entry type="wordfamilyxml:lang="ar"
      xml:id="shahama1">
      <num>١ــ</num>
      <entry type="mainEntryxml:lang="ar"
       xml:id="shahama1_1">
       <form type="lemma">
        <orth>شَهَمَ</orth>
       </form>
       <form type="scheme">
        <orth>ـَ</orth>
       </form>
       <form type="inflected">
        <form type="variant">
         <orth>شَهْمًا</orth>
        </form>
        <lbl>و</lbl>
        <form type="variant">
         <orth>شُهُمًا</orth>
        </form>
       </form>
       <sense xml:id="shahama1_1_sense1">
        <cit type="example">
         <quote>الفرسَ</quote>
        </cit>
        <pc>:</pc>
        <def>زجره</def>
       </sense>
       <pc>||</pc>
       <lbl>و</lbl>
       <sense xml:id="shahama1_1_sense2">
        <cit type="example">
         <quote>ــ الرجُل</quote>
        </cit>
        <pc>:</pc>
        <def>افزعه</def>
       </sense>
      </entry>
      <pc>٠</pc>
      <entry type="mainEntryxml:lang="ar"
       xml:id="shahama1_2">
       <form type="lemma">
        <orth>اَلمشْهوم</orth>
       </form>
       <pc>٠:</pc>
       <sense xml:id="shahama1_2_sense1">
        <def>المذعور</def>
       </sense>
      </entry>
     </entry>
     <entry type="wordFamilyxml:lang="ar"
      xml:id="shahama2">
      <num>٢٠ ــ</num>
      <entry type="mainEntryxml:lang="ar"
       xml:id="shahama2_1">
       <form type="lemma">
        <orth>شَهُم</orth>
       </form>
       <form type="scheme">
        <orth>ـُـ</orth>
       </form>
       <form type="inflected">
        <form type="variant">
         <orth>شَهَامةً</orth>
        </form>
        <lbl>و</lbl>
        <form type="variant">
         <orth>شُهُومَةُُ</orth>
        </form>
       </form>
       <lbl>:</lbl>
       <sense xml:id="shahama2_1_sense1">
        <def> كان شهْمًا</def>
       </sense>
      </entry>
      <pc>٠</pc>
      <entry type="mainEntryxml:lang="ar"
       xml:id="shahama2_2">
       <form type="lemma">
        <orth>الشَهْم</orth>
       </form>
       <form type="inflected">
        <gramGrp>
         <gram type="numbervalue="plural">ج</gram>
        </gramGrp>
        <orth>شِهام</orth>
       </form>
       <pc>:</pc>
       <sense xml:id="shahama2_2_sense1">
        <def>الذكيّ الفؤاد</def>
       </sense>
       <pc>||</pc>
       <sense xml:id="shahama2_2_sense2">
        <def>السيِّد النافذ الحكم</def>
       </sense>
       <pc>||</pc>
       <sense xml:id="shahama2_2_sense3">
        <lbl>وــ</lbl>
        <form type="inflected">
         <gramGrp>
          <gram type="numbervalue="plural">ج</gram>
         </gramGrp>
         <orth>شُهُم</orth>
        </form>
        <pc>:</pc>
        <def>الفرس النشيط السريع القويّ</def>
       </sense>
      </entry>
      <pc>٠</pc>
      <entry type="mainEntryxml:lang="ar"
       xml:id="shahama2_3">
       <form type="lemma">
        <orth>اَلمَشْهُوم</orth>
       </form>
       <pc>*:</pc>
       <sense xml:id="shahama2_3_sense1">
        <def>الذكيّ الفؤاد</def>
       </sense>
      </entry>
     </entry>
     <entry type="wordFamilyxml:lang="ar"
      xml:id="shahama3">
      <num>٠٣ ــ</num>
      <entry type="mainEntryxml:lang="ar"
       xml:id="shahama3_1">
       <form type="lemma">
        <orth>الشَيْهَم</orth>
       </form>
       <form type="inflected">
        <gramGrp>
         <gram type="numbervalue="plural">ج</gram>
        </gramGrp>
        <orth>شَيَهِم</orth>
       </form>
       <pc>(</pc>
       <usg type="domainvalue="animal">ح</usg>
       <pc>)</pc>
       <sense xml:id="shahama3_1_sense1">
        <def>ذَكَر القنافذ</def>
       </sense>
      </entry>
      <pc>٠</pc>
      <entry type="mainEntryxml:lang="ar"
       xml:id="shahama3_2">
       <form type="lemma">
        <orth>الشَيْهَمَة</orth>
       </form>
       <pc>:</pc>
       <sense xml:id="shahama3_2_sense1">
        <def>العجوز</def>
       </sense>
      </entry>
     </entry>
    </entry>Almonjid (2014) 

See also Section on grammatical properties in senses.