TEI Lex-0

— A baseline encoding for lexicographic data

10. Patterns

10.1. Inheritance of xml:lang

Some elements in TEI Lex-0, like <entry>, for instance, have a required attribute xml:lang; others like <form> or <quote> do not. In general, TEI Lex-0, unlike TEI, recommends that the xml:lang be attached to so-called container elements (for instance, <entry> and <cit>) rather than on individual word forms or textual segments.

TODO: Add some examples

So how can we extract all orthographic forms in a particular language? We can use an XPath expression like this: //orth[ancestor-or-self::*[@xml:lang][1][@xml:lang='en']] .

This XPath expression identifies:

  • each orth element, regardless of where it is in the document (//)
  • but only if it itself or one of its ancestors has the @xml:lang attribute ([ancestor-or-self::*[@xml:lang]])
  • when looking for ancestors with the @xml:lang attribute, we stop at the first such ancestor (i.e. we look for the nearest ancestors) ([1])
  • finally, we filter out only those selected elements with the @xml:lang attribute whose value is 'en'

If your dictionary uses multiple language tags for one language (as in 'en', 'en-GB' and 'en-US') and you want to capture all language varieties with one XPath expression, you can use the XPath lang() function as in: //orth[ancestor-or-self::*[@xml:lang][1][lang('en')]].

While the predicate [@xml:lang='en'] will match only those elements whose xml:lang is exactly equal to 'en', the predicate with the function [lang('en')] will match all the elements whose language is tagged as either English (i.e. 'en') or one of its 'sublanguages' such as 'en-GB'.

If you are new to XPath, you can check out a DARIAH-Campus tutorial XPath for Dictionary Nerds.