Friday, April 11, 2008

Citation schemes: empty content elements considered harmful

Classicists have, by and large, relied on standard, logical citation schemes to cite works of ancient literature. In the scheme of the Functional Requirements for Bibliographic Records (or FRBR), we could say that classicists have cited notional works using references that could then be applied to any manifestation or expression of that work.

In the print world, this practice has made it possible for scholars to apply a reference to different printed editions or translations of a work. As the internet becomes our library, this practice can turn references into machine-actionable entry points to the library (whether the reference is automatically discovered, or manually cited by a scholars). It is therefore a vital prerequisite that digital editions encode standard, logical citation data such as the book/chapter/section divisions of Thucydides, or the book/line divisions of the Iliad.

The TEI Guidelines (as so often) offer more than one way to approach the problem. It is valid TEI to encode citation values as attributes on containing elements that define the logical structure of a document. Book/chapter/section in Thucydides might be represented by a successive hierarchy of TEI div elements, for example, or book/line in the Iliad by div elements containing l elements; the citation values could be placed in the @n attribute of each container.

Alternatively, since the earliest work of the TEI in the 1980s, the Guidelines have included empty elements (such as the milestone) that could be used to mark transitional points in a document. It is easy to find examples of scholarly texts using such empty elements to mark the beginning of a new unit like a chapter or section.

Arguably, there was little difference between these two approaches in SGML. In XML, however, scholars should avoid using empty elements to encode citation data.

A host of supporting and related technologies have developed around XML in its first decade. One of the most important is XPath, a notation for referring to parts of an XML document by the document's structure. Higher-level technologies such as XSLT or implementations of the DOM model in many programming languages in turn support XPath expressions. The result is that programmers working in many environments can succinctly retrieve a unit like "book 2, chapter 5" of Thucydides with a simple XPath expression like

/TEI.2/text/body/div[@type='book' and @n='2']/div[@type='chapter' and @n='5']
Content between empty elements, on the other hand, cannot be addressed directly with XPath expressions.

Placing citation data on empty elements cuts programmers off from a galaxy of technologies they can use when citation data is kept on containing elements. Empty citation elements should never be necessary if the citation scheme is in fact a logical hierarchy: if it is not, consider whether there is a problem either with your choice of citation scheme or with your design of the rest of the document's structure.

5 comments:

Gabriel Bodard said...

Neel, I have three objections to your strongly worded statement that empty elements in citation schemes are "harmful":

(1) Who uses XPath along without any way of doing more sophisticated things with it? In XSLT (or even better XSLT2) there are many elegant ways to handle this. Even in e.g. Python, an XPath isn't a line of code in a vaccuum.

(2) Even just with XPath, wouldn't the following snippet do the job?

node()[preceding::milestone[@type='book'][1][@n='2']][preceding::milestone[@type='section'][1][@n='5']]

(3) Your solution, while one perfectly acceptable statement of priorities, itself doesn't solve all the problems we might have with these texts. Some classical texts have more than one citation scheme, and even if you claim one of these is "canonical", that decision doesn't mean that users won't sometimes want to use the other.

Of course life would be easier if there were not a problem with overlapping hierarchies in XML, but there is, and we have to make choices. In my experience, these choices always lead to a situation that we have been able to handle without too much pain. Once the successor to XML comes along that doesn't have this nasty limitation, perhaps there'll be no pain at all. :)

I would love to take this conversation on to the Markup list, Neel, or even TEI-L if you prefer.

Gabriel Bodard said...

my snippet of code didn't wrap. Retrying:

node()
[preceding::milestone[@type='book'][1][@n='2']]
[preceding::milestone[@type='section'][1][@n='5']]

Dot Porter said...

Neel, this issue was addressed today on TEI-L, several folks sent code to the list, or offered to send code offlist. The thread starts here: http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind0804&L=TEI-L&T=0&F=&S=&P=7985

Dot Porter said...

Crap, that link didn't wrap. Or link. I'll try again:

here

Assma said...

Maybe this kind of subject always dissertation help uk sounds better in French.