Friday, November 1, 2013

RDF-guys, please make languages what they should be

Human readable RDF literals can be associated with a language tag in the form of

 <rdfs:label xml:lang="en">You name it</rdfs:label>

The xml:lang="en" is defined by rfc3066 which refers to the standard ISO 639: a list of 2 and 3 letter language codes.

Looks all fine and well defined, and it is. But this way of looking at languages is a static box view.
And, yes I know that there is a possibility to extend theses lists.

But have a look at what languages really are.

Book cover "Aux sources de la parole", Pierre-Yves Oudeyer


Just referring to +Pierre-Yves Oudeyer  in his recent book


a language is a communication consensus within a group of people.








As living people are not ever lasting, the group changes over time and so does the language. Without going into the discussion whether we face a language, a variant or dialect, is the SMS language of our millennium kids English (for English speaking kids of course)?


Linguists have grouped historical variants of languages in periods. In Dutch e.g. traditionally we distinguish Old Dutch (500 - 1150) , Middle Dutch ( till 1500) and modern Dutch (from 1500 till now).
But imagine the evolution of a language in a period of several centuries.

From a historical point of view I can understand that there was a need for static classification and the creation of fixed lists of language codes. But computers and their software have moved on.
We have now the possibility to record that a Dutch book of 1851 is written in "the Dutch language of 1851", that a Whatsapp message of 2013-11-01 12:00:56 UTC is written in "the language of the social group" at the time it is produced in and send to.

Having stated this, back to our RDF.
Obviously the defined language tags (and their sub-tags) are outdated and do not correspond anymore to the fine-grained distinctions we would like to be able to make. Poor old spelling-checker!

I think the solution is simple (and thats why I like it).

Using URI 's to refer to languages (instead of codes) enables the flexibility we need. Linked and Open of course. These language URI's can be created by anyone, and by any service. Classical relationships between them can indicate hierarchy, similarity and so-forth. Searching for something in English can group (or better expand the query to) all the URI's of all languages, variants, and dialects of English if needed. But if I want a precise search in specific language form I can. At this moment I cannot.

In fact there is one place I can: in the semantic network technologies this idea has been applied since more then two decades now, and it's just fun.