Saturday, November 2, 2013

Information extraction, yes ... but the right way !


To start this post have a look at the images below (click to enlarge them).
The first two are a Maya stella (the back of Stella 3 in Piedras Negras) and the drawing of the upper part.


The next image is a representation of the information that can be extracted from the story told by the stella.



Especially the last image gives you an idea about my definition of "information extraction".

If you want to know a little more about this particular stela there is a nice interactive tool including the pronunciation. And if you want to learn the Maya glyphs I can recommend Reading the Maya Glyphs which I use myself.

In line with the work to be done to handle all kinds of source documents I'm currently developing another tool: a smart manuscript editor assistant.

The screenshot above shows a manuscript written between 1137 and 1150 in medieval latin from the Bibliothèque Nationale de France and available online.
I'll talk a bit more about the lacking tooling in a future post.

The rest of this post will be illustrated by another project: the reproduction of a Dutch book from 1851, written by one of my ancestors Beschrijving van het Hertogdom Limburg (Description of the Duchy of Limburg - a region covering parts of the actual Netherlands, Belgium and Germany).
The pages are displayed, as much as possible, in the original layout. Hovering over identified terms brings up an information box displaying additional information. This additional information comes from a semantic network and from the Dutch Dbpedia essentially (Linked Open Data).




Let(s have a look a some parts of the text.
Even if you don't understand Dutch you should be able to get the idea, Don't Panic! (recall the 42 reference of previous post ?).
 page 8
 ...
 Het Hertogdom ging na den slag van Woeringen in 1288
 aan Braband over , onder welks heerschappij het langen
  tijd verbleef.
  ...
  
 page 22
 ...
 gevonden,  werden eindelijk de verschillende legers, op 
 den 5. Junij 1288, handgemeen, en er werd toen een ver- 
 maarde slag geleverd , tusschen Nuijs en Keulen op de 
 Fuhlinger heide ‚ bij Woeringen , bekend onder den slag 
 van Woeringen , welke van ‘s morgens 9 ure tot laat in den
 avond had geduurd.
 ...
 
From these two page exceprts we can extract the following facts:
 page 8
 => ”slag van Woeringen” ”in” ”1288”
 
 page 22
 => ”slag van Woeringen” ”op den” ”5. Junij 1288”
 => ”slag van Woeringen” ”van”  ”‘s morgens 9 ure”
 => ”slag van Woeringen” ”tot” ”laat in den avond”
 
Looking at these facts we can say in the first place:
  • there is an entity [1] called ”slag van Woeringen” on page 8
  • there is an entity [2] called ”slag van Woeringen” on page 22
The question that has to be answered is whether these entities are the same.
In this simple case the answer is, with a very high probability, yes.
  • page 8 says that this entity happened in 1288 and has a label "slag van Woeringen" (English: Battle of Woeringen)
  • page 22 says that it happend on 1288-06-05 (from 09:00 till late in the evening) and has also a label "slag van Woeringen"
And
  • Both entities have the same label
  • The statement of page 22 gives a more precise date then the one of page 8
  • There are no incompatible facts.
We can now draw the conclusion that entity [1] and entity [2] are the same with e.g. 0.98 probability. Note that a probability of 1.0 would represent certainty.

Certainty does exist but not on this conclusion. What we do know for sure is that entity [1] and entity [2] both have the same label (slag van Woeringen) because that is how they have been defined.

It is also very probable that the language of the labels is Dutch. But not certain. We know for sure that if these labels are in Dutch then this Dutch was the Dutch language in 1851, the year in which the book has been published. (for more information about the ins and outs of languages see my post RDF-Guys...)

Within the context of this book we are also sure of e.g. the date of the battle, because it is a fact mentioned in the book. But this doesn't mean that elsewhere other dates can't be given and that the date of the battle is a certainty. Its only a certainty with the context of this book.

The book provides other facts on this battle on pages 23, 24, 26, 145 and 150 about the opponents involved, the outcome and the consequences. These are not mentioned in this short explanation because it doesn't add something new to the process of identifying entities. They do play a role in identifying these other entities mentioned in the book of course.

These other facts do also play a role in identifying the virtually unique entity "slag van Woeringen" of this book with other entities defined elsewhere. Something we are going to see next.
What is important to retain from this example is that we have to think about what is known with what probability. In other words: ask the right questions.

One of the targets of this way of identifying entities in this book (and other sources) is to enrich world knowledge about them. We expose the extracted facts as Open Linked Data joining the ever growing free accessible interconnected world knowledge.
For this we have to establish an identity relationship of the entities in this book to other entities defined elsewhere. Only in that way we can increase world knowledge about them.
Let's have a look at an excerpt of the rdf of the Dutch DbPedia :
 
<rdf:Description rdf:about="http://nl.dbpedia.org/resource/Slag_bij_Woeringen">
 <rdf:type rdf:resource="http://schema.org/Event"/>
 <rdfs:label xml:lang="nl">Slag bij Woeringen</rdfs:label>
 <dbo:startDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1288-06-05</dbo:startDate>
 …
</rdf:Description>
 
Remember that each time an entity is mentioned in the book it is a separated entity with only a certain probability to be identical to others in the same source.

For those who are identical beyond a certain threshold we can create a virtual entity merging all the facts of its components. This entity is used to try to establish an identity relationship with external data.
Applying the process of comparing the facts extracted from the book of this virtual entity with the facts of e.g. the Dutch Dbpedia is strictly identical to the comparison of the facts known about the different entities within the book itself.

In this case (and only based on the label and date for this example) the DbPedia entity known as ”http://nl.dbpedia.org/resource/Slag_bij_Woeringen” is, with a high probability (e.g. 0.96), identical to our virtual entity ”slag bij Woeringen” from the book.

Once this probable identity established, we can apply a feedback loop. We use the data provided by the external source to check the facts of each individual entity in the book. As well for the ones that were part of the virtual entity initially used, as for other entities that were not part of it but do have a relatively high probability (but below the threshold) to be the same one. This feedback loop might reinforce or decrease the probabilities of the individual entities (in relation to each other).

This whole process simulates the way we, the none bits and bytes, "recognize" things. We use our prior acquired knowledge (external source) to "identify" or "recognize" what the book is about.
At the same time we extend our knowledge with the new facts that the book provides.
Yeh, we are learning!

But computers are good in crunching data, big data. So let them do it.

And, don't forget, they will have to be able to explain why they established this identity (see my post on trust for more details).
And the extracted facts (triplets, quads or quinds) have to carry a reference from where they came from (sentence, paragraph, page and book in whatever machine exploitable form).


Introspection
You might think that there is some overkill in this way of handling information extraction. Perhaps you're right in some situations, but this book happens to have an concrete example where this way of processing shows its importance.
One chapter handles a person called "Jan van Weert" and describes a long list of his activities. A few years after this book has been published a historian established "with certainty" that there has been a confusion between two persons. Some of the facts associated with the "Jan van Weert" from the book belong to one, some belong to the other.
Another example goes back to the mid nineties (see my short history post). In one of the interviews a ethno-botanist had, a woman used a specific name to indicate two different plants. For her it was the same plant. It appeared impossible to associate each of her uses of that name with one plant or another. It's the inverse of previous example and there is no identity mapping possible. This is not a problem and there are ways to handle this.

A little note about the book's website.
One of external libraries I use is a bit buggy. If you use the content menu to jump to a section in the book, the book might freeze. I'm on my way of fixing this, so be patient please.
Part of this post is also available in the about page of the book.

Edit 2013-11-04: The Dutchy of Limburg doesn't include parts of Luxemburg of course.
Edit 2013-11-13: Typos