Wednesday, November 6, 2013

Natural language learning app

Since more then a year now I tried to get Google involved in some way in a specific app I would like to create. Wrong contacts, working already on something like it? No idea, but from now on it's out in the wild.

When learning foreign languages the best way to do this is in contact with native speakers. Communicating, in other words.
But that's not often possible, certainly not for most of the people in the world. With +Project Loon on its way an Internet connection will.

And what if your partner is an artificial intelligence, who patiently starts a conversation with you on cats and dogs and other familiar objects. Up to a point (after some time) where you will be fluent in a foreign language?

I made a short video on what this app might look like, sort of video prototype. But don't expect the artificial mentor yet.

Subtitles are available in English, French and Dutch.
Budget for the video? 1.5 € to make the "blue screen" used, and a few hours for me and the other participants. It was really fun to make it. Perhaps there will be a "making of" with the most amusing moments, we have material enough.

I would really like this app to go into existence. And I think it is feasable.

If you like the idea of learning foreign languages this way, reshare this post with your contacts, +it.

If you would like to participate by writing code, creating the virtual worlds, or sponsoring financially the realization, please contact me (see the end of the video).
If you can help teaching the teacher, I'm also interested.

twitter: nllapp

Monday, November 4, 2013

A leap in artificial intelligence ?

Since a long time some of us, including myself, are looking for a "real" artificial intelligence (intelligent program? - you name it).

Able to learn.

So how do we learn?

As mentioned in a previous post we can consider ourself as bunches of:
- sensors operating since our birth and even before that (eyes, ears, skin, ...)
- input interpretation services (brain)
- restitution services (memory)
- production services (voice, body)
- creation services (brain)

When we receive useful input we can do something with it. Else the only thing we can do is ... well store it and see if it can be made useful at some point in time (or throw it away immediately).

I think sensor input is useful in two situations.
In one situation the input looks like something perceived before. Not necessary identical but similar. So we can associate the new input with something in our memory and thus with everything related to it.
Is that learning? I think so because the variations in the input improves the related "pattern recognizer" and enables generalization.
In a second situation, to make an unrecognized input useful, it should be associated with some context: an input from another kind of sensor. Something you see and the word for it (read or heard) or how to use it. Learning foreign languages works best that way.

Association seems to me the most important word here. If input is not associated with something it is just... isolated (prfest) and not useful (yet).

I know there is more to it but this is enough to put you on the track for the rest of this post.

In our digital world all we have are bits and pieces called bytes.
And I like to look at them as a flow.

A written text is nothing more then a flow of bytes, characters, words, sentences and paragraphs depending at what scale you look at them.
Do you need an explanation for a sound track?
Even a still image can be seen as a flow of pixels one line after another.

The inputs in a stream are associated in a timely way (one after another) and / or in a spacial way (one line of an image above the other). A flow of text? Time or space? Hard to say, but does it matter?

In a video with subtitles you have all of it together of course. Save the smell, at least not yet.

Now imagine a flexible, adaptive pattern analyzer/recognizer on one of these streams. Breaking it in to chunks that are similar to previous seen chunks. Or breaking it in chunks because the stream itself has patterns that return once and a while. Would you be able to make one for a flow of text?
Adaptive? Absolutely necessary. If your premisses are wrong when configuring such a pattern analyzer it will fail at some point. If its mission is "find large patterns" and its toolbox good enough to try hard (e.g. change the scale of the scrutinizing window) it might fail (if there is no pattern) but at least it's not because it has the wrong premisses. (Large means: go beyond the 0's and 1's, that scale is not very useful.)
Oh, one important instruction: never look back.

We will need probably only a small set of instructions to make a pattern analyzer (its toolbox). One that is also able to differentiate the various kind of streams (text, sound, image, ...) but hints might help of course.

You can call these chunks neurons if you like.
They are associated with other chunks in the stream (positional dependence) and hopefully with "old" chunks seen before. And with chunks from other kind of parallel streams (e.g. a spoken and a written word or a written word and an image).

Each time associated chunks in stream (or across parallel streams) find an echo in the old chunks the link betweens these chunks is reinforced. We can imagine a decay function that weakens links over time and eliminate them (forget) if they are not reactivated. This will remove childhood errors in chunking of the pattern analyzer.

A pattern analyzer, preferably the same code but different (learned) settings, can be used to discover patterns in associated chunks. Leading to higher level concepts. (Mark the shift of chunk to concept.)

What's different compared to actual machine learning techniques?

First we could ask if this is supervised or unsupervised learning. Supervised machine learning is applied when you have a labelled data set and the program finds the optimal parameters to match the data with the provided labels (a good example is optical character recognition OCR).
You could look at two parallel streams as one being the data and the other the labels. Only here you do not look for parameters to match both. They are matched (I should say associated) by definition.
So it doesn't really look like supervised learning.
And unsupervised? With this kind of machine learning you have (preferably a large amount of) identical kind of data (like in this Google experiment) and try to find the higher level concepts in it. One of the essential differences here is that we look at different kinds of things (written and spoken words).
Doesn't feel like a good match with unsupervised learning either.

Second, this is life long learning (and appropriate forgetting) from streams of data. Not just training on a fixed set and then applying on other data. Google's self driving car does the same thing, but not all do.

I've the feeling (not only a dream) that this is the right track to go.

And you?

Next chapter

Saturday, November 2, 2013

Information extraction, yes ... but the right way !

To start this post have a look at the images below (click to enlarge them).
The first two are a Maya stella (the back of Stella 3 in Piedras Negras) and the drawing of the upper part.

The next image is a representation of the information that can be extracted from the story told by the stella.

Especially the last image gives you an idea about my definition of "information extraction".

If you want to know a little more about this particular stela there is a nice interactive tool including the pronunciation. And if you want to learn the Maya glyphs I can recommend Reading the Maya Glyphs which I use myself.

In line with the work to be done to handle all kinds of source documents I'm currently developing another tool: a smart manuscript editor assistant.

The screenshot above shows a manuscript written between 1137 and 1150 in medieval latin from the Bibliothèque Nationale de France and available online.
I'll talk a bit more about the lacking tooling in a future post.

The rest of this post will be illustrated by another project: the reproduction of a Dutch book from 1851, written by one of my ancestors Beschrijving van het Hertogdom Limburg (Description of the Duchy of Limburg - a region covering parts of the actual Netherlands, Belgium and Germany).
The pages are displayed, as much as possible, in the original layout. Hovering over identified terms brings up an information box displaying additional information. This additional information comes from a semantic network and from the Dutch Dbpedia essentially (Linked Open Data).

Let(s have a look a some parts of the text.
Even if you don't understand Dutch you should be able to get the idea, Don't Panic! (recall the 42 reference of previous post ?).
 page 8
 Het Hertogdom ging na den slag van Woeringen in 1288
 aan Braband over , onder welks heerschappij het langen
  tijd verbleef.
 page 22
 gevonden,  werden eindelijk de verschillende legers, op 
 den 5. Junij 1288, handgemeen, en er werd toen een ver- 
 maarde slag geleverd , tusschen Nuijs en Keulen op de 
 Fuhlinger heide ‚ bij Woeringen , bekend onder den slag 
 van Woeringen , welke van ‘s morgens 9 ure tot laat in den
 avond had geduurd.
From these two page exceprts we can extract the following facts:
 page 8
 => ”slag van Woeringen” ”in” ”1288”
 page 22
 => ”slag van Woeringen” ”op den” ”5. Junij 1288”
 => ”slag van Woeringen” ”van”  ”‘s morgens 9 ure”
 => ”slag van Woeringen” ”tot” ”laat in den avond”
Looking at these facts we can say in the first place:
  • there is an entity [1] called ”slag van Woeringen” on page 8
  • there is an entity [2] called ”slag van Woeringen” on page 22
The question that has to be answered is whether these entities are the same.
In this simple case the answer is, with a very high probability, yes.
  • page 8 says that this entity happened in 1288 and has a label "slag van Woeringen" (English: Battle of Woeringen)
  • page 22 says that it happend on 1288-06-05 (from 09:00 till late in the evening) and has also a label "slag van Woeringen"
  • Both entities have the same label
  • The statement of page 22 gives a more precise date then the one of page 8
  • There are no incompatible facts.
We can now draw the conclusion that entity [1] and entity [2] are the same with e.g. 0.98 probability. Note that a probability of 1.0 would represent certainty.

Certainty does exist but not on this conclusion. What we do know for sure is that entity [1] and entity [2] both have the same label (slag van Woeringen) because that is how they have been defined.

It is also very probable that the language of the labels is Dutch. But not certain. We know for sure that if these labels are in Dutch then this Dutch was the Dutch language in 1851, the year in which the book has been published. (for more information about the ins and outs of languages see my post RDF-Guys...)

Within the context of this book we are also sure of e.g. the date of the battle, because it is a fact mentioned in the book. But this doesn't mean that elsewhere other dates can't be given and that the date of the battle is a certainty. Its only a certainty with the context of this book.

The book provides other facts on this battle on pages 23, 24, 26, 145 and 150 about the opponents involved, the outcome and the consequences. These are not mentioned in this short explanation because it doesn't add something new to the process of identifying entities. They do play a role in identifying these other entities mentioned in the book of course.

These other facts do also play a role in identifying the virtually unique entity "slag van Woeringen" of this book with other entities defined elsewhere. Something we are going to see next.
What is important to retain from this example is that we have to think about what is known with what probability. In other words: ask the right questions.

One of the targets of this way of identifying entities in this book (and other sources) is to enrich world knowledge about them. We expose the extracted facts as Open Linked Data joining the ever growing free accessible interconnected world knowledge.
For this we have to establish an identity relationship of the entities in this book to other entities defined elsewhere. Only in that way we can increase world knowledge about them.
Let's have a look at an excerpt of the rdf of the Dutch DbPedia :
<rdf:Description rdf:about="">
 <rdf:type rdf:resource=""/>
 <rdfs:label xml:lang="nl">Slag bij Woeringen</rdfs:label>
 <dbo:startDate rdf:datatype="">1288-06-05</dbo:startDate>
Remember that each time an entity is mentioned in the book it is a separated entity with only a certain probability to be identical to others in the same source.

For those who are identical beyond a certain threshold we can create a virtual entity merging all the facts of its components. This entity is used to try to establish an identity relationship with external data.
Applying the process of comparing the facts extracted from the book of this virtual entity with the facts of e.g. the Dutch Dbpedia is strictly identical to the comparison of the facts known about the different entities within the book itself.

In this case (and only based on the label and date for this example) the DbPedia entity known as ”” is, with a high probability (e.g. 0.96), identical to our virtual entity ”slag bij Woeringen” from the book.

Once this probable identity established, we can apply a feedback loop. We use the data provided by the external source to check the facts of each individual entity in the book. As well for the ones that were part of the virtual entity initially used, as for other entities that were not part of it but do have a relatively high probability (but below the threshold) to be the same one. This feedback loop might reinforce or decrease the probabilities of the individual entities (in relation to each other).

This whole process simulates the way we, the none bits and bytes, "recognize" things. We use our prior acquired knowledge (external source) to "identify" or "recognize" what the book is about.
At the same time we extend our knowledge with the new facts that the book provides.
Yeh, we are learning!

But computers are good in crunching data, big data. So let them do it.

And, don't forget, they will have to be able to explain why they established this identity (see my post on trust for more details).
And the extracted facts (triplets, quads or quinds) have to carry a reference from where they came from (sentence, paragraph, page and book in whatever machine exploitable form).

You might think that there is some overkill in this way of handling information extraction. Perhaps you're right in some situations, but this book happens to have an concrete example where this way of processing shows its importance.
One chapter handles a person called "Jan van Weert" and describes a long list of his activities. A few years after this book has been published a historian established "with certainty" that there has been a confusion between two persons. Some of the facts associated with the "Jan van Weert" from the book belong to one, some belong to the other.
Another example goes back to the mid nineties (see my short history post). In one of the interviews a ethno-botanist had, a woman used a specific name to indicate two different plants. For her it was the same plant. It appeared impossible to associate each of her uses of that name with one plant or another. It's the inverse of previous example and there is no identity mapping possible. This is not a problem and there are ways to handle this.

A little note about the book's website.
One of external libraries I use is a bit buggy. If you use the content menu to jump to a section in the book, the book might freeze. I'm on my way of fixing this, so be patient please.
Part of this post is also available in the about page of the book.

Edit 2013-11-04: The Dutchy of Limburg doesn't include parts of Luxemburg of course.
Edit 2013-11-13: Typos

Trust? ==> 0 or 100 ... or perhaps 42 ?

Trust? ==> 0 or 100 ... or perhaps 42 ?

When I ask the Google voice search:

"What time is it in Mountain View?"

Why do I trust the answer returned?

And when I ask:

"Who invented the world?"

the answer brings tears to my eyes from laughing?

You know the answer should have been:


If you didn't knew that I recommend this book.

The first level of trust in these two examples is in speech recognition. The app provides feedback (the recognized text spoken) so I can check whether the transcription has been done correctly. Elementary but undeniably necessary to establish trust in the answer.

Next comes the answer. In the first example, over a period of short time I came to trust the kind of serves that provide the current time all around the world. Based on my knowledge that these service are not difficult to realize and the only thing that matters is to correctly identify the place asked for.

And the second case? Does it degrades my build up trust in the quality of Google's search results?
Nope! But why?
Save for the 42, I knew that there was no perfect answer, but also I can explain why +Tim Berners-Lee  popped up, and I also know how this service can be improved, in order not to make mistakes like that.
If Google had used it's knowledge graph, which obviously they didn't, it would have known that "World" and "World Wide Web" are different things.

When exchanging information with the help of computer systems, three kind of "things" need to build up their own trustworthiness:
- services (that they do what they are supposed to do)
- persons
- information

The rest of this post will essentially handle the last element of this list, but let's have a quick look at the first two.

Trust in services
How do we evaluate the trustworthiness of services? I imagine that my first trip in a self driving car will be extremely stressful for me (not for the car of course). During the fifth I'll read and answer my mails and at the tenth trip I'll take a nap.
Btw, I tend to include organizations here to, but that's of course subject to discussions.
Positive feedback from other persons using the same service accelerates the trust building. Negative feedback on the other hand ...
This naturally leads to 

Trust in persons
Complex matter. Trusting that actions will be executed timely and correctly? That the information provided is true? 
For me key elements in here are: evolution over time, at any moment doubt can pop up, other people's opinion does matter (it certainly influences the initial trust state you have in someone) but less then your own perceptions and finally trust in persons is very fine-grained (on some aspects I can trust someone almost completely, on other aspects less). If physical contact exists (directly or with some communication system in between - video, phone) body language (including voice) is an important factor.
Referring to the negative feedback above: If that feedback was given by a person I trust (on that specific topic, e.g. an authority in the field) it might outreach by far thousands of positive feedbacks of people I do not know.
The feedback provided, well ... it's information

Trust in information
The level of trust we have in information depends on the trust we have in the original provider, the trust we have in the whole chain of intermediate transmissions and/or transformations up to the trust we have in our own sensory input.

I like to make a distinction between carbon-based and silicon-based producers (although silicon is perhaps a bit to imprecise).

Starting with the latter, we can separate them in two categories: the ones providing raw data (a surveillance video camera e.g.) and the ones providing interpreted data (temperature sensor, smoke detector or a smart collision detector camera in a tunnel). These last ones can be seen as a combination of a raw sensor and one or more chained interpretation services (software in most cases).
As sensors a relatively cheap, often multiplying the number of sensors can validate the produced information. In case of interpreted data, we should of course prefer different interpretation services to avoid or detect errors in a specific one.
Taking a helicopter view, what is a search engine? Input: the Internet (text, images, video), services : indexing, filtering, output: references related to a query. Or like in this example: input: YouTube still frames, output: cats and faces.
Thus the frontier between traditional sensors of our physical world and sensors of our digital worlds (a crawl bot or a collision detector in a virtual world) is perhaps not so clean as it might appear initially.

Remains the last element, the human producer, before handling the essence of this post (sorry for the long introduction). Yes, I know that dogs are also producers (of ...).

In some sense we are a bunch of sensors operating since our birth, a bunch of input interpretation services called brain, a bunch of restitution services called memory (recent neurological research shows that different kind of input is stored in different parts of our memory) and finally a bunch of production services (voice, body language and the fingers used to type this post - dictation is not good enough yet).
All of them are subject to errors.


I forgot one. The only one which doesn't make any errors (by definition): creation services.

Trustworthy? Make up your own mind.

Apparently there is a strong need for some kind of validation (a term silently introduced above) especially for human produced information. At least if we feel the need to think about trusting it or not.

My ideal information world?
OK +Google Glass  (or some other device or app), show me how reliable this information is and why.

- independency of original sources
- trustworthiness of the independent sources
- chain of dependent sources

The toolbox:
- annotate
- annotate
- annotate

Big data? Sure!
Feasible? Sure!
Done? Nope! (only partially for recent productions: retweets and re-shares e.g.).

Now it's time to use the magic word "Semantics".
One thing that is necessary is establishing identity of the producer. Mind that for humans this is not necessarily an identification of the physical person, an avatar is fine. Nothing wrong with having multiple identities. There is only a need for enough data to establish some meaningful (initial) state of trustworthiness, authority or whatever term you would like to use.
The meaning (read semantics) of the information produced. This becomes better and better but there is still much room left for improvements (or perhaps we should not go for the 10% improvement but for the 10 x solution - i.e. come up with an entirely new way). This is needed to trace back when and by whom the same information was given. The copy - paste of text is easy to detect, but we are still in the childhood of detecting copy - paste of meaning.

And a facilitating role we all have: cite your sources! This is common is scientific sources. But elsewhere? We will have another facilitation role, but that will be the subject of another post.

Why this post now and here?
How many books have been scanned and/or are available online? How many articles are online? And blog posts? Not to mention all the rest. The source data is there. And so are the (not yet perfect) techniques.
Computer power is also available.

What are we waiting for?


Friday, November 1, 2013

The n-triples, n-quads and .... n-quinds

Today I'm in a writing mood: Go for a third post.

RDF is basically a triple collection: <subject>, <predicate>, <object>
The object might be a literal.

One of the syntax formats is n-triples. Example from the linked page:

<>  <>    "Dave Beckett" .
<>  <>    "Jan Grant" .
<>  <>  _:a .
_:a                                    <>      "World Wide Web Consortium" .
_:a                                    <>     <> .

In triple stores, where lots of triples are grouped, there is a need to register the origin of the triplet.
To be able to do so the triplets are extended with a forth element: <context>.
And, you guessed it already I suppose, the syntax format is n-quads.
So now we have <subject> <predicate> <object> <context>

Despite the name "context" this element is meant to represent the provenance of the triplet. Why wasn't it called <provenance>? No idea. But "context" is not wrong: it is to be interpreted as the production context (or source context) of the triplet. This is also called the graph (an RDF graph is a set of triplets).

But does this forth element the <context / provenance / graph> fulfills all our needs?

For me it doesn't. It has undeniable value but it's not enough.

Who has never gone through the tough exercise of changing their primary email address? If you didn't your are either lucky or smart (or both).
Changing your internet provider,  changing jobs etc might be a source of this crude task. (If you happened to associate your primary to your ISP or company).
Thus the
 <I> <email address> <> has two characteristics:
- it identifies me (the <I>) because nobody has the same address
- it is more or less temporal (it is valid in a period of time)

Especially this last information should be associated with the triplet.

There are ways to do so in RDF but they are far more complicated then necessary.
A triplet (or statement) <subject> <predicate> <object> tells you something about the subject.
If I want to tell something about this statement the easiest way is to identify this triplet in some way and use that as a subject in another statement.

That's where the n-quind comes in (no hyperlink here because I invented the term - as far as I know).

The n-quind syntax:
<subject> <predicate> <object> <context> <id>

Note that the <context> element might become unnecessary because we can capture the same information like this:
<subject> <predicate> <object> <id-1>
<id-1> <context-predicate> <context> <id-2>
Here the <context-predicate> represents the implicit meaning of the forth element of a quad.
But I choose not to confuse existing quad parsers not knowing whether they read a <context> or an <id> in the forth position.

One additional remark on the context. It is mend to represent the provenance but it could be used to "identify" a graph with a single triplet. But I think in doing so we are moving away from the initial intended meaning of provenance. Thus ending up with quinds.

And, yes, long, long ago, in a galaxy (not so far) away, this existed already.

2013-11-02 Edit: Added a forgotten an id in the context-less example above.
2013-11-13 Edit: typos

RDF-guys, please make languages what they should be

Human readable RDF literals can be associated with a language tag in the form of

 <rdfs:label xml:lang="en">You name it</rdfs:label>

The xml:lang="en" is defined by rfc3066 which refers to the standard ISO 639: a list of 2 and 3 letter language codes.

Looks all fine and well defined, and it is. But this way of looking at languages is a static box view.
And, yes I know that there is a possibility to extend theses lists.

But have a look at what languages really are.

Book cover "Aux sources de la parole", Pierre-Yves Oudeyer

Just referring to +Pierre-Yves Oudeyer  in his recent book

a language is a communication consensus within a group of people.

As living people are not ever lasting, the group changes over time and so does the language. Without going into the discussion whether we face a language, a variant or dialect, is the SMS language of our millennium kids English (for English speaking kids of course)?

Linguists have grouped historical variants of languages in periods. In Dutch e.g. traditionally we distinguish Old Dutch (500 - 1150) , Middle Dutch ( till 1500) and modern Dutch (from 1500 till now).
But imagine the evolution of a language in a period of several centuries.

From a historical point of view I can understand that there was a need for static classification and the creation of fixed lists of language codes. But computers and their software have moved on.
We have now the possibility to record that a Dutch book of 1851 is written in "the Dutch language of 1851", that a Whatsapp message of 2013-11-01 12:00:56 UTC is written in "the language of the social group" at the time it is produced in and send to.

Having stated this, back to our RDF.
Obviously the defined language tags (and their sub-tags) are outdated and do not correspond anymore to the fine-grained distinctions we would like to be able to make. Poor old spelling-checker!

I think the solution is simple (and thats why I like it).

Using URI 's to refer to languages (instead of codes) enables the flexibility we need. Linked and Open of course. These language URI's can be created by anyone, and by any service. Classical relationships between them can indicate hierarchy, similarity and so-forth. Searching for something in English can group (or better expand the query to) all the URI's of all languages, variants, and dialects of English if needed. But if I want a precise search in specific language form I can. At this moment I cannot.

In fact there is one place I can: in the semantic network technologies this idea has been applied since more then two decades now, and it's just fun.

A little personal history

I'll start this series of posts with a short history of my personal relationship with semantics in the ICT world.

Back to the late eighties: the upcoming age of relational databases.

A few years before (around 1986) I installed my first computer network: 7 MacPlus with 1Mo Ram, shared hard disk of 20 Mb, a database server (Omnis) and ... yes a network laser printer.
That was at the "Réserve Géologique de Haute Provence" where I was working.

I created several databases for this organization but when I thought about making databases for my personal use, relational databases appeared not to be an acceptable solution. I've a lot a different domains of interest, which would implicate for each of these topics several databases, but there was something more.
At that time already I wasn't looking at the (information) world as a set of independent, isolated boxes. There are links in between them. Sort of holistic view, although I was not aware of that term at that moment.

Sitting back, hands in the neck, and let my brain do the hard stuff: come up with a nice and clean solution.
This eventually resulted in 1991 in my first semantic network implementation: Notion System. Made in Hypercard, a fabulous tool at that time.

A few months ago I've put some of the original generated html output of Notion System and the website online again: Notion System and a specific application for an inventory of caves in southern France.

Amongst other features the system had a data exchange format Notion System Data Structure (NSDS). Tag based and inspired by SGML. To show you how close this was to XML, when the XML standard was published in 1996-1997 it took me,  a whole hour or so to adapt the export and import modules to produce and read XML instead of NSDS.

When telling people about this new information architecture most found it a good and powerful idea. One category of persons though had difficulties to understand it and were not willing to accept it as an alternative to relational databases: software engineers.

The only similar thing that existed towards the end of the century was Topic Maps (created more then 5 years later: between 1996 and 1998). Prior work on RDF started also around that time but RDF itself is from 2004.

One version of Notion System has been commercially implemented for a ethno-botanist in southern France in 1997, but for economic reasons I had to stop that activity and moved to the Netherlands.

During the first decade of this century I had the opportunity to work further on the concept of semantic network technologies at TNO and made, together with several other researchers, another, web-server based, implementation  in Java. Some of the background information about our work at TNO is reproduced on my website.

Here ends this short history and my first post in this blog, but look out for the post to come if semantics in computer science belongs to your interests.