Thursday, December 12, 2013

A leap in artificial intelligence ? Continued 2.


Picking up the thread of previous post I was asking myself the question:

Does our future artificial intelligence that learns from associated parallel streams of information can learn everything needed to follow our trail?

In other words is there any input (and output) that we have which is not easy to capture by our AI or cannot easily be associated with other kinds of input.

Scratch, scratch.  (OK AI, FYI: these are the words I use to describe the sound when I move my fingertips in a back and forth movement on the top of my head). (For future use.)

I can imagine 2 of them. You?

The first one that pops up is smell and taste. For me those two are so closely related that I put them at the same level. When preparing my food I never taste it. Smelling it gives me the exact taste it will have and allows me to adjust the seasoning. Feel free to consider them as separated.
We already have a lot of gas sensor networks (and could easily extend them). But they cover only the detection of small part of the molecules in the air.
And even if they would, I'm far from sure that there is enough data available to learn the step from a set of detected molecules to our 3 way classification of odors (i.e. smells good, stinks or smells like a hospital).
We have of course our fruit esters which are single molecules that define what we smell and which are well described, but are there gas detectors for them?

update: thanks +snakeappletree white-lightning-gate : Smell and taste from organic biological sensors connected to a symbiotic computer chip and synapse brain are already in development.



I hope you're all aware that we are one big walking set of touch sensors. Because that's the second one.
Now think of all the words you currently use to describe how something feels (physically).
Next associate the kind of sensors our AI must have to detect what might correspond to these words.
Sure you came up with the thermometer, the high resolution closeup camera (microscope) and several others.
But what about movement of the air from the wing beat of the butterfly the small hairs on my bare arm are detecting (the one that causes a tornado on the other side of the world)? Our Anemometers? Nope. Indirectly a camera perhaps.
The itch of a mosquito sting?

Of course there are other things that might be difficult to learn for our AI (like our emotions - although or body language does express a lot about our emotional state) but feel free to comment on important input I didn't mention.

I can hear you ask (OK AI: Try to make sense of of that): Why is he talking about these subjects?

I virtually like to dive into the deep blue from a helicopter view. Trying to see where creativeness is needed to fill the gaps. Having a reasonable feeling where more head scratches are needed. Having a holistic view.
Trying to avoid the "Oh S.." reaction at an advanced stage.

In case you noticed that terms like "problem" or "cannot" do not occur very often in my writings: the reason for it is that it's not in the beast to think in these terms. Especially when innovating.



Let's move on to the second part of this post. The part where your active participation will be appreciated.


Before thinking about how we could make our


artificial intelligence that learns from associated parallel streams of information


it is useful to think a bit about what you want this AI to do.

You know how to use the comments, so go ahead. But read the rest of this post first.

Just as +carey g. butler asked in a comment on my previous post "What is learning?", you might ask yourselves "What is artificial intelligence?"

I cannot give you an answer. Can you?
+Charles Isbell gives a few examples he likes in his talk 2 years ago. And +Luciano Floridi has his view also. And of course there are the wikipedia pages for AI and strong AI.
Using your favorite search engine, you will find other "definitions" you might like (or not).

Personally I tend to consider this domain to be a continuum ranging from the Luciano's not so smart dishwasher robot, through our spelling checkers, the smart camera's for collision detection, voice search, Google's self driving car up to the yet to come cognitive general AI.
The thing they have in common is that there is a, direct or not, interaction / relation with human beings.


Almost independent of what you might consider being AI, I like to separate the "do's" of such an AI into at least 4 groups of targets.

- for individuals
- for small groups of people that know each other
- for large groups of people that don't know each other
- for itself

The reason for this distinction is that the "how to's" might be different. But it is by no means a very strict separation. We have always lived in some sort of society which implicates interactions of an individual with other entities (other individuals or organizations and objects). It is only a handy classification.


You and me of course.
But think also about older and younger generations. Your parents, grand-parents, your kids and the yet to be born. They probably have a view quite different from yours.
Your personal butler answering your questions, doing things for you, informing you. Teaching you.
How much of your personal information are you willing to provide to let the AI help you with your daily things? How much do you give already to services you're using every day (with not much in return)? And what would you be willing to give if there is a guarantee that nobody else will get that information (this is possible see e.g. Wouter Teepe: Reconciling Information Exchange and Confidentiality. A Formal Approach)?


Your family, friends and professional contacts.
Negotiating best moments for non disruptive phone calls / hangouts, place and time for business meetings. Interesting movie to watch together tomorrow evening.
Your home negotiating with the electricity company the price of electricity (during a particular period of the day) if the dishwasher can be convinced to do its job at non-peak hours.


Town, country and world (universe ?)
The same electricity company as above negotiating with hundreds of thousands of houses. The chatting traffic lights (with other ones similar to themselves and with strange objects passing by on their two or four  wheels). Sorry, my mistake, there will be no traffic lights anymore in the (near) future.
Ding, dong. Garbage bin 2, get yourself out, I'll be there in 7.46783 minutes.
Dear B.O., ten years from now the flooding risk in this area will have increased with 13%. The associated loss of human lives and economic damage will have increased with 27%. Starting the following actions within a year will reduce the damage to an increase of only 9% and will cost ...


The AI itself and its cousins
Without going through an exhaustive list your can find everywhere in the literature, I feel comfortable with two main categories of things the AI should do for itself:
- staying alive up and running
- improve performance

Now go wild and think about the wide variety of implications these two simple drives might induce.
And of course can they be learned?



Here is the game:

If you want to play, write a short comment on what you want such an AI to do.
Try to keep it short and synthetic, enough to let me and the other readers get the idea.

I'll do my best to keep this post updated with your participation below this line.


First chapter Previous chapter Next chapter


Update 2013-12-14: Typo.

=============================================

You and me of course.
- Re affirm my own relevance in life and socialise with me in ways that suit my mood, varying from parent to disciple, teacher to student, friend to lover; companionship custom designed in ways that humans of our social indoctrination culture are incapable.  +snakeappletree white-lightning-gate


Your family, friends and professional contacts.



Town, country and world (universe ?)



The AI itself and its cousins
- I would like AI to grow naturally and have a 'good childhood'. When we move too fast, we make mistakes. Our growth should be deliberate and guided by our moral and ethical senses (compasses). Only in an environment that contains them will that which we create avoid being a danger to ourselves. +carey g. butler 
- I would like AI to understand well and be wise. AI should also be aware of the natural tendencies of any particular distribution involving life or choice to cleave itself into an elite of hegemonious control to feed itself upon the rest of the population (even Cancer shares these attributes). +carey g. butler 

Tuesday, December 10, 2013

A leap in artificial intelligence ? Continued.

About a month ago I wrote the first part of my thoughts about how to make the so much wanted leap in artificial intelligence.
Today I feel the need to release the next part.


Ever read a book? You probably did.
Ever speaking the text out loud? Perhaps when you were a kind.
Ever silently spoke it in your head? I did and do (even when writing this). And you?

Why?


Do you remember the time when your mother or father told you a story before going to sleep?
You probably don't remember what happened in your mind at those moments (the recall of our memories is not supposed to go that far back in time).  But imagine what might have happened.
I think I made it into a film. Loosing part of what was told as my internal story follows its own scenario, and merging both stories again a few moments later.
Ever lost attention during a presentation? What happened?

Why?


The "man in the box" jumping out or a suddenly moving living statue induces strong reactions.

Why?


Look at this question: r thr n tp's lft n th txt y wrt?

Why?


Before you will be attempting to answer these four why questions, a quick tour on our actual AI techniques.
Most (if not all) machine learning techniques are based on a simple idea:
- take a set of similar things (text, images, ...), labelled or not
- do some calculations with them (statistics, neural network training, ...)
- apply the results to another set of similar things


My previous post was centered around the idea of parallel treatment of different kind of things and associating these:
- written and spoken word
- image of an object and the word for it
etc.

But do we learn from static things?

Our environment is all but static. Although there are static objects in it, we aren't. Which makes the whole dynamic.

When we started learning as a baby, quietly observing our world from our cradle, what do we see? Things that don't move and things that do. Some things move when you touch them. Everything moves when you're lifted out.

I think our capability of very quickly identifying static things comes from our experience in identifying things in a dynamic environment. Not the other way around.
Our mind is trained to recognize things in a stream of information. And when provided with static things (images in our childhood book e.g.) we make them dynamic.
When looking at a painting of a still live do you see it in 2D or do you make it a 3D and sort of feeling the objects?

Now return to the why's. Apply the need for a stream of information to them. Expecting things to move or not.
Does it seem to make sense?

The typo's example might be the most difficult. But if you look at a piece of text as a flow of words (we don't read the individual characters). Anticipating the next words. Filling in the lacking words, replacing the wrong ones, not seeing the misspellings. (btw: did you pronounce the question?)
A real life experience I had several years ago is worth mentioning. For a knowledge management project we did several interviews. They were all recorded. During the transcription we discovered that one person, during the 90 minutes interview, never had finished a sentence. We didn't noticed that until the transcription. We must have made up the end of the sentences while he was talking.

Back to our artificial intelligence.

Imagine another kind of machine learning. One that learns from a flow of information. Looking at the delta's between two moments in time (video, sound but also text).
Instead of learning cat (and human) faces from independent still images it will learn what cats are from sequences of images. Their 3D appearance, their degrees of liberty, how they move etc. Probably even reaction patterns. Once this is acquired it is easy to "create" a front view: the face of the cat.
Objects that do not change in position can be viewed with a moving camera (or more then one camera).
For images the base techniques exist, nothing new. We only have to ...

use it in our actual artificial intelligence toolbox. This toolbox might be (almost) good enough but we should apply it for learning different things (and tune a few things).
Summarized this might give


Artificial intelligence that learns from associated parallel streams of information.


Does your mind starts wandering, thinking perhaps about the "how to"?

Fine.

Sit back and relax.


First chapter Next chapter


2013-12-11 Edit: typos

Wednesday, December 4, 2013

What about dates? Solved?

You might think this is solved in our semantic digitized world.
But as the title suggests, in my opinion it isn't, but could have been. And thus can be.

Did you started to digitize your old photo's from the prints or negatives? If so, did you started adding the Exif meta data?
Even if you didn't (yet), think about what you remember of where and when the photo has been made.

The Exif 2.2 specification gives us:
DateTimeOriginal
The date and time when the original image data was generated. For a DSC the date and time the picture was taken
are recorded. The format is "YYYY:MM:DD HH:MM:SS" with time shown in 24-hour format, and the date and time
separated by one blank character [20.H]. When the date and time are unknown, all the character spaces except
colons (":") may be filled with blank characters, or else the Interoperability field may be filled with blank characters.
The character string length is 20 bytes including NULL for termination. When the field is left blank, it is treated as
unknown. 
This seems a correct solution if you know at least what the year was. If you don't know the year exactly you're already in trouble.
But the tools we have? Just a small example of the one I use for most of my graphic stuff GraphicConverter 



Even if the specification contains some degree of flexibility not all our tools take this into account.

And as most users will, I figured out a personal workaround. Compatibility? None. Yek.

I really would like the people who write standards and (the ones who make software) to think about how our real data world looks like. Not only in our all digital world but also at the frontier of the old analog and the digital world.

The whole day long. How long is that? 24 or 48 hours?
A day takes 48 hours to travel around the world through the 34 time zones.

Now look again at the Exif date definition above. Unless the date tag is also associated with the geolocation tags (which gives us the time zone) this date definition doesn't tell us when exactly the photo was made. Even specified up to the second its precision is 24 hours. Yek again.

Thus a date without timezone information (direct or indirect) is a period of 48 hours.

Let me take a little sidestep here.
In some ontologies a "point" in time is defined. Although mathematically correct it is of little practical use. In my perception the only valid concept when talking about time are periods. It is the units you're currently using that define whether 1956 is point in time or not. When an ontology defines a "point" in time there is an assumption that there will never be a subdivision of the unit of time used to define the point. 3rd Yek.

In my post about information extraction I was essentially talking about entity identification.
And information extraction of dates?

As we saw above to increase the precision we need geographical information. But there is something more to it.

The Maya stela mentioned in the post gives us the birth of Lady K'atun-Ajaw Lady Namaan-Ajaw :
Source

tzik haab bolon pik lajchan winikhaab cha' tuun mih winal waklajun k'iin (spelling might vary)
Which is 9 Pih 12 K'atun 2 Tuns 0 Winal 16 K'in (Maya long count)
Using the famsi converter we obtain:
Friday, July 3, 674 CE
or
Sunday, July 5, 674 CE
or
Tuesday, July 7, 674 CE
or
... several other conversions are possible

Scientists do not agree on the correlation between the Maya calendar and our modern calendar (and that's fine, no Yek here). And even if they would have agreed now, will this be the case tomorrow?
In this example, as the geographic coordinates are well known, the precision is related to the various possible correlations.
If we want to talk about this day of birth we have to indicate a period and the range of correlations taken into account. It might 1 day if we take the most accepted correlation (in bold).

For the information extraction side of this example we should conserve the fact (9 Pih 12 K'atun 2 Tuns 0 Winal 16 K'in) and associate the derived information to calculate with it.

If you want the play a bit with calendars, the Swiss fourmilab has a nice tool for it.


To proleptic or to not proleptic.
Writing:
"On October 10, 1582 ..."
Where was the author?

It certainly wasn't in a catholic European country as this day never existed. They switched from the Julian to the Gregorian calendar where October 4 was followed by October 15, 1582.
Perhaps in the UK or one of its colonies (including the USA) where the switch only occurred in 1752.
The proleptic gregorian calendar doesn't take into account these missing days. You can find more on it in the wikipedia page including some examples of software using the proleptic calendar.


Modern programs/services can convert from one time system to another. So where are the remaining Yek's?

Almost all of the left over Yek's are related to the context in which a particular date reference was produced. In other words meta-data. And this holds as well for computer generated dates as for human crafted stuff.

Besides the geographic aspect, working with dates in our actual calendar is most of the time fine.
But ...

What do we (= human beings) use in our daily life as dates? The Second World War rings a bell?
You'll know probably that neither the beginning nor the end was at the same moment for all the countries involved.
So what if I'll ask my future OK-system: "What volcanoes erupted during WWII?
I'm convinced you can fill in the Yek's here.

"Yesterday" makes sense if I know when that was written, right? Else tomorrow never will be.

Please fill in your own examples:
....
....
....


Perhaps it is time to reach a helping hand to our upcoming AI's.

Imagine a timeline (not a calendar) with a unit of a second and the possibility to add a fractional part. It doesn't really matter if the unit is a second or some smaller unit as there will always be a fractional part.
Our next bigger unit would be the minute which is not decimal, so that doesn't seem to be a good choice.
Although the second was initially defined in relation to the duration of day, this has been superseded by an atomic measurement.

Why is the choice of an non-earth related unit important? If we want such a timeline to be applicable at the various time scales we're using (astronomical, geological, archeological, historical, ...) we quickly notice that doing so is not the best thing we can come up with. You do know that the earth is slowing down, don't you? And if you didn't, you do know now.

Handy would it also be (who talks like that?) when the timeline would contain only positive numbers. The origin should then be before the Big Bang. Let's say 100 billion years ago. This gives us a comfortable margin compared to the actual 13.8 estimated.

A timeline of 64-bit signed integers gives us room for another 192 billion years or so. We'll have some time ahead to think of a replacement system by then (using unsigned integers give us room for 484 billion years in the future, also fine).

Each calendar, or whatever time reference system is used can be calibrated on such a time line and a unique bidirectional convertor is needed for each system. These convertors use the specific knowledge of it's reference system and are independent of all the other convertors.


The image above comes from my post of a year ago in the Linked Open Vocabularies community which gives a little more technical details.

The first and only Woopah in this post.


Some decades ago (2 more precisely) I made such a system called DatExpert (a HyperCard extension). And oh what a wonderful and precise world that was.

Doing it again? With pleasure and the sooner the better.
But not as a lonely wolf anymore and certainly as open source.

If you feel also the need to move our digital world a bit and you're not the only one, I can set up a community for it.

Just re-share, +, and comment this post.

2013-12-05 : edit corrected some typos

Wednesday, November 6, 2013

Natural language learning app

Since more then a year now I tried to get Google involved in some way in a specific app I would like to create. Wrong contacts, working already on something like it? No idea, but from now on it's out in the wild.

When learning foreign languages the best way to do this is in contact with native speakers. Communicating, in other words.
But that's not often possible, certainly not for most of the people in the world. With +Project Loon on its way an Internet connection will.

And what if your partner is an artificial intelligence, who patiently starts a conversation with you on cats and dogs and other familiar objects. Up to a point (after some time) where you will be fluent in a foreign language?

I made a short video on what this app might look like, sort of video prototype. But don't expect the artificial mentor yet.




Subtitles are available in English, French and Dutch.
Budget for the video? 1.5 € to make the "blue screen" used, and a few hours for me and the other participants. It was really fun to make it. Perhaps there will be a "making of" with the most amusing moments, we have material enough.

I would really like this app to go into existence. And I think it is feasable.

If you like the idea of learning foreign languages this way, reshare this post with your contacts, +it.

If you would like to participate by writing code, creating the virtual worlds, or sponsoring financially the realization, please contact me (see the end of the video).
If you can help teaching the teacher, I'm also interested.

#NaturalLanguageLearningApp
twitter: nllapp

Monday, November 4, 2013

A leap in artificial intelligence ?

Since a long time some of us, including myself, are looking for a "real" artificial intelligence (intelligent program? - you name it).

Able to learn.

So how do we learn?

As mentioned in a previous post we can consider ourself as bunches of:
- sensors operating since our birth and even before that (eyes, ears, skin, ...)
- input interpretation services (brain)
- restitution services (memory)
- production services (voice, body)
- creation services (brain)

When we receive useful input we can do something with it. Else the only thing we can do is ... well store it and see if it can be made useful at some point in time (or throw it away immediately).

I think sensor input is useful in two situations.
In one situation the input looks like something perceived before. Not necessary identical but similar. So we can associate the new input with something in our memory and thus with everything related to it.
Is that learning? I think so because the variations in the input improves the related "pattern recognizer" and enables generalization.
In a second situation, to make an unrecognized input useful, it should be associated with some context: an input from another kind of sensor. Something you see and the word for it (read or heard) or how to use it. Learning foreign languages works best that way.

Association seems to me the most important word here. If input is not associated with something it is just... isolated (prfest) and not useful (yet).

I know there is more to it but this is enough to put you on the track for the rest of this post.

In our digital world all we have are bits and pieces called bytes.
And I like to look at them as a flow.

A written text is nothing more then a flow of bytes, characters, words, sentences and paragraphs depending at what scale you look at them.
Do you need an explanation for a sound track?
Even a still image can be seen as a flow of pixels one line after another.

The inputs in a stream are associated in a timely way (one after another) and / or in a spacial way (one line of an image above the other). A flow of text? Time or space? Hard to say, but does it matter?

In a video with subtitles you have all of it together of course. Save the smell, at least not yet.

Now imagine a flexible, adaptive pattern analyzer/recognizer on one of these streams. Breaking it in to chunks that are similar to previous seen chunks. Or breaking it in chunks because the stream itself has patterns that return once and a while. Would you be able to make one for a flow of text?
Adaptive? Absolutely necessary. If your premisses are wrong when configuring such a pattern analyzer it will fail at some point. If its mission is "find large patterns" and its toolbox good enough to try hard (e.g. change the scale of the scrutinizing window) it might fail (if there is no pattern) but at least it's not because it has the wrong premisses. (Large means: go beyond the 0's and 1's, that scale is not very useful.)
Oh, one important instruction: never look back.

We will need probably only a small set of instructions to make a pattern analyzer (its toolbox). One that is also able to differentiate the various kind of streams (text, sound, image, ...) but hints might help of course.

You can call these chunks neurons if you like.
They are associated with other chunks in the stream (positional dependence) and hopefully with "old" chunks seen before. And with chunks from other kind of parallel streams (e.g. a spoken and a written word or a written word and an image).

Each time associated chunks in stream (or across parallel streams) find an echo in the old chunks the link betweens these chunks is reinforced. We can imagine a decay function that weakens links over time and eliminate them (forget) if they are not reactivated. This will remove childhood errors in chunking of the pattern analyzer.

A pattern analyzer, preferably the same code but different (learned) settings, can be used to discover patterns in associated chunks. Leading to higher level concepts. (Mark the shift of chunk to concept.)


What's different compared to actual machine learning techniques?

First we could ask if this is supervised or unsupervised learning. Supervised machine learning is applied when you have a labelled data set and the program finds the optimal parameters to match the data with the provided labels (a good example is optical character recognition OCR).
You could look at two parallel streams as one being the data and the other the labels. Only here you do not look for parameters to match both. They are matched (I should say associated) by definition.
So it doesn't really look like supervised learning.
And unsupervised? With this kind of machine learning you have (preferably a large amount of) identical kind of data (like in this Google experiment) and try to find the higher level concepts in it. One of the essential differences here is that we look at different kinds of things (written and spoken words).
Doesn't feel like a good match with unsupervised learning either.

Second, this is life long learning (and appropriate forgetting) from streams of data. Not just training on a fixed set and then applying on other data. Google's self driving car does the same thing, but not all do.



I've the feeling (not only a dream) that this is the right track to go.

And you?


Next chapter

Saturday, November 2, 2013

Information extraction, yes ... but the right way !


To start this post have a look at the images below (click to enlarge them).
The first two are a Maya stella (the back of Stella 3 in Piedras Negras) and the drawing of the upper part.


The next image is a representation of the information that can be extracted from the story told by the stella.



Especially the last image gives you an idea about my definition of "information extraction".

If you want to know a little more about this particular stela there is a nice interactive tool including the pronunciation. And if you want to learn the Maya glyphs I can recommend Reading the Maya Glyphs which I use myself.

In line with the work to be done to handle all kinds of source documents I'm currently developing another tool: a smart manuscript editor assistant.

The screenshot above shows a manuscript written between 1137 and 1150 in medieval latin from the Bibliothèque Nationale de France and available online.
I'll talk a bit more about the lacking tooling in a future post.

The rest of this post will be illustrated by another project: the reproduction of a Dutch book from 1851, written by one of my ancestors Beschrijving van het Hertogdom Limburg (Description of the Duchy of Limburg - a region covering parts of the actual Netherlands, Belgium and Germany).
The pages are displayed, as much as possible, in the original layout. Hovering over identified terms brings up an information box displaying additional information. This additional information comes from a semantic network and from the Dutch Dbpedia essentially (Linked Open Data).




Let(s have a look a some parts of the text.
Even if you don't understand Dutch you should be able to get the idea, Don't Panic! (recall the 42 reference of previous post ?).
 page 8
 ...
 Het Hertogdom ging na den slag van Woeringen in 1288
 aan Braband over , onder welks heerschappij het langen
  tijd verbleef.
  ...
  
 page 22
 ...
 gevonden,  werden eindelijk de verschillende legers, op 
 den 5. Junij 1288, handgemeen, en er werd toen een ver- 
 maarde slag geleverd , tusschen Nuijs en Keulen op de 
 Fuhlinger heide ‚ bij Woeringen , bekend onder den slag 
 van Woeringen , welke van ‘s morgens 9 ure tot laat in den
 avond had geduurd.
 ...
 
From these two page exceprts we can extract the following facts:
 page 8
 => ”slag van Woeringen” ”in” ”1288”
 
 page 22
 => ”slag van Woeringen” ”op den” ”5. Junij 1288”
 => ”slag van Woeringen” ”van”  ”‘s morgens 9 ure”
 => ”slag van Woeringen” ”tot” ”laat in den avond”
 
Looking at these facts we can say in the first place:
  • there is an entity [1] called ”slag van Woeringen” on page 8
  • there is an entity [2] called ”slag van Woeringen” on page 22
The question that has to be answered is whether these entities are the same.
In this simple case the answer is, with a very high probability, yes.
  • page 8 says that this entity happened in 1288 and has a label "slag van Woeringen" (English: Battle of Woeringen)
  • page 22 says that it happend on 1288-06-05 (from 09:00 till late in the evening) and has also a label "slag van Woeringen"
And
  • Both entities have the same label
  • The statement of page 22 gives a more precise date then the one of page 8
  • There are no incompatible facts.
We can now draw the conclusion that entity [1] and entity [2] are the same with e.g. 0.98 probability. Note that a probability of 1.0 would represent certainty.

Certainty does exist but not on this conclusion. What we do know for sure is that entity [1] and entity [2] both have the same label (slag van Woeringen) because that is how they have been defined.

It is also very probable that the language of the labels is Dutch. But not certain. We know for sure that if these labels are in Dutch then this Dutch was the Dutch language in 1851, the year in which the book has been published. (for more information about the ins and outs of languages see my post RDF-Guys...)

Within the context of this book we are also sure of e.g. the date of the battle, because it is a fact mentioned in the book. But this doesn't mean that elsewhere other dates can't be given and that the date of the battle is a certainty. Its only a certainty with the context of this book.

The book provides other facts on this battle on pages 23, 24, 26, 145 and 150 about the opponents involved, the outcome and the consequences. These are not mentioned in this short explanation because it doesn't add something new to the process of identifying entities. They do play a role in identifying these other entities mentioned in the book of course.

These other facts do also play a role in identifying the virtually unique entity "slag van Woeringen" of this book with other entities defined elsewhere. Something we are going to see next.
What is important to retain from this example is that we have to think about what is known with what probability. In other words: ask the right questions.

One of the targets of this way of identifying entities in this book (and other sources) is to enrich world knowledge about them. We expose the extracted facts as Open Linked Data joining the ever growing free accessible interconnected world knowledge.
For this we have to establish an identity relationship of the entities in this book to other entities defined elsewhere. Only in that way we can increase world knowledge about them.
Let's have a look at an excerpt of the rdf of the Dutch DbPedia :
 
<rdf:Description rdf:about="http://nl.dbpedia.org/resource/Slag_bij_Woeringen">
 <rdf:type rdf:resource="http://schema.org/Event"/>
 <rdfs:label xml:lang="nl">Slag bij Woeringen</rdfs:label>
 <dbo:startDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1288-06-05</dbo:startDate>
 …
</rdf:Description>
 
Remember that each time an entity is mentioned in the book it is a separated entity with only a certain probability to be identical to others in the same source.

For those who are identical beyond a certain threshold we can create a virtual entity merging all the facts of its components. This entity is used to try to establish an identity relationship with external data.
Applying the process of comparing the facts extracted from the book of this virtual entity with the facts of e.g. the Dutch Dbpedia is strictly identical to the comparison of the facts known about the different entities within the book itself.

In this case (and only based on the label and date for this example) the DbPedia entity known as ”http://nl.dbpedia.org/resource/Slag_bij_Woeringen” is, with a high probability (e.g. 0.96), identical to our virtual entity ”slag bij Woeringen” from the book.

Once this probable identity established, we can apply a feedback loop. We use the data provided by the external source to check the facts of each individual entity in the book. As well for the ones that were part of the virtual entity initially used, as for other entities that were not part of it but do have a relatively high probability (but below the threshold) to be the same one. This feedback loop might reinforce or decrease the probabilities of the individual entities (in relation to each other).

This whole process simulates the way we, the none bits and bytes, "recognize" things. We use our prior acquired knowledge (external source) to "identify" or "recognize" what the book is about.
At the same time we extend our knowledge with the new facts that the book provides.
Yeh, we are learning!

But computers are good in crunching data, big data. So let them do it.

And, don't forget, they will have to be able to explain why they established this identity (see my post on trust for more details).
And the extracted facts (triplets, quads or quinds) have to carry a reference from where they came from (sentence, paragraph, page and book in whatever machine exploitable form).


Introspection
You might think that there is some overkill in this way of handling information extraction. Perhaps you're right in some situations, but this book happens to have an concrete example where this way of processing shows its importance.
One chapter handles a person called "Jan van Weert" and describes a long list of his activities. A few years after this book has been published a historian established "with certainty" that there has been a confusion between two persons. Some of the facts associated with the "Jan van Weert" from the book belong to one, some belong to the other.
Another example goes back to the mid nineties (see my short history post). In one of the interviews a ethno-botanist had, a woman used a specific name to indicate two different plants. For her it was the same plant. It appeared impossible to associate each of her uses of that name with one plant or another. It's the inverse of previous example and there is no identity mapping possible. This is not a problem and there are ways to handle this.

A little note about the book's website.
One of external libraries I use is a bit buggy. If you use the content menu to jump to a section in the book, the book might freeze. I'm on my way of fixing this, so be patient please.
Part of this post is also available in the about page of the book.

Edit 2013-11-04: The Dutchy of Limburg doesn't include parts of Luxemburg of course.
Edit 2013-11-13: Typos

Trust? ==> 0 or 100 ... or perhaps 42 ?

Trust? ==> 0 or 100 ... or perhaps 42 ?




When I ask the Google voice search:

"What time is it in Mountain View?"

Why do I trust the answer returned?











And when I ask:

"Who invented the world?"


the answer brings tears to my eyes from laughing?


You know the answer should have been:

42


If you didn't knew that I recommend this book.





The first level of trust in these two examples is in speech recognition. The app provides feedback (the recognized text spoken) so I can check whether the transcription has been done correctly. Elementary but undeniably necessary to establish trust in the answer.

Next comes the answer. In the first example, over a period of short time I came to trust the kind of serves that provide the current time all around the world. Based on my knowledge that these service are not difficult to realize and the only thing that matters is to correctly identify the place asked for.

And the second case? Does it degrades my build up trust in the quality of Google's search results?
Nope! But why?
Save for the 42, I knew that there was no perfect answer, but also I can explain why +Tim Berners-Lee  popped up, and I also know how this service can be improved, in order not to make mistakes like that.
If Google had used it's knowledge graph, which obviously they didn't, it would have known that "World" and "World Wide Web" are different things.

When exchanging information with the help of computer systems, three kind of "things" need to build up their own trustworthiness:
- services (that they do what they are supposed to do)
- persons
- information

The rest of this post will essentially handle the last element of this list, but let's have a quick look at the first two.

Trust in services
How do we evaluate the trustworthiness of services? I imagine that my first trip in a self driving car will be extremely stressful for me (not for the car of course). During the fifth I'll read and answer my mails and at the tenth trip I'll take a nap.
Btw, I tend to include organizations here to, but that's of course subject to discussions.
Positive feedback from other persons using the same service accelerates the trust building. Negative feedback on the other hand ...
This naturally leads to 

Trust in persons
Complex matter. Trusting that actions will be executed timely and correctly? That the information provided is true? 
For me key elements in here are: evolution over time, at any moment doubt can pop up, other people's opinion does matter (it certainly influences the initial trust state you have in someone) but less then your own perceptions and finally trust in persons is very fine-grained (on some aspects I can trust someone almost completely, on other aspects less). If physical contact exists (directly or with some communication system in between - video, phone) body language (including voice) is an important factor.
Referring to the negative feedback above: If that feedback was given by a person I trust (on that specific topic, e.g. an authority in the field) it might outreach by far thousands of positive feedbacks of people I do not know.
The feedback provided, well ... it's information

Trust in information
The level of trust we have in information depends on the trust we have in the original provider, the trust we have in the whole chain of intermediate transmissions and/or transformations up to the trust we have in our own sensory input.

I like to make a distinction between carbon-based and silicon-based producers (although silicon is perhaps a bit to imprecise).

Starting with the latter, we can separate them in two categories: the ones providing raw data (a surveillance video camera e.g.) and the ones providing interpreted data (temperature sensor, smoke detector or a smart collision detector camera in a tunnel). These last ones can be seen as a combination of a raw sensor and one or more chained interpretation services (software in most cases).
As sensors a relatively cheap, often multiplying the number of sensors can validate the produced information. In case of interpreted data, we should of course prefer different interpretation services to avoid or detect errors in a specific one.
Taking a helicopter view, what is a search engine? Input: the Internet (text, images, video), services : indexing, filtering, output: references related to a query. Or like in this example: input: YouTube still frames, output: cats and faces.
Thus the frontier between traditional sensors of our physical world and sensors of our digital worlds (a crawl bot or a collision detector in a virtual world) is perhaps not so clean as it might appear initially.

Remains the last element, the human producer, before handling the essence of this post (sorry for the long introduction). Yes, I know that dogs are also producers (of ...).

In some sense we are a bunch of sensors operating since our birth, a bunch of input interpretation services called brain, a bunch of restitution services called memory (recent neurological research shows that different kind of input is stored in different parts of our memory) and finally a bunch of production services (voice, body language and the fingers used to type this post - dictation is not good enough yet).
All of them are subject to errors.

Hmm...

I forgot one. The only one which doesn't make any errors (by definition): creation services.

Trustworthy? Make up your own mind.


Apparently there is a strong need for some kind of validation (a term silently introduced above) especially for human produced information. At least if we feel the need to think about trusting it or not.

My ideal information world?
OK +Google Glass  (or some other device or app), show me how reliable this information is and why.

Requirements:
- independency of original sources
- trustworthiness of the independent sources
- chain of dependent sources

The toolbox:
- annotate
- annotate
- annotate

Big data? Sure!
Feasible? Sure!
Done? Nope! (only partially for recent productions: retweets and re-shares e.g.).

Now it's time to use the magic word "Semantics".
One thing that is necessary is establishing identity of the producer. Mind that for humans this is not necessarily an identification of the physical person, an avatar is fine. Nothing wrong with having multiple identities. There is only a need for enough data to establish some meaningful (initial) state of trustworthiness, authority or whatever term you would like to use.
The meaning (read semantics) of the information produced. This becomes better and better but there is still much room left for improvements (or perhaps we should not go for the 10% improvement but for the 10 x solution - i.e. come up with an entirely new way). This is needed to trace back when and by whom the same information was given. The copy - paste of text is easy to detect, but we are still in the childhood of detecting copy - paste of meaning.

And a facilitating role we all have: cite your sources! This is common is scientific sources. But elsewhere? We will have another facilitation role, but that will be the subject of another post.

Why this post now and here?
How many books have been scanned and/or are available online? How many articles are online? And blog posts? Not to mention all the rest. The source data is there. And so are the (not yet perfect) techniques.
Computer power is also available.

What are we waiting for?

42?








Friday, November 1, 2013

The n-triples, n-quads and .... n-quinds

Today I'm in a writing mood: Go for a third post.

RDF is basically a triple collection: <subject>, <predicate>, <object>
The object might be a literal.

One of the syntax formats is n-triples. Example from the linked page:

<http://www.w3.org/2001/08/rdf-test/>  <http://purl.org/dc/elements/1.1/creator>    "Dave Beckett" .
<http://www.w3.org/2001/08/rdf-test/>  <http://purl.org/dc/elements/1.1/creator>    "Jan Grant" .
<http://www.w3.org/2001/08/rdf-test/>  <http://purl.org/dc/elements/1.1/publisher>  _:a .
_:a                                    <http://purl.org/dc/elements/1.1/title>      "World Wide Web Consortium" .
_:a                                    <http://purl.org/dc/elements/1.1/source>     <http://www.w3.org/> .

In triple stores, where lots of triples are grouped, there is a need to register the origin of the triplet.
To be able to do so the triplets are extended with a forth element: <context>.
And, you guessed it already I suppose, the syntax format is n-quads.
So now we have <subject> <predicate> <object> <context>

Despite the name "context" this element is meant to represent the provenance of the triplet. Why wasn't it called <provenance>? No idea. But "context" is not wrong: it is to be interpreted as the production context (or source context) of the triplet. This is also called the graph (an RDF graph is a set of triplets).

But does this forth element the <context / provenance / graph> fulfills all our needs?

For me it doesn't. It has undeniable value but it's not enough.

Who has never gone through the tough exercise of changing their primary email address? If you didn't your are either lucky or smart (or both).
Changing your internet provider,  changing jobs etc might be a source of this crude task. (If you happened to associate your primary to your ISP or company).
Thus the
 <I> <email address> <xyz@something.com> has two characteristics:
- it identifies me (the <I>) because nobody has the same address
- it is more or less temporal (it is valid in a period of time)

Especially this last information should be associated with the triplet.

There are ways to do so in RDF but they are far more complicated then necessary.
A triplet (or statement) <subject> <predicate> <object> tells you something about the subject.
If I want to tell something about this statement the easiest way is to identify this triplet in some way and use that as a subject in another statement.

That's where the n-quind comes in (no hyperlink here because I invented the term - as far as I know).

The n-quind syntax:
<subject> <predicate> <object> <context> <id>

Note that the <context> element might become unnecessary because we can capture the same information like this:
<subject> <predicate> <object> <id-1>
<id-1> <context-predicate> <context> <id-2>
Here the <context-predicate> represents the implicit meaning of the forth element of a quad.
But I choose not to confuse existing quad parsers not knowing whether they read a <context> or an <id> in the forth position.

One additional remark on the context. It is mend to represent the provenance but it could be used to "identify" a graph with a single triplet. But I think in doing so we are moving away from the initial intended meaning of provenance. Thus ending up with quinds.

And, yes, long, long ago, in a galaxy (not so far) away, this existed already.


2013-11-02 Edit: Added a forgotten an id in the context-less example above.
2013-11-13 Edit: typos

RDF-guys, please make languages what they should be

Human readable RDF literals can be associated with a language tag in the form of

 <rdfs:label xml:lang="en">You name it</rdfs:label>

The xml:lang="en" is defined by rfc3066 which refers to the standard ISO 639: a list of 2 and 3 letter language codes.

Looks all fine and well defined, and it is. But this way of looking at languages is a static box view.
And, yes I know that there is a possibility to extend theses lists.

But have a look at what languages really are.

Book cover "Aux sources de la parole", Pierre-Yves Oudeyer


Just referring to +Pierre-Yves Oudeyer  in his recent book


a language is a communication consensus within a group of people.








As living people are not ever lasting, the group changes over time and so does the language. Without going into the discussion whether we face a language, a variant or dialect, is the SMS language of our millennium kids English (for English speaking kids of course)?


Linguists have grouped historical variants of languages in periods. In Dutch e.g. traditionally we distinguish Old Dutch (500 - 1150) , Middle Dutch ( till 1500) and modern Dutch (from 1500 till now).
But imagine the evolution of a language in a period of several centuries.

From a historical point of view I can understand that there was a need for static classification and the creation of fixed lists of language codes. But computers and their software have moved on.
We have now the possibility to record that a Dutch book of 1851 is written in "the Dutch language of 1851", that a Whatsapp message of 2013-11-01 12:00:56 UTC is written in "the language of the social group" at the time it is produced in and send to.

Having stated this, back to our RDF.
Obviously the defined language tags (and their sub-tags) are outdated and do not correspond anymore to the fine-grained distinctions we would like to be able to make. Poor old spelling-checker!

I think the solution is simple (and thats why I like it).

Using URI 's to refer to languages (instead of codes) enables the flexibility we need. Linked and Open of course. These language URI's can be created by anyone, and by any service. Classical relationships between them can indicate hierarchy, similarity and so-forth. Searching for something in English can group (or better expand the query to) all the URI's of all languages, variants, and dialects of English if needed. But if I want a precise search in specific language form I can. At this moment I cannot.

In fact there is one place I can: in the semantic network technologies this idea has been applied since more then two decades now, and it's just fun.

A little personal history

I'll start this series of posts with a short history of my personal relationship with semantics in the ICT world.

Back to the late eighties: the upcoming age of relational databases.

A few years before (around 1986) I installed my first computer network: 7 MacPlus with 1Mo Ram, shared hard disk of 20 Mb, a database server (Omnis) and ... yes a network laser printer.
That was at the "Réserve Géologique de Haute Provence" where I was working.

I created several databases for this organization but when I thought about making databases for my personal use, relational databases appeared not to be an acceptable solution. I've a lot a different domains of interest, which would implicate for each of these topics several databases, but there was something more.
At that time already I wasn't looking at the (information) world as a set of independent, isolated boxes. There are links in between them. Sort of holistic view, although I was not aware of that term at that moment.

Sitting back, hands in the neck, and let my brain do the hard stuff: come up with a nice and clean solution.
This eventually resulted in 1991 in my first semantic network implementation: Notion System. Made in Hypercard, a fabulous tool at that time.

A few months ago I've put some of the original generated html output of Notion System and the website online again: Notion System and a specific application for an inventory of caves in southern France.

Amongst other features the system had a data exchange format Notion System Data Structure (NSDS). Tag based and inspired by SGML. To show you how close this was to XML, when the XML standard was published in 1996-1997 it took me,  a whole hour or so to adapt the export and import modules to produce and read XML instead of NSDS.

When telling people about this new information architecture most found it a good and powerful idea. One category of persons though had difficulties to understand it and were not willing to accept it as an alternative to relational databases: software engineers.

The only similar thing that existed towards the end of the century was Topic Maps (created more then 5 years later: between 1996 and 1998). Prior work on RDF started also around that time but RDF itself is from 2004.

One version of Notion System has been commercially implemented for a ethno-botanist in southern France in 1997, but for economic reasons I had to stop that activity and moved to the Netherlands.

During the first decade of this century I had the opportunity to work further on the concept of semantic network technologies at TNO and made, together with several other researchers, another, web-server based, implementation  in Java. Some of the background information about our work at TNO is reproduced on my website.

Here ends this short history and my first post in this blog, but look out for the post to come if semantics in computer science belongs to your interests.