Wednesday, January 8, 2014

Personal Archive DIgitization Project

When you're reading this, the final #PADIP count-down has reached the 0.

When this count-down started at 25

there was a pile left of 25 cm of documents to be scanned.

The last 25 cm pile of documents to be scanned.

So what is this Personal Archive DIgitization Project  ?

Several years ago I decided that it was time to get rid of the numerous piles of papers (and boxes) I conserved since the age of 14. Archives of 43 years.

Before that time there is not so much, just a few things from the archives of my parents that made it until now. Some of which go back to my birth year. Like this one:

The card my mother made with the evolution of my weight since my birth.

I'm talking about the scanning of 14000+ documents (estimated at 23000 pages) and some 6000 photo's.
My first photo's are from March 10, 1966: several shots from the live television transmission of the wedding of Princess Beatrix (our former queen).

Most of the photo's were scanned from the negatives by an external company though (and there are still some 400 photo's and 300 diapositives left to do).

These figures are very low compared to Google's millions of book scans, but as a personal project they are considerable. It least it took me several years working on it in my already very busy spare time.

And nostalgic, I conserved two folders of paper documents of a particular emotional value.

Phase 1 of PADIP : get rid of (most of) the paper (and make it available for recycling). => Done

OK, now I have a hard disk with these files (and backups of course), and an operating system that can easily find what I'm looking for in the documents from the typewriter and printer ages. Thanks to a relatively performant OCR.
For the handwritten material the scans have to rerun through the next generation of OCR software. Actually the only clues are in the meaningful file names.

Parallel to these scanned documents there are of course the digital age sources, and some digital record management headaches ahead. These digital archives start in 1986 and are continuously growing. Numbers? No precise idea yet but a rough estimation brings me 20000 emails, 10000 documents and 16000 digital photo's.

There is also some structured data available. Just two small examples: since 2008 (my first iPhone) every 5 seconds my position is recorded when I'm on the road. That are 1800+ trails. And a database of my 1500 books with the date I bought them, price, pages, dimensions etc. (Yet another scan project ?)

Perhaps I should ask Google to be able to download my browser history, and the telephone companies to download my cellular positions (from before 2008). It would be nice to have them also. NSA has them, so why not me?

So now I have my own personal Big Data. But most unstructured so not very useful, yet. But the next phase in this project will solve that.

In this post I was talking about information extraction, and in another series of posts about artificial intelligence. And for the past 20 years I've been working on semantic network technologies (comparable to Google's knowledge graph).

Got the global picture?


The trick will be to build a coherent picture of all the information available, correcting OCR errors, defining location and time of photo's, distinguishing meetings I've been to from the ones I've been invited for but didn't attend, etc.

The challenge is not in getting the data, but in creating to software to get the right data, or better, getting the data right. Doing this manually is not an option.

This personal Big Data set will provide "food for thought" for the new AI (with a lot of other stuff of course).
And as three languages (Dutch, French and English) (and snippets of Spanish, Italian and Polish) are scattered around in this set, it will also start its learning process to be able to provide in return a language teaching capability for humans as illustrated in the short (sort of prototype app) video.

Challenging isn't it?

#PADIP #ArtificialIntelligence #InformationExtraction #PersonalData

Edit 2014-01-13 Changed digitalization to digitization