[notebooks] Adding notebooks
authorVincent Michel <vincent.michel@logilab.fr>
Tue, 24 Jun 2014 21:10:26 +0000
changeset 449 80c0a2d8624a
parent 444 43f7648c4397
child 450 28ba90f7f947
[notebooks] Adding notebooks
notebooks/Named Entities Matching with Nazca.ipynb
notebooks/Record linkage with Nazca - Example Dbpedia - INSEE.ipynb
notebooks/Record linkage with Nazca - part 1 - Introduction.ipynb
notebooks/Record linkage with Nazca - part 2 - Normalization and blockings.ipynb
notebooks/Record linkage with Nazca - part 3 - Putting it all together.ipynb
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/notebooks/Named Entities Matching with Nazca.ipynb	Tue Jun 24 21:10:26 2014 +0000
@@ -0,0 +1,1369 @@
+{
+ "metadata": {
+  "name": "Named Entities Matching with Nazca"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h1>Named Entities Matching with Nazca</h1>\n",
+      "\n",
+      "\n",
+      "Named Entities Matching is the process of recognizing elements in a text and matching it\n",
+      "to different types (e.g. Person, Organization, Place). This may be related to Record Linkage (for linking entities from a text corpus and from a reference corpus) and to Named Entities Recognition."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from IPython.display import HTML\n",
+      "HTML('<iframe src=http://en.mobile.wikipedia.org/wiki/Named-entity_recognition?useformat=mobile width=700 height=350></iframe>')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<iframe src=http://en.mobile.wikipedia.org/wiki/Named-entity_recognition?useformat=mobile width=700 height=350></iframe>"
+       ],
+       "output_type": "pyout",
+       "prompt_number": 2,
+       "text": [
+        "<IPython.core.display.HTML at 0x7fd0000698d0>"
+       ]
+      }
+     ],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "In Nazca, we provide tools to match named entities to existing corpus or reference database.\n",
+      "This may be used for contextualizing your data, e.g. by recognizing in them elements from Dbpedia."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Token, Sentence and Tokenizer - nazca.utils.tokenizer</h2>\n"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "A `Sentence` is:\n",
+      "<ul>\n",
+      "    <li>the `start` indice in the text</li>\n",
+      "    <li>the `end` indice in the text</li>\n",
+      "    <li>the `indice` of the sentence amonst all the other sentences</li>\n",
+      "</ul>    "
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.utils.tokenizer import Sentence\n",
+      "\n",
+      "sentence = Sentence(indice=0, start=0, end=38)\n",
+      "print sentence"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Sentence(indice=0, start=0, end=38)\n"
+       ]
+      }
+     ],
+     "prompt_number": 16
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "A `Token` is:\n",
+      "<ul>\n",
+      "    <li>a `word` or part of a sentence</li>\n",
+      "    <li>the `start` indice in the sentence</li>\n",
+      "    <li>the `end` indice in the sentence</li>\n",
+      "    <li>the `sentence` which contains the Token</li>\n",
+      "</ul>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.utils.tokenizer import Token\n",
+      "\n",
+      "token = Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
+      "print token"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n"
+       ]
+      }
+     ],
+     "prompt_number": 17
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Nazca provides a `Tokenizer` that creates tokens from a text:\n",
+      "\n",
+      "\n",
+      "    class RichStringTokenizer(object):\n",
+      "\n",
+      "        def iter_tokens(self, text):\n",
+      "            \"\"\" Iterate tokens over a text \"\"\"\n",
+      "\n",
+      "        def find_sentences(self, text):\n",
+      "            \"\"\" Find the sentences \"\"\"\n",
+      "\n",
+      "        def load_text(self, text):\n",
+      "            \"\"\" Load the text to be tokenized \"\"\"\n",
+      "\n",
+      "        def __iter__(self):\n",
+      "            \"\"\" Iterator over the text given in the object instantiation \"\"\"\n",
+      "\n",
+      "\n",
+      "The tokenizer may be initialized with a minimum size and maximum size for the tokens. It will find all the sentences from a text using `find_sentences` and will create tokens in an iterative way."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.utils.tokenizer import RichStringTokenizer\n",
+      "\n",
+      "text = 'Hello everyone, this is   me speaking. And me !Why not me ? Blup'\n",
+      "tokenizer = RichStringTokenizer(text, token_min_size=1, token_max_size=4)\n",
+      "sentences = tokenizer.find_sentences(text)\n",
+      "\n",
+      "print sentences"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[Sentence(indice=0, start=0, end=38), Sentence(indice=1, start=38, end=47), Sentence(indice=2, start=47, end=59), Sentence(indice=3, start=59, end=64)]\n"
+       ]
+      }
+     ],
+     "prompt_number": 18
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for token in tokenizer:\n",
+      "    print token"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Token(word='Hello everyone this is', start=0, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='Hello everyone', start=0, end=14, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='Hello', start=0, end=5, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='everyone this is me', start=6, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='everyone this is', start=6, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='everyone this', start=6, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='this is me speaking', start=16, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='this is me', start=16, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='this is', start=16, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='this', start=16, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='is me speaking', start=21, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='is me', start=21, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='is', start=21, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='me speaking', start=26, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='speaking', start=29, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
+        "Token(word='And me Why', start=39, end=50, sentence=Sentence(indice=1, start=38, end=47))\n",
+        "Token(word='And me', start=39, end=45, sentence=Sentence(indice=1, start=38, end=47))\n",
+        "Token(word='And', start=39, end=42, sentence=Sentence(indice=1, start=38, end=47))\n",
+        "Token(word='me Why', start=43, end=50, sentence=Sentence(indice=1, start=38, end=47))\n",
+        "Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=47))\n",
+        "Token(word='Why not me', start=47, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
+        "Token(word='Why not', start=47, end=54, sentence=Sentence(indice=2, start=47, end=59))\n",
+        "Token(word='Why', start=47, end=50, sentence=Sentence(indice=2, start=47, end=59))\n",
+        "Token(word='not me', start=51, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
+        "Token(word='not', start=51, end=54, sentence=Sentence(indice=2, start=47, end=59))\n",
+        "Token(word='me', start=55, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
+        "Token(word='Blup', start=60, end=64, sentence=Sentence(indice=3, start=59, end=64))\n"
+       ]
+      }
+     ],
+     "prompt_number": 19
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Defining a source of entities - nazca.ner.sources</h2>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "First, we should define the source of entities that should be retrieved in the tokens yield by the tokenizer.\n",
+      "\n",
+      "A source herits from `AbstractNerSource`:\n",
+      "\n",
+      "\n",
+      "    class AbstractNerSource(object):\n",
+      "        \"\"\" High-level source for Named Entities Recognition \"\"\"\n",
+      "\n",
+      "        def __init__(self, endpoint, query, name=None, use_cache=True, preprocessors=None):\n",
+      "        \"\"\" Initialise the class.\"\"\"\n",
+      "\n",
+      "        def add_preprocessors(self, preprocessor):\n",
+      "            \"\"\" Add a preprocessor \"\"\"\n",
+      "\n",
+      "        def recognize_token(self, token):\n",
+      "            \"\"\" Recognize a token \"\"\"\n",
+      "\n",
+      "        def query_word(self, word):\n",
+      "            \"\"\" Query a word for a Named Entities Recognition process \"\"\"\n",
+      "\n",
+      "\n",
+      "It requires an `endpoint` and a `query`. And may recognize a word directly using `query_word`, or a token using `recognize_token`. In both cases, it returns a list of URIs/identifiers."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Sparql source</h4>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.ner.sources import NerSourceSparql\n",
+      "ner_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
+      "                             '''SELECT distinct ?uri\n",
+      "                                WHERE{?uri rdfs:label \"%(word)s\"@en}''')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 20
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print ner_source.query_word('Victor Hugo')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://dbpedia.org/resource/Victor_Hugo', u'http://dbpedia.org/resource/Category:Victor_Hugo', u'http://sw.opencyc.org/2008/06/10/concept/en/VictorHugo', u'http://sw.opencyc.org/2008/06/10/concept/Mx4rve1ZXJwpEbGdrcN5Y29ycA', u'http://wikidata.dbpedia.org/resource/Q1459231', u'http://wikidata.dbpedia.org/resource/Q535', u'http://wikidata.dbpedia.org/resource/Q3557368']\n"
+       ]
+      }
+     ],
+     "prompt_number": 21
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Of course, you can use a more complex query reflecting restrictions and business logic"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.ner.sources import NerSourceSparql\n",
+      "ner_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
+      "                              '''SELECT distinct ?uri\n",
+      "                                 WHERE{?uri rdfs:label \"%(word)s\"@en .\n",
+      "                                       ?p foaf:primaryTopic ?uri}''')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 22
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print ner_source.query_word('Victor Hugo')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://dbpedia.org/resource/Victor_Hugo']\n"
+       ]
+      }
+     ],
+     "prompt_number": 23
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "You may also used `use_cache` to True to keep all the previous results."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Rql source</h4>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.ner.sources import NerSourceRql\n",
+      "\n",
+      "ner_source = NerSourceRql('http://www.cubicweb.org',\n",
+      "                          'Any U WHERE X cwuri U, X name \"%(word)s\"')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 24
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print ner_source.query_word('apycot')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://www.cubicweb.org/1310453', u'http://www.cubicweb.org/749162']\n"
+       ]
+      }
+     ],
+     "prompt_number": 25
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Lexicon source</h4>\n",
+      "\n",
+      "A lexicon source is a source based on a dictionnary"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.ner.sources import NerSourceLexicon\n",
+      "\n",
+      "\n",
+      "lexicon = {'everyone': 'http://example.com/everyone',\n",
+      "           'me': 'http://example.com/me'}\n",
+      "ner_source = NerSourceLexicon(lexicon)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 26
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print ner_source.query_word('me')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "['http://example.com/me']\n"
+       ]
+      }
+     ],
+     "prompt_number": 27
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Defining preprocessors - nazca.ner.preprocessors</h2>\n",
+      "\n",
+      "Preprocessors are used to cleanup/filter a token before recognizing it in a source.\n",
+      "All preprocessors herit from `AbstractNerPreprocessor`\n",
+      "\n",
+      "    class AbstractNerPreprocessor(object):\n",
+      "        \"\"\" Preprocessor \"\"\"\n",
+      "\n",
+      "        def __call__(self, token):\n",
+      "            raise NotImplementedError\n",
+      "\n",
+      "\n",
+      "A token which is None after being returned by a preprocessor is not recognized by the source."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import nazca.ner.preprocessors as nnp"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 28
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerWordSizeFilterPreprocessor</h4>\n",
+      "\n",
+      "This preprocessor remove token based on the size of the word."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "preprocessor = nnp.NerWordSizeFilterPreprocessor(min_size=2, max_size=4)\n",
+      "token = Token('toto', 0, 4, None)\n",
+      "print token, '-->', preprocessor(token)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Token(word='toto', start=0, end=4, sentence=None) --> Token(word='toto', start=0, end=4, sentence=None)\n"
+       ]
+      }
+     ],
+     "prompt_number": 29
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "token = Token('t', 0, 4, None)\n",
+      "print token, '-->', preprocessor(token)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Token(word='t', start=0, end=4, sentence=None) --> None\n"
+       ]
+      }
+     ],
+     "prompt_number": 30
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "token = Token('tototata', 0, 4, None)\n",
+      "print token, '-->', preprocessor(token)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Token(word='tototata', start=0, end=4, sentence=None) --> None\n"
+       ]
+      }
+     ],
+     "prompt_number": 31
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerLowerCaseFilterPreprocessor</h4>\n",
+      "\n",
+      "This preprocessor remove token in lower case"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "preprocessor = nnp.NerLowerCaseFilterPreprocessor()\n",
+      "token = Token('Toto', 0, 4, None)\n",
+      "print token, '-->', preprocessor(token)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Token(word='Toto', start=0, end=4, sentence=None) --> Token(word='Toto', start=0, end=4, sentence=None)\n"
+       ]
+      }
+     ],
+     "prompt_number": 32
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "token = Token('toto', 0, 4, None)\n",
+      "print token, '-->', preprocessor(token)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Token(word='toto', start=0, end=4, sentence=None) --> None\n"
+       ]
+      }
+     ],
+     "prompt_number": 33
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerLowerFirstWordPreprocessor</h4>\n",
+      "\n",
+      "This preprocessor lower the first word of each sentence if it is a stopword."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "preprocessor = nnp.NerLowerFirstWordPreprocessor()\n",
+      "sentence = Sentence(0, 0, 20)\n",
+      "token = Token('Toto tata', 0, 4, sentence)\n",
+      "print token.word, '-->', preprocessor(token).word"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Toto tata --> Toto tata\n"
+       ]
+      }
+     ],
+     "prompt_number": 34
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "token = Token('The tata', 0, 4, sentence)\n",
+      "print token.word, '-->', preprocessor(token).word"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "The tata --> the tata\n"
+       ]
+      }
+     ],
+     "prompt_number": 35
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "token = Token('Tata The', 0, 4, sentence)\n",
+      "print token.word, '-->', preprocessor(token).word"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Tata The --> Tata The\n"
+       ]
+      }
+     ],
+     "prompt_number": 36
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerStopwordsFilterPreprocessor</h4>\n",
+      "\n",
+      "This preprocessor remove stopwords from the token. If `split_words` is False, it only removes token that are just a stopwords, in the other case, it will remove tokens that are entirely composed of stopwords."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "preprocessor = nnp.NerStopwordsFilterPreprocessor(split_words=True)\n",
+      "token = Token('Toto', 0, 4, None)\n",
+      "print token.word, '-->', preprocessor(token).word"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Toto --> Toto\n"
+       ]
+      }
+     ],
+     "prompt_number": 37
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "token = Token('Us there', 0, 4, None)\n",
+      "print token.word, '-->', preprocessor(token)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Us there --> None\n"
+       ]
+      }
+     ],
+     "prompt_number": 38
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "preprocessor = nnp.NerStopwordsFilterPreprocessor(split_words=False)\n",
+      "token = Token('Us there', 0, 4, None)\n",
+      "print token.word, '-->', preprocessor(token).word"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Us there --> Us there\n"
+       ]
+      }
+     ],
+     "prompt_number": 39
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerHashTagPreprocessor</h4>\n",
+      "\n",
+      "This preprocessor cleanup hashtags."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "preprocessor = nnp.NerHashTagPreprocessor()\n",
+      "token = Token('@BarackObama', 0, 4, None)\n",
+      "print token.word, '-->', preprocessor(token).word"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "@BarackObama --> BarackObama\n"
+       ]
+      }
+     ],
+     "prompt_number": 40
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "token = Token('@Barack_Obama', 0, 4, None)\n",
+      "print token.word, '-->', preprocessor(token).word"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "@Barack_Obama --> Barack Obama\n"
+       ]
+      }
+     ],
+     "prompt_number": 41
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Defining the NER process - nazca.ner</h2>\n",
+      "\n",
+      "The entire NER process may be implemented in a `NerProcess`\n",
+      "\n",
+      "    class NerProcess(object):\n",
+      "\n",
+      "        def add_ner_source(self, process):\n",
+      "            \"\"\" Add a ner process \"\"\"\n",
+      "\n",
+      "        def add_preprocessors(self, preprocessor):\n",
+      "            \"\"\" Add a preprocessor \"\"\"\n",
+      "\n",
+      "        def add_filters(self, filter):\n",
+      "            \"\"\" Add a filter \"\"\"\n",
+      "\n",
+      "        def process_text(self, text):\n",
+      "            \"\"\" High level function for analyzing a text \"\"\"\n",
+      "\n",
+      "        def recognize_tokens(self, tokens):\n",
+      "            \"\"\" Recognize Named Entities from a tokenizer or an iterator yielding tokens. \"\"\"\n",
+      "\n",
+      "        def postprocess(self, named_entities):\n",
+      "            \"\"\" Postprocess the results by applying filters \"\"\"\n",
+      "\n",
+      "\n",
+      "The main entry point of a `NerProcess` is the function `process_text()`, that yield triples (URI, source_name, token).\n",
+      "The source_name may be set during source creation, and may be useful when working with different sources."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.ner import NerProcess\n",
+      "\n",
+      "text = 'Hello everyone, this is   me speaking. And me.'\n",
+      "source = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
+      "                           'me': 'http://example.com/me'})\n",
+      "ner = NerProcess((source,))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 42
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 43
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Use multi sources</h3>\n",
+      "\n",
+      "Different sources may be given to the NER process."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "source1 = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
+      "                             'me': 'http://example.com/me'})\n",
+      "source2 = NerSourceLexicon({'me': 'http://example2.com/me'})\n",
+      "ner = NerProcess((source1, source2))\n",
+      "\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
+        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 44
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "It is possible to set the `unique` attribute, to keep the first appearance of a token in a source, in the sources order"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "ner = NerProcess((source1, source2), unique=True)\n",
+      "\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 45
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "ner = NerProcess((source2, source1), unique=True)\n",
+      "\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 46
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Pretty printing the output</h3>\n",
+      "\n",
+      "The output can be pretty printed with hyperlinks."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.utils.dataio import HTMLPrettyPrint\n",
+      "\n",
+      "ner = NerProcess((source1, source2))\n",
+      "named_entities = ner.process_text(text)\n",
+      "html = HTMLPrettyPrint().pprint_text(text, named_entities)\n",
+      "print html"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Hello <a href=\"http://example.com/everyone\">everyone</a>, this is   <a href=\"http://example2.com/me\">me</a> speaking. And <a href=\"http://example2.com/me\">me</a>.\n"
+       ]
+      }
+     ],
+     "prompt_number": 47
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Defining the filters - nazca.ner.filters</h2>\n",
+      "\n",
+      "It is sometimes useful to filter the named entities found by a NerProcess before returning them.\n",
+      "This may be done using filters, that herit from `AbstractNerFilter`\n",
+      "\n",
+      "\n",
+      "    class AbstractNerFilter(object):\n",
+      "        \"\"\" A filter used for cleaning named entities results \"\"\"\n",
+      "\n",
+      "        def __call__(self, named_entities):\n",
+      "            raise NotImplementedError"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import nazca.ner.filters as nnf"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 48
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerOccurenceFilter</h4>\n",
+      "\n",
+      "This filter is based on the number of occurence of named entities in the results."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "text = 'Hello everyone, this is   me speaking. And me.'\n",
+      "source1 = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
+      "                            'me': 'http://example.com/me'})\n",
+      "source2 = NerSourceLexicon({'me': 'http://example2.com/me'})\n",
+      "_filter = nnf.NerOccurenceFilter(min_occ=2)\n",
+      "\n",
+      "ner = NerProcess((source1, source2))\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
+        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 49
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "ner = NerProcess((source1, source2), filters=(_filter,))\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
+        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
+        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 50
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerDisambiguationWordParts</h4>\n",
+      "\n",
+      "This filter disambiguates named entities based on the words parts, i.e. if a token is included in a larger reocgnized tokens, we replace it by the larger token."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "text = 'Hello Toto Tutu. And Toto.'\n",
+      "source = NerSourceLexicon({'Toto Tutu': 'http://example.com/toto_tutu',\n",
+      "                           'Toto': 'http://example.com/toto'})\n",
+      "_filter = nnf.NerDisambiguationWordParts()\n",
+      "\n",
+      "ner = NerProcess((source,))\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/toto_tutu', None, Token(word='Toto Tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
+        "('http://example.com/toto', None, Token(word='Toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 51
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "ner = NerProcess((source,), filters=(_filter,))\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/toto_tutu', None, Token(word='Toto Tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
+        "('http://example.com/toto_tutu', None, Token(word='Toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 52
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerReplacementRulesFilter</h4>\n",
+      "\n",
+      "This filter allow to define replacement rules for Named Entities."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "text = 'Hello toto tutu. And toto.'\n",
+      "source = NerSourceLexicon({'toto tutu': 'http://example.com/toto_tutu',\n",
+      "                           'toto': 'http://example.com/toto'})\n",
+      "rules = {'http://example.com/toto': 'http://example.com/tata'}\n",
+      "_filter = nnf.NerReplacementRulesFilter(rules)\n",
+      "\n",
+      "ner = NerProcess((source,))\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/toto_tutu', None, Token(word='toto tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
+        "('http://example.com/toto', None, Token(word='toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 53
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "ner = NerProcess((source,), filters=(_filter,))\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "('http://example.com/toto_tutu', None, Token(word='toto tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
+        "('http://example.com/tata', None, Token(word='toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 54
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NerRDFTypeFilter</h4>\n",
+      "\n",
+      "This filter is based on the RDF type of objects"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "text = 'Hello Victor Hugo and Sony'\n",
+      "source = NerSourceSparql('http://dbpedia.org/sparql',\n",
+      "                         '''SELECT distinct ?uri\n",
+      "                            WHERE{?uri rdfs:label \"%(word)s\"@en .\n",
+      "                                  ?p foaf:primaryTopic ?uri}''')\n",
+      "_filter = nnf.NerRDFTypeFilter('http://dbpedia.org/sparql',\n",
+      "                             ('http://schema.org/Place',\n",
+      "                              'http://dbpedia.org/ontology/Agent',\n",
+      "                              'http://dbpedia.org/ontology/Place'))\n",
+      "\n",
+      "ner = NerProcess((source,))\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "(u'http://dbpedia.org/resource/Hello', None, Token(word='Hello', start=0, end=5, sentence=Sentence(indice=0, start=0, end=26)))\n",
+        "(u'http://dbpedia.org/resource/Victor_Hugo', None, Token(word='Victor Hugo', start=6, end=17, sentence=Sentence(indice=0, start=0, end=26)))\n",
+        "(u'http://dbpedia.org/resource/Sony', None, Token(word='Sony', start=22, end=26, sentence=Sentence(indice=0, start=0, end=26)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 56
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "ner = NerProcess((source,), filters=(_filter,))\n",
+      "for infos in ner.process_text(text):\n",
+      "    print infos"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "(u'http://dbpedia.org/resource/Victor_Hugo', None, Token(word='Victor Hugo', start=6, end=17, sentence=Sentence(indice=0, start=0, end=26)))\n",
+        "(u'http://dbpedia.org/resource/Sony', None, Token(word='Sony', start=22, end=26, sentence=Sentence(indice=0, start=0, end=26)))\n"
+       ]
+      }
+     ],
+     "prompt_number": 57
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Putting it all together</h2>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Get the data</h4>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import feedparser\n",
+      "from BeautifulSoup import BeautifulSoup  \n",
+      "\n",
+      "data = feedparser.parse('http://rss.nytimes.com/services/xml/rss/nyt/World.xml')\n",
+      "entries = data.entries"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Define the source</h4>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.ner.sources import NerSourceSparql\n",
+      "\n",
+      "dbpedia_sparql_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
+      "                                         '''SELECT distinct ?uri\n",
+      "                                            WHERE{\n",
+      "                                            ?uri rdfs:label \"%(word)s\"@en .\n",
+      "                                            ?p foaf:primaryTopic ?uri}''',\n",
+      "                                         'http://dbpedia.org/sparql',\n",
+      "                                        use_cache=True)\n",
+      "ner_sources = [dbpedia_sparql_source,]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Create the NER process</h4>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.ner.preprocessors import (NerLowerCaseFilterPreprocessor,\n",
+      "                                    NerStopwordsFilterPreprocessor)\n",
+      "from nazca.ner import NerProcess\n",
+      "\n",
+      "preprocessors = [NerLowerCaseFilterPreprocessor(),\n",
+      "        \t     NerStopwordsFilterPreprocessor()]\n",
+      "process = NerProcess(ner_sources, preprocessors=preprocessors)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Process the data</h4>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from IPython.display import HTML\n",
+      "from nazca.utils.dataio import HTMLPrettyPrint\n",
+      "\n",
+      "pprint_html = HTMLPrettyPrint()\n",
+      "\n",
+      "html = []\n",
+      "for ind, entry in enumerate(entries[:20]):\n",
+      "    text = BeautifulSoup(entry['summary_detail']['value']).text\n",
+      "    named_entities = process.process_text(text)\n",
+      "    html.append(pprint_html.pprint_text(text, named_entities))\n",
+      "HTML('<br/><br/>'.join(html))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<a href=\"http://dbpedia.org/resource/Matteo_Renzi\">Matteo Renzi</a> had the best showing of any <a href=\"http://dbpedia.org/resource/European\">European</a> leader in parliamentary voting, reaching a level no party has seen in an <a href=\"http://dbpedia.org/resource/Italian_election\">Italian election</a> since <a href=\"http://dbpedia.org/resource/1958\">1958</a>.<br/><br/><a href=\"http://dbpedia.org/resource/Satellite\">Satellite</a> data from <a href=\"http://dbpedia.org/resource/Malaysia_Airlines\">Malaysia Airlines</a> <a href=\"http://dbpedia.org/resource/Flight\">Flight</a> <a href=\"http://dbpedia.org/resource/370\">370</a> was released after pressure from relatives of the mostly <a href=\"http://dbpedia.org/resource/Chinese\">Chinese</a> passengers and from the <a href=\"http://dbpedia.org/resource/Chinese_government\">Chinese government</a>.<br/><br/><a href=\"http://dbpedia.org/resource/China\">China</a> and <a href=\"http://dbpedia.org/resource/Vietnam\">Vietnam</a> traded accusations over the sinking of a <a href=\"http://dbpedia.org/resource/Vietnamese\">Vietnamese</a> fishing vessel near a <a href=\"http://dbpedia.org/resource/Chinese\">Chinese</a> oil rig in disputed waters off <a href=\"http://dbpedia.org/resource/Vietnam\">Vietnam</a>\u2019s coast.<br/><br/>As tensions grow between <a href=\"http://dbpedia.org/resource/China\">China</a> and <a href=\"http://dbpedia.org/resource/Vietnam\">Vietnam</a> over the ramming and sinking of a <a href=\"http://dbpedia.org/resource/Vietnamese\">Vietnamese</a> fishing boat by a <a href=\"http://dbpedia.org/resource/Chinese\">Chinese</a> vessel, reaction in <a href=\"http://dbpedia.org/resource/China\">China</a> is overwhelmingly supportive of the incident.<br/><br/>The popular microblog of a professor at <a href=\"http://dbpedia.org/resource/Peking_University\">Peking University</a> has been blocked, perhaps because he posted a comment about the military suppression of the <a href=\"http://dbpedia.org/resource/1989\">1989</a> <a href=\"http://dbpedia.org/resource/Tiananmen\">Tiananmen</a> protest movement.<br/><br/><a href=\"http://dbpedia.org/resource/Dalia_Grybauskaite\">Dalia Grybauskaite</a>, <a href=\"http://dbpedia.org/resource/58\">58</a>, was re-elected president of <a href=\"http://dbpedia.org/resource/Lithuania\">Lithuania</a> on <a href=\"http://dbpedia.org/resource/Sunday\">Sunday</a> after beating the <a href=\"http://dbpedia.org/resource/Social_Democrat\">Social Democrat</a> candidate in a runoff.<br/><br/>Ten fishermen were rescued after their boat was struck by a <a href=\"http://dbpedia.org/resource/Chinese\">Chinese</a> vessel in the disputed waters in the <a href=\"http://dbpedia.org/resource/South_China_Sea\">South China Sea</a>, state news media reported.<br/><br/>The general contractor that helped oversee the construction of the <a href=\"http://dbpedia.org/resource/Abu_Dhabi\">Abu Dhabi</a> campus is run by a trustee of <a href=\"http://dbpedia.org/resource/N_Y\">N.Y</a>.<a href=\"http://dbpedia.org/resource/U\">U</a>.\u2019s board.<br/><br/>The <a href=\"http://dbpedia.org/resource/Malawi\">Malawi</a> <a href=\"http://dbpedia.org/resource/Electoral_Commission\">Electoral Commission</a> must manually count the <a href=\"http://dbpedia.org/resource/20\">20</a> presidential election votes, a senior official said <a href=\"http://dbpedia.org/resource/Monday\">Monday</a>. It is likely to take two months before an official result is announced.<br/><br/>An express train plowed into a parked freight train in northern <a href=\"http://dbpedia.org/resource/India\">India</a> killing at least <a href=\"http://dbpedia.org/resource/40\">40</a> people and reducing cars to twisted metal, officials said.<br/><br/><a href=\"http://dbpedia.org/resource/United_Nations\">United Nations</a> officials demanded a halt to the expulsion of thousands of citizens of the <a href=\"http://dbpedia.org/resource/Democratic_Republic\">Democratic Republic</a> of <a href=\"http://dbpedia.org/resource/Congo\">Congo</a> from the <a href=\"http://dbpedia.org/resource/Congo_Republic\">Congo Republic</a>.<br/><br/><a href=\"http://dbpedia.org/resource/Flooding\">Flooding</a> in southern <a href=\"http://dbpedia.org/resource/China\">China</a> forced almost half a million people from their homes, the government said <a href=\"http://dbpedia.org/resource/Monday\">Monday</a>.<br/><br/><a href=\"http://dbpedia.org/resource/Residents\">Residents</a> of <a href=\"http://dbpedia.org/resource/Castrillo_Matajud%C3%ADos\">Castrillo Matajud\u00edos</a>, which loosely translates as Little <a href=\"http://dbpedia.org/resource/Fort\">Fort</a> of <a href=\"http://dbpedia.org/resource/Jew\">Jew</a> <a href=\"http://dbpedia.org/resource/Killers\">Killers</a>, have voted to change the name of their village.<br/><br/><a href=\"http://dbpedia.org/resource/Pfizer\">Pfizer</a> said it \u201cdoes not intend to make an offer for <a href=\"http://dbpedia.org/resource/AstraZeneca\">AstraZeneca</a>\u201d in the wake of <a href=\"http://dbpedia.org/resource/AstraZeneca\">AstraZeneca</a>\u2019s rejection of a $<a href=\"http://dbpedia.org/resource/119\">119</a> billion bid.<br/><br/><a href=\"http://dbpedia.org/resource/United_States\">United States</a> <a href=\"http://dbpedia.org/resource/Special_Operations\">Special Operations</a> troops are forming elite units in <a href=\"http://dbpedia.org/resource/Libya\">Libya</a>, <a href=\"http://dbpedia.org/resource/Niger\">Niger</a>, <a href=\"http://dbpedia.org/resource/Mauritania\">Mauritania</a> and <a href=\"http://dbpedia.org/resource/Mali\">Mali</a> in the war against <a href=\"http://dbpedia.org/resource/Al_Qaeda\">Al Qaeda</a>\u2019s affiliates in <a href=\"http://dbpedia.org/resource/Africa\">Africa</a>.<br/><br/><a href=\"http://dbpedia.org/resource/The_military\">The military</a> commander said he would not disclose further details out of concern for the girls\u2019 safety.<br/><br/>In <a href=\"http://dbpedia.org/resource/France\">France</a>, <a href=\"http://dbpedia.org/resource/Britain\">Britain</a> and elsewhere, voters turned against traditional parties to support anti-immigrant groups opposed to the <a href=\"http://dbpedia.org/resource/European_Union\">European Union</a>.<br/><br/>When the <a href=\"http://dbpedia.org/resource/World_Cup\">World Cup</a> begins next month, many of the $<a href=\"http://dbpedia.org/resource/1\">1</a>.<a href=\"http://dbpedia.org/resource/4\">4</a> billion-worth of projects in <a href=\"http://dbpedia.org/resource/Cuiab%C3%A1\">Cuiab\u00e1</a>, one of the <a href=\"http://dbpedia.org/resource/12\">12</a> <a href=\"http://dbpedia.org/resource/Brazilian\">Brazilian</a> host cities, will be far from completion.<br/><br/><a href=\"http://dbpedia.org/resource/Narendra_Modi\">Narendra Modi</a> was sworn in on <a href=\"http://dbpedia.org/resource/Monday\">Monday</a> as <a href=\"http://dbpedia.org/resource/India\">India</a>\u2019s prime minister. Among those attending the ceremony was <a href=\"http://dbpedia.org/resource/Prime_Minister\">Prime Minister</a> <a href=\"http://dbpedia.org/resource/Nawaz_Sharif\">Nawaz Sharif</a> of <a href=\"http://dbpedia.org/resource/Pakistan\">Pakistan</a>, hinting that the two countries may revive a moribund peace process.<br/><br/><a href=\"http://dbpedia.org/resource/Ukrainian\">Ukrainian</a> soldiers used fighter jets and fought a ground battle against the pro-Russian forces, who had seized the airport in <a href=\"http://dbpedia.org/resource/Donetsk\">Donetsk</a> after elections that seemed to marginalize them."
+       ],
+       "output_type": "pyout",
+       "prompt_number": 25,
+       "text": [
+        "<IPython.core.display.HTML at 0x445f450>"
+       ]
+      }
+     ],
+     "prompt_number": 25
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/notebooks/Record linkage with Nazca - Example Dbpedia - INSEE.ipynb	Tue Jun 24 21:10:26 2014 +0000
@@ -0,0 +1,509 @@
+{
+ "metadata": {
+  "name": "Record linkage with Nazca - Example Dbpedia - INSEE"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Record linkage with Nazca - Example Dbpedia - INSEE</h2>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Imports"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import nazca.utils.dataio as nio\n",
+      "import nazca.utils.distances as ndi\n",
+      "import nazca.utils.normalize as nun\n",
+      "import nazca.rl.blocking as nbl\n",
+      "import nazca.rl.aligner as nal"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 19
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>1 - Datasets creation</h3>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "First, we have to create both reference set and target set."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We get all the couples (URI, insee code) from Dbpedia data"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "refset = nio.sparqlquery('http://demo.cubicweb.org/sparql',\n",
+      "                         '''PREFIX dbonto:<http://dbpedia.org/ontology/>\n",
+      "                            SELECT ?p ?n ?c WHERE {?p a dbonto:PopulatedPlace.\n",
+      "                                                   ?p dbonto:country dbpedia:France.\n",
+      "                                                   ?p foaf:name ?n.\n",
+      "                                                   ?p dbpprop:insee ?c}''',\n",
+      "                         autocast_data=True)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 4
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print len(refset)\n",
+      "print refset[0]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "3636\n",
+        "[u'http://dbpedia.org/resource/Ajaccio', u'Ajaccio', 2]\n"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We get all the couples (URI, insee code) from INSEE data"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "targetset = nio.sparqlquery('http://rdf.insee.fr/sparql',\n",
+      "                            '''PREFIX igeo:<http://rdf.insee.fr/def/geo#>\n",
+      "                               SELECT ?commune ?nom ?code WHERE {?commune igeo:codeCommune ?code.\n",
+      "                                                                 ?commune igeo:nom ?nom}''',\n",
+      "                            autocast_data=True)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 6
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print len(targetset)\n",
+      "print targetset[0]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "36700\n",
+        "[u'http://id.insee.fr/geo/commune/64374', u'Mazerolles', 64374]\n"
+       ]
+      }
+     ],
+     "prompt_number": 7
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Definition of the distance functions and the Processing</h3>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We use a distance based on difflib, where distance(a, b) == 0 iif a==b"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "processing = ndi.DifflibProcessing(ref_attr_index=1, target_attr_index=1)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print refset[0], targetset[0]\n",
+      "print processing.distance(refset[0], targetset[0])"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://dbpedia.org/resource/Ajaccio', u'Ajaccio', 2] [u'http://id.insee.fr/geo/commune/64374', u'Mazerolles', 64374]\n",
+        "0.764705882353\n"
+       ]
+      }
+     ],
+     "prompt_number": 9
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Preprocessings and normalization</h3>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We define now a preprocessing step to normalize the string values of each record."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "normalizer = nun.SimplifyNormalizer(attr_index=1)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 11
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The simplify normalizer is based on the simplify() function that:\n",
+      "<ol>\n",
+      "    <li> Remove the stopwords;</li>\n",
+      "    <li> Lemmatized the sentence;</li>\n",
+      "    <li> Set the sentence to lower case;</li>\n",
+      "    <li> Remove punctuation;</li>\n",
+      "</ol>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print normalizer.normalize(refset[0])"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://dbpedia.org/resource/Ajaccio', u'ajaccio', 2]\n"
+       ]
+      }
+     ],
+     "prompt_number": 12
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print normalizer.normalize(targetset[0])"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://id.insee.fr/geo/commune/64374', u'mazerolles', 64374]\n"
+       ]
+      }
+     ],
+     "prompt_number": 13
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Blockings</h3>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Blockings act similarly to divide-and-conquer approaches. They create subsets of both datasets (blocks) that will be compared, rather than making all the possible comparisons."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We create a simple NGramBlocking that will create subsets of records by looking at the first N characters of each values.In our case, we choose only the first 2 characters, with a depth of one (i.e. we don't do a recursive blocking)."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking = nbl.NGramBlocking(1, 1, ngram_size=5, depth=1)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 16
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The blocking is 'fit' on both refset and targetset and then applied to iterate blocks"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking.fit(refset, targetset)\n",
+      "blocks = list(blocking._iter_blocks())\n",
+      "print blocks[0]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "([(1722, u'http://dbpedia.org/resource/Redon,_Ille-et-Vilaine')], [(17045, u'http://id.insee.fr/geo/commune/35236')])\n"
+       ]
+      }
+     ],
+     "prompt_number": 17
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Cleanup the blocking for now as it may have stored some internal data"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking.cleanup()"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 18
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Define Aligner</h3>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Finaly, we can create the Aligner object that will perform the whole alignment processing.\n",
+      "We set the threshold to 0.1 and we only have one processing"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "aligner = nal.BaseAligner(threshold=0.1, processings=(processing,))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 20
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Thus, we register the normalizer and the blocking"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "aligner.register_ref_normalizer(normalizer)\n",
+      "aligner.register_target_normalizer(normalizer)\n",
+      "aligner.register_blocking(blocking)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 21
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The aligner has a get_aligned_pairs() function that will yield the comparisons that are below the threshold:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "pairs = list(aligner.get_aligned_pairs(refset[:1000], targetset[:1000]))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 22
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print len(pairs)\n",
+      "for p in pairs[:5]:\n",
+      "    print p"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "24\n",
+        "((u'http://dbpedia.org/resource/Calvi,_Haute-Corse', 8), (u'http://id.insee.fr/geo/commune/2B050', 268), 1e-10)\n",
+        "((u'http://dbpedia.org/resource/Livry,_Calvados', 910), (u'http://id.insee.fr/geo/commune/14372', 117), 1e-10)\n",
+        "((u'http://dbpedia.org/resource/Saint-Jean-sur-Veyle', 272), (u'http://id.insee.fr/geo/commune/01365', 301), 1e-10)\n",
+        "((u'http://dbpedia.org/resource/Saint-Just,_Ain', 273), (u'http://id.insee.fr/geo/commune/35285', 146), 1e-10)\n",
+        "((u'http://dbpedia.org/resource/Aix-en-Provence', 530), (u'http://id.insee.fr/geo/commune/13001', 669), 1e-10)\n"
+       ]
+      }
+     ],
+     "prompt_number": 24
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Each pair has the following structure: ((id in refset, indice in refset), (id in targset, indice in targetset), distance)."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Introduction to Nazca - Using pipeline</h2>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "It could be interesting to pipeline the blockings, i.e. use a raw blocking technique with a good recall (i.e. not too conservative) and that could have a bad precision (i.e. not too precise), but that is fast.\n",
+      "\n",
+      "Then, we can use a more time consuming but more precise blocking on each block."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking_1 = nbl.NGramBlocking(1, 1, ngram_size=3, depth=1)\n",
+      "blocking_2 = nbl.MinHashingBlocking(1, 1)\n",
+      "blocking = nbl.PipelineBlocking((blocking_1, blocking_2),collect_stats=True)\n",
+      "aligner.register_blocking(blocking)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 25
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "pairs = list(aligner.get_aligned_pairs(refset, targetset))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 26
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print len(pairs)\n",
+      "for p in pairs[:5]:\n",
+      "    print p"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "3408\n",
+        "((u'http://dbpedia.org/resource/Ajaccio', 0), (u'http://id.insee.fr/geo/commune/2A004', 18455), 1e-10)\n",
+        "((u'http://dbpedia.org/resource/Bastia', 2), (u'http://id.insee.fr/geo/commune/2B033', 2742), 1e-10)\n",
+        "((u'http://dbpedia.org/resource/Sart%C3%A8ne', 3), (u'http://id.insee.fr/geo/commune/2A272', 15569), 1e-10)\n",
+        "((u'http://dbpedia.org/resource/Corte', 4), (u'http://id.insee.fr/geo/commune/2B096', 2993), 1e-10)\n",
+        "((u'http://dbpedia.org/resource/Bonifacio,_Corse-du-Sud', 6), (u'http://id.insee.fr/geo/commune/2A041', 20886), 1e-10)\n"
+       ]
+      }
+     ],
+     "prompt_number": 27
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/notebooks/Record linkage with Nazca - part 1 - Introduction.ipynb	Tue Jun 24 21:10:26 2014 +0000
@@ -0,0 +1,742 @@
+{
+ "metadata": {
+  "name": "Record linkage with Nazca - part 1 - Introduction"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h1>Record linkage with Nazca - part 1</h1>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Introduction on Record Linkage</h2>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from IPython.display import HTML\n",
+      "HTML('<iframe src=http://en.mobile.wikipedia.org/wiki/Record_linkage?useformat=mobile width=700 height=350></iframe>')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<iframe src=http://en.mobile.wikipedia.org/wiki/Record_linkage?useformat=mobile width=700 height=350></iframe>"
+       ],
+       "output_type": "pyout",
+       "prompt_number": 1,
+       "text": [
+        "<IPython.core.display.HTML at 0x3bef910>"
+       ]
+      }
+     ],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Conventions on datasets</h3>\n"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Record</h4>\n",
+      "List of attributes of an entry/object.\n",
+      "The first attribute is ALWAYS considered as the identifier (e.g. URI) of the record. We note $v$ the number of attributes in a record."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "('http://data.bnf.fr/11907966/victor_hugo/', 'Victor Hugo', 1802, 1885, u'Besan\u00e7on (Doubs)', u'Paris')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "pyout",
+       "prompt_number": 5,
+       "text": [
+        "('http://data.bnf.fr/11907966/victor_hugo/',\n",
+        " 'Victor Hugo',\n",
+        " 1802,\n",
+        " 1885,\n",
+        " u'Besan\\xe7on (Doubs)',\n",
+        " u'Paris')"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Reference set / Refset</h4>\n",
+      "Set of reference data with size $n \\times v$. We note $r_i$ the ith element of the reference set."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Target set / Targetset</h4>\n",
+      "Set of target data, with size $m \\times v$. We note $t_j$ the jth element of the targset set.\n",
+      "Note: If we do not ask the record linkage to be surjective, i.e. one record of the target set is only aligned to one record in the reference set, the notions of reference set and target set are commutable."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Comparison result / Pair</h4>\n",
+      "A pair is the result of a comparison between a record of the refset and a record of the targetset. We note $p_{ij}$ the pair resulting from the comparison of $r_i$ and $t_j$."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Standard approach in Nazca</h3>\n",
+      "\n",
+      "The standard approach in Nazca is the following:\n",
+      "<ol>\n",
+      "    <li>definition of both reference and target sets</li>\n",
+      "    <li>definition of some functions to compare different attribute</li>\n",
+      "    <li>computation of the global distance by calculating a (possibly averaged) sum of all the distances</li>\n",
+      "    <li>keep only the pairs that with a global distance below a given threshold</li>\n",
+      "</ol>\n",
+      "        \n"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Problematic and blocking</h3>\n",
+      "\n",
+      "The major issue in record linkage is the size of the sets to be linked.\n",
+      "Lets assume a simple <a href=\"http://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein distance </a>:\n"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import timeit\n",
+      "t = timeit.timeit('from nazca.utils.distances import levenshtein\\nlevenshtein(\"abcd\", \"abde\")', number=1000)\n",
+      "print '%s (s)' % t\n"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "0.0332899093628 (s)\n"
+       ]
+      }
+     ],
+     "prompt_number": 14
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Lets now assume both sets of size $n=m=10^4$. Thus the total computation time for all the possible comparisons is:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "total = t*10000*10000\n",
+      "print '%s (s) = %s (h) = %s (d)' % (total, total/3600., total/(3600.*24.))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "3328990.93628 (s) = 924.719704522 (h) = 38.5299876884 (d)\n"
+       ]
+      }
+     ],
+     "prompt_number": 18
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Blocking aims at dividing the global comparisons matrix in small subsets (kind of divide-and-conquer approach).\n",
+      "This may be seen as a block diagonalisation of the comparisons matrix.\n",
+      "The number of comparisons (may) dramatically decrease !"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h2>Structure of the Nazca library</h2>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>IO - nazca.utils.dataio</h3>\n",
+      "This module provides several helpers for input/ouput, and for creating datasets."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import nazca.utils.dataio as nio"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 19
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "refset = nio.sparqlquery('http://demo.cubicweb.org/sparql',\n",
+      "                         '''PREFIX dbonto:<http://dbpedia.org/ontology/>\n",
+      "                            SELECT ?p ?n ?c WHERE {?p a dbonto:PopulatedPlace.\n",
+      "                                                   ?p dbonto:country dbpedia:France.\n",
+      "                                                   ?p foaf:name ?n.\n",
+      "                                                   ?p dbpprop:insee ?c}''',\n",
+      "                         autocast_data=False)\n",
+      "print refset[0]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://dbpedia.org/resource/Ajaccio', u'Ajaccio', u'2']\n"
+       ]
+      }
+     ],
+     "prompt_number": 23
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The `autocast_data` option may be use to automatically try to cast attributes to some specific types"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "refset = nio.sparqlquery('http://demo.cubicweb.org/sparql',\n",
+      "                         '''PREFIX dbonto:<http://dbpedia.org/ontology/>\n",
+      "                            SELECT ?p ?n ?c WHERE {?p a dbonto:PopulatedPlace.\n",
+      "                                                   ?p dbonto:country dbpedia:France.\n",
+      "                                                   ?p foaf:name ?n.\n",
+      "                                                   ?p dbpprop:insee ?c}''')\n",
+      "print refset[0]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://dbpedia.org/resource/Ajaccio', u'Ajaccio', 2]\n"
+       ]
+      }
+     ],
+     "prompt_number": 24
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "refset = nio.rqlquery('http://www.cubicweb.org', 'Any U, N WHERE X is Project, X name N, X cwuri U')\n",
+      "print refset[0]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'http://www.cubicweb.org/1579176', u'cubicweb-mobile']\n"
+       ]
+      }
+     ],
+     "prompt_number": 25
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Distances - nazca.utils.distances</h3>\n",
+      "This module provides several distance functions."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import nazca.utils.distances as ndi"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Exact match distance</h4>\n",
+      "\n",
+      "The simplest distance, defined as 0 if both values are equal, 1 elsewise."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for sa, sb in (('abcd', 'abcd'), ('abcd', 'abce'), ((1, 2), (1, 2)), ((1, 2, 'abd'), (2, 1, 'abd'))):\n",
+      "    print sa, sb, ndi.exact_match(sa, sb)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "abcd abcd 0\n",
+        "abcd abce 1\n",
+        "(1, 2) (1, 2) 0\n",
+        "(1, 2, 'abd') (2, 1, 'abd') 1\n"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4><a href=\"http://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein distance </a></h4>\n",
+      "\n",
+      "The Levenshtein distance is defined as the minimal cost to transform string a into string b, where 3 operators are allowed:\n",
+      "<ul>\n",
+      "    <li>Replace one character of string a into a character of string b</li>\n",
+      "    <li>Add one character of string b into string a</li>\n",
+      "    <li>Remove one character of string b</li>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for sa, sb in (('abcd', 'abcd'), ('abcd', 'abce'), ('abcd', 'abc'), ('abc', 'abcd'), ('abcd', 'efgh')):\n",
+      "    print sa, sb, ndi.levenshtein(sa, sb)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "abcd abcd 0\n",
+        "abcd abce 1\n",
+        "abcd abc 1\n",
+        "abc abcd 1\n",
+        "abcd efgh 4\n"
+       ]
+      }
+     ],
+     "prompt_number": 30
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4><a href=\"http://en.wikipedia.org/wiki/Soundex\">Soundex distance </a></h4>\n",
+      "\n",
+      "The Soundex distance is return 1 if the soundex codes of the two strings are different, and 0 otherwise:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for sa, sb in (('victor', 'victor'), ('victor', 'viktor'), ('victor', 'victo'), ('victor', 'viktaur')):\n",
+      "    print sa, sb, ndi.soundex(sa, sb)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "victor victor 0\n",
+        "victor viktor 0\n",
+        "victor victo 1\n",
+        "victor viktaur 0\n"
+       ]
+      }
+     ],
+     "prompt_number": 32
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4><a href=\"http://en.wikipedia.org/wiki/Jaccard_index\">Jaccard distance </a></h4>\n",
+      "\n",
+      "The Jaccard distance between two strings is defined as one minus Jaccard distance $J$ between the two sets of tokens built from the strings, with:\n",
+      "$$\n",
+      "J(A, B) = (A \\cap B)/(A \\cup B)\n",
+      "$$\n",
+      "Thus:\n",
+      "$$\n",
+      "d(string A, string B) = 1 - J(tokens A, tokens B)\n",
+      "$$"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for sa, sb in (('victor hugo', 'victor hugo'), ('victor hugo', 'victor'),\n",
+      "               ('victor hugo', 'victor hugo Besancon 1802 Paris 1885')):\n",
+      "    print sa, sb, ndi.jaccard(sa, sb)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "victor hugo victor hugo 0.0\n",
+        "victor hugo victor 0.5\n",
+        "victor hugo victor hugo Besancon 1802 Paris 1885 0.666666666667\n"
+       ]
+      }
+     ],
+     "prompt_number": 35
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4><a href=\"https://docs.python.org/2/library/difflib.html\">Difflib distance </a></h4>\n",
+      "\n",
+      "Distance based on the `SequenceMatcher` class of the `difflib` modulen, which is based on the `gestalt pattern matching` algorithm. The Nazca distance return `1 - difflib.SequenceMatcher(None, stra, strb).ratio()`."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for sa, sb in (('abcd', 'abcd'), ('abcd', 'abce'), ('abcd', 'abc'), ('abc', 'abcd'), ('abcd', 'efgh'),,\n",
+      "               ('victor', 'victor'), ('victor', 'viktor'), ('victor', 'victo'), ('victor', 'viktaur')):\n",
+      "    print sa, sb, ndi.difflib_match(sa, sb)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "abcd abcd 0.0\n",
+        "abcd abce 0.25\n",
+        "abcd abc 0.142857142857\n",
+        "abc abcd 0.142857142857\n",
+        "abcd efgh 1.0\n"
+       ]
+      }
+     ],
+     "prompt_number": 36
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4><a href=\"https://labix.org/python-dateutil\">Temporal distance </a></h4>\n",
+      "\n",
+      "The Temporal distance is based on the `dateutil` Python module and is used to compute the distance between two strings representing dates."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for d1, d2 in (('14 aout 1991', '14/08/1991'), ('14 aout 1991', '08/14/1991'), ('14 aout 1991', '08/15/1992'),\n",
+      "               ('1er mai 2012', '01/05/2012'), ('Jean est n\u00e9 le 1er octobre 1958', 'Le 01-10-1958, Jean est n\u00e9')):\n",
+      "    print d1, d2, ndi.temporal(d1, d2)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "14 aout 1991 14/08/1991 0\n",
+        "14 aout 1991 08/14/1991 0\n",
+        "14 aout 1991 08/15/1992 367\n",
+        "1er mai 2012 01/05/2012 0\n",
+        "Jean est n\u00e9 le 1er octobre 1958 Le 01-10-1958, Jean est n\u00e9 0\n"
+       ]
+      }
+     ],
+     "prompt_number": 39
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Different granularies may be used:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print ndi.temporal('13 mars', '13 mai', 'months')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "2.0\n"
+       ]
+      }
+     ],
+     "prompt_number": 41
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print ndi.temporal('14 aout 1991', '08/15/1992', 'years')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "1.00479123888\n"
+       ]
+      }
+     ],
+     "prompt_number": 42
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Dates are extracted from the text using french informations by default, but an english parser exists, and additional parsers may be built:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from dateutil import parser as dateparser\n",
+      "print ndi.temporal('13 march', '13 may', 'months', parserinfo=dateparser.parserinfo)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "2.0\n"
+       ]
+      }
+     ],
+     "prompt_number": 44
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4><a href=\"http://en.wikipedia.org/wiki/Geographical_distance\">Geographical distance </a></h4>\n",
+      "\n",
+      "The geographical distance between two points on Earth. Results are in meters, but the units can also be in 'km'.\n",
+      "The planet radius can also be changed (if it may be useful...)."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "paris = (48.856578, 2.351828)     \n",
+      "london = (51.504872, -0.07857)      \n",
+      "print ndi.geographical(paris, london, in_radians=False)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "341564.159451\n"
+       ]
+      }
+     ],
+     "prompt_number": 49
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print ndi.geographical(paris, london, in_radians=False, units='km')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "341.564159451\n"
+       ]
+      }
+     ],
+     "prompt_number": 50
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>BaseProcessing - nazca.utils.distances</h3>\n",
+      "\n",
+      "The class BaseProcessing is used to hold all necessary information for a distance between records\n",
+      "<ul>\n",
+      "    <li>indice of the attribute to be used in the reference set records</li>\n",
+      "    <li>indice of the attribute to be used in the target set records</li>\n",
+      "    <li>the distance function to be used</li>\n",
+      "    <li>the weight of the distance in the global distance matrix</li>\n",
+      "    <li>the normalization of the matrix. If True, the distance between two points is arbitrary set to [0, 1], doing:\n",
+      "        $$d = 1 - 1/(1 + d(x, y))$$\n",
+      "    </li>\n",
+      "</ul>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The prototype of the BaseProcessing class is:\n",
+      "\n",
+      "     class BaseProcessing(object):\n",
+      "\n",
+      "        def distance(self, reference_record, target_record):         \n",
+      "            \"\"\" Compute the distance between two records\"\"\"\n",
+      "\n",
+      "        def cdist(self, refset, targetset, ref_indexes=None, target_indexes=None):        \n",
+      "            \"\"\" Compute the metric matrix, given two datasets and a metric\"\"\"\n",
+      "\n",
+      "        def pdist(self, dataset):         \n",
+      "            \"\"\" Compute the upper triangular matrix in a way similar to scipy.spatial.metrical\"\"\"\n",
+      "\n",
+      "\n",
+      "Nazca provides a processing class for all the distances previously described.\n"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Of course, it is possible to define your own processing, based on a specific distance:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def iamvictorhugo(stra, strb):\n",
+      "    \"\"\"Return 0 if both strings include 'victor hugo', 0 otherwise \"\"\"\n",
+      "    return 0 if ('victor hugo' in stra.lower() and 'victor hugo' in strb.lower()) else 1\n",
+      "\n",
+      "\n",
+      "class VictorHugoProcessing(ndi.BaseProcessing):\n",
+      "    \n",
+      "          def __init__(self, ref_attr_index=None, target_attr_index=None,     \n",
+      "                       weight=1, matrix_normalized=False):         \n",
+      "            super(VictorHugoProcessing, self).__init__(ref_attr_index, target_attr_index,  \n",
+      "                                                       iamvictorhugo, weight, matrix_normalized)\n",
+      "\n",
+      "processing = VictorHugoProcessing(1, 1)\n",
+      "print processing.distance(('http://dbpedia.org/page/Victor_Hugo', 'Victor Hugo'),\n",
+      "                          ('http://data.bnf.fr/11907966/victor_hugo/', 'Victor Hugo'))\n",
+      "print processing.distance(('http://dbpedia.org/page/Victor_Hugo', 'Victor Hugo'),\n",
+      "                          ('http://data.bnf.fr/11907966/victor_hugo/', 'Yu guo'))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "0\n",
+        "1\n"
+       ]
+      }
+     ],
+     "prompt_number": 56
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/notebooks/Record linkage with Nazca - part 2 - Normalization and blockings.ipynb	Tue Jun 24 21:10:26 2014 +0000
@@ -0,0 +1,1267 @@
+{
+ "metadata": {
+  "name": "Record linkage with Nazca - part 2 - Normalization and blockings"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h1>Record linkage with Nazca - part 2 - Normalization and blockings</h1>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Distances - nazca.utils.normalize</h3>\n",
+      "This module provides several normalization and preprocessing functions."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import nazca.utils.normalize as nno"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>unormalize</h4>\n",
+      "\n",
+      "Replace diacritical characters with their corresponding ascii characters."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print nno.unormalize(u'toto')\n",
+      "print nno.unormalize(u't\u00e9t\u00e9')\n",
+      "print nno.unormalize(u'tut\u00e9')\n",
+      "print nno.unormalize(u't\u00c9t\u00e9')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "toto\n",
+        "tete\n",
+        "tute\n",
+        "tEte\n"
+       ]
+      }
+     ],
+     "prompt_number": 12
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "For non-ascii based characters, an error is raised, but you can give a substitute to avoid it."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print nno.unormalize(u'\u0392\u03af\u03ba\u03c4\u03c9\u03c1 \u039f\u1f51\u03b3\u03ba\u03ce', substitute='_')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "______ _____\n"
+       ]
+      }
+     ],
+     "prompt_number": 13
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "There is also a function `lunormalize` that also set the sentence to lower case."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print nno.lunormalize(u't\u00c9t\u00e9')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "tete\n"
+       ]
+      }
+     ],
+     "prompt_number": 14
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>simplify</h4>\n",
+      "\n",
+      "Simply a given sentence by performing the following actions\n",
+      "<ol>\n",
+      "    <li>If remove_stopwords, then remove the stop words (for now only french stopwords exist)</li>\n",
+      "    <li>If lemmas are given, the sentence is lemmatized</li>\n",
+      "    <li>Set the sentence to lower case</li>\n",
+      "    <li>Remove punctuation</li>\n",
+      "</ol>"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print nno.simplify(u\"\"\"\u00c9crivain. - Artiste graphiste, auteur de lavis. - \n",
+      "                       Membre de l'Institut, Acad\u00e9mie fran\u00e7aise (\u00e9lu en 1841)\"\"\")"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "\u00e9crivain  artiste graphiste auteur lavis  \n",
+        "            membre l institut acad\u00e9mie fran\u00e7aise \u00e9lu 1841\n"
+       ]
+      }
+     ],
+     "prompt_number": 15
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.data import FRENCH_LEMMAS\n",
+      "print nno.simplify(u\"\"\" Victor Hugo occupe une place importante dans l'histoire \n",
+      "                        des lettres fran\u00e7aises et celle du XIX si\u00e8cle,\n",
+      "                        dans des genres et des domaines d'une remarquable vari\u00e9t\u00e9.\"\"\",\n",
+      "                   lemmas=FRENCH_LEMMAS)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "victor hugo occuper placer important histoire lettre fran\u00e7aises celui xix si\u00e8cle genre domaine remarquable vari\u00e9t\u00e9\n"
+       ]
+      }
+     ],
+     "prompt_number": 24
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "A `lemmatize` function is also available."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>tokenize</h4>\n",
+      "\n",
+      "Create a list of tokens from a sentence. A tokenizer object may be given, and should have a `tokenize()` method returning a list of tokens from a string."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print nno.tokenize(u\"\"\" Victor Hugo occupe une place importante dans l'histoire \n",
+      "                        des lettres fran\u00e7aises et celle du XIX si\u00e8cle,\n",
+      "                        dans des genres et des domaines d'une remarquable vari\u00e9t\u00e9.\"\"\")"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'Victor', u'Hugo', u'occupe', u'une', u'place', u'importante', u'dans', u\"l'\", u'histoire', u'des', u'lettres', u'fran\\xe7aises', u'et', u'celle', u'du', u'XIX', u'si\\xe8cle,', u'dans', u'des', u'genres', u'et', u'des', u'domaines', u\"d'\", u'une', u'remarquable', u'vari\\xe9t\\xe9.']\n"
+       ]
+      }
+     ],
+     "prompt_number": 26
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>roundstr</h4>\n",
+      "\n",
+      "Return an unicode string of a number rounded to a given precision in decimal digits."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print nno.roundstr('3.1415', 2)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "3.14\n"
+       ]
+      }
+     ],
+     "prompt_number": 28
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>rgxformat</h4>\n",
+      "\n",
+      "Apply a regular expression to a string and return a formatted string with information extracted by the regular expression."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print nno.rgxformat(u'[Victor Hugo - 26 fev 1802 / 22 mai 1885]',\n",
+      "                    r'\\[(?P<firstname>\\w+) (?P<lastname>\\w+) - (?P<birthdate>.*) \\/ (?P<deathdate>.*?)\\]',\n",
+      "                    u'%(lastname)s, %(firstname)s (%(birthdate)s - %(deathdate)s)')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Hugo, Victor (26 fev 1802 - 22 mai 1885)\n"
+       ]
+      }
+     ],
+     "prompt_number": 31
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>BaseNormalizer - nazca.utils.normalize</h3>\n",
+      "\n",
+      "The class BaseNormalizer is used to hold all necessary information for a normalization of records.\n",
+      "It requires:\n",
+      "<ul>\n",
+      "    <li>indice of the attribute to be normalized</li>\n",
+      "    <li>the normalization function to be used</li>\n",
+      "</ul>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The prototype of the BaseProcessing class is:\n",
+      "\n",
+      "      class BaseNormalizer(object):\n",
+      "\n",
+      "         def normalize(self, record):\n",
+      "            \"\"\" Normalize a record\"\"\"\n",
+      "     \n",
+      "         def normalize_dataset(self, dataset, inplace=False): \n",
+      "            \"\"\" Normalize a dataset\"\"\"\n",
+      "\n",
+      "Nazca provides a normalization class for all the normalization previously described."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "normalizer = nno.RoundNormalizer(attr_index=2)\n",
+      "print normalizer.normalize(('uri', 'toto', '3.1415'))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "['uri', 'toto', '3']\n"
+       ]
+      }
+     ],
+     "prompt_number": 36
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "If `attr_index` is None, the whole record is passed to the callback. You can also give a list of indexes:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "normalizer = nno.RoundNormalizer(attr_index=[1, 2])\n",
+      "print normalizer.normalize(('uri', '22.36', '3.1415'))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "['uri', '22', '3']\n"
+       ]
+      }
+     ],
+     "prompt_number": 37
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Of course, it is possible to define your own processing, based on a specific normalization:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def iamvictorhugo(value):\n",
+      "    \"\"\"Return Victor Hugo \"\"\"\n",
+      "    return u'Victor Hugo'\n",
+      "\n",
+      "\n",
+      "class VictorHugoNormalizer(nno.BaseNormalizer):\n",
+      "\n",
+      "    def __init__(self, attr_index=None):\n",
+      "        super(VictorHugoNormalizer, self).__init__(iamvictorhugo, attr_index)\n",
+      "\n",
+      "normalizer = VictorHugoNormalizer(1)\n",
+      "print normalizer.normalize(('http://data.bnf.fr/12249911/michel_houellebecq/', u'Michel Houellebecq'))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "['http://data.bnf.fr/12249911/michel_houellebecq/', u'Victor Hugo']\n"
+       ]
+      }
+     ],
+     "prompt_number": 41
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>JoinNormalizer</h4>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The `JoinNormalizer` is used to join multiple fields in only one.\n",
+      "This new field will be put at the end of the new record."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "normalizer = nno.JoinNormalizer((1,2))     \n",
+      "print normalizer.normalize((1, 'ab', 'cd', 'e', 5))\n"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[1, 'e', 5, 'ab, cd']\n"
+       ]
+      }
+     ],
+     "prompt_number": 42
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NormalizerPipeline</h4>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Finaly, it is possible to join BaseNormalizer in a pipeline, using the NormalizerPipeline.\n",
+      "This pipeline takes a list of BaseNormalizer that will be applied in the given order."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "First, we define a RegexpNormalizer:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "regexp = r'(?P<id>\\d+);{[\"]?(?P<firstname>.+[^\"])[\"]?};{(?P<surname>.*)};{};{};(?P<date>.*)'\n",
+      "output = u'%(id)s\\t%(firstname)s\\t%(surname)s\\t%(date)s'\n",
+      "n1 = nno.RegexpNormalizer(regexp, u'%(id)s\\t%(firstname)s\\t%(surname)s\\t%(date)s')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 44
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Then, we define a specific BaseNormalizer that split data on a tabulation:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "n2 = nno.BaseNormalizer(callback= lambda x: x.split('\\t'))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 46
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The third normalizer is an UnicodeNormalizer:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "n3 = nno.UnicodeNormalizer(attr_index=(1, 2, 3))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 49
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Thus, we define the pipeline:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "pipeline = nno.NormalizerPipeline((n1, n2, n3))\n",
+      "r1 = u'1111;{\"Toto t\u00e0t\u00e0\"};{Titi};{};{};'\n",
+      "print pipeline.normalize(r1)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[u'1111', u'toto tata', u'titi', u'']\n"
+       ]
+      }
+     ],
+     "prompt_number": 50
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Blockings - nazca.rl.blocking</h3>\n",
+      "\n",
+      "Blockings are used to reduce the total number of possible comparisons, by only comparing records that are roughly similar.\n",
+      "\n",
+      "In Nazca, the BaseBlocking class is defined as:\n",
+      "\n",
+      "     class BaseBlocking(object):\n",
+      "\n",
+      "         def __init__(self, ref_attr_index, target_attr_index):\n",
+      "             \"\"\" Build the blocking object \"\"\"\n",
+      "\n",
+      "         def _fit(self, refset, targetset): \n",
+      "             \"\"\" Internal fit\"\"\"\n",
+      "             raise NotImplementedError      \n",
+      "\n",
+      "         def _iter_blocks(self):       \n",
+      "             \"\"\" Internal iteration function over blocks\"\"\"\n",
+      "             raise NotImplementedError    \n",
+      "        \n",
+      "         def _cleanup(self):         \n",
+      "             \"\"\" Internal cleanup blocking for further use (e.g. in pipeline)\"\"\"\n",
+      "             raise NotImplementedError\n",
+      "\n",
+      "\n",
+      "The `BaseBlocking` should be build by defining the `ref_attr_index` and `target_attr_index`, which are respectively\n",
+      "the index of the attribute of the reference set and target set that will be used in the blocking.\n",
+      "\n",
+      "\n",
+      "The `fit()` method is called to the blocking technique on the reference and target datasets.\n",
+      "Once the fit is done, it is possible to call the `iter_blocks()` method that will yield two listes (one\n",
+      "for the reference set, and one for the targetset) of pair (index, id) of records. It is also possible to iterate on the ids of the records using `iter_id_blocks()`.\n",
+      "\n",
+      "Other methods provide different iterative access to the blocks.\n",
+      "\n",
+      "The `cleanup()` method is used to clean up the blocking (e.g. in pipeline) by reseting all the possible\n",
+      "internal states."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.rl import blocking as nrb"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>KeyBlocking</h4>\n",
+      "\n",
+      "The KeyBlocking is based on a blocking criteria (or blocking key), that will be used to divide the datasets.\n",
+      "\n",
+      "The main idea here is:\n",
+      "<ol>\n",
+      "    <li> to create an index of f(x) for each x in the reference set</li>\n",
+      "    <li> to create an index of f(y) for each y in the target set</li>\n",
+      "    <li> to iterate on each distinct value of f(x) and to return the identifiers of the records of the both sets for this value.</li>\n",
+      "</ol>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "In this example, we use the KeyBlocking jointly with the `soundexcode()` to create blocks of records with similar\n",
+      "Soundex code."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from functools import partial\n",
+      "from nazca.utils.distances import soundexcode\n",
+      "\n",
+      "refset = (('a1', 'smith'), ('a2', 'neighan'), ('a3', 'meier'),  ('a4', 'smithers'),\n",
+      "          ('a5', 'nguyen'), ('a6', 'faulkner'),  ('a7', 'sandy'))\n",
+      "targetset =  (('b1', 'meier'), ('b2', 'meier'),  ('b3', 'smith'),\n",
+      "              ('b4', 'nguyen'), ('b5', 'fawkner'), ('b6', 'santi'), ('b7', 'cain'))\n",
+      "\n",
+      "blocking = nrb.KeyBlocking(ref_attr_index=1, target_attr_index=1,  \n",
+      "                           callback=partial(soundexcode, language='english'))  "
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking.fit(refset, targetset)\n",
+      "for refblock, targetblock in blocking.iter_blocks():\n",
+      "    print refblock\n",
+      "    for i, _id in refblock:\n",
+      "        print refset[i]\n",
+      "    print targetblock\n",
+      "    for i, _id in targetblock:\n",
+      "        print targetset[i]\n",
+      "    print 10*'*'"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[(0, 'a1'), (6, 'a7')]\n",
+        "('a1', 'smith')\n",
+        "('a7', 'sandy')\n",
+        "[(2, 'b3'), (5, 'b6')]\n",
+        "('b3', 'smith')\n",
+        "('b6', 'santi')\n",
+        "**********\n",
+        "[(1, 'a2'), (4, 'a5')]\n",
+        "('a2', 'neighan')\n",
+        "('a5', 'nguyen')\n",
+        "[(3, 'b4')]\n",
+        "('b4', 'nguyen')\n",
+        "**********\n",
+        "[(2, 'a3')]\n",
+        "('a3', 'meier')\n",
+        "[(0, 'b1'), (1, 'b2')]\n",
+        "('b1', 'meier')\n",
+        "('b2', 'meier')\n",
+        "**********\n"
+       ]
+      }
+     ],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h5>SoundexBlocking</h5>\n",
+      "\n",
+      "This is a specific case of KeyBlocking based on Soundex code."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking = nrb.SoundexBlocking(ref_attr_index=1, target_attr_index=1)\n",
+      "blocking.fit(refset, targetset)\n",
+      "for refblock, targetblock in blocking.iter_blocks():\n",
+      "    print refblock\n",
+      "    for i, _id in refblock:\n",
+      "        print refset[i]\n",
+      "    print targetblock\n",
+      "    for i, _id in targetblock:\n",
+      "        print targetset[i]\n",
+      "    print 10*'*'"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[(0, 'a1'), (6, 'a7')]\n",
+        "('a1', 'smith')\n",
+        "('a7', 'sandy')\n",
+        "[(2, 'b3'), (5, 'b6')]\n",
+        "('b3', 'smith')\n",
+        "('b6', 'santi')\n",
+        "**********\n",
+        "[(1, 'a2'), (4, 'a5')]\n",
+        "('a2', 'neighan')\n",
+        "('a5', 'nguyen')\n",
+        "[(3, 'b4')]\n",
+        "('b4', 'nguyen')\n",
+        "**********\n",
+        "[(2, 'a3')]\n",
+        "('a3', 'meier')\n",
+        "[(0, 'b1'), (1, 'b2')]\n",
+        "('b1', 'meier')\n",
+        "('b2', 'meier')\n",
+        "**********\n"
+       ]
+      }
+     ],
+     "prompt_number": 4
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>NGramBlocking</h4>\n",
+      "\n",
+      "The NGramBlocking is based on the construction of NGrams for a given attribute.\n",
+      "It will create a dictionnary with Ngrams as keys, and records as values. A depth can be given,\n",
+      "to create sub-dictionnaries."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)\n",
+      "blocking.fit(refset, targetset)\n",
+      "print blocking.reference_index"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "{'me': [(2, 'a3')], 'ne': [(1, 'a2')], 'ng': [(4, 'a5')], 'fa': [(5, 'a6')], 'sm': [(0, 'a1'), (3, 'a4')], 'sa': [(6, 'a7')]}\n"
+       ]
+      }
+     ],
+     "prompt_number": 10
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=4, depth=1)\n",
+      "blocking.fit(refset, targetset)\n",
+      "print blocking.reference_index"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "{'meie': [(2, 'a3')], 'nguy': [(4, 'a5')], 'faul': [(5, 'a6')], 'neig': [(1, 'a2')], 'sand': [(6, 'a7')], 'smit': [(0, 'a1'), (3, 'a4')]}\n"
+       ]
+      }
+     ],
+     "prompt_number": 11
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=2)\n",
+      "blocking.fit(refset, targetset)\n",
+      "print blocking.reference_index"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "{'me': {'ie': [(2, 'a3')]}, 'ne': {'ig': [(1, 'a2')]}, 'ng': {'uy': [(4, 'a5')]}, 'fa': {'ul': [(5, 'a6')]}, 'sm': {'it': [(0, 'a1'), (3, 'a4')]}, 'sa': {'nd': [(6, 'a7')]}}\n"
+       ]
+      }
+     ],
+     "prompt_number": 12
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>SortedNeighborhoodBlocking</h4>\n",
+      "\n",
+      "The SortedNeighborhoodBlocking is based on a sliding window of neighbours, given a function callback (aka key).\n",
+      "By default, the key is the identity function."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking = nrb.SortedNeighborhoodBlocking(ref_attr_index=1, target_attr_index=1, window_width=2)\n",
+      "blocking.fit(refset, targetset)\n",
+      "print blocking.sorted_dataset"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[((6, 'b7'), 'cain', 1), ((5, 'a6'), 'faulkner', 0), ((4, 'b5'), 'fawkner', 1), ((2, 'a3'), 'meier', 0), ((0, 'b1'), 'meier', 1), ((1, 'b2'), 'meier', 1), ((1, 'a2'), 'neighan', 0), ((4, 'a5'), 'nguyen', 0), ((3, 'b4'), 'nguyen', 1), ((6, 'a7'), 'sandy', 0), ((5, 'b6'), 'santi', 1), ((0, 'a1'), 'smith', 0), ((2, 'b3'), 'smith', 1), ((3, 'a4'), 'smithers', 0)]\n"
+       ]
+      }
+     ],
+     "prompt_number": 23
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for refblock, targetblock in blocking.iter_blocks():\n",
+      "    print refblock\n",
+      "    for i, _id in refblock:\n",
+      "        print refset[i]\n",
+      "    print targetblock\n",
+      "    for i, _id in targetblock:\n",
+      "        print targetset[i]\n",
+      "    print 10*'*'"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[(5, 'a6')]\n",
+        "('a6', 'faulkner')\n",
+        "[(6, 'b7'), (4, 'b5')]\n",
+        "('b7', 'cain')\n",
+        "('b5', 'fawkner')\n",
+        "**********\n",
+        "[(2, 'a3')]\n",
+        "('a3', 'meier')\n",
+        "[(4, 'b5'), (0, 'b1'), (1, 'b2')]\n",
+        "('b5', 'fawkner')\n",
+        "('b1', 'meier')\n",
+        "('b2', 'meier')\n",
+        "**********\n",
+        "[(1, 'a2')]\n",
+        "('a2', 'neighan')\n",
+        "[(0, 'b1'), (1, 'b2'), (3, 'b4')]\n",
+        "('b1', 'meier')\n",
+        "('b2', 'meier')\n",
+        "('b4', 'nguyen')\n",
+        "**********\n",
+        "[(4, 'a5')]\n",
+        "('a5', 'nguyen')\n",
+        "[(1, 'b2'), (3, 'b4')]\n",
+        "('b2', 'meier')\n",
+        "('b4', 'nguyen')\n",
+        "**********\n",
+        "[(6, 'a7')]\n",
+        "('a7', 'sandy')\n",
+        "[(3, 'b4'), (5, 'b6')]\n",
+        "('b4', 'nguyen')\n",
+        "('b6', 'santi')\n",
+        "**********\n",
+        "[(0, 'a1')]\n",
+        "('a1', 'smith')\n",
+        "[(5, 'b6'), (2, 'b3')]\n",
+        "('b6', 'santi')\n",
+        "('b3', 'smith')\n",
+        "**********\n",
+        "[(3, 'a4')]\n",
+        "('a4', 'smithers')\n",
+        "[(2, 'b3')]\n",
+        "('b3', 'smith')\n",
+        "**********\n"
+       ]
+      }
+     ],
+     "prompt_number": 22
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>MergeBlocking</h4>\n",
+      "\n",
+      "This blocking technique keep only one appearance of one given values, and removes all the other records having this value. The merge is based on a score function, and the record with the higher value is kept.\n",
+      "\n",
+      "This is only done on ONE set (the one with a non null attrindex)."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "refset = [('http://fr.wikipedia.org/wiki/Paris_%28Texas%29', 'Paris', 25898), \n",
+      "          ('http://fr.wikipedia.org/wiki/Paris', 'Paris', 12223100),        \n",
+      "          ('http://fr.wikipedia.org/wiki/Saint-Malo', 'Saint-Malo', 46342)] \n",
+      "targetset = [('Paris (Texas)', 25000),\n",
+      "             ('Paris (France)', 12000000)]\n",
+      "blocking = nrb.MergeBlocking(ref_attr_index=1, target_attr_index=None, score_func=lambda x:x[2])\n",
+      "blocking.fit(refset, targetset)\n",
+      "print blocking.merged_dataset"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[(1, 'http://fr.wikipedia.org/wiki/Paris'), (2, 'http://fr.wikipedia.org/wiki/Saint-Malo')]\n"
+       ]
+      }
+     ],
+     "prompt_number": 28
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>KmeansBlocking</h4>\n",
+      "\n",
+      "This blocking technique is based on <a href=\"http://en.wikipedia.org/wiki/K-means_clustering\">Kmeans clustering</a>.\n",
+      "It is based on the <a href=\"http://scikit-learn.org/stable/modules/clustering.html\">Scikit-learn</a> implementation.\n",
+      "If the number of clusters is not defined, it will take the size of the reference set divided by 10 or, if the reference set is to small, the size of the reference set divided by 2."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "refset = [['R1', 'ref1', (6.14194444444, 48.67)],\n",
+      "          ['R2', 'ref2', (6.2, 49)],\n",
+      "          ['R3', 'ref3', (5.1, 48)],\n",
+      "          ['R4', 'ref4', (5.2, 48.1)]]         \n",
+      "targetset = [['T1', 'target1', (6.2, 48.9)],        \n",
+      "             ['T2', 'target2', (5.3, 48.2)], \n",
+      "             ['T3', 'target3', (6.25, 48.91)]]\n",
+      "blocking = nrb.KmeansBlocking(ref_attr_index=2, target_attr_index=2, n_clusters=2)\n",
+      "blocking.fit(refset, targetset)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 9
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for refblock, targetblock in blocking.iter_blocks():\n",
+      "    print refblock\n",
+      "    for i, _id in refblock:\n",
+      "        print refset[i]\n",
+      "    print targetblock\n",
+      "    for i, _id in targetblock:\n",
+      "        print targetset[i]\n",
+      "    print 10*'*'"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[(2, 'R3'), (3, 'R4')]\n",
+        "['R3', 'ref3', (5.1, 48)]\n",
+        "['R4', 'ref4', (5.2, 48.1)]\n",
+        "[(1, 'T2')]\n",
+        "['T2', 'target2', (5.3, 48.2)]\n",
+        "**********\n",
+        "[(0, 'R1'), (1, 'R2')]\n",
+        "['R1', 'ref1', (6.14194444444, 48.67)]\n",
+        "['R2', 'ref2', (6.2, 49)]\n",
+        "[(0, 'T1'), (2, 'T3')]\n",
+        "['T1', 'target1', (6.2, 48.9)]\n",
+        "['T3', 'target3', (6.25, 48.91)]\n",
+        "**********\n"
+       ]
+      }
+     ],
+     "prompt_number": 10
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>KdTreeBlocking</h4>\n",
+      "\n",
+      "This blocking technique is based on <a href=\"http://en.wikipedia.org/wiki/K-d_tree\">KdTree</a>.\n",
+      "Kdtrees are used to partition k-dimensional space.\n",
+      "\n",
+      "It is efficient as creating clusters of numerical values (with small k)."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking = nrb.KdTreeBlocking(ref_attr_index=2, target_attr_index=2, threshold=0.3)\n",
+      "blocking.fit(refset, targetset)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 11
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for refblock, targetblock in blocking.iter_blocks():\n",
+      "    print refblock\n",
+      "    for i, _id in refblock:\n",
+      "        print refset[i]\n",
+      "    print targetblock\n",
+      "    for i, _id in targetblock:\n",
+      "        print targetset[i]\n",
+      "    print 10*'*'"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[(0, 'R1')]\n",
+        "['R1', 'ref1', (6.14194444444, 48.67)]\n",
+        "[(0, 'T1'), (2, 'T3')]\n",
+        "['T1', 'target1', (6.2, 48.9)]\n",
+        "['T3', 'target3', (6.25, 48.91)]\n",
+        "**********\n",
+        "[(1, 'R2')]\n",
+        "['R2', 'ref2', (6.2, 49)]\n",
+        "[(0, 'T1'), (2, 'T3')]\n",
+        "['T1', 'target1', (6.2, 48.9)]\n",
+        "['T3', 'target3', (6.25, 48.91)]\n",
+        "**********\n",
+        "[(2, 'R3')]\n",
+        "['R3', 'ref3', (5.1, 48)]\n",
+        "[(1, 'T2')]\n",
+        "['T2', 'target2', (5.3, 48.2)]\n",
+        "**********\n",
+        "[(3, 'R4')]\n",
+        "['R4', 'ref4', (5.2, 48.1)]\n",
+        "[(1, 'T2')]\n",
+        "['T2', 'target2', (5.3, 48.2)]\n",
+        "**********\n"
+       ]
+      }
+     ],
+     "prompt_number": 12
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>MinHashingBlocking</h4>\n",
+      "\n",
+      "This blocking technique is based on <a href=\"http://en.wikipedia.org/wiki/MinHash\">MinHash</a> algorithm.\n",
+      "It is very efficient as creating clusters of textual values."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.data import FRENCH_LEMMAS\n",
+      "from nazca.utils.normalize import SimplifyNormalizer\n",
+      "refset = [['R1', 'ref1', u\"Un nuage flotta dans le grand ciel bleu.\"],  \n",
+      "          ['R2', 'ref2', u\"Pour quelle occasion vous \u00eates-vous appr\u00eat\u00e9e\u202f?\"],  \n",
+      "          ['R3', 'ref3', u\"Je les vis ensemble \u00e0 plusieurs occasions.\"],      \n",
+      "          ['R4', 'ref4', u\"Je n'aime pas ce genre de bandes dessin\u00e9es tristes.\"], \n",
+      "          ['R5', 'ref5', u\"Ensemble et \u00e0 plusieurs occasions, je les vis\"]]        \n",
+      "targetset = [['T1', 'target1', u\"Des grands nuages noirs flottent dans le ciel.\"], \n",
+      "             ['T2', 'target2', u\"Je les ai vus ensemble \u00e0 plusieurs occasions.\"],  \n",
+      "             ['T3', 'target3', u\"J'aime les bandes dessin\u00e9es de genre comiques.\"]]\n",
+      "normalizer = SimplifyNormalizer(attr_index=2, lemmas=FRENCH_LEMMAS)\n",
+      "refset = normalizer.normalize_dataset(refset)\n",
+      "targetset = normalizer.normalize_dataset(targetset)\n",
+      "blocking = nrb.MinHashingBlocking(threshold=0.4, ref_attr_index=2, target_attr_index=2) \n",
+      "blocking.fit(refset, targetset)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 22
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for refblock, targetblock in blocking.iter_blocks():\n",
+      "    print refblock\n",
+      "    for i, _id in refblock:\n",
+      "        print refset[i]\n",
+      "    print targetblock\n",
+      "    for i, _id in targetblock:\n",
+      "        print targetset[i]\n",
+      "    print 10*'*'"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[(0, 'R1')]\n",
+        "['R1', 'ref1', u'nuage flotter grand ciel bleu']\n",
+        "[(0, 'T1')]\n",
+        "['T1', 'target1', u'grand nuage noir flotter ciel']\n",
+        "**********\n",
+        "[(3, 'R4')]\n",
+        "['R4', 'ref4', u'aimer genre bande dessin\\xe9es tristes']\n",
+        "[(2, 'T3')]\n",
+        "['T3', 'target3', u'aimer bande dessin\\xe9es genre comiques']\n",
+        "**********\n",
+        "[(2, 'R3'), (4, 'R5')]\n",
+        "['R3', 'ref3', u'vis ensemble \\xe0 plusieurs occasions']\n",
+        "['R5', 'ref5', u'ensemble \\xe0 plusieurs occasions vis']\n",
+        "[(1, 'T2')]\n",
+        "['T2', 'target2', u'voir ensemble \\xe0 plusieurs occasions']\n",
+        "**********\n"
+       ]
+      }
+     ],
+     "prompt_number": 23
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>PipelineBlocking - nazca.rl.blocking</h3>\n",
+      "\n",
+      "Nazca also provides a PipelineBlocking for pipelining multiple blockings within an alignment."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "refset = [['1', 'aabb', 'ccdd'],                   \n",
+      "          ['2', 'aabb', 'ddcc'],                \n",
+      "          ['3', 'ccdd', 'aabb'],                \n",
+      "          ['4', 'ccdd', 'bbaa']]         \n",
+      "targetset = [['a', 'aabb', 'ccdd'],  \n",
+      "             ['b', 'aabb', 'ddcc'],\n",
+      "             ['c', 'ccdd', 'aabb'],\n",
+      "             ['d', 'ccdd', 'bbaa']]\n",
+      "blocking_1 = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)  \n",
+      "blocking_2 = nrb.NGramBlocking(ref_attr_index=2, target_attr_index=2, ngram_size=2, depth=1)\n",
+      "blocking = nrb.PipelineBlocking((blocking_1, blocking_2))         \n",
+      "blocking.fit(refset, targetset)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 26
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for refblock, targetblock in blocking.iter_blocks():\n",
+      "    print refblock\n",
+      "    for i, _id in refblock:\n",
+      "        print refset[i]\n",
+      "    print targetblock\n",
+      "    for i, _id in targetblock:\n",
+      "        print targetset[i]\n",
+      "    print 10*'*'"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[(0, '1')]\n",
+        "['1', 'aabb', 'ccdd']\n",
+        "[(0, 'a')]\n",
+        "['a', 'aabb', 'ccdd']\n",
+        "**********\n",
+        "[(1, '2')]\n",
+        "['2', 'aabb', 'ddcc']\n",
+        "[(1, 'b')]\n",
+        "['b', 'aabb', 'ddcc']\n",
+        "**********\n",
+        "[(2, '3')]\n",
+        "['3', 'ccdd', 'aabb']\n",
+        "[(2, 'c')]\n",
+        "['c', 'ccdd', 'aabb']\n",
+        "**********\n",
+        "[(3, '4')]\n",
+        "['4', 'ccdd', 'bbaa']\n",
+        "[(3, 'd')]\n",
+        "['d', 'ccdd', 'bbaa']\n",
+        "**********\n"
+       ]
+      }
+     ],
+     "prompt_number": 27
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "It is also possible to collect stats on the different sub-blockings using the option `collect_stats=True`, which gives the lengthts of the different blocks across the different sub-blockings."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "blocking_1 = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)  \n",
+      "blocking_2 = nrb.NGramBlocking(ref_attr_index=2, target_attr_index=2, ngram_size=2, depth=1)\n",
+      "blocking = nrb.PipelineBlocking((blocking_1, blocking_2), collect_stats=True)         \n",
+      "blocking.fit(refset, targetset)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 28
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "print blocking.stats"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "{0: [(2, 2), (2, 2)], 1: [(1, 1), (1, 1), (1, 1), (1, 1)]}\n"
+       ]
+      }
+     ],
+     "prompt_number": 29
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/notebooks/Record linkage with Nazca - part 3 - Putting it all together.ipynb	Tue Jun 24 21:10:26 2014 +0000
@@ -0,0 +1,253 @@
+{
+ "metadata": {
+  "name": "Record linkage with Nazca - part 3 - Putting it all together"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h1>Record linkage with Nazca - part 3 - Putting it all together</h1>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Aligner - nazca.rl.aligner</h3>\n",
+      "\n",
+      "Once you have created your datasets, and define your preprocessings and blockings, you can use the `BaseAligner` object to perform the alignment.\n",
+      "\n",
+      "The `BaseAligner` is defined as:\n",
+      "\n",
+      "    class BaseAligner(object):\n",
+      "\n",
+      "        def register_ref_normalizer(self, normalizer):\n",
+      "            \"\"\" Register normalizers to be applied before alignment \"\"\"\n",
+      "\n",
+      "        def register_target_normalizer(self, normalizer):\n",
+      "            \"\"\" Register normalizers to be applied before alignment \"\"\"\n",
+      "\n",
+      "        def register_blocking(self, blocking):\n",
+      "            self.blocking = blocking\n",
+      "\n",
+      "        def align(self, refset, targetset, get_matrix=True):\n",
+      "            \"\"\" Perform the alignment on the referenceset and the targetset \"\"\"\n",
+      "\n",
+      "        def get_aligned_pairs(self, refset, targetset, unique=True):\n",
+      "            \"\"\" Get the pairs of aligned elements \"\"\""
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The `align()` function return the global distance matrix and the matched elements as a dictionnary, with key the index of reference records, and values the list of aligned target set records."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.utils.distances import GeographicalProcessing \n",
+      "from nazca.rl.aligner import BaseAligner\n",
+      "\n",
+      "refset = [['R1', 'ref1', (6.14194444444, 48.67)],\n",
+      "          ['R2', 'ref2', (6.2, 49)],\n",
+      "          ['R3', 'ref3', (5.1, 48)],\n",
+      "          ['R4', 'ref4', (5.2, 48.1)]]\n",
+      "targetset = [['T1', 'target1', (6.17, 48.7)],\n",
+      "             ['T2', 'target2', (5.3, 48.2)],\n",
+      "             ['T3', 'target3', (6.25, 48.91)]]\n",
+      "processings = (GeographicalProcessing(2, 2, units='km'),)\n",
+      "aligner = BaseAligner(threshold=30, processings=processings)\n",
+      "mat, matched = aligner.align(refset, targetset)\n",
+      "print mat\n",
+      "print matched"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[[   4.55325174  107.09278107   29.12484169]\n",
+        " [  33.33169937  133.59967041   11.39668941]\n",
+        " [ 141.97203064   31.38606644  162.75946045]\n",
+        " [ 126.65346527   15.69240952  147.18429565]]\n",
+        "{0: [(0, 4.5532517), (2, 29.124842)], 1: [(2, 11.396689)], 3: [(1, 15.69241)]}\n"
+       ]
+      }
+     ],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The `get_aligned_pairs()` directly yield the found aligned pairs and the distance"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "aligner = BaseAligner(threshold=30, processings=processings)\n",
+      "for pair in aligner.get_aligned_pairs(refset, targetset):\n",
+      "    print pair"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "(('R1', 0), ('T1', 0), 4.5532517)\n",
+        "(('R2', 1), ('T3', 2), 11.396689)\n",
+        "(('R4', 3), ('T2', 1), 15.69241)\n"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h4>Plugging preprocessings and blocking</h4>\n",
+      "\n",
+      "We can plug the preprocessings using `register_ref_normalizer()` and `register_target_normalizer`, and the blocking using `register_blocking()`. Only ONE blocking is allowed, thus you should use PipelineBlocking for multiple blockings."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import nazca.utils.normalize as nno\n",
+      "from nazca.rl import blocking as nrb\n",
+      "\n",
+      "normalizer = nno.SimplifyNormalizer(attr_index=1)\n",
+      "blocking = nrb.KdTreeBlocking(ref_attr_index=2, target_attr_index=2, threshold=0.3)\n",
+      "aligner = BaseAligner(threshold=30, processings=processings)\n",
+      "aligner.register_ref_normalizer(normalizer)\n",
+      "aligner.register_target_normalizer(normalizer)\n",
+      "aligner.register_blocking(blocking)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for pair in aligner.get_aligned_pairs(refset, targetset):\n",
+      "    print pair"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "(('R1', 0), ('T1', 0), 0)\n",
+        "(('R2', 1), ('T3', 2), 0)\n",
+        "(('R4', 3), ('T2', 1), 0)\n"
+       ]
+      }
+     ],
+     "prompt_number": 9
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "An `unique` boolean could be set to False to get all the alignments and not just the one unique on the target set."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for pair in aligner.get_aligned_pairs(refset, targetset, unique=False):\n",
+      "    print pair"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "(('R1', 0), ('T3', 2), 0)\n",
+        "(('R1', 0), ('T1', 0), 0)\n",
+        "(('R2', 1), ('T3', 2), 0)\n",
+        "(('R4', 3), ('T2', 1), 0)\n"
+       ]
+      }
+     ],
+     "prompt_number": 14
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<h3>Aligner - nazca.rl.aligner</h3>\n",
+      "\n",
+      "A pipeline of aligners could be created using `PipelineAligner`."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from nazca.utils.distances import LevenshteinProcessing, GeographicalProcessing\n",
+      "from nazca.rl.aligner import PipelineAligner\n",
+      "\n",
+      "processings = (GeographicalProcessing(2, 2, units='km'),)\n",
+      "aligner_1 = BaseAligner(threshold=30, processings=processings)\n",
+      "processings = (LevenshteinProcessing(1, 1),)\n",
+      "aligner_2 = BaseAligner(threshold=1, processings=processings)\n",
+      "\n",
+      "pipeline = PipelineAligner((aligner_1, aligner_2))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 11
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "for pair in pipeline.get_aligned_pairs(refset, targetset):\n",
+      "    print pair"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "(('R1', 0), ('T1', 0))\n",
+        "(('R2', 1), ('T3', 2))\n",
+        "(('R4', 3), ('T2', 1))\n"
+       ]
+      }
+     ],
+     "prompt_number": 13
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file