notebooks/Record linkage with Nazca - part 2 - Normalization and blockings.ipynb
author Vincent Michel <vincent.michel@logilab.fr>
Mon, 30 Jun 2014 14:11:18 +0000
changeset 453 7debe5d58a19
parent 449 80c0a2d8624a
child 456 d93286fdd149
permissions -rw-r--r--
preparing 0.6.0

{
 "metadata": {
  "name": "Record linkage with Nazca - part 2 - Normalization and blockings"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h1>Record linkage with Nazca - part 2 - Normalization and blockings</h1>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Distances - nazca.utils.normalize</h3>\n",
      "This module provides several normalization and preprocessing functions."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nazca.utils.normalize as nno"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>unormalize</h4>\n",
      "\n",
      "Replace diacritical characters with their corresponding ascii characters."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print nno.unormalize(u'toto')\n",
      "print nno.unormalize(u't\u00e9t\u00e9')\n",
      "print nno.unormalize(u'tut\u00e9')\n",
      "print nno.unormalize(u't\u00c9t\u00e9')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "toto\n",
        "tete\n",
        "tute\n",
        "tEte\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For non-ascii based characters, an error is raised, but you can give a substitute to avoid it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print nno.unormalize(u'\u0392\u03af\u03ba\u03c4\u03c9\u03c1 \u039f\u1f51\u03b3\u03ba\u03ce', substitute='_')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "______ _____\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There is also a function `lunormalize` that also set the sentence to lower case."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print nno.lunormalize(u't\u00c9t\u00e9')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "tete\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>simplify</h4>\n",
      "\n",
      "Simply a given sentence by performing the following actions\n",
      "<ol>\n",
      "    <li>If remove_stopwords, then remove the stop words (for now only french stopwords exist)</li>\n",
      "    <li>If lemmas are given, the sentence is lemmatized</li>\n",
      "    <li>Set the sentence to lower case</li>\n",
      "    <li>Remove punctuation</li>\n",
      "</ol>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print nno.simplify(u\"\"\"\u00c9crivain. - Artiste graphiste, auteur de lavis. - \n",
      "                       Membre de l'Institut, Acad\u00e9mie fran\u00e7aise (\u00e9lu en 1841)\"\"\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\u00e9crivain  artiste graphiste auteur lavis  \n",
        "            membre l institut acad\u00e9mie fran\u00e7aise \u00e9lu 1841\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.data import FRENCH_LEMMAS\n",
      "print nno.simplify(u\"\"\" Victor Hugo occupe une place importante dans l'histoire \n",
      "                        des lettres fran\u00e7aises et celle du XIX si\u00e8cle,\n",
      "                        dans des genres et des domaines d'une remarquable vari\u00e9t\u00e9.\"\"\",\n",
      "                   lemmas=FRENCH_LEMMAS)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "victor hugo occuper placer important histoire lettre fran\u00e7aises celui xix si\u00e8cle genre domaine remarquable vari\u00e9t\u00e9\n"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A `lemmatize` function is also available."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>tokenize</h4>\n",
      "\n",
      "Create a list of tokens from a sentence. A tokenizer object may be given, and should have a `tokenize()` method returning a list of tokens from a string."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print nno.tokenize(u\"\"\" Victor Hugo occupe une place importante dans l'histoire \n",
      "                        des lettres fran\u00e7aises et celle du XIX si\u00e8cle,\n",
      "                        dans des genres et des domaines d'une remarquable vari\u00e9t\u00e9.\"\"\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'Victor', u'Hugo', u'occupe', u'une', u'place', u'importante', u'dans', u\"l'\", u'histoire', u'des', u'lettres', u'fran\\xe7aises', u'et', u'celle', u'du', u'XIX', u'si\\xe8cle,', u'dans', u'des', u'genres', u'et', u'des', u'domaines', u\"d'\", u'une', u'remarquable', u'vari\\xe9t\\xe9.']\n"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>roundstr</h4>\n",
      "\n",
      "Return an unicode string of a number rounded to a given precision in decimal digits."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print nno.roundstr('3.1415', 2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3.14\n"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>rgxformat</h4>\n",
      "\n",
      "Apply a regular expression to a string and return a formatted string with information extracted by the regular expression."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print nno.rgxformat(u'[Victor Hugo - 26 fev 1802 / 22 mai 1885]',\n",
      "                    r'\\[(?P<firstname>\\w+) (?P<lastname>\\w+) - (?P<birthdate>.*) \\/ (?P<deathdate>.*?)\\]',\n",
      "                    u'%(lastname)s, %(firstname)s (%(birthdate)s - %(deathdate)s)')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Hugo, Victor (26 fev 1802 - 22 mai 1885)\n"
       ]
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>BaseNormalizer - nazca.utils.normalize</h3>\n",
      "\n",
      "The class BaseNormalizer is used to hold all necessary information for a normalization of records.\n",
      "It requires:\n",
      "<ul>\n",
      "    <li>indice of the attribute to be normalized</li>\n",
      "    <li>the normalization function to be used</li>\n",
      "</ul>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The prototype of the BaseProcessing class is:\n",
      "\n",
      "      class BaseNormalizer(object):\n",
      "\n",
      "         def normalize(self, record):\n",
      "            \"\"\" Normalize a record\"\"\"\n",
      "     \n",
      "         def normalize_dataset(self, dataset, inplace=False): \n",
      "            \"\"\" Normalize a dataset\"\"\"\n",
      "\n",
      "Nazca provides a normalization class for all the normalization previously described."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "normalizer = nno.RoundNormalizer(attr_index=2)\n",
      "print normalizer.normalize(('uri', 'toto', '3.1415'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "['uri', 'toto', '3']\n"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If `attr_index` is None, the whole record is passed to the callback. You can also give a list of indexes:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "normalizer = nno.RoundNormalizer(attr_index=[1, 2])\n",
      "print normalizer.normalize(('uri', '22.36', '3.1415'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "['uri', '22', '3']\n"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Of course, it is possible to define your own processing, based on a specific normalization:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def iamvictorhugo(value):\n",
      "    \"\"\"Return Victor Hugo \"\"\"\n",
      "    return u'Victor Hugo'\n",
      "\n",
      "\n",
      "class VictorHugoNormalizer(nno.BaseNormalizer):\n",
      "\n",
      "    def __init__(self, attr_index=None):\n",
      "        super(VictorHugoNormalizer, self).__init__(iamvictorhugo, attr_index)\n",
      "\n",
      "normalizer = VictorHugoNormalizer(1)\n",
      "print normalizer.normalize(('http://data.bnf.fr/12249911/michel_houellebecq/', u'Michel Houellebecq'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "['http://data.bnf.fr/12249911/michel_houellebecq/', u'Victor Hugo']\n"
       ]
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>JoinNormalizer</h4>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `JoinNormalizer` is used to join multiple fields in only one.\n",
      "This new field will be put at the end of the new record."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "normalizer = nno.JoinNormalizer((1,2))     \n",
      "print normalizer.normalize((1, 'ab', 'cd', 'e', 5))\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[1, 'e', 5, 'ab, cd']\n"
       ]
      }
     ],
     "prompt_number": 42
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NormalizerPipeline</h4>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finaly, it is possible to join BaseNormalizer in a pipeline, using the NormalizerPipeline.\n",
      "This pipeline takes a list of BaseNormalizer that will be applied in the given order."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we define a RegexpNormalizer:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "regexp = r'(?P<id>\\d+);{[\"]?(?P<firstname>.+[^\"])[\"]?};{(?P<surname>.*)};{};{};(?P<date>.*)'\n",
      "output = u'%(id)s\\t%(firstname)s\\t%(surname)s\\t%(date)s'\n",
      "n1 = nno.RegexpNormalizer(regexp, u'%(id)s\\t%(firstname)s\\t%(surname)s\\t%(date)s')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Then, we define a specific BaseNormalizer that split data on a tabulation:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n2 = nno.BaseNormalizer(callback= lambda x: x.split('\\t'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 46
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The third normalizer is an UnicodeNormalizer:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n3 = nno.UnicodeNormalizer(attr_index=(1, 2, 3))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 49
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Thus, we define the pipeline:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipeline = nno.NormalizerPipeline((n1, n2, n3))\n",
      "r1 = u'1111;{\"Toto t\u00e0t\u00e0\"};{Titi};{};{};'\n",
      "print pipeline.normalize(r1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'1111', u'toto tata', u'titi', u'']\n"
       ]
      }
     ],
     "prompt_number": 50
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Blockings - nazca.rl.blocking</h3>\n",
      "\n",
      "Blockings are used to reduce the total number of possible comparisons, by only comparing records that are roughly similar.\n",
      "\n",
      "In Nazca, the BaseBlocking class is defined as:\n",
      "\n",
      "     class BaseBlocking(object):\n",
      "\n",
      "         def __init__(self, ref_attr_index, target_attr_index):\n",
      "             \"\"\" Build the blocking object \"\"\"\n",
      "\n",
      "         def _fit(self, refset, targetset): \n",
      "             \"\"\" Internal fit\"\"\"\n",
      "             raise NotImplementedError      \n",
      "\n",
      "         def _iter_blocks(self):       \n",
      "             \"\"\" Internal iteration function over blocks\"\"\"\n",
      "             raise NotImplementedError    \n",
      "        \n",
      "         def _cleanup(self):         \n",
      "             \"\"\" Internal cleanup blocking for further use (e.g. in pipeline)\"\"\"\n",
      "             raise NotImplementedError\n",
      "\n",
      "\n",
      "The `BaseBlocking` should be build by defining the `ref_attr_index` and `target_attr_index`, which are respectively\n",
      "the index of the attribute of the reference set and target set that will be used in the blocking.\n",
      "\n",
      "\n",
      "The `fit()` method is called to the blocking technique on the reference and target datasets.\n",
      "Once the fit is done, it is possible to call the `iter_blocks()` method that will yield two listes (one\n",
      "for the reference set, and one for the targetset) of pair (index, id) of records. It is also possible to iterate on the ids of the records using `iter_id_blocks()`.\n",
      "\n",
      "Other methods provide different iterative access to the blocks.\n",
      "\n",
      "The `cleanup()` method is used to clean up the blocking (e.g. in pipeline) by reseting all the possible\n",
      "internal states."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.rl import blocking as nrb"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>KeyBlocking</h4>\n",
      "\n",
      "The KeyBlocking is based on a blocking criteria (or blocking key), that will be used to divide the datasets.\n",
      "\n",
      "The main idea here is:\n",
      "<ol>\n",
      "    <li> to create an index of f(x) for each x in the reference set</li>\n",
      "    <li> to create an index of f(y) for each y in the target set</li>\n",
      "    <li> to iterate on each distinct value of f(x) and to return the identifiers of the records of the both sets for this value.</li>\n",
      "</ol>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this example, we use the KeyBlocking jointly with the `soundexcode()` to create blocks of records with similar\n",
      "Soundex code."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from functools import partial\n",
      "from nazca.utils.distances import soundexcode\n",
      "\n",
      "refset = (('a1', 'smith'), ('a2', 'neighan'), ('a3', 'meier'),  ('a4', 'smithers'),\n",
      "          ('a5', 'nguyen'), ('a6', 'faulkner'),  ('a7', 'sandy'))\n",
      "targetset =  (('b1', 'meier'), ('b2', 'meier'),  ('b3', 'smith'),\n",
      "              ('b4', 'nguyen'), ('b5', 'fawkner'), ('b6', 'santi'), ('b7', 'cain'))\n",
      "\n",
      "blocking = nrb.KeyBlocking(ref_attr_index=1, target_attr_index=1,  \n",
      "                           callback=partial(soundexcode, language='english'))  "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking.fit(refset, targetset)\n",
      "for refblock, targetblock in blocking.iter_blocks():\n",
      "    print refblock\n",
      "    for i, _id in refblock:\n",
      "        print refset[i]\n",
      "    print targetblock\n",
      "    for i, _id in targetblock:\n",
      "        print targetset[i]\n",
      "    print 10*'*'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(0, 'a1'), (6, 'a7')]\n",
        "('a1', 'smith')\n",
        "('a7', 'sandy')\n",
        "[(2, 'b3'), (5, 'b6')]\n",
        "('b3', 'smith')\n",
        "('b6', 'santi')\n",
        "**********\n",
        "[(1, 'a2'), (4, 'a5')]\n",
        "('a2', 'neighan')\n",
        "('a5', 'nguyen')\n",
        "[(3, 'b4')]\n",
        "('b4', 'nguyen')\n",
        "**********\n",
        "[(2, 'a3')]\n",
        "('a3', 'meier')\n",
        "[(0, 'b1'), (1, 'b2')]\n",
        "('b1', 'meier')\n",
        "('b2', 'meier')\n",
        "**********\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h5>SoundexBlocking</h5>\n",
      "\n",
      "This is a specific case of KeyBlocking based on Soundex code."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking = nrb.SoundexBlocking(ref_attr_index=1, target_attr_index=1)\n",
      "blocking.fit(refset, targetset)\n",
      "for refblock, targetblock in blocking.iter_blocks():\n",
      "    print refblock\n",
      "    for i, _id in refblock:\n",
      "        print refset[i]\n",
      "    print targetblock\n",
      "    for i, _id in targetblock:\n",
      "        print targetset[i]\n",
      "    print 10*'*'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(0, 'a1'), (6, 'a7')]\n",
        "('a1', 'smith')\n",
        "('a7', 'sandy')\n",
        "[(2, 'b3'), (5, 'b6')]\n",
        "('b3', 'smith')\n",
        "('b6', 'santi')\n",
        "**********\n",
        "[(1, 'a2'), (4, 'a5')]\n",
        "('a2', 'neighan')\n",
        "('a5', 'nguyen')\n",
        "[(3, 'b4')]\n",
        "('b4', 'nguyen')\n",
        "**********\n",
        "[(2, 'a3')]\n",
        "('a3', 'meier')\n",
        "[(0, 'b1'), (1, 'b2')]\n",
        "('b1', 'meier')\n",
        "('b2', 'meier')\n",
        "**********\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NGramBlocking</h4>\n",
      "\n",
      "The NGramBlocking is based on the construction of NGrams for a given attribute.\n",
      "It will create a dictionnary with Ngrams as keys, and records as values. A depth can be given,\n",
      "to create sub-dictionnaries."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)\n",
      "blocking.fit(refset, targetset)\n",
      "print blocking.reference_index"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{'me': [(2, 'a3')], 'ne': [(1, 'a2')], 'ng': [(4, 'a5')], 'fa': [(5, 'a6')], 'sm': [(0, 'a1'), (3, 'a4')], 'sa': [(6, 'a7')]}\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=4, depth=1)\n",
      "blocking.fit(refset, targetset)\n",
      "print blocking.reference_index"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{'meie': [(2, 'a3')], 'nguy': [(4, 'a5')], 'faul': [(5, 'a6')], 'neig': [(1, 'a2')], 'sand': [(6, 'a7')], 'smit': [(0, 'a1'), (3, 'a4')]}\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=2)\n",
      "blocking.fit(refset, targetset)\n",
      "print blocking.reference_index"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{'me': {'ie': [(2, 'a3')]}, 'ne': {'ig': [(1, 'a2')]}, 'ng': {'uy': [(4, 'a5')]}, 'fa': {'ul': [(5, 'a6')]}, 'sm': {'it': [(0, 'a1'), (3, 'a4')]}, 'sa': {'nd': [(6, 'a7')]}}\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>SortedNeighborhoodBlocking</h4>\n",
      "\n",
      "The SortedNeighborhoodBlocking is based on a sliding window of neighbours, given a function callback (aka key).\n",
      "By default, the key is the identity function."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking = nrb.SortedNeighborhoodBlocking(ref_attr_index=1, target_attr_index=1, window_width=2)\n",
      "blocking.fit(refset, targetset)\n",
      "print blocking.sorted_dataset"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[((6, 'b7'), 'cain', 1), ((5, 'a6'), 'faulkner', 0), ((4, 'b5'), 'fawkner', 1), ((2, 'a3'), 'meier', 0), ((0, 'b1'), 'meier', 1), ((1, 'b2'), 'meier', 1), ((1, 'a2'), 'neighan', 0), ((4, 'a5'), 'nguyen', 0), ((3, 'b4'), 'nguyen', 1), ((6, 'a7'), 'sandy', 0), ((5, 'b6'), 'santi', 1), ((0, 'a1'), 'smith', 0), ((2, 'b3'), 'smith', 1), ((3, 'a4'), 'smithers', 0)]\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for refblock, targetblock in blocking.iter_blocks():\n",
      "    print refblock\n",
      "    for i, _id in refblock:\n",
      "        print refset[i]\n",
      "    print targetblock\n",
      "    for i, _id in targetblock:\n",
      "        print targetset[i]\n",
      "    print 10*'*'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(5, 'a6')]\n",
        "('a6', 'faulkner')\n",
        "[(6, 'b7'), (4, 'b5')]\n",
        "('b7', 'cain')\n",
        "('b5', 'fawkner')\n",
        "**********\n",
        "[(2, 'a3')]\n",
        "('a3', 'meier')\n",
        "[(4, 'b5'), (0, 'b1'), (1, 'b2')]\n",
        "('b5', 'fawkner')\n",
        "('b1', 'meier')\n",
        "('b2', 'meier')\n",
        "**********\n",
        "[(1, 'a2')]\n",
        "('a2', 'neighan')\n",
        "[(0, 'b1'), (1, 'b2'), (3, 'b4')]\n",
        "('b1', 'meier')\n",
        "('b2', 'meier')\n",
        "('b4', 'nguyen')\n",
        "**********\n",
        "[(4, 'a5')]\n",
        "('a5', 'nguyen')\n",
        "[(1, 'b2'), (3, 'b4')]\n",
        "('b2', 'meier')\n",
        "('b4', 'nguyen')\n",
        "**********\n",
        "[(6, 'a7')]\n",
        "('a7', 'sandy')\n",
        "[(3, 'b4'), (5, 'b6')]\n",
        "('b4', 'nguyen')\n",
        "('b6', 'santi')\n",
        "**********\n",
        "[(0, 'a1')]\n",
        "('a1', 'smith')\n",
        "[(5, 'b6'), (2, 'b3')]\n",
        "('b6', 'santi')\n",
        "('b3', 'smith')\n",
        "**********\n",
        "[(3, 'a4')]\n",
        "('a4', 'smithers')\n",
        "[(2, 'b3')]\n",
        "('b3', 'smith')\n",
        "**********\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>MergeBlocking</h4>\n",
      "\n",
      "This blocking technique keep only one appearance of one given values, and removes all the other records having this value. The merge is based on a score function, and the record with the higher value is kept.\n",
      "\n",
      "This is only done on ONE set (the one with a non null attrindex)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "refset = [('http://fr.wikipedia.org/wiki/Paris_%28Texas%29', 'Paris', 25898), \n",
      "          ('http://fr.wikipedia.org/wiki/Paris', 'Paris', 12223100),        \n",
      "          ('http://fr.wikipedia.org/wiki/Saint-Malo', 'Saint-Malo', 46342)] \n",
      "targetset = [('Paris (Texas)', 25000),\n",
      "             ('Paris (France)', 12000000)]\n",
      "blocking = nrb.MergeBlocking(ref_attr_index=1, target_attr_index=None, score_func=lambda x:x[2])\n",
      "blocking.fit(refset, targetset)\n",
      "print blocking.merged_dataset"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(1, 'http://fr.wikipedia.org/wiki/Paris'), (2, 'http://fr.wikipedia.org/wiki/Saint-Malo')]\n"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>KmeansBlocking</h4>\n",
      "\n",
      "This blocking technique is based on <a href=\"http://en.wikipedia.org/wiki/K-means_clustering\">Kmeans clustering</a>.\n",
      "It is based on the <a href=\"http://scikit-learn.org/stable/modules/clustering.html\">Scikit-learn</a> implementation.\n",
      "If the number of clusters is not defined, it will take the size of the reference set divided by 10 or, if the reference set is to small, the size of the reference set divided by 2."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "refset = [['R1', 'ref1', (6.14194444444, 48.67)],\n",
      "          ['R2', 'ref2', (6.2, 49)],\n",
      "          ['R3', 'ref3', (5.1, 48)],\n",
      "          ['R4', 'ref4', (5.2, 48.1)]]         \n",
      "targetset = [['T1', 'target1', (6.2, 48.9)],        \n",
      "             ['T2', 'target2', (5.3, 48.2)], \n",
      "             ['T3', 'target3', (6.25, 48.91)]]\n",
      "blocking = nrb.KmeansBlocking(ref_attr_index=2, target_attr_index=2, n_clusters=2)\n",
      "blocking.fit(refset, targetset)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for refblock, targetblock in blocking.iter_blocks():\n",
      "    print refblock\n",
      "    for i, _id in refblock:\n",
      "        print refset[i]\n",
      "    print targetblock\n",
      "    for i, _id in targetblock:\n",
      "        print targetset[i]\n",
      "    print 10*'*'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(2, 'R3'), (3, 'R4')]\n",
        "['R3', 'ref3', (5.1, 48)]\n",
        "['R4', 'ref4', (5.2, 48.1)]\n",
        "[(1, 'T2')]\n",
        "['T2', 'target2', (5.3, 48.2)]\n",
        "**********\n",
        "[(0, 'R1'), (1, 'R2')]\n",
        "['R1', 'ref1', (6.14194444444, 48.67)]\n",
        "['R2', 'ref2', (6.2, 49)]\n",
        "[(0, 'T1'), (2, 'T3')]\n",
        "['T1', 'target1', (6.2, 48.9)]\n",
        "['T3', 'target3', (6.25, 48.91)]\n",
        "**********\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>KdTreeBlocking</h4>\n",
      "\n",
      "This blocking technique is based on <a href=\"http://en.wikipedia.org/wiki/K-d_tree\">KdTree</a>.\n",
      "Kdtrees are used to partition k-dimensional space.\n",
      "\n",
      "It is efficient as creating clusters of numerical values (with small k)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking = nrb.KdTreeBlocking(ref_attr_index=2, target_attr_index=2, threshold=0.3)\n",
      "blocking.fit(refset, targetset)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for refblock, targetblock in blocking.iter_blocks():\n",
      "    print refblock\n",
      "    for i, _id in refblock:\n",
      "        print refset[i]\n",
      "    print targetblock\n",
      "    for i, _id in targetblock:\n",
      "        print targetset[i]\n",
      "    print 10*'*'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(0, 'R1')]\n",
        "['R1', 'ref1', (6.14194444444, 48.67)]\n",
        "[(0, 'T1'), (2, 'T3')]\n",
        "['T1', 'target1', (6.2, 48.9)]\n",
        "['T3', 'target3', (6.25, 48.91)]\n",
        "**********\n",
        "[(1, 'R2')]\n",
        "['R2', 'ref2', (6.2, 49)]\n",
        "[(0, 'T1'), (2, 'T3')]\n",
        "['T1', 'target1', (6.2, 48.9)]\n",
        "['T3', 'target3', (6.25, 48.91)]\n",
        "**********\n",
        "[(2, 'R3')]\n",
        "['R3', 'ref3', (5.1, 48)]\n",
        "[(1, 'T2')]\n",
        "['T2', 'target2', (5.3, 48.2)]\n",
        "**********\n",
        "[(3, 'R4')]\n",
        "['R4', 'ref4', (5.2, 48.1)]\n",
        "[(1, 'T2')]\n",
        "['T2', 'target2', (5.3, 48.2)]\n",
        "**********\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>MinHashingBlocking</h4>\n",
      "\n",
      "This blocking technique is based on <a href=\"http://en.wikipedia.org/wiki/MinHash\">MinHash</a> algorithm.\n",
      "It is very efficient as creating clusters of textual values."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.data import FRENCH_LEMMAS\n",
      "from nazca.utils.normalize import SimplifyNormalizer\n",
      "refset = [['R1', 'ref1', u\"Un nuage flotta dans le grand ciel bleu.\"],  \n",
      "          ['R2', 'ref2', u\"Pour quelle occasion vous \u00eates-vous appr\u00eat\u00e9e\u202f?\"],  \n",
      "          ['R3', 'ref3', u\"Je les vis ensemble \u00e0 plusieurs occasions.\"],      \n",
      "          ['R4', 'ref4', u\"Je n'aime pas ce genre de bandes dessin\u00e9es tristes.\"], \n",
      "          ['R5', 'ref5', u\"Ensemble et \u00e0 plusieurs occasions, je les vis\"]]        \n",
      "targetset = [['T1', 'target1', u\"Des grands nuages noirs flottent dans le ciel.\"], \n",
      "             ['T2', 'target2', u\"Je les ai vus ensemble \u00e0 plusieurs occasions.\"],  \n",
      "             ['T3', 'target3', u\"J'aime les bandes dessin\u00e9es de genre comiques.\"]]\n",
      "normalizer = SimplifyNormalizer(attr_index=2, lemmas=FRENCH_LEMMAS)\n",
      "refset = normalizer.normalize_dataset(refset)\n",
      "targetset = normalizer.normalize_dataset(targetset)\n",
      "blocking = nrb.MinHashingBlocking(threshold=0.4, ref_attr_index=2, target_attr_index=2) \n",
      "blocking.fit(refset, targetset)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for refblock, targetblock in blocking.iter_blocks():\n",
      "    print refblock\n",
      "    for i, _id in refblock:\n",
      "        print refset[i]\n",
      "    print targetblock\n",
      "    for i, _id in targetblock:\n",
      "        print targetset[i]\n",
      "    print 10*'*'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(0, 'R1')]\n",
        "['R1', 'ref1', u'nuage flotter grand ciel bleu']\n",
        "[(0, 'T1')]\n",
        "['T1', 'target1', u'grand nuage noir flotter ciel']\n",
        "**********\n",
        "[(3, 'R4')]\n",
        "['R4', 'ref4', u'aimer genre bande dessin\\xe9es tristes']\n",
        "[(2, 'T3')]\n",
        "['T3', 'target3', u'aimer bande dessin\\xe9es genre comiques']\n",
        "**********\n",
        "[(2, 'R3'), (4, 'R5')]\n",
        "['R3', 'ref3', u'vis ensemble \\xe0 plusieurs occasions']\n",
        "['R5', 'ref5', u'ensemble \\xe0 plusieurs occasions vis']\n",
        "[(1, 'T2')]\n",
        "['T2', 'target2', u'voir ensemble \\xe0 plusieurs occasions']\n",
        "**********\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>PipelineBlocking - nazca.rl.blocking</h3>\n",
      "\n",
      "Nazca also provides a PipelineBlocking for pipelining multiple blockings within an alignment."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "refset = [['1', 'aabb', 'ccdd'],                   \n",
      "          ['2', 'aabb', 'ddcc'],                \n",
      "          ['3', 'ccdd', 'aabb'],                \n",
      "          ['4', 'ccdd', 'bbaa']]         \n",
      "targetset = [['a', 'aabb', 'ccdd'],  \n",
      "             ['b', 'aabb', 'ddcc'],\n",
      "             ['c', 'ccdd', 'aabb'],\n",
      "             ['d', 'ccdd', 'bbaa']]\n",
      "blocking_1 = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)  \n",
      "blocking_2 = nrb.NGramBlocking(ref_attr_index=2, target_attr_index=2, ngram_size=2, depth=1)\n",
      "blocking = nrb.PipelineBlocking((blocking_1, blocking_2))         \n",
      "blocking.fit(refset, targetset)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for refblock, targetblock in blocking.iter_blocks():\n",
      "    print refblock\n",
      "    for i, _id in refblock:\n",
      "        print refset[i]\n",
      "    print targetblock\n",
      "    for i, _id in targetblock:\n",
      "        print targetset[i]\n",
      "    print 10*'*'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[(0, '1')]\n",
        "['1', 'aabb', 'ccdd']\n",
        "[(0, 'a')]\n",
        "['a', 'aabb', 'ccdd']\n",
        "**********\n",
        "[(1, '2')]\n",
        "['2', 'aabb', 'ddcc']\n",
        "[(1, 'b')]\n",
        "['b', 'aabb', 'ddcc']\n",
        "**********\n",
        "[(2, '3')]\n",
        "['3', 'ccdd', 'aabb']\n",
        "[(2, 'c')]\n",
        "['c', 'ccdd', 'aabb']\n",
        "**********\n",
        "[(3, '4')]\n",
        "['4', 'ccdd', 'bbaa']\n",
        "[(3, 'd')]\n",
        "['d', 'ccdd', 'bbaa']\n",
        "**********\n"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It is also possible to collect stats on the different sub-blockings using the option `collect_stats=True`, which gives the lengthts of the different blocks across the different sub-blockings."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking_1 = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)  \n",
      "blocking_2 = nrb.NGramBlocking(ref_attr_index=2, target_attr_index=2, ngram_size=2, depth=1)\n",
      "blocking = nrb.PipelineBlocking((blocking_1, blocking_2), collect_stats=True)         \n",
      "blocking.fit(refset, targetset)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print blocking.stats"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{0: [(2, 2), (2, 2)], 1: [(1, 1), (1, 1), (1, 1), (1, 1)]}\n"
       ]
      }
     ],
     "prompt_number": 29
    }
   ],
   "metadata": {}
  }
 ]
}