notebooks/Record linkage with Nazca - Example Dbpedia - INSEE.ipynb
author Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
Mon, 20 Jul 2015 16:11:02 +0200
changeset 504 20db1f91ed2e
parent 458 9527d4b3d381
permissions -rw-r--r--
[debian] use dh_python2

{
 "metadata": {
  "name": "Record linkage with Nazca - Example Dbpedia - INSEE"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h1>Record linkage with Nazca - Example Dbpedia - INSEE</h1>\n",
      "\n",
      "\n",
      "This IPython notebook show some features of the Python Nazca library :\n",
      "<ul>\n",
      "    <li> website : <a href=\"http://www.logilab.org/project/nazca\">http://www.logilab.org/project/nazca</a></li>\n",
      "    <li> source : <a href=\"http://hg.logilab.org/review/nazca\">http://hg.logilab.org/review/nazca</a></li>\n",
      "</ul>\n",
      "<ul>\n",
      "    <li> original notebook : <a href=\"http://hg.logilab.org/review/nazca/raw-file/cdc7992b78be/notebooks/Record%20linkage%20with%20Nazca%20-%20Example%20Dbpedia%20-%20INSEE.ipynb\">here !</a></li>\n",
      "    <li> date: 2014-07-01</li>\n",
      "    <li> author: Vincent Michel  (<it>vincent.michel@logilab.fr</it>, \n",
      "                                  <it>vm.michel@gmail.com</it>) @HowIMetYourData</li>\n",
      "<ul>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nazca.utils.dataio as nio\n",
      "import nazca.utils.distances as ndi\n",
      "import nazca.utils.normalize as nun\n",
      "import nazca.rl.blocking as nbl\n",
      "import nazca.rl.aligner as nal"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>1 - Datasets creation</h3>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we have to create both reference set and target set."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We get all the couples (URI, insee code) from Dbpedia data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "refset = nio.sparqlquery('http://demo.cubicweb.org/sparql',\n",
      "                         '''PREFIX dbonto:<http://dbpedia.org/ontology/>\n",
      "                            SELECT ?p ?n ?c WHERE {?p a dbonto:PopulatedPlace.\n",
      "                                                   ?p dbonto:country dbpedia:France.\n",
      "                                                   ?p foaf:name ?n.\n",
      "                                                   ?p dbpprop:insee ?c}''',\n",
      "                         autocast_data=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len(refset)\n",
      "print refset[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3636\n",
        "[u'http://dbpedia.org/resource/Ajaccio', u'Ajaccio', 2]\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We get all the couples (URI, insee code) from INSEE data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "targetset = nio.sparqlquery('http://rdf.insee.fr/sparql',\n",
      "                            '''PREFIX igeo:<http://rdf.insee.fr/def/geo#>\n",
      "                               SELECT ?commune ?nom ?code WHERE {?commune igeo:codeCommune ?code.\n",
      "                                                                 ?commune igeo:nom ?nom}''',\n",
      "                            autocast_data=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len(targetset)\n",
      "print targetset[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "36700\n",
        "[u'http://id.insee.fr/geo/commune/64374', u'Mazerolles', 64374]\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Definition of the distance functions and the Processing</h3>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We use a distance based on difflib, where distance(a, b) == 0 iif a==b"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "processing = ndi.DifflibProcessing(ref_attr_index=1, target_attr_index=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print refset[0], targetset[0]\n",
      "print processing.distance(refset[0], targetset[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://dbpedia.org/resource/Ajaccio', u'Ajaccio', 2] [u'http://id.insee.fr/geo/commune/64374', u'Mazerolles', 64374]\n",
        "0.764705882353\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Preprocessings and normalization</h3>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We define now a preprocessing step to normalize the string values of each record."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "normalizer = nun.SimplifyNormalizer(attr_index=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The simplify normalizer is based on the simplify() function that:\n",
      "<ol>\n",
      "    <li> Remove the stopwords;</li>\n",
      "    <li> Lemmatized the sentence;</li>\n",
      "    <li> Set the sentence to lower case;</li>\n",
      "    <li> Remove punctuation;</li>\n",
      "</ol>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print normalizer.normalize(refset[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://dbpedia.org/resource/Ajaccio', u'ajaccio', 2]\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print normalizer.normalize(targetset[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://id.insee.fr/geo/commune/64374', u'mazerolles', 64374]\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Blockings</h3>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Blockings act similarly to divide-and-conquer approaches. They create subsets of both datasets (blocks) that will be compared, rather than making all the possible comparisons."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We create a simple NGramBlocking that will create subsets of records by looking at the first N characters of each values.In our case, we choose only the first 2 characters, with a depth of one (i.e. we don't do a recursive blocking)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking = nbl.NGramBlocking(1, 1, ngram_size=5, depth=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The blocking is 'fit' on both refset and targetset and then applied to iterate blocks"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking.fit(refset, targetset)\n",
      "blocks = list(blocking._iter_blocks())\n",
      "print blocks[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "([(1722, u'http://dbpedia.org/resource/Redon,_Ille-et-Vilaine')], [(17045, u'http://id.insee.fr/geo/commune/35236')])\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Cleanup the blocking for now as it may have stored some internal data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking.cleanup()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Define Aligner</h3>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finaly, we can create the Aligner object that will perform the whole alignment processing.\n",
      "We set the threshold to 0.1 and we only have one processing"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "aligner = nal.BaseAligner(threshold=0.1, processings=(processing,))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Thus, we register the normalizer and the blocking"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "aligner.register_ref_normalizer(normalizer)\n",
      "aligner.register_target_normalizer(normalizer)\n",
      "aligner.register_blocking(blocking)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The aligner has a get_aligned_pairs() function that will yield the comparisons that are below the threshold:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pairs = list(aligner.get_aligned_pairs(refset[:1000], targetset[:1000]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len(pairs)\n",
      "for p in pairs[:5]:\n",
      "    print p"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "24\n",
        "((u'http://dbpedia.org/resource/Calvi,_Haute-Corse', 8), (u'http://id.insee.fr/geo/commune/2B050', 268), 1e-10)\n",
        "((u'http://dbpedia.org/resource/Livry,_Calvados', 910), (u'http://id.insee.fr/geo/commune/14372', 117), 1e-10)\n",
        "((u'http://dbpedia.org/resource/Saint-Jean-sur-Veyle', 272), (u'http://id.insee.fr/geo/commune/01365', 301), 1e-10)\n",
        "((u'http://dbpedia.org/resource/Saint-Just,_Ain', 273), (u'http://id.insee.fr/geo/commune/35285', 146), 1e-10)\n",
        "((u'http://dbpedia.org/resource/Aix-en-Provence', 530), (u'http://id.insee.fr/geo/commune/13001', 669), 1e-10)\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Each pair has the following structure: ((id in refset, indice in refset), (id in targset, indice in targetset), distance)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Introduction to Nazca - Using pipeline</h2>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It could be interesting to pipeline the blockings, i.e. use a raw blocking technique with a good recall (i.e. not too conservative) and that could have a bad precision (i.e. not too precise), but that is fast.\n",
      "\n",
      "Then, we can use a more time consuming but more precise blocking on each block."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "blocking_1 = nbl.NGramBlocking(1, 1, ngram_size=3, depth=1)\n",
      "blocking_2 = nbl.MinHashingBlocking(1, 1)\n",
      "blocking = nbl.PipelineBlocking((blocking_1, blocking_2),collect_stats=True)\n",
      "aligner.register_blocking(blocking)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pairs = list(aligner.get_aligned_pairs(refset, targetset))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len(pairs)\n",
      "for p in pairs[:5]:\n",
      "    print p"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3408\n",
        "((u'http://dbpedia.org/resource/Ajaccio', 0), (u'http://id.insee.fr/geo/commune/2A004', 18455), 1e-10)\n",
        "((u'http://dbpedia.org/resource/Bastia', 2), (u'http://id.insee.fr/geo/commune/2B033', 2742), 1e-10)\n",
        "((u'http://dbpedia.org/resource/Sart%C3%A8ne', 3), (u'http://id.insee.fr/geo/commune/2A272', 15569), 1e-10)\n",
        "((u'http://dbpedia.org/resource/Corte', 4), (u'http://id.insee.fr/geo/commune/2B096', 2993), 1e-10)\n",
        "((u'http://dbpedia.org/resource/Bonifacio,_Corse-du-Sud', 6), (u'http://id.insee.fr/geo/commune/2A041', 20886), 1e-10)\n"
       ]
      }
     ],
     "prompt_number": 23
    }
   ],
   "metadata": {}
  }
 ]
}