notebooks/Named Entities Matching with Nazca.ipynb
author Samuel Trégouët <samuel.tregouet@logilab.fr>
Tue, 13 Mar 2018 14:06:53 +0100
changeset 521 6fed581fa8ec
parent 458 9527d4b3d381
permissions -rw-r--r--
add replacement char for \u2019

{
 "metadata": {
  "name": "Named Entities Matching with Nazca"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h1>Named Entities Matching with Nazca</h1>\n",
      "\n",
      "\n",
      "This IPython notebook show some features of the Python Nazca library :\n",
      "<ul>\n",
      "    <li> website : <a href=\"http://www.logilab.org/project/nazca\">http://www.logilab.org/project/nazca</a></li>\n",
      "    <li> source : <a href=\"http://hg.logilab.org/review/nazca\">http://hg.logilab.org/review/nazca</a></li>\n",
      "</ul>\n",
      "<ul>\n",
      "    <li> original notebook : <a href=\"http://hg.logilab.org/review/nazca/raw-file/cdc7992b78be/notebooks/Named%20Entities%20Matching%20with%20Nazca.ipynb\">here !</a></li>\n",
      "    <li> date: 2014-07-01</li>\n",
      "    <li> author: Vincent Michel  (<it>vincent.michel@logilab.fr</it>, \n",
      "                                  <it>vm.michel@gmail.com</it>) @HowIMetYourData</li>\n",
      "<ul>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Named Entities Matching is the process of recognizing elements in a text and matching it\n",
      "    to different types (e.g. Person, Organization, Place). This may be related to Record Linkage (for linking entities from a text corpus and from a reference corpus) and to Named Entities Recognition."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import HTML\n",
      "HTML('<iframe src=http://en.mobile.wikipedia.org/wiki/Named-entity_recognition?useformat=mobile width=700 height=350></iframe>')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<iframe src=http://en.mobile.wikipedia.org/wiki/Named-entity_recognition?useformat=mobile width=700 height=350></iframe>"
       ],
       "output_type": "pyout",
       "prompt_number": 1,
       "text": [
        "<IPython.core.display.HTML at 0x1ceb950>"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In Nazca, we provide tools to match named entities to existing corpus or reference database.\n",
      "This may be used for contextualizing your data, e.g. by recognizing in them elements from Dbpedia."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Token, Sentence and Tokenizer - nazca.utils.tokenizer</h2>\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A `Sentence` is:\n",
      "<ul>\n",
      "    <li>the `start` indice in the text</li>\n",
      "    <li>the `end` indice in the text</li>\n",
      "    <li>the `indice` of the sentence amonst all the other sentences</li>\n",
      "</ul>    "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.utils.tokenizer import Sentence\n",
      "\n",
      "sentence = Sentence(indice=0, start=0, end=38)\n",
      "print sentence"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Sentence(indice=0, start=0, end=38)\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A `Token` is:\n",
      "<ul>\n",
      "    <li>a `word` or part of a sentence</li>\n",
      "    <li>the `start` indice in the sentence</li>\n",
      "    <li>the `end` indice in the sentence</li>\n",
      "    <li>the `sentence` which contains the Token</li>\n",
      "</ul>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.utils.tokenizer import Token\n",
      "\n",
      "token = Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
      "print token"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Nazca provides a `Tokenizer` that creates tokens from a text:\n",
      "\n",
      "\n",
      "    class RichStringTokenizer(object):\n",
      "\n",
      "        def iter_tokens(self, text):\n",
      "            \"\"\" Iterate tokens over a text \"\"\"\n",
      "\n",
      "        def find_sentences(self, text):\n",
      "            \"\"\" Find the sentences \"\"\"\n",
      "\n",
      "        def load_text(self, text):\n",
      "            \"\"\" Load the text to be tokenized \"\"\"\n",
      "\n",
      "        def __iter__(self):\n",
      "            \"\"\" Iterator over the text given in the object instantiation \"\"\"\n",
      "\n",
      "\n",
      "The tokenizer may be initialized with a minimum size and maximum size for the tokens. It will find all the sentences from a text using `find_sentences` and will create tokens in an iterative way."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.utils.tokenizer import RichStringTokenizer\n",
      "\n",
      "text = 'Hello everyone, this is   me speaking. And me !Why not me ? Blup'\n",
      "tokenizer = RichStringTokenizer(text, token_min_size=1, token_max_size=4)\n",
      "sentences = tokenizer.find_sentences(text)\n",
      "\n",
      "print sentences"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[Sentence(indice=0, start=0, end=38), Sentence(indice=1, start=38, end=47), Sentence(indice=2, start=47, end=59), Sentence(indice=3, start=59, end=64)]\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for token in tokenizer:\n",
      "    print token"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='Hello everyone this is', start=0, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='Hello everyone', start=0, end=14, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='Hello', start=0, end=5, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='everyone this is me', start=6, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='everyone this is', start=6, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='everyone this', start=6, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='this is me speaking', start=16, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='this is me', start=16, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='this is', start=16, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='this', start=16, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='is me speaking', start=21, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='is me', start=21, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='is', start=21, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='me speaking', start=26, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='speaking', start=29, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='And me Why', start=39, end=50, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='And me', start=39, end=45, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='And', start=39, end=42, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='me Why', start=43, end=50, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='Why not me', start=47, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='Why not', start=47, end=54, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='Why', start=47, end=50, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='not me', start=51, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='not', start=51, end=54, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='me', start=55, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='Blup', start=60, end=64, sentence=Sentence(indice=3, start=59, end=64))\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Defining a source of entities - nazca.ner.sources</h2>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we should define the source of entities that should be retrieved in the tokens yield by the tokenizer.\n",
      "\n",
      "A source herits from `AbstractNerSource`:\n",
      "\n",
      "\n",
      "    class AbstractNerSource(object):\n",
      "        \"\"\" High-level source for Named Entities Recognition \"\"\"\n",
      "\n",
      "        def __init__(self, endpoint, query, name=None, use_cache=True, preprocessors=None):\n",
      "        \"\"\" Initialise the class.\"\"\"\n",
      "\n",
      "        def add_preprocessors(self, preprocessor):\n",
      "            \"\"\" Add a preprocessor \"\"\"\n",
      "\n",
      "        def recognize_token(self, token):\n",
      "            \"\"\" Recognize a token \"\"\"\n",
      "\n",
      "        def query_word(self, word):\n",
      "            \"\"\" Query a word for a Named Entities Recognition process \"\"\"\n",
      "\n",
      "\n",
      "It requires an `endpoint` and a `query`. And may recognize a word directly using `query_word`, or a token using `recognize_token`. In both cases, it returns a list of URIs/identifiers."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Sparql source</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceSparql\n",
      "ner_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
      "                             '''SELECT distinct ?uri\n",
      "                                WHERE{?uri rdfs:label \"%(word)s\"@en}''')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print ner_source.query_word('Victor Hugo')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://dbpedia.org/resource/Victor_Hugo', u'http://dbpedia.org/resource/Category:Victor_Hugo', u'http://sw.opencyc.org/2008/06/10/concept/en/VictorHugo', u'http://sw.opencyc.org/2008/06/10/concept/Mx4rve1ZXJwpEbGdrcN5Y29ycA', u'http://wikidata.dbpedia.org/resource/Q1459231', u'http://wikidata.dbpedia.org/resource/Q535', u'http://wikidata.dbpedia.org/resource/Q3557368']\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Of course, you can use a more complex query reflecting restrictions and business logic"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceSparql\n",
      "ner_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
      "                              '''SELECT distinct ?uri\n",
      "                                 WHERE{?uri rdfs:label \"%(word)s\"@en .\n",
      "                                       ?p foaf:primaryTopic ?uri}''')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print ner_source.query_word('Victor Hugo')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://dbpedia.org/resource/Victor_Hugo']\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You may also used `use_cache` to True to keep all the previous results."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Rql source</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceRql\n",
      "\n",
      "ner_source = NerSourceRql('http://www.cubicweb.org',\n",
      "                          'Any U WHERE X cwuri U, X name \"%(word)s\"')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print ner_source.query_word('apycot')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://www.cubicweb.org/1310453', u'http://www.cubicweb.org/749162']\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Lexicon source</h4>\n",
      "\n",
      "A lexicon source is a source based on a dictionnary"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceLexicon\n",
      "\n",
      "\n",
      "lexicon = {'everyone': 'http://example.com/everyone',\n",
      "           'me': 'http://example.com/me'}\n",
      "ner_source = NerSourceLexicon(lexicon)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print ner_source.query_word('me')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "['http://example.com/me']\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Defining preprocessors - nazca.ner.preprocessors</h2>\n",
      "\n",
      "Preprocessors are used to cleanup/filter a token before recognizing it in a source.\n",
      "All preprocessors herit from `AbstractNerPreprocessor`\n",
      "\n",
      "    class AbstractNerPreprocessor(object):\n",
      "        \"\"\" Preprocessor \"\"\"\n",
      "\n",
      "        def __call__(self, token):\n",
      "            raise NotImplementedError\n",
      "\n",
      "\n",
      "A token which is None after being returned by a preprocessor is not recognized by the source."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nazca.ner.preprocessors as nnp"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerWordSizeFilterPreprocessor</h4>\n",
      "\n",
      "This preprocessor remove token based on the size of the word."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerWordSizeFilterPreprocessor(min_size=2, max_size=4)\n",
      "token = Token('toto', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='toto', start=0, end=4, sentence=None) --> Token(word='toto', start=0, end=4, sentence=None)\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('t', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='t', start=0, end=4, sentence=None) --> None\n"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('tototata', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='tototata', start=0, end=4, sentence=None) --> None\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerLowerCaseFilterPreprocessor</h4>\n",
      "\n",
      "This preprocessor remove token in lower case"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerLowerCaseFilterPreprocessor()\n",
      "token = Token('Toto', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='Toto', start=0, end=4, sentence=None) --> Token(word='Toto', start=0, end=4, sentence=None)\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('toto', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='toto', start=0, end=4, sentence=None) --> None\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerLowerFirstWordPreprocessor</h4>\n",
      "\n",
      "This preprocessor lower the first word of each sentence if it is a stopword."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerLowerFirstWordPreprocessor()\n",
      "sentence = Sentence(0, 0, 20)\n",
      "token = Token('Toto tata', 0, 4, sentence)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Toto tata --> Toto tata\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('The tata', 0, 4, sentence)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The tata --> the tata\n"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('Tata The', 0, 4, sentence)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Tata The --> Tata The\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerStopwordsFilterPreprocessor</h4>\n",
      "\n",
      "This preprocessor remove stopwords from the token. If `split_words` is False, it only removes token that are just a stopwords, in the other case, it will remove tokens that are entirely composed of stopwords."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerStopwordsFilterPreprocessor(split_words=True)\n",
      "token = Token('Toto', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Toto --> Toto\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('Us there', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Us there --> None\n"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerStopwordsFilterPreprocessor(split_words=False)\n",
      "token = Token('Us there', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Us there --> Us there\n"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerHashTagPreprocessor</h4>\n",
      "\n",
      "This preprocessor cleanup hashtags."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerHashTagPreprocessor()\n",
      "token = Token('@BarackObama', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "@BarackObama --> BarackObama\n"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('@Barack_Obama', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "@Barack_Obama --> Barack Obama\n"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Defining the NER process - nazca.ner</h2>\n",
      "\n",
      "The entire NER process may be implemented in a `NerProcess`\n",
      "\n",
      "    class NerProcess(object):\n",
      "\n",
      "        def add_ner_source(self, process):\n",
      "            \"\"\" Add a ner process \"\"\"\n",
      "\n",
      "        def add_preprocessors(self, preprocessor):\n",
      "            \"\"\" Add a preprocessor \"\"\"\n",
      "\n",
      "        def add_filters(self, filter):\n",
      "            \"\"\" Add a filter \"\"\"\n",
      "\n",
      "        def process_text(self, text):\n",
      "            \"\"\" High level function for analyzing a text \"\"\"\n",
      "\n",
      "        def recognize_tokens(self, tokens):\n",
      "            \"\"\" Recognize Named Entities from a tokenizer or an iterator yielding tokens. \"\"\"\n",
      "\n",
      "        def postprocess(self, named_entities):\n",
      "            \"\"\" Postprocess the results by applying filters \"\"\"\n",
      "\n",
      "\n",
      "The main entry point of a `NerProcess` is the function `process_text()`, that yield triples (URI, source_name, token).\n",
      "The source_name may be set during source creation, and may be useful when working with different sources."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner import NerProcess\n",
      "\n",
      "text = 'Hello everyone, this is   me speaking. And me.'\n",
      "source = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
      "                           'me': 'http://example.com/me'})\n",
      "ner = NerProcess((source,))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Use multi sources</h3>\n",
      "\n",
      "Different sources may be given to the NER process."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "source1 = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
      "                             'me': 'http://example.com/me'})\n",
      "source2 = NerSourceLexicon({'me': 'http://example2.com/me'})\n",
      "ner = NerProcess((source1, source2))\n",
      "\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It is possible to set the `unique` attribute, to keep the first appearance of a token in a source, in the sources order"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source1, source2), unique=True)\n",
      "\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source2, source1), unique=True)\n",
      "\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Pretty printing the output</h3>\n",
      "\n",
      "The output can be pretty printed with hyperlinks."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.utils.dataio import HTMLPrettyPrint\n",
      "\n",
      "ner = NerProcess((source1, source2))\n",
      "named_entities = ner.process_text(text)\n",
      "html = HTMLPrettyPrint().pprint_text(text, named_entities)\n",
      "print html"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Hello <a href=\"http://example.com/everyone\">everyone</a>, this is   <a href=\"http://example2.com/me\">me</a> speaking. And <a href=\"http://example2.com/me\">me</a>.\n"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Defining the filters - nazca.ner.filters</h2>\n",
      "\n",
      "It is sometimes useful to filter the named entities found by a NerProcess before returning them.\n",
      "This may be done using filters, that herit from `AbstractNerFilter`\n",
      "\n",
      "\n",
      "    class AbstractNerFilter(object):\n",
      "        \"\"\" A filter used for cleaning named entities results \"\"\"\n",
      "\n",
      "        def __call__(self, named_entities):\n",
      "            raise NotImplementedError"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nazca.ner.filters as nnf"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 34
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerOccurenceFilter</h4>\n",
      "\n",
      "This filter is based on the number of occurence of named entities in the results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'Hello everyone, this is   me speaking. And me.'\n",
      "source1 = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
      "                            'me': 'http://example.com/me'})\n",
      "source2 = NerSourceLexicon({'me': 'http://example2.com/me'})\n",
      "_filter = nnf.NerOccurenceFilter(min_occ=2)\n",
      "\n",
      "ner = NerProcess((source1, source2))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 35
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source1, source2), filters=(_filter,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerDisambiguationWordParts</h4>\n",
      "\n",
      "This filter disambiguates named entities based on the words parts, i.e. if a token is included in a larger reocgnized tokens, we replace it by the larger token."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'Hello Toto Tutu. And Toto.'\n",
      "source = NerSourceLexicon({'Toto Tutu': 'http://example.com/toto_tutu',\n",
      "                           'Toto': 'http://example.com/toto'})\n",
      "_filter = nnf.NerDisambiguationWordParts()\n",
      "\n",
      "ner = NerProcess((source,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/toto_tutu', None, Token(word='Toto Tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
        "('http://example.com/toto', None, Token(word='Toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source,), filters=(_filter,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/toto_tutu', None, Token(word='Toto Tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
        "('http://example.com/toto_tutu', None, Token(word='Toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 38
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerReplacementRulesFilter</h4>\n",
      "\n",
      "This filter allow to define replacement rules for Named Entities."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'Hello toto tutu. And toto.'\n",
      "source = NerSourceLexicon({'toto tutu': 'http://example.com/toto_tutu',\n",
      "                           'toto': 'http://example.com/toto'})\n",
      "rules = {'http://example.com/toto': 'http://example.com/tata'}\n",
      "_filter = nnf.NerReplacementRulesFilter(rules)\n",
      "\n",
      "ner = NerProcess((source,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/toto_tutu', None, Token(word='toto tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
        "('http://example.com/toto', None, Token(word='toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 39
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source,), filters=(_filter,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/toto_tutu', None, Token(word='toto tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
        "('http://example.com/tata', None, Token(word='toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 40
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerRDFTypeFilter</h4>\n",
      "\n",
      "This filter is based on the RDF type of objects"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'Hello Victor Hugo and Sony'\n",
      "source = NerSourceSparql('http://dbpedia.org/sparql',\n",
      "                         '''SELECT distinct ?uri\n",
      "                            WHERE{?uri rdfs:label \"%(word)s\"@en .\n",
      "                                  ?p foaf:primaryTopic ?uri}''')\n",
      "_filter = nnf.NerRDFTypeFilter('http://dbpedia.org/sparql',\n",
      "                             ('http://schema.org/Place',\n",
      "                              'http://dbpedia.org/ontology/Agent',\n",
      "                              'http://dbpedia.org/ontology/Place'))\n",
      "\n",
      "ner = NerProcess((source,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(u'http://dbpedia.org/resource/Hello', None, Token(word='Hello', start=0, end=5, sentence=Sentence(indice=0, start=0, end=26)))\n",
        "(u'http://dbpedia.org/resource/Victor_Hugo', None, Token(word='Victor Hugo', start=6, end=17, sentence=Sentence(indice=0, start=0, end=26)))\n",
        "(u'http://dbpedia.org/resource/Sony', None, Token(word='Sony', start=22, end=26, sentence=Sentence(indice=0, start=0, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source,), filters=(_filter,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(u'http://dbpedia.org/resource/Victor_Hugo', None, Token(word='Victor Hugo', start=6, end=17, sentence=Sentence(indice=0, start=0, end=26)))\n",
        "(u'http://dbpedia.org/resource/Sony', None, Token(word='Sony', start=22, end=26, sentence=Sentence(indice=0, start=0, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 42
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Putting it all together</h2>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Get the data</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import feedparser\n",
      "from BeautifulSoup import BeautifulSoup  \n",
      "\n",
      "data = feedparser.parse('http://rss.nytimes.com/services/xml/rss/nyt/World.xml')\n",
      "entries = data.entries"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 43
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Define the source</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceSparql\n",
      "\n",
      "dbpedia_sparql_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
      "                                         '''SELECT distinct ?uri\n",
      "                                            WHERE{\n",
      "                                            ?uri rdfs:label \"%(word)s\"@en .\n",
      "                                            ?p foaf:primaryTopic ?uri}''',\n",
      "                                         'http://dbpedia.org/sparql',\n",
      "                                        use_cache=True)\n",
      "ner_sources = [dbpedia_sparql_source,]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Create the NER process</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.preprocessors import (NerLowerCaseFilterPreprocessor,\n",
      "                                    NerStopwordsFilterPreprocessor)\n",
      "from nazca.ner import NerProcess\n",
      "\n",
      "preprocessors = [NerLowerCaseFilterPreprocessor(),\n",
      "        \t     NerStopwordsFilterPreprocessor()]\n",
      "process = NerProcess(ner_sources, preprocessors=preprocessors)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 45
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Process the data</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import HTML\n",
      "from nazca.utils.dataio import HTMLPrettyPrint\n",
      "\n",
      "pprint_html = HTMLPrettyPrint()\n",
      "\n",
      "html = []\n",
      "for ind, entry in enumerate(entries[:20]):\n",
      "    text = BeautifulSoup(entry['summary_detail']['value']).text\n",
      "    named_entities = process.process_text(text)\n",
      "    html.append(pprint_html.pprint_text(text, named_entities))\n",
      "HTML('<br/><br/>'.join(html))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "A huge throng of people, mostly young, took to <a href=\"http://dbpedia.org/resource/Hong_Kong\">Hong Kong</a>\u2019s streets <a href=\"http://dbpedia.org/resource/Tuesday\">Tuesday</a>, defying <a href=\"http://dbpedia.org/resource/Beijing\">Beijing</a>\u2019s dwindling tolerance for challenges to its control.<br/><br/>The post, until now largely ceremonial, could become much more important under <a href=\"http://dbpedia.org/resource/Mr\">Mr</a>. <a href=\"http://dbpedia.org/resource/Erdogan\">Erdogan</a>, who has held power in <a href=\"http://dbpedia.org/resource/Turkey\">Turkey</a> for a decade.<br/><br/>The discovery of the teenagers\u2019 bodies in the <a href=\"http://dbpedia.org/resource/West_Bank\">West Bank</a> prompted vows of retaliation by <a href=\"http://dbpedia.org/resource/Israel\">Israel</a>, which blamed the <a href=\"http://dbpedia.org/resource/Palestinian\">Palestinian</a> group <a href=\"http://dbpedia.org/resource/Hamas\">Hamas</a> for the killings.<br/><br/><a href=\"http://dbpedia.org/resource/The_South\">The South</a> <a href=\"http://dbpedia.org/resource/African\">African</a> track star\u2019s agent and friend testified that the couple\u2019s relationship was strong and that he did not intend to kill her.<br/><br/>The <a href=\"http://dbpedia.org/resource/Japanese_prime_minister\">Japanese prime minister</a> announced that his government would reinterpret the antiwar <a href=\"http://dbpedia.org/resource/Constitution\">Constitution</a> to allow the armed forces to come to the aid of friendly nations.<br/><br/>The first clues that led to the grisly discovery of the bodies came only hours after their abduction in the <a href=\"http://dbpedia.org/resource/West_Bank\">West Bank</a> was reported.<br/><br/>The lawmakers were under pressure to name an inclusive government as insurgents mount a violent challenge north and west of <a href=\"http://dbpedia.org/resource/Baghdad\">Baghdad</a>.<br/><br/>The only viable political future for the country is federation. But <a href=\"http://dbpedia.org/resource/America\">America</a>\u2019s first priority is to see <a href=\"http://dbpedia.org/resource/ISIS\">ISIS</a> crushed.<br/><br/><a href=\"http://dbpedia.org/resource/President\">President</a> <a href=\"http://dbpedia.org/resource/Petro\">Petro</a> <a href=\"http://dbpedia.org/resource/O\">O</a>. <a href=\"http://dbpedia.org/resource/Poroshenko\">Poroshenko</a> said he would resume full-scale efforts to quash the pro-Russian uprising in eastern <a href=\"http://dbpedia.org/resource/Ukraine\">Ukraine</a>.<br/><br/><a href=\"http://dbpedia.org/resource/Nicolas_Sarkozy\">Nicolas Sarkozy</a>, the former <a href=\"http://dbpedia.org/resource/French_president\">French president</a>, has been under scrutiny for possible financial irregularities in his <a href=\"http://dbpedia.org/resource/2007\">2007</a> campaign and for other alleged offenses.<br/><br/><a href=\"http://dbpedia.org/resource/Myanmar\">Myanmar</a> is enjoying some new diplomatic clout, leading <a href=\"http://dbpedia.org/resource/China\">China</a> to court the country as <a href=\"http://dbpedia.org/resource/Beijing\">Beijing</a> presses its territorial claims in the <a href=\"http://dbpedia.org/resource/South_China_Sea\">South China Sea</a>.<br/><br/>The last remaining <a href=\"http://dbpedia.org/resource/African\">African</a> teams in the <a href=\"http://dbpedia.org/resource/World_Cup\">World Cup</a>, <a href=\"http://dbpedia.org/resource/Algeria\">Algeria</a> and <a href=\"http://dbpedia.org/resource/Nigeria\">Nigeria</a>, were eliminated on <a href=\"http://dbpedia.org/resource/Monday\">Monday</a>, ensuring that the continent would once again remember the <a href=\"http://dbpedia.org/resource/2014\">2014</a> event for off-the-field squabbles.<br/><br/>As <a href=\"http://dbpedia.org/resource/Hong_Kong\">Hong Kong</a> prepared for its annual pro-democracy march <a href=\"http://dbpedia.org/resource/Tuesday\">Tuesday</a>, a survey of residents found more discontent than ever with the <a href=\"http://dbpedia.org/resource/Chinese_government\">Chinese government</a>\u2019s policies toward the city, especially among the young.<br/><br/><a href=\"http://dbpedia.org/resource/President\">President</a> <a href=\"http://dbpedia.org/resource/Petro\">Petro</a> <a href=\"http://dbpedia.org/resource/O\">O</a>. <a href=\"http://dbpedia.org/resource/Poroshenko\">Poroshenko</a> ended a 10-day cease-fire, saying that rebels had not put down their weapons and had persisted in attacking government troops.<br/><br/>At least <a href=\"http://dbpedia.org/resource/22\">22</a> people were killed in the firefight \u2014 all of them assailants, the military said. One soldier was injured.<br/><br/>The giant <a href=\"http://dbpedia.org/resource/French\">French</a> bank admitted to transferring billions of dollars on behalf of <a href=\"http://dbpedia.org/resource/Sudan\">Sudan</a> and other countries the <a href=\"http://dbpedia.org/resource/United_States\">United States</a> has blacklisted.<br/><br/>The former chief justice of the <a href=\"http://dbpedia.org/resource/Constitutional_Court\">Constitutional Court</a> was sentenced to life in prison for corruption, the heaviest sentence ever for graft in one of the most corrupt countries in the world.<br/><br/>A former aide to former <a href=\"http://dbpedia.org/resource/Prime_Minister\">Prime Minister</a> <a href=\"http://dbpedia.org/resource/Petr_Necas\">Petr Necas</a> who later married him was found guilty of abuse of power on <a href=\"http://dbpedia.org/resource/Monday\">Monday</a> in a scandal that exposed their affair and toppled the government a year ago.<br/><br/>The court found that in <a href=\"http://dbpedia.org/resource/1973\">1973</a> an <a href=\"http://dbpedia.org/resource/American\">American</a> naval officer provided <a href=\"http://dbpedia.org/resource/Chilean\">Chilean</a> officials with information on two <a href=\"http://dbpedia.org/resource/Americans\">Americans</a>, which led to their executions as part of a coup that ousted <a href=\"http://dbpedia.org/resource/President\">President</a> <a href=\"http://dbpedia.org/resource/Salvador_Allende\">Salvador Allende</a>.<br/><br/><a href=\"http://dbpedia.org/resource/Mayor\">Mayor</a> <a href=\"http://dbpedia.org/resource/Rob_Ford\">Rob Ford</a> of <a href=\"http://dbpedia.org/resource/Toronto\">Toronto</a> returned to his job after undergoing drug and alcohol treatment, saying, \u201cMy top priority will be rebuilding trust.\u201d"
       ],
       "output_type": "pyout",
       "prompt_number": 46,
       "text": [
        "<IPython.core.display.HTML at 0x2cdd1d0>"
       ]
      }
     ],
     "prompt_number": 46
    }
   ],
   "metadata": {}
  }
 ]
}