notebooks/Named Entities Matching with Nazca.ipynb
author Vincent Michel <vincent.michel@logilab.fr>
Mon, 30 Jun 2014 14:11:18 +0000
changeset 453 7debe5d58a19
parent 449 80c0a2d8624a
child 456 d93286fdd149
permissions -rw-r--r--
preparing 0.6.0

{
 "metadata": {
  "name": "Named Entities Matching with Nazca"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h1>Named Entities Matching with Nazca</h1>\n",
      "\n",
      "\n",
      "Named Entities Matching is the process of recognizing elements in a text and matching it\n",
      "to different types (e.g. Person, Organization, Place). This may be related to Record Linkage (for linking entities from a text corpus and from a reference corpus) and to Named Entities Recognition."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import HTML\n",
      "HTML('<iframe src=http://en.mobile.wikipedia.org/wiki/Named-entity_recognition?useformat=mobile width=700 height=350></iframe>')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<iframe src=http://en.mobile.wikipedia.org/wiki/Named-entity_recognition?useformat=mobile width=700 height=350></iframe>"
       ],
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "<IPython.core.display.HTML at 0x7fd0000698d0>"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In Nazca, we provide tools to match named entities to existing corpus or reference database.\n",
      "This may be used for contextualizing your data, e.g. by recognizing in them elements from Dbpedia."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Token, Sentence and Tokenizer - nazca.utils.tokenizer</h2>\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A `Sentence` is:\n",
      "<ul>\n",
      "    <li>the `start` indice in the text</li>\n",
      "    <li>the `end` indice in the text</li>\n",
      "    <li>the `indice` of the sentence amonst all the other sentences</li>\n",
      "</ul>    "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.utils.tokenizer import Sentence\n",
      "\n",
      "sentence = Sentence(indice=0, start=0, end=38)\n",
      "print sentence"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Sentence(indice=0, start=0, end=38)\n"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A `Token` is:\n",
      "<ul>\n",
      "    <li>a `word` or part of a sentence</li>\n",
      "    <li>the `start` indice in the sentence</li>\n",
      "    <li>the `end` indice in the sentence</li>\n",
      "    <li>the `sentence` which contains the Token</li>\n",
      "</ul>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.utils.tokenizer import Token\n",
      "\n",
      "token = Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
      "print token"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Nazca provides a `Tokenizer` that creates tokens from a text:\n",
      "\n",
      "\n",
      "    class RichStringTokenizer(object):\n",
      "\n",
      "        def iter_tokens(self, text):\n",
      "            \"\"\" Iterate tokens over a text \"\"\"\n",
      "\n",
      "        def find_sentences(self, text):\n",
      "            \"\"\" Find the sentences \"\"\"\n",
      "\n",
      "        def load_text(self, text):\n",
      "            \"\"\" Load the text to be tokenized \"\"\"\n",
      "\n",
      "        def __iter__(self):\n",
      "            \"\"\" Iterator over the text given in the object instantiation \"\"\"\n",
      "\n",
      "\n",
      "The tokenizer may be initialized with a minimum size and maximum size for the tokens. It will find all the sentences from a text using `find_sentences` and will create tokens in an iterative way."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.utils.tokenizer import RichStringTokenizer\n",
      "\n",
      "text = 'Hello everyone, this is   me speaking. And me !Why not me ? Blup'\n",
      "tokenizer = RichStringTokenizer(text, token_min_size=1, token_max_size=4)\n",
      "sentences = tokenizer.find_sentences(text)\n",
      "\n",
      "print sentences"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[Sentence(indice=0, start=0, end=38), Sentence(indice=1, start=38, end=47), Sentence(indice=2, start=47, end=59), Sentence(indice=3, start=59, end=64)]\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for token in tokenizer:\n",
      "    print token"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='Hello everyone this is', start=0, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='Hello everyone this', start=0, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='Hello everyone', start=0, end=14, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='Hello', start=0, end=5, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='everyone this is me', start=6, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='everyone this is', start=6, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='everyone this', start=6, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='this is me speaking', start=16, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='this is me', start=16, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='this is', start=16, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='this', start=16, end=20, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='is me speaking', start=21, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='is me', start=21, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='is', start=21, end=23, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='me speaking', start=26, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='speaking', start=29, end=37, sentence=Sentence(indice=0, start=0, end=38))\n",
        "Token(word='And me Why', start=39, end=50, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='And me', start=39, end=45, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='And', start=39, end=42, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='me Why', start=43, end=50, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=47))\n",
        "Token(word='Why not me', start=47, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='Why not', start=47, end=54, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='Why', start=47, end=50, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='not me', start=51, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='not', start=51, end=54, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='me', start=55, end=57, sentence=Sentence(indice=2, start=47, end=59))\n",
        "Token(word='Blup', start=60, end=64, sentence=Sentence(indice=3, start=59, end=64))\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Defining a source of entities - nazca.ner.sources</h2>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we should define the source of entities that should be retrieved in the tokens yield by the tokenizer.\n",
      "\n",
      "A source herits from `AbstractNerSource`:\n",
      "\n",
      "\n",
      "    class AbstractNerSource(object):\n",
      "        \"\"\" High-level source for Named Entities Recognition \"\"\"\n",
      "\n",
      "        def __init__(self, endpoint, query, name=None, use_cache=True, preprocessors=None):\n",
      "        \"\"\" Initialise the class.\"\"\"\n",
      "\n",
      "        def add_preprocessors(self, preprocessor):\n",
      "            \"\"\" Add a preprocessor \"\"\"\n",
      "\n",
      "        def recognize_token(self, token):\n",
      "            \"\"\" Recognize a token \"\"\"\n",
      "\n",
      "        def query_word(self, word):\n",
      "            \"\"\" Query a word for a Named Entities Recognition process \"\"\"\n",
      "\n",
      "\n",
      "It requires an `endpoint` and a `query`. And may recognize a word directly using `query_word`, or a token using `recognize_token`. In both cases, it returns a list of URIs/identifiers."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Sparql source</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceSparql\n",
      "ner_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
      "                             '''SELECT distinct ?uri\n",
      "                                WHERE{?uri rdfs:label \"%(word)s\"@en}''')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print ner_source.query_word('Victor Hugo')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://dbpedia.org/resource/Victor_Hugo', u'http://dbpedia.org/resource/Category:Victor_Hugo', u'http://sw.opencyc.org/2008/06/10/concept/en/VictorHugo', u'http://sw.opencyc.org/2008/06/10/concept/Mx4rve1ZXJwpEbGdrcN5Y29ycA', u'http://wikidata.dbpedia.org/resource/Q1459231', u'http://wikidata.dbpedia.org/resource/Q535', u'http://wikidata.dbpedia.org/resource/Q3557368']\n"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Of course, you can use a more complex query reflecting restrictions and business logic"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceSparql\n",
      "ner_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
      "                              '''SELECT distinct ?uri\n",
      "                                 WHERE{?uri rdfs:label \"%(word)s\"@en .\n",
      "                                       ?p foaf:primaryTopic ?uri}''')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print ner_source.query_word('Victor Hugo')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://dbpedia.org/resource/Victor_Hugo']\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You may also used `use_cache` to True to keep all the previous results."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Rql source</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceRql\n",
      "\n",
      "ner_source = NerSourceRql('http://www.cubicweb.org',\n",
      "                          'Any U WHERE X cwuri U, X name \"%(word)s\"')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print ner_source.query_word('apycot')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'http://www.cubicweb.org/1310453', u'http://www.cubicweb.org/749162']\n"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Lexicon source</h4>\n",
      "\n",
      "A lexicon source is a source based on a dictionnary"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceLexicon\n",
      "\n",
      "\n",
      "lexicon = {'everyone': 'http://example.com/everyone',\n",
      "           'me': 'http://example.com/me'}\n",
      "ner_source = NerSourceLexicon(lexicon)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print ner_source.query_word('me')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "['http://example.com/me']\n"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Defining preprocessors - nazca.ner.preprocessors</h2>\n",
      "\n",
      "Preprocessors are used to cleanup/filter a token before recognizing it in a source.\n",
      "All preprocessors herit from `AbstractNerPreprocessor`\n",
      "\n",
      "    class AbstractNerPreprocessor(object):\n",
      "        \"\"\" Preprocessor \"\"\"\n",
      "\n",
      "        def __call__(self, token):\n",
      "            raise NotImplementedError\n",
      "\n",
      "\n",
      "A token which is None after being returned by a preprocessor is not recognized by the source."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nazca.ner.preprocessors as nnp"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerWordSizeFilterPreprocessor</h4>\n",
      "\n",
      "This preprocessor remove token based on the size of the word."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerWordSizeFilterPreprocessor(min_size=2, max_size=4)\n",
      "token = Token('toto', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='toto', start=0, end=4, sentence=None) --> Token(word='toto', start=0, end=4, sentence=None)\n"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('t', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='t', start=0, end=4, sentence=None) --> None\n"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('tototata', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='tototata', start=0, end=4, sentence=None) --> None\n"
       ]
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerLowerCaseFilterPreprocessor</h4>\n",
      "\n",
      "This preprocessor remove token in lower case"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerLowerCaseFilterPreprocessor()\n",
      "token = Token('Toto', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='Toto', start=0, end=4, sentence=None) --> Token(word='Toto', start=0, end=4, sentence=None)\n"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('toto', 0, 4, None)\n",
      "print token, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Token(word='toto', start=0, end=4, sentence=None) --> None\n"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerLowerFirstWordPreprocessor</h4>\n",
      "\n",
      "This preprocessor lower the first word of each sentence if it is a stopword."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerLowerFirstWordPreprocessor()\n",
      "sentence = Sentence(0, 0, 20)\n",
      "token = Token('Toto tata', 0, 4, sentence)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Toto tata --> Toto tata\n"
       ]
      }
     ],
     "prompt_number": 34
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('The tata', 0, 4, sentence)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The tata --> the tata\n"
       ]
      }
     ],
     "prompt_number": 35
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('Tata The', 0, 4, sentence)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Tata The --> Tata The\n"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerStopwordsFilterPreprocessor</h4>\n",
      "\n",
      "This preprocessor remove stopwords from the token. If `split_words` is False, it only removes token that are just a stopwords, in the other case, it will remove tokens that are entirely composed of stopwords."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerStopwordsFilterPreprocessor(split_words=True)\n",
      "token = Token('Toto', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Toto --> Toto\n"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('Us there', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Us there --> None\n"
       ]
      }
     ],
     "prompt_number": 38
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerStopwordsFilterPreprocessor(split_words=False)\n",
      "token = Token('Us there', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Us there --> Us there\n"
       ]
      }
     ],
     "prompt_number": 39
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerHashTagPreprocessor</h4>\n",
      "\n",
      "This preprocessor cleanup hashtags."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "preprocessor = nnp.NerHashTagPreprocessor()\n",
      "token = Token('@BarackObama', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "@BarackObama --> BarackObama\n"
       ]
      }
     ],
     "prompt_number": 40
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "token = Token('@Barack_Obama', 0, 4, None)\n",
      "print token.word, '-->', preprocessor(token).word"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "@Barack_Obama --> Barack Obama\n"
       ]
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Defining the NER process - nazca.ner</h2>\n",
      "\n",
      "The entire NER process may be implemented in a `NerProcess`\n",
      "\n",
      "    class NerProcess(object):\n",
      "\n",
      "        def add_ner_source(self, process):\n",
      "            \"\"\" Add a ner process \"\"\"\n",
      "\n",
      "        def add_preprocessors(self, preprocessor):\n",
      "            \"\"\" Add a preprocessor \"\"\"\n",
      "\n",
      "        def add_filters(self, filter):\n",
      "            \"\"\" Add a filter \"\"\"\n",
      "\n",
      "        def process_text(self, text):\n",
      "            \"\"\" High level function for analyzing a text \"\"\"\n",
      "\n",
      "        def recognize_tokens(self, tokens):\n",
      "            \"\"\" Recognize Named Entities from a tokenizer or an iterator yielding tokens. \"\"\"\n",
      "\n",
      "        def postprocess(self, named_entities):\n",
      "            \"\"\" Postprocess the results by applying filters \"\"\"\n",
      "\n",
      "\n",
      "The main entry point of a `NerProcess` is the function `process_text()`, that yield triples (URI, source_name, token).\n",
      "The source_name may be set during source creation, and may be useful when working with different sources."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner import NerProcess\n",
      "\n",
      "text = 'Hello everyone, this is   me speaking. And me.'\n",
      "source = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
      "                           'me': 'http://example.com/me'})\n",
      "ner = NerProcess((source,))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 42
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 43
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Use multi sources</h3>\n",
      "\n",
      "Different sources may be given to the NER process."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "source1 = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
      "                             'me': 'http://example.com/me'})\n",
      "source2 = NerSourceLexicon({'me': 'http://example2.com/me'})\n",
      "ner = NerProcess((source1, source2))\n",
      "\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It is possible to set the `unique` attribute, to keep the first appearance of a token in a source, in the sources order"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source1, source2), unique=True)\n",
      "\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source2, source1), unique=True)\n",
      "\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 46
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h3>Pretty printing the output</h3>\n",
      "\n",
      "The output can be pretty printed with hyperlinks."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.utils.dataio import HTMLPrettyPrint\n",
      "\n",
      "ner = NerProcess((source1, source2))\n",
      "named_entities = ner.process_text(text)\n",
      "html = HTMLPrettyPrint().pprint_text(text, named_entities)\n",
      "print html"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Hello <a href=\"http://example.com/everyone\">everyone</a>, this is   <a href=\"http://example2.com/me\">me</a> speaking. And <a href=\"http://example2.com/me\">me</a>.\n"
       ]
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Defining the filters - nazca.ner.filters</h2>\n",
      "\n",
      "It is sometimes useful to filter the named entities found by a NerProcess before returning them.\n",
      "This may be done using filters, that herit from `AbstractNerFilter`\n",
      "\n",
      "\n",
      "    class AbstractNerFilter(object):\n",
      "        \"\"\" A filter used for cleaning named entities results \"\"\"\n",
      "\n",
      "        def __call__(self, named_entities):\n",
      "            raise NotImplementedError"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nazca.ner.filters as nnf"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 48
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerOccurenceFilter</h4>\n",
      "\n",
      "This filter is based on the number of occurence of named entities in the results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'Hello everyone, this is   me speaking. And me.'\n",
      "source1 = NerSourceLexicon({'everyone': 'http://example.com/everyone',\n",
      "                            'me': 'http://example.com/me'})\n",
      "source2 = NerSourceLexicon({'me': 'http://example2.com/me'})\n",
      "_filter = nnf.NerOccurenceFilter(min_occ=2)\n",
      "\n",
      "ner = NerProcess((source1, source2))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/everyone', None, Token(word='everyone', start=6, end=14, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 49
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source1, source2), filters=(_filter,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example2.com/me', None, Token(word='me', start=26, end=28, sentence=Sentence(indice=0, start=0, end=38)))\n",
        "('http://example.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n",
        "('http://example2.com/me', None, Token(word='me', start=43, end=45, sentence=Sentence(indice=1, start=38, end=46)))\n"
       ]
      }
     ],
     "prompt_number": 50
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerDisambiguationWordParts</h4>\n",
      "\n",
      "This filter disambiguates named entities based on the words parts, i.e. if a token is included in a larger reocgnized tokens, we replace it by the larger token."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'Hello Toto Tutu. And Toto.'\n",
      "source = NerSourceLexicon({'Toto Tutu': 'http://example.com/toto_tutu',\n",
      "                           'Toto': 'http://example.com/toto'})\n",
      "_filter = nnf.NerDisambiguationWordParts()\n",
      "\n",
      "ner = NerProcess((source,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/toto_tutu', None, Token(word='Toto Tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
        "('http://example.com/toto', None, Token(word='Toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 51
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source,), filters=(_filter,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/toto_tutu', None, Token(word='Toto Tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
        "('http://example.com/toto_tutu', None, Token(word='Toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 52
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerReplacementRulesFilter</h4>\n",
      "\n",
      "This filter allow to define replacement rules for Named Entities."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'Hello toto tutu. And toto.'\n",
      "source = NerSourceLexicon({'toto tutu': 'http://example.com/toto_tutu',\n",
      "                           'toto': 'http://example.com/toto'})\n",
      "rules = {'http://example.com/toto': 'http://example.com/tata'}\n",
      "_filter = nnf.NerReplacementRulesFilter(rules)\n",
      "\n",
      "ner = NerProcess((source,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/toto_tutu', None, Token(word='toto tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
        "('http://example.com/toto', None, Token(word='toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 53
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source,), filters=(_filter,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('http://example.com/toto_tutu', None, Token(word='toto tutu', start=6, end=15, sentence=Sentence(indice=0, start=0, end=16)))\n",
        "('http://example.com/tata', None, Token(word='toto', start=21, end=25, sentence=Sentence(indice=1, start=16, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 54
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>NerRDFTypeFilter</h4>\n",
      "\n",
      "This filter is based on the RDF type of objects"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = 'Hello Victor Hugo and Sony'\n",
      "source = NerSourceSparql('http://dbpedia.org/sparql',\n",
      "                         '''SELECT distinct ?uri\n",
      "                            WHERE{?uri rdfs:label \"%(word)s\"@en .\n",
      "                                  ?p foaf:primaryTopic ?uri}''')\n",
      "_filter = nnf.NerRDFTypeFilter('http://dbpedia.org/sparql',\n",
      "                             ('http://schema.org/Place',\n",
      "                              'http://dbpedia.org/ontology/Agent',\n",
      "                              'http://dbpedia.org/ontology/Place'))\n",
      "\n",
      "ner = NerProcess((source,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(u'http://dbpedia.org/resource/Hello', None, Token(word='Hello', start=0, end=5, sentence=Sentence(indice=0, start=0, end=26)))\n",
        "(u'http://dbpedia.org/resource/Victor_Hugo', None, Token(word='Victor Hugo', start=6, end=17, sentence=Sentence(indice=0, start=0, end=26)))\n",
        "(u'http://dbpedia.org/resource/Sony', None, Token(word='Sony', start=22, end=26, sentence=Sentence(indice=0, start=0, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 56
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ner = NerProcess((source,), filters=(_filter,))\n",
      "for infos in ner.process_text(text):\n",
      "    print infos"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(u'http://dbpedia.org/resource/Victor_Hugo', None, Token(word='Victor Hugo', start=6, end=17, sentence=Sentence(indice=0, start=0, end=26)))\n",
        "(u'http://dbpedia.org/resource/Sony', None, Token(word='Sony', start=22, end=26, sentence=Sentence(indice=0, start=0, end=26)))\n"
       ]
      }
     ],
     "prompt_number": 57
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h2>Putting it all together</h2>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Get the data</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import feedparser\n",
      "from BeautifulSoup import BeautifulSoup  \n",
      "\n",
      "data = feedparser.parse('http://rss.nytimes.com/services/xml/rss/nyt/World.xml')\n",
      "entries = data.entries"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Define the source</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.sources import NerSourceSparql\n",
      "\n",
      "dbpedia_sparql_source = NerSourceSparql('http://dbpedia.org/sparql',\n",
      "                                         '''SELECT distinct ?uri\n",
      "                                            WHERE{\n",
      "                                            ?uri rdfs:label \"%(word)s\"@en .\n",
      "                                            ?p foaf:primaryTopic ?uri}''',\n",
      "                                         'http://dbpedia.org/sparql',\n",
      "                                        use_cache=True)\n",
      "ner_sources = [dbpedia_sparql_source,]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Create the NER process</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nazca.ner.preprocessors import (NerLowerCaseFilterPreprocessor,\n",
      "                                    NerStopwordsFilterPreprocessor)\n",
      "from nazca.ner import NerProcess\n",
      "\n",
      "preprocessors = [NerLowerCaseFilterPreprocessor(),\n",
      "        \t     NerStopwordsFilterPreprocessor()]\n",
      "process = NerProcess(ner_sources, preprocessors=preprocessors)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<h4>Process the data</h4>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import HTML\n",
      "from nazca.utils.dataio import HTMLPrettyPrint\n",
      "\n",
      "pprint_html = HTMLPrettyPrint()\n",
      "\n",
      "html = []\n",
      "for ind, entry in enumerate(entries[:20]):\n",
      "    text = BeautifulSoup(entry['summary_detail']['value']).text\n",
      "    named_entities = process.process_text(text)\n",
      "    html.append(pprint_html.pprint_text(text, named_entities))\n",
      "HTML('<br/><br/>'.join(html))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<a href=\"http://dbpedia.org/resource/Matteo_Renzi\">Matteo Renzi</a> had the best showing of any <a href=\"http://dbpedia.org/resource/European\">European</a> leader in parliamentary voting, reaching a level no party has seen in an <a href=\"http://dbpedia.org/resource/Italian_election\">Italian election</a> since <a href=\"http://dbpedia.org/resource/1958\">1958</a>.<br/><br/><a href=\"http://dbpedia.org/resource/Satellite\">Satellite</a> data from <a href=\"http://dbpedia.org/resource/Malaysia_Airlines\">Malaysia Airlines</a> <a href=\"http://dbpedia.org/resource/Flight\">Flight</a> <a href=\"http://dbpedia.org/resource/370\">370</a> was released after pressure from relatives of the mostly <a href=\"http://dbpedia.org/resource/Chinese\">Chinese</a> passengers and from the <a href=\"http://dbpedia.org/resource/Chinese_government\">Chinese government</a>.<br/><br/><a href=\"http://dbpedia.org/resource/China\">China</a> and <a href=\"http://dbpedia.org/resource/Vietnam\">Vietnam</a> traded accusations over the sinking of a <a href=\"http://dbpedia.org/resource/Vietnamese\">Vietnamese</a> fishing vessel near a <a href=\"http://dbpedia.org/resource/Chinese\">Chinese</a> oil rig in disputed waters off <a href=\"http://dbpedia.org/resource/Vietnam\">Vietnam</a>\u2019s coast.<br/><br/>As tensions grow between <a href=\"http://dbpedia.org/resource/China\">China</a> and <a href=\"http://dbpedia.org/resource/Vietnam\">Vietnam</a> over the ramming and sinking of a <a href=\"http://dbpedia.org/resource/Vietnamese\">Vietnamese</a> fishing boat by a <a href=\"http://dbpedia.org/resource/Chinese\">Chinese</a> vessel, reaction in <a href=\"http://dbpedia.org/resource/China\">China</a> is overwhelmingly supportive of the incident.<br/><br/>The popular microblog of a professor at <a href=\"http://dbpedia.org/resource/Peking_University\">Peking University</a> has been blocked, perhaps because he posted a comment about the military suppression of the <a href=\"http://dbpedia.org/resource/1989\">1989</a> <a href=\"http://dbpedia.org/resource/Tiananmen\">Tiananmen</a> protest movement.<br/><br/><a href=\"http://dbpedia.org/resource/Dalia_Grybauskaite\">Dalia Grybauskaite</a>, <a href=\"http://dbpedia.org/resource/58\">58</a>, was re-elected president of <a href=\"http://dbpedia.org/resource/Lithuania\">Lithuania</a> on <a href=\"http://dbpedia.org/resource/Sunday\">Sunday</a> after beating the <a href=\"http://dbpedia.org/resource/Social_Democrat\">Social Democrat</a> candidate in a runoff.<br/><br/>Ten fishermen were rescued after their boat was struck by a <a href=\"http://dbpedia.org/resource/Chinese\">Chinese</a> vessel in the disputed waters in the <a href=\"http://dbpedia.org/resource/South_China_Sea\">South China Sea</a>, state news media reported.<br/><br/>The general contractor that helped oversee the construction of the <a href=\"http://dbpedia.org/resource/Abu_Dhabi\">Abu Dhabi</a> campus is run by a trustee of <a href=\"http://dbpedia.org/resource/N_Y\">N.Y</a>.<a href=\"http://dbpedia.org/resource/U\">U</a>.\u2019s board.<br/><br/>The <a href=\"http://dbpedia.org/resource/Malawi\">Malawi</a> <a href=\"http://dbpedia.org/resource/Electoral_Commission\">Electoral Commission</a> must manually count the <a href=\"http://dbpedia.org/resource/20\">20</a> presidential election votes, a senior official said <a href=\"http://dbpedia.org/resource/Monday\">Monday</a>. It is likely to take two months before an official result is announced.<br/><br/>An express train plowed into a parked freight train in northern <a href=\"http://dbpedia.org/resource/India\">India</a> killing at least <a href=\"http://dbpedia.org/resource/40\">40</a> people and reducing cars to twisted metal, officials said.<br/><br/><a href=\"http://dbpedia.org/resource/United_Nations\">United Nations</a> officials demanded a halt to the expulsion of thousands of citizens of the <a href=\"http://dbpedia.org/resource/Democratic_Republic\">Democratic Republic</a> of <a href=\"http://dbpedia.org/resource/Congo\">Congo</a> from the <a href=\"http://dbpedia.org/resource/Congo_Republic\">Congo Republic</a>.<br/><br/><a href=\"http://dbpedia.org/resource/Flooding\">Flooding</a> in southern <a href=\"http://dbpedia.org/resource/China\">China</a> forced almost half a million people from their homes, the government said <a href=\"http://dbpedia.org/resource/Monday\">Monday</a>.<br/><br/><a href=\"http://dbpedia.org/resource/Residents\">Residents</a> of <a href=\"http://dbpedia.org/resource/Castrillo_Matajud%C3%ADos\">Castrillo Matajud\u00edos</a>, which loosely translates as Little <a href=\"http://dbpedia.org/resource/Fort\">Fort</a> of <a href=\"http://dbpedia.org/resource/Jew\">Jew</a> <a href=\"http://dbpedia.org/resource/Killers\">Killers</a>, have voted to change the name of their village.<br/><br/><a href=\"http://dbpedia.org/resource/Pfizer\">Pfizer</a> said it \u201cdoes not intend to make an offer for <a href=\"http://dbpedia.org/resource/AstraZeneca\">AstraZeneca</a>\u201d in the wake of <a href=\"http://dbpedia.org/resource/AstraZeneca\">AstraZeneca</a>\u2019s rejection of a $<a href=\"http://dbpedia.org/resource/119\">119</a> billion bid.<br/><br/><a href=\"http://dbpedia.org/resource/United_States\">United States</a> <a href=\"http://dbpedia.org/resource/Special_Operations\">Special Operations</a> troops are forming elite units in <a href=\"http://dbpedia.org/resource/Libya\">Libya</a>, <a href=\"http://dbpedia.org/resource/Niger\">Niger</a>, <a href=\"http://dbpedia.org/resource/Mauritania\">Mauritania</a> and <a href=\"http://dbpedia.org/resource/Mali\">Mali</a> in the war against <a href=\"http://dbpedia.org/resource/Al_Qaeda\">Al Qaeda</a>\u2019s affiliates in <a href=\"http://dbpedia.org/resource/Africa\">Africa</a>.<br/><br/><a href=\"http://dbpedia.org/resource/The_military\">The military</a> commander said he would not disclose further details out of concern for the girls\u2019 safety.<br/><br/>In <a href=\"http://dbpedia.org/resource/France\">France</a>, <a href=\"http://dbpedia.org/resource/Britain\">Britain</a> and elsewhere, voters turned against traditional parties to support anti-immigrant groups opposed to the <a href=\"http://dbpedia.org/resource/European_Union\">European Union</a>.<br/><br/>When the <a href=\"http://dbpedia.org/resource/World_Cup\">World Cup</a> begins next month, many of the $<a href=\"http://dbpedia.org/resource/1\">1</a>.<a href=\"http://dbpedia.org/resource/4\">4</a> billion-worth of projects in <a href=\"http://dbpedia.org/resource/Cuiab%C3%A1\">Cuiab\u00e1</a>, one of the <a href=\"http://dbpedia.org/resource/12\">12</a> <a href=\"http://dbpedia.org/resource/Brazilian\">Brazilian</a> host cities, will be far from completion.<br/><br/><a href=\"http://dbpedia.org/resource/Narendra_Modi\">Narendra Modi</a> was sworn in on <a href=\"http://dbpedia.org/resource/Monday\">Monday</a> as <a href=\"http://dbpedia.org/resource/India\">India</a>\u2019s prime minister. Among those attending the ceremony was <a href=\"http://dbpedia.org/resource/Prime_Minister\">Prime Minister</a> <a href=\"http://dbpedia.org/resource/Nawaz_Sharif\">Nawaz Sharif</a> of <a href=\"http://dbpedia.org/resource/Pakistan\">Pakistan</a>, hinting that the two countries may revive a moribund peace process.<br/><br/><a href=\"http://dbpedia.org/resource/Ukrainian\">Ukrainian</a> soldiers used fighter jets and fought a ground battle against the pro-Russian forces, who had seized the airport in <a href=\"http://dbpedia.org/resource/Donetsk\">Donetsk</a> after elections that seemed to marginalize them."
       ],
       "output_type": "pyout",
       "prompt_number": 25,
       "text": [
        "<IPython.core.display.HTML at 0x445f450>"
       ]
      }
     ],
     "prompt_number": 25
    }
   ],
   "metadata": {}
  }
 ]
}