author Simon Chabot Thu, 04 Apr 2013 18:17:35 +0200 changeset 247 db9a8b3f6f16 parent 246 6d80b4e863f3 child 248 59a15b188628
[doc] Use sphinx roles and update the sample code (closes #119623)
 doc.rst file | annotate | diff | comparison | revisions
```--- a/doc.rst	Fri Jan 25 12:41:53 2013 +0100
+++ b/doc.rst	Thu Apr 04 18:17:35 2013 +0200
@@ -19,9 +19,9 @@
The alignment process is divided into three main steps:

1. Gather and format the data we want to align.
-   In this step, we define two sets called the ``alignset`` and the
-   ``targetset``. The ``alignset`` contains our data, and the
-   ``targetset`` contains the data on which we would like to make the links.
+   In this step, we define two sets called the `alignset` and the
+   `targetset`. The `alignset` contains our data, and the
+   `targetset` contains the data on which we would like to make the links.
2. Compute the similarity between the items gathered.  We compute a distance
matrix between the two sets according to a given distance.
3. Find the items having a high similarity thanks to the distance matrix.
@@ -29,9 +29,9 @@
Simple case
-----------

-1. Let's define ``alignset`` and ``targetset`` as simple python lists.
+1. Let's define `alignset` and `targetset` as simple python lists.

-.. code-block:: python
+.. sourcecode:: python

alignset = ['Victor Hugo', 'Albert Camus']
targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
@@ -39,7 +39,7 @@
2. Now, we have to compute the similarity between each items. For that purpose, the
`Levenshtein distance <http://en.wikipedia.org/wiki/Levenshtein_distance>`_
[#]_, which is well accurate to compute the distance between few words, is used.
-   Such a function is provided in the ``nazca.distance`` module.
+   Such a function is provided in the `nazca.distance` module.

The next step is to compute the distance matrix according to the Levenshtein
distance. The result is given in the following tables.
@@ -70,7 +70,7 @@
three distance matrices will be computed, and *items* will be represented as
nested lists. See the following example:

-.. code-block:: python
+.. sourcecode:: python

alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
['Jacques Dupuis', '06-01-1999', 'Bressuire'],
@@ -84,13 +84,13 @@
In such a case, two distance functions are used, the Levenshtein one for the
name and the city and a temporal one for the birth date [#]_.

-.. [#] Provided in the ``nazca.distances`` module.
+.. [#] Provided in the `nazca.distances` module.

-The ``cdist`` function of ``nazca.distances`` enables us to compute those
+The :func:`cdist` function of `nazca.distances` enables us to compute those
matrices :

-.. code-block:: python
+.. sourcecode:: python

>>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
>>>                    'levenshtein', matrix_normalized=False)
@@ -108,7 +108,7 @@
| Edouard Michel | 6           | 0              | 6              | 6           |
+----------------+-------------+----------------+----------------+-------------+

-.. code-block:: python
+.. sourcecode:: python

>>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
>>>                    'temporal', matrix_normalized=False)
@@ -126,7 +126,7 @@
| 18-04-1881 | 40294      | 0          | 42996      | 48074      |
+------------+------------+------------+------------+------------+

-.. code-block:: python
+.. sourcecode:: python

>>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
>>>                    'levenshtein', matrix_normalized=False)
@@ -160,12 +160,12 @@

Allowing some misspelling mistakes (for example *Dupont* and *Dupond* are very
close), the matching threshold can be set to 1 or 2. Thus we can see that the
-item 0 in our ``alignset`` is the same that the item 0 in the ``targetset``, the
-1 in the ``alignset`` and the 2 of the ``targetset`` too : the links can be
+item 0 in our `alignset` is the same that the item 0 in the `targetset`, the
+1 in the `alignset` and the 2 of the `targetset` too : the links can be
done !

-It's important to notice that even if the item 0 of the ``alignset`` and the 3
-of the ``targetset`` have the same name and the same birthplace they are
+It's important to notice that even if the item 0 of the `alignset` and the 3
+of the `targetset` have the same name and the same birthplace they are
unlikely identical because of their very different birth date.

@@ -180,7 +180,7 @@

Just before we start, we will assume the following imports have been done:

-.. code-block:: python
+.. sourcecode:: python

from nazca import dataio as aldio #Functions for input and output data
from nazca import distances as ald #Functions to compute the distances
@@ -199,7 +199,7 @@
dbpedia

We simply copy/paste the winners list of wikipedia into a file and replace all
-the separators (``-`` and ``,``) by ``#``. So, the beginning of our file is :
+the separators (`-` and `,`) by `#`. So, the beginning of our file is :

..

@@ -214,13 +214,13 @@
as *identifier* (we don't have an *URI* here as identifier) and *attribute* to align.
This is told to python thanks to the following code:

-.. code-block:: python
+.. sourcecode:: python

alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')

-So, the beginning of our ``alignset`` is:
+So, the beginning of our `alignset` is:

-.. code-block:: python
+.. sourcecode:: python

>>> alignset[:3]
[[u'John-Antoine Nau', u'John-Antoine Nau'],
@@ -228,10 +228,10 @@
[u'Claude Farrère', u'Claude Farrère']]

-Now, let's build the ``targetset`` thanks to a *sparql query* and the dbpedia
+Now, let's build the `targetset` thanks to a *sparql query* and the dbpedia
end-point:

-.. code-block:: python
+.. sourcecode:: python

query = """
SELECT ?writer, ?name WHERE {
@@ -247,13 +247,13 @@
python dictionary where the keys are the columns to work on, and the values are
the treatments to apply.

-.. code-block:: python
+.. sourcecode:: python

treatments = {1: {'metric': ald.levenshtein}} #Use a levenshtein on the name

-Finally, the last thing we have to do, is to call the ``align`` function:
+Finally, the last thing we have to do, is to call the :func:`alignall` function:

-.. code-block:: python
+.. sourcecode:: python

alignments = ala.alignall(alignset, targetset,
0.4, #This is the matching threshold
@@ -264,7 +264,7 @@

This function returns an iterator over the (different) carried out alignments.

-.. code-block:: python
+.. sourcecode:: python

for a, t in alignments:
print '%s has been aligned onto %s' % (a, t)
@@ -272,15 +272,15 @@
It may be important to apply some pre-treatment on the data to align. For
instance, names can be written with lower or upper characters, with extra
characters as punctuation or unwanted information in parenthesis and so on. That
-is why we provide some functions to `normalize` your data. The most useful may
-be the `simplify()` function (see the docstring for more information). So the
+is why we provide some functions to ``normalize`` your data. The most useful may
+be the :func:`simplify` function (see the docstring for more information). So the
treatments list can be given as follow:

-.. code-block:: python
+.. sourcecode:: python

def remove_after(string, sub):
-        """ Remove the text after ``sub`` in ``string``
+        """ Remove the text after `sub` in `string`
>>> remove_after('I like cats and dogs', 'and')
'I like cats'
>>> remove_after('I like cats and dogs', '(')
@@ -302,7 +302,7 @@
Cities alignment
----------------

-The previous case with the `Goncourt prize winners` was pretty simply because
+The previous case with the ``Goncourt prize winners`` was pretty simply because
the number of items was small, and the computation fast. But in a more real use
case, the number of items to align may be huge (some thousands or millions…). Is
such a case it's unthinkable to build the global alignment matrix because it
@@ -315,7 +315,7 @@

This is done by the following python code:

-.. code-block:: python
+.. sourcecode:: python

targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
'Any U, N, LONG, LAT WHERE X is Location, X name'
@@ -362,29 +362,29 @@
So, in the next step, we define the treatments to apply.
It is the same as before, but we ask for a non-normalized matrix
(i.e.: the real output of the levenshtein distance).
-Thus, we call the ``alignall`` function. ``indexes`` is a tuple saying the
-position of the point on which the kdtree_ must be built, ``mode`` is the mode
+Thus, we call the :func:`alignall` function. `indexes` is a tuple saying the
+position of the point on which the kdtree_ must be built, `mode` is the mode
used to find neighbours [#]_.

-Finally, ``uniq`` ask to the function to return the best
+Finally, `uniq` ask to the function to return the best
candidate (i.e.: the one having the shortest distance below the given threshold)

-.. [#] The available modes are ``kdtree``, ``kmeans`` and ``minibatch`` for
-       numerical data and ``minhashing`` for text one.
+.. [#] The available modes are `kdtree`, `kmeans` and `minibatch` for
+       numerical data and `minhashing` for text one.

The function output a generator yielding tuples where the first element is the
-identifier of the ``alignset`` item and the second is the ``targetset`` one (It
+identifier of the `alignset` item and the second is the `targetset` one (It
may take some time before yielding the first tuples, because all the computation
must be done…)

.. _kdtree: http://en.wikipedia.org/wiki/K-d_tree

-The `alignall_iterative` and `cache` usage
+The :func:`alignall_iterative` and `cache` usage
==========================================

-Even when using methods such as ``kdtree`` or ``minhashing`` or ``clustering``,
+Even when using methods such as `kdtree` or `minhashing` or `clustering`,
the alignment process might be long. That’s why we provide you a function,
-called `alignall_iterative` which works directly with your files. The idea
+called :func:`alignall_iterative` which works directly with your files. The idea
behind this function is simple, it splits your files (the `alignfile` and the
`targetfile`) into smallers ones and tries to align each item of each subsets.
When processing, if an alignment is estimated almost perfect
@@ -395,7 +395,15 @@
stored into the cache and if in the future a *better* alignment is found, the
cached is updated. At the end, you get only the better alignment found.

-.. code-block:: python
+.. sourcecode:: python
+
+    from difflib import SequenceMatcher
+
+    from nazca import normalize as aln#Functions to normalize data
+    from nazca import aligner as ala  #Functions to align data
+
+    def approxMatch(x, y):
+        return 1.0 - SequenceMatcher(None, x, y).ratio()

alignformat = {'indexes': [0, 3, 2],
'formatopt': {0: lambda x:x.decode('utf-8'),
@@ -404,10 +412,10 @@
},
}

-    targetformat = {'indexes': [0, 3, 2],
+    targetformat = {'indexes': [0, 1, 3],
'formatopt': {0: lambda x:x.decode('utf-8'),
1: lambda x:x.decode('utf-8'),
-                                 2: lambda x:x.decode('utf-8'),
+                                 3: lambda x:x.decode('utf-8'),
},
}

@@ -438,7 +446,7 @@
fobj.write('%s\t%s\t%s\n' % (aligned, targeted, distance))

Roughly, this function expects the same arguments than the previously shown
-`alignall` function, excepting the `equality_threshold` and the `size`.
+:func:`alignall` function, excepting the `equality_threshold` and the `size`.

- `size` is the number items to have in each subsets
- `equality_threshold` is the threshold above which two items are said as```