Simon Chabot <simon.chabot@logilab.fr> [Mon, 29 Oct 2012 10:17:24 +0100] rev 57
[Minhashing] Trained data can be saved and loaded
Simon Chabot <simon.chabot@logilab.fr> [Mon, 29 Oct 2012 09:52:16 +0100] rev 56
[distances] The planet radius for the geographical distance can be given
Simon Chabot <simon.chabot@logilab.fr> [Mon, 29 Oct 2012 09:46:26 +0100] rev 55
[distances] Don't concat two strings to know if there is a space in one of them. Make two tests instead
Simon Chabot <simon.chabot@logilab.fr> [Fri, 26 Oct 2012 11:44:50 +0200] rev 54
[typo] alignement --> alignment
Simon Chabot <simon.chabot@logilab.fr> [Fri, 26 Oct 2012 11:43:57 +0200] rev 53
[Matrix] Maximum distance computation delegated to max() method of numpy.matrix
Simon Chabot <simon.chabot@logilab.fr> [Fri, 26 Oct 2012 11:41:27 +0200] rev 52
[Minhash] Tests written
Simon Chabot <simon.chabot@logilab.fr> [Fri, 26 Oct 2012 09:49:07 +0200] rev 51
[Matrix] Improvement of the matched() method Don't run over all indices but ask directly to scipy for the wanted ones.
Simon Chabot <simon.chabot@logilab.fr> [Fri, 26 Oct 2012 09:54:16 +0200] rev 50
[Matrix] Change lil_matrix for dense matrix Some experiments shown the matrix wasn't sparse at all. So it's better to use the appropriate data structure. *** [Matrix] Why use todense() method if the matrix is… already dense
Simon Chabot <simon.chabot@logilab.fr> [Fri, 26 Oct 2012 13:43:08 +0200] rev 49
[Distance] Add geographical distance (Equirectangular projection)
Simon Chabot <simon.chabot@logilab.fr> [Thu, 25 Oct 2012 16:43:50 +0200] rev 48
[testing] make the alignment testing a little bit more cleaner Now use rdflib instead of a (bad) homemade code
Simon Chabot <simon.chabot@logilab.fr> [Thu, 25 Oct 2012 16:41:53 +0200] rev 47
[Matrix] Adapt the globalalignmentmatrix computation to the previous changeset (4c2f7553490b)
Simon Chabot <simon.chabot@logilab.fr> [Thu, 25 Oct 2012 16:40:42 +0200] rev 46
[Matrix] Give weighting and normalization at the contruction of the matrix It makes the computation faster
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 19:10:54 +0200] rev 45
wip
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 15:39:38 +0200] rev 44
[Minhashing] : Really, really, really faster training
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 14:51:09 +0200] rev 43
[Matrix] Compute the global alignement matrix
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 14:50:23 +0200] rev 42
[Matrix] Can pass extra arguments to distance functions
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 12:37:56 +0200] rev 41
[Matrix] Computation handles unknown values
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 11:51:57 +0200] rev 40
[Test] Updated to the changement of simplify()
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 11:43:11 +0200] rev 39
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 11:52:22 +0200] rev 38
[normalize] Add a stopword removing option to simplify()
Simon Chabot <simon.chabot@logilab.fr> [Wed, 24 Oct 2012 11:29:08 +0200] rev 37
[Minhashing] Api fonctionnelle
Simon Chabot <simon.chabot@logilab.fr> [Mon, 22 Oct 2012 18:17:17 +0200] rev 36
[minhashing] First try of minhashing (related to #129000)
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 18:22:14 +0200] rev 35
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 17:27:40 +0200] rev 34
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 16:57:40 +0200] rev 33
[Matrix] Don't store inputs inside DistanceMatrix object Storing them had not sense, was useless, and it was annoying for future summation of matrices.
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 14:44:26 +0200] rev 32
[Distance] Spaces are correctly supported by distances functions In fact, the problem was 'Victor Hugo' and 'Hugo Victor' had a big distance when we'd like in this case to have a small one (even a zero one !) So, the approach followed was : Construct a distance matrix : | Victor | Hugo Victor | 0 | 5 Hugo | 5 | 0 And return the minimun of the minimun of each row. In fact, we return the maximun of the minimum of the previous matrix, and the its transpose, to handle the following case : | Victor | Hugo | Jean | Victor | 0 | 5 | 6 |--> min of each row : 0 Hugo | 5 | 0 | 4 | | Victor | Hugo | Victor | 0 | 5 | Hugo | 5 | 0 |--> min of each row : 4 Jean | 6 | 4 | Return the max, ie : 4.
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 11:38:54 +0200] rev 31
[Matrix] Cannot use zip if it not a square matrix \!
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 11:21:44 +0200] rev 30
[Matrix] Matched() return index and value, as tuples
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 10:46:14 +0200] rev 29
[Matrix] Matched() has a lower complexity Instead of reading the whole matrix and get indexes where the value is under the cutoff (O(N²)), the idea is : - Get all indexes where the value is not null (ie not exact matched) O(N) - Append all indexes that are not in the previous list (exact matched) O(N) - If cutoff > 0, for all indexes where not null, test if the O(N) çvalue < cutoff, and add or not
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 16:58:27 +0200] rev 28
[Matrix] Matrices are not symetric. Correct it
Simon Chabot <simon.chabot@logilab.fr> [Fri, 19 Oct 2012 10:06:50 +0200] rev 27
[Test] Assure distances are symetric
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 19:01:27 +0200] rev 26
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 18:18:40 +0200] rev 25
[Test] Tests don't inherite anymore from CWTestCase by directly from unittest2 (be independant)
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 18:15:25 +0200] rev 24
[Normalizer] Lematizer returns a string (better for comparison)
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 17:58:53 +0200] rev 23
[Matrix] Enables normalization
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 17:58:28 +0200] rev 22
[Test] Distance, test euclidean distance on strings too
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 17:16:44 +0200] rev 21
[normalize] Using a class was a bad idea, I removed it
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 16:35:18 +0200] rev 20
[Matrix] Compute a distance matrix
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 15:01:47 +0200] rev 19
[Distance] Exchange 1 and 0 in the soundex distance, because we try to minimize the distance
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 13:55:25 +0200] rev 18
[Distance] Temporal distance supports ambiguity and fuzzyness - You can precise if the day or the year is given in first (day/month/year or year/month/day or month/day/year format). By default, it assumes the current format is the french common used one, ie day/month/year - You can give fuzzy sentence and compare dates : temporal('Jean est né le 1er octobre 1958', 'Le 01-10-1958, Jean est né') yields 0 !
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 12:22:30 +0200] rev 17
[Normalizer] Add the format method (related to #128998)
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 11:41:59 +0200] rev 16
Add LGPL to distance.py and normalize.py
Simon Chabot <simon.chabot@logilab.fr> [Thu, 18 Oct 2012 10:19:15 +0200] rev 15
[Normalizer] Add a normaliser (related to #128998) - unormalize - tokenize - lemmatize - round
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 16:47:07 +0200] rev 14
[distances] Add an euclidean distance function between two numbers (closes #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 15:31:11 +0200] rev 13
[distance] Add a new distance between dates (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 12:34:28 +0200] rev 12
[distances] Add jaccard distance (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 12:05:02 +0200] rev 11
[distances] Add the soundex distance (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 12:04:41 +0200] rev 10
[distance] move soundex to soundexcode (related #128982) In fact, soundexcode is the function returning the soundex code of a word, and soundex will be the 1/0 distance between two words. (1 meaning both have the same code, 0 otherwise)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 11:56:22 +0200] rev 9
[distances] Remove some trailing spaces (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 11:56:04 +0200] rev 8
[distance] Start the iteration at 1, not 0 because we don't care about the first letter (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 11:55:23 +0200] rev 7
[distances] Correct an IndexError in the soundex code (related #128982) As far as we don't know if word[i + 2] is consonant, we use get because it can be a vowel (and crash…)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 11:53:17 +0200] rev 6
[distance] Soudex : if there is a vowel between two identical numbered consonants, count those consonants twice. (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 11:51:42 +0200] rev 5
[test] Add some other tests to soudex, and some explanations too (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 11:50:55 +0200] rev 4
[test] The test for soundex was false, I corrected it (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 16:43:42 +0200] rev 3
[tests] Add tests for soundex and levenshtein (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 16:42:48 +0200] rev 2
[distances] Add soundex code (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Wed, 17 Oct 2012 16:40:51 +0200] rev 1
[distance] add Levenshtein distance (related #128982)
Simon Chabot <simon.chabot@logilab.fr> [Fri, 12 Oct 2012 10:23:58 +0200] rev 0
Initial commit