[normalize] In simplify() replace punctuation by a space
authorSimon Chabot <simon.chabot@logilab.fr>
Mon, 19 Nov 2012 10:37:47 +0100
changeset 144 71aa735b6e3f
parent 143 e538838ee124
child 145 3ab512c6bc1a
[normalize] In simplify() replace punctuation by a space Punctuation used to be removed, now the punctuation is replaced by a space. It enables minhashing to split it and send "ivry seine" and "ivry sur seine" in the same bucket, whereas "ivrysurseine" and "ivsyseine" were not.
normalize.py
--- a/normalize.py	Thu Nov 15 16:48:36 2012 +0100
+++ b/normalize.py	Mon Nov 19 10:37:47 2012 +0100
@@ -104,7 +104,10 @@
     if lemmas:
         sentence = lemmatized(sentence, lemmas)
     sentence = sentence.lower()
-    cleansent = ''.join([s for s in sentence if s not in punctuation])
+    cleansent = ''.join([s if s not in punctuation
+                           else ' ' for s in sentence]).strip()
+    #comma followed by a space is replaced by two spaces, keep only one
+    cleansent = cleansent.replace('  ', ' ')
 
     if not remove_stopwords:
         return cleansent