How to get rid of punctuation using NLTK tokenizer

Question

I m just starting to use NLTK and I don t quite understand how to get a list of words from text  If I use nltk word tokenize    I get a list of words and punctuation  I need only the words instead  How can I get rid of punctuation  Also word tokenize doesn t work with multiple sentences  dots are added to the last word

User · Answer

Just adding to the solution by  rmalouf  this will not include any numbers because  w  is equivalent to  a-zA-Z0-9    from nltk tokenize import RegexpTokenizer tokenizer   RegexpTokenizer r  a-zA-Z    tokenizer tokenize  Eighty-seven miles to go  yet   Onward

User · Answer

I just used the following code  which removed all the punctuation   tokens   nltk wordpunct tokenize raw   type tokens   text   nltk Text tokens   type text     words    w lower   for w in text if w isalpha

User · Answer

Below code will remove all punctuation marks as well as non alphabetic characters  Copied from their book   http   www nltk org book ch01 html   import nltk  s    I can t do this now  because I m so tired   Please give me some time    sd  4 232   words   nltk word tokenize s   words  word lower   for word in words if word isalpha     print words    output    i    ca    do    this    now    because    i    so    tired    please    give    me    some    time    sd

User · Answer

I use this code to remove punctuation   import nltk def getTerms sentences       tokens   nltk word tokenize sentences      words    w lower   for w in tokens if w isalnum        print tokens     print words  getTerms  hh  hh3h  wo shi 2 4 A   fdffdf  A amp  amp B      And If you want to check whether a token is a valid English word or not  you may need PyEnchant  Tutorial    import enchant  d   enchant Dict  en US    d check  Hello    d check  Helo    d suggest  Helo

User · Answer

You can do it in one line without nltk  python 3 x    import string string text  string text translate str maketrans       string punctuation

User · Answer

Remove punctuaion It will remove   as well as part of punctuation handling using below code           tbl   dict fromkeys i for i in range sys maxunicode  if unicodedata category chr i   startswith  P            text string   text string translate tbl   text string don t have punctuation         w   word tokenize text string    now tokenize the string    Sample Input Output   direct flat in oberoi esquire  3 bhk 2195 saleable 1330 carpet  rate of 14500 final plus 1  floor rise  tax approx 9  only  flat cost with parking 3 89 cr plus taxes plus possession charger  middle floor  north door  arey and oberoi woods facing  53  paymemt due  1  transfer charge with buyer  total cost around 4 20 cr approx plus possession charges  rahul soni     direct    flat    oberoi    esquire    3    bhk    2195    saleable    1330    carpet    rate    14500    final    plus    1    floor    rise    tax    approx    9    flat    cost    parking    389    cr    plus    taxes    plus    possession    charger    middle    floor    north    door    arey    oberoi    woods    facing    53    paymemt    due    1    transfer    charge    buyer    total    cost    around    420    cr    approx    plus    possession    charges    rahul    soni

User · Answer

I think you need some sort of regular expression matching  the following code is in Python 3    import string import re import nltk  s    I can t do this now  because I m so tired   Please give me some time   l   nltk word tokenize s  ll    x for x in l if not re fullmatch       string punctuation         x   print l  print ll    Output     I    ca    n t    do    this    now         because    I     m    so    tired         Please    give    me    some    time          I    ca    n t    do    this    now    because    I     m    so    tired    Please    give    me    some    time     Should work well in most cases since it removes punctuation while preserving tokens like  n t   which can t be obtained from regex tokenizers such as wordpunct tokenize

User · Answer

Take a look at the other tokenizing options that nltk provides here  For example  you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else   from nltk tokenize import RegexpTokenizer  tokenizer   RegexpTokenizer r  w    tokenizer tokenize  Eighty-seven miles to go  yet   Onward      Output     Eighty    seven    miles    to    go    yet    Onward

User · Answer

Sincerely asking  what is a word  If your assumption is that a word consists of alphabetic characters only  you are wrong since words such as can t will be destroyed into pieces  such as can and t  if you remove punctuation before tokenisation  which is very likely to affect your program negatively   Hence the solution is to tokenise and then remove punctuation tokens   import string  from nltk tokenize import word tokenize  tokens   word tokenize  I m a southern salesman        I     m    a    southern    salesman         tokens   list filter lambda token  token not in string punctuation  tokens       I     m    a    southern    salesman        and then if you wish  you can replace certain tokens such as  m with am

User · Answer

You do not really need NLTK to remove punctuation  You can remove it with simple python  For strings   import string s        some string with punctuation      s   s translate None  string punctuation    Or for unicode   import string translate table   dict  ord char   None  for char in string punctuation     s translate translate table    and then use this string in your tokenizer   P S  string module have some other sets of elements that can be removed  like digits

User · Answer

As noticed in comments start with sent tokenize    because word tokenize   works only on a single sentence  You can filter out punctuation with filter    And if you have an unicode strings make sure that is a unicode object  not a  str  encoded with some encoding like  utf-8      from nltk tokenize import word tokenize  sent tokenize  text      It is a blue  small  and extraordinary ball  Like no other    tokens    word for sent in sent tokenize text  for word in word tokenize sent   print filter lambda word  word not in   -   tokens

[python] How to get rid of punctuation using NLTK tokenizer?

Examples related to python

Examples related to nlp

Examples related to tokenize

Examples related to nltk