Stopword removal with NLTK

Question

I am trying to process a user entered text by removing stopwords using nltk toolkit  but with stopword-removal the words like  and    or    not  gets removed  I want these words to be present after stopword removal process as they are operators which are required for later processing text as query  I don t know which are the words which can be operators in text query  and I also want to remove unnecessary words from my text

User · Answer

There is an in-built stopword list in NLTK made up of 2 400 stopwords for 11 languages  Porter et al   see http   nltk org book ch02 html   gt  gt  gt  from nltk import word tokenize  gt  gt  gt  from nltk corpus import stopwords  gt  gt  gt  stop   set stopwords words  english     gt  gt  gt  sentence    this is a foo bar sentence   gt  gt  gt  print  i for i in sentence lower   split   if i not in stop     foo    bar    sentence    gt  gt  gt   i for i in word tokenize sentence lower    if i not in stop     foo    bar    sentence     I recommend looking at using tf-idf to remove stopwords  see Effects of Stemming on the term frequency

User · Answer

alvas s answer does the job but it can be done way faster  Assuming that you have documents  a list of strings   from nltk corpus import stopwords from nltk tokenize import wordpunct tokenize  stop words   set stopwords words  english    stop words update                                                                           remove it if you need punctuation   for doc in documents      list of words    i lower   for i in wordpunct tokenize doc  if i lower   not in stop words    Notice that due to the fact that here you are searching in a set  not in a list  the speed would be theoretically len stop words  2 times faster  which is significant if you need to operate through many documents   For 5000 documents of approximately 300 words each the difference is between 1 8 seconds for my example and 20 seconds for  alvas s   P S  in most of the cases you need to divide the text into words to perform some other classification tasks for which tf-idf is used  So most probably it would be better to use stemmer as well   from nltk stem porter import PorterStemmer porter   PorterStemmer     and to use  porter stem i lower    for i in wordpunct tokenize doc  if i lower   not in stop words  inside of a loop

User · Answer

I suggest you create your own list of operator words that you take out of the stopword list  Sets can be conveniently subtracted  so   operators   set   and    or    not    stop   set stopwords     - operators   Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list  You can then later switch to another stopword list or add an operator   if word lower   not in stop        use word

User · Answer

alvas has a good answer  But again it depends on the nature of the task  for example in your application you want to consider all conjunction e g  and  or  but  if  while and all determiner e g  the  a  some  most  every  no as stop words considering all others parts of speech as legitimate  then you might want to look into this solution which use Part-of-Speech Tagset to discard words  Check table 5 1   import nltk  STOP TYPES     DET    CNJ    text    some data here   tokens   nltk pos tag nltk word tokenize text   good words    w for w  wtype in tokens if wtype not in STOP TYPES

User · Answer

You can use string punctuation with built-in NLTK stopwords list   from nltk tokenize import word tokenize  sent tokenize from nltk corpus import stopwords from string import punctuation  words   tokenize text  wordsWOStopwords   removeStopWords words   def tokenize text           sents   sent tokenize text          return  word tokenize sent  for sent in sents   def removeStopWords words           customStopWords   set stopwords words  english   list punctuation           return  word for word in words if word not in customStopWords    NLTK stopwords complete list

[python] Stopword removal with NLTK

Examples related to python

Examples related to nlp

Examples related to nltk

Examples related to stop-words