How to remove stop words using nltk or python

Question

So I have a dataset that I would like to remove stop words from using   stopwords words  english     I m struggling how to use this within my code to just simply take out these words  I have a list of the words from this dataset already  the part i m struggling with is comparing to this list and removing the stop words  Any help is appreciated

User · Answer

Although the question is a bit old  here is a new library  which is worth mentioning  that can do extra tasks  In some cases  you don t want only to remove stop words  Rather  you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive  The library is called  textfeatures   You can use it as follows    pip install textfeatures import textfeatures as tf import pandas as pd  For example  suppose you have the following set of strings  texts          quot blue car and blue window quot        quot black crow in the window quot        quot i see my reflection in the window quot    df   pd DataFrame texts    Convert to a dataframe df columns     text     give a name to the column df  Now  call the stopwords   function and pass the parameters you want  tf stopwords df  quot text quot   quot stopwords quot     extract stop words df   quot text quot   quot stopwords quot    head     give names to columns  The result is going to be      text                                 stopwords 0   blue car and blue window              and  1   black crow in the window              in  the  2   i see my reflection in the window     i  my  in  the   As you can see  the last column has the stop words included in that docoument  record

User · Answer

I will show you some example First I extract the text data from the data frame  twitter df  to process further as following      from nltk tokenize import word tokenize      tweetText   twitter df  text    Then to tokenize I use the following method      from nltk tokenize import word tokenize      tweetText   tweetText apply word tokenize   Then  to remove stop words       from nltk corpus import stopwords      nltk download  stopwords         stop words   set stopwords words  english         tweetText   tweetText apply lambda x  word for word in x if word not in stop words        tweetText head    I Think this will help you

User · Answer

import sys print   enter the string from which you want to remove list of stop words   userstring   input   split      list    a   an   the   in   another list      for x in userstring      if x not in list              comparing from the list and removing it         another list append x     it is also possible to use  remove for x in another list       print x end            2  if you want to use  remove more preferred code     import sys     print   enter the string from which you want to remove list of stop words       userstring   input   split          list    a   an   the   in       another list          for x in userstring          if x in list                         userstring remove x        for x in userstring                     print x end              the code will be like this

User · Answer

from nltk corpus import stopwords   from nltk tokenize import word tokenize   example sent    quot This is a sample sentence  showing off the stop words filtration  quot      stop words   set stopwords words  english        word tokens   word tokenize example sent      filtered sentence    w for w in word tokens if not w in stop words      filtered sentence          for w in word tokens       if w not in stop words           filtered sentence append w      print word tokens   print filtered sentence

User · Answer

Use textcleaner library to remove stopwords from your data   Follow this link https   yugantm github io textcleaner documentation html remove stpwrds  Follow these steps to do so with this library   pip install textcleaner   After installing   import textcleaner as tc data   tc document  lt file name gt     you can also pass list of sentences to the document class constructor  data remove stpwrds    inplace is set to False by default   Use above code to remove the stop-words

User · Answer

from nltk corpus import stopwords       filtered words    word for word in word list if word not in stopwords words  english

User · Answer

In case your data are stored as a Pandas DataFrame  you can use remove stopwords from textero that use the NLTK stopwords list by default   import pandas as pd import texthero as hero df  text without stopwords     hero remove stopwords df  text

User · Answer

To exclude all type of stop-words including nltk stop-words  you could do something like this       from stop words import get stop words from nltk corpus import stopwords  stop words   list get stop words  en             About 900 stopwords nltk words   list stopwords words  english     About 150 stopwords stop words extend nltk words   output    w for w in word list if not w in stop words

User · Answer

You could also do a set diff  for example   list set nltk regexp tokenize sentence  pattern  gaps True   - set nltk corpus stopwords words  english

User · Answer

Here is my take on this  in case you want to immediately get the answer into a string  instead of a list of filtered words    STOPWORDS   set stopwords words  english    text        join  word for word in text split   if word not in STOPWORDS     delete stopwords from text

User · Answer

you can use this function  you should notice that you need to lower all the words  from nltk corpus import stopwords  def remove stopwords word list           processed word list              for word in word list              word   word lower     in case they arenet all lower cased             if word not in stopwords words  english                    processed word list append word          return processed word list

User · Answer

There s a very simple light-weight python package stop-words just for this sake   Fist install the package using  pip install stop-words  Then you can remove your words in one line using list comprehension   from stop words import get stop words  filtered words    word for word in dataset if word not in get stop words  english       This package is very light-weight to download  unlike nltk   works for both Python 2 and Python 3  and it has stop words for many other languages like       Arabic     Bulgarian     Catalan     Czech     Danish     Dutch     English     Finnish     French     German     Hungarian     Indonesian     Italian     Norwegian     Polish     Portuguese     Romanian     Russian     Spanish     Swedish     Turkish     Ukrainian

User · Answer

using filter   from nltk corpus import stopwords         filtered words   list filter lambda word  word not in stopwords words  english    word list

User · Answer

I suppose you have a list of words  word list  from which you want to remove stopwords  You could do something like this   filtered word list   word list     make a copy of the word list for word in word list    iterate over word list   if word in stopwords words  english         filtered word list remove word    remove word from filtered word list if it is a stopword

[python] How to remove stop words using nltk or python

Examples related to python

Examples related to nltk

Examples related to stop-words