How do I do word Stemming or Lemmatization

Question

I ve tried PorterStemmer and Snowball but both don t work on all words  missing some very common ones    My test words are   cats running ran cactus cactuses cacti community communities   and both get less than half right   See also    Stemming algorithm that produces real words Stemming - code examples or open source projects

User · Answer

If I may quote my answer to the question StompChicken mentioned:

The core issue here is that stemming algorithms operate on a phonetic basis with no actual understanding of the language they're working with.

As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".

If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.

User · Answer

Take a look at LemmaGen - open source library written in C  3 0   Results for your test words  http   lemmatise ijs si Services    cats -  cat running ran -  run cactus cactuses -  cactus cacti -  cactus community communities -  community

User · Answer

The stemmer vs lemmatizer debates goes on  It s a matter of preferring precision over efficiency  You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key    See Stemmers vs Lemmatizers  Here s an example with python NLTK    gt  gt  gt  sent    cats running ran cactus cactuses cacti community communities   gt  gt  gt  from nltk stem import PorterStemmer  WordNetLemmatizer  gt  gt  gt   gt  gt  gt  port   PorterStemmer    gt  gt  gt      join  port stem i  for i in sent split      cat run ran cactu cactus cacti commun commun   gt  gt  gt   gt  gt  gt  wnl   WordNetLemmatizer    gt  gt  gt      join  wnl lemmatize i  for i in sent split      cat running ran cactus cactus cactus community community

User · Answer

Look into WordNet  a large lexical database for the English language   http   wordnet princeton edu   There are APIs for accessing it in several languages

User · Answer

Do a search foR Lucene  im not sure if theres a PHP port but i do know Lucene is available for many platforms  Lucene is an OSS  from Apache  indexing and search library  Naturally it and community extras might have something interesting to look at  At the very least you can learn how its done in one language so you can translate the  idea  into PHP

User · Answer

In Java  i use tartargus-snowball to stemming words  Maven    lt dependency gt           lt groupId gt org apache lucene lt  groupId gt           lt artifactId gt lucene-snowball lt  artifactId gt           lt version gt 3 0 3 lt  version gt           lt scope gt test lt  scope gt   lt  dependency gt    Sample code   SnowballProgram stemmer   new EnglishStemmer    String   words   new String         testing        skincare        eyecare        eye        worked        read     for  String word   words        stemmer setCurrent word       stemmer stem          debug     logger info  Origin      word      gt      stemmer getCurrent       result  test  skincar  eyecar  eye  work  read

User · Answer

df plots   pd read excel  Plot Summary xlsx   index col   0  df plots   Printing first sentence of first row and last sentence of last row nltk sent tokenize df plots loc 1  Plot  0    nltk sent tokenize df plots loc len df   Plot  -1     Calculating length of all plots by words df plots  Length     df plots Plot apply lambda x    len nltk word tokenize x     print  Longest plot is for season    print df plots Length idxmax     print  Shortest plot is for season    print df plots Length idxmin        What is this show about   What are the top 3 words used   excluding the  stop words  in all the  seasons combined   word sample   list   struggled    died    word list   nltk pos tag word sample   wnl lemmatize str word list index  0    pos   word list index  1  0  lower    for index in range len word list       Figure out the stop words stop    stopwords words  english       Tokenize all the plots df plots  Tokenized     df plots Plot apply lambda x   nltk word tokenize x lower        Remove the stop words df plots  Filtered     df plots Tokenized apply lambda x    word for word in x if word not in stop      Lemmatize each word wnl   WordNetLemmatizer   df plots  POS     df plots Filtered apply lambda x   nltk pos tag list x      df plots  POS     df plots POS apply lambda x     word 1    word 1  0  for word in word list  for word list in x   df plots  Lemmatized     df plots POS apply lambda x    wnl lemmatize x index  0   pos   str x index  1  0   lower    for index in range len list x          Which Season had the highest screenplay of  Jesse  compared to  Walt     Screenplay of Jesse   Occurences of  Jesse    Occurences of  Jesse    Occurences of  Walt    df plots groupby  Season   Tokenized sum    df plots  Share     df plots groupby  Season   Tokenized sum   apply lambda x   float x count  jesse     100  float x count  jesse     x count  walter     x count  walt      print  The highest times Jesse was mentioned compared to Walter Walt was in season    print df plots  Share   idxmax     float df plots Tokenized sum   count  jesse      100    float  df plots Tokenized sum   count  jesse      df plots Tokenized sum   count  walt      df plots Tokenized sum   count  walter

User · Answer

I highly recommend using Spacy  base text parsing  amp  tagging  and Textacy  higher level text processing built on top of Spacy    Lemmatized words are available by default in Spacy as a token s  lemma  attribute and text can be lemmatized while doing a lot of other text preprocessing with textacy   For example while creating a bag of terms or words or generally just before performing some processing that requires it   I d encourage you to check out both before writing any code  as this may save you a lot of time

User · Answer

Net lucene has an inbuilt porter stemmer  You can try that  But note that porter stemming does not consider word context when deriving the lemma   Go through the algorithm and its implementation and you will see how it works

User · Answer

I tried your list of terms on this snowball demo site and the results look okay        cats -  cat  running -  run  ran -  ran cactus -  cactus  cactuses -  cactus community -  communiti  communities -  communiti   A stemmer is supposed to turn inflected forms of words down to some common root  It s not really a stemmer s job to make that root a  proper  dictionary word  For that you need to look at morphological orthographic analysers   I think this question is about more or less the same thing  and Kaarel s answer to that question is where I took the second link from

User · Answer

Martin Porter s official page contains a Porter Stemmer in PHP as well as other languages   If you re really serious about good stemming though you re going to need to start with something like the Porter Algorithm  refine it by adding rules to fix incorrect cases common to your dataset  and then finally add a lot of exceptions to the rules   This can be easily implemented with key value pairs  dbm hash dictionaries  where the key is the word to look up and the value is the stemmed word to replace the original  A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm

User · Answer

Martin Porter wrote Snowball  a language for stemming algorithms  and rewrote the  quot English Stemmer quot  in Snowball  There are is an English Stemmer for C and Java  He explicitly states that the Porter Stemmer has been reimplemented only for historical reasons  so testing stemming correctness against the Porter Stemmer will get you results that you  should  already know   From http   tartarus org  martin PorterStemmer index html  emphasis mine  The Porter stemmer should be regarded as    frozen     that is  strictly defined  and not amenable to further modification  As a stemmer  it is slightly inferior to the Snowball English or Porter2 stemmer  which derives from it  and which is subjected to occasional improvements  For practical work  therefore  the new Snowball stemmer is recommended  The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable   Dr  Porter suggests to use the English or Porter2 stemmers instead of the Porter stemmer  The English stemmer is what s actually used in the demo site as  StompChicken has answered earlier

User · Answer

This looks interesting  MIT Java WordnetStemmer  http   projects csail mit edu jwi api edu mit jwi morph WordnetStemmer html

User · Answer

You could use the Morpha stemmer   UW has uploaded morpha stemmer to Maven central if you plan to use it from a Java application   There s a wrapper that makes it much easier to use   You just need to add it as a dependency and use the edu washington cs knowitall morpha MorphaStemmer class   Instances are threadsafe  the original JFlex had class fields for local variables unnecessarily    Instantiate a class and run morpha and the word you want to stem   new MorphaStemmer   morpha  climbed      goes to  climb

User · Answer

I use stanford nlp to perform lemmatization  I have been stuck up with a similar problem in the last few days  All thanks to stackoverflow to help me solve the issue      import java util     import edu stanford nlp pipeline    import edu stanford nlp ling     import edu stanford nlp ling CoreAnnotations       public class example       public static void main String   args                Properties props   new Properties             props put  annotators    tokenize  ssplit  pos  lemma             pipeline   new StanfordCoreNLP props  false           String text      the string you want              Annotation document   pipeline process text              for CoreMap sentence  document get SentencesAnnotation class                             for CoreLabel token  sentence get TokensAnnotation class                                        String word   token get TextAnnotation class                         String lemma   token get LemmaAnnotation class                    System out println  lemmatized version      lemma                                     It also might be a good idea to use stopwords to minimize output lemmas if it s used later in classificator  Please take a look at coreNlp extension written by John Conwell

User · Answer

Based on various answers on Stack Overflow and blogs I ve come across  this is the method I m using  and it seems to return real words quite well  The idea is to split the incoming text into an array of words  use whichever method you d like   and then find the parts of speech  POS  for those words and use that to help stem and lemmatize the words   You re sample above doesn t work too well  because the POS can t be determined  However  if we use a real sentence  things work much better   import nltk from nltk corpus import wordnet  lmtzr   nltk WordNetLemmatizer   lemmatize   def get wordnet pos treebank tag       if treebank tag startswith  J            return wordnet ADJ     elif treebank tag startswith  V            return wordnet VERB     elif treebank tag startswith  N            return wordnet NOUN     elif treebank tag startswith  R            return wordnet ADV     else          return wordnet NOUN   def normalize text text       word pos   nltk pos tag nltk word tokenize text       lemm words    lmtzr sw 0   get wordnet pos sw 1    for sw in word pos       return  x lower   for x in lemm words   print normalize text  cats running ran cactus cactuses cacti community communities        cat    run    ran    cactus    cactuses    cacti    community    community    print normalize text  The cactus ran to the community to see the cats running around cacti between communities         the    cactus    run    to    the    community    to    see    the    cat    run    around    cactus    between    community

User · Answer

The most current version of the stemmer in NLTK is Snowball   You can find examples on how to use it here   http   nltk googlecode com svn trunk doc api nltk stem snowball2-pysrc html demo

User · Answer

If you know Python  The Natural Language Toolkit  NLTK  has a very powerful lemmatizer that makes use of WordNet   Note that if you are using this lemmatizer for the first time  you must download the corpus prior to using it   This can be done by    gt  gt  gt  import nltk  gt  gt  gt  nltk download  wordnet     You only have to do this once   Assuming that you have now downloaded the corpus  it works like this    gt  gt  gt  from nltk stem wordnet import WordNetLemmatizer  gt  gt  gt  lmtzr   WordNetLemmatizer    gt  gt  gt  lmtzr lemmatize  cars    car   gt  gt  gt  lmtzr lemmatize  feet    foot   gt  gt  gt  lmtzr lemmatize  people    people   gt  gt  gt  lmtzr lemmatize  fantasized   v    fantasize    There are other lemmatizers in the nltk stem module  but I haven t tried them myself

User · Answer

http   wordnet princeton edu man morph 3WN  For a lot of my projects  I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming    http   wordnet princeton edu links PHP has a link to a PHP interface to the WN APIs

User · Answer

Try this one here  http   www twinword com lemmatizer php  I entered your query in the demo  cats running ran cactus cactuses cacti community communities  and got   cat    running    run    cactus    cactus    cactus    community    community   with the optional flag ALL TOKENS   Sample Code  This is an API so you can connect to it from any environment  Here is what the PHP REST call may look like      These code snippets use an open-source library  http   unirest io php  response   Unirest Request  post  ENDPOINT     array       X-Mashape-Key    gt   API KEY        Content-Type    gt   application x-www-form-urlencoded        Accept    gt   application json         array       text    gt   cats running ran cactus cactuses cacti community communities

User · Answer

The top python packages  in no specific order  for lemmatization are  spacy  nltk  gensim  pattern  CoreNLP and TextBlob  I prefer spaCy and gensim s implementation  based on pattern  because they identify the POS tag of the word and assigns the appropriate lemma automatically  The gives more relevant lemmas  keeping the meaning intact   If you plan to use nltk or TextBlob  you need to take care of finding the right POS tag manually and the find the right lemma    Lemmatization Example with spaCy     Run below statements in terminal once   pip install spacy spacy download en  import spacy    Initialize spacy  en  model nlp   spacy load  en   disable   parser    ner     sentence    The striped bats are hanging on their feet for best     Parse doc   nlp sentence     Extract the lemma     join  token lemma  for token in doc     gt   the strip bat be hang on -PRON- foot for good    Lemmatization Example With Gensim   from gensim utils import lemmatize sentence    The striped bats were hanging on their feet and ate best fishes  lemmatized out    wd decode  utf-8   split      0  for wd in lemmatize sentence     gt    striped    bat    be    hang    foot    eat    best    fish     The above examples were borrowed from in this lemmatization page

[nlp] How do I do word Stemming or Lemmatization?

Examples related to nlp

Examples related to stemming

Examples related to lemmatization