n-grams in python four five six grams

Question

I m looking for a way to split a text into n-grams  Normally I would do something like   import nltk from nltk import bigrams string    I really like python  it s pretty awesome   string bigrams   bigrams string  print string bigrams   I am aware that nltk only offers bigrams and trigrams  but is there a way to split my text in four-grams  five-grams or even hundred-grams   Thanks

User · Answer

I m surprised that this hasn t shown up yet   In  34   sentence    I really like python  it s pretty awesome   split    In  35   N   4  In  36   grams    sentence i i N  for i in xrange len sentence -N 1    In  37   for gram in grams  print gram   I    really    like    python      really    like    python     it s     like    python     it s    pretty     python     it s    pretty    awesome

User · Answer

You can get all 4-6gram using the code without other package below   from itertools import chain  def get m 2 ngrams input list  min  max       for s in chain   get ngrams input list  k  for k in range min  max 1             yield     join s   def get ngrams input list  n       return zip   input list i   for i in range n     if   name         main         input list     I    am    aware    that    nltk    only    offers    bigrams    and    trigrams         but    is    there    a    way    to    split    my    text    in    four-grams         five-grams    or    even    hundred-grams       for s in get m 2 ngrams input list  4  6           print s    the output is below   I am aware that am aware that nltk aware that nltk only that nltk only offers nltk only offers bigrams only offers bigrams and offers bigrams and trigrams bigrams and trigrams   and trigrams   but trigrams   but is   but is there but is there a is there a way there a way to a way to split way to split my to split my text split my text in my text in four-grams text in four-grams   in four-grams   five-grams four-grams   five-grams or   five-grams or even five-grams or even hundred-grams I am aware that nltk am aware that nltk only aware that nltk only offers that nltk only offers bigrams nltk only offers bigrams and only offers bigrams and trigrams offers bigrams and trigrams   bigrams and trigrams   but and trigrams   but is trigrams   but is there   but is there a but is there a way is there a way to there a way to split a way to split my way to split my text to split my text in split my text in four-grams my text in four-grams   text in four-grams   five-grams in four-grams   five-grams or four-grams   five-grams or even   five-grams or even hundred-grams I am aware that nltk only am aware that nltk only offers aware that nltk only offers bigrams that nltk only offers bigrams and nltk only offers bigrams and trigrams only offers bigrams and trigrams   offers bigrams and trigrams   but bigrams and trigrams   but is and trigrams   but is there trigrams   but is there a   but is there a way but is there a way to is there a way to split there a way to split my a way to split my text way to split my text in to split my text in four-grams split my text in four-grams   my text in four-grams   five-grams text in four-grams   five-grams or in four-grams   five-grams or even four-grams   five-grams or even hundred-grams   you can find more detail on this blog

User · Answer

You can use sklearn feature extraction text CountVectorizer   import sklearn feature extraction text   FYI http   scikit-learn org stable install html ngram size   4 string     I really like python  it s pretty awesome    vect   sklearn feature extraction text CountVectorizer ngram range  ngram size ngram size   vect fit string  print   1 -grams   0   format vect get feature names    ngram size     outputs   4-grams   u like python it pretty   u python it pretty awesome   u really like python it     You can set to ngram size to any positive integer  I e  you can split a text in four-grams  five-grams or even hundred-grams

User · Answer

Great native python based answers given by other users  But here s the nltk approach  just in case  the OP gets penalized for reinventing what s already existing in the nltk library     There is an ngram module that people seldom use in nltk  It s not because it s hard to read ngrams  but training a model base on ngrams where n   3 will result in much data sparsity   from nltk import ngrams  sentence    this is a foo bar sentences and i want to ngramize it   n   6 sixgrams   ngrams sentence split    n   for grams in sixgrams    print grams

User · Answer

After about seven years  here s a more elegant answer using collections deque   def ngrams words  n       d   collections deque maxlen n      d extend words  n       words   words n       for window  word in zip itertools cycle  d     words           print     join window           d append word   words     I    am    become    death     the    destroyer    of    worlds     Output   In  15   ngrams words  3                                                                                                                                                                                                                       I am become am become death  become death  the death  the destroyer the destroyer of  In  16   ngrams words  4                                                                                                                                                                                                                       I am become death  am become death  the become death  the destroyer death  the destroyer of  In  17   ngrams words  1                                                                                                                                                                                                                       I am become death  the destroyer of  In  18   ngrams words  2                                                                                                                                                                                                                       I am am become become death  death  the the destroyer destroyer of

User · Answer

You can easily whip up your own function to do this using itertools    from itertools import izip  islice  tee s    spam and eggs  N   3 trigrams   izip   islice seq  index  None  for index  seq in enumerate tee s  N     list trigrams       s    p    a      p    a    m      a    m             m         a           a    n      a    n    d        n    d           d         e           e    g        e    g    g      g    g    s

User · Answer

If you want a pure iterator solution for large strings with constant memory usage   from typing import Iterable   import itertools  def ngrams iter input  str  ngram size  int  token regex r    s     - gt  Iterable str       input iters              map lambda m  m group 0   re finditer token regex  input            for n in range ngram size               Skip first words     for n in range 1  ngram size   list map next  input iters n            output iter   itertools starmap           lambda  args      join args             zip  input iters              return output iter   Test   input    If you want a pure iterator solution for large strings with constant memory usage  list ngrams iter input  5     Output     If you want a pure     you want a pure iterator     want a pure iterator solution     a pure iterator solution for     pure iterator solution for large     iterator solution for large strings     solution for large strings with     for large strings with constant     large strings with constant memory     strings with constant memory usage

User · Answer

If efficiency is an issue and you have to build multiple different n-grams  up to a hundred as you say   but you want to use pure python I would do       from itertools import chain  def n grams seq  n 1          Returns an itirator over the n-grams given a listTokens        shiftToken   lambda i   el for j el in enumerate seq  if j gt  i      shiftedTokens    shiftToken i  for i in range n       tupleNGrams   zip  shiftedTokens      return tupleNGrams   if join in generator        join i  for i in tupleNGrams   def range ngrams listTokens  ngramRange  1 2           Returns an itirator over all n-grams for n in range ngramRange  given a listTokens         return chain   n grams listTokens  i  for i in range  ngramRange      Usage      gt  gt  gt  input list   input list    test the ngrams generator  split    gt  gt  gt  list range ngrams input list  ngramRange  1 3       test       the       ngrams       generator       test    the      the    ngrams      ngrams    generator      test    the    ngrams      the    ngrams    generator       Same speed as NLTK   import nltk   timeit input list    test the ngrams interator vs nltk   10  6 nltk ngrams input list n 5    7 02 ms    79   s per loop  mean    std  dev  of 7 runs  100 loops each     timeit input list    test the ngrams interator vs nltk   10  6 n grams input list n 5    7 01 ms    103   s per loop  mean    std  dev  of 7 runs  100 loops each     timeit input list    test the ngrams interator vs nltk   10  6 nltk ngrams input list n 1  nltk ngrams input list n 2  nltk ngrams input list n 3  nltk ngrams input list n 4  nltk ngrams input list n 5    7 32 ms    241   s per loop  mean    std  dev  of 7 runs  100 loops each     timeit input list    test the ngrams interator vs nltk   10  6 range ngrams input list  ngramRange  1 6     7 13 ms    165   s per loop  mean    std  dev  of 7 runs  100 loops each    Repost from my previous answer

User · Answer

People have already answered pretty nicely for the scenario where you need bigrams or trigrams but if you need everygram for the sentence in that case you can use nltk util everygrams    gt  gt  gt  from nltk util import everygrams   gt  gt  gt  message    who let the dogs out    gt  gt  gt  msg split   message split     gt  gt  gt  list everygrams msg split      who       let       the       dogs       out       who    let      let    the      the    dogs      dogs    out      who    let    the      let    the    dogs      the    dogs    out      who    let    the    dogs      let    the    dogs    out      who    let    the    dogs    out      Incase you have a limit like in case of trigrams where the max length should be 3 then you can use max len param to specify it    gt  gt  gt  list everygrams msg split  max len 2      who       let       the       dogs       out       who    let      let    the      the    dogs      dogs    out      You can just modify the max len param to achieve whatever gram i e four gram  five gram  six or even hundred gram   The previous mentioned solutions can be modified to implement the above mentioned solution but this solution is much straight forward than that   For further reading click here  And when you just need a specific gram like bigram or trigram etc you can use the nltk util ngrams as mentioned in M A Hassan s answer

User · Answer

A more elegant approach to build bigrams with python   s builtin zip     Simply convert the original string into a list by split    then pass the list once normally and once offset by one element    string    I really like python  it s pretty awesome    def find bigrams s       input list   s split          return zip input list  input list 1     def find ngrams s  n     input list   s split        return zip   input list i   for i in range n     find bigrams string      I    really      really    like      like    python       python     it s      it s    pretty      pretty    awesome

User · Answer

I have never dealt with nltk but did N-grams as part of some small class project  If you want to find the frequency of all N-grams occurring in the string  here is a way to do that  D would give you the histogram of your N-words   D   dict   string    whatever string     strparts   string split   for i in range len strparts -N     N-grams     try          D tuple strparts i i N       1     except          D tuple strparts i i N      1

User · Answer

For four grams it is already in NLTK  here is a piece of code that can help you toward this    from nltk collocations import    import nltk   You should tokenize your text  text    I do not like green eggs and ham  I do not like them Sam I am    tokens   nltk wordpunct tokenize text   fourgrams nltk collocations QuadgramCollocationFinder from words tokens   for fourgram  freq in fourgrams ngram fd items             print fourgram  freq   I hope it helps

User · Answer

here is another simple way for do n-grams    gt  gt  gt  from nltk util import ngrams  gt  gt  gt  text    I am aware that nltk only offers bigrams and trigrams  but is there a way to split my text in four-grams  five-grams or even hundred-grams   gt  gt  gt  tokenize   nltk word tokenize text   gt  gt  gt  tokenize   I    am    aware    that    nltk    only    offers    bigrams    and    trigrams         but    is    there    a    way    to    split    my    text    in    four-grams         five-grams    or    even    hundred-grams    gt  gt  gt  bigrams   ngrams tokenize 2   gt  gt  gt  bigrams    I    am      am    aware      aware    that      that    nltk      nltk    only      only    offers      offers    bigrams      bigrams    and      and    trigrams      trigrams                but      but    is      is    there      there    a      a    way      way    to      to    split      split    my      my    text      text    in      in    four-grams      four-grams                five-grams      five-grams    or      or    even      even    hundred-grams     gt  gt  gt  trigrams ngrams tokenize 3   gt  gt  gt  trigrams    I    am    aware      am    aware    that      aware    that    nltk      that    nltk    only      nltk    only    offers      only    offers    bigrams      offers    bigrams    and      bigrams    and    trigrams      and    trigrams           trigrams         but           but    is      but    is    there      is    there    a      there    a    way      a    way    to      way    to    split      to    split    my      split    my    text      my    text    in      text    in    four-grams      in    four-grams           four-grams         five-grams           five-grams    or      five-grams    or    even      or    even    hundred-grams     gt  gt  gt  fourgrams ngrams tokenize 4   gt  gt  gt  fourgrams    I    am    aware    that      am    aware    that    nltk      aware    that    nltk    only      that    nltk    only    offers      nltk    only    offers    bigrams      only    offers    bigrams    and      offers    bigrams    and    trigrams      bigrams    and    trigrams           and    trigrams         but      trigrams         but    is           but    is    there      but    is    there    a      is    there    a    way      there    a    way    to      a    way    to    split      way    to    split    my      to    split    my    text      split    my    text    in      my    text    in    four-grams      text    in    four-grams           in    four-grams         five-grams      four-grams         five-grams    or           five-grams    or    even      five-grams    or    even    hundred-grams

User · Answer

Using only nltk tools  from nltk tokenize import word tokenize from nltk util import ngrams  def get ngrams text  n        n grams   ngrams word tokenize text   n      return       join grams  for grams in n grams    Example output  get ngrams  This is the simplest text i could think of   3      This is the    is the simplest    the simplest text    simplest text i    text i could    i could think    could think of     In order to keep the ngrams in array format just remove     join

User · Answer

Nltk is great  but sometimes is a overhead for some projects   import re def tokenize text  ngrams 1       text   re sub r   b                 s                  text      text   re sub r  s         text      tokens   text split       return  tuple tokens i i ngrams   for i in xrange len tokens -ngrams 1     Example use    gt  gt  text    This is an example text   gt  gt  tokenize text  2     This    is      is    an      an    example      example    text     gt  gt  tokenize text  3     This    is    an      is    an    example      an    example    text

[python] n-grams in python, four, five, six grams?

Examples related to python

Examples related to string

Examples related to nltk

Examples related to n-gram