How can I split a text into sentences

Question

I have a text file  I need to get a list of sentences   How can this be implemented  There are a lot of subtleties  such as a dot being used in abbreviations   My old regular expression works badly   re compile                A-Z        lt  gt     amp                     re M

User · Answer

You could make a new tokenizer for Russian (and some other languages) using this function:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

and then call it in this way:

text = '?? ?????????? ?????, ????????? Google SSL;'
tokens = russianTokenizer(text)

User · Answer

For simple cases  where sentences are terminated normally   this should work   import re text      join open  somefile txt   readlines    sentences   re split r                         text    The regex is        which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right  to prevent something like the period in re split being counted as a change in sentence    Obviously  not the most robust solution  but it ll do fine in most cases  The only case this won t cover is abbreviations  perhaps run through the list of sentences and check that each string in sentences starts with a capital letter

User · Answer

The Natural Language Toolkit  nltk org  has what you need   This group posting indicates this does it   import nltk data  tokenizer   nltk data load  tokenizers punkt english pickle   fp   open  test txt   data   fp read   print   n----- n  join tokenizer tokenize data      I haven t tried it

User · Answer

i hope this will help you on latin chinese arabic text  import re  punctuation   re compile r     d               n                         lines       with open  myData txt   r  encoding  utf-8   as myFile      lines   punctuation sub r  1 2 lt pad gt    myFile read        lines    line strip   for line in lines split   lt pad gt    if line strip

User · Answer

This function can split the entire text of Huckleberry Finn into sentences in about 0 1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e g   Mr  John Johnson Jr  was born in the U S A but earned his Ph D  in Israel before joining Nike Inc  as an engineer  He also worked at craigslist org as a business analyst      - - coding  utf-8 - - import re alphabets     A-Za-z    prefixes     Mr St Mrs Ms Dr      suffixes     Inc Ltd Jr Sr Co   starters     Mr Mrs Ms Dr He s She s It s They s Their s Our s We s But s However s That s This s Wherever   acronyms      A-Z     A-Z        A-Z         websites        com net org io gov    def split into sentences text       text         text            text   text replace   n           text   re sub prefixes    1 lt prd gt   text      text   re sub websites   lt prd gt   1  text      if  Ph D  in text  text   text replace  Ph D    Ph lt prd gt D lt prd gt        text   re sub   s    alphabets              1 lt prd gt    text      text   re sub acronyms     starters    1 lt stop gt    2  text      text   re sub alphabets           alphabets           alphabets            1 lt prd gt   2 lt prd gt   3 lt prd gt   text      text   re sub alphabets           alphabets            1 lt prd gt   2 lt prd gt   text      text   re sub     suffixes        starters     1 lt stop gt    2  text      text   re sub     suffixes           1 lt prd gt   text      text   re sub       alphabets             1 lt prd gt   text      if       in text  text   text replace                    if      in text  text   text replace                  if     in text  text   text replace                  if     in text  text   text replace                  text   text replace        lt stop gt        text   text replace        lt stop gt        text   text replace        lt stop gt        text   text replace   lt prd gt            sentences   text split   lt stop gt        sentences   sentences  -1      sentences    s strip   for s in sentences      return sentences

User · Answer

Using spacy  import spacy  nlp   spacy load  en core web sm   text    quot How are you today  I hope you have a great day quot  tokens   nlp text  for sent in tokens sents      print sent string strip

User · Answer

Here is a middle of the road approach that doesn t rely on any external libraries   I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations  for example      vs        abbreviations     dr     doctor    mr     mister    bro     brother    bro    brother    mrs     mistress    ms     miss    jr     junior    sr     senior                     i e     for example    e g     for example    vs     versus   terminators                   wrappers                               def find sentences paragraph      end   True    sentences         while end  gt  -1         end   find sentence end paragraph         if end  gt  -1             sentences append paragraph end   strip               paragraph   paragraph  end     sentences append paragraph     sentences reverse      return sentences   def find sentence end paragraph        possible endings  contraction locations                 contractions   abbreviations keys       sentence terminators   terminators    terminator   wrapper for wrapper in wrappers for terminator in terminators      for sentence terminator in sentence terminators          t indices   list find all paragraph  sentence terminator           possible endings extend     if not len t indices  else   i  len sentence terminator   for i in t indices        for contraction in contractions          c indices   list find all paragraph  contraction           contraction locations extend     if not len c indices  else  i   len contraction  for i in c indices        possible endings    pe for pe in possible endings if pe 0    pe 1  not in contraction locations      if len paragraph  in  pe 0    pe 1  for pe in possible endings           max end start   max  pe 0  for pe in possible endings           possible endings    pe for pe in possible endings if pe 0     max end start      possible endings    pe 0    pe 1  for pe in possible endings if sum pe   gt  len paragraph  or  sum pe   lt  len paragraph  and paragraph sum pe                end    -1 if not len possible endings  else max possible endings       return end   def find all a str  sub       start   0     while True          start   a str find sub  start          if start    -1              return         yield start         start    len sub    I used Karl s find all function from this entry  Find all occurrences of a substring in Python

User · Answer

You can also use sentence tokenization function in NLTK   from nltk tokenize import sent tokenize sentence    As the most quoted English writer Shakespeare has more than his share of famous quotes   Some Shakespare famous quotes are known for their beauty  some for their everyday truths and some for their wisdom  We often talk about Shakespeare   s quotes as things the wise Bard is saying to us but  we should remember that some of his wisest words are spoken by his biggest fools  For example  both    neither a borrower nor a lender be     and    to thine own self be true    are from the foolish  garrulous and quite disreputable Polonius in Hamlet    sent tokenize sentence

User · Answer

Was working on similar task and came across this query  by following few links and working on few exercises for nltk  the below code worked for me like magic   from nltk tokenize import sent tokenize     text    quot Hello everyone  Welcome to GeeksforGeeks  You are studying NLP article quot  sent tokenize text    output    Hello everyone      Welcome to GeeksforGeeks      You are studying NLP article    Source  https   www geeksforgeeks org nlp-how-tokenizing-text-sentence-words-works

User · Answer

Instead of using regex for spliting the text into sentences  you can also use nltk library    gt  gt  gt  from nltk import tokenize  gt  gt  gt  p    Good morning Dr  Adams  The patient is waiting for you in room number 3     gt  gt  gt  tokenize sent tokenize p    Good morning Dr  Adams     The patient is waiting for you in room number 3      ref  https   stackoverflow com a 9474645 2877052

User · Answer

You can try using Spacy instead of regex  I use it and it does the job   import spacy nlp   spacy load  en    text      Your text here    tokens   nlp text   for sent in tokens sents      print sent string strip

User · Answer

No doubt that NLTK is the most suitable for the purpose  But getting started with NLTK is quite painful  But once you install it - you just reap the rewards   So here is simple re based code available at http   pythonicprose blogspot com 2009 09 python-split-paragraph-into-sentences html    split up a paragraph into sentences   using regular expressions   def splitParagraphIntoSentences paragraph           break a paragraph into sentences         and return a list         import re       to split by multile characters          regular expressions are easiest  and fastest      sentenceEnders   re compile              sentenceList   sentenceEnders split paragraph      return sentenceList   if   name         main         p      This is a sentence   This is an excited sentence  And do you think this is a question          sentences   splitParagraphIntoSentences p      for s in sentences          print s strip     output      This is a sentence     This is an excited sentence      And do you think this is a question

User · Answer

Also  be wary of additional top level domains that aren t included in some of the answers above  For example  info   biz   ru   online will throw some sentence parsers but aren t included above  Here s some info on frequency of top level domains  https   www westhost com blog the-most-popular-top-level-domains-in-2017  That could be addressed by editing the code above to read  alphabets   quot   A-Za-z   quot  prefixes    quot  Mr St Mrs Ms Dr     quot  suffixes    quot  Inc Ltd Jr Sr Co  quot  starters    quot  Mr Mrs Ms Dr He s She s It s They s Their s Our s We s But s However s That s This s Wherever  quot  acronyms    quot   A-Z     A-Z        A-Z        quot  websites    quot     com net org io gov ai edu co uk ru info biz online  quot

User · Answer

I had to read subtitles files and split them into sentences  After pre-processing  like removing time information etc in the  srt files   the variable fullFile contained the full text of the subtitle file  The below crude way neatly split them into sentences  Probably I was lucky that the sentences always ended  correctly  with a space  Try this first and if it has any exceptions  add more checks and balances     Very approximate way to split the text into sentences - Break after     and   fullFile   re sub                  1 lt BRK gt   fullFile  sentences   fullFile split   lt BRK gt     sentFile   open    sentences out    w     for line in sentences      sentFile write  line       sentFile write    n    sentFile close    Oh  well  I now realize that since my content was Spanish  I did not have the issues of dealing with  Mr  Smith  etc  Still  if someone wants a quick and dirty parser

[python] How can I split a text into sentences?

Examples related to python

Examples related to text

Examples related to split