BeautifulSoup Grab Visible Webpage Text

Question

Basically  I want to use BeautifulSoup to grab strictly the visible text on a webpage  For instance  this webpage is my test case  And I mainly want to just get the body text  article  and maybe even a few tab names here and there  I have tried the suggestion in this SO question that returns lots of  lt script gt  tags and html comments which I don t want  I can t figure out the arguments I need for the function findAll   in order to just get the visible texts on a webpage   So  how should I find all visible text excluding scripts  comments  css etc

User · Answer

import urllib from bs4 import BeautifulSoup  url    https   www yahoo com  html   urllib urlopen url  read   soup   BeautifulSoup html     kill all script and style elements for script in soup   script    style         script extract        rip it out    get text text   soup get text      break into lines and remove leading and trailing space on each lines    line strip   for line in text splitlines      break multi-headlines into a line each chunks    phrase strip   for line in lines for phrase in line split          drop blank lines text     n  join chunk for chunk in chunks if chunk   print text encode  utf-8

User · Answer

Try this   from bs4 import BeautifulSoup from bs4 element import Comment import urllib request   def tag visible element       if element parent name in   style    script    head    title    meta     document             return False     if isinstance element  Comment           return False     return True   def text from html body       soup   BeautifulSoup body   html parser       texts   soup findAll text True      visible texts   filter tag visible  texts        return u    join t strip   for t in visible texts   html   urllib request urlopen  http   www nytimes com 2009 12 21 us 21storm html   read   print text from html html

User · Answer

from bs4 import BeautifulSoup from bs4 element import Comment import urllib request import re import ssl  def tag visible element       if element parent name in   style    script    head    title    meta     document             return False     if isinstance element  Comment           return False     if re match r   n    str element    return False     return True def text from html url       body   urllib request urlopen url context ssl  create unverified context    read       soup   BeautifulSoup body   lxml       texts   soup findAll text True      visible texts   filter tag visible  texts        text   u    join t strip   for t in visible texts      text   text lstrip   rstrip       text   text split          clean text          for sen in text          if sen              sen   sen rstrip   lstrip               clean text    sen         return clean text url    http   www nytimes com 2009 12 21 us 21storm html  print text from html url

User · Answer

Using BeautifulSoup the easiest way with less code to just get the strings  without empty lines and crap   tag    lt Parent Tag that contains the data gt  soup   BeautifulSoup tag   html parser    for i in soup stripped strings      print repr i

User · Answer

The simplest way to handle this case is by using getattr     You can adapt this example to your needs  from bs4 import BeautifulSoup  source html    quot  quot  quot   lt span class  quot ratingsDisplay quot  gt       lt a class  quot ratingNumber quot  href  quot https   www youtube com watch v oHg5SJYRHA0 quot  target  quot  blank quot  rel  quot noopener quot  gt           lt span class  quot ratingsContent quot  gt 3 7 lt  span gt       lt  a gt   lt  span gt   quot  quot  quot   soup   BeautifulSoup source html   quot lxml quot   my ratings   getattr soup find  span     quot class quot    quot ratingsContent quot      quot text quot   None  print my ratings   This will find the text element  quot 3 7 quot   within the tag object  lt span class  quot ratingsContent quot  gt 3 7 lt  span gt  when it exists  however  default to NoneType when it does not   getattr object  name   default   Return the value of the named attribute of object  name must be a string  If the string is the name of one of the object   s attributes  the result is the value of that attribute  For example  getattr x   foobar   is equivalent to x foobar  If the named attribute does not exist  default is returned if provided  otherwise  AttributeError is raised

User · Answer

If you care about performance  here s another more efficient way   import re  INVISIBLE ELEMS     style    script    head    title   RE SPACES   re compile r  s 3      def visible texts soup           get visible text from a document         text       join           s for s in soup strings         if s parent name not in INVISIBLE ELEMS              collapse multiple spaces to two spaces      return RE SPACES sub       text    soup strings is an iterator  and it returns NavigableString so that you can check the parent s tag name directly  without going through multiple loops

User · Answer

I completely respect using Beautiful Soup to get rendered content  but it may not be the ideal package for acquiring the rendered content on a page   I had a similar problem to get rendered content  or the visible content in a typical browser   In particular I had many perhaps atypical cases to work with such a simple example below   In this case the non displayable tag is nested in a style tag  and is not visible in many browsers that I have checked   Other variations exist such as defining a class tag setting display to none   Then using this class for the div     lt html gt     lt title gt   Title here lt  title gt      lt body gt       lots of text here  lt p gt   lt br gt       lt h1 gt  even headings  lt  h1 gt        lt style type  text css  gt            lt div  gt  this will not be visible  lt  div gt        lt  style gt       lt  body gt    lt  html gt    One solution posted above is    html   Utilities ReadFile  simple html   soup   BeautifulSoup BeautifulSoup html  texts   soup findAll text True  visible texts   filter visible  texts  print visible texts     u  n   u  n   u  n n        lots of text here    u     u  n   u  even headings    u  n   u  this will not be visible    u  n   u  n     This solution certainly has applications in many cases and does the job quite well generally but in the html posted above it retains the text that is not rendered   After searching SO a couple solutions came up here BeautifulSoup get text does not strip all tags and JavaScript  and here Rendered HTML to plain text using Python  I tried both these solutions  html2text and nltk clean html and was surprised by the timing results so thought they warranted an answer for posterity   Of course  the speeds highly depend on the contents of the data     One answer here from  Helge was about using nltk of all things     import nltk   timeit nltk clean html html  was returning 153 us per loop   It worked really well to return a string with rendered html   This nltk module was faster than even html2text  though perhaps html2text is more robust    betterHTML   html decode errors  ignore    timeit html2text html2text betterHTML   3 09 ms per loop

User · Answer

The approved answer from  jbochi does not work for me   The str   function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element   Here is a more succinct way to filter the example web page to visible text   html   open  21storm html   read   soup   BeautifulSoup html   s extract   for s in soup   style    script     document     head    title     visible text   soup getText

User · Answer

While  i would completely suggest using beautiful-soup in general  if anyone is looking to display the visible parts of a malformed html  e g  where you have just a segment or line of a web-page  for whatever-reason  the the following will remove content between  lt  and  gt  tags   import re      only use with malformed html - this is not efficient def display visible html using re text                    return re sub     lt      gt        text

User · Answer

The title is inside an  lt nyt headline gt  tag  which is nested inside an  lt h1 gt  tag and a  lt div gt  tag with id  article      soup findAll  nyt headline   limit 1    Should work   The article body is inside an  lt nyt text gt  tag  which is nested inside a  lt div gt  tag with id  articleBody    Inside the  lt nyt text gt   element  the text itself is contained within  lt p gt   tags   Images are not within those  lt p gt  tags   It s difficult for me to experiment with the syntax  but I expect a working scrape to look something like this   text   soup findAll  nyt text   limit 1  0  text findAll  p

[python] BeautifulSoup Grab Visible Webpage Text

Examples related to python

Examples related to text

Examples related to beautifulsoup

Examples related to html-content-extraction