BeautifulSoup getText from between p not picking up subsequent paragraphs

Question

Firstly  I am a complete newbie when it comes to Python  However  I have written a piece of code to look at an RSS feed  open the link and extract the text from the article  This is what I have so far   from BeautifulSoup import BeautifulSoup import feedparser import urllib    Dictionaries links      titles         Variables n   0  rss url    feed   www gfsc gg  layouts GFSC GFSCRSSFeed aspx Division ALL amp Article All amp Title News amp Type doc amp List  7b66fa9b18-776a-4e91-9f80-    30195001386c 7d 23 7b679e913e-6301-4bc4-9fd9-a788b926f565 7d 23 7b0e65f37f-1129-4c78-8f59-3db5f96409fd 7d 23 7bdd7c290d-5f17-43b7-b6fd-50089368e090 7d 23 7b4790a972-c55f-46a5-8020-396780eb8506 7d 23 7b6b67c085-7c25-458d-8a98-373e0ac71c52 7d 23 7be3b71b9c-30ce-47c0-8bfb-f3224e98b756 7d 23 7b25853d98-37d7-4ba2-83f9-78685f2070df 7d 23 7b14c41f90-c462-44cf-a773-878521aa007c 7d 23 7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528 7d 23 7baf17e955-96b7-49e9-ad8a-7ee0ac097f37 7d 23 7b3faca1d0-be40-445c-a577-c742c2d367a8 7d 23 7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6 7d 23 7b43e2b52d-e4f1-4628-84ad-0042d644deaf 7d     Parse the RSS feed feed   feedparser parse rss url     view the entire feed  one entry at a time for post in feed entries        Create variables from posts     link   post link     title   post title       Add the link to the dictionary     n    1     links n    link  for k v in links items          Open RSS feed     page   urllib urlopen v  read       page   str page      soup   BeautifulSoup page         Find all of the text between paragraph tags and strip out the html     page   soup find  p   getText          Strip ampersand codes and WATCH      page   re sub   amp  w       page      page   re sub  WATCH      page         Print Page     print page      print             To stop after 3rd article  just whilst testing    to be removed        if  k  gt   3           break   This produces the following output    gt  gt  gt   executing lines 1 to 45 of  RSS BeautifulSoup py    Total deposits held with Guernsey banks at the end of June 2012 increased 2 1  in sterling terms by   2 1 billion from the end of March 2012 level of   101 billion  up to   103 1 billion  This is 9 4  lower than the same time a year ago   Total assets and liabilities increased by   2 9 billion to   131 2 billion representing a 2 3  increase over the quarter though this was 5 7  lower than the level a year ago   The higher figures reflected the effects both of volume and exchange rate factors   The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by   711 million  0 3   to reach   270 8 billion For the year since 30 June 2011  total net asset values decreased by   3 6 billion  1 3     The Commission has updated the warranties on the Form REG  Form QIF and Form FTL to take into account the Commission   s Guidance Notes on Personal Questionnaires and Personal Declarations   In particular  the following warranty  varies slightly dependent on the application  has been inserted in the aforementioned forms    gt  gt  gt     The problem is that this is the first paragraph of each article  however I need to show the entire article  Any help would be gratefully received

User · Accepted Answer

You are getting close     Find all of the text between paragraph tags and strip out the html page   soup find  p   getText     Using find  as you ve noticed  stops after finding one result  You need find all if you want all the paragraphs  If the pages are formatted consistently   just looked over one   you could also use something like   soup find  div    id   ctl00 PlaceHolderMain RichHtmlField1  ControlWrapper RichHtmlField      to zero in on the body of the article

User · Answer

This works well for specific articles where the text is all wrapped in  lt p gt  tags   Since the web is an ugly place  it s not always the case   Often  websites will have text scattered all over  wrapped in different types of tags  e g  maybe in a  lt span gt  or a  lt div gt   or an  lt li gt     To find all text nodes in the DOM  you can use soup find all text True    This is going to return some undesired text  like the contents of  lt script gt  and  lt style gt  tags   You ll need to filter out the text contents of elements you don t want   blacklist        style      script       other elements     text elements    t for t in soup find all text True  if t parent name not in blacklist    If you are working with a known set of tags  you can tag the opposite approach   whitelist        p     text elements    t for t in soup find all text True  if t parent name in whitelist

[python] BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

Examples related to python

Examples related to python-2.7

Examples related to beautifulsoup