[python] BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article. This is what I have so far:

from BeautifulSoup import BeautifulSoup
import feedparser
import urllib

# Dictionaries
links = {}
titles = {}

# Variables
n = 0

rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"

# Parse the RSS feed
feed = feedparser.parse(rss_url)

# view the entire feed, one entry at a time
for post in feed.entries:
    # Create variables from posts
    link = post.link
    title = post.title
    # Add the link to the dictionary
    n += 1
    links[n] = link

for k,v in links.items():
    # Open RSS feed
    page = urllib.urlopen(v).read()
    page = str(page)
    soup = BeautifulSoup(page)

    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()

    # Strip ampersand codes and WATCH:
    page = re.sub('&\w+;','',page)
    page = re.sub('WATCH:','',page)

    # Print Page
    print(page)
    print(" ")

    # To stop after 3rd article, just whilst testing ** to be removed **
    if (k >= 3):
        break

This produces the following output:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
?Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.

The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).

The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,

>>> 

The problem is that this is the first paragraph of each article, however I need to show the entire article. Any help would be gratefully received.

This question is related to python python-2.7 beautifulsoup

The answer is


This works well for specific articles where the text is all wrapped in <p> tags. Since the web is an ugly place, it's not always the case.

Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a <span> or a <div>, or an <li>).

To find all text nodes in the DOM, you can use soup.find_all(text=True).

This is going to return some undesired text, like the contents of <script> and <style> tags. You'll need to filter out the text contents of elements you don't want.

blacklist = [
  'style',
  'script',
  # other elements,
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blacklist]

If you are working with a known set of tags, you can tag the opposite approach:

whitelist = [
  'p'
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to python-2.7

Numpy, multiply array with scalar Not able to install Python packages [SSL: TLSV1_ALERT_PROTOCOL_VERSION] How to create a new text file using Python Could not find a version that satisfies the requirement tensorflow Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Display/Print one column from a DataFrame of Series in Pandas How to calculate 1st and 3rd quartiles? How can I read pdf in python? How to completely uninstall python 2.7.13 on Ubuntu 16.04 Check key exist in python dict

Examples related to beautifulsoup

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org What should I use to open a url instead of urlopen in urllib3 TypeError: a bytes-like object is required, not 'str' in python and CSV UnicodeEncodeError: 'ascii' codec can't encode character at special name UnicodeEncodeError: 'charmap' codec can't encode characters bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? Using BeautifulSoup to extract text without tags python BeautifulSoup parsing table install beautiful soup using pip Python BeautifulSoup extract text between element