[python] Python/BeautifulSoup - how to remove all tags from an element?

How can I simply strip all tags from an element I find in BeautifulSoup?

This question is related to python beautifulsoup

The answer is


it looks like this is the way to do! as simple as that

with this line you are joining together the all text parts within the current element

''.join(htmlelement.find(text=True))

You can use the decompose method in bs4:

soup = bs4.BeautifulSoup('<body><a href="http://example.com/">I linked to <i>example.com</i></a></body>')

for a in soup.find('a').children:
    if isinstance(a,bs4.element.Tag):
        a.decompose()

print soup

Out: <html><body><a href="http://example.com/">I linked to </a></body></html>

With BeautifulStoneSoup gone in bs4, it's even simpler in Python3

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

why has no answer I've seen mentioned anything about the unwrap method? Or, even easier, the get_text method

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text


Here is the source code: you can get the text which is exactly in the URL

URL = ''
page = requests.get(URL)
soup = bs4.BeautifulSoup(page.content,'html.parser').get_text()
print(soup)

Use get_text(), it returns all the text in a document or beneath a tag, as a single Unicode string.

For instance, remove all different script tags from the following text:

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>

The expected result is:

Signal et Communication
Ingénierie Réseaux et Télécommunications

Here is the source code:

#!/usr/bin/env python3
from bs4 import BeautifulSoup

text = '''
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
'''
soup = BeautifulSoup(text)

print(soup.get_text())

Code to simply get the contents as text instead of html:

'html_text' parameter is the string which you will pass in this function to get the text

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, 'lxml')
text = soup.get_text()
print(text)