How can I simply strip all tags from an element I find in BeautifulSoup?
This question is related to
python
beautifulsoup
it looks like this is the way to do! as simple as that
with this line you are joining together the all text parts within the current element
''.join(htmlelement.find(text=True))
You can use the decompose method in bs4:
soup = bs4.BeautifulSoup('<body><a href="http://example.com/">I linked to <i>example.com</i></a></body>')
for a in soup.find('a').children:
if isinstance(a,bs4.element.Tag):
a.decompose()
print soup
Out: <html><body><a href="http://example.com/">I linked to </a></body></html>
With BeautifulStoneSoup
gone in bs4
, it's even simpler in Python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
print(text)
why has no answer I've seen mentioned anything about the unwrap
method? Or, even easier, the get_text
method
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
Here is the source code: you can get the text which is exactly in the URL
URL = ''
page = requests.get(URL)
soup = bs4.BeautifulSoup(page.content,'html.parser').get_text()
print(soup)
Use get_text(), it returns all the text in a document or beneath a tag, as a single Unicode string.
For instance, remove all different script tags from the following text:
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
The expected result is:
Signal et Communication
Ingénierie Réseaux et Télécommunications
Here is the source code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
text = '''
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())
Code to simply get the contents as text instead of html:
'html_text' parameter is the string which you will pass in this function to get the text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, 'lxml')
text = soup.get_text()
print(text)
Source: Stackoverflow.com