This works well for specific articles where the text is all wrapped in <p>
tags. Since the web is an ugly place, it's not always the case.
Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a <span>
or a <div>
, or an <li>
).
To find all text nodes in the DOM, you can use soup.find_all(text=True)
.
This is going to return some undesired text, like the contents of <script>
and <style>
tags. You'll need to filter out the text contents of elements you don't want.
blacklist = [
'style',
'script',
# other elements,
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blacklist]
If you are working with a known set of tags, you can tag the opposite approach:
whitelist = [
'p'
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]