I am using BeautifulSoup to look for user-entered strings on a specific page. For example, I want to see if the string 'Python' is located on the page: http://python.org
When I used:
find_string = soup.body.findAll(text='Python')
,
find_string
returned []
But when I used:
find_string = soup.body.findAll(text=re.compile('Python'), limit=1)
,
find_string
returned [u'Python Jobs']
as expected
What is the difference between these two statements that makes the second statement work when there are more than one instances of the word to be searched?
This question is related to
python
beautifulsoup
text='Python'
searches for elements that have the exact text you provided:
import re
from BeautifulSoup import BeautifulSoup
html = """<p>exact text</p>
<p>almost exact text</p>"""
soup = BeautifulSoup(html)
print soup(text='exact text')
print soup(text=re.compile('exact text'))
[u'exact text']
[u'exact text', u'almost exact text']
"To see if the string 'Python' is located on the page http://python.org":
import urllib2
html = urllib2.urlopen('http://python.org').read()
print 'Python' in html # -> True
If you need to find a position of substring within a string you could do html.find('Python')
.
In addition to the accepted answer. You can use a lambda
instead of regex
:
from bs4 import BeautifulSoup
html = """<p>test python</p>"""
soup = BeautifulSoup(html, "html.parser")
print(soup(text="python"))
print(soup(text=lambda t: "python" in t))
Output:
[]
['test python']
I have not used BeuatifulSoup but maybe the following can help in some tiny way.
import re
import urllib2
stuff = urllib2.urlopen(your_url_goes_here).read() # stuff will contain the *entire* page
# Replace the string Python with your desired regex
results = re.findall('(Python)',stuff)
for i in results:
print i
I'm not suggesting this is a replacement but maybe you can glean some value in the concept until a direct answer comes along.
Source: Stackoverflow.com