How would I, using BeautifulSoup, search for tags containing ONLY the attributes I search for?
For example, I want to find all <td valign="top">
tags.
The following code:
raw_card_data = soup.fetch('td', {'valign':re.compile('top')})
gets all of the data I want, but also grabs any <td>
tag that has the attribute valign:top
I also tried:
raw_card_data = soup.findAll(re.compile('<td valign="top">'))
and this returns nothing (probably because of bad regex)
I was wondering if there was a way in BeautifulSoup to say "Find <td>
tags whose only attribute is valign:top
"
UPDATE
FOr example, if an HTML document contained the following <td>
tags:
<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />
I would want only the first <td>
tag (<td width="580" valign="top">
) to return
This question is related to
python
beautifulsoup
As explained on the BeautifulSoup documentation
You may use this :
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
EDIT :
To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs
property :
from BeautifulSoup import BeautifulSoup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
for result in results :
if len(result.attrs) == 1 :
print result
That returns :
<td valign="top">.....</td>
if you want to only search with attribute name with any value
from bs4 import BeautifulSoup
import re
soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})
as per Steve Lorimer better to pass True instead of regex
results = soup.findAll("td", {"valign" : True})
Adding a combination of Chris Redford's and Amr's answer, you can also search for an attribute name with any value with the select command:
from bs4 import BeautifulSoup as Soup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = Soup(html, 'lxml')
results = soup.select('td[valign]')
The easiest way to do this is with the new CSS style select
method:
soup = BeautifulSoup(html)
results = soup.select('td[valign="top"]')
Just pass it as an argument of findAll
:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <html>
... <head><title>My Title!</title></head>
... <body><table>
... <tr><td>First!</td>
... <td valign="top">Second!</td></tr>
... </table></body><html>
... """)
>>>
>>> soup.findAll('td')
[<td>First!</td>, <td valign="top">Second!</td>]
>>>
>>> soup.findAll('td', valign='top')
[<td valign="top">Second!</td>]
You can use lambda
functions in findAll
as explained in documentation. So that in your case to search for td
tag with only valign = "top"
use following:
td_tag_list = soup.findAll(
lambda tag:tag.name == "td" and
len(tag.attrs) == 1 and
tag["valign"] == "top")
find using an attribute in any tag
<th class="team" data-sort="team">Team</th>
soup.find_all(attrs={"class": "team"})
<th data-sort="team">Team</th>
soup.find_all(attrs={"data-sort": "team"})
Source: Stackoverflow.com