soup.find("tagName", { "id" : "articlebody" })
Why does this NOT return the <div id="articlebody"> ... </div>
tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from
soup.prettify()
soup.find("div", { "id" : "articlebody" })
also does not work.
(EDIT: I found that BeautifulSoup wasn't correctly parsing my page, which probably meant the page I was trying to parse isn't properly formatted in SGML or whatever)
This question is related to
python
beautifulsoup
I used:
soup.findAll('tag', attrs={'attrname':"attrvalue"})
As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn't be different.
In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.
NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
What I think you need to do is to specify the attrs you want such as
source.find('div', attrs={'id':'articlebody'})
I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".
This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.
The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.
This is my code, where I just try to print the number of tags "div" with class "fcontent":
from BeautifulSoup import BeautifulSoup
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)
soup.find("tagName",attrs={ "id" : "articlebody" })
The Id
property is always uniquely identified. That means you can use it directly without even specifying the element. Therefore, it is a plus point if your elements have it to parse through the content.
divEle = soup.find(id = "articlebody")
Here is a code fragment
soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})
As you can see I find all tags and then I find all tags with class="article" inside
have you tried soup.findAll("div", {"id": "articlebody"})
?
sounds crazy, but if you're scraping stuff from the wild, you can't rule out multiple divs...
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url = 'your_url'
session = HTMLSession()
resp = session.get(url)
# if element with id "articlebody" is dynamic, else need not to render
resp.html.render()
soup = bs(resp.html.html, "lxml")
soup.find("div", {"id": "articlebody"})
To find an element by its id
:
div = soup.find(id="articlebody")
Happened to me also while trying to scrape Google.
I ended up using pyquery.
Install:
pip install pyquery
Use:
from pyquery import PyQuery
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')
Beautiful Soup 4 supports most CSS selectors with the .select()
method, therefore you can use an id
selector such as:
soup.select('#articlebody')
If you need to specify the element's type, you can add a type selector before the id
selector:
soup.select('div#articlebody')
The .select()
method will return a collection of elements, which means that it would return the same results as the following .find_all()
method example:
soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")
If you only want to select a single element, then you could just use the .find()
method:
soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")
Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.
Source: Stackoverflow.com