Beautiful Soup and extracting a div and its contents by ID

Question

soup find  tagName      id     articlebody       Why does this NOT return the  lt div id  articlebody  gt       lt  div gt  tags and stuff in between  It returns nothing  And I know for a fact it exists because I m staring right at it from   soup prettify     soup find  div      id     articlebody     also does not work    EDIT  I found that BeautifulSoup wasn t correctly parsing my page  which probably meant the page I was trying to parse isn t properly formatted in SGML or whatever

User · Answer

The Id property is always uniquely identified  That means you can use it directly without even specifying the element  Therefore  it is a plus point if your elements have it to parse through the content       divEle   soup find id    articlebody

User · Answer

I think there is a problem when the  div  tags are too much nested  I am trying to parse some contacts from a facebook html file  and the Beautifulsoup is not able to find tags  div  with class  fcontent    This happens with other classes as well  When I search for divs in general  it turns only those that are not so much nested   The html source code can be any page from facebook of the friends list of a friend of you  not the one of your friends   If someone can test it and give some advice I would really appreciate it   This is my code  where I just try to print the number of tags  div  with class  fcontent    from BeautifulSoup import BeautifulSoup  f   open   Users myUserName Desktop contacts html   soup   BeautifulSoup f   list   soup findAll  div   attrs   class   fcontent    print len list

User · Answer

To find an element by its id   div   soup find id  articlebody

User · Answer

from bs4 import BeautifulSoup from requests html import HTMLSession  url    your url  session   HTMLSession   resp   session get url     if element with id  quot articlebody quot  is dynamic  else need not to render resp html render    soup   bs resp html html   quot lxml quot   soup find  quot div quot     quot id quot    quot articlebody quot

User · Answer

In the beautifulsoup source this line allows divs to be nested within divs  so your concern in lukas  comment wouldn t be valid   NESTABLE BLOCK TAGS     blockquote    div    fieldset    ins    del     What I think you need to do is to specify the attrs you want such as  source find  div   attrs   id   articlebody

User · Answer

Here is a code fragment   soup   BeautifulSoup   index html   titleList   soup findAll  title   divList   soup findAll  div   attrs    class     article story      As you can see I find all  tags and then I find all  tags with class  article  inside

User · Answer

Most probably because of the default beautifulsoup parser has problem  Change a different parser  like  lxml  and try again

User · Answer

I used   soup findAll  tag   attrs   attrname   attrvalue      As my syntax for find findall  that said  unless there are other optional parameters between the tag and attribute list  this shouldn t be different

User · Answer

Beautiful Soup 4 supports most CSS selectors with the  select   method  therefore you can use an id selector such as   soup select   articlebody     If you need to specify the element s type  you can add a type selector before the id selector   soup select  div articlebody     The  select   method will return a collection of elements  which means that it would return the same results as the following  find all   method example   soup find all  div   id  articlebody     or soup find all id  articlebody     If you only want to select a single element  then you could just use the  find   method   soup find  div   id  articlebody     or soup find id  articlebody

User · Answer

You should post your example document  because the code works fine    gt  gt  gt  import BeautifulSoup  gt  gt  gt  soup   BeautifulSoup BeautifulSoup   lt html gt  lt body gt  lt div id  articlebody  gt       lt  div gt  lt  body gt  lt  html    gt  gt  gt  soup find  div     id    articlebody     lt div id  articlebody  gt       lt  div gt    Finding  lt div gt s inside  lt div gt s works as well    gt  gt  gt  soup   BeautifulSoup BeautifulSoup   lt html gt  lt body gt  lt div gt  lt div id  articlebody  gt       lt  div gt  lt  div gt  lt  body gt  lt  html    gt  gt  gt  soup find  div     id    articlebody     lt div id  articlebody  gt       lt  div gt

User · Answer

have you tried soup findAll  div     id    articlebody      sounds crazy  but if you re scraping stuff from the wild  you can t rule out multiple divs

User · Answer

soup find  quot tagName quot  attrs    quot id quot     quot articlebody quot

User · Answer

Happened to me also while trying to scrape Google  I ended up using pyquery  Install   pip install pyquery   Use   from pyquery import PyQuery     pq   PyQuery   lt html gt  lt body gt  lt div id  articlebody  gt       lt  div gt  lt  body gt  lt  html   tag   pq  div articlebody

[python] Beautiful Soup and extracting a div and its contents by ID

Examples related to python

Examples related to beautifulsoup