Python BeautifulSoup extract text between element

Question

I try to extract  THIS IS MY TEXT  from the following HTML    lt html gt   lt body gt   lt table gt      lt td class  MYCLASS  gt         lt  -- a comment -- gt         lt a hef  xy  gt Text lt  a gt         lt p gt something lt  p gt        THIS IS MY TEXT        lt p gt something else lt  p gt         lt  br gt      lt  td gt   lt  table gt   lt  body gt   lt  html gt    I tried it this way   soup   BeautifulSoup html   for hit in soup findAll attrs   class     MYCLASS         print hit text   But I get all the text between all nested Tags plus the comment    Can anyone help me to just get  THIS IS MY TEXT  out of this

User · Answer

Short answer  soup findAll  p   0  next  Real answer  You need an invariant reference point from which you can get to your target    You mention in your comment to Haidro s answer that the text you want is not always in the same place  Find a sense in which it is in the same place relative to some element  Then figure out how to make BeautifulSoup navigate the parse tree following that invariant path    For example  in the HTML you provide in the original post  the target string appears immediately after the first paragraph element  and that paragraph is not empty  Since findAll  p   will find paragraph elements  soup find  p   0  will be the first paragraph element    You could in this case use soup find  p   but soup findAll  p   n  is more general since maybe your actual scenario needs the 5th paragraph or something like that   The next field attribute will be the next parsed element in the tree  including children  So soup findAll  p   0  next contains the text of the paragraph  and soup findAll  p   0  next next will return your target in the HTML provided

User · Answer

Use  children instead   from bs4 import NavigableString  Comment print    join unicode child  for child in hit children      if isinstance child  NavigableString  and not isinstance child  Comment     Yes  this is a bit of a dance   Output    gt  gt  gt  for hit in soup findAll attrs   class     MYCLASS             print    join unicode child  for child in hit children              if isinstance child  NavigableString  and not isinstance child  Comment                  THIS IS MY TEXT

User · Answer

You can use  contents    gt  gt  gt  for hit in soup findAll attrs   class     MYCLASS             print hit contents 6  strip        THIS IS MY TEXT

User · Answer

soup   BeautifulSoup html  for hit in soup findAll attrs   class     MYCLASS       hit   hit text strip     print hit   This will print  THIS IS MY TEXT Try this

User · Answer

with your own soup object   soup p next sibling strip      you grab the  lt p gt  directly with soup p   this hinges on it being the first  lt p gt  in the parse tree  then use next sibling on the tag object that soup p returns since the desired text is nested at the same level of the parse tree as the  lt p gt    strip   is just a Python str method to remove leading and trailing whitespace    otherwise just find the element using your choice of filter s   in the interpreter this looks something like   In  4   soup p Out 4    lt p gt something lt  p gt   In  5   type soup p  Out 5   bs4 element Tag  In  6   soup p next sibling Out 6   u  n      THIS IS MY TEXT n         In  7   type soup p next sibling  Out 7   bs4 element NavigableString  In  8   soup p next sibling strip   Out 8   u THIS IS MY TEXT   In  9   type soup p next sibling strip    Out 9   unicode

User · Answer

Learn more about how to navigate through the parse tree in BeautifulSoup  Parse tree has got tags and NavigableStrings  as THIS IS A TEXT   An example  from BeautifulSoup import BeautifulSoup  doc      lt html gt  lt head gt  lt title gt Page title lt  title gt  lt  head gt             lt body gt  lt p id  firstpara  align  center  gt This is paragraph  lt b gt one lt  b gt              lt p id  secondpara  align  blah  gt This is paragraph  lt b gt two lt  b gt              lt  html gt    soup   BeautifulSoup    join doc    print soup prettify      lt html gt      lt head gt       lt title gt       Page title      lt  title gt      lt  head gt      lt body gt       lt p id  firstpara  align  center  gt       This is paragraph       lt b gt        one       lt  b gt              lt  p gt       lt p id  secondpara  align  blah  gt       This is paragraph       lt b gt        two       lt  b gt              lt  p gt      lt  body gt     lt  html gt    To move down the parse tree you have contents and string       contents is an ordered list of the Tag and NavigableString objects   contained within a page element     if a tag has only one child node  and that child node is a string    the child node is made available as tag string  as well as   tag contents 0     For the above  that is to say you can get   soup b string   u one  soup b contents 0    u one    For several children nodes  you can have for instance  pTag   soup p pTag contents    u This is paragraph     lt b gt one lt  b gt   u       so here you may play with contents and get contents at the index you want   You also can iterate over a Tag  this is a shortcut  For instance   for i in soup body      print i    lt p id  firstpara  align  center  gt This is paragraph  lt b gt one lt  b gt   lt  p gt     lt p id  secondpara  align  blah  gt This is paragraph  lt b gt two lt  b gt   lt  p gt

User · Answer

The BeautifulSoup documentation provides an example about removing objects from a document using the extract method  In the following example the aim is to remove all comments from the document   Removing Elements     Once you have a reference to an element  you can rip it out of the   tree with the extract method  This code removes all the comments   from a document    from BeautifulSoup import BeautifulSoup  Comment soup   BeautifulSoup    1 lt  --The loneliest number-- gt                       lt a gt 2 lt  --Can be as bad as one-- gt  lt b gt 3     comments   soup findAll text lambda text isinstance text  Comment    comment extract   for comment in comments  print soup   1    lt a gt 2 lt b gt 3 lt  b gt  lt  a gt

[python] Python BeautifulSoup extract text between element

Examples related to python

Examples related to beautifulsoup