can we use xpath with BeautifulSoup

Question

I am using BeautifulSoup to scrape a url and I had the following code   import urllib import urllib2 from BeautifulSoup import BeautifulSoup  url     http   www example com servlet av ResultTemplate AVResult html  req   urllib2 Request url  response   urllib2 urlopen req  the page   response read   soup   BeautifulSoup the page  soup findAll  td  attrs   class   empformbody      Now in the above code we can use findAll to get tags and information related to them  but I want to use xpath  Is it possible to use xpath with BeautifulSoup  If possible  can anyone please provide me an example code so that it will be more helpful

User · Answer

from lxml import etree from bs4 import BeautifulSoup soup   BeautifulSoup open  path of your localfile html    html parser   dom   etree HTML str soup   print dom xpath       id  BGINP01 S1   section div font text       Above used the combination of Soup object with lxml and one can extract the value using xpath

User · Answer

Nope  BeautifulSoup  by itself  does not support XPath expressions   An alternative library  lxml  does support XPath 1 0  It has a BeautifulSoup compatible mode where it ll try and parse broken HTML the way Soup does  However  the default lxml HTML parser does just as good a job of parsing broken HTML  and I believe is faster   Once you ve parsed your document into an lxml tree  you can use the  xpath   method to search for elements   try        Python 2     from urllib2 import urlopen except ImportError      from urllib request import urlopen from lxml import etree  url     http   www example com servlet av ResultTemplate AVResult html  response   urlopen url  htmlparser   etree HTMLParser   tree   etree parse response  htmlparser  tree xpath xpathselector    There is also a dedicated lxml html   module with additional functionality   Note that in the above example I passed the response object directly to lxml  as having the parser read directly from the stream is more efficient than reading the response into a large string first  To do the same with the requests library  you want to set stream True and pass in the response raw object after enabling transparent transport decompression   import lxml html import requests  url     http   www example com servlet av ResultTemplate AVResult html  response   requests get url  stream True  response raw decode content   True tree   lxml html parse response raw    Of possible interest to you is the CSS Selector support  the CSSSelector class translates CSS statements into XPath expressions  making your search for td empformbody that much easier   from lxml cssselect import CSSSelector  td empformbody   CSSSelector  td empformbody   for elem in td empformbody tree         Do something with these table cells    Coming full circle  BeautifulSoup itself does have very complete CSS selector support   for cell in soup select  table foobar td empformbody          Do something with these table cells

User · Answer

I ve searched through their docs and it seems there is not xpath option  Also  as you can see here on a similar question on SO  the OP is asking for a translation from xpath to BeautifulSoup  so my conclusion would be - no  there is no xpath parsing available

User · Answer

Maybe you can try the following without XPath  from simplified scrapy simplified doc import SimplifiedDoc  html        lt html gt   lt body gt   lt div gt       lt h1 gt Example Domain lt  h1 gt       lt p gt This domain is for use in illustrative examples in documents  You may use this     domain in literature without prior coordination or asking for permission  lt  p gt       lt p gt  lt a href  https   www iana org domains example  gt More information    lt  a gt  lt  p gt   lt  div gt   lt  body gt   lt  html gt        What XPath can do  so can it doc   SimplifiedDoc html    The result is the same as doc getElementByTag  body   getElementByTag  div   getElementByTag  h1   text print  doc body div h1 text  print  doc div h1 text  print  doc h1 text    Shorter paths will be faster print  doc div getChildren    print  doc div getChildren  p

User · Answer

I can confirm that there is no XPath support within Beautiful Soup

User · Answer

As others have said  BeautifulSoup doesn t have xpath support   There are probably a number of ways to get something from an xpath  including using Selenium   However  here s a solution that works in either Python 2 or 3   from lxml import html import requests  page   requests get  http   econpy pythonanywhere com ex 001 html   tree   html fromstring page content   This will create a list of buyers  buyers   tree xpath    div  title  buyer-name   text      This will create a list of prices prices   tree xpath    span  class  item-price   text      print  Buyers     buyers  print  Prices     prices    I used this as a reference

User · Answer

when you use lxml all simple   tree   lxml html fromstring html  i need element   tree xpath    a  class  shared-components    href     but when use BeautifulSoup BS4 all simple too    first remove      and       second - add star before       try this magic   soup   BeautifulSoup html   lxml   i need element   soup select   a class   shared-components       as you see  this does not support sub-tag  so i remove    href  part

User · Answer

This is a pretty old thread  but there is a work-around solution now  which may not have been in BeautifulSoup at the time    Here is an example of what I did  I use the  requests  module to read an RSS feed and get its text content in a variable called  rss text   With that  I run it thru BeautifulSoup  search for the xpath  rss channel title  and retrieve its contents  It s not exactly XPath in all its glory  wildcards  multiple paths  etc    but if you just have a basic path you want to locate  this works    from bs4 import BeautifulSoup rss obj   BeautifulSoup rss text   xml   cls title   rss obj rss channel title get text

User · Answer

BeautifulSoup has a function named findNext from current element directed childern so   father findNext  div    class   class value    findNext  div    id   id value    findAll  a      Above code can imitate the following xpath   div class class value  div id id value

[python] can we use xpath with BeautifulSoup?

Examples related to python

Examples related to xpath

Examples related to beautifulsoup

Examples related to urllib