I've been playing around with scraping data from web pages using VBS/VBA.
If it were Javascript I'd be away as its easy, but it doesn't seem to be quite as straight forward in VBS/VBA.
This is an example I made for an answer, it works but I had planned on accessing the child nodes using getElementByTagName
but I could not figure out how to use them! The HTMLElement
object does not have those methods.
Sub Scrape()
Dim Browser As InternetExplorer
Dim Document As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set Browser = New InternetExplorer
Browser.navigate "http://www.hsbc.com/about-hsbc/leadership"
Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Set Document = Browser.Document
Set Elements = Document.getElementsByClassName("profile-col1")
For Each Element in Elements
Debug.Print "[ name] " & Trim(Element.Children(1).Children(0).innerText)
Debug.Print "[ title] " & Trim(Element.Children(1).Children(1).innerText)
Next Element
Set Document = Nothing
Set Browser = Nothing
End Sub
I have been looking at the HTMLElement.document
property, seeing if it is like a fragment of the document but its either difficult to work with or just isnt what I think
Dim Fragment As HTMLDocument
Set Element = Document.getElementById("example") ' This works
Set Fragment = Element.document ' This doesn't
This also seems a long winded way to do it (although thats usually the way for vba imo). Anyone know if there is a simpler way to chain functions?
Document.getElementById("target").getElementsByTagName("tr")
would be awesome...
This question is related to
vba
web-scraping
Sub Scrape()
Dim Browser As InternetExplorer
Dim Document As htmlDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set Browser = New InternetExplorer
Browser.Visible = True
Browser.navigate "http://www.stackoverflow.com"
Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Set Document = Browser.Document
Set Elements = Document.getElementById("hmenus").getElementsByTagName("li")
For Each Element In Elements
Debug.Print Element.innerText
'Questions
'Tags
'Users
'Badges
'Unanswered
'Ask Question
Next Element
Set Document = Nothing
Set Browser = Nothing
End Sub
Thanks to dee for the answer above with the Scrape() subroutine. The code worked perfectly as written, and I was able to then convert the code to work with the specific website I am trying to scrape.
I do not have enough reputation to upvote or to comment, but I do actually have some minor improvements to add to dee's answer:
You will need to add the VBA Reference via "Tools\References" to "Microsoft HTML Object Library in order for the code to compile.
I commented out the Browser.Visible line and added the comment as follows
'if you need to debug the browser page, uncomment this line:
'Browser.Visible = True
And I added a line to close the browser before Set Browser = Nothing:
Browser.Quit
Thanks again dee!
ETA: this works on machines with IE9, but not machines with IE8. Anyone have a fix?
Found the fix myself, so came back here to post it. The ClassName function is available in IE9. For this to work in IE8, you use querySelectorAll, with a dot preceding the class name of the object you are looking for:
'Set repList = doc.getElementsByClassName("reportList") 'only works in IE9, not in IE8
Set repList = doc.querySelectorAll(".reportList") 'this works in IE8+
I would use XMLHTTP request to retrieve page content as much faster. Then it is easy enough to use querySelectorAll to apply a CSS class selector to grab by class name. Then you access the child elements by tag name and index.
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument, elements As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.hsbc.com/about-hsbc/leadership", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set elements = .querySelectorAll(".profile-col1")
For i = 0 To elements.Length - 1
Debug.Print String(20, Chr$(61))
Debug.Print elements.item(i).getElementsByTagName("a")(0).innerText
Debug.Print elements.item(i).getElementsByTagName("p")(0).innerText
Debug.Print elements.item(i).getElementsByTagName("p")(1).innerText
Next
End With
End Sub
References:
VBE > Tools > References > Microsoft HTML Object Library
Source: Stackoverflow.com