Advice on how to scrape tables of data or HTML from webpages using VBA
Part three of a three-part series of blogs
You can use VBA to extract data from
web pages, either as whole tables or by parsing the underlying HTML
elements. This blog shows you how to code both methods (the technique is
often called "web-scraping").
- Two ways to get data from websites using Excel VBA
- Extracting a table of data from a website using a VBA query
- Scraping a website's HTML in VBA (this blog)
Posted by
Andy Brown
on 13 January 2014 | no comments
Scraping a website's HTML in VBA
This blog shows how to go through a website, making sense of its HTML within VBA. We'll break the problem down into several chunks - to whit:- Defining what we want to achieve.
- Analysing the target URL (the target website).
- Referencing the required applications.
- Getting at the underlying HTML.
- Parsing the data into Excel.
Don't forget that websites change all the time, so this code may no longer
work when you try it out as the format of the
StackOverflow website may have changed.
The following code also assumes that you have Internet Explorer on your computer
(something which will be true of nearly all Windows computers).
Step 1 - Defining the problem
At the time of writing, here is what the above-mentioned StackOverflow website's home page looks like:
The home page lists out the questions which have been asked most recently.
From this we want to extract the raw questions, with just the votes, views and
author information:
What the answer should look like. The list of
questions changes by the second, so the data is
different!
To scrape websites you need to know a little HTML, and knowing a lot will
help you enormously.
Step 2 - Analysing the target website
In any browser you can right-click and choose to show the underlying HTML for the page:
How to show the HTML for a webpage in FireFox (the Internet Explorer, Chrome, Safari and other browser options will be similar).
Press CTRL +
F to find the given text.
We'll loop over all of the HTML elements within this
div tag.
The question contains all of the data we want - we just have to get at it!
Here's how we'll get at the four bits of data we want:Data | Method |
---|---|
Id | We'll find the div tag with class question-summary narrow, and extract the question number from its id. |
Votes | We'll find the div tag with class name votes, and look at the inner text for this (ie the contents of the div tag, ignoring any HTML tags). By stripping out any blanks and the word vote or votes, we'll end up with the data we want. |
Views | An identical process, but using views instead of votes. |
Author | We'll find the tag with class name started, and look at the inner text of the second tag within this (since there are two hyperlinks, and it's the second one which contains the author's name). |
Step 3 - Referencing the required applications
To get this macro to work, we'll need to:- Create a new copy of Internet Explorer in memory; then
- Work with the elements on the HTML page we find.
Library | Used for |
---|---|
Microsoft Internet Controls | Getting at Internet Explorer in VBA |
Microsoft HTML Object Library | Getting at parts of an HTML page |
You'll need to scroll down quite a way to find each of these libraries to reference.
Step 4 - Getting at the underlying HTML
Let's now show some code for loading up the HTML at a given web page. The main problem is that we have to wait until the web browser has responded, so we keep "doing any events" until it returns the correct state out of the following choices:
Enum READYSTATE
READYSTATE_UNINITIALIZED = 0
READYSTATE_LOADING = 1
READYSTATE_LOADED = 2
READYSTATE_INTERACTIVE = 3
READYSTATE_COMPLETE = 4
End Enum
Sub ImportStackOverflowData()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://stackoverflow.com/"
'Wait until IE is done loading page
Do While ie.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to StackOverflow ..."
DoEvents
Loop
'show text of HTML document returned
Set html = ie.document
MsgBox html.DocumentElement.innerHTML
'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""
End Sub
- Creates a new copy of Internet Explorer to run invisibly in memory.
- Navigates to the StackOverflow home page.
- Waits until the home page has loaded.
- Loads up an HTML document, and shows its text.
- Closes Internet Explorer.
Step 5 - Parsing the HTML
Here's the entire subroutine, in parts, with comments for the HTML bits. Start by getting a handle on the HTML document, as above:
Sub ImportStackOverflowData()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://stackoverflow.com/"
'Wait until IE is done loading page
Do While ie.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to StackOverflow ..."
DoEvents
Loop
'show text of HTML document returned
Set html = ie.document
'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""
'clear old data out and put titles in
Cells.Clear
'put heading across the top of row 3
Range("A3").Value = "Question id"
Range("B3").Value = "Votes"
Range("C3").Value = "Views"
Range("D3").Value = "Person"
Dim QuestionList As IHTMLElement
Dim Questions As IHTMLElementCollection
Dim Question As IHTMLElement
Dim RowNumber As Long
Dim QuestionId As String
Dim QuestionFields As IHTMLElementCollection
Dim QuestionField As IHTMLElement
Dim votes As String
Dim views As String
Dim QuestionFieldLinks As IHTMLElementCollection
Set QuestionList = html.getElementById("question-mini-list")
Set Questions = QuestionList.Children
RowNumber = 4
For Each Question In Questions
'if this is the tag containing the question details, process it
If Question.className = "question-summary narrow" Then
'first get and store the question id in first column
QuestionId = Replace(Question.ID, "question-summary-", "")
Cells(RowNumber, 1).Value = CLng(QuestionId)
'get a list of all of the parts of this question,
'and loop over them
Set QuestionFields = Question.all
For Each QuestionField In QuestionFields
'if this is the question's votes, store it (get rid of any surrounding text)
If QuestionField.className = "votes" Then
votes = Replace(QuestionField.innerText, "votes", "")
votes = Replace(votes, "vote", "")
Cells(RowNumber, 2).Value = Trim(votes)
End If
'likewise for views (getting rid of any text)
If QuestionField.className = "views" Then
views = QuestionField.innerText
views = Replace(views, "views", "")
views = Replace(views, "view", "")
Cells(RowNumber, 3).Value = Trim(views)
End If
'if this is the bit where author's name is ...
If QuestionField.className = "started" Then
'get a list of all elements within, and store the
'text in the second one
Set QuestionFieldLinks = QuestionField.all
Cells(RowNumber, 4).Value = QuestionFieldLinks(2).innerHTML
End If
Next QuestionField
'go on to next row of worksheet
RowNumber = RowNumber + 1
End If
Next
Set html = Nothing
'do some final formatting
Range("A3").CurrentRegion.WrapText = False
Range("A3").CurrentRegion.EntireColumn.AutoFit
Range("A1:C1").EntireColumn.HorizontalAlignment = xlCenter
Range("A1:D1").Merge
Range("A1").Value = "StackOverflow home page questions"
Range("A1").Font.Bold = True
Application.StatusBar = ""
MsgBox "Done!"
End Sub
As the above shows, website scraping can get quite messy. If you're
going to be doing much of this, I recommend learning about the HTML DOM
(Document Object Model), and taking advantage of this in your code.
- Two ways to get data from websites using Excel VBA
- Extracting a table of data from a website using a VBA query
- Scraping a website's HTML in VBA (this blog)
没有评论:
发表评论