2016年7月7日星期四

vba and html

Advice on how to scrape tables of data or HTML from webpages using VBA

Part three of a three-part series of blogs

You can use VBA to extract data from web pages, either as whole tables or by parsing the underlying HTML elements. This blog shows you how to code both methods (the technique is often called "web-scraping").

Two ways to get data from websites using Excel VBA
Extracting a table of data from a website using a VBA query
Scraping a website's HTML in VBA (this blog)

Posted by Andy Brown on 13 January 2014 | no comments

Scraping a website's HTML in VBA

This blog shows how to go through a website, making sense of its HTML within VBA. We'll break the problem down into several chunks - to whit:

Defining what we want to achieve.
Analysing the target URL (the target website).
Referencing the required applications.
Getting at the underlying HTML.
Parsing the data into Excel.

Don't forget that websites change all the time, so this code may no longer work when you try it out as the format of the StackOverflow website may have changed. The following code also assumes that you have Internet Explorer on your computer (something which will be true of nearly all Windows computers).

Step 1 - Defining the problem

At the time of writing, here is what the above-mentioned StackOverflow website's home page looks like:

The home page lists out the questions which have been asked most recently.

From this we want to extract the raw questions, with just the votes, views and author information:

What the answer should look like. The list of questions changes by the second, so the data is different!

To do this we need to learn the structure of the HTML behind the page.

To scrape websites you need to know a little HTML, and knowing a lot will help you enormously.

Step 2 - Analysing the target website

In any browser you can right-click and choose to show the underlying HTML for the page:

How to show the HTML for a webpage in FireFox (the Internet Explorer, Chrome, Safari and other browser options will be similar).

The web page begins with Top Questions, so let's find that:

Press CTRL + F to find the given text.

Analysing the HTML which follows this shows that the questions are all encased in a div tag called question-mini-list:

We'll loop over all of the HTML elements within this div tag.

Here's the HTML for a single question:

The question contains all of the data we want - we just have to get at it!

Here's how we'll get at the four bits of data we want:

Data	Method
Id	We'll find the div tag with class question-summary narrow, and extract the question number from its id.
Votes	We'll find the div tag with class name votes, and look at the inner text for this (ie the contents of the div tag, ignoring any HTML tags). By stripping out any blanks and the word vote or votes, we'll end up with the data we want.
Views	An identical process, but using views instead of votes.
Author	We'll find the tag with class name started, and look at the inner text of the second tag within this (since there are two hyperlinks, and it's the second one which contains the author's name).

Step 3 - Referencing the required applications

To get this macro to work, we'll need to:

Create a new copy of Internet Explorer in memory; then
Work with the elements on the HTML page we find.

To do this, you'll need to reference two object libraries:

Library	Used for
Microsoft Internet Controls	Getting at Internet Explorer in VBA
Microsoft HTML Object Library	Getting at parts of an HTML page

To do this, in VBA choose from the menu Tools --> References, then tick the two options shown:

You'll need to scroll down quite a way to find each of these libraries to reference.

Now we can begin writing the VBA to get at our data!

Step 4 - Getting at the underlying HTML

Let's now show some code for loading up the HTML at a given web page. The main problem is that we have to wait until the web browser has responded, so we keep "doing any events" until it returns the correct state out of the following choices:

Enum READYSTATE

READYSTATE_UNINITIALIZED = 0

READYSTATE_LOADING = 1

READYSTATE_LOADED = 2

READYSTATE_INTERACTIVE = 3

READYSTATE_COMPLETE = 4

End Enum

Here a subroutine to get at the text behind a web page:

Sub ImportStackOverflowData()

'to refer to the running copy of Internet Explorer

Dim ie As InternetExplorer

'to refer to the HTML document returned

Dim html As HTMLDocument

'open Internet Explorer in memory, and go to website

Set ie = New InternetExplorer

ie.Visible = False

ie.navigate "http://stackoverflow.com/"

'Wait until IE is done loading page

Do While ie.readyState <> READYSTATE_COMPLETE

Application.StatusBar = "Trying to go to StackOverflow ..."

DoEvents

Loop

'show text of HTML document returned

Set html = ie.document

MsgBox html.DocumentElement.innerHTML

'close down IE and reset status bar

Set ie = Nothing

Application.StatusBar = ""

End Sub

What this does is:

Creates a new copy of Internet Explorer to run invisibly in memory.
Navigates to the StackOverflow home page.
Waits until the home page has loaded.
Loads up an HTML document, and shows its text.
Closes Internet Explorer.

You could now parse the HTML using the Document Object Model (for those who know this), but we're going to do it the slightly harder way, by finding tags and then looping over their contents.

Step 5 - Parsing the HTML

Here's the entire subroutine, in parts, with comments for the HTML bits. Start by getting a handle on the HTML document, as above:

Sub ImportStackOverflowData()

'to refer to the running copy of Internet Explorer

Dim ie As InternetExplorer

'to refer to the HTML document returned

Dim html As HTMLDocument

'open Internet Explorer in memory, and go to website

Set ie = New InternetExplorer

ie.Visible = False

ie.navigate "http://stackoverflow.com/"

'Wait until IE is done loading page

Do While ie.readyState <> READYSTATE_COMPLETE

Application.StatusBar = "Trying to go to StackOverflow ..."

DoEvents

Loop

'show text of HTML document returned

Set html = ie.document

'close down IE and reset status bar

Set ie = Nothing

Application.StatusBar = ""

Now put titles in row 3 of the spreadsheet:

'clear old data out and put titles in

Cells.Clear

'put heading across the top of row 3

Range("A3").Value = "Question id"

Range("B3").Value = "Votes"

Range("C3").Value = "Views"

Range("D3").Value = "Person"

We're going to need a fair few variables (I don't guarantee that this is the most efficient solution!):

Dim QuestionList As IHTMLElement

Dim Questions As IHTMLElementCollection

Dim Question As IHTMLElement

Dim RowNumber As Long

Dim QuestionId As String

Dim QuestionFields As IHTMLElementCollection

Dim QuestionField As IHTMLElement

Dim votes As String

Dim views As String

Dim QuestionFieldLinks As IHTMLElementCollection

Start by getting a reference to the HTML element which contains all of the questions (this also initialises the row number in the spreadsheet to 4, the one after the titles):

Set QuestionList = html.getElementById("question-mini-list")

Set Questions = QuestionList.Children

RowNumber = 4

Now we'll loop over all of the child elements within this tag, finding each question in turn:

For Each Question In Questions

'if this is the tag containing the question details, process it

If Question.className = "question-summary narrow" Then

Each question has a tag giving its id, which we can extract:

'first get and store the question id in first column

QuestionId = Replace(Question.ID, "question-summary-", "")

Cells(RowNumber, 1).Value = CLng(QuestionId)

Now we'll loop over all of the child elements within each question's containing div tag:

'get a list of all of the parts of this question,

'and loop over them

Set QuestionFields = Question.all

For Each QuestionField In QuestionFields

For each element, extract its details (either the integer number of votes cast, the integer number of views or the name of the author):

'if this is the question's votes, store it (get rid of any surrounding text)

If QuestionField.className = "votes" Then

votes = Replace(QuestionField.innerText, "votes", "")

votes = Replace(votes, "vote", "")

Cells(RowNumber, 2).Value = Trim(votes)

End If

'likewise for views (getting rid of any text)

If QuestionField.className = "views" Then

views = QuestionField.innerText

views = Replace(views, "views", "")

views = Replace(views, "view", "")

Cells(RowNumber, 3).Value = Trim(views)

End If

'if this is the bit where author's name is ...

If QuestionField.className = "started" Then

'get a list of all elements within, and store the

'text in the second one

Set QuestionFieldLinks = QuestionField.all

Cells(RowNumber, 4).Value = QuestionFieldLinks(2).innerHTML

End If

Next QuestionField

Time now to finish this question, increase the spreadsheet row count by one and go on to the next question:

'go on to next row of worksheet

RowNumber = RowNumber + 1

End If

Set html = Nothing

Finally, we'll tidy up the results and put a title in row one:

'do some final formatting

Range("A3").CurrentRegion.WrapText = False

Range("A3").CurrentRegion.EntireColumn.AutoFit

Range("A1:C1").EntireColumn.HorizontalAlignment = xlCenter

Range("A1:D1").Merge

Range("A1").Value = "StackOverflow home page questions"

Range("A1").Font.Bold = True

Application.StatusBar = ""

MsgBox "Done!"

End Sub

And that's the complete macro!

As the above shows, website scraping can get quite messy. If you're going to be doing much of this, I recommend learning about the HTML DOM (Document Object Model), and taking advantage of this in your code.

Two ways to get data from websites using Excel VBA
Extracting a table of data from a website using a VBA query
Scraping a website's HTML in VBA (this blog)

没有评论:

发表评论

订阅：博文评论 (Atom)