[David Phillips] Web Scraping With Excel How to U(B-ok.cc)

Chapter 1: Introduction

The automated process of procuring data from websites has become an essential component of today’s today’s business marketplace. The skill of writing programs or functions that can pull data from f rom websites in a structured and scalable manner is in high demand. The best part of this new trend is that web scraping is not overly complicated. While an experienced programmer can learn web scraping in no time at all, even individuals with a very limited amount of programing experience can become super-star super-star scrapers if they have the right tools, information and, of course, a very eager attitude. Web scraping, also referred to as web data extracting, harvesting, crawling, etc., has received a bad rap of late. In the business world, it has become synonymous with stealing data from competitor’ competitor ’s websites. While this certainly can be the case, web scraping can serve a myriad of other purposes. For example, let’s pretend your company is working with a client whose website has data necessary for your purposes. It’s possible that your client does not have a convenient means of producing the data you need to carry on with day-to day business. This can leave your company spending hours manually extracting data from their website. In this situation, web scraping is a viable, and ethically acceptable, option for saving your company time and manpower. manpower. The skill of web scraping can even be valuable outside of the workplace. I’ve personally used scrapes to procure Major League Baseball statistics and data. The data is then in excel for me to analyze, or automatically calculate, the best fantasy team for the day. day. I’m not a big gambler so I’ve I ’ve yet to strike it rich in fantasy baseball, but it’s a perfect example of how one can use scrapes for personal interests or hobbies. One can see how easily this skill can be used for sports, news, stocks and so forth. The internet is a never ending supply of new information, and scraping offers an efficient means of obtaining the data you may want for a myriad of reasons.

Legal Issues Legality regarding the scraping of some sites is i s a bit of a fuzzy issue at times. Websites frequently have disclaimers explicitly prohibiting the use of data extraction tools. It is your responsibility to make sure you are not breaking any legal rules in performing your scrapes. If you work for a company that intends to use scraped data, you should definitely talk to your supervisors and legal department to ensure that you are not getting your company or yourself in any trouble. This subject can actually get fairly complicated at times from a legal standpoint and I am not a lawyer so I always recommend consulting consulting

with your company’s legal representatives before performing any scrapes for business purposes. In general, if a site states that you are not allowed to scrape their data then you probably shouldn’t

Ethical Issues Now that I’ve made it clear that this book does not promote the breaking of any laws, let’s address the ethics of scraping data. While I’m not aware of any company that I’ve worked for using the techniques in this book for nefarious reasons, the truth of the matter is that many, many, if not most, large companies are performing scrapes of their competitor’s sites regardless of disclaimers. Scrapers also justify using these techniques by pointing out that scrapes are only procuring data that is readily available to the public. There is no hidden or confidential information being obtained. In theory, you could accomplish the same tasks by simply copying and pasting data from the site into a spreadsheet. The only difference that the scrape makes is that it saves time and energy. energy. This leads into the next advantage of learning the skill of scraping data. By knowing the tools and techniques used by what you might consider to be nefarious scrapers, you can stay one step ahead in protecting data on your site. As I alluded to before, adding a legal disclaimer to your site will only do so much in terms of preventing scrapers. Let me just say that if I wanted to prevent individuals from scraping data from a site I had, I would not rely on a legal disclaimer for accomplishing this goal. Luckily, there are other ways to prevent, or at least inhibit, the scraper from taking data you don’t want them to have. While the prevention of scraping is not the focus of this book, learning the process is the first step in prevention.

For This Book This book will contain many examples of code in both VBA and HTML. For the sake of showing examples in these languages, brackets will be used to specify the part of the code that would need to be added or changed. For instance, when you’re looking an example and a line in the code reads: [Type your code here], here], you do not literally write “[Type “[Type your code here]” in the program. The brackets and their content are there t here to illustrate or describe what piece of information would go at that location. For instance, let’s look at the sentence: My name is [Your [Your name here] here].. For me, this would translate to: My name is David. David.

Before we can get into the details pertaining to web scraping, we must consider and review the key components involved in the process. Once this is established, we can get into the basics of writing web scrapes for Excel and then we can address some of the most common issues that can frustrate a new scraper. Ideally, by the end of this book you should have a good idea of what it takes to write simple, yet valuable, web scrapes using Microsoft Excel.

Chapter 2: VBA Basics

Microsoft Excel is ubiquitous in the business world. Every day, millions of Americans get in their car, drive to work, and promptly open up Excel. For most of this population, spreadsheets are just a compilation of rectangles which hold letters and numbers in a very structured and, at times, aesthetically pleasing manner. Their day typically consists of moving or transforming data by typing, clicking, copying and pasting it to where it needs to go while occasionally utilizing a formula or two t o save time. While Excel is certainly useful for this manual work, the truth is that most Excel users are completely unaware of the potential that the program has to turn repetitive and boring tasks into a thing of the past. This is where VBA comes into play. Visual Basic for Applications (VBA) is the scripting language used to automate tasks in Excel. Much like any programing language, the coding for VBA can be as complicated or as simple as it needs to be. If you’ve ever recorded or played a macro in Excel, then VBA is being utilized. When the recording function is used, Excel is essentially writing a VBA program for you. While the language is useful for recording and playing basic macros, this is really just the tip of the iceberg in terms of what it can accomplish. In addition to automating almost any task in Excel, VBA can be used to interact with other programs, which will be especially useful when we get to the actual scraping portion of the book. To go over every task that is possible with VBA would take at least another book i n and of itself and would not be practical for our purposes. Since this book is specifically focusing on the use of VBA in scraping and manipulating data from the web, we will only be covering the essential terms and functions that VBA provides for these tasks. However, we will also be covering the more basic elements of programing that are also critical for scraping the web.

VBA Programming This book is intended for the use of beginners as well as experienced programmers. With that being said, this section is probably not necessary for advanced VBA developers or programmers in general, so it may behoove you to skim or completely skip this basic review. For you beginners though, if you are completely new to programming then please note that this is a very quick review of the most basic components. Again, this book does not intent to give an all-encompassing view of programming with VBA let alone programming in general. If you’re an absolute beginner

who intends to become a master programmer of various languages, then there is plenty of information readily available on the internet for every language you can imagine. The following material is only meant to give a brief overview of the basic concepts of programming with VBA and how they will be utilized for our purposes.

Variables There’s a good chance that variables are exactly what you think they are. In programming, just like in algebra, variables represent, or hold, a value. In VBA, assigning a value to a variable is especially easy. One simply has to type the variable name, add an equals sign (=) and, finally, add the value that they want the variable to hold. Examples: 1. X=3 2. X=3.0 3. X= “Three” One might notice that not all variables contain the same type of data. The values assigned in Examples 1,2, and 3 are 3, 3.0, and “three” respectively. The three in example 1 is an integer. The three in example 2 is a number but the decimal shows that it is not an integer like it is in example 2. The X in Example 3 is, of course, not a number at all; it is a collection of characters, in this case letters, that spells the word “three”. Why am I stating the obvious? Because the example demonstrates how even though the pieces of information, in a sense, all mean the same thing to us humans, they are in different forms, or data types, to computers. Typically, in most languages the programmer must specify the type of data that the variable will be holding. In the previous examples, Example 1 would be specified as an integer, Example 2 would be a double, and example three would be a string. One of the great things about learning programming through VBA is that it is not as crucial to specify the variable type as it is in other languages. However, this is generally only true for writing basic macros. While it is generally good practice to always specify the variables that you will be using, the beginner can take solace in the fact that their program probably won’t break if they don’t specify the data type for every variable. The reason for this is that when a data type is not specified for a variable in VBA, it then defaults to a variant d ata type which is extremely versatile and can be used as an integer, double, string, etc. This can be good or bad for the beginning VBA programmer. On one hand it saves time and is a luxury to not worry about what data type a variable is, especially for basic programs or functions. On the other hand, the variant uses much more memory than the other commonly used data types. When you get into more advanced

concepts and coding it will be crucial to correctly specify what type of data a variable is housing. There is a time and a place to take advantage of the variable defaulting to variant, but for our purposes, that being writing macros that scrape data from the web, we’ll operate under the assumption that defining variables correctly is necessary. Some of the most common data types that we will be using: • Boolean- Binary. • Examples: True or False • Integer- Whole number (No decimals) •

Examples: 1, 2, 12, etc.

• Double- Whole numbers or fractions (decimals) • Examples: 1.2, 1.33333, 50.12, etc. • String- Text • Examples: “Cat”, “Dog”, “VBA is awesome”, etc. • Object- Objects will be discussed in more detail later on as the concept is not as straightforward as the previous data types. For now, just think of an object as a separate entity or thing such as an occurrence of an outside program (i.e. internet explorer). While that’s not technically the entire definition, it’s the most relevant feature of the data type for our purposes. • Variant- Variants were briefly discussed earlier. They are the data type that VBA defaults a variable to when you don’t specify the type. The variant is very flexible and can serve the purpose of almost any data type. However, they use up a large amount of memory which can effect performance. Relying on defaulting to the variant will also limit you when you get into more advanced programing in VBA. I only rely on defaulting to variants on small projects in need of a quick fix which is definitely not the case when it comes to scraping web sites. Once you’ve decided what data type your variable is, specifying a variable’s data type is as simple as typing, “Dim [variable] as [data type]”. For example, let’s assume that I want the variable x to represent 3, an integer. To specify that x will be an integer, I would type: Dim x as integer I could then assign X my integer value of 3 which would be: X=3 To specify multiple variables of the same data type, simply type “Dim [variable1],

[variable2], [variable3], [etc.] as [data type].” For example, if I wanted to use variables x, y and z all as integers I could write: Dim x, y, z as integer You can specify a variable’s data type an any point in the program before giving the variable a value. I believe it helps keep the program organized and adds a sense of professionalism to specify all variables at the beginning, or technically before the macro is run, but it can really be done at any point. We’ll illustrate examples of both options after we’ve covered more basics and can write a program. Again, this is almost completely an aesthetic choice and will not affect how the program runs. There is an unlimited number of ways that programs can use variables. However, there are some aspects of programming that are almost universally used across languages. This section will give a brief review of the most relevant ones for our purposes. Before we do this however, if you are complete newb let’s get writing your first macro, or program, out of the way. First, open a blank excel workbook. To get to the environment you will be writing your code in, right click on the tab at the bottom of the spreadsheet. This will typically be labeled Sheet1 in a new workbook, and select View Code. You can also access the environment by pressing Alt-F11. A new window labeled Microsoft Visual Basic for pplications - Book 1 should appear. At the top of this window, select Insert and click Module. A blank white window labeled Book1 - Module1() should appear. This is where you will be typing your code. All macros in this environment will begin and end the say way. At the beginning, type “Sub” followed by whatever you want to name your macro, followed by “()”. To end your macro, type “End Sub”. So, if you wanted to name your macro “FirstMacro” it would look something like this:

Sub FirstMacro() [Type your code here] End Sub

Note that unlike many other languages, VBA is not case sensitive, meaning that you need not concern yourself with deciphering uppercase from lowercase. To complete your first macro, we’ll be utilizing VBA’s MsgBox function, which will display a dialog box with a message when it is run. To use this feature, simply type MsgBox followed by your message that you want displayed.

This is your complete first macro. There are several ways to run code once you have it written. For example, you can click the green Run Sub button (button with the green triangle) or you can go back to your spreadsheet, click the View tab, click the Macros section and a dialog should appear with the name of your macro displayed. Select your macro and click the Run button. If you’ve correctly entered the code displayed above, when you run your macro, a dialog displaying “Hello World!” should appear.

Conditional Statements Conditional statements are often referred to as “if statements” for pretty obvious reasons. They check for the existence of a certain condition. If the condition is true it performs a specific action. If it is false, then it performs a different action or no auction at all. Here’s how a conditional statement might look in VBA.

If [condition is met] then [action] else [different action] end if

To illustrate this principle, we’ll slightly modify our current code and add a conditional statement. To do this, we’ll utilize a variable which we will call Var1

If you’ve entered the code correctly, then a dialog should appear with the message “The variable is five.” To further test the code, change the variable to any integer other than five. Run your macro and a dialog will display showing “The variable is not five.”

Loops Loops might be looked at as a certain type of condition statement. The difference between loops and the previously discussed conditional statements is that loops repeat a section of code multiple times until the condition is met.

For Loops There are several different types of loops. For our purposes we will primarily be working with For Loops. For Loops are perhaps the simplest type of loop as it repeats the loop for a predetermined number of times (generally speaking). Let’s look at an example to illustrate the concept. If I had a section of code that I wanted to repeat 5 times I would write:

For x = 1 to 5 [Action to be repeated] Next x

We’ll again use the MsgBox function in an example.

When you run this code, the dialog displaying the message “This is my loop example” should be displayed 5 consecutive times (you must click the OK button each time). In this example, and in most circumstances that we’ll be using For Loops, each iteration of the loop adds 1 to the variable x. To demonstrate this, we’ll modify our loop example by including our variable.

When this code is run it should show 5 consecutive dialog boxes displaying the iteration of the loop.

Do While Loops Another type of loop that we’ll be using regularly is the do loop. The do loop is similar to the For Loop in that it repeats a section of code. There are two types of do loops; Do While and Do Until. Do While Loops repeats the section of code while a specified condition is true. Do Until Loops repeat the section while a condition is not met and stops when it becomes true. Clearly, the two options are very similar. I find it easier to pick one of the two options and stick with it. Most of the time if a task can be accomplished by one then it can easily be by the other by slightly altering the inner portion of code. For this reason, I generally stick with Do While Loops, but the choice is up to you. The general structure of the Do While Loop will look something like this:

Do While [condition] [Repeat this action] Loop

Let’s look at an example to illustrate how the Do While Loop works for our purposes. We’ll also introduce the Application.wait function in this example as we’ll commonly use this action in while loops. We will be using this function to pause our macro for a specific amount of time, usually for one or a few seconds. This function was created to pause a macro until a specified date and time. Naturally, we do not know the exact date and time of day that we’ll need to use this function, so to get around this we’ll take the current date and time and add the number of seconds that we want our macro to pause for. If this is confusing for you, don’t sweat it. Just understand that when you see pplication.Wait (Now + #12:00:01 AM#) it will pause the macro for 1 second. When you see Application.Wait (Now + #12:00:02 AM#) the macro will pause for 2 seconds, and so on. When the following macro is initiated it will pause for 5 seconds and then display a dialog indicating that the macro has finished.

This concept tends to trip some people up so we’ll walk through it. At the beginning of the macro, the variable y is 0. The condition in the loop stipulates that the action in the loop will repeat while y is less than 5. Each time the loop runs it will use the application.wait function to pause excel for one second. After this pause, it will add 1 to our variable y before starting the next iteration of the loop. Therefore, the loop will r un 5 times, pausing for one second each time, before y is no longer less than 5 which means the condition is no longer met and the loop is done. As the macro continues, it will then use the MsgBox function to display a dialog indicating that the macro has finished. Therefore, when you run this macro it will pause for 5 seconds before displaying the dialog.

Basic Spreadsheet Interaction VBA offers various ways of inputting data on a spreadsheet. For inputting data into individual cells, I personally use and recommend the Cells method. This method simply looks at the spreadsheet as a plane of coordinates, with cells(1,1) being the first cell in the spreadsheet (top left corner) with the first number controlling the vertical placement and the seconding representing the horizontal placement. It might be best to think of this method as cells([row],[column]). F or example, as previously mentioned, cells(1,1) is associated with the top left corner of the spreadsheet. To reference the cell on the far left but in the second row, you would use cells(2,1) and so on. To place the value you want in the cell, simply type cells([row],[column]).value = [value you want to enter]. If you want to put in a static value, make sure you use quotations around your text. For example, if I wanted the cells in the second row of the second column to say “Dog” I would put:

Cells(2,2).value = “Dog”

You should probably notice that when using this method in this manner, the sheet in your

workbook that you want to use is not specified. When using this method as we did above, VBA will assume that you want this action performed on whatever sheet i s currently active. If you want to specify what sheet the cell is on that you are going to be using then put Sheets([Name of sheet in quotations]) followed by your cells statement. For example, if you wanted to put “Cat” in the third row of the first column on a sheet called “Pets” in your workbook, you could put:

Sheets(“Pets”).cells(3,1).value =”Cat”

Clearly, putting a static value in a cell is a pretty simple process. What we will be doing frequently is only slightly more complicated, in that we will be using a loop variable to dictate the cells being used. This is more frequently done in the row coordinate. Here’s a simple example:

This code should put the word “Bird” in the first 10 rows of the f irst column of the worksheet titled “Pets”. Of course, it would be very unusual to have to enter a static value in such a manner. Typically, the “Bird” portion of the above loop will be a variable that is changing with each iteration of the loop. To better understand this concept, here is an example illustrating this principle. If each cell of the first ten rows of the first column had a unique value and we wanted to use this method to input each of these values into the same row that they are currently in but in column two we could write:

This leads into the next important concept for our purposes. Unlike the above example, it would be very rare for us to be using a static value for the number of rows to loop through. We typically won’t know if it’ll be ten rows [as in the example above), 100 rows, 8,675,309 rows, etc. Let’s say in the far left column we have a list of URL’s that we

will ultimately have to use for our scrape and we need to loop through them one at a time. It would be highly inconvenient to manually determine the number of rows and then enter the number into our loop. To get around this inconvenience, we need to automatically count how many rows are populated with data in the first column and then use that number in our loop. There are multiple ways to do this in VBA. One easily understandable method is: Sheets([Sheet name]).cells(1,1).CurrentRegion.Rows.Count Clearly there are multiple things happening here to get the number of rows used in Column 1. In short, the code looks at Cells(1,1) and the “block” of populated cells that it is part of. It then counts the number of rows in this block, thus producing the number of rows that we want to loop through. We can then use this number as the upper limit in our loop. Here’s the same loop that we just used but now with a variable dictating how many rows to loop through:

Formatting Code As mentioned previously, what you name your variables is entirely up to you. I personally like to use a single letter for variables representing the iterations of loops and use more descriptive names for other variables. In the example above, the variable named z could just as easily be named loopvariable1, q , or RonaldReagan. However, regardless of your preference I recommend not naming all variables randomly. While it can be funny at first, eventually you’ll probably have to look at your code again months or years after it was originally written and having random names for variables will make it more difficult to follow the logic of your program. Another aspect of the style of your code that is a matter of preferences in the use of indentation and spaces between lines. For example, you might find the last code that we used more organized if it was formatted as follows:

This formatting is completely optional and is a matter of preference. I sometimes wont format very short codes (like the one’s we’ve been using thus far) but always use my preference in formatting for longer programs. My only recommendation is that if you do use formatting to keep your code organized, be consistent with how you use it. Notice in the example above that I indented the beginning and end of the loop so that they are visually in line with one another and put and additional indentation for the loops contents. This may add little value in codes as short as these examples, but when you start writing long programs that may include loops inside of loops inside of loops, this type of formatting makes troubleshooting whatever problems arise much easier.

Chapter 3: Scraping with VBA

Navigating the Web Before we can get into great detail regarding pulling pieces of data out of websites, we first must become familiar with using VBA to navigate the web. This skill can serve a myriad of purposes even if you don’t intend to scrape data from sites. For example, it’s not uncommon to use VBA to create macros that navigate to and through various sites simply for display purposes. More importantly, these techniques can be used in website testing. VBA can be used to navigate through all pages of a website, testing it for reliability and accuracy. Just like you and me, VBA requires a web-browser to navigate throughout the web. However, there’s a good chance that you, as a human, are quite a bit more versatile when it comes to operating different web browsers. If you’re like me, you might prefer Firefox on a daily basis when it comes to manual internet browsing. At the same time, you probably can easily switch to Chrome or Internet Explorer if you ever felt like it. However, when it comes to programming with VBA, internet explorer is the most common web browser to use as it is by far the easiest. The reason for this is pretty obvious. VBA was developed by Microsoft to be used with Microsoft programs so naturally Internet Explorer is the easiest one to utilize in this context. This is not to say that it’s impossible to use Firefox or Chrome with VBA, but it generally requires much more effort. Even if Internet Explorer isn’t your personal preference when it comes to browsing the web, it should be more than sufficient for your scraping purposes. To start your web-browsing with VBA, we must create our Internet Explorer object. We’ll start by defining our variable, which we’ll call IE, as an object . Object variables are not established in the exact same way as other variables. The only difference you need to be aware of is that when declaring Object variables, you must add the word “Set” before the name of the variable. So when we create the Internet Explorer variable it will look something like this:

Set IE = CreateObject(“InternetExplorer.Application”)

There are other ways of creating Internet Explorer objects in VBA. However, this often involves adding outside references to your VBA project. This task is not difficult and any

seasoned VBA developer will know how to do it, but I find that using the aforementioned method of creating this object to be just as effective as any other means so it is generally the method that I stick with. After you’re done using your browser for each macro, it’s important to clear the object, which essentially means you’re changing the object to nothing which can be done with this code:

Set IE = Nothing

This step is often overlooked. People often skip it because they assume that once the browser is closed then it is no longer using the computers resources, however this is often not the case. This situation often leads to what’s referred to as a data leak . In larger scrapes, there will be times when it’s necessary to open and close a browser object many times. Each browser object you create takes up a certain amount of the computer’s available resources. If the objects are never cleared, even if the browser has been closed, the amount of memory dedicated to the objects you create continue to accumulate, which can ultimately lead to your scrape running very slowly. This leads into the next point. Creating an object, such as one that is used for web browsing, does not mean the object will automatically become visible. It is very possible to create and use a browser that remains completely invisible to the computers user. While there may be an appropriate time to use Internet Explorer in this manner, I generally never do. Being able to see the navigation that the browser performs is of vital importance when testing and monitoring a new scrape. Keep in mind that even if you write your code flawlessly, there are certain aspects when it comes to navigating the internet that will remain out of your control. Websites will be updated and changed over time which can lead to your scrape not working properly, so it’s ideal to be able to see the progress of your navigation when you can. Naturally, you’re not going to want to monitor your scrapes at all times, but it’s important to be able to see the browser when you want to. To make this object visible, we’ll use the code:

IE.Visible = True

Now that your browser is created and visible, it’s time to navigate to whatever site you’ll be displaying or scraping. This code is also short and simple:

IE. Navigate [website URL]

For example, if you wanted to navigate to Google.com, your code would look like this:

IE.Navigate www.google.com

Putting these three pieces together will create your first web-browsing macro:

Timing When interacting VBA with Internet Explorer, time is of critical importance. This may seem like one of the more trivial matters affecting your scrapes, but timing issues can actually be some of the most troublesome and frustrating obstacles you encounter when navigating and scraping sites on the web. Allow me to elaborate; websites typically require at least a few seconds to load. While we know that it would be a futile effort to operate a website before it is done loading, your VBA program does not know any better unless it is specifically told to wait for the page to load. As far as VBA is concerned, once it processes IE.navigate “google.com” it is ready to continue on to the next line of your code whether or not Internet Explorer has finished loading the site. When this happens, your program will most likely be curtailed with an error message. To get around this problem, you must tell the program to wait for the website to finish loading. There are multiple ways to do this. While at times it can be useful to use the application.wait function that was previously mentioned, this is not the ideal choice when it comes to waiting for web-pages to load. The reason for this is the unpredictability in the amount of time that it takes for a page to load. At any given time, this process may take one, ten or twenty second. Having a static value in the application.wait function leaves room for error in this context as it may leave too much or not enough time for a page to load. While the idea of leaving an excessive amount of time for a page to load makes sense in theory, if your scrape is navigating through many pages, as will probably be the case, this wasted time can be very valuable. Even an extra second or two to each page will end up adding up to a significant amount of lost time when you’re navigating

through thousands of pages. There are various ways to get around the page-loading-timing issue. One of the most commonly used codes used to pause the macro while the page loads utilizes a while loop:

Another option for accomplishing the same goal is:

In theory, either of these methods should accomplish the same goal of pausing the macro until the site has finished loading, but it should be noted that both of these methods have been criticized for being inconsistent. However, many have found that utilizing both of these conditions at the same time adequately accomplishes the task. The combination of the two codes might look something like this:

As previously stated, these methods have been criticized for lacking consistency. In my experiences, it is not technically accurate that these methods are inconsistent per se, although they do leave something to be desired. It’s been my experience that when these methods work for a certain site, they will typically always for that site but if they don’t work for a site, they will never for that site. There are times when the loop fails to pause the macro at all and there are other times when the macro will get stuck in the loop indefinitely. In general I believe the above methods work most of the time so they are a good starting point when trying to pause a macro while a page loads. Just keep in mind that this method is not perfect and it will sometimes be necessary to utilize more creative programming to accomplish this task. There are occasions when these methods will work most of the time but occasionally gets stuck in an infinite or seemingly infinite loop. Fortunately, there are methods around this. One of my favorites is to put a variable which counts every iteration of your loop up

to a maximum limit. In theory, this can be done without a pause, but the limit would typically have to be a very large number for the loop to serve a useful purpose. Therefore, it’s ideal to have a pause in each iteration of the loop. For example:

The application.wait function is necessary in our loop because if there were no pause, the variable q would meet the condition (add up to five) in less than a second, which would hardly be useful to us. Since the variable q increases by one for every repetition of the loop, in the fifth iteration the condition of q < 5 will no longer be true, thus activating the Exit Do portion of the code, ending the loop. So if we look at the entire code we can see that the loop will continue until IE.Busy is false or IE.ReadyState <> READYSTATE_COMPLETE is false. However, due to the q<5 portion of the code, the maximum number of times that the loop can go through is 5 before it exits the loop. The final method I use to pause macros while a web-page loads is less commonly utilized, but I’ve found it to be invaluable at times. As previously mentioned, there will be sites where the IE.Busy or IE.ReadyState <> READYSTATE_COMPLETE function simply doesn’t work at all. This method utilizes the LocationURL property. When we add this property to our browsing object, we can get whatever the browser’s current URL is. How is this useful? When we direct the browser to a specific URL it does not navigate there instantly, it takes time for the new URL to load. We can use this to our advantage with the following logic. Use IE.LocationURL to determine the browser’s current URL. We’ll now refer to this URL as URL1. At this time, we tell the browser to navigate to the desired URL which we’ll call URL2. While this page is loading, the IE.LocationURL operation will still display URL1 until URL2 has loaded. Therefore, once you’ve instructed the browser to go to URL2, you can start a loop that repeatedly checks IE.LocationURL until it reads URL2 at which time the page as loaded and you can resume your macro. Here is an example of such a loop.

It’s going to take a practice and repetition to know when to use which method to stall your program. As previously mentioned, there are an infinite number of websites that come in an endless combination of styles and operations, so there i s not one method that will work for all of them.

Interacting with Websites Now that we’ve covered navigating to websites, we can begin covering how to interact with the site. While there will be scrapes that simply load a page, scrape the data and then carry on to the next page, there are times where it will be of paramount importance to understand how to interact with the sight through VBA in the same way that you would manually interact with a site. For instance, there will be times when you need to click a button or buttons for the site to produce the information that you want. There will also be times when you need to insert text in a file on the page for it to operate the way you want it to. Furthermore, at times it will be easier to click an element on a page to take you to another page in oppose to relying on entering a new URL. Each separate component of the site that are of interest to us, such as buttons, text boxes, icons, filters, etc. will be referred to as elements.

Identifying Elements Regardless of how you intend to utilize an element within a website, whether that means clicking on it, inserting data into it, taking data from it, etc., you must determine how to identify the element within the coding of the website and how you can reference it in your VBA program. In order to do this, you need only a very basic level of knowledge about the structure of websites and how they are organized. As you’re probably aware of, the basic structure of websites is created with HTML (Hypertext Markup Language). However, as technology has evolved HTML became less sufficient for providing the most advanced features used in most modern websites. It has gotten to the point where HTML really only provides the “backbone” of most modern sites. In fact, it’s likely that every site you regularly use incorporates multiple other languages such as CSS, JavaScript, PHP and so on. Lucky for us, you don’t need a wealth of knowledge regarding all of these languages to be a proficient web scraper. The reason for this is that much of the value that these additional languages provide focuses on aesthetics or graphics, which is to say that a large portion of the actual content or information that you want from a site will still reside in the HTML portion of the code.

There are a few ways to go about identifying the element of a website that you want to use. The most basic way of doing this is to manually scan through the HMTL code looking for the element you want. To do this, open a browser and navigate to a desired site. A dialog should appear with a few options depending on the browser you’re using. Typically, the option you’re looking for will say “View Source”, “View Source Code”, “View Code” or something similar. By clicking on this, a window should appear which displays all of the available coding behind this site. At this point, you can scan through the code, and try to follow it to the identifying feature of the field that you want. One effective strategy is to copy and paste whatever text is in your element, or an element near the element that you want, and doing a search for the copied text in the “View Source” window. I could go into more detail about how to do this, but I won’t. The truth is that while it’s important to know how to do this in the event that you need to review the entire page’s code, it is by far the most tedious and inefficient way of finding the identifying feature of the element you want. Many web browsers offer a more efficient means of accomplishing this goal. This is one of the reasons that I generally prefer doing my manual browsing with Mozilla Firefox as it provides an easy way of finding the code for the exact element you want. To do so, simply move your cursor over the element that you want to use, right-click and select “Inspect Element.” A new section at the bottom of your browsing window will appear with the section of code responsible for the element highlighted. A final option for identifying the elements that you want to use in your scrapes is to run a macro that procures the innertext, value, or other identifying feature for each element on the page and then searching through the results for the element you want. I have included a code below that accomplishes this goal which will take the innertext and identifying features for each element that has innertext and places them in the workbook. It’s important to note that there are many more elements on each page that won’t appear on the worksheet as they don’t have innertext . This code could certainly be improved; however, this would call for a more complicated programming. Ideally this macro will be simple enough for individuals that do not have much experience with VBA to understand. If you are brand new to VBA, it may look intimidating, but don’t harp on every detail. Hopefully by the end of this book you’ll have a better understanding of how this code works and you can then manipulate it to your liking. If you combine the material in this book with plenty of practice then this code will begin to look simple, if not elementary. Furthermore, it is easy to find similar macros online that accomplish similar tasks and have extra bells and whistles. Keep in mind that even though these macros can be useful tools in identifying the element you want to use; they are still only tools which are intended to help the process. It is still ultimately up to you to determine the elements you want to use and the best way to reference them in your scrape.

While it is usually easy to find the HTML code for the element that you want to use, it is not always easy to properly reference the element. There are several factors that come into play here, but it typically depends on what information the HTML code provides for the desired element. To explore this issue, we’ll observe the various methods that VBA provides for locating an HTML element. For several of the forthcoming examples, we’ll use the following HTML code: This is my example HTML code

We will first discuss the methods used to properly reference an HTML element through VBA and then we will explore how to use these references to accomplish our goals.

GetElementByID The GetElementByID method is ideal for scraping data when the option is available. The primary reason for this is that an ID will typically only refer to a single element on a site, meaning that there will not be multiple elements that have the same ID. When it comes to scraping, finding an element you want that provides an ID is like taking candy from a baby. If we continue using IE as our browser object variable, the code for taking data for these elements is simply:

IE.Document.GetElementByID(“[Element ID]”).InnerText

So, let’s apply this procedure to our example HTML code. Notice that the ID portion of the code has been bolded for the sake of demonstration.

This is my example HTML code

We can see that the ID for the element we are using is “abc”. Therefore, to reference this element we could use the code: IE.Document.GetElementByID(“abc”).

GetElementsByName Similar to the GetElementByID procedure is the GetElementsByName procedure. Other than “Name” replacing “ID” in the procedure, there is a major difference between the two. Notice that the “Element” portion of the GetElementsByName procedure has an “s” at the end, which is not true of the GetElementByID procedure. Unlike IDs, there can be multiple elements with the same name. Since multiple elements can have the same name, it’s important to further specify which one you are referring to. When multiple elements have the same defining feature, such as the element’s name, then each of these same named elements are assigned a hidden index number, starting with 0, which you can add at the end of your GetElementsByName reference in parentheses.

This might sound confusing but it’s actually very simple. If we have several elements that have the same name, then the first one has an index of 0, the second has an index of 1, the third has an index of 2, and so on. All you have to do is add this index number between two parentheses to the end of your name reference. Let’s look at our example code to further elaborate.

This is my example HTML code

Notice that the name portion is bolded for the sake of demonstration. If we wanted to reference the first element with the name of “def” we could write: IE.Document.getelementsbyname(“defg”)(0). Even if there was only one element on the page named “def”, it is still necessary to put the 0 as a reference. The example above is technically accurate but not very realistic. If an element has an ID, it would be simpler to use that in your reference since it does not require an index number. Let’s look at this section of code to get a better understanding of how using GetElementsByName can be useful.

This is my first HTML code This is my second HTML code This is my third HTML code This is my fourth HTML code

This chunk of HTML code provides a much better illustration of when GetElementsByName can be used. Notice that there are no IDs in these four elements and they all have the same name. However, each of the four is displaying a different phrase in the text portion of the elements. To reference the first one, we would write:

IE.Document.GetElementsByName(“def”)(0)

To reference the second one, we would write:


To reference the second one, we would write:


etc.

GetElementsByClassName The GetElementsByClassName procedure is almost the exact same as the GetElementsByName procedure. The only difference is that the procedure looks for the elements class instead of its name. Everything else, including the use of index numbers, is the exact same.

This is my first HTML code

Assuming this element is the first or only one with a class name of “ghi”, we would reference it by writing:

IE.Document.GetElementsByClassName(“ghi”)(0)

GetElementsByTagName At this point, you can probably guess how to use the GetElementsByTagName procedure. This procedure is, essentially, the exact same as the previous two procedures, except that it utilizes the name of the tag. Everything else, including the use of index numbers, is exactly the same. For this example, we’ll use “td” as our tagname.

This is my first HTML code

Notice that the “td” portion of the code has been bolded for demonstration. If this is the first or only element on the page with the tagname, “td”, we could reference it by writing:

IE.Document.GetElementsByTagname(“tagname”)(0) Working with Elements Now that you have a basic idea of how to use some of the most common procedures for referencing HTML elements, we’ll explore how these features are useful. Simply adding these procedures to your VBA program will do nothing in and of themselves. If we did, the code would simply find the element and do nothing with it. Therefore, you must instruct the procedure what to do with each element that it is referencing. To explore this topic, we’ll look at some of the most common actions that we typically perform while browsing the web. To do so, we will explore the functionality of the two most commonly used devices that communicate instructions to our computer: the mouse and the keyboard.

The Mouse We’ll start with the basic functionality that the mouse provides. When browsing the internet one usually uses the mouse to navigate the cursor to the element that they want to use and, most of the time, clicks the left mouse button to activate whatever function it is that the element that they navigated the cursor to does. For our purposes, the actual navigation of the mouse is irrelevant. This is because you don’t need to literally navigate the mouse to an element to click on it. We can, instead, reference the element that we want to use with the previously discussed procedures and then instruct VBA to click on it. In a sense, our reference to the element is doing the navigating for us, so all we have to do is instruct VBA to click on it. To do so, we can simply put “. click” at the end of our code that refers to the element. For example, if there was a button on the website which had the following HTML code:

To click this button in VBA, we could write:

IE.Document.GetElementByID(“button1”).click

The Keyboard In everyday life the keyboard is used to enter text into fields. The difference is that if you already know how to reference the field that you want to add text to, there is no need to click on it or press “Tab” in order to activate it. Instead, simply state your reference to

the element, add “.innertext=”, then add the text that you want to add in quotations. For example, we’ll pretend that you want to enter text into this HTML element:

To enter the string “This is my Text” you could write:

Pulling Data While entering text will be necessary at times when performing a scrape, it will typically be much more common to do just the opposite by taking text from an element and storing it somewhere, such as a cell on your spreadsheet or in a variable. To do so, take the code that we used to enter text, but reverse the text on each side of the equal sign: Let’s look at our last example with this adjustment:

Of course, the string “This is my text” can’t be used to store the contents of the referenced element, so we must change this portion of the code to a variable that is able to store the elements contents. If we use Var1 as our variable, our code would look something like this: Var1 = IE.Document.GetElementByID(“abc”).innertext . If, instead of storing the element’s contents in a variable, we wanted to store it in our spreadsheet, we can replace the variable in our code with the reference to the cell that we want to use. We’ll use cell A1 for this example.

This technique can be used with any of the aforementioned techniques for referencing an element. Similar lines of code are going to be the “bread and butter” of your scrapes:

Putting the pieces together Before moving on to the more complicated concepts that may be required, let’s put together the pieces that we’ve covered so far to make a complete macro. In theory, you now have all of the tools required to create a web scrape, however it would have to be a very simple one. By looking at the code for an entire, albeit simple, web scrape, it should be easier to understand the value of the more advanced topics that will follow in addition to providing a recap of the material covered. For this imaginary scrape we’ll navigate to FakeURLl, pull text from the element with an id of “abc” and then pull text from the first element with a name of “def”. I’ve added comments to walk you through every step of the process.

Chapter 4: Creative Scraping

As previously mentioned, the tools and techniques covered in the previous section will generally make up most of your scrapes. However, it is rare for a website to be as straightforward as the imaginary one that we created. There are an infinite number of ways to create and structure the code behind a website, therefore you must be flexible in your ability to apply the aforementioned techniques to create a successful scrape. This section will cover some of the more common problems and solutions that I’ve frequently come across in my scraping career. Even though many of the tools have been provided for you, it will ultimately be up to you to piece them together in the right way to accomplish your goals. The following solutions are just a few of the common ones I’ve had to implement. I’ll illustrate one of the more common challenges that you’ll come across with an anecdote. Say you’re a fantasy baseball guru and you have a list of pitchers that you want to get statistics for. You already know the site that you want to use, but you have to go to a separate page for each pitcher. For each pitcher you want to get t heir earned run average (ERA), innings pitched (IP), strikeouts(K) and wins (W). If, for each page, each element had its own element with an ID, this would be a pretty easy scrape. The HTML code might look something like this:

ERA: 3.12

IP: 200

K: 100

W: 10

This would be the ideal situation to get the desired stats. You could simply use the GetElementByID method for each stat. We’ll make a variable for each one:

If there is a unique id for each stat on one page of the site, there’s a good chance that the next page will utilize the same id for the same information for the next pitcher. However,

there is a good chance that there will not be a unique id for each stat you want. For instance, the HTML code might look something like this:

ERA: 3.12

IP: 200

K: 100

W: 10

This would also be pretty easy to scrape, assuming each pitcher’s page is displaying the same number of stats. You could just use the GetElementsByClassName(“stats1”) with indexes of 0,1,2 and 3.

Using this method, however, will be problematic if each pitcher’s page does not have a consistent number of stats. To illustrate this, let’s look at this section of code on two separate pages:

Pitcher 1 (First Page):

ERA: 3.12

IP: 200

K: 100

W: 10

Pitcher 2 (Second Page):

ERA:2.51

WHIP:.8

IP: 250

K: 200

W: 18

Notice that on the second pitcher’s page there’s an additional stat after ERA; WHIP. If we were to use the VBA code that we discussed in the last section, the procedure that pulled the IP stat from the first pitchers page, IE.Document.GetElementsByClassName(“stats1”)(1).innertext , would now be pulling the WHIP instead. This is because it is looking for the second element with a classname of “stat” regardless of what information is in the element. By adding the additional stat, the index after the GetElementsByClassName procedure is now off by one for all of the remaining GetElementsByClassName procedures on the page. In other words, IE.Document.GetElementsByClassName(“stats1”)(0).innertext would correctly pull the ERA stat, but IE.Document.GetElementsByClassName(“stats1”)(1).innertext would pull the WHIP instead of IP, IE.Document.GetElementsByClassName(“stats1”)(2).innertext would incorrectly pull IP instead of K and so on. This is where creative problem solving is of enormous importance. With this limited amount of information, my suggestion would be to loop through all of the elements in the class, looking through the text of each element to determine whether or not it contains the data that we’re after. In this situation, we can utilize the first few characters of each field to determine whether or not it is one of the elements that we want. For instance, only the ERA field will contain the text “ERA:”. As you can see in our previous example with Pitcher 1, the ERA field contains “ERA: 3.12”. If we don’t know what the index number will be for the desired ERA statistic, we can loop through each element checking for the string “ERA:” and once we find a field that does contain the string, we can infer that it is our ERA statistic and take the data we want from it. To set up this loop, we’ll count how many elements have a class name of “stats1” and then loop through each one, checking for the text “ERA:” To do this we’ll utilize the InStr procedure. This procedure searches for a string inside of a larger string and is performed as follows: InStr([starting position], [large string], [small string]). For the [starting position] portion of the code, I almost always use the number 1 so that the search starts at the beginning of the larger string. The rest is pretty self-explanatory. The [Large string] is the string that we assume will, at some point, contain the small string that was are searching for. For example, if we use “ABCDEFG” as our large string and “E” as our small string, the procedure would look like this: InStr(1, “ABCDEFG”,”E”). The result of this procedure should be the number 5 as “E” is in the fifth position of our l arge string.

However, if we try to run this procedure again with a small string that is not in the large string, for example InStr(1,”ABCDEFG”,”Q”), the result will be 0. Thus, if we apply this strategy to determine whether or not each element that we are looping through contains the text “ERA:”, if we get a result of anything that is greater than 0, then the procedure is implying that “ERA:” must exist somewhere in the string that it is testing and, therefore, must be the element that we are looking for. To make the code easier to write and follow, I created an object variable to hold my collection of element references (or multiple IE.Document.GetElementsByClassName(“stats1”) references if you want to think of it that way). In other words, by putting:

for the rest of the code I can write PitchStats instead of the entire reference, IE.Document.GetElementsByClassName(“stats1”) every time that I want to use the reference. You might also notice my For Loop runs from 0 to (PitchStats.Length - 1). This is done because, as previously mentioned, the index for elements starts at 0 instead of 1, so we must start the loop from 0 instead of 1. However, if the loop ran from 0 to PitchStats.Length (the number of elements that we’re checking), it would run through one too many iterations, causing an error. It must, therefore, only run to PitchStats.Length - 1 to account for the index starting at 0 instead of 1. You might also notice the Exit For procedure after the variable is assigned in the if statement. This was added because once we’ve found the element that we are looking for and the variable is assigned, there’s no need to run through the remaining iterations of the loop.

Parsing The skill of parsing data can be imperative when it comes to giving structure to scraped data. Being proficient at parsing creates more options for the scraper in terms of what elements they scrape and how to use this data. Here’s a common scenario that you’ll come across while scraping. Let’s pretend that I’m going to scrape from our HTML code again, but this time I will pull the parent element which has a class name of “stats” instead of the children elements that have a class name of “stats1”. In doing so, I will be pulling data

from all of the child elements at one time instead of individually from each child element. To do so I could write:

While this method seems much simpler than the previous one, there’s a major issue in that it will pull all of the text from all of the child elements into a single string. For this example our HTML code will be:

ERA: 3.12

IP: 200

K: 100

W: 10

If we pulled the innertext from the “stats” element, it would look something like this: “ERA: 3.12IP: 200K: 100W: 10”. While this is clearly an easy way to get ahold of a large chunk of data, without parsing this string into coherent pieces this chunk of data is essentially useless. As you can see, in this situation it would be much easier to just pull each of the desired pieces of data from the child elements (“stats1”). However, there will be times when this will not be a practical option and it will be necessary to parse your misshapen wad of data into something useful. There are an infinite number methods and tools that can be used for parsing. However, for our purposes we’ll stick with some of the basic which can be very versatile. There’s a good chance you’re familiar with these concepts already, but for the sake of being thorough we’ll review them anyways. Left()- This function pulls a specified number of characters from the string starting from the first character on the far left of the string. Here is the structure for using these functions Left([String], [Number]). For example, Let’s look at the string “ABCDE”. If we were to write “ Left(“ABCDE”,1)” it would pull the letter “A” because A is the single letter to the far left of the string. If we were to write “ Left(“ABCDE”,2)” it would pull “AB”. “ Left(“ABCDE”,3)” would pull “ABC” and so on. Right()- This function is the exact same as the Left() function except it starts from the opposite side of the string. “ Right(“ABCDE”,1)” it would pull the letter “E”, “ Right(“ABCDE”,2)” would pull the letters “DE”, etc.

Mid()- The Mid() function is quite possibly the most valuable when it comes to parsing data. It is structure as Mid([String], [Number1], [Number2]). The string is, obviously, the string that you want to pull data from. Number1 is the index number of the character that you want to start pulling from. This sounds confusing, but it isn’t. Just think of each character of your initial string being assigned a number, starting at 1 with the first character, 2 for the second, 3 for the third and so on. So, if your initial string was “ABCDEFG” and you wanted to pull the characters starting with the letter “B”, your Number1 in your mid function would be 2 because “B” is the second character in the string. If you wanted to start from “C”, then Number1 would be 3, etc. Number2 in your function is just the number of characters that you want to pull. Assume you want to pull 2 characters starting with the letter “C”, you could write “ Mid(“ABCDEFG”,3,2)”. The results for this function would be “CD” because “C” is the third character in the string, so if we start from “C” and pull two characters, they would be “CD”. If we use the exact same function but change Number2 to 3 instead of 2, “ Mid(“ABCDEFG”,3,2)”, then the result would be “CDE” etc. Len()- This function is pretty simple. It simply gives the number of characters in the string and is structured as “Len([String])”. For example, Len(“ABC”) would result in the number 3, Len(“ABCD”) would be 4, etc. InStr() - We’ve already reviewed this function so just to briefly recap, the InStr function finds the location of a small string within a larger string. The result of this function is the character that the smaller string begins in the larger string. For example, InStr(1,“ABCDEFG”,”DEFG”) would return 4 since the smaller string begins at the fourth character of the larger string. We’ll now take these basic functions and use them to parse our example string; “ERA: 3.12IP: 200K: 100W: 10”. The first stat that we want to procure from the string is ERA. For the sake of our first demonstration, we’ll pretend that the ERA stat can only be a single digit followed by a decimal followed by two more digits (i.e. X.XX). If this were the case, then this would be a very easy parse with the mid() function. As you recall, this function requires 3 inputs or variables: The string which we will take our substring from, the starting position of our sub string and the length of our substring. We, of course, already know our string variable, “ERA: 3.12IP: 200K: 100W: 10”. The next input we need is the starting position for our substring. If we are confident that this string wil l always begin with “ERA:”, which is four characters long, then we can infer that our ERA stat will always begin at the fifth position of our string, therefore our [Number 1] would be 5. Now we just have to determine our [Number2] input which is the length of our substring. Since for these example we established that the stat will be in the format of “X.XX”, this substring will always be 4 characters long. By putting all of the pieces together our procedure looks something like this:

Mid(“ERA: 3.12IP: 200K: 100W: 10”,5,4)

Now let’s use the same example but with a much more realistic circumstance that ERA will not always be in the “X.XX” format. Let say that some of the pitchers that we’ll be collecting data for haven’t been doing so well and have ERA’s that require an additional digit and must be in the format “XX.XX”. In this circumstance, if we continue using the Md() function then our first two inputs would be the same, but we would have to adjust our variable for the length of our substring since it will sometimes be 4 and sometimes 5. For this example, we’ll pretend that the stat IP will always be following the ERA stat. Since the substring “IP” is always after the ERA stat that we are after, we can find the location of “IP”, to help determine how long our ERA stat is. If we use the InStr function we can determine where “IP” is at in our string and, thus, determine where the ERA stat ends. We can then take that position and subtract 6 (We subtract 5 since we determined that our ERA substring will always be starting at 5. We subtract an additional 1 since the “IP” in the string is one character in front of the last ERA digit) to get the length of our ERA stat and we can then use it as [Number 2] in our Mid() function.

InStr(1,“ERA: 3.12IP: 200K: 100W: 10”,IP) = 10 10-6 = 4

Therefore, our ERA stat is 4 characters long. If we were to do the same procedure with a larger ERA:

InStr(1,“ERA: 10.12IP: 200K: 100W: 10”,IP) = 11 11-6=5

Our ERA stat is 5 characters long. We can now use this logic to finish our Mid() function. I’ll create a variable called Num2 which will be our [Number2] input. I’ll also create the variable StatString for the string that we are parsing.

Upon running this macro, a dialog should appear displaying the ERA stat in the correct format. Because of the logic that we’ve used, we can now run the exact same macro with the exact same string, except for a larger ERA stat, and it will still display the correctly formatted ERA:

Let’s now attempt this same parse under a more complicated scenario. For this example, we’ll assume that we don’t know what stat is following our ERA stat. Consequently, one page’s ERA might be followed by wins or walks or any other stat. Regardless of the specifics, the take-away is that we can’t rely on whatever stat is following ERA since it may not be consistent, so we cannot utilize it as we did in our previous section. There is still however, on feature on the structure of our ERA stat that is in our favor. Even if you are not familiar with baseball statistics, it would not take long to notice that regardless of the decimal values of one’s ERA, it is almost always displayed with two digits. We can use this characteristic to our advantage by determining the location of the decimal. This can be done multiple ways. You could, for instance, use the InStr function again. However, for this example I will use an additional method solely for the sake of demonstrating its use. With this technique you create a loop that checks each character in a string for the character that you’re looking for. Observe the code below. The For Loop is setup to run from 1 to the number of characters

in our StatString.

With the Mid(StatString, x, 1) section of code, each iteration of the loop is looking at a substring of the StatString starting at x (the iteration of the loop) that is a length of one character. In a sense, this technique looks at each character of the string as a substring and determines whether or not it is equal to “.”. When the loop finds a character that equals “.”, the loop is ended with x still containing the number for the character in the string t hat contains “.”. With this example, x would be the number 6. Since we know that the ERA stat will end two digits beyond 6, we add 2 to get the position of the last character for ERA, which means that it is in the 8th position. Of course, it would not make sense for our string to be eight characters long. We must, therefore, subtract the number of characters in our larger string that come before our substring. Since we already know that our starting position will always be 5, this means that we will always be subtracting 4 from the end position number to get the number of characters for our substring to use as [Number2] in our mid function. For the sake of keeping our code as simple as possible and omitting unnecessary steps, instead of adding 2 and then subtracting four t o get negative two ([Number2] = X + 2 - 4), we skipped the unnecessary arithmetic and simply subtracted two from X to get the length of our string ([Number2] = X - 2). As previously mentioned, there are an infinite number of ways to go about parsing strings. These are just a few of the methods that I often employ to get the data that I need. As you do more parsing, you will develop your own techniques and styles that suite your needs the best and which make the most sense to you. There are times when parsing a particular string to suite your needs might feel like a rubix cube where each move has the potential to destroy previously made progress. As such, it will behoove you to keep the big picture, or overall goal, in sight as you work on the smaller pieces of code. Eventually even the most complex of parses will become second nature to you.

Chapter 5: Finishing Touches

The material in this chapter is not necessary when it comes to performing scrapes. However, these tips and tricks can make the regular use of your scrapes easier. As an added bonus, they also do a good job of impressing managers and colleagues.

Count and Summary One of the easiest “extras” that I typically add to scrapes is one or more counters which keep track of records as they are scraped. The purpose of this is to provide a summary at the conclusion of the macro. Counters can be used to keep track of almost any aspect of the data you are pulling. To do so, create an integer variable and add 1 to it every time the event you are counting occurs. This can just as easily be done when counting multiple

types of events by created an integer variable for each count. At the end of the macro, add a message box that displays the count and whatever other information you want to include in the summary. For this example, we’ll again pretend that I’m scraping a baseball website for pitcher data with each pitcher’s statistics being displayed on its own page and I want to keep two separate counts; one for right handed pitchers and one for left handed pitchers and display a summary of the results at the end of the scrape. For this example, we’ll pretend that the element that displays the pitcher’s handedness has an ID of “Hand” and that the NoOfPages variable holds the number of pages that will be scraped.

As this example demonstrates, it’s important to set the variables to 0 outside of the main loop. If this part is put inside the loop, then each iteration of the loop will reset the variables to 0 which would make them pretty useless at counting.

Error Handling Error handling is a crucial component to advanced VBA development. While the use of error handling might not be necessary for the simplest of VBA scrapes, when it comes to larger or more complicated scrapes, knowing how to properly dictate procedure when errors occur can make your life much easier. Since there are so many things that can go wrong when working with VBA that can cause an error to occur, it would be impossible to address every potential situation, so we’ll just cover the basic idea of error handling and some of the functionality that I’ve found to be most useful. The basic idea behind error handling is simple. It is simply telling the macro how to proceed when an error occurs in the course of your script. The most commonly used method for handling errors is often placed at the beginning of a simple script: “On Error Resume Next”. This method is simply telling the program to proceed to the next line when

an error occurs. This technique can be useful for simple scripts, but will likely have to be avoided for the most complicated macros. This is because when an error occurs in a larger, more complex code, it’s unlikely that simply skipping a line will really fix the problem, so your program is likely going to be curtailed even if you skip the line where the error is first noticed. The technique that I find more useful is to create a section of code, usually placed at the very end of the macro, dedicated to the handling of errors.

Sub ExampleMacro() On Error GoTo Errhandler [Main portion of macro] Exit Sub Errhandler: ‘Execute this section of code when an error occurs. End Sub

After your errors are addressed in this section of code, you’ll likely want to either end your macro, or specify a location in your script for it to resume. If your error handling section is at the end of your script, the macro will end after the error procedure is performed. If you want your macro to resume after the error code is performed, you must have a specified location in your code for this point. To establish this, simply add the term you’d like to refer to this point as with a semicolon at the location where you’d like your script to resume. To continue running your macro at this point, put “Resume [point name]” at the end of you error code. In this demonstration, I’ll call this point Restart .

Sub ExampleMacro() Restart: On Error GoTo Errhandler [Main portion of macro] Exit Sub Errhandler: [Execute this section of code when an error occurs] Resume Restart

End Sub

Naturally you won’t be able to create an error handler for every possible problem that can trigger an error in your code. This is especially true of scrapes compared to other macros as you have no control over the content of the websites you are scraping and they can change at any time. There’s no conceivable way that you can write a script that can adjust to every potential change that a website can make, therefore your error handlers will probably have limited value in terms of ameliorating new problems that occur when pulling data from a website. What I’ve found to be a useful technique for long scrapes that are impractical to constantly monitor is to place a section of code in the error handling section that triggers an email to be sent to me which notifies me that an error has occurred in my macro so that I know to address the problem when I have the chance. In these situations, my error handling code would look something like this:

This is a simple, yet effective code to use for addressing macro errors. It uses Microsoft Outlook to create a blank email, adds the relevant information that might be useful when addressing the error, and sends the email. Notice that the subject lines has the date and time that the error occurred and the body of the email has the specific error code. Whether or not you want the macro to try to continue running after sending this email will depend on the overall structure of the code and the tasks it is performing. Keep in mind that if you instruct your macro to resume after you run your error code, if errors continue then you will continue getting emails. In these situations, it may be beneficial to put a counter in your error code which stops the macro when it gets to a certain number of errors so that you don’t end up with an inbox with thousands of emails informing you of macro errors.

Scrolling In a typical scenario, your macro will be pulling data and inputting it into your

spreadsheet one line at a time. Depending on your zoom, after about 20-30 rows of data are added you will no longer be able to observe the data being added unless you manually scroll down. This can be a pain when you’re trying to keep track of how many records your scrape has pulled during the course of its run. To get around this nuisance, I’ve found it beneficial to add a piece of code that automatically scrolls down as each row of data is added. This way, the row where your data is being added is always visible when looking at your spreadsheet. This small piece of code will scroll the view of your spreadsheet one row at a time: ActiveWindow.SmallScroll down:=1. Since you probably want several rows to be populated before this scrolling begins (if it started immediately you’d only see the top row of data being populated) it can be helpful to add a counter which counts each row as it is populated and instructing this section of code to only run when the counter reaches a certain number. When this counter reaches the predetermined number, then the scrolling function will occur one row at a time for each iteration of the scrape.

Chapter 6: Conclusion

Review This book has covered various aspects of web scraping with VBA and some of the more common challenges that you may come across as you learn the art. To briefly recap, we started with exploring what web scraping is and how it can be used both professionally and personally as well as legal issues that may accompany the use of this skill in certain scenarios. We also explored the utility of VBA and how it can be used to manipulate data in Excel as well as interact with other programs. From there, we reviewed the basics of programming with VBA while focusing on the features that will most benefit the web scraper, after which we explored the essential and more advanced components and tools that can be utilized in VBA to interact with Internet Explorer to navigate the web and extract data from websites. From there, we briefly explored the art of parsing data and how it is essential to the web scraper.

Final Thoughts You may have noticed that I’ve repeatedly referred to web scraping as an art. This is because despite the fact that there is a degree of technical know-how required, the more advanced scrapes will require a large amount of individual creativity which means that each program you write will be unique to your own style of programing, your way of thinking about data, and the best way to mold the data available to you into the structure that best suits you. Web scraping has become more common in the marketplace and is now considered by many to be an essential component to a successful business. This is a skill that is highly in demand and can set you apart from others in the workplace. I know this because I have personally experienced it. Sharing this experience and how the material in this book can benefit others is one of the primary reasons that I made the decision to write this book. It’s my hope that the information in this book will help VBA developers, or aspiring VBA developers, acquire the necessary skills in the art of web scraping to further their careers and ultimately achieve their aspirations. This book isn’t supposed to instantly t urn the reader into an all-star VBA developer or data scraper. Achieving these goals requires work and practice which can’t be learned simply by reading a book. However, the information contained in this book should provide you with some of the most valuable tools of the trade and prepare you for some of the more common challenges you’ll face in your scraping endeavors.

Report "[David Phillips] Web Scraping With Excel How to U(B-ok.cc)"

Your name

Email

Reason

Description

Copyright © 2025 IDOC.TIPS. All rights reserved.
About Us | Privacy Policy | Terms of Service | Copyright | Contact Us | Cookie Policy

Sign In

Email

Password

Remember me Forgot password?

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close

[David Phillips] Web Scraping With Excel How to U(B-ok.cc)

Recommend Documents