How To Scrape Data From A Webpage With ParseHub: Your Contact List’s Architect

Recently I have fallen in love with a new (to me) tool called ParseHub. It is, among other things, an incredible tool for scraping webpages. If you are unfamiliar with what it means to “scrape” a site, no worries. Put simply, it is using a tool to parse and extract (or extract then parse) data from a webpage(s). This can be an invaluable way to create fast lists for your marketing campaigns.

Using ParseHub could not be easier once you learn a few basics. It will not only return specific data from a page, but you can also “teach” it to “click” links and go to other pages from which to return data. So if a page has 50 links of company names, and then company info after you have clicked each link, ParseHub will do that for you. If that makes no sense, it will after we go through a quick demo.

But first, you may be wondering what good this is for inbound marketing. Afterall, isn’t the point to ecourage prospects to come to YOU? Obviously! However, in order to cut through the noise, we often need to first get our message in front of the right eyes. We have had success doing this through Facebook Ads in the past. We were also able to build a fairly extensive list of German horse riding clubs for a campaign we did for our client Litzclip using this very tool.

So, let’s get into our demo! Feel free to follow along with the graphics and explanation, or simply watch the screen capture demo below.

1. First you’ll need to register and download the app here.

2. Upon opening ParseHub you’ll see this

You will see some tutorials here, both beginner and advanced. This is a great way to learn all the different uses of ParseHub. What we will do first is walk through a simple page scrape.

To the left you will see “New Project.” Click here and let’s get started.

You will be asked to enter a webpage from which you would like to extract data. Let’s say you want to create a list of Indiana University basketball players. So, you search “indiana basketball players” in Google and find this page at Wikipedia https://en.wikipedia.org/wiki/List_of_Indiana_Hoosiers_basketball_players with a list of players, a perfect chance to scrape!

Enter that page into the project box and hit the “start project on this URL” button. After doing this, we should see the webpage appear to the right.

On the left you will see the Commands and Settings. Here is where you will teach the program to do what you want it to do. Immediately we see “main_template”, “Select page (1)” and “Empty selection1.” These are the steps that ParseHub will go through to extract data. When a tab is indented, think of it as living within the tab above it.

The “Select page (1)” tab tells ParseHub what page you are extracting from (in our case it is the wikipedia page) and beneath that you can add different commands by clicking the + button. We want to “Select” something from the page to extract, so we will choose “Select.”

In the right hand preview pane, your cursor becomes a choosing tool. Simply click on the data that you would like ParseHub to extract. Hovering over the first name, you will see a blue box put around the element. Clicking on one of the names will put a green box around it. After clicking on the name you will also notice that the other names will have yellow boxes around them. If we stop here and run the program it will ONLY extract that first single element, but the whole idea is that we want multiple elements. So, simply click on another name and you will see that all of the names will now have green boxes around them. This means you will be extracting each one of those elements.

After doing this, you will see that ParseHub has added a couple steps in our commands section. These new commands are telling us what ParseHub will do, so just make sure they makes sense with your goal. By default, ParseHub has added “Extract name” (which is what we wanted) and “Extract url.” This means ParseHub will not only extract the names, but because these names are also links to other URL’s, it will also extract and list those URL’s. We can leave it this way, no harm done, but if you don’t need the data you can simply delete it as a command by hovering over the command and clicking the “x” button, like below.

Ok, so we set out just wanting to create a list of Indiana University basketball players. If we were to run the app now that is exactly what we would get. However, let’s push it a bit further. Let’s also extract the first year that each one of these players played at IU. Simple enough. We will follow similar steps as we did to extract the players’ names.

WAIT! Before we just repeat what we did and are on our merry way, let’s think about this. Each one of these years will pertain to a single name. We want them bound together, not as separate datapoints. So, we will instead use the “Relative Select” command. We also need to “add” this command not under the “Select page” command heading, but under the “Select name” heading because it will be connected to the names we are already extracting. This tells ParseHub that you are selecting an element that will relate direclty to another element you have already chosen to extract.

So we will use the + button on the “Select name” command line and choose “Relative Select.” Make sense? If not, just keep following along with the photos. Below, I have also renamed the steps “name” and “year” so I remind myself what each command line is doing.

We see that it gives us the “Relative” command direclty beneath the “Extract” command.

Now, we get a pop up on our preview pane telling us that we first need to click an already selected element (which for us will be the names) and then click the element we want to bind to that (for us this will be the year).

Once we have chosen a couple “relations” we see the preview pane learning what we want to do.

Now, we see if our hard work is going to pay off! At the bottom we will click “Get Data”, then “Run” and finally “Save and Run.” As long as no mistakes have been made, we will get this dialog box telling us that ParseHub is on the job!

We don’t need to worry too much about all this info. We can now work on something else while ParseHub does its work. When it is finished, we will get an email telling us that our data is available for download. Most jobs don’t take more than a couple minutes, but that obviously depends on the amount of data to be extracted. Ok, let’s check the results!

Simply click onthe CSV button to download the data to Excel or Numbers.

Boom! Hopefully your result looks something like this!

If you got a different result, don’t worry. It takes a few times to learn how the program runs. Just keep trying things, read a few of the tutorials and you will quickly learn how to set things up to get the exact data you want!

Leave a Reply