When building web applications, sometimes there’s a need to fetch data from other sources. Perhaps you’re building a custom RSS feed of news items based on your interests or you want to aggregate data from several sites. In any case, it’s not always possible to do this elegantly; you may not have direct access to the raw data and an existing API may not exist. For these situations there’s one general (albeit fragile) solution: manually parse the end result when a page is loaded in a browser.
There are many different ways to build a web scraper. A server-side language such as PHP will have a much easier time as there are less limitations and existing libraries (such as cURL) to accomplish the task. I’ll be using JavaScript (with the jQuery library), however, as I want a standalone client independent of server technology.
Let’s start with a random snippet of code:
1 2 3 4 5 6 7 8 9 10 11 12 | function fetchPage(url) { $.ajax({ type: "GET", url: url, error: function(request, status) { alert('Error fetching ' + url); }, success: function(data) { parse(data.responseText); } }); } |
The above function attempts to make an AJAX call to a specified web page, fetch the HTML of said page and pass it to a parsing function. Simple enough, right?
Unfortunately there’s a little problem with this. Client-side programming languages have to deal with something called the same origin policy which basically restricts scripts from accessing external domains. From a security standpoint this is obviously a good thing; however it can be a headache for web app. developers (indeed, there are ideas such as the origin HTTP header to solve this). (The following bit can be ignored if the scraper is on the same domain as the data source – but in such a case why is said scraper even necessary?)
In this case, there are a couple of solutions (that I can think of). Both of them take advantage of the fact that JavaScript can read JSON even if it is located on a different domain.
- The first one is writing a complementary server-side script that takes a few arguments (such as the target URL), makes the call, parses the result into JSON format and passes it back to the calling function. It’s a simple idea, but I never really liked it because it introduced a major dependency (which, in my case, can be a pain to deal with). However, it’s definitely a viable solution which works extremely well.
- The second is using an existing system such as Yahoo’s YQL to fetch the required data and return it in a structured form. This is the method that I’ll be using.
YQL is an interesting little beast. From their site:
The Yahoo! Query Language is an expressive SQL-like language that lets you query, filter, and join data across Web services. With YQL, apps run faster with fewer lines of code and a smaller network footprint.
I haven’t had too much time to look into it so I do what all programmers do: Google the living crap out of a problem to find a solution.
In this case, Chris Heilmann has a great post on how to load external content via. various methods including YQL. To simplify things, James Padolsey wrote a plugin that detects an external AJAX call and passes it to YQL automatically.
Anyway, that solves the problem of fetching data from an external source. All that’s left is extracting the relevant pieces for whatever application is being built. Consider the following example:
1 2 3 | function parse(data) { alert($(data).find("h1").text()); } |
The parse() function takes the responseText data from the earlier fetchPage() function and then proceeds to pick at it slowly and painfully. What’s really cool about it is that pretty much all of jQuery’s selectors can be used to select relevant data. In the above case, I’m trying to extract text inside the first <h1> tag found on a page and outputting it as an alert to the browser. Obviously there are more complex uses for this but they are outside the scope of this post.
And…that’s it!
Those two code blocks combined pretty much handles all of the grunt-work in extracting data from external sources. Again, this is a rather fragile solution – it can break if the target page’s HTML changes (one possible solution is to pick unique identifiers or classes). In addition, some sites may have restrictions on their data – but I’m sure everyone reads those long-winded documents.
Tags: JavaScript, jQuery, Web Scraping, YQL
I’ve been scraping the web recently using jQuery and javascript. It is a little clunky with the client side limitations but can be a really fast way to get the data out. I usually get past the same origin policy by using the Chrome javascript console while I am on the site in question’s page. I then usually just use their version of jQuery too!