<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Reflections &#187; Reference</title>
	<atom:link href="http://reflections.irythia.com/category/reference/feed/" rel="self" type="application/rss+xml" />
	<link>http://reflections.irythia.com</link>
	<description>The ramblings and ravings of a ... what?</description>
	<lastBuildDate>Sat, 01 Jan 2011 06:40:53 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Building a Web Scraper Using JavaScript and jQuery</title>
		<link>http://reflections.irythia.com/2010/01/25/building-a-web-scraper-using-javascript-and-jquery/</link>
		<comments>http://reflections.irythia.com/2010/01/25/building-a-web-scraper-using-javascript-and-jquery/#comments</comments>
		<pubDate>Mon, 25 Jan 2010 08:38:43 +0000</pubDate>
		<dc:creator>Illianthe</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Reference]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[jQuery]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[YQL]]></category>

		<guid isPermaLink="false">http://reflections.irythia.com/?p=211</guid>
		<description><![CDATA[When building web applications, sometimes there&#8217;s a need to fetch data from other sources. Perhaps you&#8217;re building a custom RSS feed of news items based on your interests or you want to aggregate data from several sites. In any case, it&#8217;s not always possible to do this elegantly; you may not have direct access to [...]]]></description>
			<content:encoded><![CDATA[<p>When building web applications, sometimes there&#8217;s a need to fetch data from other sources. Perhaps you&#8217;re building a custom RSS feed of news items based on your interests or you want to aggregate data from several sites. In any case, it&#8217;s not always possible to do this elegantly; you may not have direct access to the raw data and an existing API may not exist. For these situations there&#8217;s one general (albeit fragile) solution: manually parse the end result when a page is loaded in a browser.</p>
<p>There are many different ways to build a web scraper. A server-side language such as PHP will have a much easier time as there are less limitations and existing libraries (such as cURL) to accomplish the task. I&#8217;ll be using JavaScript (with the jQuery library), however, as I want a standalone client independent of server technology.</p>
<p>Let&#8217;s start with a random snippet of code:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #003366; font-weight: bold;">function</span> fetchPage<span style="color: #009900;">&#40;</span>url<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    $.<span style="color: #660066;">ajax</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#123;</span>
        type<span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;GET&quot;</span><span style="color: #339933;">,</span>
        url<span style="color: #339933;">:</span> url<span style="color: #339933;">,</span>
        error<span style="color: #339933;">:</span> <span style="color: #003366; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>request<span style="color: #339933;">,</span> <span style="color: #000066;">status</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #000066;">alert</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'Error fetching '</span> <span style="color: #339933;">+</span> url<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
        success<span style="color: #339933;">:</span> <span style="color: #003366; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            parse<span style="color: #009900;">&#40;</span>data.<span style="color: #660066;">responseText</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>The above function attempts to make an AJAX call to a specified web page, fetch the HTML of said page and pass it to a parsing function. Simple enough, right?</p>
<p>Unfortunately there&#8217;s a little problem with this. Client-side programming languages have to deal with something called the <a href="http://en.wikipedia.org/wiki/Same_origin_policy">same origin policy</a> which basically restricts scripts from accessing external domains. From a security standpoint this is obviously a good thing; however it can be a headache for web app. developers (indeed, there are ideas such as the origin HTTP header to solve this). (The following bit can be ignored if the scraper is on the same domain as the data source &#8211; but in such a case why is said scraper even necessary?)</p>
<p>In this case, there are a couple of solutions (that I can think of). Both of them take advantage of the fact that JavaScript can read <a href="http://en.wikipedia.org/wiki/Json">JSON</a> even if it is located on a different domain.</p>
<ul style="text-align: left;">
<li>The first one is writing a complementary server-side script that takes a few arguments (such as the target URL), makes the call, parses the result into JSON format and passes it back to the calling function. It&#8217;s a simple idea, but I never really liked it because it introduced a major dependency (which, in my case, can be a pain to deal with). However, it&#8217;s definitely a viable solution which works extremely well.</li>
<li>The second is using an existing system such as Yahoo&#8217;s YQL to fetch the required data and return it in a structured form. This is the method that I&#8217;ll be using.</li>
</ul>
<p>YQL is an interesting little beast. From <a href="http://developer.yahoo.com/yql/">their site</a>:</p>
<blockquote><p>The Yahoo! Query Language is an expressive SQL-like language that lets you query, filter, and join data across Web services. With YQL, apps run faster with fewer lines of code and a smaller network footprint.</p></blockquote>
<p>I haven&#8217;t had too much time to look into it so I do what all programmers do: Google the living crap out of a problem to find a solution. <img src='http://www.irythia.com/portal/reflections/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' />  In this case, <a href="http://www.wait-till-i.com/2010/01/10/loading-external-content-with-ajax-using-jquery-and-yql/">Chris Heilmann has a great post</a> on how to load external content via. various methods including YQL. To simplify things, James Padolsey <a href="http://github.com/jamespadolsey/jQuery-Plugins/tree/master/cross-domain-ajax/">wrote a plugin</a> that detects an external AJAX call and passes it to YQL automatically.</p>
<p>Anyway, that solves the problem of fetching data from an external source. All that&#8217;s left is extracting the relevant pieces for whatever application is being built. Consider the following example:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
</pre></td><td class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #003366; font-weight: bold;">function</span> parse<span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000066;">alert</span><span style="color: #009900;">&#40;</span>$<span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span>.<span style="color: #660066;">find</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;h1&quot;</span><span style="color: #009900;">&#41;</span>.<span style="color: #660066;">text</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>The parse() function takes the responseText data from the earlier fetchPage() function and then proceeds to pick at it slowly and painfully. What&#8217;s really cool about it is that pretty much all of jQuery&#8217;s selectors can be used to select relevant data. In the above case, I&#8217;m trying to extract text inside the first &lt;h1&gt; tag found on a page and outputting it as an alert to the browser. Obviously there are more complex uses for this but they are outside the scope of this post. <img src='http://www.irythia.com/portal/reflections/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
<p>And&#8230;that&#8217;s it! <img src='http://www.irythia.com/portal/reflections/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' />  Those two code blocks combined pretty much handles all of the grunt-work in extracting data from external sources. Again, this is a rather fragile solution &#8211; it can break if the target page&#8217;s HTML changes (one possible solution is to pick unique identifiers or classes). In addition, some sites may have restrictions on their data &#8211; but I&#8217;m sure everyone reads those <a href="http://en.wikipedia.org/wiki/Terms_of_service">long-winded documents</a>. <img src='http://www.irythia.com/portal/reflections/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://reflections.irythia.com/2010/01/25/building-a-web-scraper-using-javascript-and-jquery/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The &#8220;A&#8221; in &#8220;AJAX&#8221;</title>
		<link>http://reflections.irythia.com/2010/01/16/the-a-in-ajax/</link>
		<comments>http://reflections.irythia.com/2010/01/16/the-a-in-ajax/#comments</comments>
		<pubDate>Sun, 17 Jan 2010 00:49:39 +0000</pubDate>
		<dc:creator>Illianthe</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Reference]]></category>
		<category><![CDATA[AJAX]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[jQuery]]></category>

		<guid isPermaLink="false">http://reflections.irythia.com/?p=201</guid>
		<description><![CDATA[Consider the following code snippet using JavaScript (with the jQuery library): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 function blah&#40;&#41; &#123; var pagedata; $.ajax&#40;&#123; type: &#34;GET&#34;, url: &#34;test.php&#34;, error: function&#40;request, error&#41; &#123; alert&#40;&#34;Error: &#34; + error&#41;; &#125;, success: function&#40;data&#41; &#123; pagedata = data; &#125; &#125;&#41;; return pagedata; &#125; [...]]]></description>
			<content:encoded><![CDATA[<p>Consider the following code snippet using JavaScript (with the jQuery library):</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #003366; font-weight: bold;">function</span> blah<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #003366; font-weight: bold;">var</span> pagedata<span style="color: #339933;">;</span>
    $.<span style="color: #660066;">ajax</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#123;</span>
        type<span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;GET&quot;</span><span style="color: #339933;">,</span>
        url<span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;test.php&quot;</span><span style="color: #339933;">,</span>
        error<span style="color: #339933;">:</span> <span style="color: #003366; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>request<span style="color: #339933;">,</span> error<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #000066;">alert</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;Error: &quot;</span> <span style="color: #339933;">+</span> error<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
        success<span style="color: #339933;">:</span> <span style="color: #003366; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            pagedata <span style="color: #339933;">=</span> data<span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000066; font-weight: bold;">return</span> pagedata<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>Looks reasonable enough. The purpose of the function is to make a GET request to some page (&#8220;test.php&#8221; in this case), store the result in a temporary variable and return said variable (of type string).</p>
<p>However, it turns out that this doesn&#8217;t work too well. When the function is called, chances are pretty good that the returned string is empty. Why? At first I thought it was because of some weird variable scoping rules in JavaScript, but that didn&#8217;t really make too much sense.</p>
<p>It turns out that I forgot about the &#8220;A&#8221; in &#8220;AJAX&#8221; (i.e. <strong>Asynchronous</strong> JavaScript and XML). The function above doesn&#8217;t execute in a top-bottom manner; the $.ajax() call is made in parallel with the rest of the script. Generally this isn&#8217;t too much of a problem (indeed, it&#8217;s a valued property by those making web applications) but in this case it isn&#8217;t what I want.</p>
<p>Anyway, there&#8217;s a simple solution to this. What I want to do is make the call synchronous with the rest of the script. Fortunately, jQuery makes this extremely easy in the $.ajax() method by providing an option (async: false) to turn this off. Thus:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre class="javascript" style="font-family:monospace;"><span style="color: #003366; font-weight: bold;">function</span> blah<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #003366; font-weight: bold;">var</span> pagedata<span style="color: #339933;">;</span>
    $.<span style="color: #660066;">ajax</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#123;</span>
        async<span style="color: #339933;">:</span> <span style="color: #003366; font-weight: bold;">false</span><span style="color: #339933;">,</span>
        type<span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;GET&quot;</span><span style="color: #339933;">,</span>
        url<span style="color: #339933;">:</span> <span style="color: #3366CC;">&quot;test.php&quot;</span><span style="color: #339933;">,</span>
        error<span style="color: #339933;">:</span> <span style="color: #003366; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>request<span style="color: #339933;">,</span> error<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #000066;">alert</span><span style="color: #009900;">&#40;</span><span style="color: #3366CC;">&quot;Error: &quot;</span> <span style="color: #339933;">+</span> error<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
        success<span style="color: #339933;">:</span> <span style="color: #003366; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            pagedata <span style="color: #339933;">=</span> data<span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000066; font-weight: bold;">return</span> pagedata<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>This locks up the browser while the call is being made so it&#8217;s probably not the best solution. It shouldn&#8217;t matter too much with small requests though (or requests made with the user&#8217;s consent via. the UI or something).</p>
]]></content:encoded>
			<wfw:commentRss>http://reflections.irythia.com/2010/01/16/the-a-in-ajax/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to move the site root to a subdirectory on cPanel-based web hosts</title>
		<link>http://reflections.irythia.com/2009/11/15/how-to-move-the-site-root-to-a-subdirectory-on-cpanel-based-web-hosts/</link>
		<comments>http://reflections.irythia.com/2009/11/15/how-to-move-the-site-root-to-a-subdirectory-on-cpanel-based-web-hosts/#comments</comments>
		<pubDate>Sun, 15 Nov 2009 23:05:09 +0000</pubDate>
		<dc:creator>Illianthe</dc:creator>
				<category><![CDATA[Reference]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[.htaccess]]></category>
		<category><![CDATA[Apache]]></category>
		<category><![CDATA[mod_rewrite]]></category>

		<guid isPermaLink="false">http://reflections.irythia.com/?p=87</guid>
		<description><![CDATA[I&#8217;m going to start off by saying that I&#8217;m rather picky about how files and folders are organized on a computer. For my own site structure, I&#8217;ve placed each subdomain into its own directory. Many hosts using cPanel (like BlueHost and HostMonster) allow you to set the directory where the subdomain&#8217;s files can be placed. [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m going to start off by saying that I&#8217;m rather picky about how files and folders are organized on a computer. For my own site structure, I&#8217;ve placed each subdomain into its own directory. Many hosts using <a href="http://en.wikipedia.org/wiki/CPanel">cPanel</a> (like BlueHost and HostMonster) allow you to set the directory where the subdomain&#8217;s files can be placed. However, it is not immediately obvious how you can accomplish the same with the root domain. By default it is mapped to the /www (/public_html) folder and there are no options in cPanel that allow you to change this. Good thing there&#8217;s an alternative solution. <img src='http://www.irythia.com/portal/reflections/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>For this to work, you must be running Apache as your web server (I suppose IIS should have a similar method, but eh&#8230;) and have mod_rewrite enabled. We&#8217;ll be creating (or adding to) a <a href="http://en.wikipedia.org/wiki/Htaccess">.htaccess</a> file as follows:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
</pre></td><td class="code"><pre class="text" style="font-family:monospace;">RewriteEngine On
RewriteCond %{HTTP_HOST} ^(www\.)?domain.com$
RewriteCond %{REQUEST_URI} !^/subdir/
RewriteRule ^(.*)$ /subdir/$1</pre></td></tr></table></div>

<p>The first line turns the runtime rewriting engine on to allow for modifications. The second line determines whether the request came from &#8220;www.domain.com&#8221; or &#8220;domain.com&#8221;; this is to prevent rewriting the URLs of any subdomains you might have. The condition after checks to see whether the request has already been redirected to the subdirectory. If not, then the rewriting rule is executed to make it so.</p>
<p>The above code is sufficient if all you want is to move files from &#8220;/www/&#8221; to a subdirectory &#8220;/www/subdir/&#8221;. However, Apache is quirky in that it requires a trailing slash at the end of directory names (to specify that it is indeed a directory). Normally this isn&#8217;t a problem since there is a module (mod_dir) that automagically redirects a path without a trailing slash to one that does. However, it conflicts with the above code in this case: first a request is directed to a subdirectory and then mod_dir issues another redirect, exposing the subdirectory name. For example, a request for &#8220;http://irythia.com/somefolderhere&#8221; will become &#8220;http://irythia.com/subdir/somefolderhere/&#8221;.</p>
<p>Note: mod_dir only executes if there isn&#8217;t a trailing slash on directory names, so if you qualify URLs with one the rules above work as expected. That is, &#8220;http://irythia.com/somefolderhere/&#8221; won&#8217;t change.</p>
<p>To fix this problem, we add another rule to append a trailing slash on all URLs ending in directory names.</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code"><pre class="text" style="font-family:monospace;">RewriteEngine On
RewriteCond %{HTTP_HOST} ^(www\.)?domain.com$
RewriteCond %{REQUEST_URI} !(/|\.[^/]*)$
RewriteRule (.*) http://www.domain.com/$1/ [L,R=301]
RewriteCond %{HTTP_HOST} ^(www\.)?domain.com$
RewriteCond %{REQUEST_URI} !^/subdir/
RewriteRule ^(.*)$ /subdir/$1</pre></td></tr></table></div>

<p>Note the additions in lines 2-4. Again, line 2 checks to see whether the request came from &#8220;www.domain.com&#8221; or &#8220;domain.com&#8221;. Line 3 checks to see if the URI ends in a directory by filtering out file extensions (i.e. .php, .html, .js, .css, etc.). It also checks to see if there isn&#8217;t already a trailing slash. Finally, we rewrite the URL. The [L,R=301] flag tells Apache to not execute any more rules and do a permanent redirect.</p>
<p>And that&#8217;s that. Upload the .htaccess file to the root web directory (/www or /public_html) and you&#8217;re good to go.</p>
<p>Some final notes: this is essentially a hack to work around limitations imposed by many web hosts; there are probably better methods if you have direct access to the server. Since it <em>is</em> a hack, it might not work on all sites without modification; this post is pretty much a reference for myself. As well, there is likely some performance penalty (as minor as it is) because the .htaccess file rewrites all requests that hit the domain. Of course, if you were <em>that</em> worried about performance, you wouldn&#8217;t be running on a shared host anyway. <img src='http://www.irythia.com/portal/reflections/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://reflections.irythia.com/2009/11/15/how-to-move-the-site-root-to-a-subdirectory-on-cpanel-based-web-hosts/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

