Archive for March 31st, 2008
Using Yahoo Pipes - Google Trends Scraper
In the last post Using Yahoo Pipes - Google Trends Tokenizer, we saw how Yahoo Pipes can be used to prepare a Google Trends RSS feed containing search strings as individual items instead of all existing in a single post as provided by Google. It works fine except that someone asked me what if I want to get similar feed for Google Trends for some previous particular date? Tricky! since Google provides RSS feed for just the current trends and not the old dates.
So in this post we will achive this requirement. But from where will we get raw data? Hope you haven’t forgot that though Google doesn’t provide Google Trends RSS feed for old dates but it does provide the data on HTML page!! If we could parse this page to extract the data out, we can do it… Fortunately Yahoo Pipes provides a “Fetch Page” module that can fetch a raw HTML page and even do operations as getting data from within given strings or break page content using a delimeter. Let’s start doing it…
If you reached the page looking for the RSS feed it’s link is here. Change the date appropriately to get data for date of your choice. If you are here for learning how to use Yaho Pipes to get the work done, then read ahead.
http://google.com/trends/hottrends?sa=X&date=2008-3-19
Above is the url for Google Trends Page containing data for say the date we take here March 19, 2008. Add a “Text Input” module from “Unser Inputs” section so that we can enable user to provide the date input. Fill “Date”, “Date (yyyy-mm-dd):” in fields Name and Prompt. We will leave Default as blank so that on event of not passing any inputs the Current Trends Data is fetched.

Now add a “URL Builder” module from “URL” section. Fill “http://google.com/trends/hottrends” in field “Base” and leave the path element field blank. Add anotehr Query Parameter by clicking on “+” sign by the side. In the Parameter name sections fill “date” and “sa” to the two parameters respectively. Fill “X” as value for parameter “sa” and for date field we need to fill the value as per user input so drag output from the “Text Input” module into value space for paraemter “date”.

Add a “Fetch Page” module from “Sources” section. For the URL Field get the value from “URL Builder” module. Fill <table class=”Z2_list”> in “Cut content from” and <script type=”text/javascript”><!– against “to”… Why? because if you look into the source of the Google Trends Page you will find that all the Strings are existing between the first occurances each of <table class=”Z2_list”> and <script type=”text/javascript”><!– Fill <tr><td class=num> for field “Split using delimiter’, neddless to say becase each search string in the source has this string in the beginning.

With this module the page would be fetched and the content split into 100+1 individual items, the first one being the one with junk characters that we will finally filter off. All the text extracted would be contained in “content” node. To Add “title” and “link” node we will use a “Rename” module from “Operators” section. Add the module and give two mappings, one copying “content to an additional node “title” and the other to rename the content node to “link”.

At this stage the content of each of the “title” and the “link” would look something like this
1. <a rel=”nofollow” target=”_blank” href=”http://google.com/trends/hottrends?q=kelly+pickler&date=2008-3-19&sa=X”>kelly pickler</a>
Our next task would be to extract the link and the search string from within these. For this we will use a “Regex” module from “Operators” section. Add a rule with item.link for field “In”, ^.*href=”(.*sa=X)”.* for field “replace” and $1 for field “with”. Check the checkbox before “s”. This line tells that the link that currently begins with some text then has href=”somelink that ends with sa=X” htne follows some more text, is to be replaced with the actual link, the occurance of which is marked by brackets (.. ..). The checkbox “s” is checked because every 25th search string carries some additional lines of text and “s” (treat string as a single line) instructs to consider all such characters too to be included which matching.
Add another rule with item.title for field “In”, ^.*href=”.*sa=X”>(.*)</a>.* for field “replace” and $1 for field “with”. Check the checkbox before “s”. This line tells that the link that currently begins with some text then openeing anchor tag ends sa=X”> with has search string following it and ending with occurance of </ahref=”somelink >, some more text may follow (remember extra lines every 25th search string!!)

Now everything is OK except that the first record has to be filtered off. Add a “Filter” module from “Operators” section, add a single rule that say “Permit items that match all of the following - item.link matches regex ^http://.* (or in plain words starts with http://)”

Join this module to “Output” module and our Pipe is ready to use.

If you want to access the source the link is here. For executing the pipe interactively click here.
For all current and future pipes developed by me click here