Jalaj P. Jha Technical & Miscellaneous Ramblings

3Feb/091




YQL the New Module at Yahoo Pipes

imageJust a couple of days back Yahoo Pipes added a new source module named YQL (Yahoo Query Language) allowing experts to start with filtered inputs which if they had achieved conventional modules would have required many extra steps. It uses an SQL-like SELECT query to get data. You can get an idea of Yahoo pipes at this post on Yahoo Pipes Blog.

We are diving straight ahead with a live example on how powerful is YQL and how it can change the way you think and develop. Check my posts Using Yahoo Pipes - Google Trends Tokenizer and Using Yahoo Pipes - Google Trends Scraper which shows how I used input data from Google Trends, RSS in the first one and and HTML page in the second one, using the various operators to get the desired output at the end. That was all by conventional method, and now we have power of YQL.

Let’s check how Google Trends Scraper can be developed using it.

First of all we need to know our source. If you need Google Trends for a particular date the url for it would be similar to this one

http://google.com/trends/hottrends?sa=X&date=2008-3-19

In order to parameterize the pipes to accept date let’s add a Date module.

image

Now add a String Builder module to build a YQL string

image

  • The first part of string would be
    select * from html where url="http://google.com/trends/hottrends?sa=X&date=
  • The send part will take value from the Date Module
  • And the third part would be
    " and xpath='//td[@class="hotColumn"]/table/tr/td/a'

Add a YQL Module which will take value from the string builder.

image

image The pipe is nearly finished with just one drawback. The link text which needs to be name ‘title’ is available under name ‘content’ and the link which should be named ‘link’ is available as ‘href’ with relative reference i.e. http://google.com doesn’t appear at the beginning of the link. To effect these changes we will add a regex module with three entries as under

image

the above regex entries do task as below

  • Add ‘link’ node initialized with ‘http://google.com/’
  • At the end of the ‘link’ add the entire value as available in ‘href’
  • Add ‘title’ node initialized with value as available in ‘content’

Add the output from the regex module to the output module and that completes our pipe. The full pipe looks like as below.

image

The point to be noted here is that while the old solution required page to be scraped and all links and texts extracted using various operators and regular expressions, YQL allows a page to be parsed using xpath as we have done here extracting all links that exist in TDs having class hotModule.

We will take more examples later by the you can find the source for above example here

31Mar/082




Using Yahoo Pipes – Google Trends Scraper

In the last post Using Yahoo Pipes - Google Trends Tokenizer, we saw how Yahoo Pipes can be used to prepare a Google Trends RSS feed containing search strings as individual items instead of all existing in a single post as provided by Google. It works fine except that someone asked me what if I want to get similar feed for Google Trends for some previous particular date? Tricky! since Google provides RSS feed for just the current trends and not the old dates.

So in this post we will achive this requirement. But from where will we get raw data? Hope you haven't forgot that though Google doesn't provide Google Trends RSS feed for old dates but it does provide the data on HTML page!! If we could parse this page to extract the data out, we can do it... Fortunately Yahoo Pipes provides a "Fetch Page" module that can fetch a raw HTML page and even do operations as getting data from within given strings or break page content using a delimeter. Let's start doing it...

If you reached the page looking for the RSS feed it's link is here. Change the date appropriately to get data for date of your choice. If you are here for learning how to use Yaho Pipes to get the work done, then read ahead.

http://google.com/trends/hottrends?sa=X&date=2008-3-19

Above is the url for Google Trends Page containing data for say the date we take here March 19, 2008. Add a "Text Input" module from "Unser Inputs" section so that we can enable user to provide the date input. Fill "Date", "Date (yyyy-mm-dd):" in fields Name and Prompt. We will leave Default as blank so that on event of not passing any inputs the Current Trends Data is fetched.

Now add a "URL Builder" module from "URL" section. Fill "http://google.com/trends/hottrends" in field "Base" and leave the path element field blank. Add anotehr Query Parameter by clicking on "+" sign by the side. In the Parameter name sections fill "date" and "sa" to the two parameters respectively. Fill "X" as value for parameter "sa" and for date field we need to fill the value as per user input so drag output from the "Text Input" module into value space for paraemter "date".

Add a "Fetch Page" module from "Sources" section. For the URL Field get the value from "URL Builder" module. Fill <table class="Z2_list"> in "Cut content from" and <script type="text/javascript"><!-- against "to"... Why? because if you look into the source of the Google Trends Page you will find that all the Strings are existing between the first occurances each of <table class="Z2_list"> and <script type="text/javascript"><!-- Fill <tr><td class=num> for field "Split using delimiter', neddless to say becase each search string in the source has this string in the beginning.

With this module the page would be fetched and the content split into 100+1 individual items, the first one being the one with junk characters that we will finally filter off. All the text extracted would be contained in "content" node. To Add "title" and "link" node we will use a "Rename" module from "Operators" section. Add the module and give two mappings, one copying "content to an additional node "title" and the other to rename the content node to "link".

At this stage the content of each of the "title" and the "link" would look something like this

1. <a rel="nofollow" target="_blank" href="http://google.com/trends/hottrends?q=kelly+pickler&date=2008-3-19&sa=X">kelly pickler</a>

Our next task would be to extract the link and the search string from within these. For this we will use a "Regex" module from "Operators" section. Add a rule with item.link for field "In", ^.*href="(.*sa=X)".* for field "replace" and $1 for field "with". Check the checkbox before "s". This line tells that the link that currently begins with some text then has href="somelink that ends with sa=X" htne follows some more text, is to be replaced with the actual link, the occurance of which is marked by brackets (.. ..). The checkbox "s" is checked because every 25th search string carries some additional lines of text and "s" (treat string as a single line) instructs to consider all such characters too to be included which matching.

Add another rule with item.title for field "In", ^.*href=".*sa=X">(.*)</a>.* for field "replace" and $1 for field "with". Check the checkbox before "s". This line tells that the link that currently begins with some text then openeing anchor tag ends sa=X"> with has search string following it and ending with occurance of </ahref="somelink &gt, some more text may follow (remember extra lines every 25th search string!!)

Now everything is OK except that the first record has to be filtered off. Add a "Filter" module from "Operators" section, add a single rule that say "Permit items that match all of the following - item.link matches regex ^http://.* (or in plain words starts with http://)"

Join this module to "Output" module and our Pipe is ready to use.

If you want to access the source the link is here. For executing the pipe interactively click here.

For all current and future pipes developed by me click here

28Mar/080




Using Yahoo Pipes – Google Trends Tokenizer

Million of searches are done each day on Google, and its natural that a large number of people all over the world are seaching for same set of keywords. Google started making these search strings public though "Google Trends" where you can get the most searched-for string for the day, the list being update hourly.

Want to see what is hot today visit Google Trends

While these trends are looked-for by general public casually, bloggers find these trends quite useful as this way they can get to know the topics that are hot and blogging in those subjects would bring in larger number of visitors. And to get hourly update most of such bloggers subscribe to hourly RSS feed. This RSS feed contains a single post containing the 100 search strings withing the content body. While most bloggers will find this feed useful, many find it difficult to pick new search string that newly entered in Trends. A feed that provided each search string as an individual post would have served the purpose and such a feed is useful for many other tasks too.

Now since such a feed is not made available by Google we would create one such feed using Yahoo Pipes.

If you reached this post in search of an RSS feed containing Google Trends data / search strings as individual feeds then the required feed is here

If you are here to learn how to do it using Yahoo Pipes rest of the post is for you. This post assumes your familiarity with Yahoo Pipes, Sources, Operators etc. The source is available here for you to refer.

Create a new pipe. Click on "Fetch Feed" under Source to add a module to fetch the feed that is available at Google Trends. In the module put http://www.google.com/trends/hottrends/atom/hourly in the space provided for url. This is the url for hourly RSS feed provided by Google Trends.

If you go through the content code in the debugger you will find that the Top 100 search strings are provided in content as list. Under "String" section there is a module "String Tokenizer" that can break a string into number of items using some delimeters. If we can break the content taking <li> tag as delimeter we can get our desired feed.

Since thiis "String Tokenizer" module requires a string to work upon, we will have to embed this module within a "Loop" Module available under "Operators" section. Add "Loop" module first and then drag the "String Tokenzer" module within this module. In the drop down list against "For each...in input feed" select "item.content.content". Type "<li>" against "Delimeter". Link the module to "Fetch Feed" module to recieve the output.

Now with this step the content would be broken into a number of individual items under the single item available with original feed. We need to put back the old feed and create a new RSS feed based on items made available by "String Tokenizer". This can be done using "Sub-element" module under "Operators" section. Create one such module and link it to "Loop" to recieve output from there. Select "item.loop:stringtokenizer" in the dropdown.

Now an RSS feed with a number of individual items is available, but it is not yet ready for use as it contains only "contents" node and no "title" and 'link" nodes, the primary and foremost requirement. Further the link and the search strings are available in the same node along with unrequired HTML markups.

Let's move step by step. The content node can be cloned to make another node naming it "title" and later renaming "content" to "link". This facility is available with "rename" module under "Operators" section. Create a "Rename" module and use "+" button to add one more "rule". Select "item.content" from the first drop down lists and "Copy As" and "Rename" respectively for second drop down lists in the two rules and fill the text boxes with "title" and "link" respectively for the two rules.

Now instead of a "content" node we will have "title" and "link" nodes.

Look at the content under "link" and "title" nodes. They contain text something like below

<span class="Volcanic"><a rel="nofollow" target="_blank" href="http://www.google.com/trends/hottrends?q=last+dance+lyrics&date=2008-3-27&sa=X">last dance lyrics</a></span>

Our next task would be to remove the junk characters and fill in relevent text in respective nodes. For this we will take help of Regular Expressions which is available with "Regex" module under "Operators" section.

Create a Regex module. Select "item.link" from the drop down list. Fill regex pattern ^<span.*><a.*href="(.*sa=X)">.*</a></span> in the text box against "replace". This Regular expressions in in-fact spelling out that the string begins with <span> tag containing attributes, then have an anchor containing the link within href attribute. The part we are interested in, is enclosed in brackets "(...)". Filling in $1 in the text box against "with" instructs to replace complete content with what appears within brackets. The checkboxes are out of scope here.

Now add an additional rule for filtering out title. Select "item.t itle" from the drop down list. Fill in Regular Expression ^<span.*><a.*sa=X">(.*)</a></span> and $1 at appropriate textboxes. The brackets here enclose the Search string text.

This leads us to achieve our results except that the first record is a junk one specifically containing <ol></ol>. We could have handled this before "Rename" module but are taking it last (makes no difference). To filter off this record we will use "Filter" module under "Operators" section. Add it and set the rule to say "Permit items that matches all of the following - item link contains http://". This way all record except this odd record will pass the filter giving us the RSS feed with 100 items for Top 100 search strings.

Link to "Output" module and the pipe is finished.

This and other pipes coming up can be accessed using this link