Archive for March 28th, 2008
Using Yahoo Pipes - Google Trends Tokenizer
Million of searches are done each day on Google, and its natural that a large number of people all over the world are seaching for same set of keywords. Google started making these search strings public though “Google Trends” where you can get the most searched-for string for the day, the list being update hourly.
Want to see what is hot today visit Google Trends
While these trends are looked-for by general public casually, bloggers find these trends quite useful as this way they can get to know the topics that are hot and blogging in those subjects would bring in larger number of visitors. And to get hourly update most of such bloggers subscribe to hourly RSS feed. This RSS feed contains a single post containing the 100 search strings withing the content body. While most bloggers will find this feed useful, many find it difficult to pick new search string that newly entered in Trends. A feed that provided each search string as an individual post would have served the purpose and such a feed is useful for many other tasks too.
Now since such a feed is not made available by Google we would create one such feed using Yahoo Pipes.
If you reached this post in search of an RSS feed containing Google Trends data / search strings as individual feeds then the required feed is here
If you are here to learn how to do it using Yahoo Pipes rest of the post is for you. This post assumes your familiarity with Yahoo Pipes, Sources, Operators etc. The source is available here for you to refer.
Create a new pipe. Click on “Fetch Feed” under Source to add a module to fetch the feed that is available at Google Trends. In the module put http://www.google.com/trends/hottrends/atom/hourly in the space provided for url. This is the url for hourly RSS feed provided by Google Trends.

If you go through the content code in the debugger you will find that the Top 100 search strings are provided in content as list. Under “String” section there is a module “String Tokenizer” that can break a string into number of items using some delimeters. If we can break the content taking <li> tag as delimeter we can get our desired feed.
Since thiis “String Tokenizer” module requires a string to work upon, we will have to embed this module within a “Loop” Module available under “Operators” section. Add “Loop” module first and then drag the “String Tokenzer” module within this module. In the drop down list against “For each…in input feed” select “item.content.content”. Type “<li>” against “Delimeter”. Link the module to “Fetch Feed” module to recieve the output.

Now with this step the content would be broken into a number of individual items under the single item available with original feed. We need to put back the old feed and create a new RSS feed based on items made available by “String Tokenizer”. This can be done using “Sub-element” module under “Operators” section. Create one such module and link it to “Loop” to recieve output from there. Select “item.loop:stringtokenizer” in the dropdown.

Now an RSS feed with a number of individual items is available, but it is not yet ready for use as it contains only “contents” node and no “title” and ‘link” nodes, the primary and foremost requirement. Further the link and the search strings are available in the same node along with unrequired HTML markups.
Let’s move step by step. The content node can be cloned to make another node naming it “title” and later renaming “content” to “link”. This facility is available with “rename” module under “Operators” section. Create a “Rename” module and use “+” button to add one more “rule”. Select “item.content” from the first drop down lists and “Copy As” and “Rename” respectively for second drop down lists in the two rules and fill the text boxes with “title” and “link” respectively for the two rules.

Now instead of a “content” node we will have “title” and “link” nodes.
Look at the content under “link” and “title” nodes. They contain text something like below
<span class=”Volcanic”><a rel=”nofollow” target=”_blank” href=”http://www.google.com/trends/hottrends?q=last+dance+lyrics&date=2008-3-27&sa=X”>last dance lyrics</a></span>
Our next task would be to remove the junk characters and fill in relevent text in respective nodes. For this we will take help of Regular Expressions which is available with “Regex” module under “Operators” section.
Create a Regex module. Select “item.link” from the drop down list. Fill regex pattern ^<span.*><a.*href=”(.*sa=X)”>.*</a></span> in the text box against “replace”. This Regular expressions in in-fact spelling out that the string begins with <span> tag containing attributes, then have an anchor containing the link within href attribute. The part we are interested in, is enclosed in brackets “(…)”. Filling in $1 in the text box against “with” instructs to replace complete content with what appears within brackets. The checkboxes are out of scope here.
Now add an additional rule for filtering out title. Select “item.t itle” from the drop down list. Fill in Regular Expression ^<span.*><a.*sa=X”>(.*)</a></span> and $1 at appropriate textboxes. The brackets here enclose the Search string text.

This leads us to achieve our results except that the first record is a junk one specifically containing <ol></ol>. We could have handled this before “Rename” module but are taking it last (makes no difference). To filter off this record we will use “Filter” module under “Operators” section. Add it and set the rule to say “Permit items that matches all of the following - item link contains http://”. This way all record except this odd record will pass the filter giving us the RSS feed with 100 items for Top 100 search strings.

Link to “Output” module and the pipe is finished.

This and other pipes coming up can be accessed using this link
Adult Search Day - Cinderella.com & www.cdidigital.com
Seems like all younger ones are in parks now and only adults are searching the web throug Google. I saw a new site address Cinderella.com for most searched for sites today. I thought that it was some Kids’ site, as I read and heard cinderella story since childhood and when logged in to the site found it to be an adult site with the home page showing a disclaimer and “Enter”, “Exit” buttons… the same site that opens when opened www.cdidigital.com (This site I though have something to do with CDs/DVDs etc.)
While Cinderella.com attracts more than 10000 visitors each day cdidigital attracts half. And looking at Trends it seems that even larger number were searching for it on Google.
Jokes apart, it seems that this is another spike due to mails to users on myspace etc. Any information?