Bhomiyo.Com – Indic Transliteration on the Fly
Update: The "A Virus on Loose" in the title has been changed to "Indic Transliteration on the Fly" after necessary modifications by developer removing it's unexpected nature of spreading.
Bhomiyo.com is a site in service of Indian Languages, providing script conversion between various Indian Languages. Thus using it you can see a web page originally written in Devnagari script (Hindi) rendered in Gujarati script or any other Indian Script and vice versa. It's a useful site indeed except that it's performing certain tasks that it's developers might not have expected.
I was just surfing the web when I got sight of content from my blog crawled by Google showing it on Bhomiyo.com with page url as below.
http://bhomiyo.com/ml.xliterate/jalaj.net
When opened the page found that the same page as existing on my blog was shown as it is except that it changes the links too to modify as above. And the result... it's turned into a proxy... and what's more, it's a virus on loose... as of now Google holds about 52,900 pages that it crawled thinking it to be originating from bhomiyo.com. And if not stopped will result in a big part of the web to show as a page under bhomiyo.com. Got to see the current number? then follow the link below.
http://www.google.co.in/search?q=site%3Abhomiyo.com
What's expected from bhomiyo.com webmaster?
They should use robot.txt in a proper way to notify any search engine including Google that any page that is under folder xx.xliterate (where xx is the language code) should not be crawled. And that they use Google Webmaster tools to request the already crawled pages to be removed from the index.
Update 20.08.2007 : As of now the pages indexed by Google are 121,000 in number, an increase of 68,100 in 26 days (avg increase 2620 per day approx)
Update : A Spark Neglected Burns the House

WebSense classified the site under category "Phishing and Other Frauds".



July 28th, 2007 - 04:30
Jalaj,
If you have read about Bhomiyo.com on internet or on bhomiyo.wordpress.com – you would have noticed that its a volunteer effort to help publishers of indian language content reach out to wider audience.
The transliteration feature was added recently because of some user;s request and I have noticed the effect on Google as many English pages are getting indexed under this main domain – which is not desired.
1) the current users of transliteration do want their pages to be indexed in different languages so that they get more traffic. E.g. public users are likely search on Bhaarat by typing in English – and they should reach the sites who have Bhaarat word in Hindi or other languages that way they will come to know that such pages exist in other languages. Today many of the Indian users don’t even know that so much content exist in Indian languages.
Special example is http://www.sanskritdocuments.org/all_sa site. I see many hits to this site from google via Bhomiyo and it is desirable.
2) Now, because of lack of time I haven’t been able to put in enough checks so that English links are excluded from xliteration. what I mean is some site in Hindi (or indian language content) has link to your site Jalaj.net. Your site is in English so it does not have to go through Xliteration and Bhomiyo should ignore your site during Xliteration. But I have to put in some rules or conditions when or when not to exclude other links.
I am debating on this myself too and I had asked one other blogger about this and his idea was to leave it as it is for now.
I know your concern. Please do let me know any feedback that you may have. I don’t intend to spam internet but I do want google to reach to indian language content through xliteration – it helps.
-Piyush
July 30th, 2007 - 07:08
Hi Piyush,
It’s no doubt that the site indeed is a useful one and I have admitted it in the post, and now with your explanation it’s also evident that you want the regional contents to be indexed in Google so that people conducting searches for regional content in roman text or other Indian language also could reach the site content. But fact remains that in its current avatar the site has turned into a proxy site and would be misused as and when people come to know about it. Also that it is creating duplicate content in Google (number of pages increased by 300 since my post). I wish all the best to the site and here are some of my suggestions for its improvement.
1) A database of exclusion urls should be created (I will come later to how they are to be filled) and on requesting a page the transliteration script should first check the list and if url exist in list should simply redirect to the given url instead of transliterating it. This way those using it as proxy will be discouraged. This will require a little code to be added to your existing script and a corresponding database/table created.
2) The exclusion list as above can be filled using an Admin-side page so you can have control over it.
3) In the transliteration script after you receive the complete page (and before carrying out transliteration task and modifying urls) you can carry a check using Regular Expressions to find if one or more characters from Indian languages exist on the page or not. If yes you can go ahead with transliteration task otherwise insert the url in exclusion list and simply redirect to the original url. Next time the same page is requested it will be excluded anyways by point 1 above.
The suggestions above may take time to be executed, but at least they should be there in the modification plans. This will also reduce the bandwidth consumption.
September 24th, 2007 - 03:11
Recently I did two things that may help reduce the problem.
1) When a site is accessed via bhomiyo proxy – it will not reproduce external links with bhomiyo xliteration. So that if someone is accessing bhomiyo.com/uxh.xliterate/bbc.co.uk/urdu and that page has link to google.com – the link will remain intact.
2) Introduced a Forbidden sites table and I am manually entering site addresses in there that should not be accessed via the proxy. This way not allowing sites like ebay or paypal to be accessible through this proxy.
Just to keep you updated.
-Piyush
P.S. I would appreciate if you change your title of this post and call it something else rather than ‘virus’ – unless you really think so.
September 24th, 2007 - 08:38
Yes… Now with the changes you have made, the site no longer spreads itself but remains confined within the site you are originally visiting…
Your second change you made will be instrumental in removing all Banking & related Sites (which you can see has lead you to be marked under unexpected category by websense) from being accessed as proxy
You will see title changed soon.
December 26th, 2007 - 21:56
Hello Piyush
I’ve just come across bhomiyo.com, while doing a random search for my own blog, World Foodie Guide. I have done a bit of research about it, and commend bhomiyo.com’s volunteer efforts, enabling English text to be translated into various Indian languages. However, what I do not agree with is the fact that my entire blog is showing with links being redirected live to bhomiyo.com. I am happy to share my blog with anyone, but as it is, bhomiyo.com appears to be displaying my entire blog as if it is bhomiyo.com content. I don’t think the disclaimer above my blog makes it clear enough that this is MY blog. And could you explain how traffic is being directed to my blog, as I have not seen any referrer stats from bhomiyo.com?
http://www.bhomiyo.com/or.xliterate/worldfoodieguide.wordpress.com
Helen Yuet Ling Pang
December 27th, 2007 - 04:56
@foodieguide
A correction – bhomiyo.com doesnot translate from English to other languages but “Transliterate”(language does not change but only the script page is written in) between indian scripts.
bhomiyo.com earlier showed entire web as its content but have corrected (after this blog post, which originally titled ‘a virus on loose) a bit and leaves links from other domains as it is.
Since your blog content is entirely in English it will display as it is in transliteration to any language. You can write to bhomiyo.com to exclude your blog from being accessed from the site.
November 16th, 2008 - 07:58
Why you have stopped Bhomiyo in Gujarati and Hindi?
It is so much useful to Gujarati people.Please try to
reopen for us……
October 14th, 2009 - 03:39
any plugin for enabling google indic transliteration in live writer?