Creating a Bot App – 1
I wanted one of my Wiki (using MediaWiki) to be converted to Static pages, so as to enable viewing the content without the need of Apache server, PHP and MysQL.
For those who are not aware of, MediaWiki is an open source software for creating a collaboration portal, where any user can add/modify/delete content. One of the largest information source WikiPedia is powered by this software (In fact, MediaWiki was developed for WikiPedia itself and opened to the Open Source Community)
I got the PHPMyAdmin dump downloaded and uploaded it to my local system. Now I needed to use a software which can take the first url and get all pages saved as local files updating all links to reflect local pages. Unfortunately, the software that I had access to, failed to address my requirement.
The page content and titles were Multilingual (Hindi & Gujrati) Unicode characters which created local pages whose names were hard to understand, and lengthy. Further I wanted few pages as Edit pages to run on the existing site on net, which the software denied support to. Further they failed to fetch CSS files called using @Import, which spoiled the page visuals.
I was left with no choice other than shelving this task or to develop my own software to make this possible. naturally I preferred the later.
I ended up with a small Visual Basic Application which, though is not as powerful as the other available applications, allows me to complete my task which I performed on a local system (I changed the hosts file to make it simulate my own site on net). The application works well for sites on Net too, but being single threaded is slow compared to applications which opens a large number of threads to fetch pages. My app saves files as 1.html, 2.gif and so on... and will save all files to a single folder irrespective of the folder structure in existing site, as it is programmed as such.
The main part of this app will remain in a seperate DLL file which will provide two objects "Spider" and "PagePicker". Using this approach will faciliate reusability and at the same time enhancement to the app can be done with less pain due to it's object oriented nature.
Open the Visual Basic and create a new project. When prompted for Application Type Select "ActiveX DLL". Change the Project Name to "MyBot" by opening the Project Properties Dialog. By default, a class camed "Class1" is created, change it's name to "Spider" by accessing the Properties Window. Add references to "Microsoft ActiveX Data Objects 2.6 library" and "Microsoft VBScript Regular Expressions 5.5"
An instance of "Spider" class will allow us to specify
1) The Folder path where the files will be saved.
2) Regular Expression for specifying the path allowed for fetching pages Example giving "http://localhost/" will only fetch pages from this path.
3) Regular Expression for specifying the path(s) denied for fetching from those allowed as above. Example "http://localhost/private" will not fetch pages from this path. giving "/edit.php|/delete.php" will ensure that these pages are not fetched.
We will implement above as properties LocalFolder, AllowURL, and DenyURL resp. objURLList, objLocalFiles are private collections which will hold names of URLs to be fetched and the Local filenames resp. objPicker is an object of type PagePicker which will be responsible for fetching and parsing the pages.
Private objURLList As New Collection Private objLocalFiles As New Collection Private objPicker As New PagePicker Public LocalFolder As String Public AllowURL As String Public DenyURL As String
Further Spider will allow us to
1) add a url for fetching
2) start the page fetching
We will implement above as methods AddURL and BotStart
Public Function AddURL(strURL)
Dim blnURLAllowed As Boolean
Dim blnURLDenied As Boolean
Dim objRegEX As New RegExp
Dim ctr As Long
objRegEX.Global = False
objRegEX.IgnoreCase = True
objRegEX.Pattern = AllowURL
blnURLAllowed = objRegEX.Test(strURL)
objRegEX.Pattern = DenyURL
If DenyURL = "" Then
blnURLDenied = False
Else
blnURLDenied = objRegEX.Test(strURL)
End If
If blnURLAllowed And Not blnURLDenied Then
AddURL = ""
For ctr = 1 To objURLList.Count
If objURLList(ctr) = strURL Then
AddURL = objLocalFiles(ctr)
Exit Function
End If
Next
objURLList.Add strURL
objLocalFiles.Add Trim(Str(objURLList.Count) & "." & FileType(strURL))
AddURL = objLocalFiles(objLocalFiles.Count)
Else
AddURL = strURL
End If
End Function
Public Sub BotStart()
Dim ctr As Long
Dim strPageContent As String
ctr = 0
Do While ctr < objURLList.Count
ctr = ctr + 1
If objURLList(ctr) > "" Then
strPageContent = GetPage(objURLList(ctr))
SavePage strPageContent, objLocalFiles(ctr)
End If
Loop
End Sub
AddURL method takes a URL as input, checks if the URLs are allowed according to AllowURL property and are not denied per DenyURL property. In case the URL is not in allowed path or is in Denied List, it simply returns back the URL that was passed to it. Otherwise checks if it already exists in the list of URLs to be fetched. If not then a local filename for the URL is created and saved in corresponding collection. This filename is returned back.
Why do we require to return filenames? Well! the same method will be called by the instance of PagePicker too, and it will require to replace the url with the local filename so as to enable browsing through them.
The AddURL method uses Test Method of RegExp which returns True or False depending on whether the pattern is found in the string being searched or not.
The BotStart method starts fetching of pages and continues till all the pages are fetched. AddURL method calls a private function FileType which returns the Extention name for the local file for corresponding URL, for saving locally.
Private Function FileType(ByVal strURL As String) As String
If InStr(strURL, ".css") > 0 Then FileType = "css": Exit Function
If InStr(strURL, ".js") > 0 Then FileType = "js": Exit Function
If InStr(strURL, ".vbs") > 0 Then FileType = "vbs": Exit Function
If InStr(strURL, ".gif") > 0 Then FileType = "gif": Exit Function
If InStr(strURL, ".jpg") > 0 Then FileType = "jpg": Exit Function
If InStr(strURL, ".png") > 0 Then FileType = "png": Exit Function
If InStr(strURL, ".ico") > 0 Then FileType = "ico": Exit Function
If InStr(strURL, ".hhc") > 0 Then FileType = "hhc": Exit Function
FileType = "html"
End Function
The BotStart method calls a private function GetPage for fetching each page in the list, which itself uses the Pick method of the PagePicker. It also calls private Sub savePage to save the content in the local folder.
Private Function GetPage(strURL)
GetPage = objPicker.Pick(strURL, Me)
End Function
Private Sub SavePage(strPageContent, strLocalFile)
Dim objTextStream As New ADODB.Stream
objTextStream.Type = adTypeText
objTextStream.open
objTextStream.WriteText strPageContent
objTextStream.SaveToFile LocalFolder & strLocalFile, adSaveCreateOverWrite
End Sub
That's all with the Spider Class. We will take the rest of the app in the Next part.




December 17th, 2008 - 01:38
Comments on this post are being disabled due to a stupid spammer targeting this single post.