Jalaj

December 19, 2006

Creating a Bot App - 1

Filed under: COM, Regular Expressions, Visual Basic — Jalaj @ 10:22 am

I wanted one of my Wiki (using MediaWiki) to be converted to Static pages, so as to enable viewing the content without the need of Apache server, PHP and MysQL.

For those who are not aware of, MediaWiki is an open source software for creating a collaboration portal, where any user can add/modify/delete content. One of the largest information source WikiPedia is powered by this software (In fact, MediaWiki was developed for WikiPedia itself and opened to the Open Source Community)

I got the PHPMyAdmin dump downloaded and uploaded it to my local system. Now I needed to use a software which can take the first url and get all pages saved as local files updating all links to reflect local pages. Unfortunately, the software that I had access to, failed to address my requirement.

The page content and titles were Multilingual (Hindi & Gujrati) Unicode characters which created local pages whose names were hard to understand, and lengthy. Further I wanted few pages as Edit pages to run on the existing site on net, which the software denied support to. Further they failed to fetch CSS files called using @Import, which spoiled the page visuals.

I was left with no choice other than shelving this task or to develop my own software to make this possible. naturally I preferred the later.

I ended up with a small Visual Basic Application which, though is not as powerful as the other available applications, allows me to complete my task which I performed on a local system (I changed the hosts file to make it simulate my own site on net). The application works well for sites on Net too, but being single threaded is slow compared to applications which opens a large number of threads to fetch pages. My app saves files as 1.html, 2.gif and so on… and will save all files to a single folder irrespective of the folder structure in existing site, as it is programmed as such.

The main part of this app will remain in a seperate DLL file which will provide two objects “Spider” and “PagePicker”. Using this approach will faciliate reusability and at the same time enhancement to the app can be done with less pain due to it’s object oriented nature.

Open the Visual Basic and create a new project. When prompted for Application Type Select “ActiveX DLL”. Change the Project Name to “MyBot” by opening the Project Properties Dialog. By default, a class camed “Class1″ is created, change it’s name to “Spider” by accessing the Properties Window. Add references to “Microsoft ActiveX Data Objects 2.6 library” and “Microsoft VBScript Regular Expressions 5.5″

An instance of “Spider” class will allow us to specify
1) The Folder path where the files will be saved.
2) Regular Expression for specifying the path allowed for fetching pages Example giving “http://localhost/” will only fetch pages from this path.
3) Regular Expression for specifying the path(s) denied for fetching from those allowed as above. Example “http://localhost/private” will not fetch pages from this path. giving “/edit.php|/delete.php” will ensure that these pages are not fetched.

We will implement above as properties LocalFolder, AllowURL, and DenyURL resp. objURLList, objLocalFiles are private collections which will hold names of URLs to be fetched and the Local filenames resp. objPicker is an object of type PagePicker which will be responsible for fetching and parsing the pages.

Private objURLList As New Collection
Private objLocalFiles As New Collection

Private objPicker As New PagePicker

Public LocalFolder As String

Public AllowURL As String
Public DenyURL As String

Further Spider will allow us to
1) add a url for fetching
2) start the page fetching

We will implement above as methods AddURL and BotStart

Public Function AddURL(strURL)
    Dim blnURLAllowed As Boolean
    Dim blnURLDenied As Boolean
    Dim objRegEX As New RegExp
    Dim ctr As Long

    objRegEX.Global = False
    objRegEX.IgnoreCase = True
    objRegEX.Pattern = AllowURL
    blnURLAllowed = objRegEX.Test(strURL)
    objRegEX.Pattern = DenyURL
    If DenyURL = "" Then
        blnURLDenied = False
    Else
        blnURLDenied = objRegEX.Test(strURL)
    End If
    If blnURLAllowed And Not blnURLDenied Then
        AddURL = ""
        For ctr = 1 To objURLList.Count
            If objURLList(ctr) = strURL Then
                AddURL = objLocalFiles(ctr)
                Exit Function
            End If
        Next
        objURLList.Add strURL
        objLocalFiles.Add Trim(Str(objURLList.Count) & "." & FileType(strURL))
        AddURL = objLocalFiles(objLocalFiles.Count)
    Else
        AddURL = strURL
    End If
End Function

Public Sub BotStart()

    Dim ctr As Long
    Dim strPageContent As String

    ctr = 0
    Do While ctr < objURLList.Count
        ctr = ctr + 1
        If objURLList(ctr) > "" Then
            strPageContent = GetPage(objURLList(ctr))
            SavePage strPageContent, objLocalFiles(ctr)
        End If
    Loop

End Sub

AddURL method takes a URL as input, checks if the URLs are allowed according to AllowURL property and are not denied per DenyURL property. In case the URL is not in allowed path or is in Denied List, it simply returns back the URL that was passed to it. Otherwise checks if it already exists in the list of URLs to be fetched. If not then a local filename for the URL is created and saved in corresponding collection. This filename is returned back.

Why do we require to return filenames? Well! the same method will be called by the instance of PagePicker too, and it will require to replace the url with the local filename so as to enable browsing through them.

The AddURL method uses Test Method of RegExp which returns True or False depending on whether the pattern is found in the string being searched or not.

The BotStart method starts fetching of pages and continues till all the pages are fetched. AddURL method calls a private function FileType which returns the Extention name for the local file for corresponding URL, for saving locally.

Private Function FileType(ByVal strURL As String) As String

    If InStr(strURL, ".css") > 0 Then FileType = "css": Exit Function
    If InStr(strURL, ".js") > 0 Then FileType = "js": Exit Function
    If InStr(strURL, ".vbs") > 0 Then FileType = "vbs": Exit Function
    If InStr(strURL, ".gif") > 0 Then FileType = "gif": Exit Function
    If InStr(strURL, ".jpg") > 0 Then FileType = "jpg": Exit Function
    If InStr(strURL, ".png") > 0 Then FileType = "png": Exit Function
    If InStr(strURL, ".ico") > 0 Then FileType = "ico": Exit Function
    If InStr(strURL, ".hhc") > 0 Then FileType = "hhc": Exit Function
    FileType = "html"

End Function

The BotStart method calls a private function GetPage for fetching each page in the list, which itself uses the Pick method of the PagePicker. It also calls private Sub savePage to save the content in the local folder.

Private Function GetPage(strURL)

    GetPage = objPicker.Pick(strURL, Me)

End Function

Private Sub SavePage(strPageContent, strLocalFile)

    Dim objTextStream As New ADODB.Stream
    objTextStream.Type = adTypeText
    objTextStream.open
    objTextStream.WriteText strPageContent
    objTextStream.SaveToFile LocalFolder & strLocalFile, adSaveCreateOverWrite

End Sub

That’s all with the Spider Class. We will take the rest of the app in the Next part.

1 Comment »

  1. [...] Continued from Creating a Bot App - 1 [...]

    Pingback by Creating a Bot App - 2 « Jalaj — December 20, 2006 @ 1:11 pm

RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.