Jalaj P. Jha Technical & Miscellaneous Ramblings

17Dec/060




Regular Expressions in VB

I sat down to write a function that takes string (which, in fact, contained a complete HTML page code) and parse it to return url of all pages/images/CSS called from HTML in string. Given conditions as

1. Pages may linked using href="<url>"
2. There may be or may not be quotes around url
3. The equal-to sign may have one or more space(s) on either or both sides or may have spaces on any sides
4. CSS/JS pages or Images may be linked using src="<url>"
5. Conditions 2-3 apply for these pages/Images too.
6. CSS files may be called using @import "<url>"
7. Condition 2 apply for above
8. Inline CSS declarations may call images using url("<url>")
9. Condition 2 apply for above
that's all...

Thanks to Regular Expressions that took me only a few lines of code to implement all these.

Dim re As New RegExp
Dim m As Match

re.IgnoreCase = True
re.Pattern = "((src|href)\\s*=\\s*|@import\\s*|url\\s*\\()""*\\s*[a-z0-9/_%:\\.&-\\?\\+]+\\s*""*\\s*\\)*"
re.Global = True

For Each m In re.Execute(rtext)

' m contains the url text along with href/src/import/url
' These may again be filtered to get clean url
' The values may be returned in array or as required. Say as…
' MsgBox m
Next

The code assumes that
1. The project references "Microsoft VBScript Regular Expressions 5.5"
2. variable rtext contains the HTML code to extract URLs from

Comments (0) Trackbacks (1)

Leave a comment