Regular Expressions - A Primer
My previous post placed a problem regarding extracting filenames from a html page, and the ways by which the pages may be defined there was exhaustive (I repeat it here) as :
1. Pages may linked using href=”<url>”
2. There may be or may not be quotes around url
3. The equal-to sign may have one or more space(s) on either or both sides or may have spaces on any sides
4. CSS/JS pages or Images may be linked using src=”<url>”
5. Conditions 2-3 apply for these pages/Images too.
6. CSS files may be called using @import “<url>”
7. Condition 2 apply for above
8. Inline CSS declarations may call images using url(”<url>”)
9. Condition 2 apply for above
Think of doing it with Classical programming… You will need to check for existance of words “href”, “src”, @import, URL, then determine the end of the url either by checking for existance of quotes, or > sign or space or newline character and so on… Now getting the url from the string so obtained… Want to program all these?
Just imagine if you can just explain in minimal words your requirement to an existing program and get back the result… This is what Regular Expressions is all about! Ken Thompson, in early 60’s, brought this concept to reality in his CTSS version of QED. From thereon the Regular Expressions have been implemented in most of development platforms and editors. Read more of it’s history at Wikipedia
The software that implements/faciliates Regular Expressons as termed as Regular Expressions (RegEx) Engine. While developing in Visual Basic, Regular Expressions are implemented using the COM Component “Microsoft VBScript Regular Expressions 5.5″.
The Regular Expressions requires to define your search requirement using a syntax thus forming a “Pattern”. The RegEx engine uses this pattern to search within the text and returns “Matches”, the ones required by you.
The Syntax
The characters \ | . ^ $ ( ) * + [ ] ? have special meanings in a pattern, rest of the characters matches themselves i.e. if you specify the pattern as abc all occurances of abc will be returned in the matches
All excluded characters as mentioned above requires to be escaped with a backslash “\” which itself is one of such characters. So if you need to match * you have to specify the pattern as \*
| (pipe) character matches characters on either side of the pipe, so abc|xyz will return all occurances of abc or xyz in the matches. Similarly abc|pqr|xyz with return all occurances of abc, pqr or xyz.
. (dot) Matches any character except line break characters. ab. all occurances of aba, abb, abc and so on.
^ can be used to specify that the pattern that follows the caret needs to be at the beginning of the string being searched.
$ can be used to specify that the pattern that preceeds the dollar sign needs to be at the end of the string being searched.
( ) can be used to group a set of characters. Why? Read on!
* specifies that a character or group as defined above preceeding the asterisk can occur zero or more times. So abc* will return as match all occurances ab, abc, abcc, and so on. Whereas a(bc)* will return all occurances of a, abc, abcbc and so on…
+ specifies that a character or group preceeding the asterisk can occur one or more times. So abc+ will return as match all occurances abc, abcc, and so on. Whereas a(bc)+ will return all occurances of abc, abcbc and so on…
[ ] can be used to specify a character class. a[bcde] will return all occurances of ab, ac, ad, ae. The same result can be achived by using “-” to specify a range as a[b-e]. Characters (including escaped ones) and ranges can be included in any numbers for example [abf-j0-35-8\*\+]
? instructs the engine to be “Lazy” in searching the previous character. Let me explaing with an example.
Say it the string which is being searched is
The search result can be “this” or “this one” or “this too”
If we need to match strings in quotes (including quotes) the pattern can be
“.*” matching a string beginning and ending with a quote each with Zero or more occurances of Any characters
While you expected to get three matches, the RegEx engine being greedy, by default, tries to match as many as characters it could between the two quotes and returns result as
“this” or “this one” or “this too”
So it included even the quotes in definition of Any characters between the first and last occurance of quotes.
By defining the pattern as “.*?” you achive the required result as you now instructed the engine to go lazy when searching for “any character” between the two quotes.
Further there are few other pattern definitions using backslashes as prefix
\s matches “space” character
\n matches Line Feed (LF) character
\r matches Carriage return (CR) character
\d matches digits 0-9
\w matches word characters i.e. Alphabets and Digits
While I have focused here only on a part of possibilities with RegEx, More information on Regular Expressions and its syntax can be found it at Regular-Expressions.info. Since implementation of RegEx can vary with different Engines, it is best that document available with Engine/Developer Site be referred while developing.
Now for our problem, lets take the first 3 conditions for href and build the pattern for it which can be
href\s*=\s*”*[a-z0-9#/_%:\.&-\?\+]+”*
The pattern above describes that href can contain zero or more spaces following it, then an “equal to” sign, followed by zero or more spaces, then zero or more occurances of quote (of course practically there would be either one or none), then one or more occurances of characters that make a url (a-z 0-9 # / _ % : . & - ? +), then zero or more occurances of quote.
This pattern when passed to RegEx engine can extract all urls used with href. Lets add support for “src” (conditions 4-5)… the pattern is same except that “href” be replaced with “src”… so lets group them using pipe |
(href|src)\s*=\s*”*[a-z0-9#/_%:\.&-\?\+]+”*
For conditions 6-7 i.e. @Import “URL” add another level of group
((href|src)\s*=|@import)\s*”*[a-z0-9#/_%:\.&-\?\+]+”*
For condition 8-9 modify it finally to
((href|src)\s*=|@import|url)\s*\(*”*[a-z0-9#/_%:\.&-\?\+]+”*\)*
Given this pattern urls from HTML, CSS, JS files can be easily extracted.
update 14.03.2008: Ryan Gardner suggested two modifications, one for allowing urls in Upper case too (The pattern was originally developed for VB hence the miss) and other using single quote instead of double-quotes for enclosing url. The modified pattern would be as below.
((href|src)\s*=|@import|url)\s*(\(\s*”|\(\s*’|\(|’|”)+([\w#/_%:\.&-\?\+]+)(’\s*\)|”\s*\)|”|\))+

[...] : Change Post URL after it’s Published One of my post Regular Expressions - A Primer, in the early days of this blog, inadvertently got misspelled. Though I changed the page title, the [...]
Pingback by Wordpress : Change Post URL after it's Published « Jalaj — June 5, 2007 @ 1:14 pm
when you say it’s ove. Mica Kingsley.
Comment by Mica Kingsley az3h.com — October 21, 2007 @ 7:57 pm
Your regex is not complete - URLs can contain uppercase letters - so you should either have a-zA-Z0-9 or just use \w.
Comment by Ryan Gardner — March 13, 2008 @ 11:11 pm
Oh… and for href and src the quote can be single or double quote, and for url( the quote is optional.
I’m tweaking that regex and will post my tweaked one when I am finished with it.
Comment by Ryan Gardner — March 14, 2008 @ 12:56 am
Here’s an updated regex that I believe better finds the URLs and will better allow you grab them. (In Java, grouping #4 contains the URL )
((href|src)\s*=|@import|url)\s*(\(\s*”|\(\s*’|\(|’|”)+([\w#/_%:\.&-\?\+]+)(’\s*\)|”\s*\)|”|\))+
It grabs mostly-valid URLs… href(’monkey.gif’); would never be valid, but it wont catch that obviously… if someone really cared about only finding completely-valid URLs you’d have to retool it more than this.
Comment by Ryan Gardner — March 14, 2008 @ 2:33 am
Thanks Ryan for pointing out the error as well as rectifying that. I will also post that to the main section.
I think I skipped the case-sensitiveness as this post was preceded by “Regular Expressions in Vb” post where I first produced this pattern, and in VB to setting “IgnoreCase” property to True itself makes pattern case-insensitive.
Comment by Jalaj — March 14, 2008 @ 10:20 am