Its exciting to be blogging back after quite a few months (actually almost a couple of years if you consider my last blog post was short and wasn’t really followed up with more until this once). More exciting is to be blogging on Imbibe’s official blog (rather than mine). Anyways, the topic of the first technical post on Imbibe’s blog is something that cuts across programming languages, regular expressions (popularly called regex).
Recently I needed to solve an exciting problem, to limit the total length of a string where the string format was validated by a regex. One easy solution could have been to validate the string via the intended regex and then validate the total length manually using conditional if/else logic. However I really wanted to validate the string length in the regex too.
Let’s take an example. Suppose we have a requirement where we need to ensure that a string starts with one or more english alphabet characters and only contains numbers thereafter. The following regex does the job (I am using javascript as the reference language here but you can easily adapt the regex for your programming language):
[code language=”javascript”]
/^[a-z]+[0-9]*$/i
[/code]
Now comes the interesting part. What if you want to limit the total length of the string validated by this regex to between 5 and 10. A quick solution that comes to mind is to use quantifiers, for example:
[code language=”javascript”]
/^([a-z]+[0-9]*){5,10}$/i
[/code]
However this won’t work. Ideally the string “aa111
” should have been matched by this expression (alphabets followed by numbers with total length 5), but it doesn’t because essentially what we have done is put the quantifier around the whole expression meaning the expression as a bunch needs to repeat 5-10 times, so “a1a1a1a1a1
” matches (here we have alphabet followed by number 5 times).
Putting quantifier separately on the alphabet part or on the number part also won’t work as the requirement does not put a restriction on their counts individually, just the overall string length has to be between 5 and 10. Quantifiers act on a pattern within the regex and control the pattern’s repetition count, they cannot count characters individually. What this means is quantifiers are not the solution to the problem. We need something else, the positive lookahead assertion.
Essentially what positive lookahead assertions do are to put a restriction to the regex pattern following the assertion. The assertion itself is not part of the match, just that it puts a restriction on the match. For example, the following assertion:
[code language=”javascript”]
/(?=ab)[a-z]+/
[/code]
enforces that consecutive “ab” must occur in the pattern following it. So “acbd
” fails but “abcd
” is successfully matched by this regex.
(?=ab)
is the positive lookahead assertion in this case while the pattern following it ([a-z]+
) is what it operates upon. More on assertions can be found here.
Coming back to our problem of enforcing the total string length using regex, the following regex does the task:
[code language=”javascript”]
/^(?=.{5,10}$)([a-z]+[0-9]*)$/i
[/code]
Here the interesting part is the assertion “(?=.{5,10}$)
“. To begin with, we placed braces around our complete original regex “[a-z]+[0-9]*
” so it becomes a single capturing group. And we placed our assertion in front of the group so it works on the complete group.
As for the assertion itself, “.
” means match any character in the pattern following it. Then the quantifier in the assertion “{5,10}
” forces the length of the pattern following it to be between 5 and 10. Finally the “$
” in the assertion enforces that the pattern must end within the specified count (i.e. the pattern must end within a count of 10). Omitting the “$
” would allow the total length to exceed 10.
It was pretty interesting to use the positive lookahead assertion in solving the total length problem. Before actually attempting it, I had’t thought the assertions could be used in such use-cases. This also means I now use assertions to implement more complex validation restrictions on strings via regexes.