I recently had to parse through a large dataset, millions of lines. When you have to parse at this scale regex should be the word buzzing around in your head.
Regex, or regular expressions, are a tool used throughout programming languages, in Google Analytics, I would guess Microsoft Excel, etc. Regex allows you to declare general rules for how a string should or should not be.
How a String Should or Should Not Be
Let’s take an example here:
So here we have a string that has some (not entirely accurate) information about Saied Abbasi. The string includes 5 spaced ‘words’, or smaller strings within it. Let’s pretend the words correspond to these values:
SA0228 = Saied Abbasi’s initials and his birthday of February 28th
PIS = the zodiac sign Pisces
R1 = Ruby 1 year experience
WP4 = WordPress 4 years experience
Imagine you have a millions records like this for different individuals and you want the initials of each individual and know how many years of Ruby experience they have.
You would need to write some logic to move through each complete string. Then you would need the internal ‘words’ in an array. So taking a single line with our string example and creating an array:
If we only care about an individual’s initials we could combine Ruby’s gsub method and some regex — a powerful combination!
So if our dataset is kind enough to always include initials followed by a four digit birthday and we only care about retrieving the initials we could use some code like this using our variable words representing an array of strings:
So here we take our array of words and remove the first word, ‘SA0228’. On the variable word we use gsub with an regular expression. We say if the string includes any characters with the exception of capital letters from A to Z, then replace those outliers with “” — a blank space, which removes them.
This leaves us with just SA which we set equal to a variable initials.
When I write regex I give it a quick test over at Rubular.
Now we want to move through the remaining words and find any that match the letter R followed by one or two digits for years of experience. We could make an explicit little method like this:
Here we do a loop using the each method. This will let us cycle through each word in the words array one at a time. Each time through we take an individual word and we check it against the regular expression R\d.
Breaking down =~
When we run the string “R1” and evaluate it against =~ /(R\d)/ the return value is 0. This corresponds to the first index where our regex expression finds a match. In this case the capital R at index 0. If we ran the expression AR1 =~ /R\d/, the return value would be 1 corresponding to the R at index 1.
For clarification, our regex is checking for a capital R followed by any 0-9 value (\d is a shorthand for any digit value). Within the syntax there is another aspect worth highlighting, when we use =~ we wrap our regex in / slashes — one at the beginning and one at the end. The parenthesis make sure the R value is immediately followed by a digit, while a string like RP1 would not return a integer as the regex would not match.
Another thing to highlight, if we ran JS1 =~ /(R\d)/ the return would be nil. In Ruby the only ‘falsy’ values are false and nil. 0 is a ‘truthy’ value. This is really handy because we can use our =~ regex statements in conditional clauses, like we do above. Pretty cool! For lots of information on true and false values in Ruby and other programming languages check this out.
I use word = ‘0’ as a benchmark tested way of quickly replacing the value at index 0 with ‘0’. This then allows me to easily run .to_i to turn our string of numbers into an integer.
From here we have the initials and the years experience in Ruby. You can imagine when tackling a huge dataset how regex and the =~ evaluator can be extremely powerful.