Utilizing REGEX in Tableau

 

During the Tableau Conference in Las Vegas a few weeks ago, my fellow coworkers and I have learned some exciting tips and tricks in Tableau and how to answer some common issues we have been trying to solve. While all of the seminars were useful and handy, one particular topic stood out to me as lifesaver for those twitter feed-analyzing enthusiasts. 

ISSUE

You have a field that gives a string of text such as a url or email, and you want to retrieve part of that string to analyze/visualize. For example, you would like to retrieve names from email addresses. 

ANSWER

Regex Expressions! These special text strings work phenomenally towards matching, replacing, and extracting particular phrases from a string. In this article I will  give a basic introduction to regex expressions and an example utilizing the regex functions available in Tableau.

INTRO TO REGEX: THE ALL-POWERFUL PARSING TOOL

To begin, lets start with a simple example. Let us say we are looking at, say, data concerning shootings that have happened in America from 2013 to 2015. There are very interesting fields in this data set, particularly a list of articles that have reported these shootings across America:

The problem is, we can't really use this data because these links are very complicated and really have no use to us in their current form. If we could extract, however, the domain name of the actual website and utilize that for our visualizations, we'd be in business! This is where a regex expression comes in handy. 

To utilize regex, one needs to create a calculated field. There are three functions you can choose from to utilize regular expressions. These functions are shown below:

 

  • REGEXP_EXTRACT: This function lets one extract a particular pattern from a string variable.
  • REGEXP_MATCH: This function is used for finding a repeated pattern in a string and returning a boolean.
  • REGEXP_REPLACE: Replace a pattern with a set of characters. 

 

Before I go through retrieving the website name from the links within this field, let me go over some of the basic syntax of regex. Let us say we have the word cat that we want to extract from a string field.

  • (cat) will extract the word cat and drop everything else in the string
  • ^cat  will extract only cat if its at the beginning of the string
  • cat$ will extract cat only if its at the end of the string
  • ^cat$ will only extract cat if it is alone in a string
  • [cat] will only extract the specific characters c,a,t from a string
  • [c-t] extracts any characters from c to t
  • [^c-t] will extract any character except those between c to t
  • [cat]+ will give back one or more characters that are c, a, or t
  • (cat|dog) will extract the word cat OR the word dog
  • (cat){2,4} extract cat when it is repeated 2 to 4 times
  • c*t extracts any pattern that starts with c and ends with t 
  • ca?t extracts cat but will also extract ct
  • ca\st will extract ca t
  • cat\d{3} will extract cat012, cat111, cat356, or any pattern with cat and 3 digits afterwords.
  • cat\w{3} is the same as above but with character values instead of numbers.
  • cat. will match any character one time after cat. (i.e. catt, cata, cats,..)

NOW BACK TO OUR EXAMPLE

What we want from the urls is the actual website that posts these articles. I am interested in seeing with website posts the most about the shootings in America. What website is most interested in enlightening the public about the immense issues of gun control in our country?

The expression we want to use is the following:

 

 

Within the function REGEXP_EXTRACT, I give the function the field (in this case Article1), and the expression 'http://(\w.*.com)/'. I will break this down step by step.

  • We first need to express we want to leave out the beginning of the url. We type this out WITHOUT any parentheses. This tells regex to drop this part of the string.
  • We want basically what is in-between http:// and the next forward slash. We are not sure what and how many characters we need to retrieve, but we do know the next delimiter and what is at the end of the expression (/ is the next delimiter). Thus we utilize \w (referring to an arbitrary character) and then use the expression .* afterwards, meaning any character 0 or more times. 
  • Putting .com at the end of the expression in parentheses tells regex oh hey, we want what ever is in the middle of the first value after http:// and ends with .com. 
  • Placing a / outside the parentheses lets the function know to drop the forward slash afterwards..just in case. 

AND Voila! We have our websites:

Lets make a dashboard out of this, just for funsies:

Click on a particular location and you will be able to see the number of victims from the shootings over time and also the top websites reporting shootings in that area. One interesting observation  is summer seems like a popular time for shootings to occur. Why would that be? 

Anyways, check back to our blog in December for another visually enlightening post. Hope everyone has a great weekend!

AUTHOR AND VISUALIZATIONS BY ERIN MIDDLEMAS