Dataford - Ace your Data Interview

The regexp_split_to_table() function (yes that's a long one!) breaks a text into separate rows based on a pattern. For example, it can take a string like "apple, banana" and split it into two rows: one with 'apple' and another with 'banana'.

This can be super useful when dealing with text data to analyze the most used words in comments or posts in social media. And with that said, let's go through some examples:

Imagine you work at Twitter (just imagine) because Elon Musk will probably not let you in. And your goal is to find the most frequently used hashtags.

For that, you will have access to the tweets table that stores tweets, each containing multiple hashtags.

tweet_id	content
1	#happy #fun #exciting
2	#fun #adventure #travel
3	#happy #travel

To split the hashtags from the content column into individual rows, you will need to use the regexp_split_to_table() function! And specify the pattern which is basically the delimiter here. As you see the delimiter between these hashtags is space.

SELECT 
    regexp_split_to_table(content, ' ') AS hashtags
FROM tweets

Now, that we have all our hashtags, you can simply use the above query as a subquery and count the number of times each hashtags occured!

WITH my_hashtags AS (
    SELECT 
        regexp_split_to_table(content, ' ') AS hashtags
    FROM tweets)

SELECT
    hashtags,
    COUNT(*) AS frequency
FROM my_hashtags
GROUP BY 1;

The syntax of this function has 2 parameters, the column name and the pattern:

regexp_split_to_table(column_name, pattern)

And trust me, I could do a whole course of 2 weeks regarding patterns, because there are SO MANY! You don't need to know them by heart, but you can always know they exist for more advanced pattern:

Regex Pattern	Explanation
`abc`	Matches the exact sequence 'abc'.
`[abc]`	Matches any one of 'a', 'b', or 'c'.
`[^abc]`	Matches any character except 'a', 'b', or 'c'.
`.`	Matches any single character (except newline).
`^abc`	Matches 'abc' at the start of a string.
`abc$`	Matches 'abc' at the end of a string.
`ab*`	Matches 'a' followed by zero or more 'b's.
`ab+`	Matches 'a' followed by one or more 'b's.

If you look at the last row, it will match any whitespace characters (space, tab), so we could have use it for that specific use case:

SELECT 
    regexp_split_to_table(content, '\s') AS hashtags
FROM tweets

And it can be even be more powerful than that, let's say some users made 1 or more whitespace between their hashtags, you could add a "+" to your pattern which will basically say 1 or more spaces:

SELECT 
    regexp_split_to_table(content, '\s+') AS hashtags
FROM tweets

You don't need to know them all by heart, but some of the one I mentioned above could always be useful!

It was a long ride, but we will practice these concepts so that you feel more prepared!

Also, take some time to read it again if you need to! 💪

Modules