The regexp_split_to_table() function (yes that's a long one!) breaks a text into separate rows based on a pattern. For example, it can take a string like "apple, banana" and split it into two rows: one with 'apple' and another with 'banana'.
This can be super useful when dealing with text data to analyze the most used words in comments or posts in social media. And with that said, let's go through some examples:
Imagine you work at Twitter (just imagine) because Elon Musk will probably not let you in. And your goal is to find the most frequently used hashtags.
For that, you will have access to the tweets table that stores tweets, each containing multiple hashtags.
| tweet_id | content |
|---|---|
| 1 | #happy #fun #exciting |
| 2 | #fun #adventure #travel |
| 3 | #happy #travel |
To split the hashtags from the content column into individual rows, you will need to use the regexp_split_to_table() function! And specify the pattern which is basically the delimiter here. As you see the delimiter between these hashtags is space.
SELECT
regexp_split_to_table(content, ' ') AS hashtags
FROM tweets
Now, that we have all our hashtags, you can simply use the above query as a subquery and count the number of times each hashtags occured!
WITH my_hashtags AS (
SELECT
regexp_split_to_table(content, ' ') AS hashtags
FROM tweets)
SELECT
hashtags,
COUNT(*) AS frequency
FROM my_hashtags
GROUP BY 1;
The syntax of this function has 2 parameters, the column name and the pattern:
regexp_split_to_table(column_name, pattern)
And trust me, I could do a whole course of 2 weeks regarding patterns, because there are SO MANY! You don't need to know them by heart, but you can always know they exist for more advanced pattern:
| Regex Pattern | Explanation |
|---|---|
abc | Matches the exact sequence 'abc'. |
[abc] | Matches any one of 'a', 'b', or 'c'. |
[^abc] | Matches any character except 'a', 'b', or 'c'. |
. | Matches any single character (except newline). |
^abc | Matches 'abc' at the start of a string. |
abc$ | Matches 'abc' at the end of a string. |
ab* | Matches 'a' followed by zero or more 'b's. |
ab+ | Matches 'a' followed by one or more 'b's. |
If you look at the last row, it will match any whitespace characters (space, tab), so we could have use it for that specific use case:
SELECT
regexp_split_to_table(content, '\s') AS hashtags
FROM tweets
And it can be even be more powerful than that, let's say some users made 1 or more whitespace between their hashtags, you could add a "+" to your pattern which will basically say 1 or more spaces:
SELECT
regexp_split_to_table(content, '\s+') AS hashtags
FROM tweets
You don't need to know them all by heart, but some of the one I mentioned above could always be useful!

It was a long ride, but we will practice these concepts so that you feel more prepared!
Also, take some time to read it again if you need to! 💪
ab? | Matches 'a' followed by zero or one 'b'. |
ab{2} | Matches 'a' followed by exactly 2 'b's. |
\. | Matches a literal dot '.'. |
\s | Matches any whitespace character (space, tab...) |