I recently flubbed a code screening for a Data Engineering job. It's been a hot minute since I've done a screening, so I was a bit rusty with my test taking skills and did not effectively manage my time. I'm particularly frustrated with myself, because the question I ended up wasting a lot of time on is a type of question I definitely know how to do. I just allowed myself to get confused.
As an act of penance/self-flagellation, I'll discuss how to solve that type of problem in this post. Obviously I will not divulge the company I was screening for or try to brain dump the question itself. However, a similar question is posed in an Intuit SQL Interview Question that can be found on DataLemur
This is a classic SQL problem, dealing with identifying contiguous sequences of data in a data set. This is also commonly referred to as an island problem. In the case of this interview question from Intuit, we are asked to find users that used TurboTax software to file their taxes for 3+ consecutive years.
The official solution provided by DataLemur uses LAG and LEAD to identify rows that are 1 year apart from the previous filing event year and the following filing event year, thus identifying consecutive years. While this does work for this problem, I think a more intuitive and flexible solution in this case involves using ROW_NUMBER() to find and aggregate groups of consecutive events.
The crux of this analysis is to have the ROW_NUMBER() column for the partition we are trying to identify streaks for, and then a column identifying the event time or overall sequence of events. The latter may already be present, or we may have to calculate/generate our own sequence identifying row. When we then find the difference between these two columns, streaks will become clear due to the difference being the same within a streak. This is easier to visualize in the SQL output, so let's get started.
The Intuit interview question uses one table, filed_taxes, with the following schema
| Column Name | Type |
|---|---|
| filing_id | integer |
| user_id | varchar |
| filing_date | datetime |
| product | varchar |
Here's some sample data as given by the problem statement.
| filing_id | user_id | filing_date | product |
|---|---|---|---|
| 1 | 1 | 4/14/2019 | TurboTax Desktop 2019 |
| 2 | 1 | 4/15/2020 | TurboTax Deluxe |
| 3 | 1 | 4/15/2021 | TurboTax Online |
| 4 | 2 | 4/07/2020 | TurboTax Online |
| 5 | 2 | 4/10/2021 | TurboTax Online |
| 6 | 3 | 4/07/2020 | TurboTax Online |
| 7 | 3 | 4/15/2021 | TurboTax Online |
| 8 | 3 | 3/11/2022 | QuickBooks Desktop Pro |
| 9 | 4 | 4/15/2022 | QuickBooks Online |
And we are asked the following:
Write a query that identifies the user IDs of individuals who have filed their taxes using any version of TurboTax for three or more consecutive years. Each user is allowed to file taxes once a year using a specific product. Display the output in the ascending order of user IDs.
Let's start with the following query.
SELECT
user_id,
EXTRACT(YEAR FROM filing_date) as filing_year,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER by EXTRACT(YEAR FROM filing_date)) as rn,
product
FROM filed_taxes
WHERE product LIKE '%TurboTax%'
Let's break down what this does. We need to find instances of a user using TurboTax software. Making an assumption that TurboTax products will just have "TurboTax" somewhere in the name, we utilize the simple LIKE '%TurboTax%' clause. We're asked to find consecutive years, so we can just extract the year from the filing_date column. I'm using PostgreSQL in this example. If you were using SQL Server, you can use DATEPART. The problem asks us to find consecutive filing years per user, hence we partition our ROW_NUMBER() by user_id in order of the filing year. Our output looks something like this:
| user_id | filing_year | rn | product |
|---|---|---|---|
| 1 | 2019 | 1 | TurboTax Desktop 2019 |
| 1 | 2020 | 2 | TurboTax Deluxe |
| 1 | 2021 | 3 | TurboTax Online |
| 2 | 2020 | 1 | TurboTax Online |
| 2 | 2021 | 2 | TurboTax Online |
| 3 | 2020 | 1 | TurboTax Online |
| 3 | 2021 | 2 | TurboTax Online |
The filing_year column we created will act as our sequential row identifier for each user. This gives us the two elements we need to perform our streak analysis. Let's focus on user_id 1. Through visual inspection, we can see this user used TurboTax software for three consecutive years. Let's see what happens when we take the difference between filing_year and our row number column.
2019 - 1 = 2018
2020 - 2 = 2018
2021 - 3 = 2018
Because the two columns increment at the same rate, the difference is always the same. Say for example, user_id 1 had not used TurboTax again until 2022. The differences would be:
2019 - 1 = 2018
2020 - 2 = 2018
2022 - 3 = 2019
The streak would end, signified by a new difference on the third row. Why does this help us? This allows us to GROUP BY our calculated row, and rows within a streak will be aggregated together. We'll do precisely this in our SQL query.
WITH filing_cte AS
(
SELECT
filing_id,
user_id,
EXTRACT(YEAR FROM filing_date) as filing_year,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER by EXTRACT(YEAR FROM filing_date)) as rn,
product
FROM filed_taxes
WHERE product LIKE '%TurboTax%'
)
SELECT
user_id,
filing_year,
rn,
filing_year - rn as diff
FROM filing_cte
| user_id | filing_year | rn | diff |
|---|---|---|---|
| 1 | 2019 | 1 | 2018 |
| 1 | 2020 | 2 | 2018 |
| 1 | 2021 | 3 | 2018 |
| 2 | 2020 | 1 | 2019 |
| 2 | 2021 | 2 | 2019 |
| 6 | 2016 | 1 | 2015 |
| 6 | 2017 | 2 | 2015 |
| 6 | 2018 | 3 | 2015 |
| 6 | 2020 | 4 | 2016 |
| 6 | 2021 | 5 | 2016 |
| 6 | 2022 | 6 | 2016 |
The islands are clearly identifiable from the matching diffs per user. I've included the output of user_id 6, which shows two distinct streaks (with a gap in 2019). Now, we just have to aggregate this all on the user_id and difference to find the solution (users that have at one point used TurboTax for 3+ consecutive years)
WITH filing_cte AS
(
SELECT
filing_id,
user_id,
EXTRACT(YEAR FROM filing_date) as filing_year,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER by EXTRACT(YEAR FROM filing_date)) as rn,
product
FROM filed_taxes
WHERE product LIKE '%TurboTax%'
)
SELECT
DISTINCT user_id
FROM filing_cte
GROUP BY user_id, (filing_year - rn)
HAVING COUNT(*) >= 3
This will solve the problem as given. With some minor updates, we can find additional stats, such as the number of streaks and longest streak per user.
WITH filing_cte AS
(
SELECT
filing_id,
user_id,
EXTRACT(YEAR FROM filing_date) as filing_year,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER by EXTRACT(YEAR FROM filing_date)) as rn,
product
FROM filed_taxes
WHERE product LIKE '%TurboTax%'
),
diff_cte AS
(
SELECT
user_id,
COUNT(*) as consecutive_years
FROM filing_cte
GROUP BY user_id, (filing_year - rn)
)
SELECT
user_id,
COUNT(*) AS streak_count,
MAX(consecutive_years) AS longest_streak
FROM diff_cte
GROUP BY user_id
ORDER BY user_id
| user_id | streak_count | longest_streak |
|---|---|---|
| 1 | 1 | 3 |
| 2 | 1 | 2 |
| 3 | 1 | 2 |
| 5 | 2 | 3 |
| 6 | 2 | 3 |
Anyhow, this solution is just one approach for this particular type of islands problem. I'm far from an expert on this subject matter. I'll be studying up on this kind of problem, since it has clearly emerged as a weakness of mine for interviews/screenings. Itzik Ben-Gan has some very comprehensive write-ups on this kind of issue, so I'll be checking out some of his work.
Until next time - Div