Div's Blog

I recently flubbed a code screening for a Data Engineering job. It's been a hot minute since I've done a screening, so I was a bit rusty with my test taking skills and did not effectively manage my time. I'm particularly frustrated with myself, because the question I ended up wasting a lot of time on is a type of question I definitely know how to do. I just allowed myself to get confused.

As an act of penance/self-flagellation, I'll discuss how to solve that type of problem in this post. Obviously I will not divulge the company I was screening for or try to brain dump the question itself. However, a similar question is posed in an Intuit SQL Interview Question that can be found on DataLemur

This is a classic SQL problem, dealing with identifying contiguous sequences of data in a data set. This is also commonly referred to as an island problem. In the case of this interview question from Intuit, we are asked to find users that used TurboTax software to file their taxes for 3+ consecutive years.

The official solution provided by DataLemur uses LAG and LEAD to identify rows that are 1 year apart from the previous filing event year and the following filing event year, thus identifying consecutive years. While this does work for this problem, I think a more intuitive and flexible solution in this case involves using ROW_NUMBER() to find and aggregate groups of consecutive events.

The crux of this analysis is to have the ROW_NUMBER() column for the partition we are trying to identify streaks for, and then a column identifying the event time or overall sequence of events. The latter may already be present, or we may have to calculate/generate our own sequence identifying row. When we then find the difference between these two columns, streaks will become clear due to the difference being the same within a streak. This is easier to visualize in the SQL output, so let's get started.

The Intuit interview question uses one table, filed_taxes, with the following schema

Column Name	Type
filing_id	integer
user_id	varchar
filing_date	datetime
product	varchar

Here's some sample data as given by the problem statement.

filing_id	user_id	filing_date	product
1	1	4/14/2019	TurboTax Desktop 2019
2	1	4/15/2020	TurboTax Deluxe
3	1	4/15/2021	TurboTax Online
4	2	4/07/2020	TurboTax Online
5	2	4/10/2021	TurboTax Online
6	3	4/07/2020	TurboTax Online
7	3	4/15/2021	TurboTax Online
8	3	3/11/2022	QuickBooks Desktop Pro
9	4	4/15/2022	QuickBooks Online

And we are asked the following:

Write a query that identifies the user IDs of individuals who have filed their taxes using any version of TurboTax for three or more consecutive years. Each user is allowed to file taxes once a year using a specific product. Display the output in the ascending order of user IDs.

Let's start with the following query.

SELECT 
    user_id,
    EXTRACT(YEAR FROM filing_date) as filing_year,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER by EXTRACT(YEAR FROM filing_date)) as rn,
 	product
FROM filed_taxes
WHERE product LIKE '%TurboTax%'

Let's break down what this does. We need to find instances of a user using TurboTax software. Making an assumption that TurboTax products will just have "TurboTax" somewhere in the name, we utilize the simple LIKE '%TurboTax%' clause. We're asked to find consecutive years, so we can just extract the year from the filing_date column. I'm using PostgreSQL in this example. If you were using SQL Server, you can use DATEPART. The problem asks us to find consecutive filing years per user, hence we partition our ROW_NUMBER() by user_id in order of the filing year. Our output looks something like this:

user_id	filing_year	rn	product
1	2019	1	TurboTax Desktop 2019
1	2020	2	TurboTax Deluxe
1	2021	3	TurboTax Online
2	2020	1	TurboTax Online
2	2021	2	TurboTax Online
3	2020	1	TurboTax Online
3	2021	2	TurboTax Online

The filing_year column we created will act as our sequential row identifier for each user. This gives us the two elements we need to perform our streak analysis. Let's focus on user_id 1. Through visual inspection, we can see this user used TurboTax software for three consecutive years. Let's see what happens when we take the difference between filing_year and our row number column.

2019 - 1 = 2018
2020 - 2 = 2018
2021 - 3 = 2018

Because the two columns increment at the same rate, the difference is always the same. Say for example, user_id 1 had not used TurboTax again until 2022. The differences would be:

2019 - 1 = 2018
2020 - 2 = 2018
2022 - 3 = 2019

The streak would end, signified by a new difference on the third row. Why does this help us? This allows us to GROUP BY our calculated row, and rows within a streak will be aggregated together. We'll do precisely this in our SQL query.

WITH filing_cte AS 
(
  SELECT 
    filing_id,
    user_id,
    EXTRACT(YEAR FROM filing_date) as filing_year,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER by EXTRACT(YEAR FROM filing_date)) as rn,
    product
  FROM filed_taxes
  WHERE product LIKE '%TurboTax%'
)
SELECT
    user_id,
    filing_year,
    rn,
    filing_year - rn as diff
FROM filing_cte

user_id	filing_year	rn	diff
1	2019	1	2018
1	2020	2	2018
1	2021	3	2018
2	2020	1	2019
2	2021	2	2019
6	2016	1	2015
6	2017	2	2015
6	2018	3	2015
6	2020	4	2016
6	2021	5	2016
6	2022	6	2016

The islands are clearly identifiable from the matching diffs per user. I've included the output of user_id 6, which shows two distinct streaks (with a gap in 2019). Now, we just have to aggregate this all on the user_id and difference to find the solution (users that have at one point used TurboTax for 3+ consecutive years)

WITH filing_cte AS 
(
  SELECT 
    filing_id,
    user_id,
    EXTRACT(YEAR FROM filing_date) as filing_year,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER by EXTRACT(YEAR FROM filing_date)) as rn,
    product
  FROM filed_taxes
  WHERE product LIKE '%TurboTax%'
)
SELECT 
  DISTINCT user_id
FROM filing_cte
GROUP BY user_id, (filing_year - rn)
HAVING COUNT(*) >= 3

This will solve the problem as given. With some minor updates, we can find additional stats, such as the number of streaks and longest streak per user.

WITH filing_cte AS 
(
  SELECT 
    filing_id,
    user_id,
    EXTRACT(YEAR FROM filing_date) as filing_year,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER by EXTRACT(YEAR FROM filing_date)) as rn,
    product
  FROM filed_taxes
  WHERE product LIKE '%TurboTax%'
),
diff_cte AS
(
  SELECT 
    user_id,
    COUNT(*) as consecutive_years
  FROM filing_cte
  GROUP BY user_id, (filing_year - rn)
)
SELECT
  user_id,
  COUNT(*) AS streak_count,
  MAX(consecutive_years) AS longest_streak
FROM diff_cte
GROUP BY user_id
ORDER BY user_id

user_id	streak_count	longest_streak
1	1	3
2	1	2
3	1	2
5	2	3
6	2	3

Anyhow, this solution is just one approach for this particular type of islands problem. I'm far from an expert on this subject matter. I'll be studying up on this kind of problem, since it has clearly emerged as a weakness of mine for interviews/screenings. Itzik Ben-Gan has some very comprehensive write-ups on this kind of issue, so I'll be checking out some of his work.

Until next time - Div

Solving a basic SQL islands problem with ROW_NUMBER()