dbt-flow: unit testing ELT transformations

Let’s discuss some techniques to improve ELT jobs built on top of dbt.

Will Fuks
17 min readSep 25, 2023

A while ago a former boss of mine invited me for a work related to data engineering and the construction of a set of indicators for the Carrefour company.

The project should be delivered in two months; I thought to myself that that would be easy! But then he also said: “…they are looking for an implementation that reaches SOURCE OF TRUTH status within the company”.

Little did I know that those simple words would lead to a spectacular challenge to which it took us almost an entire year to solve.

That’s what we’ll be discussing today. How we leveraged dbt and integrated some approaches to add unit-testing functionality to the tool to help us deliver what the company desired.

A Beast Of Its Own Kind

When we learn about dbt and read through its documentation, we study the famous Jaffle-Shop example, which contains some transactions, customers and payment data.

Transformations are simple, direct and easy to digest.

But real life is real life and it’s usually not that simple as we’d wish it was. This is certainly the case for Carrefour: the company is huge and complex, orchestrating hundreds of thousands of employees every single day in order to keep up its operations going.

The natural effect is that data transformations tend to not be that simple and direct anymore; we had to implement hundreds of business rules and some queries were reaching thousands of lines, all that processing hundreds of gigabytes daily.

As we worked through the challenge, problems emerged. Some of our data was not yielding expected results which begged the question “who is right?”. The answer actually is “it depends”. It depends on which business rules were considered, in which context and from which source data comes.

So we’d investigate, debug every feasible corner of our queries and eventually find mistakes, errors, missing transformations. But after fixing those we were still left wondering “…have we just created more bugs elsewhere?!

It was getting quite tough, the more complex the queries would get the tougher it became to make changes and still maintain quality control.

We researched some ideas to mitigate those issues but couldn’t find a tool that would precisely solve our problem. Eventually, we thought on implementing a simple but straightforward testing framework to add a newer level of security and improve our development workflow.

SQL unit-testing

Let’s work through an example to better illustrate these ideas. We’ll use Carrefour as an example but keep it simple to respect the company’s privacy and only one node (orders) will be used. First we’ll implement the query of the node and then keep adding tests against it to confirm if it’s working, trying to simulate our own experience on real life projects.

Here are some business rules and information about data we’ll be working with:

  1. orders is our first entity table. It’ll contain the fields orderID, Date, Value, customerHashID, status and partitionDate.
  2. deliveries represents different consignments within each order. This happens because a customer may purchase items from different vendors in the same order which renders different delivery options. Fields will be orderID, consignmentID, region, partitionDate.
  3. orders may have its correspondent at different partitions in deliveries.
  4. There’s also orders that comes from another retail group. We’ll call it group “Big”.
  5. Orders have 2 possible statuses, we’ll call it ‘A’, and ‘B’.
  6. Information about some orders may arrive later than its actual date. This means that incremental nodes should update its rows accordingly.
  7. They also wanted to have a better understanding of the distribution of customers types categorized as new customers. It should consider the complete date range available in the raw dataset only for orders in status ‘B’. Customers should be considered new at the group level, that is, orders from Carrefour should not interfere on Big (and vice-versa).

Initing the project

Let’s start our dbt project:

python3 -m venv .venv
source .venv/bin/activate
pip install dbt-bigquery==1.5.3
dbt init

And some dbt dependencies to install:

# in dbt project folder
cat > packages.yml << EOF
---
packages:
- package: dbt-labs/metrics
version: 1.5.0

- package: dbt-labs/dbt_utils
version: 1.0.0

- git: "https://github.com/WillianFuks/dbt-flow.git"
revision: 0.0.2
EOF
dbt deps

We also need to add a profile target to setup how the connection will be handled by BigQuery:

cat >> "$HOME/.dbt/profiles.yml" << EOF
name_of_dbt_project:
outputs:
unittest:
dataset: unittest_wfuks
job_execution_timeout_seconds: 300
job_retries: 1
location: US
method: oauth
priority: interactive
project: dbt-flow
threads: 4
type: bigquery
target: unittest
EOF

So far so good. Now let’s create our models. To keep it simple, we’ll only create 1 model and 3 sources (we’ll abbreviate “Carrefour” with “C4” from now on):

#running from inside dbt project folder
mkdir -p models/{staging/{c4,big},marts}
cat > models/staging/c4/_c4__sources.yml << EOF
---
version: 2

sources:
- name: c4_source
schema: mock_schema
database: dbt-flow
tables:
- name: orders
identifier: orders
columns:
- name: orderID
- name: orderDate
- name: value
- name: status
- name: customerHashID
- name: partitionDate

- name: deliveries
identifier: deliveries
columns:
- name: orderID
- name: consignmentID
- name: region
- name: partitionDate
EOF

cat > models/staging/c4/_big__sources.yml << EOF
---
version: 2

sources:
- name: big_source
schema: mock_schema
database: dbt-flow
tables:
- name: orders
identifier: big_orders
columns:
- name: orderID
- name: store
description: >
It's either "c4" or "big".
- name: date
- name: value
- name: region
- name: customerHashID
- name: partitionDate
EOF

cat > models/staging/_all__sources.yml << EOF
---
version: 2

sources:
- name: all_source
schema: mock_schema
database: dbt-flow
tables:
- name: customers
identifier: all_customers
columns:
- name: customerHashID
- name: ordersData
columns:
- name: orderID
- name: orderDate
- name: store
- name: region
EOF

The sources basically describe the orders input for both groups what customers purchased in the whole history of data. We’ll create just one model that encompasses all 3 sources — the orders node:

It basically retrieves the data, if it’s incremental then update what’s already available else proceeds to process customers types. You can try checking for yourself if there are any mistakes in the query.

The query is simple and direct when compared to real life production environments but it may already be good enough to elicit the question:

How can we guarantee that there are no mistakes in those SQL lines?”

dbt-flow

dbt-flow is a tool for adding unit-testing functionality for all nodes offered by dbt (except snapshots for now). We mock the input data and test the target node if results are as expected.

To use dbt-flow, simply define a new test file and use its macros (notice we already installed it in the packages.yml file). Here’s an example for testing our orders node:

mkdir -p tests/flow
cat > tests/flow/test_orders.sql << EOF

-- depends_on: {{ ref('orders') }}

{{
config(
tags=['flow-test', 'orders']
)
}}


{%- set test1 = dbt_flow.setup_test(target_model='orders', test_name='orders1', options={"drop_tables": false}, test_description='simple flow test for orders.',
mocks={
"c4_source.orders": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
DATE('2023-01-01') AS orderDate,
100 as value,
'B' AS status,
1234 as customerHashID
""",
"c4_source.deliveries": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
1 as consignmentID,
'regionA' as region
""",
"big_source.orders": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
'c4' AS store,
DATE('2023-01-01') AS orderDate,
100 as value,
'regionB' AS region,
'B' AS status,
1234 as customerHashID UNION ALL
SELECT
DATE('2023-01-01') AS partitionDate,
3 as orderID,
'big' AS store,
DATE('2023-01-01') AS orderDate,
300 as value,
'regionB' AS region,
'B' AS status,
111 as customerHashID
""",
"all_source.customers": """
SELECT
1234 AS customerHashID,
ARRAY(SELECT AS STRUCT 1 AS orderID, DATE('2023-01-01') AS orderDate, 'c4' AS store, 'regionA' AS region) AS ordersData UNION ALL
SELECT
111 AS customerHashID,
ARRAY(SELECT AS STRUCT 3 AS orderID, DATE('2023-01-01') AS orderDate, 'big' AS store, 'regionB' AS region) AS ordersData
"""
},
expected="""
SELECT
1 as orderID,
DATE('2023-01-01') AS orderDate,
1234 as customerHashID,
ARRAY(SELECT STRUCT(1 AS consignmentID, 100 as value, 'B' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'c4' AS store UNION ALL
SELECT
3 as orderID,
DATE('2023-01-01') AS orderDate,
111 as customerHashID,
ARRAY(SELECT STRUCT(1 as consignmentID, 300 as value, 'B' as status, 'regionB' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'big' AS store
"""
) -%}


{{ dbt_flow.run_tests([test1], global_options={"drop_tables": false, "verbose": true}) }}
EOF

We begin by declaring which nodes are part of the test. This is the

— depends_on: {{ ref(‘orders’) }}

part. After that, we give the tag flow-test just to be able to either select the file or exclude it when necessary.

Then we set a test with the macro dbt_flow.setup_test() which accepts:

  • target_model: The model you want to compare final results.
  • test_name: String that identifies the test, it’s also used as suffix for creating the mocked tables.
  • options: General settings, such as drop_tables or verbose.
  • test_description: Just a description for later reference.
  • mocks: It’s a dict whose keys are the nodes in our graph. For instance, in our example we have c4_source.orders which refers to mocking the source “c4_source” with table name “orders”. The expression:
"c4_source.orders": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
DATE('2023-01-01') AS orderDate,
100 as value,
'B' AS status,
1234 as customerHashID
"""

simply means to replace all references to the c4_source and orders table with the correspondent SQL query.

  • expected: a SQL string that asserts what final results should be.

dbt-flow will create all the mock tables and then use dbt-utils to compare observed against expected.

To run the test:

dbt test -s test_orders --vars '{init_date: 2023-01-01, end_date: 2023-01-01}'
Figure 1: first run of dbt-flow succeeds. The node orders was created as expected.

And the respective resulting table:

Figure 2: Results of transformations for node “orders”.

So far so good! Apparently our query is working fine!

Let’s test it a bit more and mock customers making purchases on multiple days now:

# tests/flow/test_orders_multiple_days.sql

(...)

mocks={
"c4_source.orders": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
DATE('2023-01-01') AS orderDate,
100 as value,
'B' AS status,
1234 as customerHashID
""",
"c4_source.deliveries": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
1 as consignmentID,
'regionA' as region
""",
"big_source.orders": """
SELECT
DATE('2023-01-03') AS partitionDate,
3 as orderID,
'big' AS store,
DATE('2023-01-03') AS orderDate,
300 as value,
'regionB' AS region,
'B' AS status,
111 as customerHashID
""",
"all_source.customers": """
SELECT
1234 AS customerHashID,
ARRAY(SELECT AS STRUCT 1 AS orderID, DATE('2023-01-01') AS orderDate, 'c4' AS store, 'regionA' AS region) AS ordersData UNION ALL
SELECT
111 AS customerHashID,
ARRAY(SELECT AS STRUCT 3 AS orderID, DATE('2023-01-03') AS orderDate, 'big' AS store, 'regionB' AS region) AS ordersData
"""
},
expected="""
SELECT
1 as orderID,
DATE('2023-01-01') AS orderDate,
1234 as customerHashID,
ARRAY(SELECT STRUCT(1 AS consignmentID, 100 as value, 'B' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'c4' AS store UNION ALL
SELECT
3 as orderID,
DATE('2023-01-03') AS orderDate,
111 as customerHashID,
ARRAY(SELECT STRUCT(1 as consignmentID, 300 as value, 'B' as status, 'regionB' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'big' AS store
"""

Now we run it:

dbt test -s test_orders_multiple_days --vars '{init_date: 2023-01-01, end_date: 2023-01-03}'

Which gives:

Figure 3: Tests now are failing!

Oops!

Something is going on here.

If we take a look at the results for orders we see the following:

Figure 4: Customer type is broken.

Customers are definitely new in our system, so why is customer 111 being identified as false for new type? More intriguing, it’s only happening for days after January First. Weird 🤔…

Well, if we investigate the query for processing customers, here’s the problem:

NOT EXISTS(
SELECT 1 FROM UNNEST(customersOrdersData) customer_orders
WHERE TRUE
AND DATE_TRUNC(customer_orders.orderDate, MONTH) < current_order.orderDate
) AS isNewCustomer,

To determine if a customer is new or not at a given orderID, we compare its order date with all other dates that this same customers already made. If there’s some order that comes before then the customer cannot be considered new at that particular point. But as the reference should be at the Month level, if we truncate the order dates in one side and not the other then orders will be compared against its equivalent but as if it happened at day 1!

That is, an order in date 2023-01-02 when truncated becomes 2023-01-01, when we test DATE_TRUNC(2023-01–02, MONTH) < 2023-01–01 it’ll always be false.

Let’s fix this bug. We just need to also truncate the orderDate:

WHERE TRUE
AND DATE_TRUNC(customer_orders.orderDate, MONTH) < DATE_TRUNC(current_order.orderDate, MONTH)

And running the test again:

Figure 4: Now it’s working again! Customers are making purchases at multiple distinct dates.

Cool! It’s working again!

Let’s explore a bit more our query. As we had to consider every single possible order from customers, let’s simulate customers purchasing at multiple years as well:

# tests/flow/test_orders_multiple_years.sql

mocks={
"c4_source.orders": """
SELECT
DATE('2021-01-01') AS partitionDate,
1 as orderID,
DATE('2021-01-01') AS orderDate,
100 as value,
'B' AS status,
1234 as customerHashID UNION ALL
SELECT
DATE('2023-01-02') AS partitionDate,
2 as orderID,
DATE('2023-01-02') AS orderDate,
200 as value,
'B' AS status,
4321 as customerHashID
""",
"c4_source.deliveries": """
SELECT
DATE('2021-01-01') AS partitionDate,
1 as orderID,
1 as consignmentID,
'regionA' as region UNION ALL
SELECT
DATE('2023-01-02') AS partitionDate,
2 as orderID,
1 as consignmentID,
'regionA' as region
""",
"big_source.orders": """
SELECT
DATE('2021-01-01') AS partitionDate,
1 as orderID,
'c4' AS store,
DATE('2021-01-01') AS orderDate,
100 as value,
'regionB' AS region,
'B' AS status,
1234 as customerHashID
""",
"all_source.customers": """
SELECT
1234 AS customerHashID,
ARRAY(SELECT AS STRUCT 1 AS orderID, DATE('2021-01-01') AS orderDate, 'c4' AS store, 'regionA' AS region) AS ordersData UNION ALL
SELECT
4321 AS customerHashID,
ARRAY(SELECT AS STRUCT 1 AS orderID, DATE('2023-01-02') AS orderDate, 'c4' AS store, 'regionA' AS region) AS ordersData
"""
},
expected="""
SELECT
1 as orderID,
DATE('2021-01-01') AS orderDate,
1234 as customerHashID,
ARRAY(SELECT STRUCT(1 AS consignmentID, 100 as value, 'B' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'c4' AS store UNION ALL
SELECT
2 as orderID,
DATE('2023-01-02') AS orderDate,
4321 as customerHashID,
ARRAY(SELECT STRUCT(1 as consignmentID, 200 as value, 'B' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'c4' AS store
"""

Now we just run:

dbt test -s test_orders_multiple_years --vars '{init_date: 2021-01-01, end_date: 2023-01-03}'

And we get:

Figure 5: customers purchasing on different years are disappearing!

Oops….wrong again

Weird 🤔

For some reason the order from date 2021 disappears (we actually made this mistake when working in the actual project, took us hours to debug 😅).

As it turns out, we realized that the expiration of partitions should be removed from the query:

{{ 
config(
materialized='incremental',
...
partition_expiration_days = 365
)
}}

There should be no partition expiration. Let’s remove it to see what happens:

Figure 6: Without partition expiration days orders re-appear as expected.

It’s working! 🙌

What else could possibly go wrong?!?

Let’s test more of the business rules of our data. Remember in the description that orders and deliveries may come at different partitions so let’s confirm if it’s working as expected.

We mock one order coming at day 2023–01–01 and its delivery data at 2023–01–02:

mocks={
"c4_source.orders": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
DATE('2023-01-01') AS orderDate,
100 as value,
'B' AS status,
1234 as customerHashID
""",
"c4_source.deliveries": """
SELECT
DATE('2023-01-02') AS partitionDate,
1 as orderID,
1 as consignmentID,
'regionA' as region
""",
"big_source.orders": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
'c4' AS store,
DATE('2023-01-01') AS orderDate,
100 as value,
'regionB' AS region,
'B' AS status,
1234 as customerHashID
""",
"all_source.customers": """
SELECT
1234 AS customerHashID,
ARRAY(SELECT AS STRUCT 1 AS orderID, DATE('2023-01-01') AS orderDate, 'c4' AS store, 'regionA' AS region) AS ordersData
"""
},
expected="""
SELECT
1 as orderID,
DATE('2023-01-01') AS orderDate,
1234 as customerHashID,
ARRAY(SELECT STRUCT(1 AS consignmentID, 100 as value, 'B' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'c4' AS store
"""

And now we run it:

dbt test -s test_orders_unmatching_partitions --vars '{init_date: 2021-01-01, end_date: 2023-01-03}'

And we get:

Figure 7: There’s some bug hidden in the join logic of orders and deliveries.

Wrong Again!

Again! Something is wrong…

Examining the results we see clearly why:

Figure 8: The left join didn’t find a match.

This indicates there’s an error in our JOIN logic. Investigating it we identify what’s going on. Our ON clause is looking for equal partitionDate but they can differ, as per the business rules!

Let’s fix this:

SELECT
o.*,
d.* EXCEPT(orderID, partitionDate)
FROM c4_orders o
LEFT JOIN c4_deliveries d
ON o.orderID = d.orderID
--AND o.partitionDate = d.partitionDate

Which gives:

Figure 9: Now the left join is working!

Ok. That was easy!

Should we look for more mistakes?!

Well, we still didn’t investigate what happens when customers make purchases on distinct groups so we should confirm if this logic is working.

Let’s mock data this time to simulate the same customer purchasing on both groups:

# tests/flow/test_orders_user_from_both_groups.sql

mocks={
"c4_source.orders": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
DATE('2023-01-01') AS orderDate,
100 as value,
'B' AS status,
1234 as customerHashID
""",
"c4_source.deliveries": """
SELECT
DATE('2023-01-01') AS partitionDate,
1 as orderID,
1 as consignmentID,
'regionA' as region
""",
"big_source.orders": """
SELECT
DATE('2023-02-01') AS partitionDate,
2 as orderID,
'big' AS store,
DATE('2023-02-01') AS orderDate,
300 as value,
'regionB' AS region,
'B' AS status,
1234 as customerHashID
""",
"all_source.customers": """
SELECT
1234 AS customerHashID,
[
STRUCT(1 AS orderID, DATE('2023-01-01') AS orderDate, 'c4' AS store, 'regionA' AS region),
STRUCT(2 AS orderID, DATE('2023-02-01') AS orderDate, 'big' AS store, 'regionB' AS region)
] AS ordersData
"""
},
expected="""
SELECT
1 as orderID,
DATE('2023-01-01') AS orderDate,
1234 as customerHashID,
ARRAY(SELECT STRUCT(1 AS consignmentID, 100 as value, 'B' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'c4' AS store UNION ALL
SELECT
2 as orderID,
DATE('2023-02-01') AS orderDate,
1234 as customerHashID,
ARRAY(SELECT STRUCT(1 as consignmentID, 300 as value, 'B' as status, 'regionB' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'big' AS store
"""

What possibly could go wrong?!

Let’s run this new test:

dbt test -s test_orders_user_from_both_groups --vars '{init_date: 2021-01-01, end_date: 2023-02-01}'

Which gives:

Figure 10: Unexpected conflict between customers purchasing at both groups.

DEAR LORD 😭

Weird…again…

Examining the results we see:

Figure 11: Customer is no longer new as it should be!

Something is wrong in the logic for customers types. They bought first in Carrefour and then when buying in Big, as per the requirements, they should be still new at that group level. That could only mean that orders from both groups are blending! We should fix that by filtering orders in the same group:

NOT EXISTS(
SELECT 1 FROM UNNEST(customersOrdersData) customer_orders
WHERE TRUE
AND current_order.store = customer_orders.store
AND DATE_TRUNC(customer_orders.orderDate, MONTH) < DATE_TRUNC(current_order.orderDate, MONTH)
) AS isNewCustomer,

Now we get:

Figure 12: Customers now are compared only to the group level.

We are good again 🙌

What possible else could go wrong?!?!

Well, there’s one more thing we didn’t test yet: incrementality. One thing is processing the query for the first time but when the table is already up and running, the query changes!

Let’s test this option as well. The mocking input this time will first create the orders mocked table and then everything will be processed to be merged there. We begin with orders containing one order entity and then it receives new data:

#tests/test_orders_incremental.sql

-- depends_on: {{ ref('orders') }}

{{
config(
tags=['flow-test', 'orders']
)
}}


{%- set test1 = dbt_flow.setup_test(target_model='orders', test_name='orders6', options={"drop_tables": false, "test_incremental": true}, test_description='simple flow test for orders.',
mocks={
"c4_source.orders": """
SELECT
DATE('2023-01-02') AS partitionDate,
1 as orderID,
DATE('2023-01-01') AS orderDate,
100 as value,
'B' AS status,
1234 as customerHashID
""",
"c4_source.deliveries": """
SELECT
DATE('2023-01-02') AS partitionDate,
1 as orderID,
1 as consignmentID,
'regionA' as region
""",
"big_source.orders": """
SELECT
DATE('2023-01-02') AS partitionDate,
2 as orderID,
'big' AS store,
DATE('2023-01-02') AS orderDate,
200 as value,
'regionB' AS region,
'B' AS status,
111 as customerHashID
""",
"all_source.customers": """
SELECT
1234 AS customerHashID,
ARRAY(SELECT AS STRUCT 1 AS orderID, DATE('2023-01-01') AS orderDate, 'c4' AS store, 'regionA' AS region) AS ordersData UNION ALL
SELECT
111 AS customerHashID,
ARRAY(SELECT AS STRUCT 2 AS orderID, DATE('2023-01-02') AS orderDate, 'big' AS store, 'regionB' AS region) AS ordersData

""",
"orders": """
SELECT
1 as orderID,
DATE('2023-01-01') AS orderDate,
1234 as customerHashID,
ARRAY(SELECT STRUCT(1 AS consignmentID, 100 as value, 'A' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'c4' AS store
"""
},
expected="""
SELECT
1 as orderID,
DATE('2023-01-01') AS orderDate,
1234 as customerHashID,
[
STRUCT(1 AS consignmentID, 100 as value, 'A' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer),
STRUCT(1 AS consignmentID, 100 as value, 'B' as status, 'regionA' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)
] AS orderData,
'c4' AS store UNION ALL
SELECT
2 as orderID,
DATE('2023-01-02') AS orderDate,
111 as customerHashID,
ARRAY(SELECT STRUCT(1 AS consignmentID, 200 as value, 'B' as status, 'regionB' as region, TRUE AS isNewCustomer, TRUE AS isNewRegionCustomer)) AS orderData,
'big' AS store
"""
) -%}


{{ dbt_flow.run_tests([test1], global_options={"drop_tables": false, "verbose": true}) }}

Now we run it:

dbt test -s test_orders_incremental --vars '{init_date: 2021-01-01, end_date: 2023-02-01}'

And we get:

Figure 13: Incremental run is not working.

NOOOOOOOOOOOOOOOOOOOOOOoooooooooooo

Well…we still have some bug. At least we know it’s in the incremental part of the query. Let’s see what happened:

Figure 14: One of the orders lost all its data!

For some reason the order from Big group was “shutdown”. The main possible source for the bug is when we concatenated the orders from Carrefour and Big…(we made this mistake actually; took us hours to solve).

The problem is in the properties of the ARRAY_CONCAT operator. If one of the inputs is NULL then result is also NULL. When we join the orders, if the new orders do not find respective match with the known ones result is NULL and so is the final result! Let’s fix that:

new_and_known_orders AS (

SELECT
new_orders.orderID,
new_orders.orderDate,
new_orders.customerHashID,
ARRAY_CONCAT(
COALESCE(new_orders.orderData, []),
COALESCE(known.orderData, [])
) AS orderData,
COALESCE(new_orders.store, known.store) AS store
FROM orders_joined new_orders
LEFT JOIN known_orders known
ON new_orders.orderID = known.orderID

)

Running again:

Figure 15: The last of the bugs has been eliminated!

We are back…AGAIN! 🙌

Ok. BUT WHAT ELSE COULD STILL GO WRONG?!?

Relax, relax. I’m just kidding. Now it’s all over 😅!

Conclusion

I did try to make this article as succinct as possible. Failure is just part of the game.

Still, I actually couldn’t find another way of exploring the idea of testing ELT transformations without working through an example based on real life experience.

When we first implemented our system we had no idea of how many bugs were still lingering around unknown to us. Just as we saw in this article, the query seemed to be working just fine and then we saw it actually wasn’t. It’s not that unexpected to have queries that look just fine and that actually contain dozens of mistakes that are affecting production systems.

As those errors accumulate this can affect the decision making of companies in insidious ways, from where to invest in marketing up to which costs to cut. Millions of dollars can be directed in the wrong actions simply because management is not actually seeing the real map after all.

As we wanted to have the source of truth for the company, it became clear for us that we had to implement also a system of guarantees that would helps us otherwise we’d still be looking for which bugs we missed.

Now with the set of protective queries installed it’s a bit easier to make updates without fearing to break something else unexpectedly. It’s certainly not a bullet proof solution but it’s another tool we can leverage to help us deliver data quality control for companies.

In fact, the tool ended up being so useful for us that we in a way used it as a TDD guidance. We’d first mock data, make the changes to the query and see what results looked like. As we had control over input debugging became much easier.

Well, this is it for now! No more bugs for today 🙌! I hope this was useful and may help you in your own ELT jobs.

If you’d like to interact with the repository please feel free to do so!

And, as always, see you next mission ;)

PS: dbt-flow logo is the blending between dbt’s original logo and Samus’ screw attack from Super Metroid(1994). Yeap. Couldn’t control myself here 😄

--

--

Will Fuks

Mainly interested in data science and software development topics.