Sep 29, 2020 / Bill Lay

I came across an interesting problem the other day that required me to rethink my usual approach to expressions.

Setup

My (fictitious) client sells coffee makers. As part of their post-sale customer care, they handle product issues in one of three ways:

  1. warranty claim
    1. for minor, fixable issues within the warranty life (typically 12 months)
    2. the consumer completes an online claim
    3. after claim is received and approved, repair parts and instructions are sent to the consumer
  2. product returns
    1. for major issues with a product
    2. the consumer completes an online return authorization form (RA)
    3. after the RA is received and approved, prepaid shipping label and instructions are sent to the consumer
  3. complaint call
    1. for minor issues that do not require a repair or replacement
    2. example: talking the customer through product setup or cleaning

Each of these processes use a different back-end information system, but the data can be brought together in a Qlik app.

The Data Model

Here’s a grossly simplified data model. Note that it uses the “link table” approach to combining two data subjects: Claims and Returns. A better thought-out model would combine the three processes in a generic, concatenated fact table, but we nonetheless have this pragmatic, albeit clumsy model to work with.

The counter fields _ClaimCounter and _ReturnCounter simply have a one (1) on every row to facilitate easy counting.

IMPORTANT – the data model includes a row in the Serial table for every product sold, regardless of whether or not it has had an issue (claim/return).

The Question

So here’s the business question:

How many coffee makers that were sold last year have a claim OR have been returned?

Let’s translate this question in terms of the data model:

  • “how many coffee makers” = count(distinct SerialNumber)
  • “OR” means we need to include both claims and returns in the result
  • ignore customer complaint facts

Being hasty, I dove into this question and said, “Easy – I’ll simply count the serials with claims and count the serials with returns and add them together.”

count(distinct {<_ClaimCounter = {1}>} SerialNumber) 
    + count(distinct {<_ReturnCounter = {1}>} SerialNumber)

Or if I’m worried that one of these could be null, I can use rangeSum() to guard against that:

rangeSum(
    count(distinct {<_ClaimCounter = {1}>} SerialNumber)
    ,count(distinct {<_ReturnCounter = {1}>} SerialNumber) 
)

This is going to fail however, because any Serial that had both a claim and a return will be double-counted.

Now, if we didn’t have the CustomerComplaint table, and the only two possible issues were Claims and Return then we could do something clever using the association with the Calendar table:

count(distinct {<IssueDate = {"*"}>} SerialNumber)

This is essentially selecting the set of all possible Issue Dates, and then counting the distinct Serials associated through the Link table to those dates.  But that won’t work since we have the CustomerComplaint table in the model. Because it would be over-counting those Serials that had a complaint, even if they didn’t have a return or claim.

Instead, a neat way to solve this is to use the boolean operator on set modifiers, i.e. use the union operator + to combine the two count expressions

count(distinct {<_ClaimCounter = {1}>   +   <_ReturnCounter = {1}>} SerialNumber)

(I’ve exaggerated the whitespace so you can see the union operator here.)

I confess that I rarely use these boolean operators, but whenever you’re faced with OR requirements they can save you.

Want to learn more? Check out Oleg Troyansky’s “Set Analysis & Advanced Aggregation” class at MastersSummit.com.

-Bill