Transforming & Visualizing Data with SQL and Power BI at Health Catalyst

I use this project with permission from the company where I work as a Data Analytics Engineer.

Summary

Health Catalyst is a healthcare consulting firm with 1,000 employees, 1/3 of which are data engineers who consult at hospitals and clinics. The business case for my work is that management has to decide how all of these engineers are allocated to clients. I worked closely with management to iterate constantly according to their needs and feedback, resulting in over 20 versions of the project over a 3 month period. I also worked with stakeholders to validate data and get access to new data when needed.

I was able to present a finished dashboard to senior management using advanced SQL, ETL tools, and Power BI. The project was managed by myself using agile techniques including a Kanban board with tasks, daily standups, points meetings to assign value to tasks and retrospectives to discuss project progress.

Project Overview

The project was completed in stages, though these were done iteratively and not sequentially.

Data Exploration
Data Transformation
Data Validation
Data Visualization.

1. Data Exploration

Before doing anything, I needed to understand the business use case, our customer pain and what problems the dashboard could solve.

The process management uses is:

A client requests engineers to work on their project
The engineers are allocated to the client by management
The engineers log billable hours at the client to meet contractual obligations

The current process involved disparate spreadsheets and ad-hoc reports generated by the time tracking software. It was pretty painful to watch. Decisions have to be made every week on which business units need to hire more people, how many incoming hires are needed to backfill projected attrition, etc. There was wide disagreement about how effectively the organization was using its time tracking software or allocating engineers and if the ad hoc reports could accurately reflect at all the reality on the ground. It was a data mess and management had poor visibility into key metrics (e.g. are employees allocated for or billing too much or too little time?). Any improvement on these processes would give leadership a common source of truth from which to make decisions and cut down on management time and effort caused by ad-hoc reports and spreadsheets.

Therefore, the data and visualization would need to answer the following:

What is our billable utilization? (a key metric for consulting firms)
How efficiently do we allocate billable time to clients?
Do people actually bill hours at the clients they were allocated to?
Who are the individuals that are over-allocated or under-allocated to clients?

These are all critical questions to answer to understand the efficiency and financial health of the organization as a whole and to get greater insight into how individual business units are operating.

To get the answers to these questions I modeled in excel and explored the data warehouse to find the tables we would need to use. Kind of like grocery shopping with a recipe in hand to pick the right ingredients. It was a big project to look into what data we had access to, how it fit together, and which tables and fields we would need to make it work. I often uncovered problems with the data, including duplicate records, bad calculations and obsolete or out-of-date data better found elsewhere. I would often go up a level to pull from the source data and made my own tables in the data warehouse to get access to what I needed, or went to the stakeholders so they could fix the problems with the source data.

2. Data Transformation

First the data had to go through the ELT process (extract, load, and transform) in our data warehouse. At this company the corporate analytics team already extracted and loaded the data, but the tables I needed were in disparate databases, and a lot of cleaning had to be done. All transformation was done using an in-house ETL tool with SQL queries. These SQL queries ranged from simple to complex, including features such as simple where filters, aggregation functions, group by clauses, case statements, CTEs, inner joins, left joins, cross joins, full outer joins and window functions.

Here is an example of the SQL I wrote to create the 'Daily Allocations' table. This query is written and tested in Azure Data Studio then pasted into the ETL software which creates the tables.

The table turns an allocation with a date range into daily allocated hours for a given employee. Say Paul was allocated 3240 minutes to a client between May 31 and July 31. This is then joined to a calendar to produce this output - employee id, date of work, the billable status of the project and minutes for the entire allocation.

Then it splits the allocation into minutes per day based on the date range. So on Aug 31, the first employee here has 131 minutes allocated to a billable project.

Then it turns minutes into hours, and uses the billable flag to assign those hours to either 'billable' or 'nonbillable' time, and calculates the total time as well. If there are any overlapping allocations they are added together at this stage by employee and by day.

I am sure this is not the most efficient way to do these sorts of transformations - no doubt I could figure out how to split the allocations by day, add them together and calculate total, billable and nonbillable time in a single step. But by breaking it into smaller steps it helps me to understand what the code is doing and therefore making changes or debugging is a much easier process.

To get my data from A to B, I drew from a series of source tables and then tried to combine them in ways that made the most sense. I think this part could have been done more cleanly, but keep in mind that I knew no SQL on week 1 so I made some choices early on that were likely inefficient given my limited experience and visibility.

The lines from the employee and team member lists on the left are one example of this mistake. I could have just first combined them into my own employee table (ProServicesEmplioyees) and then used that table exclusively. Instead, I used interchangeably the two source tables and the combined table. This would have saved a few joins on the "FTEsOnLeave" and "EventCurrentAllocations" tables and made the code easier to understand and change.

3. Data Validation

One major issue we faced at the beginning was that allocations data did not match reality. This was caused by a calculation in our source data which took raw hours, created a % allocation metric and then turned that back into hours. When the method for calculating allocation % changed in April, it caused a divergence in data accuracy which could have potentially affected company wide staffing decisions. You can see below how the bad data (red) correlates closely with my data (blue) until April.

My allocation data (blue) vs the improperly calculated allocation data (red)

The rest of the validation process involved getting the ad-hoc reports in excel sheets, running my own queries to match the data, then using VLOOKUPs and Boolean operators to do simple comparisons, flagging mismatches with conditional formatting. This was done thoroughly for individual employee time tracking, employee allocations, lists of current employees and info like termination dates or leave status. Along the way outliers were identified and source data was either fixed or I did my own filtering in SQL (e.g. removing duplicates) to fix the data. Below is an example of the kind of granular validation I did to make sure all the timesheet data matched exactly. Most of these discrepancies were just rounding errors.

3. Data Visualization

The end goal was visualizing this data in a way that made it as useful as possible to senior management to make staffing decisions. Because it involved input from so many stakeholders and hadn't been done before, this step by far took the longest. We used Microsoft Power BI to visualize our data and publish reports. This stage is still in process because running the data on a daily basis and hosting it at the company requires help from stakeholders that have a big backlog of action items.

The visualization went through 20+ iterations after feedback from end users, which usually also required SQL queries to be written or changed in the warehouse, making this a very iterative and agile development process. Showing my work, getting input from stakeholders, brainstorming better ways to visualize data, keeping what worked and scrapping what didn't and then getting new input from stakeholders again was the daily process to refine and enhance the dashboard. Here are some of the results of my work.

Below, the Efficiency & Utilization tab shows efficiency metrics that help management understand how the business is trending month-to-month, and how allocations match up with actual hours worked. Billable utilization in particular is an important KPI which represents the % of employee time working on billable projects. The tab allows you to filter by department (CostCenter), manager, team member and for particular months.

The Cost Center tab is particularly exciting because it allows the head of the consulting division to compare aggregate department performance, something that they have never been able to do before. These metrics include utilization, efficiency, team member counts, ratios of employees to managers, and a breakdown of how staff are spending their time. It includes the same filters for managers, team members, departments and months.

Then finally here is a table which shows the breakdown of how time is spent at the employee level in a given month or range of months. This helps management see who is spending time on the right things, who needs to be re-allocated to better projects, etc.

Results

I have been honored to apprentice as a Data Analytics Engineer here. Senior leadership created this project because they didn't have any client work to assign me to. I could have looked at it as busy work but I wanted to make this project something that had a real impact on the company. After my hard work in less than 3 months it has become an important part of the business unit strategy for conducting staffing operations and is beginning to function as a single source of truth at the SVP and C suite levels. Reactions to the project have been overwhelmingly positive. I am grateful for the fantastic experience to work and learn at HealthCatalyst, and I look forward to taking these skills into the future.

Feedback from users:

"Manually collecting data from 5 different sources is tedious and error-prone. Having it all in one place is a game changer because it's something we've never had access to before." - SVP of Analytics Services

"I'm impressed you were able to do this so quickly. It looks like an experienced person did it." - Analytics Engineer

"The insights from Paul's dashboard have already helped me have more productive conversations with my team about how we can improve our operations. It's one of the best I've seen at Health Catalyst, and I want to use it as a model for the high-quality analytics work we do here." - VP of Operations

And my efforts have been similarly well received.

"You are a great employee because of how quickly you are able to get work done." - VP of Operations
"You always ask for feedback about how you can improve and accept it humbly." - SVP of Analytics Services

alder.stream