Automating a high-touch financial process in Python at Angel Studios

I had the pleasure of working on a really fun project at Angel Studios called Disbursements. It cut a month-long quarterly process into something that can be run in a few hours, and saved about $100,000 annually in wages spent on running the process.

This was a dreaded financial process in which we pay creators on a quarterly cadence to meet our contractual obligations and maintain a relationship with them. We get revenue from different platforms, we split that revenue up by creator, and then we pay the creators for the revenue they earned. Sounds simple enough.

However the reality was that the process has been plagued by problems for years. It took a data scientist a full month every quarter to process the data and generate reports. Reports were often wrong or incomplete - creators would often complain that their historical earnings went down, for example. Reports were also not transparent, so the creators would have no way to know if calculations were done accurately or not. Combining those two things together created a bad relationship with our creators.

Identifying the core problems

First I sat with the guy who was in charge of the process before me for about 7 hours, watching him do the old process. I listened to the team members who had worked with the creators the most and they highlighted all the pain points with the process. Once I had a clear picture of what was going wrong, I was able to outline a plan to redo the process. I had a few key insights which enabled the whole thing.

  1. The wrong grain. Granularity matters a lot in data - if you create a daily report and then people want to know hourly information, you're toast because that fine-grain detail is gone. The problem with this process is we were using the wrong data model at the wrong granularity. While previously we were mapping a piece of content directly to a creator, instead we needed to map the content to their episode, then to their contract, then to the creator. By doing so we were able to now say how much an individual episode and contract was earning, and were able to start having contracts pay off independently instead of being lumped together by creator, which solved at least 90% of the problems we had with calculating payments correctly. This involved manually remapping thousands of videos to episodes instead of creators, and splitting past earnings and debts up by contract, but it was well worth it.
  2. No basic tables in place. We needed a source of truth for some basics, like a list of creators with correct names and emails, a list of contracts with contract terms and starting balances, and a list of episodes, with primary and foreign key relationships between those three tables. 
  3. The wrong tools. The process was using a Postgres db, Elixir, LulaTex for creating the report template, some Python notebooks for initial processing of some data, and google sheets for storing and processing some key data. The important question here is "what type of process is this, and what types of tools does this process require?" This is a high-touch financial process, high-touch meaning that there are lots of exceptions - every step of the process is different for every platform, some creators are paid differently than others, and you need visibility into every step of the process to ensure you did it right. Usually, Excel is the king of high-touch financial processes because you can see all the data as you manipulate it, but Excel isn't great for scalable processes - doing steps the same way each time. Python pandas is great for that - data scientists regularly get input data, clean it up and transform it in pandas, and can observe the process as they do transformation steps in Jupyter notebooks, and produce an output. A custom process with repeatability.
  4. The wrong people. Previously, we were asking data guys to do the job of accounting - keeping track of payments and account balances over time. That was a mistake, and accounting is much better at this process than me. Saying no to some parts of the process that were on my plate was an important part of ensuring that when we did the process again in the future, it would go smoothly. This helped the data team focus on what we do best - cleaning, processing and manipulating the data.
Getting to these insights required an inquisitive approach to diagnose and understand the core problems. Going from symptoms, to core problems, to solutions was the thought process we had to go through to figure out how to simplify and make things better. All I saw initially were symptoms - reports are wrong, the process is slow and error prone, creators are unhappy. After picking apart each symptom into groups of root causes we arrived at the right solutions to ensure the process would be smooth going forward.

The solution structure



Once the basic tables and remapping was done, I went through the process of processing each platform's report. Each platform sends us an excel file or csv over email, or we download the csv from a portal, and for one I had to write a Python script to query their API. Then for each platform, I wrote a pandas script to unify the CSVs for the quarter into one table, and created an output table at the content-quarter-episode grain, the most granular level of detail that made sense for this project. I joined on the mapping tables, cleaned, aggregated, and calculated as necessary depending on the needs of the platform, and wrote unit tests to make sure that there were no missing mappings and that aggregations produced the same results as the input. Finally I hashed the row and added metadata about the content revenue and the final table was outputted to a parquet file. Parquet is awesome because it's like a csv file - a portable file with the data, but also like a database table - columnar storage, good compression, and keeps track of the datatypes of each column. The output of this first step was a 'staging table' that looks like this:


In the next step, I read in the parquets from each platform and combined them into one fine grain content revenue table for the quarter. Then I was able to aggregate by episode in one table, and by contract in another. The next step was to calculate payments for each contract. I calculated those, and wrote the output to another parquet file. Here I created one spreadsheet in excel with a tab for each level of granularity - content, episode, contract and payment calculations - so it was super simple to be able to figure out why a contract made x amount because you could go to the episode and content tabs, and see the full breakdown. This transparency was a game changer for everyone involved - for finance to understand how payments are calculated, for customer service to explain to the creators how payments are calculated, and for me to make sure that payments were being calculated correctly.

Next we created the reports. I created a design in Canva, used ChatGPT to convert the image to HTML, used Jinja to input template variables from Python, and wrote a script that looped through each creator and each of their contracts to show them all the data in the reports. Finally, when I heard that the team member in charge of sending out the reports spent 5 days emailing out 500 reports, I wrote a script to send out the emails.

When it was time to hit run, we outputted the reports with minimal issues, and sent out 99% of them in 20 minutes. The remaining reports with some issues were resolved after a few days and then those were sent out as well. The content creators were very happy with the results.
“Excellent breakdown.”
“Loving the extra details on this quarter's report :) thanks!”
“This looks great! Wow. Thanks so much. :) Look at me finally making money! Yay.”
“Thanks for the info, and the clarity of the new statements. Waaaaaayyy better. It’s good to be back in the black :)”
“Wow that's a much nicer quarterly statement!”
"Great new reports!! Super easy to see everything."
“I gotta tell you. I like this breakdown much better. You’re doing such an amazing job. Thank you.”
“Thank you for all your hard work on our behalf.”
“The new reporting system is excellent.  Much appreciated!”
Now, this process can be run end to end in 1-3 hours instead of 4 weeks. It dramatically improved relationships with stakeholders and reduced risk for the company. All told, this likely saved the company $100k annually just accounting for wages spent on this process.

Using excel as a database

Yes. I committed the worst of all data guy offenses, and used a Google Sheet as our database for our three basic tables (episodes, contracts and creators). Why not use a Postgres DB? We needed to pull all the basic data into existence, and doing that required lots of copy+pasting and hundreds of edits as we iterated to get it right. We also don't have a dev team dedicated to this project that can create a GUI for the database so non-technical staff can manage it and keep it updated. This should live in a real relational database, but for the sake of speed and to prove that the concept would work, the most simple and expedient thing to do was to make the tables in excel. I am proud that we didn't prematurely optimize this, and now that the process works well, we can build more structure around it including a relational db.

Why did it succeed?

I had a lot of fun spending late nights and weekends grinding through this project to get it to the finish line. It was rewarding to take ownership of something hard, understand it deeply and move mountains to make sure it happened on time. What made it fun, and why was I able to do something that had perplexed two data scientists before me at the same time?

First, I was able to simplify the problem. Often times complexity is a symptom that some basics need to be reassessed in order to simplify the problem.

Second, I was able to focus. I took 2 1/2 weeks away from my team with permission from my manager and support from my team members, and was able to only work on this project. That meant I never lost context on the project, I never made a big mistake, and I kept in my head the vision for what needed to be created, what needed to be done, and by when.

Third, I was able to apply experiences from different domains from my experience as a financial analyst doing excel-based processes in Python, and my experience as an analytics engineer solving fundamental problems with better data models.

Being able to apply my strengths to this project, and dedicate all my work time and attention to doing one project really well was a lot of fun. It was particularly fun because the benefits of the successful execution of the project were so apparent - it was a massive relief to everyone involved when it worked. I hope to do more projects like this in the future.

Comments

Popular posts from this blog

Thinking critically about your work

Building ad hoc data workflows with SQL, Python pandas and Excel at Hammerton