Standardize data modeling so work can be reused
Having worked in data for many years, I found that every question we answered started from scratch. We constantly had to think about what we need to build, the columns that we should add, the ways a table would be used. It always blew my mind that there are probably another 10 data engineers at this exact moment probably thinking about the same table to answer the same set of questions. Why doesn't a place exist to share?
I know what you're thinking: Query Templates. But that always failed me. The assumptions were too basic and the my salesforce/any tool's implementation was always just different enough to make templates not helpful. I needed a standard that can work across all companies, that can answer any question, that can capture any data structure.
I needed the Activity Schema.
I went on a tour of so many companies looking for a standard that can work. The weird thing was that when talking to these companies, we would discuss their data and the questions they answered using simple journeys but when looking at the SQL it was all different. It was complex, nested and super confusing.
The why really fascinated me, so I kept digging. What I discovered was all templates assume that data can be connected via JOINS based on a key. Yet, in practice that key never existed so every data engineer had to write a lot of complex SQL to figure out how to stitch the data to answer the questions at hand. Different questions and different business logic lead to a very complicated data transformation. Over time it only got worse. As companies ask more questions and want to slice and dice by another feature, data engineers add more SQL to deal with combining more data.
This problem is worst it has ever been. Every startup leverages multiple SaaS products to operate, each collecting their own data in their own way.
Each also have their own analytics, so no one is coming to the data team to ask how many tickets were submitted. They will look at their Zendesk Dashboard.
The result is that every question coming to the data team is bridging multiple data sources. This bridge means you need to figure out a way to connect the data and there will not be any key to join on.
That is why the Activity Schema became so popular -- it allows data teams to bridge data sources using Customer and Time.
That has been done before, but the Activity Schema standardized how to relate data using 12 temporal joins, allowing full reusability.
Solving the data modeling problem with a standard structure that has standardized guarantees opened up the world for sharing. Anything built on top of the Activity Schema can be reused. No only that, but activities can be swapped to be further repurposed.
I founded Narrator to make this possible. To allow data analyst to use the Activity Schema to answer any question. To swap activities and answer similar questions.
We then added the ability to build entire analyses on top of the Activity Schema. These analyses can be copy and pasted across companies.
They can also be used to answer so many more questions than the initial design.
Also now analysts can go from one company to another and start answering questions.
I see a world where all the data work is shared and everyone is working together to enable better decisions to be made.