The reference architecture that will kill your AI projects

Yes you can blame the data governance, finance and legal teams for this :)

Feb 06, 2024

Technologists everywhere are falling into a dangerous trap - trying to build the perfect unified data warehouse as the foundation for AI. Just to give you some context, I was doing this exercise for Pathway and comparing what large companies are doing in the streaming space using Databricks, Azure, and AWS. I realized there was a repeated pattern of creating these heavy rules-based data models.

These highly refined datasets are meant to be used for reporting, dashboards, and AI. There is a big march (and big spend) towards keeping things auditable with a full trail and lineage of what happened before you reached this refined view of data. (On a side note: For those who know data lakes do you know why they call it the GOLD Standard? It takes a LOT of gold to build and maintain this)

The Mythical "Single Source of Truth"

The assumption Enterprises is moving forward with is that AI relies on a magical complete dataset - a "single source of truth" that integrates all business data. But this is a fantasy.

In reality, most companies have hundreds of operational systems built up over decades. A 2023 survey found Enterprise companies use over 780 distinct SaaS apps on average. With so many fragmented systems, seamlessly consolidating data is impossible.

Yet 61% of data warehouse projects aim to be that "single source of truth" that is always just out of reach. These projects incur massive costs, with the average data warehouse now costing $20 million. But the failure rate remains stuck at around 67-70%, with "lack of adoption" the primary reason.

The Elusive Dream of "Golden" Data

Like modern-day alchemists, data teams try heroically to transmute messy operational data into perfect golden datasets for AI through cleansing and ETL wizardry.

But the dirty secret is that after this complex transformation pipeline, business users take the "pristine" data and immediately manipulate it in Excel using tribal knowledge to create reports. Because sanitized backward-looking data doesn't reflect the nuances and exceptions of daily operations.

What AI Really Needs - Reality

Unlike periodic reporting, AI makes millions of decisions in real time based on what is happening right now. The algorithms require massive amounts of comprehensive, fresh, raw data from across all systems to approximate the current state of the business, customers, and operations in the field.

Studies show that AI models relying on broad raw data sources consistently outperform those relying on curated datasets. The algorithms find the signals within the noise using deep learning, without requiring perfect data.

Inverting the Data Paradigm

To enable this, architectures must invert to make data the driver, instead of an afterthought. Data must become the digital reality powering intelligent algorithms making billions of operational decisions each day.

This requires rethinking data flows to funnel events from all sources into easily accessible raw data lakes, providing a fresh 360-degree view of business activity. Combined with the ability to query historical records in-place, this powers algorithms with comprehensive timely context.

Killing the Data Warehouse Dragon

The future is not monolithic consolidated data warehouses completely separated from operational systems. We need to retire this old dragon 🐲 from our IT landscape and get your consultant bills lower along with that. Let’s reduce those unwanted bills from Snowflake and Databricks.

Instead, we must create configurable decision context-targeted datasets that provide each AI application with the specific reality it needs at scale. No more hoarding stale data gold that is useless in the real world.

Accurate Analytics is a Byproduct

This data inversion is daunting but removes the need for massive centralized data stores that become outdated dinosaurs.

It leads to an amazing upside - accurate analytics as a natural byproduct of AI-driven decisions, without endless data wrangling and shit tons of ETL code. Reporting taps into the same curated views used to run operations and engage customers.

In Summary

The allure of a unified "golden" dataset is strong but leads to failure (at least 67% of the time). To enable AI, reject the false gods of data warehousing. Embrace the difficult path to decision context and AI-ready architectures

The reference architecture that will kill your AI projects

Yes you can blame the data governance, finance and legal teams for this :)

Discussion about this post