::DataRoles.ai THE CURRENT DATA STACK IS BROKEN ← Back

::The whole Data Stack industry is forcused on The Wrong Problem

The tools designed to help are now the problem. It is a sad fact that data engineers currently spend more time working on the infrastructure of their data stack and learning the many tools that help solve the issues it created; in reality, most new data stack products are designed to fix issues caused by the stack itself.

The Philosophy That Started It All

The modern data stack was built on one philosophy:

Ingest data sources as fast as possible. Store everything centrally. Figure out what to do with it later.

This is source-first thinking. It seemed obvious in 2012 when cloud data warehouses dropped costs from $100,000/year to $160/month. The race was on to move data faster, store more of it, and worry about value creation downstream.

Source-first. SQL-only. Developer-focused. These assumptions shaped every tool that followed.

"We got distracted by circular problems of our own making. We created pipelines to shuffle data around, and orchestrators to coordinate those pipelines, and observability dashboards to monitor the orchestrators, and incident managers to organize the observability incidents."

— Benn Stancil, Co-founder Mode Analytics

"Most BigQuery customers store less than 1TB."

— Jordan Tigani, BigQuery Founding Engineer

The Result

Data Engineers now spend more time fixing their stack and the issues it creates than actually working with data and delivering results.

StatSource
897 apps average per enterpriseMuleSoft 2025
Only 28% of apps integratedMuleSoft 2025
40% of IT time spent on integrationMuleSoft 2025
70% of data leaders say stack is "too complex"Modern Data 101
Most BigQuery customers store less than 1TBJordan Tigani

The industry built infrastructure for hyperscale when most teams needed simplicity. We optimized for "how fast can we ingest raw data" when the right question was "what data products does the business actually need?"

What Went Wrong

Source-first — Ingest everything, figure out value later. Storage is cheap, so why not? Because cheap storage created expensive complexity. You pay for every row ingested, every table stored, every query run—whether anyone uses the output or not.

SQL & Python only — The entire stack assumes you speak SQL or Python. Every transformation is a query or a script. Every answer requires a technical translator. Business users can't self-serve. Analysts become bottlenecks. The gap between question and answer is measured in days, not seconds.

Developer-focused — Tools built by engineers, for engineers. Configuration in YAML and JSON. Debugging in terminals and logs. Business users locked out entirely. The people closest to the data problems—operations, finance, marketing—can't touch the tools designed to solve them.

Every tool category that followed—ingestion, transformation, orchestration, quality, lineage, observability, reverse ETL—exists to patch a gap left by these original assumptions.

::Data Engineer Burnout

39% considering quitting — 78% wish their job came with a therapist

The Human Cost

The wrong philosophy didn't just create technical debt. It broke the people doing the work.

Data engineers were hired to work with data. Instead, they spend their days debugging Airflow DAGs, managing Kubernetes clusters, writing glue code between tools, and firefighting pipeline failures at 2 AM.

The tools designed to help became the full-time job.

"Data teams spent more time maintaining infrastructure than delivering insights."

— Modern Data 101

"Integration nightmares multiplied—teams became 'glue code developers.'"

— Modern Data 101
StatSource
39% of data engineers considering quitting due to burnoutImmuta
78% wish their job came with a therapistIndustry survey
77% report heavier workloads despite AI toolsUpwork 2024
67 monthly data incidents average per orgWakefield Research 2023
15 hours average to resolve an incident (up from 5.5)Wakefield Research 2023
40% of time spent on integration, not data workMuleSoft 2025

The Daily Reality

Hired to build data products. Actually doing: YAML debugging. Credential rotation. Dependency conflicts. Version mismatches. Backfill jobs. Incident triage. Vendor management. Cost optimization. Security audits. Compliance documentation.

The "modern" in modern data stack didn't mean modern work. It meant more work.

::Data Quality

$12.9M/yr avg cost of poor data quality — 67% don't trust their data

The Problem

Data quality tools exist because ingestion doesn't validate. Bad data lands in your warehouse. You detect it after the fact—if you detect it at all. Quality becomes a separate purchase, a separate team, a separate problem.

"Data quality is usually one of the goals of effective data management. Yet too often organizations treat it like an afterthought."

— Gartner

"Most organizations decide to address issues in a piecemeal fashion... No wonder this is only a tactical solution; sooner or later, we need to start working on another tactical project to resolve the issues caused by the previous tactical project."

— Dan Sutherland, Senior Director, Protiviti
StatSource
$12.9M annual cost of poor data qualityGartner
67% don't trust their data for decisionsPrecisely/Drexel 2024
64% say quality is top challenge (up from 50%)Precisely 2024

::Consumption Pricing

62% exceeded cloud budget in 2024 — 86% of CIOs planning repatriation

"We're paying an at times almost absurd premium for the possibility that workloads could spike. It's like paying a quarter of your house's value for earthquake insurance when you don't live anywhere near a fault line."

— DHH, CTO 37signals (Basecamp/HEY)
StatSource
62% exceeded cloud budget in 2024Wasabi Index 2025
86% of CIOs planning some repatriationBarclays CIO Q4 2024
$100B+ market cap lost to cloud costsAndreessen Horowitz

Every major vendor uses consumption-based pricing. Warehouses charge per credit. Compute platforms charge per unit. Ingestion tools charge per row. The more you store and process, the more you pay.

"Close to half of cloud buyers spent more on cloud than they expected in 2023, with 59% anticipating similar overruns in 2024."

— Daniel Saroff, Group VP, IDC

Minimum charges punish small queries. Cloud infrastructure costs typically exceed platform charges by 50-200%. Pricing changes trigger overnight cost increases. CFOs can't forecast. Finance teams treat data as unpredictable expense.

The incentives are misaligned. Vendors profit from volume. Customers benefit from value. Storing everything "just in case" is expensive for you and profitable for them.

Case Studies

37signals reduced AWS spend from $3.2M to $1.3M annually—projecting $10M+ savings over 5 years by leaving the cloud.

GEICO achieved 50% compute cost reduction and 60% storage cost reduction through cloud repatriation.

::AI / The Chatbot Problem

95% of AI pilots show zero P&L impact — Everyone built the same text-to-SQL chatbot

The Problem

Every data vendor bolted on the same AI feature: a chatbot that writes SQL. They call it a "copilot" or "analyst" or "assistant." They all do roughly the same thing—and they all have the same limitation: the underlying architecture wasn't designed for AI.

A natural language front-end doesn't make a complex system intelligent. It makes it slightly easier to use while masking the complexity underneath.

"When AI simply makes the product easier to use, that's AI-washing. A natural language front-end for a workflow—yes, technically it's AI—but it's just masking the complexity of the workflow with a more accessible customer interface."

— CMSWire

"LLMs can write SQL, but they are often prone to making up tables, making up fields, and generally just writing SQL that if executed against your database would not actually be valid."

— LangChain Documentation

"Right now, [agent] is being slapped on everything from simple scripts to sophisticated AI workflows. There's no shared definition, which leaves plenty of room for companies to market basic automation as something much more advanced."

— MIT Technology Review, July 2025

"They just don't work. They don't have enough intelligence, they're not multimodal enough."

— Andrej Karpathy, OpenAI Co-founder
StatSource
95% of AI pilot projects delivered no measurable P&L impactMIT "GenAI Divide" Study, July 2025
30% of GenAI projects abandoned after POCGartner 2025
77% of employees say AI tools added to their workloadUpwork, July 2024
40% of European "AI startups" had no real AIPwC/MMC Ventures
$400K in SEC fines for AI washing claimsSEC, March 2024
GenAI now in "Trough of Disillusionment"Gartner Hype Cycle 2025
For every 33 AI POCs launched, only 4 reach productionIDC

The AI-Native vs AI-Augmented Distinction

AI-augmented/powered: Traditional systems with AI layered on top—retrofitted solutions depend on third-party APIs or cloud models... the original architecture limits it.

AI-native: Built from the ground up with intelligence as foundation—if you remove the AI, the product loses its core value.

Technical analysis shows AI-native architectures demonstrate 2-5x performance improvements in latency and throughput compared to bolted-on systems.

::Data Sovereignty

84% of European orgs using/planning sovereign cloud — GEICO repatriated to cut costs 50-60%

The Problem

Data sovereignty is no longer optional. Regulations require data to stay within borders. Cloud providers operate across jurisdictions. Compliance teams can't guarantee where data lives, who can access it, or which government can subpoena it.

The cloud promised flexibility. For regulated industries, it delivered risk.

"The European cloud market is growing exponentially; however, the share held by European providers is diminishing. This trend poses a significant concern for Europe's technological sovereignty."

— Manuel Mateo Goyet, DG Connect, European Commission

"Ten years into that cloud journey, GEICO still hadn't migrated everything to the cloud, their bills went up 2.5X, and their reliability challenges went up quite a lot too."

— Rebecca Weekly, VP Platform Engineering, GEICO
StatSource
84% of European orgs using/planning sovereign cloudIDC 2024
86% of CIOs planning some repatriationBarclays CIO Q4 2024
GEICO: 50% compute reduction, 60% storage reductionThe Stack 2024

::DevOps & Developer

SQL & Python required — Business users locked out entirely

The Problem

The entire data stack assumes you're a developer. Every tool requires SQL or Python. Configuration lives in YAML and JSON. Debugging happens in terminals and logs.

If you can't code, you can't participate. The people closest to the data problems—operations, finance, marketing—are locked out of the tools designed to solve them.

"The 'freedom to choose' that once characterized the Modern Data Stack is quietly giving way to a controlled substrate that vendors can both standardize and monetize."

— Modern Data 101
StatSource
897 apps average per enterprise; only 28% of apps integratedMuleSoft 2025
70% of data leaders say stack is "too complex"Modern Data 101

Who Gets Left Behind

Business users can't self-serve—they file tickets. Analysts become translators instead of analysts. Operations teams closest to data problems can't touch the tools. Finance waits days for reports that should take seconds.

The people who need data most are furthest from it.

The Stack They Expect You To Know

SQL. Python. YAML. JSON. Git. Docker. Kubernetes. Terraform. Airflow. dbt. Spark.

Each tool has its own syntax, its own mental model, its own failure modes. The barrier to entry isn't just high—it's intentionally technical.

::Orchestration

70% rate pipeline management "complex" — Airflow = "PHP of data sector"

The Problem

Orchestration tools exist because pipeline components don't coordinate themselves. You need a central scheduler to manage dependencies, sequence tasks, and handle failures. The scheduler becomes another system to maintain, debug, and keep running.

"All in all, Airflow is far from perfect, and many of us have merely learned to deal with its limitations."

— Ben Rogojan (Seattle Data Guy), Independent Consultant

"Despite any criticism, Airflow continues to play a pivotal role, much like PHP of the data sector—often criticized but extensively relied upon."

— Ben Rogojan
StatSource
70% rate pipeline management "complex"Matillion 2025
89% of Airflow users expect more revenue-generating or external solutions this yearAstronomer 2026
32% of Airflow users have GenAI or MLOps in productionAstronomer 2026

::Observability

67 monthly data incidents avg — 74% say business finds issues first

The Problem

Observability tools exist because pipelines fail silently. Something breaks at 2 AM. Nobody notices for three days. Data is missing. Downstream systems already consumed what was there and made decisions on it.

"Looking past the market fragmentation and maturity, there is significant demand among data and analytics leaders to address their growing data operations complexity."

— Gartner Market Guide for DataOps 2024
StatSource
67 monthly data incidents average; 15 hours to resolve; 74% say business finds issues first; 31% of revenue impacted by data issuesWakefield Research 2023
61 data incidents per month on average; 40% of the workday spent firefighting bad dataMonte Carlo 2022

::Lineage

80% of data governance initiatives will fail by 2027

The Problem

Lineage tools exist because nothing tracks where data came from. They reconstruct history by parsing SQL queries and scanning logs. The reconstruction is incomplete, often wrong, and requires constant maintenance.

"By 2027, 80% of data and analytics governance initiatives will fail."

— Saul Judah, VP Analyst, Gartner

"We are extremely adept at generating data, not so much at extracting value from those data, and very challenged to destroy any data at all. Data hoarding, data sprawl, and data decay are all significant problems."

— IDC
StatSource
80% of D&A governance to fail by 2027Gartner
Only 25% measure data quality metricsPrecisely 2024

::Ingestion

95% report integration barriers — 897 apps avg, only 28% integrated

The Problem

Ingestion tools move data from source systems to warehouses. They promise "connect once, sync forever." Reality: schema changes break pipelines, API rate limits cause data loss, and you're charged per row whether the data is useful or not.

"In 2017, Y Combinator funded 15 analytics, data engineering, and AI/ML companies. In 2021, they funded 100. It's impossible to make sense of this many tools, much less manage even a fraction of them in a single stack."

— Benn Stancil
StatSource
95% report integration challenges as barriers to AISalesforce 2024
70% of data workers rate pipeline management "complex"Matillion 2025

::CDC (Change Data Capture)

89% report pipeline scaling issues — Kafka operational overhead

The Problem

CDC tools track changes in source databases. They require database-level access, often involve Kafka clusters for streaming, and create operational overhead that exceeds the original data engineering problem.

"The thing to know about merchants of complexity is that they never go away, they merely migrate. From WS-DeathStar to microservices to premature k8s to auth services to GraphQL and beyond."

— DHH, CTO 37signals
StatSource
89% report scaling issues with pipelinesMatillion 2025
StatSource
20-30% write path overhead for trigger-based CDCSystem Overflow
10-60 seconds latency for query-based CDC; deletes can be missedSystem Overflow

::Reverse ETL

Exists because the stack moves data the wrong direction

The Problem

Reverse ETL tools exist to get data OUT of warehouses—because the whole stack moved data one direction, and now you need it back in Salesforce, HubSpot, and the other tools where work actually happens.

"The problem wasn't the market around the products we were building; the problem was the product itself."

— Benn Stancil
ProviderTools Involved
Source → WarehouseFivetran, Stitch, Airbyte
Warehouse → Transformdbt, Dataform
Warehouse → DestinationCensus, Hightouch, Polytomic
Data WarehousesDatabricks, Snowflake
Summary

::Final words

The current data stack is fragmented, expensive, and operationally heavy.

The same tools that were meant to simplify data work have created new layers of complexity. Teams spend more time wiring systems together, managing vendor sprawl, and handling operational overhead than they do improving the actual data products people rely on.

The result is more cost, more risk, and less control over data, governance, and compliance. That is the core problem this report has been pointing at throughout.