AI Data Issues | DataRoles Report

Every AI product in the data space right now is bolted-on augmentation of existing workflows using generative AI techniques. Not job replacement. Not autonomous systems that work like a real team member with the ability to not only respond and react, but also to ambiently work with their team, including the ability to assign work if needed.

DATA FOCUSED AI/LLM BASED PRODUCTS THAT DIRECTLY TOUCH DATASOURCES ARE A RISK

The marketing says "AI-powered data pipeline." The reality is a language model that guesses SQL, hallucinates column names, and has no concept of your business rules. When it works, it saves time; when it fails, and it fails often, it corrupts data, exposes credentials, and creates compliance violations that take weeks to unwind.

Letting an LLM have broad access to your data and metadata is not innovation. It is a massive data security risk being sold as a feature.

95%

OF BUSINESSES TRYING AI GOT ZERO MEASURABLE ROI

MIT (2025)

47%

OF EXECUTIVES MADE MAJOR DECISIONS USING UNVERIFIED AI CONTENT

Suprmind (2026)

40%+

OF AGENTIC AI PROJECTS WILL BE CANCELED BY THE END OF 2027

Gartner (2025)

13%

OF GENAI PROMPTS CONTAIN SENSITIVE ORGANIZATIONAL DATA

Lasso (2025)

29-45%

OF AI-GENERATED CODE CONTAINS SECURITY VULNERABILITIES

Diffray (2025)

40%+

OF AI DATA BREACHES WILL STEM FROM CROSS-BORDER GENAI MISUSE BY 2027

Gartner (2025)

Section 01

::Hallucinations Are Structural, Not Fixable

LLM hallucinations are not bugs that will be patched in the next release. They are an architectural property of how language models work, predicting the next likely token, not verifying truth.

Hallucinations arise from mismatches between what is predictable from text patterns and what is essentially arbitrary, low-frequency truth. Some facts lack sufficient signal in training distributions. The "just add more data" executive instinct fails; more data can reduce some errors, but it doesn't eliminate the structural incentive to guess.

For data engineering this is catastrophic. A hallucinated column name silently breaks a pipeline. A fabricated SQL join produces results that look correct but aren't. A confident-but-wrong transformation corrupts downstream products without triggering any error.

Today's LLMs are trained to produce the most statistically likely answer, not to assess their own confidence.

Duke University Libraries (Jan 2026)

Since model behaviour can be hard to understand or predict, it is challenging to foresee or confidently rule out specific failures.

International AI Safety Report (2026)

Section 02

::AI Needs Humans

The most credible enterprise AI lesson of 2025-2026 is not autonomy. It is supervision. AI can accelerate work, but it still needs humans, governance, and operating models around it.

The strongest real-world deployments are not "set the agent loose and hope." They are human-in-the-loop systems with approvals, guardrails, and clear ownership. That matters even more in data engineering, where a bad decision does not stay on screen. It lands in pipelines, models, dashboards, and customer-facing systems.

Enterprise teams are learning that AI success is mostly organizational discipline: who approves outputs, how failures are caught, and where the model is allowed to act. Without that structure, "agents" are just fast ways to create expensive mistakes.

Successful implementation and scaling of enterprise AI projects is fundamentally a people and operating model challenge, not just a technology problem.

Stack Overflow / IBM (Jan 2026)

The AGI discussion rapidly became kind of passe. The focus shifted to narrowly scoped vertical areas where agents actually do great work.

Stack Overflow AI Agents Retrospective (Mar 2026)

Section 03

::Don't Let AI Access Data

If an AI system can query databases, modify records, or trigger production workflows, its mistakes stop being suggestions. They become incidents.

Tool-enabled LLMs are one of the riskiest parts of enterprise AI adoption because they collapse the gap between bad output and real-world impact. A hallucinated answer on screen is one problem. A hallucinated SQL statement, package name, deployment step, or update against a live system is much worse.

Data engineering learned long ago that production systems need validation, approvals, rollback, and clear execution boundaries. AI does not remove that need. It raises the cost of getting it wrong.

When a model can trigger actions such as querying databases, modifying records, sending messages, or deploying resources, the consequences of manipulation increase dramatically.

OWASP Top 10 for LLM Applications

Depending on the prompt, study, or benchmark, 29% to 45% of AI-generated code contains security vulnerabilities.

Veracode (2025)

Section 04

::LLM Data Exposure Risk

Once sensitive data is pasted into a chatbot or exposed through an LLM-connected workflow, the risk is no longer theoretical. It becomes a data-exposure problem with security, privacy, and compliance consequences.

Employees treat chatbots like private workspaces even when the underlying system may log, retain, or route prompts through external services. That creates a direct exposure path for source code, credentials, internal URLs, personal data, and confidential business information.

The technical risk also extends beyond copy-paste misuse. Prompt injection, system prompt leakage, and insecure tool connections can all expose data that the model should never reveal. In practice, LLM convenience often collapses the boundary between "ask a question" and "expose something sensitive."

13% of employee-submitted prompts to GenAI chatbots contain security or compliance risks, potentially exposing businesses to security breaches, regulatory violations, and reputational damage.

Lasso Research (2025)

Prompt injection can lead to unauthorized data access and exfiltration when attackers manipulate an LLM through crafted inputs.

OWASP GenAI Project (2025)

Section 05

::Most AI Augment

The strongest enterprise AI outcomes still come from augmentation, not replacement. AI helps people move faster, but human judgment, review, and domain context remain the difference between useful output and expensive noise.

For data engineering, augmentation is the realistic model. AI can draft queries, suggest transformations, summarize logs, or surface anomalies faster than a person working alone. But it still lacks business context, accountability, and the judgment required to decide what should actually ship.

The winning pattern is not "remove the human." It is designing workflows where humans review, approve, and redirect AI output before it affects production systems, decisions, or customer-facing data products.

Human and AI agent collaboration delivered up to a 70% boost in work completion compared with agents working alone.

Upwork Human+Agent Productivity Index (2025)

Oversight works when organizations integrate it into product design, instead of tacking it on at launch.

BCG (2025)

Section 06

::Data Quality at Scale

The quality of the information LLMs train on is often terrible. Poor input data is not new, but when the garbage generator runs at millions of tokens per second and sounds confident, the damage multiplies.

AI code tools do not just generate weak code. They also fabricate dependencies, invent APIs, and produce output that looks plausible enough to be trusted. At scale, that turns ordinary quality issues into supply-chain risk and production instability.

Teams building agents are learning the same lesson quickly: quality is not a side issue. It is the main production barrier. If you cannot trust the output, speed only makes the problem bigger.

Quality is the production killer, with 32% citing it as a top barrier. Meanwhile, 89% of organizations have implemented some form of observability for their agents.

LangChain State of Agent Engineering (2026)

19.7% of all recommended packages in the study didn't exist.

Socket citing "We Have a Package for You!" (2025)

Section 07

::Data Exposure Problem

Giving an LLM access to your database schema, table names, and column names is giving it your entire data model. Metadata is not "just technical information"; it reveals business logic, customer segmentation strategies, pricing models, and competitive intelligence.

When you connect an AI agent to your warehouse so it can answer questions, you are giving it access to metadata that explains how your data is structured, related, classified, and used. That context is exactly what makes the system useful, and exactly what makes the exposure risky.

In practice, metadata can reveal sensitive fields, business rules, ownership, lineage, dependencies, and system relationships. Even when raw data is not directly exposed, the metadata layer can still disclose how the business works.

Technical metadata for a relational database might describe the structure of tables, data types and relationships between tables.

IBM Think (2026)

Metadata answers all these questions. It transforms cryptic database tables into documented, trustworthy business assets.

Atlan (2026)

Section 08

::Regulatory Implications

Processing personal data through AI chatbots requires a lawful basis under GDPR. Most chatbot deployments don't have one. The EU AI Act adds data governance requirements for high-risk systems. The regulatory walls are closing in.

The Italian Data Protection Authority blocked ChatGPT in 2023 and later fined OpenAI €15 million in December 2024 over personal-data processing, transparency, and legal-basis failures. That is not a theoretical warning. It is a live example of regulators treating AI data handling as an enforceable compliance issue.

The pressure is also structural, not just case-by-case. Under the EU AI Act, high-risk systems must follow documented data-governance and risk-management requirements. So the compliance burden is moving in one direction: toward more documentation, more controls, and less tolerance for vague claims about how AI uses data.

We're crafting easily administrable remedies with bright-line rules on the development, use and management of AI inputs. Firms cannot use claims of innovation as cover for law breaking.

Lina Khan, FTC (Feb 2024)

Meta is basically saying that it can use 'any data from any source for any purpose and make it available to anyone in the world', as long as it's done via 'AI technology'. This is clearly the opposite of GDPR compliance.

Max Schrems, NOYB (Jun 2024)

Summary

::Final words

AI systems are most dangerous when they are given direct access to data, workflows, and production systems without clear limits.

The safest deployments keep humans in the loop, constrain the model’s scope, and treat every generated response as something that still needs review before it can affect data or operations.

The difference between a useful system and a risky one is not the presence of AI itself. It is whether the platform around it enforces boundaries, accountability, and control.

::CURRENT DATA STACK AI, LLMS ARE NOT WHAT YOU THINK