Where AI Fits in the Modern Data Stack

I've watched organisations approach AI as if it were a purchasing decision rather than an engineering challenge. Teams evaluate models, compare vendors, and prototype demos while their data pipelines remain brittle, their governance unclear, and their semantic definitions inconsistent. The result is predictable: technically sophisticated AI systems that produce unreliable outputs because the foundations beneath them were never solid.

The reality I've learned is that AI capability emerges from platform fundamentals, not from model sophistication alone. AI belongs where data is trusted, governed, discoverable, and operationally sound. Without that foundation, AI adds noise faster than value.

Why the question matters now

Organisations face genuine pressure to integrate AI into products, analytics, and internal workflows. Executives want AI-enabled features. Teams want copilots for their daily work. Customers expect intelligent experiences. This creates tension: how do you move fast on AI while ensuring the underlying data architecture can support it?

The teams that struggle treat AI as an isolated experiment, something the data science team does in a corner until they throw a model over the wall. The teams that succeed understand that AI is a layer in the same stack that serves analytics, powers dashboards, and feeds operational systems. The architecture, governance, and operational patterns that make data platforms useful are the same ones that make AI systems reliable.

What people get wrong

The most common mistake is starting with model selection. Teams spend weeks comparing GPT-4 to Claude to Llama when their real problem is that their data pipelines break silently, their feature definitions differ between tools, and nobody knows who owns the customer table.

They also underestimate the operational burden. AI applications require versioning, monitoring, access controls, and feedback loops just like any production system. Treating an LLM application as somehow simpler than a traditional ML pipeline because "it's just an API call" ignores the reality that production reliability requires engineering discipline regardless of what's behind the endpoint.

Finally, they treat AI as separate from data engineering. The embedding pipeline, the vector store, the retrieval logic - these are data engineering concerns. The teams that succeed embed AI capabilities into their existing data platforms rather than building parallel shadow systems.

Where AI actually fits

To understand AI's place in the modern data stack, I've found it helpful to think through each layer and how AI extends or depends on it.

Data ingestion and collection. AI systems often need access to unstructured data - documents, support tickets, code repositories - that traditional analytics might ignore. The ingestion layer must handle this diversity without creating separate, ungoverned copies. Streaming pipelines become more important when AI applications need near-real-time context.

Storage and modeling. This is where one of the key architectural decisions emerges: unified versus split architecture. In a unified approach, you store vectors alongside structured data in the same system - PostgreSQL with pgvector, MongoDB Atlas with native vector search. This gives you consistency, simpler operations, and ACID guarantees. It works well for moderate scale (hundreds of thousands to low millions of vectors) and teams that value operational simplicity. In a split architecture, you use dedicated vector databases like Pinecone or Weaviate alongside your relational store. This gives you better performance at billions of vectors, specialised indexing, and independent scaling. The cost is operational complexity: synchronisation, dual consistency models, and more moving parts. I've seen teams default to dedicated vector stores when pgvector would have sufficed, and I've seen teams try to squeeze massive scale into unified systems that couldn't handle it. The right choice depends on your scale, team capabilities, and consistency requirements - not on what's trending.

Transformation and context preparation. AI-native ETL looks different from traditional ETL. Instead of field mappings and type conversions, you're doing document chunking, embedding generation, and vector storage. The transformations involve external API calls with rate limits, token costs, and different failure modes than database operations. A pipeline that processes a million records in an hour with CPU and memory might take a full day and cost hundreds in API fees when embeddings are involved. This changes how you design error handling, retry logic, and batch sizing. Chunking strategy - how you split documents into retrievable units - is one of the most consequential decisions in a RAG system, yet it's often treated as an implementation detail.

Metadata, catalog, and lineage. AI systems amplify the value of good metadata. When an LLM generates SQL, it needs to know what columns mean, how tables relate, and what filters are valid. The semantic layer - your metrics definitions, dimensional hierarchies, business glossaries - becomes critical context. I've seen AI copilots grounded in well-documented semantic layers produce dramatically more reliable results than those querying raw tables. Lineage matters too: when you need to explain why an AI system produced a particular output, you trace back through retrieved chunks to source documents to upstream systems.

Governance, privacy, and access control. AI introduces new governance requirements. Data contracts - explicit agreements between producers and consumers about schema, semantics, and quality - become essential when multiple AI components interact. Schema validation at boundaries prevents silent failures. Access controls must extend to vector stores and embedding pipelines. Training data provenance, consent management for ML, and the unique privacy challenges of models that can memorise training data all require extending governance patterns, not abandoning them.

Retrieval and serving. Feature stores like Tecton or Feast serve ML features for model training and inference. Vector stores serve embeddings for semantic search. Both require low-latency access, versioning, and lineage tracking. Production RAG systems typically use hybrid search - combining vector similarity with keyword filtering - to get both semantic relevance and exact matching for technical terms or identifiers. The retrieval layer determines what context reaches the model, which means it determines what the model can possibly know.

Evaluation, observability, and feedback. Traditional monitoring - uptime, latency, error rates - is necessary but insufficient for AI systems. You need retrieval accuracy metrics, embedding drift detection, prompt quality evaluation, and human feedback signals. LLM-as-judge patterns use a secondary model to score responses on dimensions like groundedness, correctness, and safety. Training-serving skew remains the silent killer of production ML: the model learned from cleaned, transformed training data but runs on raw production data. These observability requirements extend your data platform's monitoring capabilities rather than replacing them.

Application layer. Agents, copilots, and AI-enabled products sit at the top. They orchestrate calls to retrieval systems, models, and tools. They manage memory - short-term context, working memory for intermediate artifacts, long-term knowledge. They enforce policies about what the AI is allowed to do. This layer depends on everything beneath it being reliable, observable, and well-governed.

The relationship between analytics, ML, and AI

These aren't separate universes. Analytics produces the business metrics and dimensions that ground AI outputs in reality. ML provides the structured prediction patterns that AI systems can build upon. AI adds natural language interfaces, reasoning capabilities, and content generation.

But they share foundations. All three need trusted data, clear definitions, reliable pipelines, and strong governance. AI doesn't eliminate the need for analytics - if anything, it makes semantic consistency more important. When an AI copilot answers "what was our Q3 revenue?" it needs the same metric definition your CFO sees in the board deck. AI doesn't replace the need for ML infrastructure - embedding pipelines, feature stores, and model monitoring are still essential.

What a good modern stack looks like

From what I've observed, stacks that support AI credibly share certain qualities:

They have trusted, well-governed data. Data quality isn't perfect, but it's measured, monitored, and improving. There's clear ownership. Access is controlled and auditable.

They have clear ownership. Someone owns the customer table. Someone owns the revenue metric. Someone is accountable when pipelines break or definitions drift.

They have reusable pipelines. Patterns that work for the tenth use case worked for the first. New data sources don't require heroics. Embedding pipelines follow the same orchestration and error-handling patterns as traditional ETL.

They have discoverable assets. People can find the data they need, understand what it means, and know if they can trust it. Documentation isn't perfect, but it's present and maintained.

They have reliable serving and retrieval patterns. Feature stores serve ML features. Vector stores serve embeddings. Both have appropriate SLAs, monitoring, and fallback behaviors.

They have observability across data and AI workflows. You can trace from an AI output back to the retrieved chunks, to the source documents, to the upstream systems. You can detect when data drift will degrade model performance before users complain.

Practical examples

RAG systems on governed internal data. The most successful implementations I've seen don't just dump documents into a vector store. They use hybrid search - vector similarity plus keyword filtering. They store metadata like tenant IDs, document freshness, and source authority. They apply the same access controls that govern the original documents. They validate retrieval accuracy and track which chunks actually get used.

AI copilots grounded in semantic layers. Natural language interfaces that query metrics defined in dbt's semantic layer or Looker's LookML produce more reliable results than those generating SQL against raw tables. The semantic layer provides business definitions, valid filter values, and join paths that the AI can't hallucinate.

Feature stores supporting real-time inference. Tecton or Feast serve pre-computed features for ML models. The same infrastructure serves context for AI systems. Online stores provide sub-100ms lookups. Offline stores generate training datasets. Feature lineage shows which models depend on which data.

Common mistakes

Treating AI ETL like traditional ETL. Not accounting for embedding API costs, rate limits, and the different failure modes of external services. A pipeline that works fine for 10,000 records might take days and cost hundreds for a million when embeddings are involved.

Choosing architecture without understanding trade-offs. Defaulting to dedicated vector databases when pgvector would suffice, or vice versa. The decision should be driven by scale, consistency requirements, and team capabilities - not by what's trending on Twitter.

Building AI pilots with no production path. Demos that work on cleaned sample data but haven't solved for the messy reality of production data, access controls, or observability.

Weak access controls around sensitive data. Vector stores often contain embedded documents with the same sensitivity as the originals, but they're treated as less critical because "it's just vectors."

Ignoring data contracts and schema evolution. Allowing schema drift between training and serving environments. Not validating LLM outputs against expected schemas. Treating prompt changes as minor updates rather than production code changes.

Bolting AI onto fragmented systems. Trying to add AI capabilities before basic data quality, ownership, and governance are in place. AI magnifies data problems rather than solving them.

My point of view

I've learned that AI capability emerges from platform fundamentals, not from chasing the newest model. The organisations actually succeeding with enterprise AI in production - the ones moving beyond demos to durable, scaled capabilities - are almost uniformly the ones that made serious investments in data infrastructure before or alongside their AI investments. Not smarter about AI. More rigorous about data.

I believe AI belongs in the modern data stack as a capability layer that extends what's already there. It depends on the same foundations: trustworthy data, clear ownership, discoverable assets, governed access, reusable pipelines, and reliable serving. When those foundations are solid, AI becomes a natural extension. When they're weak, AI becomes a liability.

The strongest AI systems I've seen weren't built by AI specialists working in isolation. They were built by platform teams, data engineers, and analytics engineers who understood that AI is another workload on the same stack they've been building all along. The patterns that make data platforms useful - observability, versioning, governance, reliability - are the same patterns that make AI systems trustworthy.

AI fits best where trust, context, governance, and delivery already exist. That's not a limitation. It's a design principle.

In short

AI is not a separate stack or a replacement for data engineering. It's an extension of modern data capabilities that depends on the same foundations: trusted data, clear ownership, strong governance, and reliable operations. The teams that succeed with AI are the ones rigorous about their data platforms, not just excited about their models.