Strategic Data

Every organization depends on data for decisions, and myriad solutions have evolved to manage data — streaming and telemetry, governance frameworks, analytics packages, and modern lakehouse platforms all provide the tools. Yet, many organizations struggle to get reliable, consistent answers from their data, and the gap can be technical, but it is often organizational, driven by persistent points of confusion.

Strategic Data is the Semantic Operations pillar that addresses this directly. Its methods and tools promote clear, neutral understanding of what data systems actually are, what they require, and why getting this right is a prerequisite for any organization hoping to benefit from AI. It represents the most deterministic part of the Semantic Funnel — the transformation from raw Data to structured Information — and provides a playbook for making data a first-class strategic asset.

Where Strategic Data Sits in the Semantic Funnel

The Semantic Funnel is our mental model for framing complexity into simplicity.Strategic Data operates at the foundation — the D→I transformation — where Structural Rules are applied to give raw facts their initial shape and meaning.

This is the most bounded, most deterministic transformation in the funnel. Apply the right schema to raw data and the result is repeatable every time. The funnel is also an important reminder that when the term "data" is used, often, information or knowledge is what is meant, and knowing the diffrence is critical, especially when it comes to AI.

Funnel Level	Rule Class	Strategic Data Role
D → I	Structural	Define schemas, dimensional models, type systems — the rules that give raw data its shape
I → K	Interpretive	Provide the structured foundation that pattern detection and inference depend on - where most decisions are made
K → W	Normative	Encode organizational judgment, Architecture, Principles

Understanding Data Systems

I think the single most valuable thing an organization can do for its data strategy is develop a clear, neutral understanding of what data systems actually are and what they do, independent of any vendor or platform. Based on my experience, as well as some recent research, "data systems" are some of the least-well understood technologies within an organization. A clearer picture starts with three ideas that cut through the noise.

The entire landscape of data systems can be characterized by a small set of neutral concepts. Once these are clear, any data system — regardless of vendor, platform, or industry — can be described, compared, and governed using the same vocabulary.

What kind of system is it? Pick one of four system types that capture "data systems":

Application — operational, transactional systems (OLTP)
Analytics — systems instrumented for analysis (OLAP)
Enterprise Work — unstructured knowledge artifacts (docs, comms, designs)
Systems of Record — canonical truth (Finance, ERP, HRIS)

Every organization has a different mix of these four, determined by industry and business model. Each type has its own optimization targets, governance requirements, and integration patterns. Understanding which types dominate tells leadership where to invest and what governance approaches apply.

If Analytics, How does it function? Three complementary views, adapted from Reis & Housley (2022), describe the internal anatomy of any analytics data system:

Components — what it is built with: ingestion, storage, table layer, compute, orchestration, catalog, and ML/AI services.
Lifecycle — how does the data flow: generation, ingestion, storage, transformation, and serving.
Functions — what needs to be done: security, data management, DataOps, architecture, orchestration, and software engineering.

Components are assembled into a platform. Lifecycle flows through those components. Functions ensure quality at every intersection. Vendor differentiation is execution quality within these categories, not fundamental capability.

What forces shape its architecture? There are only three forces:

Volume & Velocity — how much data, and how fast does it grow?
Latency — how fresh must insights be?
Structure — how predictable is the data shape?

That is the complete picture. Four types identify what kind of data system an organization has. Three forces characterize what constraints shape its architecture. Components, lifecycle, and functions describe how it works internally. Together, they give technical and non-technical stakeholders a shared vocabulary for reasoning about data systems — one that holds regardless of which products are in the stack.

Structure and Meaning

Understanding data systems is the first step, and then understanding that correct answers require structure because structure is meaning. That

Aggregating across time requires a proper date dimension. Slicing by multiple attributes requires a dimensional model. Ensuring consistent metrics across teams requires conformed definitions. Enabling drill-down requires hierarchical relationships. These are mathematical requirements for analytical correctness, and when they are in place, dashboards align, queries are concise, and the numbers earn trust across the organization.

While this seems like it is already well-understood, allow me put forth that often it is not. Each technology wave has shifted attention away from structure by declaring "schema bad" because the "relational era" enforced structure through schema, and "relational also bad". The NoSQL era prioritized flexibility and speed, and the data lake era emphasized storage scale over semantic design, and a generation of data scientists and data engineers weren't taught how to data model. But here's the thing - schema and relationships have to be in there somewhere or the data means nothing. Luckily, the semantic renaissance arrived — lakehouse formats, semantic layers, dbt — restoring structure once organizations recognized the cost of deferring it. The LLM era is following the same arc: AI initially makes schema feel optional, and then the need for grounded, structured data reasserts itself.

A well-built dimensional model feeds both traditional ML and generative AI directly. The same customer-level features — Recency, Frequency, Monetary value — serve clustering:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# The dimensional model gave you clean customer-level features
X = customer_features[['Recency', 'Frequency', 'Monetary']].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=4, random_state=42)
customer_segments = kmeans.fit_predict(X_scaled)

And those same features, enriched with the segments the model produced, become structured input for LLM fine-tuning:

{
  "instruction": "Generate a personalized marketing email for a customer",
  "input": {
    "customer_segment": "High-Value Lapsed",
    "recency_days": 45,
    "total_spend": 2500,
    "open_rate": 0.32,
    "preferred_tone": "friendly",
    "preferred_offer": "discount"
  },
  "output": "Subject: We Miss You! Here's 20% Off Your Next Order\n\nHi [Name],\n\nIt's been a while since your last purchase, and we wanted to reach out..."
}

Why This Matters More for AI

AI makes structure more valuable, not less. RAG systems perform best against queryable, well-structured data. Agents reason more reliably when relationships between entities are explicit. ML models produce better predictions when feature engineering starts from a sound dimensional model.

Organizational Challenges

The technology to manage data well exists and has existed for a long time. When data initiatives stall, the root causes are almost always organizational — and I think this is worth stating plainly because it reframes where the leverage actually is. The highest-return investments are in people, literacy, and governance, not in platforms.

Data systems are inherently cross-cutting. Analytics instrumentation is created by engineering, consumed by analytics teams, and governed by data teams — and often owned by none of them. The mix of system types varies by industry — a SaaS company is roughly 60% Application, while a professional services firm is 50% Enterprise Work — which means "best practices" from one domain do not transfer to another. Effective strategy starts from the business domain, not from tool selection.

The persistent challenges are well-known:

Source ownership is a political orphan. Analytics wants deeper events; engineering owns the codebase; neither owns the boundary between them. Without cross-functional coordination and leadership commitment, instrumentation falls through the cracks.
Autonomy without coherence. Modern organizations do not have isolated silos — they have pipelines flowing everywhere, reverse ETLs, self-serve tools proliferating, and metrics recalculated hundreds of ways. The problem is too much interconnection without shared semantic definitions.
Domain knowledge has splintered. Software engineering developed a unified career path — junior to senior to staff — where practitioners own the full stack from schema to deployment. Data roles fragmented the opposite way: Data Engineer, Data Scientist, and Business Analyst each own a task, not a discipline. The industry favored generalist profiles and tool fluency over deep domain expertise, and modern education followed suit.
Analytics errors are silent. Software bugs crash visibly. Analytics errors are different: a dashboard renders, the numbers look reasonable, and the metric definition quietly diverges from what the stakeholder intended. Detection requires domain expertise, data literacy, and organizational standing to question "official" numbers — a combination that is rare.

This is why organizations cycle through platform investments without resolving the underlying issue. The root cause is semantic, and new tools do not change metric definitions or governance practices.

Finance offers a useful counterpoint. Financial reporting has standardized definitions (GAAP/IFRS), built-in validation (double-entry, reconciliation), external verification (auditors), and clear accountability for errors. These are governance practices, not technology advantages — and they demonstrate that semantic consistency at scale is achievable when leadership commits to it.

Governance as the Solution

Governance is often perceived as compliance overhead — a cost center that slows things down. I think this framing misses what governance actually does when it is designed well. Governance is how an organization solves the cross-cutting ownership problem: it encodes shared understanding of data systems into repeatable, self-validating rules that hold across teams.

But governance only works when it has full coverage. Good governance in the analytics team and none in engineering does not produce consistency — it produces a well-governed island surrounded by chaos. Partial governance is almost worse than no governance, because it creates the illusion of control. And governance has to be backed by incentives. Without them, teams will do whatever solves their immediate problem — build a local copy, define their own metric, spin up a shadow dashboard. That is not dysfunction; it is rational behavior in the absence of structural guardrails. Leadership has to commit to governance as strategy, not delegate it as a compliance task.

When that commitment exists, the mechanisms are well-established. Two complementary practices provide the audit trail:

Together, provenance and lineage answer the two questions that govern trust. When both are in place, teams can assess impact before making changes, trace errors back to their source, and enforce semantic contracts between systems — versioned definitions of what data means at each boundary — like API contracts.

And there is a deeper payoff that connects directly to AI. Provenance and lineage are not just governance artifacts — they are the raw material for semantic coherence. When an organization knows where data came from, how it was transformed, and what it means at each stage, it has the foundation for detecting drift before it causes failures and building the kind of traceable, auditable AI operations that regulators and stakeholders increasingly demand. Governance as strategy is a theme that runs through all three pillars of the Semantic Operations framework — Strategic Data establishes the foundation, and the other pillars build directly on the discipline it creates.

Implications for AI

Everything above applies regardless of AI. Sound data management, clear system understanding, structural discipline, and governance as strategy are best practices that have been true for decades. But AI changes the stakes in two specific ways that make Strategic Data more urgent than ever.

Everything Is Data

In the AI era, the boundary between "data" and "everything else" is disolving. Code, documents, models, patterns, decisions, and conversations all become inputs to agentic systems. This expands the scope of "data management" from managing databases to managing all semantic artifacts. The four system types still apply, but their boundaries expand: Application systems now include API contracts and code patterns, Analytics systems include model outputs and experiment results, Enterprise Work becomes central rather than peripheral, and Systems of Record extend to model registries and decision logs. The 95% of organizational meaning that lived in unstructured "dark data" is now primary input for AI systems.

This is why existing semantic investments pay forward so strongly into AI. Consistent metric definitions, documented assumptions, and decisions with recorded rationale all become reliable context for agents. The scope of data management has expanded, and organizations that have already built semantic discipline are better positioned to operate in it.

AI Needs Structure, AI Makes Structure

AI needs structured data to operate well, but AI is also good at creating structure. This creates a potential virtuous cycle when properly designed. AI can extract entities from unstructured text, classify and tag content, detect schema violations, identify semantic drift, generate schema proposals from examples, and validate consistency across artifacts.

The cycle works like this: human-defined structure enables AI to produce better outputs. Better outputs feed back as structured artifacts. More structure enables better AI. Each turn of the wheel improves semantic coherence across the organization. The cycle starts with humans who understand the business domain — defining what things mean is the initial investment that sets the flywheel in motion. From there, AI accelerates structure creation and compounds the return.

Organizations that have invested in Strategic Data fundamentals — clear system understanding, dimensional models, governance discipline — find that AI integration amplifies their existing strengths. The foundation matters more now than it ever has.

Where to Start

Strategic Data is a discipline — an ongoing commitment to understanding what data an organization has, applying structure deliberately, and governing meaning as a first-class concern. The starting points are straightforward:

Know your system mix. Which of the four system types dominate your organization? The answer depends on your industry and business model, and it determines where to focus.
Own the source. Define business events once with complete semantic context. Capture them once at the source. Governance at the point of origin is orders of magnitude cheaper than governance downstream.
Build dimensional models. Star schemas, conformed dimensions, explicit grain. Building a proper model takes days; the return compounds for years.
Treat governance as strategy. Provenance, lineage, semantic contracts — the infrastructure that makes everything else trustworthy and fast.
Prepare for AI scope expansion. Recognize that code, documents, and decisions are now data. Extend governance beyond databases to all semantic artifacts.

The Semantic Funnel makes the case that investing in the conditions for meaning improves both human and machine performance. Strategic Data is where that investment begins — at the foundation, where structure is applied, where rules are most deterministic, and where getting it right has the highest leverage.

The Semantic Funnel — The mental model that the rest of the framework builds on.

Why SemOps? — The full case for why meaning matters and what makes it hard.

Back to Framework