Data Modeling Styles

Comprehensive guide to different data modeling styles and when to use each approach

Data Modeling Styles

Data modeling styles define how data is structured, stored, and accessed within systems. Different modeling approaches serve different purposes: some are optimized for operational systems requiring fast transactional processing, while others are designed for analytical workloads that need to support complex queries and historical analysis.

Operational vs. Analytical Modeling

Understanding the distinction between operational and analytical data modeling is crucial:

Operational Data Models:

Designed for transactional systems (OLTP - Online Transaction Processing)
Optimized for fast inserts, updates, and deletes
Normalized structures to reduce redundancy
Real-time data access with low latency
Examples: Customer databases, order systems, inventory management

Analytical Data Models:

Designed for reporting and analysis (OLAP - Online Analytical Processing)
Optimized for complex queries and aggregations
Denormalized structures for query performance
Historical data with time-based analysis
Examples: Data warehouses, business intelligence systems, analytics platforms

Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions handle changes to dimension data over time. They are essential for maintaining historical accuracy in analytical systems.

SCD Type 1: Overwrite

SCD Type 1 simply overwrites the old value with the new value, losing historical information.

Use Case: When historical accuracy is not required, and you only need the current state.

Example:

-- Customer dimension table
CREATE TABLE dim_customer (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    email VARCHAR(100),
    city VARCHAR(50),
    updated_at TIMESTAMP
);

-- When customer moves from "New York" to "Los Angeles"
-- Old value is overwritten, no history maintained
UPDATE dim_customer 
SET city = 'Los Angeles', updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 123;

Characteristics:

Simple implementation
No history tracking
Minimal storage requirements
Suitable for correcting errors or when history is irrelevant

SCD Type 2: Historical Tracking

SCD Type 2 maintains a complete history of changes by creating new records for each change, preserving both old and new values.

Use Case: When you need to track all historical changes and analyze data as it existed at any point in time.

Example:

-- Customer dimension table with SCD Type 2
CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id INT,
    customer_name VARCHAR(100),
    email VARCHAR(100),
    city VARCHAR(50),
    effective_date DATE,
    expiry_date DATE,
    is_current BOOLEAN,
    version_number INT
);

-- Initial record
INSERT INTO dim_customer VALUES 
(1, 123, 'John Doe', 'john@example.com', 'New York', 
 '2020-01-01', '9999-12-31', TRUE, 1);

-- When customer moves to Los Angeles
-- Close old record
UPDATE dim_customer 
SET expiry_date = '2023-06-15', is_current = FALSE
WHERE customer_id = 123 AND is_current = TRUE;

-- Insert new record
INSERT INTO dim_customer VALUES 
(2, 123, 'John Doe', 'john@example.com', 'Los Angeles', 
 '2023-06-16', '9999-12-31', TRUE, 2);

Characteristics:

Complete historical tracking
Enables point-in-time analysis
Increased storage requirements
More complex queries (need to filter by date or is_current flag)
Essential for compliance and audit requirements

Querying SCD Type 2:

-- Get current customer information
SELECT * FROM dim_customer 
WHERE customer_id = 123 AND is_current = TRUE;

-- Get customer information as of a specific date
SELECT * FROM dim_customer 
WHERE customer_id = 123 
  AND '2023-06-10' BETWEEN effective_date AND expiry_date;

Dimensional Modeling (Star Schema)

Dimensional modeling organizes data into fact tables (measurable events) and dimension tables (descriptive attributes). The star schema is the simplest form, with a central fact table surrounded by dimension tables.

Use Case: Business intelligence, reporting, and analytics where query performance is critical.

Example - Star Schema:

-- Fact table (measures/events)
CREATE TABLE fact_sales (
    sale_id INT PRIMARY KEY,
    date_key INT,
    customer_key INT,
    product_key INT,
    store_key INT,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2),
    discount_amount DECIMAL(10,2)
);

-- Dimension tables
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    date DATE,
    year INT,
    quarter INT,
    month INT,
    week INT,
    day_of_week VARCHAR(10),
    is_weekend BOOLEAN
);

CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id INT,
    customer_name VARCHAR(100),
    email VARCHAR(100),
    city VARCHAR(50),
    state VARCHAR(50),
    country VARCHAR(50),
    customer_segment VARCHAR(50)
);

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id INT,
    product_name VARCHAR(100),
    category VARCHAR(50),
    brand VARCHAR(50),
    price DECIMAL(10,2)
);

CREATE TABLE dim_store (
    store_key INT PRIMARY KEY,
    store_id INT,
    store_name VARCHAR(100),
    city VARCHAR(50),
    region VARCHAR(50),
    store_type VARCHAR(50)
);

Characteristics:

Denormalized structure for fast queries
Simple to understand and query
Optimized for read-heavy analytical workloads
Foreign keys link facts to dimensions
Supports aggregations and drill-down analysis

Query Example:

-- Sales by product category and month
SELECT 
    d.year,
    d.month,
    p.category,
    SUM(f.total_amount) as total_sales,
    SUM(f.quantity) as total_quantity,
    COUNT(*) as transaction_count
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
WHERE d.year = 2023
GROUP BY d.year, d.month, p.category
ORDER BY d.year, d.month, p.category;

Snowflake Schema

The snowflake schema is a normalized version of the star schema, where dimension tables are further normalized into multiple related tables.

Use Case: When storage optimization is important and you’re willing to trade some query complexity for reduced redundancy.

Example - Snowflake Schema:

-- Normalized dimension tables
CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id INT,
    customer_name VARCHAR(100),
    email VARCHAR(100),
    city_key INT,
    customer_segment_key INT
);

CREATE TABLE dim_city (
    city_key INT PRIMARY KEY,
    city_name VARCHAR(50),
    state_key INT
);

CREATE TABLE dim_state (
    state_key INT PRIMARY KEY,
    state_name VARCHAR(50),
    country_key INT
);

CREATE TABLE dim_country (
    country_key INT PRIMARY KEY,
    country_name VARCHAR(50),
    region VARCHAR(50)
);

CREATE TABLE dim_customer_segment (
    segment_key INT PRIMARY KEY,
    segment_name VARCHAR(50),
    segment_description TEXT
);

Characteristics:

Normalized structure reduces storage
More complex queries (multiple joins)
Better for large dimensions with many attributes
Slightly slower queries than star schema
Better data integrity and consistency

Comparison: Star vs. Snowflake:

Aspect	Star Schema	Snowflake Schema
Normalization	Denormalized	Normalized
Query Complexity	Simple (fewer joins)	More complex (more joins)
Storage	Higher redundancy	Lower redundancy
Query Performance	Faster (fewer joins)	Slightly slower
Maintenance	Easier	More complex
Use Case	Small to medium dimensions	Large dimensions with many attributes

Data Vault 2.0

Data Vault 2.0 is a hybrid modeling approach designed for agile data warehousing, focusing on scalability, flexibility, and auditability. It separates business keys from structural relationships and provides a framework for handling historical data.

Use Case: Enterprise data warehouses requiring flexibility, scalability, and audit trails. Ideal for environments with changing business requirements.

Core Components:

Hubs: Represent unique business keys
Links: Represent relationships between hubs
Satellites: Store descriptive attributes and their history

Example - Data Vault 2.0:

-- Hub: Unique business keys
CREATE TABLE hub_customer (
    customer_hkey CHAR(32) PRIMARY KEY,  -- Hash of business key
    customer_id VARCHAR(50),              -- Business key
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

CREATE TABLE hub_product (
    product_hkey CHAR(32) PRIMARY KEY,
    product_id VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

-- Link: Relationships
CREATE TABLE link_customer_product (
    customer_product_lkey CHAR(32) PRIMARY KEY,
    customer_hkey CHAR(32),
    product_hkey CHAR(32),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

-- Satellite: Descriptive attributes with history
CREATE TABLE sat_customer (
    customer_hkey CHAR(32),
    load_date TIMESTAMP,
    load_end_date TIMESTAMP,
    record_source VARCHAR(100),
    customer_name VARCHAR(100),
    email VARCHAR(100),
    phone VARCHAR(20),
    address TEXT,
    PRIMARY KEY (customer_hkey, load_date)
);

CREATE TABLE sat_customer_product (
    customer_product_lkey CHAR(32),
    load_date TIMESTAMP,
    load_end_date TIMESTAMP,
    record_source VARCHAR(100),
    purchase_date DATE,
    quantity INT,
    price DECIMAL(10,2),
    PRIMARY KEY (customer_product_lkey, load_date)
);

Characteristics:

Highly scalable and flexible
Complete audit trail
Parallel loading capabilities
Handles source system changes gracefully
More complex than star/snowflake schemas
Requires specialized ETL patterns
Excellent for enterprise data integration

Querying Data Vault:

-- Get current customer information
SELECT 
    h.customer_id,
    s.customer_name,
    s.email,
    s.phone
FROM hub_customer h
JOIN sat_customer s ON h.customer_hkey = s.customer_hkey
WHERE s.load_end_date = '9999-12-31'  -- Current record
  AND h.customer_id = 'CUST001';

Graph Data Modeling

Graph modeling represents data as nodes (entities) and edges (relationships), enabling complex relationship analysis and traversal.

Use Case: Social networks, recommendation systems, fraud detection, knowledge graphs, network analysis, and any scenario requiring relationship traversal.

Example - Graph Model:

// Neo4j Cypher example
// Create nodes
CREATE (alice:Person {name: 'Alice', age: 30, city: 'New York'})
CREATE (bob:Person {name: 'Bob', age: 35, city: 'San Francisco'})
CREATE (charlie:Person {name: 'Charlie', age: 28, city: 'New York'})
CREATE (acme:Company {name: 'ACME Corp', industry: 'Technology'})
CREATE (xyz:Company {name: 'XYZ Inc', industry: 'Finance'})

// Create relationships
CREATE (alice)-[:KNOWS {since: 2010}]->(bob)
CREATE (alice)-[:KNOWS {since: 2015}]->(charlie)
CREATE (bob)-[:WORKS_AT {role: 'Engineer', since: 2018}]->(acme)
CREATE (charlie)-[:WORKS_AT {role: 'Analyst', since: 2020}]->(xyz)
CREATE (alice)-[:LIVES_IN]->(:City {name: 'New York'})
CREATE (bob)-[:LIVES_IN]->(:City {name: 'San Francisco'})

// Query: Find friends of Alice who work in Technology
MATCH (alice:Person {name: 'Alice'})-[:KNOWS]->(friend)-[:WORKS_AT]->(company)
WHERE company.industry = 'Technology'
RETURN friend.name, company.name

Characteristics:

Excellent for relationship-heavy data
Supports complex traversals and path queries
Flexible schema (schema-optional)
High performance for relationship queries
Not ideal for traditional reporting
Requires specialized graph databases (Neo4j, Amazon Neptune, etc.)

Use Cases for Graph Modeling:

Social Networks: Friend connections, follower relationships
Recommendation Systems: “People who bought X also bought Y”
Fraud Detection: Identifying suspicious relationship patterns
Knowledge Graphs: Entity relationships in AI/ML systems
Network Analysis: IT infrastructure, supply chains

Choosing the Right Modeling Style

Modeling Style	Best For	Operational	Analytical	Real-time Support
Normalized Relational	Transactional systems, OLTP	✅ Excellent	❌ Poor	✅ Excellent
Star Schema	Business intelligence, reporting	❌ Poor	✅ Excellent	⚠️ Limited
Snowflake Schema	Large analytical systems	❌ Poor	✅ Good	⚠️ Limited
Data Vault 2.0	Enterprise data warehouses	❌ Poor	✅ Excellent	⚠️ Limited
SCD Type 2	Historical tracking, compliance	⚠️ Moderate	✅ Excellent	⚠️ Moderate
Graph	Relationship analysis, networks	⚠️ Moderate	✅ Excellent	✅ Good

Modern Data Architecture: Blending Operational and Analytical

With the shift to faster real-time data processing and newer technologies (streaming, event-driven architectures, in-memory computing), the traditional boundary between operational and analytical systems is blurring.

Real-Time Analytics

Modern systems increasingly require:

Streaming Analytics: Processing data in real-time as it arrives
Operational Analytics: Running analytical queries against operational data
Hybrid Workloads: Supporting both transactional and analytical queries simultaneously

Example - Real-Time Customer Analytics:

-- Operational table with analytical capabilities
CREATE TABLE customer_events (
    event_id BIGINT,
    customer_id INT,
    event_type VARCHAR(50),
    event_timestamp TIMESTAMP,
    amount DECIMAL(10,2),
    product_id INT,
    -- Operational fields
    status VARCHAR(20),
    -- Analytical fields
    customer_segment VARCHAR(50),
    lifetime_value DECIMAL(10,2),
    churn_probability DECIMAL(5,4)
) WITH (
    -- Enable real-time analytics
    STREAMING = TRUE,
    RETENTION_DAYS = 90
);

-- Real-time query for operational decision
SELECT customer_id, churn_probability 
FROM customer_events 
WHERE customer_id = 123 
  AND event_timestamp > CURRENT_TIMESTAMP - INTERVAL '1 hour'
ORDER BY event_timestamp DESC 
LIMIT 1;

-- Analytical query with longer lookback
SELECT 
    customer_segment,
    AVG(churn_probability) as avg_churn_risk,
    SUM(amount) as total_revenue,
    COUNT(*) as event_count
FROM customer_events
WHERE event_timestamp > CURRENT_TIMESTAMP - INTERVAL '30 days'
GROUP BY customer_segment;

AI/ML Model Requirements

AI and ML models processing real-time data often require:

Long Lookback Windows: Historical context for predictions
Feature Engineering: Combining operational and analytical data
Time-Series Analysis: Temporal patterns and trends
Real-Time Inference: Low-latency predictions on streaming data

Example - Merging Operational and Analytical Data for ML:

-- Operational transaction data
CREATE TABLE transactions (
    transaction_id BIGINT,
    customer_id INT,
    amount DECIMAL(10,2),
    timestamp TIMESTAMP,
    merchant_id INT,
    category VARCHAR(50)
);

-- Analytical aggregated features
CREATE TABLE customer_features (
    customer_id INT,
    feature_date DATE,
    -- Real-time features
    last_transaction_amount DECIMAL(10,2),
    last_transaction_time TIMESTAMP,
    -- Historical features (long lookback)
    avg_daily_spend_30d DECIMAL(10,2),
    avg_daily_spend_90d DECIMAL(10,2),
    transaction_count_7d INT,
    transaction_count_30d INT,
    transaction_count_90d INT,
    -- ML model outputs
    fraud_score DECIMAL(5,4),
    churn_probability DECIMAL(5,4),
    next_purchase_prediction TIMESTAMP
);

-- Real-time feature update (operational + analytical merge)
INSERT INTO customer_features
SELECT 
    t.customer_id,
    CURRENT_DATE as feature_date,
    -- Operational: latest transaction
    t.amount as last_transaction_amount,
    t.timestamp as last_transaction_time,
    -- Analytical: 30-day average
    (SELECT AVG(amount) 
     FROM transactions 
     WHERE customer_id = t.customer_id 
       AND timestamp > CURRENT_TIMESTAMP - INTERVAL '30 days') as avg_daily_spend_30d,
    -- Analytical: 90-day average
    (SELECT AVG(amount) 
     FROM transactions 
     WHERE customer_id = t.customer_id 
       AND timestamp > CURRENT_TIMESTAMP - INTERVAL '90 days') as avg_daily_spend_90d,
    -- Call ML model for predictions
    ml_fraud_detection(t.customer_id, t.amount, t.timestamp) as fraud_score,
    ml_churn_prediction(t.customer_id) as churn_probability
FROM transactions t
WHERE t.timestamp > CURRENT_TIMESTAMP - INTERVAL '1 minute';

Best Practices for Hybrid Architectures

Use Appropriate Storage Layers:
- Hot Data: In-memory or fast storage for real-time operational queries
- Warm Data: Fast disk storage for recent analytical queries
- Cold Data: Object storage for long-term historical analysis
Implement Data Contracts:
- Define schemas for both operational and analytical data
- Ensure consistency across systems
- Version schemas to handle evolution
Optimize for Both Workloads:
- Use columnar storage for analytical queries
- Maintain indexes for operational queries
- Implement materialized views for common analytical patterns
Stream Processing:
- Use streaming frameworks (Kafka, Flink, Spark Streaming) for real-time processing
- Maintain state for aggregations and windowing
- Enable exactly-once processing semantics
Feature Stores:
- Centralize feature definitions for ML models
- Support both batch and streaming feature computation
- Enable feature versioning and lineage tracking

Data Schemas - Understanding different schema types
Data Contracts - How data modeling relates to data contracts
Data Flows - Designing data flows for different modeling styles
ODCS - Standardized approach to data contract definition