🧩 Chapter 4.3: Mapping the Customer Journey — How Markov Chains Reveal True Marketing Impact

Welcome back to our eCommerce Analytics Series! 👋 In the previous chapter Product Clustering and Customer Segmentation, where we uncovered behavioral segments using K-Means and DBSCAN. Now, we take the next step: understanding how customers move through the funnel.

Let’s explore Chapter 4.3: Mapping the Customer Journey — How Markov Chains Reveal True Marketing Impact.

🔄 The Problem: Why Last-Click Attribution Fails 🔄

Most companies rely on last-touch attribution, giving 100% credit to the final click. But real customer journeys are rarely linear.

Imagine a user who:

Discovers your brand via Facebook
Researches via Organic Search
Returns via Email
Converts via Adwords

Last-click gives all credit to Adwords — but what if removing Facebook collapsed the funnel?

This is where Markov Chain Attribution shines.

🧠 How Markov Chains Work: The Memoryless Journey 🧠

A Markov Chain models transitions between states (e.g., marketing channels) based only on the current state — not the full history.

Key Components:

States: home, product, cart, purchase, cancel
Transition Matrix: Probability of moving from one state to another
Removal Effect: Measures a channel’s true impact by simulating its removal

📊 Transition Matrix: The Heart of the Model 📊

We begin by extracting and aggregating user paths using Python and BigQuery SQL to obtain the transition matrix:

python

1234567891011121314151617181920212223242526272829303132333435363738394041424344

      query = """
WITH transitions AS (
    SELECT
        e1.event_type AS current_event,
        e2.event_type AS next_event
    FROM
        `bigquery-public-data.thelook_ecommerce.events` AS e1
    JOIN
        `bigquery-public-data.thelook_ecommerce.events` AS e2
    ON
        e1.session_id = e2.session_id AND
        e1.sequence_number = e2.sequence_number - 1
),
counts AS (
    SELECT
        current_event,
        COUNT(*) AS total_count
    FROM
        transitions
    GROUP BY
        current_event
)
SELECT
    t.current_event,
    t.next_event,
    COUNT(*) AS transition_count,
    COUNT(*) / c.total_count AS transition_probability
FROM
    transitions t
JOIN
    counts c ON t.current_event = c.current_event
GROUP BY
    t.current_event, t.next_event, c.total_count
ORDER BY
    t.current_event, t.next_event;
"""
query_job = client.query(query)
transitions = query_job.result().to_dataframe()
transition_matrix = transitions.pivot(index='current_event', columns='next_event', values='transition_probability')
transition_matrix
# comvert nan to zero
transition_matrix = transition_matrix.fillna(0)
transition_matrix = transition_matrix.round(3)
transition_matrix

Here’s the aggregated transition matrix:

current_event	cancel	cart	department	product	purchase
cart	0.267	0.0	0.345	0.0	0.388
department	0.0	0.0	0.0	1.0	0.0
home	0.0	0.0	1.0	0.0	0.0
product	0.0	1.0	0.0	0.0	0.0

Key Insights:

High Cart Abandonment: 26.7% of users abandon at cart
Smooth Navigation: home → department → product is 100% efficient
Purchase Bottleneck: Only 38.8% convert from cart

Markov chain visualization in python:

python

12345678910111213141516171819202122232425262728

      import networkx as nx
import matplotlib.pyplot as plt

# Create a directed graph
graph = nx.DiGraph()

# Add nodes and edges based on the transition matrix
for current_event in transition_matrix.index:
  for next_event in transition_matrix.columns:
    probability = transition_matrix.loc[current_event, next_event]
    if probability > 0:
      graph.add_edge(current_event, next_event, weight=probability)


# Draw the graph
pos = nx.spring_layout(graph)

# Adjust the position of the 'product' node slightly
pos['product'] = (pos['product'][0], pos['product'][1] + 1.5) # Adjust x and y coordinates
pos['home'] = (pos['home'][0] + 1, pos['home'][1] + 1.5) # Adjust x and y coordinates
pos['department'] = (pos['department'][0] + 1, pos['department'][1]) # Adjust x and y coordinates

nx.draw(graph, pos, with_labels=True, node_size=1000, node_color="skyblue", font_size=10, font_weight="bold", arrows=True)
edge_labels = nx.get_edge_attributes(graph, "weight")
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels)

plt.title("Markov Chain Transition Diagram")
plt.show()

🚫 The Removal Effect: Measuring True Channel Impact 🚫

We compute each channel’s contribution by simulating its removal:

plaintext

1234567891011121314

      # Baseline conversion rate
baseline_conv_rate = total_conversions / total_sessions

# Removal effect
removal_effects = {}
for _, row in df.iterrows():
    remaining_conversions = total_conversions - row['conversions']
    remaining_sessions = total_sessions - row['total_sessions']
    new_conv_rate = remaining_conversions / remaining_sessions
    removal_effect = (baseline_conv_rate - new_conv_rate) / baseline_conv_rate
    removal_effects[row['traffic_source']] = max(removal_effect, 0)

# Attribution shares
attribution_shares = {k: v / sum(removal_effects.values()) for k, v in removal_effects.items()}

Attribution Results:

Channel	Removal Effect	Attribution Share
Adwords	0.001	100.0%
Facebook	0.000	0.0%
Email	0.000	0.0%
Organic	0.000	0.0%
YouTube	0.000	0.0%

⚠️ Note: In this synthetic dataset, all channels show nearly identical behavior — a known limitation. In real data, differences in conversion paths would yield meaningful attribution splits.

📈 Business Impact: Fixing the Cart Bottleneck 📈

Even small improvements in cart conversion have massive impact.

Metric	Value
Baseline Conversion Rate	26.5%
Simulated Cart → Purchase (+10%)	33.8%
Impact on Conversion	+7.3 pp

A 10% reduction in cart abandonment boosts conversion by 7.3 percentage points — without new traffic.

🔍 Data Storytelling: Before → Change → Reason → Impact 🔍

Phase	Insight
Before	Last-click attribution overvalues Adwords; other channels ignored
Change	Implemented Markov Chain model with removal effect
Reason	Identified cart abandonment as universal bottleneck
Impact	Shifted budget to checkout optimization, boosting conversion by 7.3 pp

✅ Closing the Series: Key Takeaways & Final Reflections ✅

This project began with raw data and ended with actionable, business-driven insights. Across 12 chapters, we:

Cleaned and modeled real-world e-commerce data
Segmented customers using unsupervised learning
Mapped journeys with Markov chains
Engineered KPIs for marketing, sales, and supply chain

While the data was synthetic, the methods are production-ready: from SQL-based KPI pipelines to Python-powered attribution models.

The biggest lesson? Data doesn’t speak for itself — it takes structure, storytelling, and strategic framing to turn numbers into decisions.

Thank you for following along. This concludes Unlocking eCommerce Success: A Deep Dive into TheLook Dataset.

Some Recommended Datasets:

While the theLook dataset is excellent for learning, it has a key limitation: each session has only one traffic source, preventing true multi-touch analysis.

For advanced Markov chain and attribution modeling, consider these realistic datasets:

🔗 GitHub: Multi-Touch Attribution (eeghor/mta) A synthetic but realistic dataset designed for attribution modeling:

Paths like: Facebook > Google > Email > Purchase Includes conversion counts, timestamps, and exposure times Perfect for testing Markov, Shapley, and time-decay models 🔗 https://github.com/eeghor/mta

⚡ Databricks Multi-Touch Attribution Solution Accelerator Full synthetic dataset with ad impressions and conversions Implements first-touch, last-touch, and Markov models Built for production-level attribution on the Lakehouse 🔗 Databricks MTA Blog

These resources enable true multi-touch journey analysis, going beyond single-session attribution.

Welcome to Chapter 4.3: How Markov Chains Reveal True Marketing Impact