🧩 Chapter 4.3: Mapping the Customer Journey — How Markov Chains Reveal True Marketing Impact
Welcome back to our eCommerce Analytics Series! 👋 In the previous chapter Product Clustering and Customer Segmentation, where we uncovered behavioral segments using K-Means and DBSCAN. Now, we take the next step: understanding how customers move through the funnel.
Let’s explore Chapter 4.3: Mapping the Customer Journey — How Markov Chains Reveal True Marketing Impact.
🔄 The Problem: Why Last-Click Attribution Fails 🔄
Most companies rely on last-touch attribution, giving 100% credit to the final click. But real customer journeys are rarely linear.
Imagine a user who:
- Discovers your brand via Facebook
- Researches via Organic Search
- Returns via Email
- Converts via Adwords
Last-click gives all credit to Adwords — but what if removing Facebook collapsed the funnel?
This is where Markov Chain Attribution shines.
🧠 How Markov Chains Work: The Memoryless Journey 🧠
A Markov Chain models transitions between states (e.g., marketing channels) based only on the current state — not the full history.
Key Components:
- States: home, product, cart, purchase, cancel
- Transition Matrix: Probability of moving from one state to another
- Removal Effect: Measures a channel’s true impact by simulating its removal
📊 Transition Matrix: The Heart of the Model 📊
We begin by extracting and aggregating user paths using Python and BigQuery SQL to obtain the transition matrix:
query = """
WITH transitions AS (
SELECT
e1.event_type AS current_event,
e2.event_type AS next_event
FROM
`bigquery-public-data.thelook_ecommerce.events` AS e1
JOIN
`bigquery-public-data.thelook_ecommerce.events` AS e2
ON
e1.session_id = e2.session_id AND
e1.sequence_number = e2.sequence_number - 1
),
counts AS (
SELECT
current_event,
COUNT(*) AS total_count
FROM
transitions
GROUP BY
current_event
)
SELECT
t.current_event,
t.next_event,
COUNT(*) AS transition_count,
COUNT(*) / c.total_count AS transition_probability
FROM
transitions t
JOIN
counts c ON t.current_event = c.current_event
GROUP BY
t.current_event, t.next_event, c.total_count
ORDER BY
t.current_event, t.next_event;
"""
query_job = client.query(query)
transitions = query_job.result().to_dataframe()
transition_matrix = transitions.pivot(index='current_event', columns='next_event', values='transition_probability')
transition_matrix
# comvert nan to zero
transition_matrix = transition_matrix.fillna(0)
transition_matrix = transition_matrix.round(3)
transition_matrix
Here’s the aggregated transition matrix:
| current_event | cancel | cart | department | product | purchase |
|---|---|---|---|---|---|
| cart | 0.267 | 0.0 | 0.345 | 0.0 | 0.388 |
| department | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| home | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| product | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
Key Insights:
- High Cart Abandonment: 26.7% of users abandon at cart
- Smooth Navigation: home → department → product is 100% efficient
- Purchase Bottleneck: Only 38.8% convert from cart
Markov chain visualization in python:
import networkx as nx
import matplotlib.pyplot as plt
# Create a directed graph
graph = nx.DiGraph()
# Add nodes and edges based on the transition matrix
for current_event in transition_matrix.index:
for next_event in transition_matrix.columns:
probability = transition_matrix.loc[current_event, next_event]
if probability > 0:
graph.add_edge(current_event, next_event, weight=probability)
# Draw the graph
pos = nx.spring_layout(graph)
# Adjust the position of the 'product' node slightly
pos['product'] = (pos['product'][0], pos['product'][1] + 1.5) # Adjust x and y coordinates
pos['home'] = (pos['home'][0] + 1, pos['home'][1] + 1.5) # Adjust x and y coordinates
pos['department'] = (pos['department'][0] + 1, pos['department'][1]) # Adjust x and y coordinates
nx.draw(graph, pos, with_labels=True, node_size=1000, node_color="skyblue", font_size=10, font_weight="bold", arrows=True)
edge_labels = nx.get_edge_attributes(graph, "weight")
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels)
plt.title("Markov Chain Transition Diagram")
plt.show()
🚫 The Removal Effect: Measuring True Channel Impact 🚫
We compute each channel’s contribution by simulating its removal:
# Baseline conversion rate
baseline_conv_rate = total_conversions / total_sessions
# Removal effect
removal_effects = {}
for _, row in df.iterrows():
remaining_conversions = total_conversions - row['conversions']
remaining_sessions = total_sessions - row['total_sessions']
new_conv_rate = remaining_conversions / remaining_sessions
removal_effect = (baseline_conv_rate - new_conv_rate) / baseline_conv_rate
removal_effects[row['traffic_source']] = max(removal_effect, 0)
# Attribution shares
attribution_shares = {k: v / sum(removal_effects.values()) for k, v in removal_effects.items()}
Attribution Results:
| Channel | Removal Effect | Attribution Share |
|---|---|---|
| Adwords | 0.001 | 100.0% |
| 0.000 | 0.0% | |
| 0.000 | 0.0% | |
| Organic | 0.000 | 0.0% |
| YouTube | 0.000 | 0.0% |
⚠️ Note: In this synthetic dataset, all channels show nearly identical behavior — a known limitation. In real data, differences in conversion paths would yield meaningful attribution splits.
📈 Business Impact: Fixing the Cart Bottleneck 📈
Even small improvements in cart conversion have massive impact.
| Metric | Value |
|---|---|
| Baseline Conversion Rate | 26.5% |
| Simulated Cart → Purchase (+10%) | 33.8% |
| Impact on Conversion | +7.3 pp |
A 10% reduction in cart abandonment boosts conversion by 7.3 percentage points — without new traffic.
🔍 Data Storytelling: Before → Change → Reason → Impact 🔍
| Phase | Insight |
|---|---|
| Before | Last-click attribution overvalues Adwords; other channels ignored |
| Change | Implemented Markov Chain model with removal effect |
| Reason | Identified cart abandonment as universal bottleneck |
| Impact | Shifted budget to checkout optimization, boosting conversion by 7.3 pp |
✅ Closing the Series: Key Takeaways & Final Reflections ✅
This project began with raw data and ended with actionable, business-driven insights. Across 12 chapters, we:
- Cleaned and modeled real-world e-commerce data
- Segmented customers using unsupervised learning
- Mapped journeys with Markov chains
- Engineered KPIs for marketing, sales, and supply chain
While the data was synthetic, the methods are production-ready: from SQL-based KPI pipelines to Python-powered attribution models.
The biggest lesson? Data doesn’t speak for itself — it takes structure, storytelling, and strategic framing to turn numbers into decisions.
Thank you for following along. This concludes Unlocking eCommerce Success: A Deep Dive into TheLook Dataset.
Some Recommended Datasets:
While the theLook dataset is excellent for learning, it has a key limitation: each session has only one traffic source, preventing true multi-touch analysis.
For advanced Markov chain and attribution modeling, consider these realistic datasets:
🔗 GitHub: Multi-Touch Attribution (eeghor/mta) A synthetic but realistic dataset designed for attribution modeling:
Paths like: Facebook > Google > Email > Purchase Includes conversion counts, timestamps, and exposure times Perfect for testing Markov, Shapley, and time-decay models 🔗 https://github.com/eeghor/mta
⚡ Databricks Multi-Touch Attribution Solution Accelerator Full synthetic dataset with ad impressions and conversions Implements first-touch, last-touch, and Markov models Built for production-level attribution on the Lakehouse 🔗 Databricks MTA Blog
These resources enable true multi-touch journey analysis, going beyond single-session attribution.