🧩 Chapter 4.3: Mapping the Customer Journey — How Markov Chains Reveal True Marketing Impact

Welcome back to our eCommerce Analytics Series! 👋 In the previous chapter Product Clustering and Customer Segmentation, where we uncovered behavioral segments using K-Means and DBSCAN. Now, we take the next step: understanding how customers move through the funnel.

Let’s explore Chapter 4.3: Mapping the Customer Journey — How Markov Chains Reveal True Marketing Impact.

🔄 The Problem: Why Last-Click Attribution Fails 🔄

Most companies rely on last-touch attribution, giving 100% credit to the final click. But real customer journeys are rarely linear.

Imagine a user who:

  • Discovers your brand via Facebook
  • Researches via Organic Search
  • Returns via Email
  • Converts via Adwords

Last-click gives all credit to Adwords — but what if removing Facebook collapsed the funnel?

This is where Markov Chain Attribution shines.

🧠 How Markov Chains Work: The Memoryless Journey 🧠

A Markov Chain models transitions between states (e.g., marketing channels) based only on the current state — not the full history.

Key Components:

  • States: home, product, cart, purchase, cancel
  • Transition Matrix: Probability of moving from one state to another
  • Removal Effect: Measures a channel’s true impact by simulating its removal

📊 Transition Matrix: The Heart of the Model 📊

We begin by extracting and aggregating user paths using Python and BigQuery SQL to obtain the transition matrix:

python
1234567891011121314151617181920212223242526272829303132333435363738394041424344
      query = """
WITH transitions AS (
    SELECT
        e1.event_type AS current_event,
        e2.event_type AS next_event
    FROM
        `bigquery-public-data.thelook_ecommerce.events` AS e1
    JOIN
        `bigquery-public-data.thelook_ecommerce.events` AS e2
    ON
        e1.session_id = e2.session_id AND
        e1.sequence_number = e2.sequence_number - 1
),
counts AS (
    SELECT
        current_event,
        COUNT(*) AS total_count
    FROM
        transitions
    GROUP BY
        current_event
)
SELECT
    t.current_event,
    t.next_event,
    COUNT(*) AS transition_count,
    COUNT(*) / c.total_count AS transition_probability
FROM
    transitions t
JOIN
    counts c ON t.current_event = c.current_event
GROUP BY
    t.current_event, t.next_event, c.total_count
ORDER BY
    t.current_event, t.next_event;
"""
query_job = client.query(query)
transitions = query_job.result().to_dataframe()
transition_matrix = transitions.pivot(index='current_event', columns='next_event', values='transition_probability')
transition_matrix
# comvert nan to zero
transition_matrix = transition_matrix.fillna(0)
transition_matrix = transition_matrix.round(3)
transition_matrix
    

Here’s the aggregated transition matrix:

current_eventcancelcartdepartmentproductpurchase
cart0.2670.00.3450.00.388
department0.00.00.01.00.0
home0.00.01.00.00.0
product0.01.00.00.00.0

Key Insights:

  • High Cart Abandonment: 26.7% of users abandon at cart
  • Smooth Navigation: home → department → product is 100% efficient
  • Purchase Bottleneck: Only 38.8% convert from cart

Markov chain visualization in python:

python
12345678910111213141516171819202122232425262728
      import networkx as nx
import matplotlib.pyplot as plt

# Create a directed graph
graph = nx.DiGraph()

# Add nodes and edges based on the transition matrix
for current_event in transition_matrix.index:
  for next_event in transition_matrix.columns:
    probability = transition_matrix.loc[current_event, next_event]
    if probability > 0:
      graph.add_edge(current_event, next_event, weight=probability)


# Draw the graph
pos = nx.spring_layout(graph)

# Adjust the position of the 'product' node slightly
pos['product'] = (pos['product'][0], pos['product'][1] + 1.5) # Adjust x and y coordinates
pos['home'] = (pos['home'][0] + 1, pos['home'][1] + 1.5) # Adjust x and y coordinates
pos['department'] = (pos['department'][0] + 1, pos['department'][1]) # Adjust x and y coordinates

nx.draw(graph, pos, with_labels=True, node_size=1000, node_color="skyblue", font_size=10, font_weight="bold", arrows=True)
edge_labels = nx.get_edge_attributes(graph, "weight")
nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels)

plt.title("Markov Chain Transition Diagram")
plt.show()
    

🚫 The Removal Effect: Measuring True Channel Impact 🚫

We compute each channel’s contribution by simulating its removal:

plaintext
1234567891011121314
      # Baseline conversion rate
baseline_conv_rate = total_conversions / total_sessions

# Removal effect
removal_effects = {}
for _, row in df.iterrows():
    remaining_conversions = total_conversions - row['conversions']
    remaining_sessions = total_sessions - row['total_sessions']
    new_conv_rate = remaining_conversions / remaining_sessions
    removal_effect = (baseline_conv_rate - new_conv_rate) / baseline_conv_rate
    removal_effects[row['traffic_source']] = max(removal_effect, 0)

# Attribution shares
attribution_shares = {k: v / sum(removal_effects.values()) for k, v in removal_effects.items()}
    

Attribution Results:

ChannelRemoval EffectAttribution Share
Adwords0.001100.0%
Facebook0.0000.0%
Email0.0000.0%
Organic0.0000.0%
YouTube0.0000.0%

⚠️ Note: In this synthetic dataset, all channels show nearly identical behavior — a known limitation. In real data, differences in conversion paths would yield meaningful attribution splits.

📈 Business Impact: Fixing the Cart Bottleneck 📈

Even small improvements in cart conversion have massive impact.

MetricValue
Baseline Conversion Rate26.5%
Simulated Cart → Purchase (+10%)33.8%
Impact on Conversion+7.3 pp

A 10% reduction in cart abandonment boosts conversion by 7.3 percentage points — without new traffic.

🔍 Data Storytelling: Before → Change → Reason → Impact 🔍

PhaseInsight
BeforeLast-click attribution overvalues Adwords; other channels ignored
ChangeImplemented Markov Chain model with removal effect
ReasonIdentified cart abandonment as universal bottleneck
ImpactShifted budget to checkout optimization, boosting conversion by 7.3 pp

✅ Closing the Series: Key Takeaways & Final Reflections ✅

This project began with raw data and ended with actionable, business-driven insights. Across 12 chapters, we:

  • Cleaned and modeled real-world e-commerce data
  • Segmented customers using unsupervised learning
  • Mapped journeys with Markov chains
  • Engineered KPIs for marketing, sales, and supply chain

While the data was synthetic, the methods are production-ready: from SQL-based KPI pipelines to Python-powered attribution models.

The biggest lesson? Data doesn’t speak for itself — it takes structure, storytelling, and strategic framing to turn numbers into decisions.

Thank you for following along. This concludes Unlocking eCommerce Success: A Deep Dive into TheLook Dataset.

While the theLook dataset is excellent for learning, it has a key limitation: each session has only one traffic source, preventing true multi-touch analysis.

For advanced Markov chain and attribution modeling, consider these realistic datasets:

🔗 GitHub: Multi-Touch Attribution (eeghor/mta) A synthetic but realistic dataset designed for attribution modeling:

Paths like: Facebook > Google > Email > Purchase Includes conversion counts, timestamps, and exposure times Perfect for testing Markov, Shapley, and time-decay models 🔗 https://github.com/eeghor/mta

⚡ Databricks Multi-Touch Attribution Solution Accelerator Full synthetic dataset with ad impressions and conversions Implements first-touch, last-touch, and Markov models Built for production-level attribution on the Lakehouse 🔗 Databricks MTA Blog

These resources enable true multi-touch journey analysis, going beyond single-session attribution.