FLO CLTV Prediction with BG-NBD & Gamma-Gamma

Yasemin Derya Dilli
7 min readNov 19, 2023

Data Understanding ve Preparation

First of all , we import all library

import pandas as pd
import datetime as dt
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from sklearn.preprocessing import MinMaxScaler
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.options.mode.chained_assignment = None

from terminal, we install lifetime library and read data set.

pip install lifetime
df_ = pd.read_csv("flo_data_20k.csv")

df = df_.copy()
df.head()
                              master_id order_channel last_order_channel  \
0 cc294636-19f0-11eb-8d74-000d3a38a36f Android App Offline
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f Android App Mobile
2 69b69676-1a40-11ea-941b-000d3a38a36f Android App Android App
3 1854e56c-491f-11eb-806e-000d3a38a36f Android App Android App
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f Desktop Desktop
first_order_date last_order_date last_order_date_online \
0 2020-10-30 2021-02-26 2021-02-21
1 2017-02-08 2021-02-16 2021-02-16
2 2019-11-27 2020-11-27 2020-11-27
3 2021-01-06 2021-01-17 2021-01-17
4 2019-08-03 2021-03-07 2021-03-07
last_order_date_offline order_num_total_ever_online \
0 2021-02-26 4.00
1 2020-01-10 19.00
2 2019-12-01 3.00
3 2021-01-06 1.00
4 2019-08-03 1.00
order_num_total_ever_offline customer_value_total_ever_offline \
0 1.00 139.99
1 2.00 159.97
2 2.00 189.97
3 1.00 39.99
4 1.00 49.99
customer_value_total_ever_online interested_in_categories_12
0 799.38 [KADIN]
1 1853.58 [ERKEK, COCUK, KADIN, AKTIFSPOR]
2 395.35 [ERKEK, KADIN]
3 81.98 [AKTIFCOCUK, COCUK]
4 159.99 [AKTIFSPOR]
df.head()
Out[3]:
master_id order_channel last_order_channel \
0 cc294636-19f0-11eb-8d74-000d3a38a36f Android App Offline
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f Android App Mobile
2 69b69676-1a40-11ea-941b-000d3a38a36f Android App Android App
3 1854e56c-491f-11eb-806e-000d3a38a36f Android App Android App
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f Desktop Desktop
first_order_date last_order_date last_order_date_online \
0 2020-10-30 2021-02-26 2021-02-21
1 2017-02-08 2021-02-16 2021-02-16
2 2019-11-27 2020-11-27 2020-11-27
3 2021-01-06 2021-01-17 2021-01-17
4 2019-08-03 2021-03-07 2021-03-07
last_order_date_offline order_num_total_ever_online \
0 2021-02-26 4.00
1 2020-01-10 19.00
2 2019-12-01 3.00
3 2021-01-06 1.00
4 2019-08-03 1.00
order_num_total_ever_offline customer_value_total_ever_offline \
0 1.00 139.99
1 2.00 159.97
2 2.00 189.97
3 1.00 39.99
4 1.00 49.99
customer_value_total_ever_online interested_in_categories_12
0 799.38 [KADIN]
1 1853.58 [ERKEK, COCUK, KADIN, AKTIFSPOR]
2 395.35 [ERKEK, KADIN]
3 81.98 [AKTIFCOCUK, COCUK]
4 159.99 [AKTIFSPOR]

Trimming outliers function

def outlier_thresholds(dataframe, variable):
quartile1 = dataframe[variable].quantile(0.01)
quartile3 = dataframe[variable].quantile(0.99)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit
def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_thresholds(dataframe, variable)
dataframe.loc[(dataframe[variable] < low_limit), variable] = round(low_limit,0)
dataframe.loc[(dataframe[variable] > up_limit), variable] = round(up_limit,0)

The outliers of the variables ‘order_num_total_ever_online’, ‘order_num_total_ever_offline’, ‘customer_value_total_ever_offline’, ‘customer_value_total_ever_online’ have been suppressed.

columns = ["order_num_total_ever_online", "order_num_total_ever_offline", "customer_value_total_ever_offline","customer_value_total_ever_online"]
for col in columns:
replace_with_thresholds(df, col)

We have created a new variable for the total order number and total customer value because we have two source from online and offline.

df["order_num_total"] = df["order_num_total_ever_online"] + df["order_num_total_ever_offline"]
df["customer_value_total"] = df["customer_value_total_ever_offline"] + df["customer_value_total_ever_online"]

We converted the data type of variables representing dates to ‘date’.

date_columns = df.columns[df.columns.str.contains("date")]
df[date_columns] = df[date_columns].apply(pd.to_datetime)
master_id                                    object
order_channel object
last_order_channel object
first_order_date datetime64[ns]
last_order_date datetime64[ns]
last_order_date_online datetime64[ns]
last_order_date_offline datetime64[ns]
order_num_total_ever_online float64
order_num_total_ever_offline float64
customer_value_total_ever_offline float64
customer_value_total_ever_online float64
interested_in_categories_12 object
order_num_total float64
customer_value_total float64
dtype: object

Creating the CLTV Data Structure

We determined the analysis date as 2 days after the date of the last purchase.


df["last_order_date"].max() # 2021-05-30
analysis_date = dt.datetime(2021,6,1)

We are creating the CLTV (Customer Lifetime Value) data frame.


cltv_df = pd.DataFrame()
cltv_df["customer_id"] = df["master_id"]
cltv_df["recency_cltv_weekly"] = ((df["last_order_date"]- df["first_order_date"]).astype('timedelta64[D]')) / 7
cltv_df["T_weekly"] = ((analysis_date - df["first_order_date"]).astype('timedelta64[D]'))/7
cltv_df["frequency"] = df["order_num_total"]
cltv_df["monetary_cltv_avg"] = df["customer_value_total"] / df["order_num_total"]

cltv_df.head()
                            customer_id  recency_cltv_weekly  T_weekly  \
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43
frequency monetary_cltv_avg
0 5.00 187.87
1 21.00 95.88
2 5.00 117.06
3 2.00 60.98
4 2.00 104.99

Setting up the BG/NBD and Gamma-Gamma models


bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(cltv_df['frequency'],
cltv_df['recency_cltv_weekly'],
cltv_df['T_weekly'])

We predicted the expected purchases from customers within 3 months and added it to the CLTV dataframe as ‘exp_sales_3_month’.


cltv_df["exp_sales_3_month"] = bgf.predict(4*3,
cltv_df['frequency'],
cltv_df['recency_cltv_weekly'],
cltv_df['T_weekly'])

We predicted the expected purchases from customers within 6months and added it to the CLTV dataframe as ‘exp_sales_6_month’.


cltv_df["exp_sales_6_month"] = bgf.predict(4*6,
cltv_df['frequency'],
cltv_df['recency_cltv_weekly'],
cltv_df['T_weekly'])

We examine the top purchasers in months 3


cltv_df.sort_values("exp_sales_3_month",ascending=False)[:10]

                                customer_id  recency_cltv_weekly  T_weekly
7330 a4d534a2-5b1b-11eb-8dbd-000d3a38a36f 62.71 67.29
15611 4a7e875e-e6ce-11ea-8f44-000d3a38a36f 39.71 40.00
8328 1902bf80-0035-11eb-8341-000d3a38a36f 28.86 33.29
19538 55d54d9e-8ac7-11ea-8ec0-000d3a38a36f 52.57 58.71
14373 f00ad516-c4f4-11ea-98f7-000d3a38a36f 38.00 46.43
10489 7af5cd16-b100-11e9-9757-000d3a38a36f 103.14 111.86
4315 d5ef8058-a5c6-11e9-a2fc-000d3a38a36f 133.14 147.14
6756 27310582-6362-11ea-a6dc-000d3a38a36f 62.71 64.14
6666 53fe00d4-7b7a-11eb-960b-000d3a38a36f 9.71 13.00
10536 e143b6fa-d6f8-11e9-93bc-000d3a38a36f 104.57 113.43
frequency monetary_cltv_avg exp_sales_3_month exp_sales_6_month
7330 52.00 166.22 4.66 9.31
15611 29.00 165.30 3.37 6.75
8328 25.00 97.44 3.14 6.28
19538 31.00 228.53 3.08 6.17
14373 27.00 141.35 3.00 6.00
10489 43.00 157.11 2.98 5.96
4315 49.00 161.85 2.83 5.66
6756 29.00 168.88 2.79 5.59
6666 17.00 259.87 2.78 5.56
10536 40.00 176.20 2.76 5.53

We established the Gamma-Gamma model.

ggf = GammaGammaFitter(penalizer_coef=0.01)
ggf.fit(cltv_df['frequency'], cltv_df['monetary_cltv_avg'])
cltv_df["exp_average_value"] = ggf.conditional_expected_average_profit(cltv_df['frequency'],
cltv_df['monetary_cltv_avg'])
cltv_df.head()

After predicting the average value customers will leave, we added it to the CLTV dataframe as ‘exp_average_value’

                           customer_id  recency_cltv_weekly  T_weekly  \
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43
frequency monetary_cltv_avg exp_sales_3_month exp_sales_6_month \
0 5.00 187.87 0.97 1.95
1 21.00 95.88 0.98 1.97
2 5.00 117.06 0.67 1.34
3 2.00 60.98 0.70 1.40
4 2.00 104.99 0.40 0.79
exp_average_value
0 193.63
1 96.67
2 120.97
3 67.32
4 114.33

We calculated the 6-month CLTV and added it to the dataframe with the name ‘cltv’

cltv = ggf.customer_lifetime_value(bgf,
cltv_df['frequency'],
cltv_df['recency_cltv_weekly'],
cltv_df['T_weekly'],
cltv_df['monetary_cltv_avg'],
time=6,
freq="W",
discount_rate=0.01)
cltv_df["cltv"] = cltv
                            customer_id  recency_cltv_weekly  T_weekly  \
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43
frequency monetary_cltv_avg exp_sales_3_month exp_sales_6_month \
0 5.00 187.87 0.97 1.95
1 21.00 95.88 0.98 1.97
2 5.00 117.06 0.67 1.34
3 2.00 60.98 0.70 1.40
4 2.00 104.99 0.40 0.79
exp_average_value cltv
0 193.63 395.73
1 96.67 199.43
2 120.97 170.22
3 67.32 98.95
4 114.33 95.01

We observed the top 20 individuals with the highest CLTV values.

cltv_df.sort_values("cltv",ascending=False)[:20]
                                customer_id  recency_cltv_weekly  T_weekly  \
9055 47a642fe-975b-11eb-8c2a-000d3a38a36f 2.86 7.86
13880 7137a5c0-7aad-11ea-8f20-000d3a38a36f 6.14 13.14
17323 f59053e2-a503-11e9-a2fc-000d3a38a36f 51.71 101.00
12438 625f40a2-5bd2-11ea-98b0-000d3a38a36f 74.29 74.57
7330 a4d534a2-5b1b-11eb-8dbd-000d3a38a36f 62.71 67.29
8868 9ce6e520-89b0-11ea-a6e7-000d3a38a36f 3.43 34.43
6402 851de3b4-8f0c-11eb-8cb8-000d3a38a36f 8.29 9.43
6666 53fe00d4-7b7a-11eb-960b-000d3a38a36f 9.71 13.00
19538 55d54d9e-8ac7-11ea-8ec0-000d3a38a36f 52.57 58.71
14858 031b2954-6d28-11eb-99c4-000d3a38a36f 14.86 15.57
17963 8fd88976-6708-11ea-9d38-000d3a38a36f 50.29 63.29
15516 9083981a-f59e-11e9-841e-000d3a38a36f 63.57 83.86
6717 40b4f318-9dfb-11eb-9c47-000d3a38a36f 27.14 33.86
4157 7eed6468-4540-11ea-acaf-000d3a38a36f 89.14 90.00
4735 dbabb58e-6312-11ea-a6dc-000d3a38a36f 61.29 64.29
11694 90f1b7f2-bbad-11ea-a0c9-000d3a38a36f 47.29 48.00
11179 d2e74a36-3228-11eb-860c-000d3a38a36f 1.14 26.29
1853 f02473b0-43c3-11eb-806e-000d3a38a36f 17.29 23.14
5775 e31293ac-d63a-11e9-93bc-000d3a38a36f 91.71 93.14
7312 90befc98-925a-11eb-b584-000d3a38a36f 4.14 8.86
frequency monetary_cltv_avg exp_sales_3_month exp_sales_6_month \
9055 4.00 1401.80 1.09 2.19
13880 11.00 758.09 1.97 3.94
17323 7.00 1106.47 0.72 1.44
12438 16.00 501.87 1.57 3.13
7330 52.00 166.22 4.66 9.31
8868 8.00 601.23 1.27 2.53
6402 2.00 862.69 0.79 1.59
6666 17.00 259.87 2.78 5.56
19538 31.00 228.53 3.08 6.17
14858 3.00 743.59 0.87 1.74
17963 7.00 694.20 0.92 1.84
15516 4.00 1090.36 0.57 1.15
6717 7.00 544.70 1.16 2.33
4157 27.00 289.76 2.21 4.43
4735 13.00 442.12 1.42 2.85
11694 6.00 647.34 0.93 1.87
11179 3.00 750.57 0.78 1.56
1853 2.00 835.88 0.68 1.37
5775 8.00 727.09 0.83 1.65
7312 6.00 431.33 1.36 2.73
exp_average_value cltv
9055 1449.06 3327.78
13880 767.36 3172.39
17323 1127.61 1708.98
12438 506.17 1662.61
7330 166.71 1628.89
8868 611.49 1623.81
6402 923.68 1538.86
6666 262.07 1529.23
19538 229.61 1485.82
14858 778.05 1423.00
17963 707.69 1362.61
15516 1127.35 1359.44
6717 555.41 1355.44
4157 291.29 1353.53
4735 446.82 1334.83
11694 662.11 1297.52
11179 785.34 1286.14
1853 895.04 1285.23
5775 739.39 1282.58
7312 441.40 1263.19

Creating Segments Based on CLTV

We are dividing all your customers into 4 groups (segments) based on the 6-month standardized CLTV, and we added the group names to the dataset as #cltv_segment.

cltv_df["cltv_segment"] = pd.qcut(cltv_df["cltv"], 4, labels=["D", "C", "B", "A"])
cltv_df.head()
                            customer_id  recency_cltv_weekly  T_weekly  \
0 cc294636-19f0-11eb-8d74-000d3a38a36f 17.00 30.57
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f 209.86 224.86
2 69b69676-1a40-11ea-941b-000d3a38a36f 52.29 78.86
3 1854e56c-491f-11eb-806e-000d3a38a36f 1.57 20.86
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f 83.14 95.43
frequency monetary_cltv_avg exp_sales_3_month exp_sales_6_month \
0 5.00 187.87 0.97 1.95
1 21.00 95.88 0.98 1.97
2 5.00 117.06 0.67 1.34
3 2.00 60.98 0.70 1.40
4 2.00 104.99 0.40 0.79
exp_average_value cltv cltv_segment
0 193.63 395.73 A
1 96.67 199.43 B
2 120.97 170.22 B
3 67.32 98.95 D
4 114.33 95.01 D

As a result, we have segmented our customers into four groups. Now, we can create distinct and personalized communication campaigns tailored to each of the four segments.

Big thanks to Vahit Keskin and Miuul

Contact me on Linkedin :) yaseminderyadilli

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Yasemin Derya Dilli
Yasemin Derya Dilli

Written by Yasemin Derya Dilli

Data Analyst | Engineer | Content Writer

No responses yet

Write a response