FLO CLTV Prediction with BG-NBD & Gamma-Gamma

7 min readNov 19, 2023

Data Understanding ve Preparation

First of all , we import all library

import pandas as pd
import datetime as dt
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from sklearn.preprocessing import MinMaxScaler
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.options.mode.chained_assignment = None

from terminal, we install lifetime library and read data set.

pip install lifetime

df_ = pd.read_csv("flo_data_20k.csv")

df = df_.copy()
df.head()

                              master_id order_channel last_order_channel  \
0  cc294636-19f0-11eb-8d74-000d3a38a36f   Android App            Offline   
1  f431bd5a-ab7b-11e9-a2fc-000d3a38a36f   Android App             Mobile   
2  69b69676-1a40-11ea-941b-000d3a38a36f   Android App        Android App   
3  1854e56c-491f-11eb-806e-000d3a38a36f   Android App        Android App   
4  d6ea1074-f1f5-11e9-9346-000d3a38a36f       Desktop            Desktop   
  first_order_date last_order_date last_order_date_online  \
0       2020-10-30      2021-02-26             2021-02-21   
1       2017-02-08      2021-02-16             2021-02-16   
2       2019-11-27      2020-11-27             2020-11-27   
3       2021-01-06      2021-01-17             2021-01-17   
4       2019-08-03      2021-03-07             2021-03-07   
  last_order_date_offline  order_num_total_ever_online  \
0              2021-02-26                         4.00   
1              2020-01-10                        19.00   
2              2019-12-01                         3.00   
3              2021-01-06                         1.00   
4              2019-08-03                         1.00   
   order_num_total_ever_offline  customer_value_total_ever_offline  \
0                          1.00                             139.99   
1                          2.00                             159.97   
2                          2.00                             189.97   
3                          1.00                              39.99   
4                          1.00                              49.99   
   customer_value_total_ever_online       interested_in_categories_12  
0                            799.38                           [KADIN]  
1                           1853.58  [ERKEK, COCUK, KADIN, AKTIFSPOR]  
2                            395.35                    [ERKEK, KADIN]  
3                             81.98               [AKTIFCOCUK, COCUK]  
4                            159.99                       [AKTIFSPOR]  
df.head()
Out[3]: 
                              master_id order_channel last_order_channel  \
0  cc294636-19f0-11eb-8d74-000d3a38a36f   Android App            Offline   
1  f431bd5a-ab7b-11e9-a2fc-000d3a38a36f   Android App             Mobile   
2  69b69676-1a40-11ea-941b-000d3a38a36f   Android App        Android App   
3  1854e56c-491f-11eb-806e-000d3a38a36f   Android App        Android App   
4  d6ea1074-f1f5-11e9-9346-000d3a38a36f       Desktop            Desktop   
  first_order_date last_order_date last_order_date_online  \
0       2020-10-30      2021-02-26             2021-02-21   
1       2017-02-08      2021-02-16             2021-02-16   
2       2019-11-27      2020-11-27             2020-11-27   
3       2021-01-06      2021-01-17             2021-01-17   
4       2019-08-03      2021-03-07             2021-03-07   
  last_order_date_offline  order_num_total_ever_online  \
0              2021-02-26                         4.00   
1              2020-01-10                        19.00   
2              2019-12-01                         3.00   
3              2021-01-06                         1.00   
4              2019-08-03                         1.00   
   order_num_total_ever_offline  customer_value_total_ever_offline  \
0                          1.00                             139.99   
1                          2.00                             159.97   
2                          2.00                             189.97   
3                          1.00                              39.99   
4                          1.00                              49.99   
   customer_value_total_ever_online       interested_in_categories_12  
0                            799.38                           [KADIN]  
1                           1853.58  [ERKEK, COCUK, KADIN, AKTIFSPOR]  
2                            395.35                    [ERKEK, KADIN]  
3                             81.98               [AKTIFCOCUK, COCUK]  
4                            159.99                       [AKTIFSPOR]

Trimming outliers function

def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.01)
    quartile3 = dataframe[variable].quantile(0.99)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = round(low_limit,0)
    dataframe.loc[(dataframe[variable] > up_limit), variable] = round(up_limit,0)

The outliers of the variables ‘order_num_total_ever_online’, ‘order_num_total_ever_offline’, ‘customer_value_total_ever_offline’, ‘customer_value_total_ever_online’ have been suppressed.

columns = ["order_num_total_ever_online", "order_num_total_ever_offline", "customer_value_total_ever_offline","customer_value_total_ever_online"]
for col in columns:
    replace_with_thresholds(df, col)

We have created a new variable for the total order number and total customer value because we have two source from online and offline.

df["order_num_total"] = df["order_num_total_ever_online"] + df["order_num_total_ever_offline"]
df["customer_value_total"] = df["customer_value_total_ever_offline"] + df["customer_value_total_ever_online"]

We converted the data type of variables representing dates to ‘date’.

date_columns = df.columns[df.columns.str.contains("date")]
df[date_columns] = df[date_columns].apply(pd.to_datetime)

master_id                                    object
order_channel                                object
last_order_channel                           object
first_order_date                     datetime64[ns]
last_order_date                      datetime64[ns]
last_order_date_online               datetime64[ns]
last_order_date_offline              datetime64[ns]
order_num_total_ever_online                 float64
order_num_total_ever_offline                float64
customer_value_total_ever_offline           float64
customer_value_total_ever_online            float64
interested_in_categories_12                  object
order_num_total                             float64
customer_value_total                        float64
dtype: object

Creating the CLTV Data Structure

We determined the analysis date as 2 days after the date of the last purchase.


df["last_order_date"].max() # 2021-05-30
analysis_date = dt.datetime(2021,6,1)

We are creating the CLTV (Customer Lifetime Value) data frame.


cltv_df = pd.DataFrame()
cltv_df["customer_id"] = df["master_id"]
cltv_df["recency_cltv_weekly"] = ((df["last_order_date"]- df["first_order_date"]).astype('timedelta64[D]')) / 7
cltv_df["T_weekly"] = ((analysis_date - df["first_order_date"]).astype('timedelta64[D]'))/7
cltv_df["frequency"] = df["order_num_total"]
cltv_df["monetary_cltv_avg"] = df["customer_value_total"] / df["order_num_total"]

cltv_df.head()

                            customer_id  recency_cltv_weekly  T_weekly  \
0  cc294636-19f0-11eb-8d74-000d3a38a36f                17.00     30.57   
1  f431bd5a-ab7b-11e9-a2fc-000d3a38a36f               209.86    224.86   
2  69b69676-1a40-11ea-941b-000d3a38a36f                52.29     78.86   
3  1854e56c-491f-11eb-806e-000d3a38a36f                 1.57     20.86   
4  d6ea1074-f1f5-11e9-9346-000d3a38a36f                83.14     95.43   
   frequency  monetary_cltv_avg  
0       5.00             187.87  
1      21.00              95.88  
2       5.00             117.06  
3       2.00              60.98  
4       2.00             104.99

Setting up the BG/NBD and Gamma-Gamma models


bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(cltv_df['frequency'],
        cltv_df['recency_cltv_weekly'],
        cltv_df['T_weekly'])

We predicted the expected purchases from customers within 3 months and added it to the CLTV dataframe as ‘exp_sales_3_month’.


cltv_df["exp_sales_3_month"] = bgf.predict(4*3,
                                       cltv_df['frequency'],
                                       cltv_df['recency_cltv_weekly'],
                                       cltv_df['T_weekly'])

We predicted the expected purchases from customers within 6months and added it to the CLTV dataframe as ‘exp_sales_6_month’.


cltv_df["exp_sales_6_month"] = bgf.predict(4*6,
                                       cltv_df['frequency'],
                                       cltv_df['recency_cltv_weekly'],
                                       cltv_df['T_weekly'])

We examine the top purchasers in months 3


cltv_df.sort_values("exp_sales_3_month",ascending=False)[:10]

                                customer_id  recency_cltv_weekly  T_weekly
7330   a4d534a2-5b1b-11eb-8dbd-000d3a38a36f                62.71     67.29   
15611  4a7e875e-e6ce-11ea-8f44-000d3a38a36f                39.71     40.00   
8328   1902bf80-0035-11eb-8341-000d3a38a36f                28.86     33.29   
19538  55d54d9e-8ac7-11ea-8ec0-000d3a38a36f                52.57     58.71   
14373  f00ad516-c4f4-11ea-98f7-000d3a38a36f                38.00     46.43   
10489  7af5cd16-b100-11e9-9757-000d3a38a36f               103.14    111.86   
4315   d5ef8058-a5c6-11e9-a2fc-000d3a38a36f               133.14    147.14   
6756   27310582-6362-11ea-a6dc-000d3a38a36f                62.71     64.14   
6666   53fe00d4-7b7a-11eb-960b-000d3a38a36f                 9.71     13.00   
10536  e143b6fa-d6f8-11e9-93bc-000d3a38a36f               104.57    113.43   
       frequency  monetary_cltv_avg  exp_sales_3_month  exp_sales_6_month  
7330       52.00             166.22               4.66               9.31  
15611      29.00             165.30               3.37               6.75  
8328       25.00              97.44               3.14               6.28  
19538      31.00             228.53               3.08               6.17  
14373      27.00             141.35               3.00               6.00  
10489      43.00             157.11               2.98               5.96  
4315       49.00             161.85               2.83               5.66  
6756       29.00             168.88               2.79               5.59  
6666       17.00             259.87               2.78               5.56  
10536      40.00             176.20               2.76               5.53

We established the Gamma-Gamma model.

ggf = GammaGammaFitter(penalizer_coef=0.01)
ggf.fit(cltv_df['frequency'], cltv_df['monetary_cltv_avg'])
cltv_df["exp_average_value"] = ggf.conditional_expected_average_profit(cltv_df['frequency'],
                                                                cltv_df['monetary_cltv_avg'])
cltv_df.head()

After predicting the average value customers will leave, we added it to the CLTV dataframe as ‘exp_average_value’

                           customer_id  recency_cltv_weekly  T_weekly  \
0  cc294636-19f0-11eb-8d74-000d3a38a36f                17.00     30.57   
1  f431bd5a-ab7b-11e9-a2fc-000d3a38a36f               209.86    224.86   
2  69b69676-1a40-11ea-941b-000d3a38a36f                52.29     78.86   
3  1854e56c-491f-11eb-806e-000d3a38a36f                 1.57     20.86   
4  d6ea1074-f1f5-11e9-9346-000d3a38a36f                83.14     95.43   
   frequency  monetary_cltv_avg  exp_sales_3_month  exp_sales_6_month  \
0       5.00             187.87               0.97               1.95   
1      21.00              95.88               0.98               1.97   
2       5.00             117.06               0.67               1.34   
3       2.00              60.98               0.70               1.40   
4       2.00             104.99               0.40               0.79   
   exp_average_value  
0             193.63  
1              96.67  
2             120.97  
3              67.32  
4             114.33

We calculated the 6-month CLTV and added it to the dataframe with the name ‘cltv’

cltv = ggf.customer_lifetime_value(bgf,
                                   cltv_df['frequency'],
                                   cltv_df['recency_cltv_weekly'],
                                   cltv_df['T_weekly'],
                                   cltv_df['monetary_cltv_avg'],
                                   time=6,
                                   freq="W",
                                   discount_rate=0.01)
cltv_df["cltv"] = cltv

                            customer_id  recency_cltv_weekly  T_weekly  \
0  cc294636-19f0-11eb-8d74-000d3a38a36f                17.00     30.57   
1  f431bd5a-ab7b-11e9-a2fc-000d3a38a36f               209.86    224.86   
2  69b69676-1a40-11ea-941b-000d3a38a36f                52.29     78.86   
3  1854e56c-491f-11eb-806e-000d3a38a36f                 1.57     20.86   
4  d6ea1074-f1f5-11e9-9346-000d3a38a36f                83.14     95.43   
   frequency  monetary_cltv_avg  exp_sales_3_month  exp_sales_6_month  \
0       5.00             187.87               0.97               1.95   
1      21.00              95.88               0.98               1.97   
2       5.00             117.06               0.67               1.34   
3       2.00              60.98               0.70               1.40   
4       2.00             104.99               0.40               0.79   
   exp_average_value   cltv  
0             193.63 395.73  
1              96.67 199.43  
2             120.97 170.22  
3              67.32  98.95  
4             114.33  95.01

We observed the top 20 individuals with the highest CLTV values.

cltv_df.sort_values("cltv",ascending=False)[:20]

                                customer_id  recency_cltv_weekly  T_weekly  \
9055   47a642fe-975b-11eb-8c2a-000d3a38a36f                 2.86      7.86   
13880  7137a5c0-7aad-11ea-8f20-000d3a38a36f                 6.14     13.14   
17323  f59053e2-a503-11e9-a2fc-000d3a38a36f                51.71    101.00   
12438  625f40a2-5bd2-11ea-98b0-000d3a38a36f                74.29     74.57   
7330   a4d534a2-5b1b-11eb-8dbd-000d3a38a36f                62.71     67.29   
8868   9ce6e520-89b0-11ea-a6e7-000d3a38a36f                 3.43     34.43   
6402   851de3b4-8f0c-11eb-8cb8-000d3a38a36f                 8.29      9.43   
6666   53fe00d4-7b7a-11eb-960b-000d3a38a36f                 9.71     13.00   
19538  55d54d9e-8ac7-11ea-8ec0-000d3a38a36f                52.57     58.71   
14858  031b2954-6d28-11eb-99c4-000d3a38a36f                14.86     15.57   
17963  8fd88976-6708-11ea-9d38-000d3a38a36f                50.29     63.29   
15516  9083981a-f59e-11e9-841e-000d3a38a36f                63.57     83.86   
6717   40b4f318-9dfb-11eb-9c47-000d3a38a36f                27.14     33.86   
4157   7eed6468-4540-11ea-acaf-000d3a38a36f                89.14     90.00   
4735   dbabb58e-6312-11ea-a6dc-000d3a38a36f                61.29     64.29   
11694  90f1b7f2-bbad-11ea-a0c9-000d3a38a36f                47.29     48.00   
11179  d2e74a36-3228-11eb-860c-000d3a38a36f                 1.14     26.29   
1853   f02473b0-43c3-11eb-806e-000d3a38a36f                17.29     23.14   
5775   e31293ac-d63a-11e9-93bc-000d3a38a36f                91.71     93.14   
7312   90befc98-925a-11eb-b584-000d3a38a36f                 4.14      8.86   
       frequency  monetary_cltv_avg  exp_sales_3_month  exp_sales_6_month  \
9055        4.00            1401.80               1.09               2.19   
13880      11.00             758.09               1.97               3.94   
17323       7.00            1106.47               0.72               1.44   
12438      16.00             501.87               1.57               3.13   
7330       52.00             166.22               4.66               9.31   
8868        8.00             601.23               1.27               2.53   
6402        2.00             862.69               0.79               1.59   
6666       17.00             259.87               2.78               5.56   
19538      31.00             228.53               3.08               6.17   
14858       3.00             743.59               0.87               1.74   
17963       7.00             694.20               0.92               1.84   
15516       4.00            1090.36               0.57               1.15   
6717        7.00             544.70               1.16               2.33   
4157       27.00             289.76               2.21               4.43   
4735       13.00             442.12               1.42               2.85   
11694       6.00             647.34               0.93               1.87   
11179       3.00             750.57               0.78               1.56   
1853        2.00             835.88               0.68               1.37   
5775        8.00             727.09               0.83               1.65   
7312        6.00             431.33               1.36               2.73   
       exp_average_value    cltv  
9055             1449.06 3327.78  
13880             767.36 3172.39  
17323            1127.61 1708.98  
12438             506.17 1662.61  
7330              166.71 1628.89  
8868              611.49 1623.81  
6402              923.68 1538.86  
6666              262.07 1529.23  
19538             229.61 1485.82  
14858             778.05 1423.00  
17963             707.69 1362.61  
15516            1127.35 1359.44  
6717              555.41 1355.44  
4157              291.29 1353.53  
4735              446.82 1334.83  
11694             662.11 1297.52  
11179             785.34 1286.14  
1853              895.04 1285.23  
5775              739.39 1282.58  
7312              441.40 1263.19

Creating Segments Based on CLTV

We are dividing all your customers into 4 groups (segments) based on the 6-month standardized CLTV, and we added the group names to the dataset as #cltv_segment.

cltv_df["cltv_segment"] = pd.qcut(cltv_df["cltv"], 4, labels=["D", "C", "B", "A"])
cltv_df.head()

                            customer_id  recency_cltv_weekly  T_weekly  \
0  cc294636-19f0-11eb-8d74-000d3a38a36f                17.00     30.57   
1  f431bd5a-ab7b-11e9-a2fc-000d3a38a36f               209.86    224.86   
2  69b69676-1a40-11ea-941b-000d3a38a36f                52.29     78.86   
3  1854e56c-491f-11eb-806e-000d3a38a36f                 1.57     20.86   
4  d6ea1074-f1f5-11e9-9346-000d3a38a36f                83.14     95.43   
   frequency  monetary_cltv_avg  exp_sales_3_month  exp_sales_6_month  \
0       5.00             187.87               0.97               1.95   
1      21.00              95.88               0.98               1.97   
2       5.00             117.06               0.67               1.34   
3       2.00              60.98               0.70               1.40   
4       2.00             104.99               0.40               0.79   
   exp_average_value   cltv cltv_segment  
0             193.63 395.73            A  
1              96.67 199.43            B  
2             120.97 170.22            B  
3              67.32  98.95            D  
4             114.33  95.01            D

As a result, we have segmented our customers into four groups. Now, we can create distinct and personalized communication campaigns tailored to each of the four segments.

Big thanks to Vahit Keskin and Miuul

Contact me on Linkedin :) yaseminderyadilli