rfm_train_test_split#
- pymc_marketing.clv.utils.rfm_train_test_split(transactions, customer_id_col, datetime_col, train_period_end, test_period_end=None, time_unit='D', time_scaler=1, datetime_format=None, monetary_value_col=None, include_first_transaction=False, sort_transactions=True)[source]#
Summarize transaction data and split into training and tests datasets for CLV modeling. This can also be used to evaluate the impact of a time-based intervention like a marketing campaign.
- This transforms a DataFrame of transaction data of the form:
customer_id, datetime [, monetary_value]
- to a DataFrame of the form:
customer_id, frequency, recency, T [, monetary_value], test_frequency [, test_monetary_value], test_T
Note this function will exclude new customers whose first transactions occurred during the test period.
Adapted from lifetimes package CamDavidsonPilon/lifetimes
- Parameters:
transactions (
DataFrame
) – A Pandas DataFrame that contains the customer_id col and the datetime col.customer_id_col (string) – Column in the transactions DataFrame that denotes the customer_id.
datetime_col (string) – Column in the transactions DataFrame that denotes the datetime the purchase was made.
train_period_end (Union[str, pd.Period, datetime], optional) – A string or datetime to denote the final time period for the training data. Events after this time period are used for the test data.
test_period_end (Union[str, pd.Period, datetime], optional) – A string or datetime to denote the final time period of the study. Events after this date are truncated. If not given, defaults to the max of ‘datetime_col’.
time_unit (string, optional) – Time granularity for study. Default: ‘D’ for days. Possible values listed here: https://numpy.org/devdocs/reference/arrays.datetime.html#datetime-units
time_scaler (int, optional) – Default: 1. Useful for scaling recency & T to a different time granularity. Example: With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632 With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375 This is useful if predictions in months or years are desired, and can also help with model convergence for study periods of many years.
datetime_format (string, optional) – A string that represents the timestamp format. Useful if Pandas can’t understand the provided format.
monetary_value_col (string, optional) – Column in the transactions DataFrame that denotes the monetary value of the transaction. Optional; only needed for spend estimation models like the Gamma-Gamma model.
include_first_transaction (bool, optional) – Default: False For predictive CLV modeling, this should be False. Set to True if performing RFM segmentation.
sort_transactions (bool, optional) – Default: True If raw data is already sorted in chronological order, set to
False
to improve computational efficiency.
- Returns:
customer_id, frequency, recency, T, test_frequency, test_T [, monetary_value, test_monetary_value]
- Return type:
obj: DataFrame: