3. Baseline feature transformation

The simulated dataset generated in the previous section is simple. It only contains the essential features that characterize a payment card transaction. These are: a unique identifier for the transaction, the date and time of the transaction, the transaction amount, a unique identifier for the customer, a unique number for the merchant, and a binary variable that labels the transaction as legitimate or fraudulent (0 for legitimate or 1 for fraudulent). Fig. 1 provides the first three rows of the simulated dataset:

alt text

Fig. 1. The first three transactions in the simulated dataset used in this chapter.

What each row essentially summarizes is that, at 00:00:31, on the 1st of April 2018, a customer with ID 596 made a payment of a value of 57.19 to a merchant with ID 3156, and that the transaction was not fraudulent. Then, at 00:02:10, on the 1st of April 2018, a customer with ID 4961 made a payment of a value of 81.51 to a merchant with ID 3412, and that the transaction was not fraudulent. And so on. The simulated dataset is a long list of such transactions (1.8 million in total). The variable transaction_ID is a unique identifier for each transaction.

While conceptually simple for a human, such a set of features is however not appropriate for a machine learning predictive model. Machine learning algorithms typically require numerical and ordered features. Numerical means that the type of the variable must be an integer or a real number. Ordered means that the order of the values of a variable is meaningful.

In this dataset, the only numerical and ordered features are the transaction amount and the fraud label. The date is a Panda timestamp, and therefore not numerical. The identifiers for the transactions, customers, and terminals are numerical but not ordered: it would not make sense to assume for example that the terminal with ID 3548 is ‘bigger’ or ‘larger’ than the terminal with ID 1983. Rather, these identifiers represent distinct ‘entities’, which are referred to as categorical features.

There is unfortunately no standard procedure to deal with non-numerical or categorical features. The topic is known in the machine learning literature as feature engineering or feature transformation. In essence, the goal of feature engineering is to design new features that are assumed to be relevant for a predictive problem. The design of these features is usually problem-dependent, and involves domain knowledge.

In this section, we will implement three types of feature transformation that are known to be relevant for payment card fraud detection.

encoding

The first type of transformation involves the date/time variable, and consists in creating binary features that characterize potentially relevant periods. We will create two such features. The first one will characterize whether a transaction occurs during a weekday or during the weekend. The second will characterize whether a transaction occurs during the day or the night. These features can be useful since it has been observed in real-world datasets that fraudulent patterns differ between weekdays and weekends, and between the day and night.

The second type of transformation involves the customer ID and consists in creating features that characterize the customer spending behaviors. We will follow the RFM (Recency, Frequency, Monetary value) framework proposed in [VVBC+15], and keep track of the average spending amount and number of transactions for each customer and for three window sizes. This will lead to the creation of six new features.

The third type of transformation involves the terminal ID and consists in creating new features that characterize the ‘risk’ associated with the terminal. The risk will be defined as the average number of frauds that were observed on the terminal for three window sizes. This will lead to the creation of three new features.

The table below summarizes the types of transformation that will be performed and the new features that will be created.

Original feature name

Original feature type

Transformation

Number of new features

New feature(s) type

TX_DATE_TIME

Panda timestamp

0 if transaction during a weekday, 1 if transaction during a weekend. The new feature is called TX_DURING_WEEKEND.

1

Integer (0/1)

TX_DATE_TIME

Panda timestamp

0 if transaction between 6am and 0pm, 1 if transaction between 0pm and 6am. The new feature is called TX_DURING_NIGHT.

1

Integer (0/1)

CUSTOMER_ID

Categorical variable

Number of transactions by the customer in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_NB_TX_nDAY_WINDOW.

3

Integer

CUSTOMER_ID

Categorical variable

Average spending amount in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_AVG_AMOUNT_nDAY_WINDOW.

3

Real

TERMINAL_ID

Categorical variable

Number of transactions on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay and will be discussed later in this notebook. The new features are called TERMINAL_ID_NB_TX_nDAY_WINDOW.

3

Integer

TERMINAL_ID

Categorical variable

Average number of frauds on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay and will be discussed later in this notebook. The new features are called TERMINAL_ID_RISK_nDAY_WINDOW.

3

Real

The following sections provide the implementation for each of these three transformations. After the transformations, a set of 14 new features will be created. Note that some of the features are the result of aggregation functions over the values of other features or conditions (same customer, given time window). These features are often referred to as aggregated features.

# Initialization: Load shared functions and simulated data 

# Load shared functions
!curl -O https://raw.githubusercontent.com/Fraud-Detection-Handbook/fraud-detection-handbook/main/Chapter_References/shared_functions.py
%run shared_functions.py

# Get simulated data from Github repository
if not os.path.exists("simulated-data-raw"):
    !git clone https://github.com/Fraud-Detection-Handbook/simulated-data-raw
        
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31567  100 31567    0     0   135k      0 --:--:-- --:--:-- --:--:--  135k
Cloning into 'simulated-data-raw'...
remote: Enumerating objects: 189, done.
remote: Counting objects: 100% (189/189), done.
remote: Compressing objects: 100% (187/187), done.
remote: Total 189 (delta 0), reused 186 (delta 0), pack-reused 0
Receiving objects: 100% (189/189), 28.04 MiB | 3.13 MiB/s, done.

3.1. Loading of dataset

Let us first load the transaction data simulated in the previous notebook. We will load the transaction files from April to September. Files can be loaded using the read_from_files function in the shared functions notebook. The function was put in this notebook since it will be used frequently throughout this book.

The function takes as input the folder where the data files are located, and the dates that define the period to load (between BEGIN_DATE and END_DATE). It returns a DataFrame of transactions. The transactions are sorted by chronological order.

DIR_INPUT='./simulated-data-raw/data/' 

BEGIN_DATE = "2018-04-01"
END_DATE = "2018-09-30"

print("Load  files")
%time transactions_df=read_from_files(DIR_INPUT, BEGIN_DATE, END_DATE)
print("{0} transactions loaded, containing {1} fraudulent transactions".format(len(transactions_df),transactions_df.TX_FRAUD.sum()))
Load  files
CPU times: user 3.1 s, sys: 696 ms, total: 3.79 s
Wall time: 4.13 s
1754155 transactions loaded, containing 14681 fraudulent transactions
transactions_df.head()
TRANSACTION_ID TX_DATETIME CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_TIME_SECONDS TX_TIME_DAYS TX_FRAUD TX_FRAUD_SCENARIO
0 0 2018-04-01 00:00:31 596 3156 57.16 31 0 0 0
1 1 2018-04-01 00:02:10 4961 3412 81.51 130 0 0 0
2 2 2018-04-01 00:07:56 2 1365 146.00 476 0 0 0
3 3 2018-04-01 00:09:29 4128 8737 64.49 569 0 0 0
4 4 2018-04-01 00:10:34 927 9906 50.99 634 0 0 0

3.2. Date and time transformations

We will create two new binary features from the transaction dates and times:

  • The first will characterize whether a transaction occurs during a weekday (value 0) or a weekend (1), and will be called TX_DURING_WEEKEND

  • The second will characterize whether a transaction occurs during the day or during the day (0) or during the night (1). The night is defined as hours that are between 0pm and 6am. It will be called TX_DURING_NIGHT.

For the TX_DURING_WEEKEND feature, we define a function is_weekend that takes as input a Panda timestamp, and returns 1 if the date is during a weekend, or 0 otherwise. The timestamp object conveniently provides the weekday function to help in computing this value.

def is_weekend(tx_datetime):
    
    # Transform date into weekday (0 is Monday, 6 is Sunday)
    weekday = tx_datetime.weekday()
    # Binary value: 0 if weekday, 1 if weekend
    is_weekend = weekday>=5
    
    return int(is_weekend)

It is then straghtforward to compute this feature for all transactions using the Panda apply function.

%time transactions_df['TX_DURING_WEEKEND']=transactions_df.TX_DATETIME.apply(is_weekend)
CPU times: user 7.54 s, sys: 247 ms, total: 7.79 s
Wall time: 7.94 s

We follow the same logic to implement the TX_DURING_NIGHT feature. First, a function is_night that takes as input a Panda timestamp, and returns 1 if the time is during the night, or 0 otherwise. The timestamp object conveniently provides the hour property to help in computing this value.

def is_night(tx_datetime):
    
    # Get the hour of the transaction
    tx_hour = tx_datetime.hour
    # Binary value: 1 if hour less than 6, and 0 otherwise
    is_night = tx_hour<=6
    
    return int(is_night)
%time transactions_df['TX_DURING_NIGHT']=transactions_df.TX_DATETIME.apply(is_night)
CPU times: user 7.09 s, sys: 221 ms, total: 7.31 s
Wall time: 7.47 s

Let us check that these features where correctly computed.

transactions_df[transactions_df.TX_TIME_DAYS>=30]
TRANSACTION_ID TX_DATETIME CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_TIME_SECONDS TX_TIME_DAYS TX_FRAUD TX_FRAUD_SCENARIO TX_DURING_WEEKEND TX_DURING_NIGHT
288062 288062 2018-05-01 00:01:21 3546 2944 18.71 2592081 30 0 0 0 1
288063 288063 2018-05-01 00:01:48 206 3521 18.60 2592108 30 0 0 0 1
288064 288064 2018-05-01 00:02:22 2610 4470 66.67 2592142 30 0 0 0 1
288065 288065 2018-05-01 00:03:15 4578 1520 79.41 2592195 30 0 0 0 1
288066 288066 2018-05-01 00:03:51 1246 7809 52.08 2592231 30 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ...
1754150 1754150 2018-09-30 23:56:36 161 655 54.24 15810996 182 0 0 1 0
1754151 1754151 2018-09-30 23:57:38 4342 6181 1.23 15811058 182 0 0 1 0
1754152 1754152 2018-09-30 23:58:21 618 1502 6.62 15811101 182 0 0 1 0
1754153 1754153 2018-09-30 23:59:52 4056 3067 55.40 15811192 182 0 0 1 0
1754154 1754154 2018-09-30 23:59:57 3542 9849 23.59 15811197 182 0 0 1 0

1466093 rows × 11 columns

The 2018-05-01 was a Monday, and the 2018-09-30 a Sunday. These dates are correctly flagged as weekday, and weekend, respectively. The day and night feature is also correctly set for the first transactions, that happen closely after 0 pm, and the last transactions that happen closely before 0 pm.

3.3. Customer ID transformations

Let us now proceed with customer ID transformations. We will take inspiration from the RFM (Recency, Frequency, Monetary value) framework proposed in [VVBC+15], and compute two of these features over three time windows. The first feature will be the number of transactions that occur within a time window (Frequency). The second will be the average amount spent in these transactions (Monetary value). The time windows will be set to one, seven, and thirty days. This will generate six new features. Note that these time windows could later be optimized along with the models using a model selection procedure (Chapter 5).

Let us implement these transformations by writing a get_customer_spending_behaviour_features function. The function takes as inputs the set of transactions for a customer and a set of window sizes. It returns a DataFrame with the six new features. Our implementation relies on the Panda rolling function, which makes easy the computation of aggregates over a time window.

def get_customer_spending_behaviour_features(customer_transactions, windows_size_in_days=[1,7,30]):
    
    # Let us first order transactions chronologically
    customer_transactions=customer_transactions.sort_values('TX_DATETIME')
    
    # The transaction date and time is set as the index, which will allow the use of the rolling function 
    customer_transactions.index=customer_transactions.TX_DATETIME
    
    # For each window size
    for window_size in windows_size_in_days:
        
        # Compute the sum of the transaction amounts and the number of transactions for the given window size
        SUM_AMOUNT_TX_WINDOW=customer_transactions['TX_AMOUNT'].rolling(str(window_size)+'d').sum()
        NB_TX_WINDOW=customer_transactions['TX_AMOUNT'].rolling(str(window_size)+'d').count()
    
        # Compute the average transaction amount for the given window size
        # NB_TX_WINDOW is always >0 since current transaction is always included
        AVG_AMOUNT_TX_WINDOW=SUM_AMOUNT_TX_WINDOW/NB_TX_WINDOW
    
        # Save feature values
        customer_transactions['CUSTOMER_ID_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
        customer_transactions['CUSTOMER_ID_AVG_AMOUNT_'+str(window_size)+'DAY_WINDOW']=list(AVG_AMOUNT_TX_WINDOW)
    
    # Reindex according to transaction IDs
    customer_transactions.index=customer_transactions.TRANSACTION_ID
        
    # And return the dataframe with the new features
    return customer_transactions

Let us compute these aggregates for the first customer.

spending_behaviour_customer_0=get_customer_spending_behaviour_features(transactions_df[transactions_df.CUSTOMER_ID==0])
spending_behaviour_customer_0
TRANSACTION_ID TX_DATETIME CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_TIME_SECONDS TX_TIME_DAYS TX_FRAUD TX_FRAUD_SCENARIO TX_DURING_WEEKEND TX_DURING_NIGHT CUSTOMER_ID_NB_TX_1DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW CUSTOMER_ID_NB_TX_7DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW CUSTOMER_ID_NB_TX_30DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW
TRANSACTION_ID
1758 1758 2018-04-01 07:19:05 0 6076 123.59 26345 0 0 0 1 0 1.0 123.590000 1.0 123.590000 1.0 123.590000
8275 8275 2018-04-01 18:00:16 0 858 77.34 64816 0 0 0 1 0 2.0 100.465000 2.0 100.465000 2.0 100.465000
8640 8640 2018-04-01 19:02:02 0 6698 46.51 68522 0 0 0 1 0 3.0 82.480000 3.0 82.480000 3.0 82.480000
12169 12169 2018-04-02 08:51:06 0 6569 54.72 118266 1 0 0 0 0 3.0 59.523333 4.0 75.540000 4.0 75.540000
15764 15764 2018-04-02 14:05:38 0 7707 63.30 137138 1 0 0 0 0 4.0 60.467500 5.0 73.092000 5.0 73.092000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1750390 1750390 2018-09-30 13:38:41 0 3096 38.23 15773921 182 0 0 1 0 5.0 64.388000 28.0 57.306429 89.0 63.097640
1750758 1750758 2018-09-30 14:10:21 0 9441 43.60 15775821 182 0 0 1 0 6.0 60.923333 29.0 56.833793 89.0 62.433933
1751039 1751039 2018-09-30 14:34:30 0 1138 69.69 15777270 182 0 0 1 0 7.0 62.175714 29.0 57.872414 90.0 62.514556
1751272 1751272 2018-09-30 14:54:59 0 9441 91.26 15778499 182 0 0 1 0 8.0 65.811250 30.0 58.985333 90.0 61.882333
1751455 1751455 2018-09-30 15:11:37 0 2746 27.90 15779497 182 0 0 1 0 9.0 61.598889 31.0 57.982581 91.0 61.508901

384 rows × 17 columns

We can check that the new features are consistent with the customer profile (see the previous notebook). For customer 0, the mean amount was mean_amount=62.26, and the transaction frequency was mean_nb_tx_per_day=2.18. These values are indeed closely matched by the features CUSTOMER_ID_NB_TX_30DAY_WINDOW and CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW, especially after 30 days.

Let us now generate these features for all customers. This is straightforward using the Panda groupby and apply methods.

%time transactions_df=transactions_df.groupby('CUSTOMER_ID').apply(lambda x: get_customer_spending_behaviour_features(x, windows_size_in_days=[1,7,30]))
transactions_df=transactions_df.sort_values('TX_DATETIME').reset_index(drop=True)
CPU times: user 1min 2s, sys: 1.21 s, total: 1min 3s
Wall time: 1min 7s
transactions_df
TRANSACTION_ID TX_DATETIME CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_TIME_SECONDS TX_TIME_DAYS TX_FRAUD TX_FRAUD_SCENARIO TX_DURING_WEEKEND TX_DURING_NIGHT CUSTOMER_ID_NB_TX_1DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW CUSTOMER_ID_NB_TX_7DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW CUSTOMER_ID_NB_TX_30DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW
0 0 2018-04-01 00:00:31 596 3156 57.16 31 0 0 0 1 1 1.0 57.160000 1.0 57.160000 1.0 57.160000
1 1 2018-04-01 00:02:10 4961 3412 81.51 130 0 0 0 1 1 1.0 81.510000 1.0 81.510000 1.0 81.510000
2 2 2018-04-01 00:07:56 2 1365 146.00 476 0 0 0 1 1 1.0 146.000000 1.0 146.000000 1.0 146.000000
3 3 2018-04-01 00:09:29 4128 8737 64.49 569 0 0 0 1 1 1.0 64.490000 1.0 64.490000 1.0 64.490000
4 4 2018-04-01 00:10:34 927 9906 50.99 634 0 0 0 1 1 1.0 50.990000 1.0 50.990000 1.0 50.990000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1754150 1754150 2018-09-30 23:56:36 161 655 54.24 15810996 182 0 0 1 0 2.0 75.280000 12.0 67.047500 72.0 69.521111
1754151 1754151 2018-09-30 23:57:38 4342 6181 1.23 15811058 182 0 0 1 0 1.0 1.230000 21.0 22.173810 93.0 24.780753
1754152 1754152 2018-09-30 23:58:21 618 1502 6.62 15811101 182 0 0 1 0 5.0 7.368000 21.0 7.400476 65.0 7.864462
1754153 1754153 2018-09-30 23:59:52 4056 3067 55.40 15811192 182 0 0 1 0 3.0 100.696667 16.0 107.052500 51.0 102.919608
1754154 1754154 2018-09-30 23:59:57 3542 9849 23.59 15811197 182 0 0 1 0 5.0 41.304000 24.0 35.308333 119.0 37.251513

1754155 rows × 17 columns

3.4. Terminal ID transformations

Finally, let us proceed with the terminal ID transformations. The main goal will be to extract a risk score, that assesses the exposure of a given terminal ID to fraudulent transactions. The risk score will be defined as the average number of fraudulent transactions that occurred on a terminal ID over a time window. As for customer ID transformations, we will use three window sizes, of 1, 7, and 30 days.

Contrary to customer ID transformations, the time windows will not directly precede a given transaction. Instead, they will be shifted back by a delay period. The delay period accounts for the fact that, in practice, the fraudulent transactions are only discovered after a fraud investigation or a customer complaint. Hence, the fraudulent labels, which are needed to compute the risk score, are only available after this delay period. To a first approximation, this delay period will be set to one week. The motivations for the delay period will be further argued in Chapter 5, Validation strategies.

Let us perform the computation of the risk scores by defining a get_count_risk_rolling_window function. The function takes as inputs the DataFrame of transactions for a given terminal ID, the delay period, and a list of window sizes. In the first stage, the number of transactions and fraudulent transactions are computed for the delay period (NB_TX_DELAY and NB_FRAUD_DELAY). In the second stage, the number of transactions and fraudulent transactions are computed for each window size plus the delay period (NB_TX_DELAY_WINDOW and NB_FRAUD_DELAY_WINDOW). The number of transactions and fraudulent transactions that occurred for a given window size, shifted back by the delay period, is then obtained by simply computing the differences of the quantities obtained for the delay period, and the window size plus delay period:

NB_FRAUD_WINDOW=NB_FRAUD_DELAY_WINDOW-NB_FRAUD_DELAY
NB_TX_WINDOW=NB_TX_DELAY_WINDOW-NB_TX_DELAY

The risk score is finally obtained by computing the proportion of fraudulent transactions for each window size (or 0 if no transaction occurred for the given window):

RISK_WINDOW=NB_FRAUD_WINDOW/NB_TX_WINDOW

Additionally to the risk score, the function also returns the number of transactions for each window size. This results in the addition of six new features: The risk and number of transactions, for three window sizes.

def get_count_risk_rolling_window(terminal_transactions, delay_period=7, windows_size_in_days=[1,7,30], feature="TERMINAL_ID"):
    
    terminal_transactions=terminal_transactions.sort_values('TX_DATETIME')
    
    terminal_transactions.index=terminal_transactions.TX_DATETIME
    
    NB_FRAUD_DELAY=terminal_transactions['TX_FRAUD'].rolling(str(delay_period)+'d').sum()
    NB_TX_DELAY=terminal_transactions['TX_FRAUD'].rolling(str(delay_period)+'d').count()
    
    for window_size in windows_size_in_days:
    
        NB_FRAUD_DELAY_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delay_period+window_size)+'d').sum()
        NB_TX_DELAY_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delay_period+window_size)+'d').count()
    
        NB_FRAUD_WINDOW=NB_FRAUD_DELAY_WINDOW-NB_FRAUD_DELAY
        NB_TX_WINDOW=NB_TX_DELAY_WINDOW-NB_TX_DELAY
    
        RISK_WINDOW=NB_FRAUD_WINDOW/NB_TX_WINDOW
        
        terminal_transactions[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
        terminal_transactions[feature+'_RISK_'+str(window_size)+'DAY_WINDOW']=list(RISK_WINDOW)
        
    terminal_transactions.index=terminal_transactions.TRANSACTION_ID
    
    # Replace NA values with 0 (all undefined risk scores where NB_TX_WINDOW is 0) 
    terminal_transactions.fillna(0,inplace=True)
    
    return terminal_transactions
transactions_df[transactions_df.TX_FRAUD==1]
TRANSACTION_ID TX_DATETIME CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_TIME_SECONDS TX_TIME_DAYS TX_FRAUD TX_FRAUD_SCENARIO TX_DURING_WEEKEND TX_DURING_NIGHT CUSTOMER_ID_NB_TX_1DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW CUSTOMER_ID_NB_TX_7DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW CUSTOMER_ID_NB_TX_30DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW
3527 3527 2018-04-01 10:17:43 3774 3059 225.41 37063 0 1 1 1 0 3.0 158.073333 3.0 158.073333 3.0 158.073333
5789 5790 2018-04-01 13:31:48 4944 6050 222.26 48708 0 1 1 1 0 2.0 127.605000 2.0 127.605000 2.0 127.605000
6549 6549 2018-04-01 14:42:02 4625 9102 226.40 52922 0 1 1 1 0 4.0 167.165000 4.0 167.165000 4.0 167.165000
9583 9583 2018-04-02 01:01:05 3814 6893 59.15 90065 1 1 3 0 1 6.0 29.138333 6.0 29.138333 6.0 29.138333
10356 10355 2018-04-02 05:03:35 2513 1143 222.04 104615 1 1 1 0 1 5.0 123.740000 5.0 123.740000 5.0 123.740000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1753524 1753524 2018-09-30 19:51:48 1671 3192 128.60 15796308 182 1 3 1 0 6.0 138.358333 25.0 106.957200 82.0 75.621341
1753600 1753600 2018-09-30 20:09:00 4166 632 17.39 15797340 182 1 2 1 0 3.0 19.766667 19.0 15.984737 86.0 15.846512
1753673 1753673 2018-09-30 20:30:52 4097 1558 24.04 15798652 182 1 2 1 0 3.0 23.050000 16.0 40.440625 63.0 41.877460
1754014 1754014 2018-09-30 22:27:04 100 8604 73.85 15805624 182 1 3 1 0 2.0 48.010000 26.0 30.384231 103.0 23.627184
1754017 1754018 2018-09-30 22:28:01 4677 8935 45.85 15805681 182 1 2 1 0 5.0 39.078000 19.0 35.133684 85.0 37.656000

14681 rows × 17 columns

Let us compute these six features for the first terminal ID containing at least one fraud:

# Get the first terminal ID that contains frauds
transactions_df[transactions_df.TX_FRAUD==0].TERMINAL_ID[0]
3156
get_count_risk_rolling_window(transactions_df[transactions_df.TERMINAL_ID==3059], delay_period=7, windows_size_in_days=[1,7,30])
TRANSACTION_ID TX_DATETIME CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_TIME_SECONDS TX_TIME_DAYS TX_FRAUD TX_FRAUD_SCENARIO TX_DURING_WEEKEND ... CUSTOMER_ID_NB_TX_7DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW CUSTOMER_ID_NB_TX_30DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW TERMINAL_ID_NB_TX_1DAY_WINDOW TERMINAL_ID_RISK_1DAY_WINDOW TERMINAL_ID_NB_TX_7DAY_WINDOW TERMINAL_ID_RISK_7DAY_WINDOW TERMINAL_ID_NB_TX_30DAY_WINDOW TERMINAL_ID_RISK_30DAY_WINDOW
TRANSACTION_ID
3527 3527 2018-04-01 10:17:43 3774 3059 225.41 37063 0 1 1 1 ... 3.0 158.073333 3.0 158.073333 0.0 0.0 0.0 0.0 0.0 0.0
4732 4732 2018-04-01 11:59:14 55 3059 36.28 43154 0 0 0 1 ... 2.0 35.670000 2.0 35.670000 0.0 0.0 0.0 0.0 0.0 0.0
16216 16216 2018-04-02 14:47:34 4879 3059 105.00 139654 1 0 0 0 ... 10.0 76.010000 10.0 76.010000 0.0 0.0 0.0 0.0 0.0 0.0
18249 18249 2018-04-02 19:08:10 2263 3059 90.89 155290 1 0 0 0 ... 7.0 50.458571 7.0 50.458571 0.0 0.0 0.0 0.0 0.0 0.0
26512 26512 2018-04-03 15:44:49 4879 3059 58.51 229489 2 0 0 0 ... 14.0 71.070000 14.0 71.070000 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1697944 1697944 2018-09-25 05:32:56 402 3059 57.30 15312776 177 0 0 0 ... 14.0 65.167857 46.0 68.163261 1.0 0.0 9.0 0.0 36.0 0.0
1701971 1701971 2018-09-25 12:30:54 1035 3059 7.56 15337854 177 0 0 0 ... 23.0 7.052174 107.0 6.763738 2.0 0.0 10.0 0.0 36.0 0.0
1704512 1704512 2018-09-25 16:37:41 1519 3059 35.79 15352661 177 0 0 0 ... 7.0 41.404286 30.0 46.780000 1.0 0.0 9.0 0.0 36.0 0.0
1731937 1731937 2018-09-28 14:30:31 1534 3059 81.39 15604231 180 0 0 0 ... 18.0 69.477778 89.0 63.906629 1.0 0.0 8.0 0.0 36.0 0.0
1740901 1740901 2018-09-29 13:35:17 118 3059 90.96 15687317 181 0 0 1 ... 35.0 104.233714 98.0 91.407143 0.0 0.0 7.0 0.0 36.0 0.0

193 rows × 23 columns

We can check that the first fraud occurred on the 2018/09/10, and that risk scores only start being counted with a one-week delay.

Let us finally generate these features for all terminals. This is straightforward using the Panda groupby and apply methods.

%time transactions_df=transactions_df.groupby('TERMINAL_ID').apply(lambda x: get_count_risk_rolling_window(x, delay_period=7, windows_size_in_days=[1,7,30], feature="TERMINAL_ID"))
transactions_df=transactions_df.sort_values('TX_DATETIME').reset_index(drop=True)
CPU times: user 2min 27s, sys: 2.23 s, total: 2min 29s
Wall time: 2min 41s
transactions_df
TRANSACTION_ID TX_DATETIME CUSTOMER_ID TERMINAL_ID TX_AMOUNT TX_TIME_SECONDS TX_TIME_DAYS TX_FRAUD TX_FRAUD_SCENARIO TX_DURING_WEEKEND ... CUSTOMER_ID_NB_TX_7DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW CUSTOMER_ID_NB_TX_30DAY_WINDOW CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW TERMINAL_ID_NB_TX_1DAY_WINDOW TERMINAL_ID_RISK_1DAY_WINDOW TERMINAL_ID_NB_TX_7DAY_WINDOW TERMINAL_ID_RISK_7DAY_WINDOW TERMINAL_ID_NB_TX_30DAY_WINDOW TERMINAL_ID_RISK_30DAY_WINDOW
0 0 2018-04-01 00:00:31 596 3156 57.16 31 0 0 0 1 ... 1.0 57.160000 1.0 57.160000 0.0 0.0 0.0 0.0 0.0 0.00000
1 1 2018-04-01 00:02:10 4961 3412 81.51 130 0 0 0 1 ... 1.0 81.510000 1.0 81.510000 0.0 0.0 0.0 0.0 0.0 0.00000
2 2 2018-04-01 00:07:56 2 1365 146.00 476 0 0 0 1 ... 1.0 146.000000 1.0 146.000000 0.0 0.0 0.0 0.0 0.0 0.00000
3 3 2018-04-01 00:09:29 4128 8737 64.49 569 0 0 0 1 ... 1.0 64.490000 1.0 64.490000 0.0 0.0 0.0 0.0 0.0 0.00000
4 4 2018-04-01 00:10:34 927 9906 50.99 634 0 0 0 1 ... 1.0 50.990000 1.0 50.990000 0.0 0.0 0.0 0.0 0.0 0.00000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1754150 1754150 2018-09-30 23:56:36 161 655 54.24 15810996 182 0 0 1 ... 12.0 67.047500 72.0 69.521111 1.0 0.0 4.0 0.0 28.0 0.00000
1754151 1754151 2018-09-30 23:57:38 4342 6181 1.23 15811058 182 0 0 1 ... 21.0 22.173810 93.0 24.780753 1.0 0.0 9.0 0.0 39.0 0.00000
1754152 1754152 2018-09-30 23:58:21 618 1502 6.62 15811101 182 0 0 1 ... 21.0 7.400476 65.0 7.864462 1.0 0.0 5.0 0.0 33.0 0.00000
1754153 1754153 2018-09-30 23:59:52 4056 3067 55.40 15811192 182 0 0 1 ... 16.0 107.052500 51.0 102.919608 1.0 0.0 6.0 0.0 28.0 0.00000
1754154 1754154 2018-09-30 23:59:57 3542 9849 23.59 15811197 182 0 0 1 ... 24.0 35.308333 119.0 37.251513 1.0 0.0 12.0 0.0 41.0 0.02439

1754155 rows × 23 columns

3.5. Saving of dataset

Let us finally save the dataset, split into daily batches, using the pickle format.

DIR_OUTPUT = "./simulated-data-transformed/"

if not os.path.exists(DIR_OUTPUT):
    os.makedirs(DIR_OUTPUT)

start_date = datetime.datetime.strptime("2018-04-01", "%Y-%m-%d")

for day in range(transactions_df.TX_TIME_DAYS.max()+1):
    
    transactions_day = transactions_df[transactions_df.TX_TIME_DAYS==day].sort_values('TX_TIME_SECONDS')
    
    date = start_date + datetime.timedelta(days=day)
    filename_output = date.strftime("%Y-%m-%d")+'.pkl'
    
    # Protocol=4 required for Google Colab
    transactions_day.to_pickle(DIR_OUTPUT+filename_output, protocol=4)

The generated dataset is also available from Github at https://github.com/Fraud-Detection-Handbook/simulated-data-transformed.