SEGMENTACIÓN DE CLIENTES DE BANCA | DEPARTAMENTO DE MÁRKETING ¶

Este caso requiere desarrollar una segmentación de clientes para definir una estrategia de marketing. El conjunto de datos de muestra resume el comportamiento de uso de aproximadamente 9000 titulares de tarjetas de crédito activas durante los últimos 6 meses.El archivo consta de 18 variables de comportamiento.

A continuación se presenta el Diccionario de Datos para el conjunto de datos de la Tarjeta de Crédito:

Data Source: https://www.kaggle.com/arjunbhasin2013/ccdata

CUST_ID (Identificación del titular): Identificación del titular de la tarjeta de crédito (Categórico)

BALANCE (Saldo): Saldo disponible en su cuenta para realizar compras

BALANCE_FREQUENCY (Frecuencia de Saldo): Frecuencia con la que se actualiza el saldo, puntuación entre 0 y 1 (1 = actualizado frecuentemente, 0 = no actualizado frecuentemente)

PURCHASES (Compras): Monto de compras realizadas desde la cuenta

ONEOFF_PURCHASES (Compras únicas): Monto máximo de compra realizado de una sola vez

INSTALLMENTS_PURCHASES (Compras a plazos): Monto de compra realizado en cuotas

CASH_ADVANCE (Adelanto de efectivo): Adelanto de efectivo proporcionado por el usuario

PURCHASES_FREQUENCY (Frecuencia de Compras): Frecuencia con la que se realizan las compras, puntuación entre 0 y 1 (1 = comprado frecuentemente, 0 = no comprado frecuentemente)

ONEOFFPURCHASESFREQUENCY (Frecuencia de Compras Únicas): Frecuencia con la que se realizan compras de una sola vez (1 = comprado frecuentemente, 0 = no comprado frecuentemente)

PURCHASESINSTALLMENTSFREQUENCY (Frecuencia de Compras a Plazos): Frecuencia con la que se realizan compras en cuotas (1 = hecho frecuentemente, 0 = no hecho frecuentemente)

CASHADVANCEFREQUENCY (Frecuencia de Adelantos): Frecuencia con la que se paga el adelanto en efectivo

CASHADVANCETRX (Transacciones de Adelanto): Número de transacciones realizadas con "Adelanto en Efectivo"

PURCHASES_TRX (Transacciones de Compra): Número de transacciones de compra realizadas

CREDIT_LIMIT (Límite de Crédito): Límite de la tarjeta de crédito para el usuario

PAYMENTS (Pagos): Monto del pago realizado por el usuario

MINIMUM_PAYMENTS (Pagos mínimos): Monto mínimo de pagos realizado por el usuario

PRCFULLPAYMENT (Porcentaje de Pago Completo): Porcentaje del pago total pagado por el usuario

TENURE (Duración): Duración del servicio de tarjeta de crédito para el usuario

Data Source: https://www.kaggle.com/arjunbhasin2013/ccdata

IMPORTAR LAS LIBRERÍAS Y LOS DATASETS¶

In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
In [ ]:
creditcard_df = pd.read_csv("Marketing_data.csv")
In [ ]:
creditcard_df
Out[ ]:
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
0 C10001 40.900749 0.818182 95.40 0.00 95.40 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12
1 C10002 3202.467416 0.909091 0.00 0.00 0.00 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12
2 C10003 2495.148862 1.000000 773.17 773.17 0.00 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12
3 C10004 1666.670542 0.636364 1499.00 1499.00 0.00 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 NaN 0.000000 12
4 C10005 817.714335 1.000000 16.00 16.00 0.00 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8945 C19186 28.493517 1.000000 291.12 0.00 291.12 0.000000 1.000000 0.000000 0.833333 0.000000 0 6 1000.0 325.594462 48.886365 0.500000 6
8946 C19187 19.183215 1.000000 300.00 0.00 300.00 0.000000 1.000000 0.000000 0.833333 0.000000 0 6 1000.0 275.861322 NaN 0.000000 6
8947 C19188 23.398673 0.833333 144.40 0.00 144.40 0.000000 0.833333 0.000000 0.666667 0.000000 0 5 1000.0 81.270775 82.418369 0.250000 6
8948 C19189 13.457564 0.833333 0.00 0.00 0.00 36.558778 0.000000 0.000000 0.000000 0.166667 2 0 500.0 52.549959 55.755628 0.250000 6
8949 C19190 372.708075 0.666667 1093.25 1093.25 0.00 127.040008 0.666667 0.666667 0.000000 0.333333 2 23 1200.0 63.165404 88.288956 0.000000 6

8950 rows × 18 columns

In [ ]:
creditcard_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   CUST_ID                           8950 non-null   object 
 1   BALANCE                           8950 non-null   float64
 2   BALANCE_FREQUENCY                 8950 non-null   float64
 3   PURCHASES                         8950 non-null   float64
 4   ONEOFF_PURCHASES                  8950 non-null   float64
 5   INSTALLMENTS_PURCHASES            8950 non-null   float64
 6   CASH_ADVANCE                      8950 non-null   float64
 7   PURCHASES_FREQUENCY               8950 non-null   float64
 8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
 9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
 10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
 11  CASH_ADVANCE_TRX                  8950 non-null   int64  
 12  PURCHASES_TRX                     8950 non-null   int64  
 13  CREDIT_LIMIT                      8949 non-null   float64
 14  PAYMENTS                          8950 non-null   float64
 15  MINIMUM_PAYMENTS                  8637 non-null   float64
 16  PRC_FULL_PAYMENT                  8950 non-null   float64
 17  TENURE                            8950 non-null   int64  
dtypes: float64(14), int64(3), object(1)
memory usage: 1.2+ MB
In [ ]:
creditcard_df.describe()

# El balance medio es $1564
# La frecuencia del balance se actualiza bastante a menudo, en promedio ~0.9
# El promedio de las compras es $1000
# El importe máximo de compra no recurrente es en promedio ~$600
# El promedio de la frecuencia de las compras está cerca de 0.5
# El promedio de ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY, y CASH_ADVANCE_FREQUENCY es en general bajo
# El promedio del límite de crédito es ~ 4500
# El porcentaje de pago completo es 15%
# Los clientes llevan de promedio en el servicio 11 años
Out[ ]:
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
count 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8949.000000 8950.000000 8637.000000 8950.000000 8950.000000
mean 1564.474828 0.877271 1003.204834 592.437371 411.067645 978.871112 0.490351 0.202458 0.364437 0.135144 3.248827 14.709832 4494.449450 1733.143852 864.206542 0.153715 11.517318
std 2081.531879 0.236904 2136.634782 1659.887917 904.338115 2097.163877 0.401371 0.298336 0.397448 0.200121 6.824647 24.857649 3638.815725 2895.063757 2372.446607 0.292499 1.338331
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50.000000 0.000000 0.019163 0.000000 6.000000
25% 128.281915 0.888889 39.635000 0.000000 0.000000 0.000000 0.083333 0.000000 0.000000 0.000000 0.000000 1.000000 1600.000000 383.276166 169.123707 0.000000 12.000000
50% 873.385231 1.000000 361.280000 38.000000 89.000000 0.000000 0.500000 0.083333 0.166667 0.000000 0.000000 7.000000 3000.000000 856.901546 312.343947 0.000000 12.000000
75% 2054.140036 1.000000 1110.130000 577.405000 468.637500 1113.821139 0.916667 0.300000 0.750000 0.222222 4.000000 17.000000 6500.000000 1901.134317 825.485459 0.142857 12.000000
max 19043.138560 1.000000 49039.570000 40761.250000 22500.000000 47137.211760 1.000000 1.000000 1.000000 1.500000 123.000000 358.000000 30000.000000 50721.483360 76406.207520 1.000000 12.000000
In [ ]:
# Vamos a investigar quien ha hecho una compra de $40761!
creditcard_df[creditcard_df["ONEOFF_PURCHASES"] == 40761.25]
Out[ ]:
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
550 C10574 11547.52001 1.0 49039.57 40761.25 8278.32 558.166886 1.0 1.0 0.916667 0.083333 1 101 22500.0 46930.59824 2974.069421 0.25 12
In [ ]:
creditcard_df['CASH_ADVANCE'].max()
Out[ ]:
47137.21176
In [ ]:
# Vamos a ver quien pago por anticipado $47137!
# Este cliente hizo un total de 123 transacciones por adelantado!!
# Nunca paga sus compras completamente con la tarjeta

creditcard_df[creditcard_df['CASH_ADVANCE'] == 47137.211760000006]
Out[ ]:
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE

VISUALIZACIÓN DEL DATASET¶

In [ ]:
# Comprobemos a ver si tenemos datos faltantes.
sns.heatmap(creditcard_df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")
Out[ ]:
<Axes: >
No description has been provided for this image
In [ ]:
creditcard_df.isnull().sum()
Out[ ]:
CUST_ID                               0
BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64
In [ ]:
# Vamos a rellenar los datos faltantes con el promedio del campo 'MINIMUM_PAYMENT'
creditcard_df.loc[(creditcard_df['MINIMUM_PAYMENTS'].isnull() == True), 'MINIMUM_PAYMENTS'] = creditcard_df['MINIMUM_PAYMENTS'].mean()
In [ ]:
# Vamos a rellenar los datos faltantes con el promedio del campo 'CREDIT_LIMIT'
creditcard_df.loc[(creditcard_df['CREDIT_LIMIT'].isnull() == True), 'CREDIT_LIMIT'] = creditcard_df['CREDIT_LIMIT'].mean()
In [ ]:
sns.heatmap(creditcard_df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")
Out[ ]:
<Axes: >
No description has been provided for this image
In [ ]:
# Varifiquemos si tenemos entradas duplicadas en nuestros datos
creditcard_df.duplicated().sum()
Out[ ]:
0
In [ ]:
# Podemos deshacernos del campo Customer ID ya que no nos sirve para nada
creditcard_df.drop("CUST_ID", axis = 1, inplace=True)
In [ ]:
creditcard_df.head()
Out[ ]:
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
0 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12
1 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12
2 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12
3 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 864.206542 0.000000 12
4 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12
In [ ]:
n = len(creditcard_df.columns)
n
Out[ ]:
17
In [ ]:
creditcard_df.columns
Out[ ]:
Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE'],
      dtype='object')
In [ ]:
# distplot combina la función matplotlib.hist con la de seaborn kdeplot()
# KDE Plot representa la Kernel Density Estimate
# KDE se utiliza para visualizar la densidad de una probabilidad de una variable continua.
# KDE nos muestra la densidad de una probabilidad para diferentes valores de una variable continua.

# El balance promedio es  $1500
# 'Balance_Frequency' para muchos usuarios se actualiza muy frecuentemente ~1
# Para el campo 'PURCHASES_FREQUENCY', hay dos grupos diferentes de clientes
# Para los campos 'ONEOFF_PURCHASES_FREQUENCY' y 'PURCHASES_INSTALLMENT_FREQUENCY' la gran mayoría de usuarios no pagan todo de golpe ni a plazos
# Muy pocos clientes pagan su deuda al completo 'PRC_FULL_PAYMENT'~0
# El promedio del límite del crédito está entorno de los $4500
# La mayoría de clientes llevan ~11 años usando el servicio

plt.figure(figsize = (10, 50))
for i in range(n):
    plt.subplot(n, 1, i+1)
    sns.histplot(creditcard_df[creditcard_df.columns[i]], line_kws = {"color": "b", "lw": 3, "label": "KDE"}, cbar_kws={"color": "g"}, kde=True)
    plt.title(creditcard_df.columns[i])

plt.tight_layout()
No description has been provided for this image
In [ ]:
# sns.pairplot(creditcard_df)
# Hay correlación entre 'PURCHASES' y ONEOFF_PURCHASES & INSTALMENT_PURCHASES
# Se ve una tendencia entre 'PURCHASES' y 'CREDIT_LIMIT' & 'PAYMENTS'
In [ ]:
correlations = creditcard_df.corr()
In [ ]:
f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot = True)
# 'PURCHASES' tienen una alta correlación con one-off purchases, 'installment purchases, purchase transactions, credit limit y payments.
# Correlación positiva muy elevada entre 'PURCHASES_FREQUENCY' y 'PURCHASES_INSTALLMENT_FREQUENCY'
Out[ ]:
<Axes: >
No description has been provided for this image

K-MEANS¶

ENCONTRAR EL NÚMERO ÓPTIMO DE CLISTERS UTILIZANDO EL MÉTODO DEL CODO¶

  • El método del codo es un método heurístico de interpretación y validación de la coherencia dentro del análisis de clustering diseñado para ayudar a encontrar el número apropiado de clusters en un conjunto de datos.
  • Si el gráfico de líneas se parece a un brazo, entonces el "codo" en el brazo es el valor de k que es el mejor.
  • Source:
    • https://en.wikipedia.org/wiki/Elbow_method_(clustering)
    • https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/
In [ ]:
# Empecemos por escalar primero el dataset
scaler = StandardScaler()
creditcard_df_scaled = scaler.fit_transform(creditcard_df)
In [ ]:
creditcard_df_scaled.shape
Out[ ]:
(8950, 17)
In [ ]:
creditcard_df_scaled
Out[ ]:
array([[-0.73198937, -0.24943448, -0.42489974, ..., -0.31096755,
        -0.52555097,  0.36067954],
       [ 0.78696085,  0.13432467, -0.46955188, ...,  0.08931021,
         0.2342269 ,  0.36067954],
       [ 0.44713513,  0.51808382, -0.10766823, ..., -0.10166318,
        -0.52555097,  0.36067954],
       ...,
       [-0.7403981 , -0.18547673, -0.40196519, ..., -0.33546549,
         0.32919999, -4.12276757],
       [-0.74517423, -0.18547673, -0.46955188, ..., -0.34690648,
         0.32919999, -4.12276757],
       [-0.57257511, -0.88903307,  0.04214581, ..., -0.33294642,
        -0.52555097, -4.12276757]])
In [ ]:
scores_1 = []

range_values = range(1, 20)

for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(creditcard_df_scaled)
    scores_1.append(kmeans.inertia_) #WCSS


plt.plot(range_values, scores_1, 'bx-')
plt.title("Encontrar el número óptimo de Clusters")
plt.xlabel("Clusters")
plt.ylabel("WCSS(k)")
plt.show()



# Con el gráfico podemos ver que en 4 clusters es donde se forma el codo de la curva.
# Sin embargo, los valores no se reducen a una forma lineal hasta el 8º cluster.
# Elijamos pues un número de clusters igual a 8.
No description has been provided for this image

APLICAR EL MÉTODO DE K-MEANS¶

In [ ]:
kmeans = KMeans(8)
kmeans.fit(creditcard_df_scaled)
labels = kmeans.labels_
In [ ]:
kmeans.cluster_centers_.shape
Out[ ]:
(8, 17)
In [ ]:
cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns=[creditcard_df.columns])
cluster_centers
Out[ ]:
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
0 -0.135630 0.400131 0.539387 0.669386 0.045410 -0.331722 0.977756 1.902452 0.168801 -0.414087 -0.330446 0.616253 0.429303 0.143434 -0.158654 0.445917 0.268936
1 0.011995 0.404100 -0.357021 -0.241749 -0.399925 -0.094577 -0.852753 -0.394841 -0.754326 0.102807 -0.028741 -0.481657 -0.303464 -0.250302 -0.011802 -0.455361 0.273520
2 -0.361430 0.332243 -0.034806 -0.242518 0.363059 -0.363193 0.993781 -0.384169 1.206526 -0.473349 -0.360135 0.189821 -0.261013 -0.214810 -0.030171 0.313605 0.256155
3 1.684378 0.392729 -0.217494 -0.155556 -0.228546 2.007810 -0.471593 -0.208498 -0.411691 1.912655 1.919139 -0.265987 1.027732 0.820395 0.551555 -0.390256 0.070846
4 1.042239 0.465006 2.500780 1.806350 2.594499 -0.161300 1.164798 1.562962 1.273275 -0.284847 -0.150801 3.128411 1.303209 1.436890 0.559682 0.251719 0.337526
5 1.845310 0.340595 12.297201 12.823670 5.516158 0.272530 1.043177 2.145028 0.896761 -0.380373 -0.109730 4.556136 3.185151 9.047799 1.030898 1.222264 0.298409
6 -0.336893 -0.351916 -0.288199 -0.214424 -0.287144 0.067274 -0.203348 -0.286554 -0.225256 0.306510 -0.000298 -0.387867 -0.563134 -0.393094 -0.208930 0.013378 -3.201889
7 -0.701208 -2.141009 -0.310385 -0.234926 -0.302187 -0.322400 -0.554752 -0.442477 -0.439746 -0.521503 -0.376311 -0.419356 -0.175381 -0.201786 -0.256971 0.282604 0.201234
In [ ]:
# Para entender mejor estos valores, vamos a aplicar la transformación inversa.
cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns=[creditcard_df.columns])
cluster_centers

# Primer Cluster de Clientes (Transactors): Esos son los clientes que pagan la menor cantidad de cargos por intereses y tienen cuidado con su dinero, Clúster con el saldo más bajo ($ 104) y anticipo en efectivo ($ 303), Porcentaje de pago completo = 23%
# Segundo Cluster de Clientes (Revolvers) que usan tarjeta de crédito como préstamo (sector más lucrativo): saldo más alto ($ 5000) y anticipo en efectivo (~ $ 5000), baja frecuencia de compra, alta frecuencia de anticipo en efectivo (0.5), transacciones de anticipo en efectivo alto (16) y bajo porcentaje de pago (3%)
# Tercer Cluster de Clientes (VIP/Prime): límite de crédito alto $ 16K y porcentaje más alto de pago completo, objetivo para aumentar el límite de crédito y aumentar los hábitos de gasto
# Cuarto Cluster de Clientes (low tenure): estos son clientes con baja antigüedad (7 años), saldo bajo
Out[ ]:
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
0 1282.172458 0.972058 2155.612989 1703.481335 452.131655 283.234302 0.882771 0.769996 0.431523 0.052281 0.993772 30.027580 6056.429958 2148.372546 494.470259 0.284138 11.877224
1 1589.442414 0.972998 240.424411 191.184026 49.420211 780.538716 0.148099 0.084669 0.064649 0.155717 3.052689 2.737645 3390.322190 1008.544109 836.702288 0.020529 11.883358
2 812.188081 0.955976 928.841434 189.906974 739.377555 217.237799 0.889203 0.087853 0.843942 0.040422 0.791171 19.428075 3544.776886 1111.289062 793.893442 0.245439 11.860119
3 5070.366014 0.970305 538.526613 334.246201 204.396739 5189.341636 0.301078 0.140259 0.200821 0.517886 16.345538 8.098398 8233.758061 4108.105687 2149.581367 0.039571 11.612128
4 3733.806745 0.987426 6346.159662 3590.608789 2757.241014 640.617891 0.957840 0.668720 0.870470 0.078143 2.219718 92.470423 9236.056338 5892.801050 2168.520942 0.227338 11.969014
5 5405.330935 0.957955 27276.363750 21877.102917 5399.260833 1550.378389 0.909028 0.842361 0.720833 0.059028 2.500000 127.958333 16083.333333 27925.634496 3266.671038 0.511206 11.916667
6 863.259693 0.793905 387.463654 236.537740 151.406683 1119.947453 0.408737 0.116973 0.274915 0.196480 3.246795 5.068910 2445.535977 595.176112 377.303465 0.157627 7.232372
7 104.968575 0.370085 340.061778 202.509086 137.804005 302.783168 0.267702 0.070458 0.189671 0.030786 0.680779 4.286198 3856.341613 1148.993000 265.345972 0.236372 11.786622
In [ ]:
labels.shape
Out[ ]:
(8950,)
In [ ]:
labels.min()
Out[ ]:
0
In [ ]:
labels.max()
Out[ ]:
7
In [ ]:
y_kmeans = kmeans.fit_predict(creditcard_df_scaled)
y_kmeans
Out[ ]:
array([7, 1, 6, ..., 0, 0, 0], dtype=int32)
In [ ]:
# Concatenamos las etiquetas de los clusters con el dataset riginal
creditcard_df_cluster = pd.concat([creditcard_df, pd.DataFrame({'cluster': labels})], axis = 1)
creditcard_df_cluster.head()
Out[ ]:
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE cluster
0 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12 1
1 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12 3
2 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12 0
3 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 864.206542 0.000000 12 1
4 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12 1
In [ ]:
# Visualizamos histogramas para cada cluster
for i in creditcard_df.columns:
    plt.figure(figsize=(35, 5))
    for j in range(8):
        plt.subplot(1, 8, j+1)
        cluster = creditcard_df_cluster[creditcard_df_cluster['cluster'] == j]
        cluster[i].hist(bins = 20)
        plt.title('{}    \nCluster {}'.format(i, j))
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

APLICAR ANÁLISIS DE LAS COMPONENTES PRINCIPALES Y VISUALIZAR LOS RESULTADOS¶

In [ ]:
# Obtenemos las componentes principales
pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(creditcard_df_scaled)
principal_comp
Out[ ]:
array([[-1.68221999, -1.07645189],
       [-1.13829562,  2.50646484],
       [ 0.96968184, -0.38351944],
       ...,
       [-0.9262032 , -1.81078284],
       [-2.33654944, -0.65796001],
       [-0.55642156, -0.40046214]])
In [ ]:
# Creamos un dataframe con las dos componentes
pca_df = pd.DataFrame(data = principal_comp, columns=["pca1", "pca2"])
pca_df.head()
Out[ ]:
pca1 pca2
0 -1.682220 -1.076452
1 -1.138296 2.506465
2 0.969682 -0.383519
3 -0.873627 0.043163
4 -1.599434 -0.688581
In [ ]:
# Concatenamos las etiquetas de los clusters con el dataframe de las componentes principales

pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis = 1)

pca_df.head()
Out[ ]:
pca1 pca2 cluster
0 -1.682220 -1.076452 1
1 -1.138296 2.506465 3
2 0.969682 -0.383519 0
3 -0.873627 0.043163 1
4 -1.599434 -0.688581 1
In [ ]:
plt.figure(figsize=(10,10))
ax = sns.scatterplot(x = "pca1", y = "pca2", hue = "cluster", data = pca_df,
                     palette = ["red", "green", "blue", "pink", "yellow", "gray", "purple", "black"])
plt.show()
No description has been provided for this image

AUTOENCODERS¶

APLICAR LOS AUTOENCODERS (REDUCIR LA DIMENSIÓN DE LOS DATOS CON LOS AUTOENCODERS)¶

In [ ]:
from tensorflow.keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.initializers import glorot_uniform
from keras.optimizers import SGD

encoding_dim = 7

input_df = Input(shape = (17, ))

# Glorot normal inicializador (Xavier normal initializer) tomar muestras aleatorias de una distribución normal truncada

x = Dense(encoding_dim, activation = 'relu')(input_df)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)

encoded = Dense(10, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)

x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(encoded)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)

decoded = Dense(17, kernel_initializer = 'glorot_uniform')(x)


autoencoder = Model(input_df, decoded)

encoder = Model(input_df, encoded)

autoencoder.compile(optimizer = 'adam', loss = 'mean_squared_error')
In [ ]:
creditcard_df_scaled.shape
Out[ ]:
(8950, 17)
In [ ]:
autoencoder.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 17)]              0         
                                                                 
 dense (Dense)               (None, 7)                 126       
                                                                 
 dense_1 (Dense)             (None, 500)               4000      
                                                                 
 dense_2 (Dense)             (None, 500)               250500    
                                                                 
 dense_3 (Dense)             (None, 2000)              1002000   
                                                                 
 dense_4 (Dense)             (None, 10)                20010     
                                                                 
 dense_5 (Dense)             (None, 2000)              22000     
                                                                 
 dense_6 (Dense)             (None, 500)               1000500   
                                                                 
 dense_7 (Dense)             (None, 17)                8517      
                                                                 
=================================================================
Total params: 2307653 (8.80 MB)
Trainable params: 2307653 (8.80 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
autoencoder.fit(creditcard_df_scaled, creditcard_df_scaled, batch_size=128, epochs = 25, verbose = 1)
Epoch 1/25
70/70 [==============================] - 7s 70ms/step - loss: 0.5762
Epoch 2/25
70/70 [==============================] - 6s 91ms/step - loss: 0.3098
Epoch 3/25
70/70 [==============================] - 6s 80ms/step - loss: 0.2284
Epoch 4/25
70/70 [==============================] - 5s 68ms/step - loss: 0.2010
Epoch 5/25
70/70 [==============================] - 6s 87ms/step - loss: 0.1761
Epoch 6/25
70/70 [==============================] - 5s 73ms/step - loss: 0.1638
Epoch 7/25
70/70 [==============================] - 6s 81ms/step - loss: 0.1600
Epoch 8/25
70/70 [==============================] - 10s 141ms/step - loss: 0.1437
Epoch 9/25
70/70 [==============================] - 9s 135ms/step - loss: 0.1314
Epoch 10/25
70/70 [==============================] - 7s 98ms/step - loss: 0.1196
Epoch 11/25
70/70 [==============================] - 5s 68ms/step - loss: 0.1186
Epoch 12/25
70/70 [==============================] - 6s 92ms/step - loss: 0.1118
Epoch 13/25
70/70 [==============================] - 5s 67ms/step - loss: 0.1013
Epoch 14/25
70/70 [==============================] - 7s 96ms/step - loss: 0.0952
Epoch 15/25
70/70 [==============================] - 6s 81ms/step - loss: 0.0913
Epoch 16/25
70/70 [==============================] - 5s 64ms/step - loss: 0.0861
Epoch 17/25
70/70 [==============================] - 5s 77ms/step - loss: 0.0845
Epoch 18/25
70/70 [==============================] - 6s 80ms/step - loss: 0.0764
Epoch 19/25
70/70 [==============================] - 5s 66ms/step - loss: 0.0731
Epoch 20/25
70/70 [==============================] - 6s 82ms/step - loss: 0.0717
Epoch 21/25
70/70 [==============================] - 5s 75ms/step - loss: 0.0710
Epoch 22/25
70/70 [==============================] - 5s 65ms/step - loss: 0.0666
Epoch 23/25
70/70 [==============================] - 6s 87ms/step - loss: 0.0647
Epoch 24/25
70/70 [==============================] - 5s 72ms/step - loss: 0.0603
Epoch 25/25
70/70 [==============================] - 5s 66ms/step - loss: 0.0603
Out[ ]:
<keras.src.callbacks.History at 0x7ad9e0568ca0>
In [ ]:
autoencoder.save_weights('autoencoder.h5')
In [ ]:
pred = encoder.predict(creditcard_df_scaled)
280/280 [==============================] - 2s 6ms/step
In [ ]:
pred.shape
Out[ ]:
(8950, 10)
In [ ]:
scores_2 = []

range_values = range(1,20)

for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(pred)
    scores_2.append(kmeans.inertia_)


plt.plot(range_values, scores_2, 'bx-')
plt.title("Encontrar el número óptimo de clusters")
plt.xlabel("Número de Clusters")
plt.ylabel("WCSS(k)")
plt.show()
No description has been provided for this image
In [ ]:
plt.plot(range_values, scores_1, 'bx-', color = "r")
plt.plot(range_values, scores_2, 'bx-', color = "g")
Out[ ]:
[<matplotlib.lines.Line2D at 0x7ad9dafa83a0>]
No description has been provided for this image
In [ ]:
kmeans = KMeans(4)
kmeans.fit(pred)
labels = kmeans.labels_
y_kmeans = kmeans.fit_predict(pred)
In [ ]:
df_cluster_dr = pd.concat([creditcard_df, pd.DataFrame({'cluster': labels})], axis = 1)
df_cluster_dr.head()
Out[ ]:
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE cluster
0 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12 1
1 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12 0
2 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12 1
3 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 864.206542 0.000000 12 1
4 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12 1
In [ ]:
pca = PCA(n_components=2)
princ_comp = pca.fit_transform(pred)
pca_df = pd.DataFrame(data = princ_comp, columns=["pca1", "pca2"])
pca_df.head()
Out[ ]:
pca1 pca2
0 -1.474160 -0.022473
1 1.799312 -2.021418
2 -1.882853 0.545519
3 0.479843 -0.348112
4 -2.034453 -0.111284
In [ ]:
pca_df = pd.concat([pca_df, pd.DataFrame({"cluster":labels})], axis = 1)
pca_df
Out[ ]:
pca1 pca2 cluster
0 -1.474160 -0.022473 1
1 1.799312 -2.021418 0
2 -1.882853 0.545519 1
3 0.479843 -0.348112 1
4 -2.034453 -0.111284 1
... ... ... ...
8945 1.116068 0.785430 3
8946 0.432141 0.369335 3
8947 0.997263 0.278995 3
8948 0.877870 -1.634350 0
8949 0.368103 -0.924275 0

8950 rows × 3 columns

In [ ]:
plt.figure(figsize=(10,10))
ax = sns.scatterplot(x="pca1", y = "pca2", hue="cluster", data = pca_df, palette=["red", "green", "blue", "yellow"])
plt.show()
No description has been provided for this image