SEGMENTACIÓN DE CLIENTES DE BANCA | DEPARTAMENTO DE MÁRKETING ¶
Este caso requiere desarrollar una segmentación de clientes para definir una estrategia de marketing. El conjunto de datos de muestra resume el comportamiento de uso de aproximadamente 9000 titulares de tarjetas de crédito activas durante los últimos 6 meses.El archivo consta de 18 variables de comportamiento.
A continuación se presenta el Diccionario de Datos para el conjunto de datos de la Tarjeta de Crédito:
Data Source: https://www.kaggle.com/arjunbhasin2013/ccdata
CUST_ID (Identificación del titular): Identificación del titular de la tarjeta de crédito (Categórico)
BALANCE (Saldo): Saldo disponible en su cuenta para realizar compras
BALANCE_FREQUENCY (Frecuencia de Saldo): Frecuencia con la que se actualiza el saldo, puntuación entre 0 y 1 (1 = actualizado frecuentemente, 0 = no actualizado frecuentemente)
PURCHASES (Compras): Monto de compras realizadas desde la cuenta
ONEOFF_PURCHASES (Compras únicas): Monto máximo de compra realizado de una sola vez
INSTALLMENTS_PURCHASES (Compras a plazos): Monto de compra realizado en cuotas
CASH_ADVANCE (Adelanto de efectivo): Adelanto de efectivo proporcionado por el usuario
PURCHASES_FREQUENCY (Frecuencia de Compras): Frecuencia con la que se realizan las compras, puntuación entre 0 y 1 (1 = comprado frecuentemente, 0 = no comprado frecuentemente)
ONEOFFPURCHASESFREQUENCY (Frecuencia de Compras Únicas): Frecuencia con la que se realizan compras de una sola vez (1 = comprado frecuentemente, 0 = no comprado frecuentemente)
PURCHASESINSTALLMENTSFREQUENCY (Frecuencia de Compras a Plazos): Frecuencia con la que se realizan compras en cuotas (1 = hecho frecuentemente, 0 = no hecho frecuentemente)
CASHADVANCEFREQUENCY (Frecuencia de Adelantos): Frecuencia con la que se paga el adelanto en efectivo
CASHADVANCETRX (Transacciones de Adelanto): Número de transacciones realizadas con "Adelanto en Efectivo"
PURCHASES_TRX (Transacciones de Compra): Número de transacciones de compra realizadas
CREDIT_LIMIT (Límite de Crédito): Límite de la tarjeta de crédito para el usuario
PAYMENTS (Pagos): Monto del pago realizado por el usuario
MINIMUM_PAYMENTS (Pagos mínimos): Monto mínimo de pagos realizado por el usuario
PRCFULLPAYMENT (Porcentaje de Pago Completo): Porcentaje del pago total pagado por el usuario
TENURE (Duración): Duración del servicio de tarjeta de crédito para el usuario
Data Source: https://www.kaggle.com/arjunbhasin2013/ccdata
IMPORTAR LAS LIBRERÍAS Y LOS DATASETS¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
creditcard_df = pd.read_csv("Marketing_data.csv")
creditcard_df
CUST_ID | BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.40 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12 |
1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.00 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12 |
2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.00 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12 |
3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.00 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500.0 | 0.000000 | NaN | 0.000000 | 12 |
4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.00 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8945 | C19186 | 28.493517 | 1.000000 | 291.12 | 0.00 | 291.12 | 0.000000 | 1.000000 | 0.000000 | 0.833333 | 0.000000 | 0 | 6 | 1000.0 | 325.594462 | 48.886365 | 0.500000 | 6 |
8946 | C19187 | 19.183215 | 1.000000 | 300.00 | 0.00 | 300.00 | 0.000000 | 1.000000 | 0.000000 | 0.833333 | 0.000000 | 0 | 6 | 1000.0 | 275.861322 | NaN | 0.000000 | 6 |
8947 | C19188 | 23.398673 | 0.833333 | 144.40 | 0.00 | 144.40 | 0.000000 | 0.833333 | 0.000000 | 0.666667 | 0.000000 | 0 | 5 | 1000.0 | 81.270775 | 82.418369 | 0.250000 | 6 |
8948 | C19189 | 13.457564 | 0.833333 | 0.00 | 0.00 | 0.00 | 36.558778 | 0.000000 | 0.000000 | 0.000000 | 0.166667 | 2 | 0 | 500.0 | 52.549959 | 55.755628 | 0.250000 | 6 |
8949 | C19190 | 372.708075 | 0.666667 | 1093.25 | 1093.25 | 0.00 | 127.040008 | 0.666667 | 0.666667 | 0.000000 | 0.333333 | 2 | 23 | 1200.0 | 63.165404 | 88.288956 | 0.000000 | 6 |
8950 rows × 18 columns
creditcard_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8950 entries, 0 to 8949 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUST_ID 8950 non-null object 1 BALANCE 8950 non-null float64 2 BALANCE_FREQUENCY 8950 non-null float64 3 PURCHASES 8950 non-null float64 4 ONEOFF_PURCHASES 8950 non-null float64 5 INSTALLMENTS_PURCHASES 8950 non-null float64 6 CASH_ADVANCE 8950 non-null float64 7 PURCHASES_FREQUENCY 8950 non-null float64 8 ONEOFF_PURCHASES_FREQUENCY 8950 non-null float64 9 PURCHASES_INSTALLMENTS_FREQUENCY 8950 non-null float64 10 CASH_ADVANCE_FREQUENCY 8950 non-null float64 11 CASH_ADVANCE_TRX 8950 non-null int64 12 PURCHASES_TRX 8950 non-null int64 13 CREDIT_LIMIT 8949 non-null float64 14 PAYMENTS 8950 non-null float64 15 MINIMUM_PAYMENTS 8637 non-null float64 16 PRC_FULL_PAYMENT 8950 non-null float64 17 TENURE 8950 non-null int64 dtypes: float64(14), int64(3), object(1) memory usage: 1.2+ MB
creditcard_df.describe()
# El balance medio es $1564
# La frecuencia del balance se actualiza bastante a menudo, en promedio ~0.9
# El promedio de las compras es $1000
# El importe máximo de compra no recurrente es en promedio ~$600
# El promedio de la frecuencia de las compras está cerca de 0.5
# El promedio de ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY, y CASH_ADVANCE_FREQUENCY es en general bajo
# El promedio del límite de crédito es ~ 4500
# El porcentaje de pago completo es 15%
# Los clientes llevan de promedio en el servicio 11 años
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8949.000000 | 8950.000000 | 8637.000000 | 8950.000000 | 8950.000000 |
mean | 1564.474828 | 0.877271 | 1003.204834 | 592.437371 | 411.067645 | 978.871112 | 0.490351 | 0.202458 | 0.364437 | 0.135144 | 3.248827 | 14.709832 | 4494.449450 | 1733.143852 | 864.206542 | 0.153715 | 11.517318 |
std | 2081.531879 | 0.236904 | 2136.634782 | 1659.887917 | 904.338115 | 2097.163877 | 0.401371 | 0.298336 | 0.397448 | 0.200121 | 6.824647 | 24.857649 | 3638.815725 | 2895.063757 | 2372.446607 | 0.292499 | 1.338331 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 50.000000 | 0.000000 | 0.019163 | 0.000000 | 6.000000 |
25% | 128.281915 | 0.888889 | 39.635000 | 0.000000 | 0.000000 | 0.000000 | 0.083333 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1600.000000 | 383.276166 | 169.123707 | 0.000000 | 12.000000 |
50% | 873.385231 | 1.000000 | 361.280000 | 38.000000 | 89.000000 | 0.000000 | 0.500000 | 0.083333 | 0.166667 | 0.000000 | 0.000000 | 7.000000 | 3000.000000 | 856.901546 | 312.343947 | 0.000000 | 12.000000 |
75% | 2054.140036 | 1.000000 | 1110.130000 | 577.405000 | 468.637500 | 1113.821139 | 0.916667 | 0.300000 | 0.750000 | 0.222222 | 4.000000 | 17.000000 | 6500.000000 | 1901.134317 | 825.485459 | 0.142857 | 12.000000 |
max | 19043.138560 | 1.000000 | 49039.570000 | 40761.250000 | 22500.000000 | 47137.211760 | 1.000000 | 1.000000 | 1.000000 | 1.500000 | 123.000000 | 358.000000 | 30000.000000 | 50721.483360 | 76406.207520 | 1.000000 | 12.000000 |
# Vamos a investigar quien ha hecho una compra de $40761!
creditcard_df[creditcard_df["ONEOFF_PURCHASES"] == 40761.25]
CUST_ID | BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
550 | C10574 | 11547.52001 | 1.0 | 49039.57 | 40761.25 | 8278.32 | 558.166886 | 1.0 | 1.0 | 0.916667 | 0.083333 | 1 | 101 | 22500.0 | 46930.59824 | 2974.069421 | 0.25 | 12 |
creditcard_df['CASH_ADVANCE'].max()
47137.21176
# Vamos a ver quien pago por anticipado $47137!
# Este cliente hizo un total de 123 transacciones por adelantado!!
# Nunca paga sus compras completamente con la tarjeta
creditcard_df[creditcard_df['CASH_ADVANCE'] == 47137.211760000006]
CUST_ID | BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE |
---|
VISUALIZACIÓN DEL DATASET¶
# Comprobemos a ver si tenemos datos faltantes.
sns.heatmap(creditcard_df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")
<Axes: >
creditcard_df.isnull().sum()
CUST_ID 0 BALANCE 0 BALANCE_FREQUENCY 0 PURCHASES 0 ONEOFF_PURCHASES 0 INSTALLMENTS_PURCHASES 0 CASH_ADVANCE 0 PURCHASES_FREQUENCY 0 ONEOFF_PURCHASES_FREQUENCY 0 PURCHASES_INSTALLMENTS_FREQUENCY 0 CASH_ADVANCE_FREQUENCY 0 CASH_ADVANCE_TRX 0 PURCHASES_TRX 0 CREDIT_LIMIT 1 PAYMENTS 0 MINIMUM_PAYMENTS 313 PRC_FULL_PAYMENT 0 TENURE 0 dtype: int64
# Vamos a rellenar los datos faltantes con el promedio del campo 'MINIMUM_PAYMENT'
creditcard_df.loc[(creditcard_df['MINIMUM_PAYMENTS'].isnull() == True), 'MINIMUM_PAYMENTS'] = creditcard_df['MINIMUM_PAYMENTS'].mean()
# Vamos a rellenar los datos faltantes con el promedio del campo 'CREDIT_LIMIT'
creditcard_df.loc[(creditcard_df['CREDIT_LIMIT'].isnull() == True), 'CREDIT_LIMIT'] = creditcard_df['CREDIT_LIMIT'].mean()
sns.heatmap(creditcard_df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")
<Axes: >
# Varifiquemos si tenemos entradas duplicadas en nuestros datos
creditcard_df.duplicated().sum()
0
# Podemos deshacernos del campo Customer ID ya que no nos sirve para nada
creditcard_df.drop("CUST_ID", axis = 1, inplace=True)
creditcard_df.head()
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12 |
1 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12 |
2 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12 |
3 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500.0 | 0.000000 | 864.206542 | 0.000000 | 12 |
4 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12 |
n = len(creditcard_df.columns)
n
17
creditcard_df.columns
Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT', 'TENURE'], dtype='object')
# distplot combina la función matplotlib.hist con la de seaborn kdeplot()
# KDE Plot representa la Kernel Density Estimate
# KDE se utiliza para visualizar la densidad de una probabilidad de una variable continua.
# KDE nos muestra la densidad de una probabilidad para diferentes valores de una variable continua.
# El balance promedio es $1500
# 'Balance_Frequency' para muchos usuarios se actualiza muy frecuentemente ~1
# Para el campo 'PURCHASES_FREQUENCY', hay dos grupos diferentes de clientes
# Para los campos 'ONEOFF_PURCHASES_FREQUENCY' y 'PURCHASES_INSTALLMENT_FREQUENCY' la gran mayoría de usuarios no pagan todo de golpe ni a plazos
# Muy pocos clientes pagan su deuda al completo 'PRC_FULL_PAYMENT'~0
# El promedio del límite del crédito está entorno de los $4500
# La mayoría de clientes llevan ~11 años usando el servicio
plt.figure(figsize = (10, 50))
for i in range(n):
plt.subplot(n, 1, i+1)
sns.histplot(creditcard_df[creditcard_df.columns[i]], line_kws = {"color": "b", "lw": 3, "label": "KDE"}, cbar_kws={"color": "g"}, kde=True)
plt.title(creditcard_df.columns[i])
plt.tight_layout()
# sns.pairplot(creditcard_df)
# Hay correlación entre 'PURCHASES' y ONEOFF_PURCHASES & INSTALMENT_PURCHASES
# Se ve una tendencia entre 'PURCHASES' y 'CREDIT_LIMIT' & 'PAYMENTS'
correlations = creditcard_df.corr()
f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot = True)
# 'PURCHASES' tienen una alta correlación con one-off purchases, 'installment purchases, purchase transactions, credit limit y payments.
# Correlación positiva muy elevada entre 'PURCHASES_FREQUENCY' y 'PURCHASES_INSTALLMENT_FREQUENCY'
<Axes: >
K-MEANS¶
ENCONTRAR EL NÚMERO ÓPTIMO DE CLISTERS UTILIZANDO EL MÉTODO DEL CODO¶
- El método del codo es un método heurístico de interpretación y validación de la coherencia dentro del análisis de clustering diseñado para ayudar a encontrar el número apropiado de clusters en un conjunto de datos.
- Si el gráfico de líneas se parece a un brazo, entonces el "codo" en el brazo es el valor de k que es el mejor.
- Source:
# Empecemos por escalar primero el dataset
scaler = StandardScaler()
creditcard_df_scaled = scaler.fit_transform(creditcard_df)
creditcard_df_scaled.shape
(8950, 17)
creditcard_df_scaled
array([[-0.73198937, -0.24943448, -0.42489974, ..., -0.31096755, -0.52555097, 0.36067954], [ 0.78696085, 0.13432467, -0.46955188, ..., 0.08931021, 0.2342269 , 0.36067954], [ 0.44713513, 0.51808382, -0.10766823, ..., -0.10166318, -0.52555097, 0.36067954], ..., [-0.7403981 , -0.18547673, -0.40196519, ..., -0.33546549, 0.32919999, -4.12276757], [-0.74517423, -0.18547673, -0.46955188, ..., -0.34690648, 0.32919999, -4.12276757], [-0.57257511, -0.88903307, 0.04214581, ..., -0.33294642, -0.52555097, -4.12276757]])
scores_1 = []
range_values = range(1, 20)
for i in range_values:
kmeans = KMeans(n_clusters = i)
kmeans.fit(creditcard_df_scaled)
scores_1.append(kmeans.inertia_) #WCSS
plt.plot(range_values, scores_1, 'bx-')
plt.title("Encontrar el número óptimo de Clusters")
plt.xlabel("Clusters")
plt.ylabel("WCSS(k)")
plt.show()
# Con el gráfico podemos ver que en 4 clusters es donde se forma el codo de la curva.
# Sin embargo, los valores no se reducen a una forma lineal hasta el 8º cluster.
# Elijamos pues un número de clusters igual a 8.
APLICAR EL MÉTODO DE K-MEANS¶
kmeans = KMeans(8)
kmeans.fit(creditcard_df_scaled)
labels = kmeans.labels_
kmeans.cluster_centers_.shape
(8, 17)
cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns=[creditcard_df.columns])
cluster_centers
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.135630 | 0.400131 | 0.539387 | 0.669386 | 0.045410 | -0.331722 | 0.977756 | 1.902452 | 0.168801 | -0.414087 | -0.330446 | 0.616253 | 0.429303 | 0.143434 | -0.158654 | 0.445917 | 0.268936 |
1 | 0.011995 | 0.404100 | -0.357021 | -0.241749 | -0.399925 | -0.094577 | -0.852753 | -0.394841 | -0.754326 | 0.102807 | -0.028741 | -0.481657 | -0.303464 | -0.250302 | -0.011802 | -0.455361 | 0.273520 |
2 | -0.361430 | 0.332243 | -0.034806 | -0.242518 | 0.363059 | -0.363193 | 0.993781 | -0.384169 | 1.206526 | -0.473349 | -0.360135 | 0.189821 | -0.261013 | -0.214810 | -0.030171 | 0.313605 | 0.256155 |
3 | 1.684378 | 0.392729 | -0.217494 | -0.155556 | -0.228546 | 2.007810 | -0.471593 | -0.208498 | -0.411691 | 1.912655 | 1.919139 | -0.265987 | 1.027732 | 0.820395 | 0.551555 | -0.390256 | 0.070846 |
4 | 1.042239 | 0.465006 | 2.500780 | 1.806350 | 2.594499 | -0.161300 | 1.164798 | 1.562962 | 1.273275 | -0.284847 | -0.150801 | 3.128411 | 1.303209 | 1.436890 | 0.559682 | 0.251719 | 0.337526 |
5 | 1.845310 | 0.340595 | 12.297201 | 12.823670 | 5.516158 | 0.272530 | 1.043177 | 2.145028 | 0.896761 | -0.380373 | -0.109730 | 4.556136 | 3.185151 | 9.047799 | 1.030898 | 1.222264 | 0.298409 |
6 | -0.336893 | -0.351916 | -0.288199 | -0.214424 | -0.287144 | 0.067274 | -0.203348 | -0.286554 | -0.225256 | 0.306510 | -0.000298 | -0.387867 | -0.563134 | -0.393094 | -0.208930 | 0.013378 | -3.201889 |
7 | -0.701208 | -2.141009 | -0.310385 | -0.234926 | -0.302187 | -0.322400 | -0.554752 | -0.442477 | -0.439746 | -0.521503 | -0.376311 | -0.419356 | -0.175381 | -0.201786 | -0.256971 | 0.282604 | 0.201234 |
# Para entender mejor estos valores, vamos a aplicar la transformación inversa.
cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns=[creditcard_df.columns])
cluster_centers
# Primer Cluster de Clientes (Transactors): Esos son los clientes que pagan la menor cantidad de cargos por intereses y tienen cuidado con su dinero, Clúster con el saldo más bajo ($ 104) y anticipo en efectivo ($ 303), Porcentaje de pago completo = 23%
# Segundo Cluster de Clientes (Revolvers) que usan tarjeta de crédito como préstamo (sector más lucrativo): saldo más alto ($ 5000) y anticipo en efectivo (~ $ 5000), baja frecuencia de compra, alta frecuencia de anticipo en efectivo (0.5), transacciones de anticipo en efectivo alto (16) y bajo porcentaje de pago (3%)
# Tercer Cluster de Clientes (VIP/Prime): límite de crédito alto $ 16K y porcentaje más alto de pago completo, objetivo para aumentar el límite de crédito y aumentar los hábitos de gasto
# Cuarto Cluster de Clientes (low tenure): estos son clientes con baja antigüedad (7 años), saldo bajo
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1282.172458 | 0.972058 | 2155.612989 | 1703.481335 | 452.131655 | 283.234302 | 0.882771 | 0.769996 | 0.431523 | 0.052281 | 0.993772 | 30.027580 | 6056.429958 | 2148.372546 | 494.470259 | 0.284138 | 11.877224 |
1 | 1589.442414 | 0.972998 | 240.424411 | 191.184026 | 49.420211 | 780.538716 | 0.148099 | 0.084669 | 0.064649 | 0.155717 | 3.052689 | 2.737645 | 3390.322190 | 1008.544109 | 836.702288 | 0.020529 | 11.883358 |
2 | 812.188081 | 0.955976 | 928.841434 | 189.906974 | 739.377555 | 217.237799 | 0.889203 | 0.087853 | 0.843942 | 0.040422 | 0.791171 | 19.428075 | 3544.776886 | 1111.289062 | 793.893442 | 0.245439 | 11.860119 |
3 | 5070.366014 | 0.970305 | 538.526613 | 334.246201 | 204.396739 | 5189.341636 | 0.301078 | 0.140259 | 0.200821 | 0.517886 | 16.345538 | 8.098398 | 8233.758061 | 4108.105687 | 2149.581367 | 0.039571 | 11.612128 |
4 | 3733.806745 | 0.987426 | 6346.159662 | 3590.608789 | 2757.241014 | 640.617891 | 0.957840 | 0.668720 | 0.870470 | 0.078143 | 2.219718 | 92.470423 | 9236.056338 | 5892.801050 | 2168.520942 | 0.227338 | 11.969014 |
5 | 5405.330935 | 0.957955 | 27276.363750 | 21877.102917 | 5399.260833 | 1550.378389 | 0.909028 | 0.842361 | 0.720833 | 0.059028 | 2.500000 | 127.958333 | 16083.333333 | 27925.634496 | 3266.671038 | 0.511206 | 11.916667 |
6 | 863.259693 | 0.793905 | 387.463654 | 236.537740 | 151.406683 | 1119.947453 | 0.408737 | 0.116973 | 0.274915 | 0.196480 | 3.246795 | 5.068910 | 2445.535977 | 595.176112 | 377.303465 | 0.157627 | 7.232372 |
7 | 104.968575 | 0.370085 | 340.061778 | 202.509086 | 137.804005 | 302.783168 | 0.267702 | 0.070458 | 0.189671 | 0.030786 | 0.680779 | 4.286198 | 3856.341613 | 1148.993000 | 265.345972 | 0.236372 | 11.786622 |
labels.shape
(8950,)
labels.min()
0
labels.max()
7
y_kmeans = kmeans.fit_predict(creditcard_df_scaled)
y_kmeans
array([7, 1, 6, ..., 0, 0, 0], dtype=int32)
# Concatenamos las etiquetas de los clusters con el dataset riginal
creditcard_df_cluster = pd.concat([creditcard_df, pd.DataFrame({'cluster': labels})], axis = 1)
creditcard_df_cluster.head()
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12 | 1 |
1 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12 | 3 |
2 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12 | 0 |
3 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500.0 | 0.000000 | 864.206542 | 0.000000 | 12 | 1 |
4 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12 | 1 |
# Visualizamos histogramas para cada cluster
for i in creditcard_df.columns:
plt.figure(figsize=(35, 5))
for j in range(8):
plt.subplot(1, 8, j+1)
cluster = creditcard_df_cluster[creditcard_df_cluster['cluster'] == j]
cluster[i].hist(bins = 20)
plt.title('{} \nCluster {}'.format(i, j))
plt.show()
APLICAR ANÁLISIS DE LAS COMPONENTES PRINCIPALES Y VISUALIZAR LOS RESULTADOS¶
# Obtenemos las componentes principales
pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(creditcard_df_scaled)
principal_comp
array([[-1.68221999, -1.07645189], [-1.13829562, 2.50646484], [ 0.96968184, -0.38351944], ..., [-0.9262032 , -1.81078284], [-2.33654944, -0.65796001], [-0.55642156, -0.40046214]])
# Creamos un dataframe con las dos componentes
pca_df = pd.DataFrame(data = principal_comp, columns=["pca1", "pca2"])
pca_df.head()
pca1 | pca2 | |
---|---|---|
0 | -1.682220 | -1.076452 |
1 | -1.138296 | 2.506465 |
2 | 0.969682 | -0.383519 |
3 | -0.873627 | 0.043163 |
4 | -1.599434 | -0.688581 |
# Concatenamos las etiquetas de los clusters con el dataframe de las componentes principales
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis = 1)
pca_df.head()
pca1 | pca2 | cluster | |
---|---|---|---|
0 | -1.682220 | -1.076452 | 1 |
1 | -1.138296 | 2.506465 | 3 |
2 | 0.969682 | -0.383519 | 0 |
3 | -0.873627 | 0.043163 | 1 |
4 | -1.599434 | -0.688581 | 1 |
plt.figure(figsize=(10,10))
ax = sns.scatterplot(x = "pca1", y = "pca2", hue = "cluster", data = pca_df,
palette = ["red", "green", "blue", "pink", "yellow", "gray", "purple", "black"])
plt.show()
AUTOENCODERS¶
APLICAR LOS AUTOENCODERS (REDUCIR LA DIMENSIÓN DE LOS DATOS CON LOS AUTOENCODERS)¶
from tensorflow.keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.initializers import glorot_uniform
from keras.optimizers import SGD
encoding_dim = 7
input_df = Input(shape = (17, ))
# Glorot normal inicializador (Xavier normal initializer) tomar muestras aleatorias de una distribución normal truncada
x = Dense(encoding_dim, activation = 'relu')(input_df)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
encoded = Dense(10, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(encoded)
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x)
decoded = Dense(17, kernel_initializer = 'glorot_uniform')(x)
autoencoder = Model(input_df, decoded)
encoder = Model(input_df, encoded)
autoencoder.compile(optimizer = 'adam', loss = 'mean_squared_error')
creditcard_df_scaled.shape
(8950, 17)
autoencoder.summary()
Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 17)] 0 dense (Dense) (None, 7) 126 dense_1 (Dense) (None, 500) 4000 dense_2 (Dense) (None, 500) 250500 dense_3 (Dense) (None, 2000) 1002000 dense_4 (Dense) (None, 10) 20010 dense_5 (Dense) (None, 2000) 22000 dense_6 (Dense) (None, 500) 1000500 dense_7 (Dense) (None, 17) 8517 ================================================================= Total params: 2307653 (8.80 MB) Trainable params: 2307653 (8.80 MB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
autoencoder.fit(creditcard_df_scaled, creditcard_df_scaled, batch_size=128, epochs = 25, verbose = 1)
Epoch 1/25 70/70 [==============================] - 7s 70ms/step - loss: 0.5762 Epoch 2/25 70/70 [==============================] - 6s 91ms/step - loss: 0.3098 Epoch 3/25 70/70 [==============================] - 6s 80ms/step - loss: 0.2284 Epoch 4/25 70/70 [==============================] - 5s 68ms/step - loss: 0.2010 Epoch 5/25 70/70 [==============================] - 6s 87ms/step - loss: 0.1761 Epoch 6/25 70/70 [==============================] - 5s 73ms/step - loss: 0.1638 Epoch 7/25 70/70 [==============================] - 6s 81ms/step - loss: 0.1600 Epoch 8/25 70/70 [==============================] - 10s 141ms/step - loss: 0.1437 Epoch 9/25 70/70 [==============================] - 9s 135ms/step - loss: 0.1314 Epoch 10/25 70/70 [==============================] - 7s 98ms/step - loss: 0.1196 Epoch 11/25 70/70 [==============================] - 5s 68ms/step - loss: 0.1186 Epoch 12/25 70/70 [==============================] - 6s 92ms/step - loss: 0.1118 Epoch 13/25 70/70 [==============================] - 5s 67ms/step - loss: 0.1013 Epoch 14/25 70/70 [==============================] - 7s 96ms/step - loss: 0.0952 Epoch 15/25 70/70 [==============================] - 6s 81ms/step - loss: 0.0913 Epoch 16/25 70/70 [==============================] - 5s 64ms/step - loss: 0.0861 Epoch 17/25 70/70 [==============================] - 5s 77ms/step - loss: 0.0845 Epoch 18/25 70/70 [==============================] - 6s 80ms/step - loss: 0.0764 Epoch 19/25 70/70 [==============================] - 5s 66ms/step - loss: 0.0731 Epoch 20/25 70/70 [==============================] - 6s 82ms/step - loss: 0.0717 Epoch 21/25 70/70 [==============================] - 5s 75ms/step - loss: 0.0710 Epoch 22/25 70/70 [==============================] - 5s 65ms/step - loss: 0.0666 Epoch 23/25 70/70 [==============================] - 6s 87ms/step - loss: 0.0647 Epoch 24/25 70/70 [==============================] - 5s 72ms/step - loss: 0.0603 Epoch 25/25 70/70 [==============================] - 5s 66ms/step - loss: 0.0603
<keras.src.callbacks.History at 0x7ad9e0568ca0>
autoencoder.save_weights('autoencoder.h5')
pred = encoder.predict(creditcard_df_scaled)
280/280 [==============================] - 2s 6ms/step
pred.shape
(8950, 10)
scores_2 = []
range_values = range(1,20)
for i in range_values:
kmeans = KMeans(n_clusters = i)
kmeans.fit(pred)
scores_2.append(kmeans.inertia_)
plt.plot(range_values, scores_2, 'bx-')
plt.title("Encontrar el número óptimo de clusters")
plt.xlabel("Número de Clusters")
plt.ylabel("WCSS(k)")
plt.show()
plt.plot(range_values, scores_1, 'bx-', color = "r")
plt.plot(range_values, scores_2, 'bx-', color = "g")
[<matplotlib.lines.Line2D at 0x7ad9dafa83a0>]
kmeans = KMeans(4)
kmeans.fit(pred)
labels = kmeans.labels_
y_kmeans = kmeans.fit_predict(pred)
df_cluster_dr = pd.concat([creditcard_df, pd.DataFrame({'cluster': labels})], axis = 1)
df_cluster_dr.head()
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12 | 1 |
1 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12 | 0 |
2 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12 | 1 |
3 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500.0 | 0.000000 | 864.206542 | 0.000000 | 12 | 1 |
4 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12 | 1 |
pca = PCA(n_components=2)
princ_comp = pca.fit_transform(pred)
pca_df = pd.DataFrame(data = princ_comp, columns=["pca1", "pca2"])
pca_df.head()
pca1 | pca2 | |
---|---|---|
0 | -1.474160 | -0.022473 |
1 | 1.799312 | -2.021418 |
2 | -1.882853 | 0.545519 |
3 | 0.479843 | -0.348112 |
4 | -2.034453 | -0.111284 |
pca_df = pd.concat([pca_df, pd.DataFrame({"cluster":labels})], axis = 1)
pca_df
pca1 | pca2 | cluster | |
---|---|---|---|
0 | -1.474160 | -0.022473 | 1 |
1 | 1.799312 | -2.021418 | 0 |
2 | -1.882853 | 0.545519 | 1 |
3 | 0.479843 | -0.348112 | 1 |
4 | -2.034453 | -0.111284 | 1 |
... | ... | ... | ... |
8945 | 1.116068 | 0.785430 | 3 |
8946 | 0.432141 | 0.369335 | 3 |
8947 | 0.997263 | 0.278995 | 3 |
8948 | 0.877870 | -1.634350 | 0 |
8949 | 0.368103 | -0.924275 | 0 |
8950 rows × 3 columns
plt.figure(figsize=(10,10))
ax = sns.scatterplot(x="pca1", y = "pca2", hue="cluster", data = pca_df, palette=["red", "green", "blue", "yellow"])
plt.show()