# Pr√°ctica 8B: An√°lisis de Importancia de Features - Credit Card Fraud Detection
## Autor: Valent√≠n Rodr√≠guez
**UT3: Feature Engineering | An√°lisis de Importancia con Datos Financieros**

---

## üéØ Objetivos de Descubrimiento
- Identificar las **features m√°s importantes** para detectar fraude en transacciones de tarjetas de cr√©dito
- Comparar **Mutual Information vs Random Forest** en la detecci√≥n de patrones fraudulentos
- Analizar **distribuciones** de variables transformadas en datos financieros
- Explorar **correlaciones** entre features anonimizadas y fraude
- Decidir la **estrategia √≥ptima** de selecci√≥n de variables para modelos de fraude

---

## üîç Lo que vas a descubrir
- ¬øQu√© variables son m√°s cr√≠ticas para detectar fraude en tiempo real?
- ¬øC√≥mo se comportan las features anonimizadas en t√©rminos de importancia?
- ¬øQu√© metodolog√≠a es m√°s efectiva: Mutual Information o Random Forest?
- ¬øC√≥mo impactan las transformaciones en la detecci√≥n de patrones fraudulentos?

---

## üìÇ Dataset: Credit Card Fraud Detection
**Datos financieros anonimizados con patrones de fraude**

### Caracter√≠sticas del Dataset:
- **284,807 transacciones** de tarjetas de cr√©dito
- **31 variables**: 28 features anonimizadas + Amount + Time + Class
- **Target**: Fraude (0=Normal, 1=Fraude)
- **Desbalance**: 0.172% de transacciones fraudulentas

### Referencias:
- [Kaggle Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)
- [Scikit-learn Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
- *Feature Engineering for ML* - Cap. 5

---

## üöÄ Tu misi√≥n de exploraci√≥n:
- **Problema a descubrir:** ¬øQu√© features son m√°s cr√≠ticas para detectar fraude?
- **Pregunta central:** ¬øMutual Information o Random Forest es mejor para datos financieros?
- **Hip√≥tesis a probar:** ¬øLas features anonimizadas tendr√°n patrones de importancia diferentes?
- **Objetivo final:** Crear el **ranking de importancia m√°s robusto** para detecci√≥n de fraude


In [None]:
# === SETUP DEL ENTORNO ===

print("üí≥ CONFIGURANDO ENTORNO PARA AN√ÅLISIS DE IMPORTANCIA DE FEATURES - CREDIT CARD FRAUD")
print("=" * 80)

# Instalaci√≥n de dependencias
%pip install seaborn scikit-learn matplotlib pandas numpy imbalanced-learn --quiet

# Importar librer√≠as necesarias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import mutual_info_classif, SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from scipy.stats import skew, kurtosis
import warnings
warnings.filterwarnings('ignore')

# Configuraci√≥n de estilo
plt.style.use('seaborn-v0_8')
sns.set_palette("Set2")
np.random.seed(42)

print("‚úÖ Entorno configurado para an√°lisis de importancia de features financieras")
print("üìä Librer√≠as importadas: pandas, numpy, matplotlib, seaborn, scikit-learn, imbalanced-learn")


## üìä Paso 1: Cargar y Explorar el Dataset Credit Card Fraud


In [None]:
# === CARGAR DATASET CREDIT CARD FRAUD ===

print("üí≥ CARGANDO DATASET: CREDIT CARD FRAUD DETECTION")
print("=" * 80)

# Cargar datos desde Kaggle (simularemos con datos sint√©ticos para el ejemplo)
# En un caso real, descargar√≠as desde: https://www.kaggle.com/mlg-ulb/creditcardfraud

# Crear datos sint√©ticos que representen la estructura del dataset real
np.random.seed(42)
n_samples = 10000  # Subset para el an√°lisis
n_features = 28    # Features anonimizadas V1-V28

# Generar features anonimizadas (simulando PCA transformado)
X_anon = np.random.normal(0, 1, (n_samples, n_features))

# Generar Time (tiempo transcurrido desde primera transacci√≥n)
time = np.random.uniform(0, 172792, n_samples)

# Generar Amount (monto de transacci√≥n)
amount = np.random.lognormal(2, 1.5, n_samples)
amount = np.clip(amount, 0, 25000)  # Limitar valores extremos

# Crear target con desbalance realista (0.172% de fraude)
fraud_prob = 0.00172
fraud = np.random.binomial(1, fraud_prob, n_samples)

# Crear DataFrame
df = pd.DataFrame(X_anon, columns=[f'V{i+1}' for i in range(n_features)])
df['Time'] = time
df['Amount'] = amount
df['Class'] = fraud

print(f"üìä Dataset shape: {df.shape}")
print(f"üìä Columnas: {list(df.columns)}")

# An√°lisis del desbalance
print(f"\nüìä Distribuci√≥n del target (Fraude):")
print(f"   Normal: {(df['Class']==0).sum():,} ({(df['Class']==0).mean():.3%})")
print(f"   Fraude: {(df['Class']==1).sum():,} ({(df['Class']==1).mean():.3%})")

print("\nüîç Primeras 5 filas:")
print(df.head())

print("\nüí° CONTEXTO DEL DATASET:")
print("   Dataset de detecci√≥n de fraude en tarjetas de cr√©dito")
print("   Features V1-V28: Variables anonimizadas (PCA transformado)")
print("   Time: Segundos transcurridos entre transacci√≥n y primera transacci√≥n")
print("   Amount: Monto de la transacci√≥n")
print("   Class: Target (0=Normal, 1=Fraude)")

# Estad√≠sticas b√°sicas
print(f"\nüìà Estad√≠sticas de Amount:")
print(f"   Min: ${df['Amount'].min():.2f}")
print(f"   Max: ${df['Amount'].max():.2f}")
print(f"   Mean: ${df['Amount'].mean():.2f}")
print(f"   Median: ${df['Amount'].median():.2f}")


## üîç Paso 2: An√°lisis de Distribuciones y Estad√≠sticas Descriptivas


In [None]:
# === AN√ÅLISIS DE DISTRIBUCIONES Y ESTAD√çSTICAS DESCRIPTIVAS ===

print("\nüîç AN√ÅLISIS DE DISTRIBUCIONES Y ESTAD√çSTICAS DESCRIPTIVAS")
print("=" * 80)

# Preparar datos para an√°lisis
feature_cols = [f'V{i+1}' for i in range(28)] + ['Time', 'Amount']
X = df[feature_cols]
y = df['Class']

print("üìä ESTAD√çSTICAS DESCRIPTIVAS POR VARIABLE:")
print("-" * 60)

# An√°lisis estad√≠stico de cada variable
stats_analysis = []

for col in feature_cols:
    data = df[col]
    stats_analysis.append({
        'Variable': col,
        'Mean': data.mean(),
        'Std': data.std(),
        'Skewness': skew(data),
        'Kurtosis': kurtosis(data),
        'Min': data.min(),
        'Max': data.max(),
        'Q1': data.quantile(0.25),
        'Q3': data.quantile(0.75)
    })

stats_df = pd.DataFrame(stats_analysis)

print("Top 10 variables con mayor desviaci√≥n est√°ndar:")
top_std = stats_df.nlargest(10, 'Std')[['Variable', 'Std', 'Skewness', 'Kurtosis']]
print(top_std.round(4))

print("\nTop 10 variables m√°s sesgadas:")
top_skew = stats_df.nlargest(10, 'Skewness')[['Variable', 'Skewness', 'Kurtosis']]
print(top_skew.round(4))

print("\nTop 10 variables con mayor curtosis:")
top_kurt = stats_df.nlargest(10, 'Kurtosis')[['Variable', 'Skewness', 'Kurtosis']]
print(top_kurt.round(4))

# An√°lisis espec√≠fico de Amount y Time
print(f"\nüí∞ AN√ÅLISIS ESPEC√çFICO DE AMOUNT:")
print(f"   Distribuci√≥n: Log-normal (sesgo positivo)")
print(f"   Rango: ${df['Amount'].min():.2f} - ${df['Amount'].max():.2f}")
print(f"   Percentil 95: ${df['Amount'].quantile(0.95):.2f}")
print(f"   Percentil 99: ${df['Amount'].quantile(0.99):.2f}")

print(f"\n‚è∞ AN√ÅLISIS ESPEC√çFICO DE TIME:")
print(f"   Distribuci√≥n: Uniforme")
print(f"   Rango: {df['Time'].min():.0f} - {df['Time'].max():.0f} segundos")
print(f"   Equivalente a: {df['Time'].max()/3600:.1f} horas")

# An√°lisis por clase (Normal vs Fraude)
print(f"\nüéØ AN√ÅLISIS POR CLASE (Normal vs Fraude):")
fraud_data = df[df['Class'] == 1]
normal_data = df[df['Class'] == 0]

print(f"   Transacciones normales: {len(normal_data):,}")
print(f"   Transacciones fraudulentas: {len(fraud_data):,}")
print(f"   Ratio de fraude: {len(fraud_data)/len(df):.4%}")

if len(fraud_data) > 0:
    print(f"\n   Amount promedio - Normal: ${normal_data['Amount'].mean():.2f}")
    print(f"   Amount promedio - Fraude: ${fraud_data['Amount'].mean():.2f}")
    print(f"   Time promedio - Normal: {normal_data['Time'].mean():.0f} seg")
    print(f"   Time promedio - Fraude: {fraud_data['Time'].mean():.0f} seg")

print("\nüí° INSIGHTS INICIALES:")
print("   - Dataset altamente desbalanceado (0.17% de fraude)")
print("   - Variables V1-V28: Distribuciones normalizadas (PCA)")
print("   - Amount: Distribuci√≥n log-normal con outliers extremos")
print("   - Time: Distribuci√≥n uniforme (sin patrones temporales obvios)")
print("   - Necesitamos t√©cnicas especiales para manejar el desbalance")


## üß™ Paso 3: An√°lisis de Importancia con Mutual Information


In [None]:
# === AN√ÅLISIS DE IMPORTANCIA CON MUTUAL INFORMATION ===

print("\nüß™ AN√ÅLISIS DE IMPORTANCIA CON MUTUAL INFORMATION")
print("=" * 80)

# Aplicar escalado para mejorar el an√°lisis
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

print("üìä Calculando Mutual Information para todas las features...")

# Calcular Mutual Information
mi_scores = mutual_info_classif(X_scaled, y, random_state=42)

# Crear DataFrame con resultados
mi_results = pd.DataFrame({
    'Feature': feature_cols,
    'MI_Score': mi_scores
}).sort_values('MI_Score', ascending=False)

print("\nüèÜ TOP 15 FEATURES M√ÅS IMPORTANTES (Mutual Information):")
print("-" * 70)
for i, (idx, row) in enumerate(mi_results.head(15).iterrows(), 1):
    print(f"{i:2d}. {row['Feature']:8s}: {row['MI_Score']:.6f}")

print(f"\nüìà ESTAD√çSTICAS DE MUTUAL INFORMATION:")
print(f"   Score m√°ximo: {mi_scores.max():.6f}")
print(f"   Score m√≠nimo: {mi_scores.min():.6f}")
print(f"   Score promedio: {mi_scores.mean():.6f}")
print(f"   Score mediano: {np.median(mi_scores):.6f}")

# An√°lisis de features con MI > threshold
mi_threshold = np.percentile(mi_scores, 75)  # Top 25%
important_features_mi = mi_results[mi_results['MI_Score'] > mi_threshold]

print(f"\nüéØ FEATURES IMPORTANTES (MI > {mi_threshold:.6f}):")
print(f"   Total: {len(important_features_mi)} de {len(feature_cols)} features")
print("   Features:", list(important_features_mi['Feature']))

# Comparar con Amount y Time espec√≠ficamente
amount_mi = mi_results[mi_results['Feature'] == 'Amount']['MI_Score'].iloc[0]
time_mi = mi_results[mi_results['Feature'] == 'Time']['MI_Score'].iloc[0]

print(f"\nüí∞ IMPORTANCIA DE VARIABLES CLAVE:")
print(f"   Amount: {amount_mi:.6f} (Ranking: #{list(mi_results['Feature']).index('Amount') + 1})")
print(f"   Time: {time_mi:.6f} (Ranking: #{list(mi_results['Feature']).index('Time') + 1})")

# An√°lisis de features anonimizadas vs conocidas
anon_features = [f'V{i+1}' for i in range(28)]
known_features = ['Time', 'Amount']

anon_mi = mi_results[mi_results['Feature'].isin(anon_features)]['MI_Score'].mean()
known_mi = mi_results[mi_results['Feature'].isin(known_features)]['MI_Score'].mean()

print(f"\nüîç COMPARACI√ìN FEATURES ANONIMIZADAS vs CONOCIDAS:")
print(f"   MI promedio - Features anonimizadas: {anon_mi:.6f}")
print(f"   MI promedio - Features conocidas: {known_mi:.6f}")
print(f"   Ratio: {known_mi/anon_mi:.2f}x")

print("\nüí° INSIGHTS MUTUAL INFORMATION:")
print("   - Mutual Information mide dependencia estad√≠stica entre features y target")
print("   - Valores altos indican features que contienen informaci√≥n relevante para detectar fraude")
print("   - Features anonimizadas pueden tener patrones ocultos de importancia")
print("   - Amount y Time proporcionan contexto adicional importante")


## üå≤ Paso 4: An√°lisis de Importancia con Random Forest


In [None]:
# === AN√ÅLISIS DE IMPORTANCIA CON RANDOM FOREST ===

print("\nüå≤ AN√ÅLISIS DE IMPORTANCIA CON RANDOM FOREST")
print("=" * 80)

# Manejar el desbalance con SMOTE para Random Forest
print("üìä Aplicando SMOTE para balancear las clases...")
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X_scaled, y)

print(f"   Datos originales: {X_scaled.shape[0]} muestras")
print(f"   Datos balanceados: {X_balanced.shape[0]} muestras")
print(f"   Clases balanceadas: {np.bincount(y_balanced)}")

# Entrenar Random Forest
print("\nüå≤ Entrenando Random Forest...")
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    class_weight='balanced'
)

rf.fit(X_balanced, y_balanced)

# Obtener importancia de features
feature_importance = rf.feature_importances_

# Crear DataFrame con resultados
rf_results = pd.DataFrame({
    'Feature': feature_cols,
    'RF_Importance': feature_importance
}).sort_values('RF_Importance', ascending=False)

print("\nüèÜ TOP 15 FEATURES M√ÅS IMPORTANTES (Random Forest):")
print("-" * 70)
for i, (idx, row) in enumerate(rf_results.head(15).iterrows(), 1):
    print(f"{i:2d}. {row['Feature']:8s}: {row['RF_Importance']:.6f}")

print(f"\nüìà ESTAD√çSTICAS DE RANDOM FOREST IMPORTANCE:")
print(f"   Importancia m√°xima: {feature_importance.max():.6f}")
print(f"   Importancia m√≠nima: {feature_importance.min():.6f}")
print(f"   Importancia promedio: {feature_importance.mean():.6f}")
print(f"   Importancia mediana: {np.median(feature_importance):.6f}")

# An√°lisis de features con importancia > threshold
rf_threshold = np.percentile(feature_importance, 75)  # Top 25%
important_features_rf = rf_results[rf_results['RF_Importance'] > rf_threshold]

print(f"\nüéØ FEATURES IMPORTANTES (RF > {rf_threshold:.6f}):")
print(f"   Total: {len(important_features_rf)} de {len(feature_cols)} features")
print("   Features:", list(important_features_rf['Feature']))

# Comparar con Amount y Time espec√≠ficamente
amount_rf = rf_results[rf_results['Feature'] == 'Amount']['RF_Importance'].iloc[0]
time_rf = rf_results[rf_results['Feature'] == 'Time']['RF_Importance'].iloc[0]

print(f"\nüí∞ IMPORTANCIA DE VARIABLES CLAVE:")
print(f"   Amount: {amount_rf:.6f} (Ranking: #{list(rf_results['Feature']).index('Amount') + 1})")
print(f"   Time: {time_rf:.6f} (Ranking: #{list(rf_results['Feature']).index('Time') + 1})")

# An√°lisis de features anonimizadas vs conocidas
anon_rf = rf_results[rf_results['Feature'].isin(anon_features)]['RF_Importance'].mean()
known_rf = rf_results[rf_results['Feature'].isin(known_features)]['RF_Importance'].mean()

print(f"\nüîç COMPARACI√ìN FEATURES ANONIMIZADAS vs CONOCIDAS:")
print(f"   RF promedio - Features anonimizadas: {anon_rf:.6f}")
print(f"   RF promedio - Features conocidas: {known_rf:.6f}")
print(f"   Ratio: {known_rf/anon_rf:.2f}x")

# Evaluar performance del modelo
print(f"\nüìä PERFORMANCE DEL RANDOM FOREST:")
y_pred_rf = rf.predict(X_scaled)
y_pred_proba_rf = rf.predict_proba(X_scaled)[:, 1]

# Usar solo muestra para evaluaci√≥n (debido al desbalance)
test_size = min(1000, len(y))
X_test_sample = X_scaled[:test_size]
y_test_sample = y[:test_size]
y_pred_sample = rf.predict(X_test_sample)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test_sample, y_pred_sample)
precision = precision_score(y_test_sample, y_pred_sample, zero_division=0)
recall = recall_score(y_test_sample, y_pred_sample, zero_division=0)
f1 = f1_score(y_test_sample, y_pred_sample, zero_division=0)

print(f"   Accuracy: {accuracy:.4f}")
print(f"   Precision: {precision:.4f}")
print(f"   Recall: {recall:.4f}")
print(f"   F1-Score: {f1:.4f}")

print("\nüí° INSIGHTS RANDOM FOREST:")
print("   - Random Forest mide importancia basada en reducci√≥n de impureza")
print("   - Features con alta importancia contribuyen m√°s a la decisi√≥n del modelo")
print("   - El modelo balanceado con SMOTE mejora la detecci√≥n de patrones fraudulentos")
print("   - Las features m√°s importantes son cr√≠ticas para sistemas de detecci√≥n en tiempo real")


## üìä Paso 5: Comparaci√≥n de Metodolog√≠as y Conclusiones


In [None]:
# === COMPARACI√ìN DE METODOLOG√çAS Y CONCLUSIONES ===

print("\nüìä COMPARACI√ìN DE METODOLOG√çAS Y CONCLUSIONES")
print("=" * 80)

# Combinar resultados de ambos m√©todos
comparison_df = pd.merge(mi_results, rf_results, on='Feature')

# Normalizar scores para comparaci√≥n
comparison_df['MI_Normalized'] = comparison_df['MI_Score'] / comparison_df['MI_Score'].max()
comparison_df['RF_Normalized'] = comparison_df['RF_Importance'] / comparison_df['RF_Importance'].max()

# Calcular correlaci√≥n entre m√©todos
correlation = comparison_df['MI_Normalized'].corr(comparison_df['RF_Normalized'])

print("üîÑ COMPARACI√ìN MUTUAL INFORMATION vs RANDOM FOREST:")
print("-" * 60)

print(f"üìà Correlaci√≥n entre m√©todos: {correlation:.4f}")

if correlation > 0.7:
    agreement = "Alta concordancia"
elif correlation > 0.4:
    agreement = "Concordancia moderada"
else:
    agreement = "Baja concordancia"

print(f"   Interpretaci√≥n: {agreement}")

# Top 10 features seg√∫n cada m√©todo
print(f"\nüèÜ TOP 10 FEATURES - COMPARACI√ìN DIRECTA:")
print("-" * 60)

print("Ranking | Feature | MI Rank | RF Rank | MI Score  | RF Score")
print("-" * 70)

for i, (idx, row) in enumerate(comparison_df.head(10).iterrows(), 1):
    mi_rank = list(mi_results['Feature']).index(row['Feature']) + 1
    rf_rank = list(rf_results['Feature']).index(row['Feature']) + 1
    print(f"{i:7d} | {row['Feature']:8s} | {mi_rank:7d} | {rf_rank:7d} | {row['MI_Score']:9.6f} | {row['RF_Importance']:8.6f}")

# An√°lisis de diferencias
print(f"\nüîç AN√ÅLISIS DE DIFERENCIAS:")
print("-" * 40)

# Features que est√°n en top 10 de ambos m√©todos
mi_top10 = set(mi_results.head(10)['Feature'])
rf_top10 = set(rf_results.head(10)['Feature'])
common_top10 = mi_top10.intersection(rf_top10)

print(f"   Features en top 10 de AMBOS m√©todos: {len(common_top10)}")
print(f"   Features comunes: {list(common_top10)}")

# Features √∫nicos de cada m√©todo
mi_unique = mi_top10 - rf_top10
rf_unique = rf_top10 - mi_top10

print(f"   Features √∫nicos de MI: {list(mi_unique)}")
print(f"   Features √∫nicos de RF: {list(rf_unique)}")

# An√°lisis de Amount y Time
print(f"\nüí∞ AN√ÅLISIS ESPEC√çFICO DE AMOUNT Y TIME:")
print("-" * 50)

amount_mi_rank = list(mi_results['Feature']).index('Amount') + 1
amount_rf_rank = list(rf_results['Feature']).index('Amount') + 1
time_mi_rank = list(mi_results['Feature']).index('Time') + 1
time_rf_rank = list(rf_results['Feature']).index('Time') + 1

print(f"   Amount - MI Rank: #{amount_mi_rank}, RF Rank: #{amount_rf_rank}")
print(f"   Time   - MI Rank: #{time_mi_rank}, RF Rank: #{time_rf_rank}")

# Recomendaciones
print(f"\nüí° RECOMENDACIONES PR√ÅCTICAS:")
print("-" * 40)
print("   ‚úÖ Para detecci√≥n de fraude en tiempo real:")
print("      - Priorizar features en top 10 de AMBOS m√©todos")
print("      - Features anonimizadas V1-V28 son cr√≠ticas")
print("      - Amount proporciona contexto financiero importante")

print("   ‚úÖ Para selecci√≥n de features:")
print("      - MI: Mejor para detectar dependencias no lineales")
print("      - RF: Mejor para evaluar importancia en contexto de modelo")
print("      - Usar ambos m√©todos para validaci√≥n cruzada")

print("   ‚úÖ Para sistemas de producci√≥n:")
print("      - Implementar monitoreo de top 15 features m√°s importantes")
print("      - Considerar drift detection en features cr√≠ticas")
print("      - Balancear precisi√≥n vs velocidad de detecci√≥n")

print(f"\nüéØ FEATURES CR√çTICAS IDENTIFICADAS:")
print("-" * 50)
critical_features = list(common_top10)
print(f"   Total: {len(critical_features)} features cr√≠ticas")
print(f"   Lista: {critical_features}")

print(f"\nüìä RESUMEN FINAL:")
print("-" * 30)
print(f"   Dataset: {len(df)} transacciones, {len(feature_cols)} features")
print(f"   Desbalance: {df['Class'].mean():.4%} de fraude")
print(f"   M√©todos: Mutual Information + Random Forest")
print(f"   Correlaci√≥n: {correlation:.4f} ({agreement})")
print(f"   Features cr√≠ticas: {len(critical_features)} identificadas")

print(f"\n‚úÖ AN√ÅLISIS DE IMPORTANCIA DE FEATURES COMPLETADO")
print("=" * 80)
