💻 Gaussian Mixture Model

February 28, 2022

GitHub Link for Code Visuals
Soowan Choi

Gaussian Mixture Model

(Detecting Red Wine)

Classification (Clustering) - Unsupervised - Instance Based - Parametric

Tuning # features & # Gaussians (with # components) using F1 Score and AUC

wine

1) Problem

1.1) Classification (Clustering)

2 wine types
predict anomaly/red wine (NOTE: labels unknown for unsupervised learning)

1.2) Explore

6497 samples (rows), 13 features (columns), 2 targets (unknown)
unbalanced class: White Wine = 4898, Red Wine = 1599
some missing values

***Check Balance

Percentage of entries in dataset for each target class?

White: 74.96%, Red: 24.52%

Is the data balanced or imbalanced?

This is an unbalanced dataset: significantly larger amount of data for target wine = white compared to target wine = red.

Why is an imbalanced dataset bad? Bad for KNN?

An imbalanced dataset makes it difficult to accurately predict the positive class (red wine).
An imbalanced class distribution might affect a KNN classifier by classifying all new test data as the class with the most observations (white wine).

What metric must be use when data is imbalanced?

Use the F1 Score (Precision and Recall) as the performance metrics when class is imbalanced.
The ROC-AUC curves (True Positive and False Positive Rates from Confusion Matrix) are better when class is balanced

Reference: https://www.kaggle.com/datasets/rajyellow46/wine-quality?resource=download

1.1) Classification (Clustering)

# load the dataset
import pandas as pd
url = 'https://raw.githubusercontent.com/soowanchoi/swanscodex/main/winequality.csv'
df = pd.read_csv(url)

1.2) Explore

# how many samples and features?
print(f'there are {df.shape[0]} rows and {df.shape[1]} columns in this dataset \n')
df.rename(columns = {'type':'wine'}, inplace = True)
df.head()   

# how many types of each class?
types = len(df.wine.unique())
print(f'there are {types} types of wine types to classify in this dataset: \n')

# print the names of each class and sample length
for i in range(types):
  print(f'{i+1} = {df.wine.unique()[i]}: \t sample data = {len(df[df.wine == df.wine.unique()[i]])}')  

# check to see if data is balanced
white = round((len(df.where(df['wine']=='white').dropna()) / df.shape[0]) * 100, 2)
red = round((len(df.where(df['wine']=='red').dropna()) / df.shape[0]) * 100, 2)

print(f'percentage of entries for wine = white is {white}%')       #how many target wine = white 
print(f'percentage of entries for wine = red is {red}%')           #how many target wine = red

# data statistics
df.describe() 

# how many missing values in dataset?
df.info()  
print()
df.isnull().sum()

2) Data

2.1) Clean
2.2) Xy Split
2.3) Test/Train Split
2.4) ~~Standardize~~

2.1) Clean

print(len(df))
df = df.dropna()
print(len(df))

2.2) Xy Split

# Xy split
feature_data = df.iloc[:, 1:]       # feature data X
target_data = df.iloc[:,0]          # target data y

# show the split dataframe of feature data X
feature_data.head(3)               

2.3) Test/Train Split

# test/train split: Training (65%), Validation (20%), Testing (15%)
from sklearn.model_selection import train_test_split

# split entire data set for 15% test set
X_train, X_test, y_train, y_test = train_test_split(feature_data, target_data, test_size=0.15, random_state=1)

# split training set again for 20% validation set
# 0.85*0.23529 = 0.2% for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.23529, random_state=1)        

# check to make sure the 65%, 20% and 15% split
print(f'training data: {round(len(X_train)/len(feature_data)*100,2)}%')       #train data
print(f'validation data: {round(len(X_val)/len(feature_data)*100,2)}%')       #validation data
print(f'testing data: {round(len(X_test)/len(feature_data)*100,2)}%')         #test data (new unseen data)

Visualize Distribution of White Wine and Red Wine

import matplotlib.gridspec as gridspec
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

features = [f for f in df.columns[1:]]
num_plots = np.size(features)
plt.figure(figsize=(10,5*num_plots))

# grid layout to place each subplot
grid = gridspec.GridSpec(num_plots,1)       
for i, feature in enumerate(features):
    ax = plt.subplot(grid[i])  
    sns.histplot(X_train[feature][y_train=='white'], stat="density", kde=True, color="blue", bins=30)
    sns.histplot(X_train[feature][y_train=='red'], stat="density", kde=True, color="red", bins=30)
    ax.legend(['white', 'red'],loc='best')
    ax.set_xlabel('')
    ax.set_title('Feature Distribution: ' + feature)

These gaussian distributions tells us which features effectively differentiate between white and red wine. For example, feature “citric acid” will be a bad feature to detect anomalies as it has a nearly identical distribution between red wine (anomaly to be detected) and white wine. On the other hand, feature “chlorides” has two distinct distributions between the white wine and red wine, which would result in better precision and recall of true positives (detecting red wine as anomalies).

3) Model: GMM Clustering

3.1) Gaussian Model (ONE) - Feature (ONE)

Making a prediction using a SINGLE FEATURE at a time

Fit Gaussian Regardless of Class:
3.1.1) Fit single gaussian distribution on full training data (both class).

sklearn.mixture.GaussianMixture
n_components = 1

3.1.2) Compute AUC on full training data and validation data (both class).

based on sklearn.mixture.GaussianMixture.score_samples.

3.1.3) Repeat for each single feature.

3.1.4) Select 3 best features to identify red wine.

based on highest AUC of validation data

3.1.5) Optimal Threshold to Maximize F1 Score

for each best feature: find optimal threshold to maximize F1 score of validation data
sklearn.metrics.f1_score
anomlay (white wine): score_samples < threshold
train and get probability of NOT outlier for each of the 3 best features:
score_samples = compute log likelihood of each sample been generated by any cluster (likelihood of not an outlier)

3.1.6) Table: precision, recall and F1 Score on training data and validation data

using optimal threshold

Fit Gaussian Based on Class:
3.1.7) Use the 3 best features with best AUC

3.1.8) Fit Gaussian only on RED WINE in the training data

3.1.9) Compute AUC, F1 Score, Precision, Recall

3.1.10) Compare results when fitting gaussian regardless/based on class

CODE TO VISUALIZE ROC-AUC CURVES

training data and validation data of the One Gaussian Model - Single Feature

from sklearn.mixture import GaussianMixture

# define parameters of gaussian mixture model - single feature
gm_one = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)     
# reshape(-1,1) = (however many rows, one feature) as an array

# train the gaussian mixture model  - unsupervised learning (uses no labels)
gm_one.fit(np.array(X_train.iloc[:,2]).reshape(-1,1))   

# compute AUC on full training set AND validation set (both class)
# log likelihood each sample belongs to cluster (likelihood not outlier)
p_train = gm_one.score_samples(np.array(X_train['total sulfur dioxide']).reshape(-1,1)) 
p_val = gm_one.score_samples(np.array(X_val['total sulfur dioxide']).reshape(-1,1))



from sklearn.metrics import roc_curve

# convert the target class labels to 0 and 1 
y_train = y_train.map({'red': 0, 'white': 1}).astype(int)
y_val = y_val.map({'red' : 0, 'white' : 1}).astype(int)

# plot the ROC curves
fpr_train, tpr_train, _ = roc_curve(y_train, -1* p_train)  
fpr_val, tpr_val, _ = roc_curve(y_val, -1* p_val)
plt.plot(fpr_train, tpr_train, linestyle = '--', label='ROC of Single Gaussian using Training Data')
plt.plot(fpr_val, tpr_val, marker='.', label='ROC of Single Gaussian using Validation Data')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

from sklearn.metrics import roc_auc_score

# calculate the AUC scores
print ("AUC of Single Gaussian using Training Data" , format(  roc_auc_score(y_train, -1* p_train) , ".3f")  )
print ("AUC of Single Gaussian using Validation Data" , format(  roc_auc_score(y_val, -1* p_val)  , ".3f")  )

3.1.1) Fit single gaussian distribution on full training data (both class).

3.1.2) Compute AUC on full training data and validation data (both class).

3.1.3) Repeat for each single feature.

from sklearn.mixture import GaussianMixture
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np

# define parameters of gaussian mixture model - single gaussian
gm_one = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)      
# create empty dataframe for summary of AUC
table = pd.DataFrame()             
# empty list for training data AUC scores                               
AUC_train = []                             
# empty list for validation data AUC scores                      
AUC_val = []                                                

# 3.1.3) repeat for each single feature
for i in np.arange(0,len(feature_data.columns)):

  # 3.1.1) fit a single gaussian distribution on full training set (both class)
  # train the gaussian mixture model  - unsupervised learning (uses no labels)
  gm_one.fit(np.array(X_train.iloc[:,2]).reshape(-1,1))           

  # 3.1.2) compute AUC on full training set AND validation set (both class)
  # log likelihood each sample belongs to cluster (likelihood not outlier)
  p_train = gm_one.score_samples(np.array(X_train.iloc[:,i]).reshape(-1,1)) 
  p_val = gm_one.score_samples(np.array(X_val.iloc[:,i]).reshape(-1,1))

  #AUC of single gaussian using training data
  AUC_train.append(round(roc_auc_score(y_train, -1* p_train),3))    

  #AUC of single gaussian using validation data       
  AUC_val.append(round(roc_auc_score(y_val, -1* p_val),3))                

#3.1.3 ) table of AUC
table['feature'] = df.columns[1:]
table['AUC Train'] = AUC_train
table['AUC Val'] = AUC_val

# 3.1.3) table of AUC
table

3.1.4) Select 3 best features to identify red wine.

# 3.1.4) best 3 features to distinguish red wine from white wine based on AUC of validation data
# sort table and find the three largest AUC_val scores
val1, val2, val3 = sorted(table['AUC Val'])[-3:]  
print(f'the three largest AUC scores from validation data are: {val1}, {val2}, {val3}')

# find the feature (index location) associated with the three largest AUC_val scores
x1 = table[table['AUC Val'] == val1]
x2 = table[table['AUC Val'] == val2]
x3 = table[table['AUC Val'] == val3]      
frames = [x1,x2,x3]
auc_sum = pd.concat(frames)
auc_sum

From the table of largest AUC scores for validation data above, the 3 best features to distinguish between red wine and white wine:.

free sulfur dioxide
chlorides
total sulfur dioxide

3.1.5) Optimal Threshold to Maximize F1 Score

# 3.1.5) optimal threshold (maximizes F1 score) of validation set for each of 3 best feature

# probability predictions of VALIDATION set:
#   train and get probability of NOT outlier for each of the 3 best features:
#   score_samples = compute log likelihood of each sample been generated by any cluster (likelihood of not an outlier)

# feature: free sulfur dioxide
# train model 
gm_one.fit(np.array(X_train['free sulfur dioxide']).reshape(-1,1))    
# log(probability) validation data using feature 1 (larger num classified valid -4>-10)
p_val_f1 = gm_one.score_samples(np.array(X_val['free sulfur dioxide']).reshape(-1,1))

# feature: chlorides
# train model 
gm_one.fit(np.array(X_train['chlorides']).reshape(-1,1))    
# log(probability) validation data using feature 2 (larger num classified valid -4>-10) 
p_val_f2 = gm_one.score_samples(np.array(X_val['chlorides']).reshape(-1,1))

# feature: total sulfur dioxide
# train model
gm_one.fit(np.array(X_train['total sulfur dioxide']).reshape(-1,1))    
# log(probability) validation data using feature 3 (larger num classified valid -4>-10)
p_val_f3 = gm_one.score_samples(np.array(X_val['total sulfur dioxide']).reshape(-1,1))

from sklearn.metrics import precision_score, recall_score, f1_score  

# feature - f1 (free sulfur dioxide) 
tr_f1 = []
f1_f1 = []
pre_f1 = []
rec_f1 = []
for i in np.arange(1, len(p_val_f1) - 500, 30):        # iterate through different thresholds                    
   tr = sorted(p_val_f1)[i]                            # sort the points by probability
   precision = precision_score(y_val, p_val_f1 < tr)   # precision = TP / TP + FP
   recall = recall_score(y_val, p_val_f1 < tr)         # recall = TP / TP + FN
   f1 = f1_score(y_val,p_val_f1 < tr)                  # F1 = 2 * (precision * recall) / (precision + recall)

   tr_f1.append(round(tr,3))                           # store the threshold values in a list
   f1_f1.append(round(f1,3))                           # store the F1 score values in a list
   pre_f1.append(round(precision,3))                   # store the precision values in a list
   rec_f1.append(round(recall,3))                      # store the recall values in a list
   print('For k: ',i,'\t  threshold: ','%.3f'% tr ,'  precision: ', '%.3f' % precision,'  recall: ', '%.3f' % recall, '  F1_Score: ','%.3f' % f1)

idx_f1 = f1_f1.index(max(f1_f1))                       # index of maximum F1 score value
f1_f1_val = max(f1_f1)                                 # max F1 Score of validation set
pre_f1_val = pre_f1[idx_f1]                            # corresponding precision score
rec_f1_val = rec_f1[idx_f1]                            # corresponding recall score
tr_f1_val = tr_f1[idx_f1]                              # optimal threshold
print(f'\n max F1 score: {f1_f1_val} with precision: {pre_f1_val}, recall: {rec_f1_val} at threshold: {tr_f1_val} for feature: free sulfur dioxide')       

# feature - f2 (chlorides) 
tr_f2 = []
f1_f2 = []
pre_f2 = []
rec_f2 = []
for i in np.arange(1, len(p_val_f2) - 500, 30):        # iterate through different thresholds                    
   tr = sorted(p_val_f2)[i]                            # sort the points by probability
   precision = precision_score(y_val, p_val_f2 < tr)   # precision = TP / TP + FP
   recall = recall_score(y_val, p_val_f2 < tr)         # recall = TP / TP + FN
   f1 = f1_score(y_val,p_val_f2 < tr)                  # F1 = 2 * (precision * recall) / (precision + recall)

   tr_f2.append(round(tr,3))                           # store the threshold values in a list
   f1_f2.append(round(f1,3))                           # store the F1 score values in a list
   pre_f2.append(round(precision,3))                   # store the precision values in a list
   rec_f2.append(round(recall,3))                      # store the recall values in a list
   print('For k: ',i,'\t  threshold: ','%.3f'% tr ,'  precision: ', '%.3f' % precision,'  recall: ', '%.3f' % recall, '  F1_Score: ','%.3f' % f1) 

idx_f2 = f1_f2.index(max(f1_f2))                       # index of maximum F1 score value
f1_f2_val = max(f1_f2)                                 # max F1 Score of validation set
pre_f2_val = pre_f2[idx_f2]                            # corresponding precision score
rec_f2_val = rec_f2[idx_f2]                            # corresponding recall score
tr_f2_val = tr_f2[idx_f2]                              # optimal threshold
print(f'\n max F1 score: {f1_f2_val} with precision: {pre_f2_val}, recall: {rec_f2_val} at threshold: {tr_f2_val} for feature: chlorides')  

# feature - f3 (total sulfur dioxide)  
tr_f3 = []
f1_f3 = []
pre_f3 = []
rec_f3 = []
for i in np.arange(1, len(p_val_f3) - 500, 30):        # iterate through different thresholds                    
   tr = sorted(p_val_f3)[i]                            # sort the points by probability
   precision = precision_score(y_val, p_val_f3 < tr)   # precision = TP / TP + FP
   recall = recall_score(y_val, p_val_f3 < tr)         # recall = TP / TP + FN
   f1 = f1_score(y_val,p_val_f3 < tr)                  # F1 = 2 * (precision * recall) / (precision + recall)

   tr_f3.append(round(tr,3))                           # store the threshold values in a list
   f1_f3.append(round(f1,3))                           # store the F1 score values in a list
   pre_f3.append(round(precision,3))                   # store the precision values in a list
   rec_f3.append(round(recall,3))                      # store the recall values in a list
   print('For k: ',i,'\t  threshold: ','%.3f'% tr ,'  precision: ', '%.3f' % precision,'  recall: ', '%.3f' % recall, '  F1_Score: ','%.3f' % f1) 

idx_f3 = f1_f3.index(max(f1_f3))                       # index of maximum F1 score value
f1_f3_val = max(f1_f3)                                 # max F1 Score of validation set
pre_f3_val = pre_f3[idx_f3]                            # corresponding precision score
rec_f3_val = rec_f3[idx_f3]                            # corresponding recall score
tr_f3_val = tr_f3[idx_f3]                              # optimal threshold
print(f'\n max F1 score: {f1_f3_val} with precision: {pre_f3_val}, recall: {rec_f3_val} at threshold: {tr_f3_val} for feature: total sulfur dioxide')  

3.1.6) Table: precision, recall and F1 Score on training data and validation data

using optimal threshold

# 3.1.6) summary of performance metrics on training set of f1, f2, f3 features using the optimal threshold on TRAINING DATA

# feature: free sulfur dioxide
# train model 
gm_one.fit(np.array(X_train['free sulfur dioxide']).reshape(-1,1))    
# log(probability) validation data using feature 1 (larger num classified valid -4>-10)
p_train_f1 = gm_one.score_samples(np.array(X_train['free sulfur dioxide']).reshape(-1,1))

# feature: chlorides
# train model 
gm_one.fit(np.array(X_train['chlorides']).reshape(-1,1))    
# log(probability) validation data using feature 2 (larger num classified valid -4>-10) 
p_train_f2 = gm_one.score_samples(np.array(X_train['chlorides']).reshape(-1,1))

# feature: total sulfur dioxide
# train model
gm_one.fit(np.array(X_train['total sulfur dioxide']).reshape(-1,1))    
# log(probability) validation data using feature 3 (larger num classified valid -4>-10)
p_train_f3 = gm_one.score_samples(np.array(X_train['total sulfur dioxide']).reshape(-1,1))

# feature - f1 (free sulfur dioxide)              
tr_f1_val = tr_f1[idx_f1]                                       # optimal threshold found previously from validation set
pre_f1_train = precision_score(y_train, p_train_f1 < tr_f1_val) # precision = TP / TP + FP
rec_f1_train = recall_score(y_train, p_train_f1 < tr_f1_val)    # recall = TP / TP + FN
f1_f1_train = f1_score(y_train, p_train_f1 < tr_f1_val)         # F1 = 2 * (precision * recall) / (precision + recall)
print(
    f'''F1 score: {round(f1_f1_train,3)} with precision: {round(pre_f1_train,3)},
    recall: {round(rec_f1_train,3)} at threshold: {round(tr_f1_val,3)} for feature "free sulfur dioxide" on training set
    ''')  

# feature - f2 (chlorides)              
tr_f2_val = tr_f2[idx_f2]                                       # optimal threshold found previously from validation set
pre_f2_train = precision_score(y_train, p_train_f2 < tr_f2_val) # precision = TP / TP + FP
rec_f2_train = recall_score(y_train, p_train_f2 < tr_f2_val)    # recall = TP / TP + FN
f1_f2_train = f1_score(y_train, p_train_f2 < tr_f2_val)         # F1 = 2 * (precision * recall) / (precision + recall)
print(
    f'''F1 score: {round(f1_f2_train,3)} with precision: {round(pre_f2_train,3)}, 
    recall: {round(rec_f2_train,3)} at threshold: {round(tr_f2_val,3)} for feature "chlorides" on training set
    ''')  

# feature - f3 (total sulfur dioxide)             
tr_f3_val = tr_f3[idx_f3]                                       # optimal threshold found previously from validation set
pre_f3_train= precision_score(y_train, p_train_f3 < tr_f3_val)  # precision = TP / TP + FP
rec_f3_train= recall_score(y_train, p_train_f3 < tr_f3_val)     # recall = TP / TP + FN
f1_f3_train= f1_score(y_train, p_train_f3 < tr_f3_val)          # F1 = 2 * (precision * recall) / (precision + recall)
print(
    f'''F1 score: {round(f1_f3_train,3)} with precision: {round(pre_f3_train,3)}, 
    recall: {round(rec_f3_train,3)} at threshold: {round(tr_f3_val,3)} for feature "total sulfur dioxide" on training set
    ''')  

# summary of performance metrics 

# create empty dataframe for summary of performance metrics
table_p6 = pd.DataFrame()                         

# features from both datasets
feat = ['f1_val','f2_val','f3_val','f1_train','f2_train','f3_train']  
# optimal threshold 
tr_val = [tr_f1_val, tr_f2_val, tr_f3_val, tr_f1_val, tr_f2_val, tr_f3_val]     
# precision scores
pre_val = [pre_f1_val, pre_f2_val, pre_f3_val, pre_f1_train, pre_f2_train, pre_f3_train] 
# recall scores 
rec_val = [rec_f1_val, rec_f2_val, rec_f3_val, rec_f1_train, rec_f2_train, rec_f3_train] 
# f1 scores
f1_val = [f1_f1_val, f1_f2_val, f1_f3_val, f1_f1_train, f1_f2_train, f1_f3_train]       

AUC = [auc_sum.iloc[0,2],auc_sum.iloc[2,2],auc_sum.iloc[1,2],auc_sum.iloc[0,1],auc_sum.iloc[2,1],auc_sum.iloc[1,1]]  #AUC values

# round the values to 3 decimal places:
pre_val_ = []
rec_val_ = []
f1_val_ = []
for i in np.arange(0,len(pre_val)):
  pre_val_.append(round(pre_val[i],3))      #round precision values
  rec_val_.append(round(rec_val[i],3))      #round recall values
  f1_val_.append(round(f1_val[i],3))        #round f1 score values

# fill dataframe table
table_p6['Features'] = feat
table_p6['Opt. Threshold'] = tr_val
table_p6['Precision'] = pre_val_
table_p6['Recall'] = rec_val_
table_p6['F1 Score'] = f1_val_
table_p6['AUC'] = AUC

# print dataframe for performance metrics results
table_p6

3.1.7) Use the 3 best features with best AUC

3.1.8) Fit Gaussian only on WHITE WINE in the training data

# fitting based on class:

# 3.1.7) the 3 features that had the best AUC was "free sulfur dioxide", "chlorides", "total sulfur dioxide"
# 3.1.8) repeat but only fit gaussian on RED WINE DATA IN THE TRAINING SET

X_train_2b = X_train.copy()                                        # copy the training set
X_train_2b['wine'] = y_train                                       # add column for wine labels
X_train_2b = X_train_2b.where(X_train_2b['wine'] == 0).dropna()    # only keep red wine data

# define parameters of gaussian mixture model - single feature
gm_one_nf = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)  

3.1.9) Compute AUC, F1 Score, Precision, Recall

Best Feature 1: Free Sulfur Dioxide

# feature - f1 (free sulfur dioxide)
# train model using f1 feature - RED WINE DATA ONLY
gm_one_nf.fit(np.array(X_train_2b['free sulfur dioxide']).reshape(-1,1))     

# compute AUC on full training set AND validation set (RED WINE class)
# log likelihood each sample belongs to cluster (likelihood not outlier)
p_train_nf = gm_one_nf.score_samples(np.array(X_train['free sulfur dioxide']).reshape(-1,1))
p_val_nf = gm_one_nf.score_samples(np.array(X_val['free sulfur dioxide']).reshape(-1,1))
AUC_train_f1 = round(roc_auc_score(y_train, -1 * p_train_nf),3)
AUC_val_f1 = round(roc_auc_score(y_val, -1 * p_val_nf),3)

print(f'AUC of feature f1 from training set: {AUC_train_f1}')
print(f'AUC of feature f1 from validation set: {AUC_val_f1}')

# optimal threshold (maximizes F1 score) of Validation set for feature f1
tr_f1 = []
f1_f1 = []
pre_f1 = []
rec_f1 = []
for i in np.arange(1, len(p_val_nf) - 500, 30):        # iterate through different thresholds                    
   tr = sorted(p_val_nf)[i]                            # sort the points by probability
   precision = precision_score(y_val, p_val_nf < tr)   # precision = TP / TP + FP
   recall = recall_score(y_val, p_val_nf < tr)         # recall = TP / TP + FN
   f1 = f1_score(y_val,p_val_nf < tr)                  # F1 = 2 * (precision * recall) / (precision + recall)

   tr_f1.append(round(tr,3))                           # store the threshold values in a list
   f1_f1.append(round(f1,3))                           # store the F1 score values in a list
   pre_f1.append(round(precision,3))                   # store the precision values in a list
   rec_f1.append(round(recall,3))                      # store the recall values in a list

idx_f1 = f1_f1.index(max(f1_f1))                       # index of maximum F1 score value
f1_f1_val = max(f1_f1)                                 # max F1 Score of validation set
pre_f1_val = pre_f1[idx_f1]                            # corresponding precision score
rec_f1_val = rec_f1[idx_f1]                            # corresponding recall score
tr_f1_val = tr_f1[idx_f1]                              # optimal threshold
print(
    f'''\n feature f1 Validation Data: \n max F1 score: \t {f1_f1_val} \n precision: \t {pre_f1_val}
 recall: \t {rec_f1_val} \n threshold: \t {tr_f1_val}\n
    ''')

#summary of performance metrics on training set of f1 feature using the optimal threshold
tr_f1_val = tr_f1[idx_f1]                                       # optimal threshold found previously from validation set
pre_f1_train = precision_score(y_train, p_train_nf < tr_f1_val) # precision = TP / TP + FP
rec_f1_train = recall_score(y_train, p_train_nf < tr_f1_val)    # recall = TP / TP + FN
f1_f1_train = f1_score(y_train, p_train_nf < tr_f1_val)         # F1 = 2 * (precision * recall) / (precision + recall)
print(
    f''' feature f1 Training Data: \n F1 score: \t {round(f1_f1_train,3)} \n precision: \t {round(pre_f1_train,3)}
 recall: \t {round(rec_f1_train,3)} \n threshold: \t {round(tr_f1_val,3)} 
''')  

Best Feature 2: Chloride

# feature - f2 (chlorides) 
# train model using f2 feature - RED WINE DATA ONLY
gm_one_nf.fit(np.array(X_train_2b['chlorides']).reshape(-1,1))      

# compute AUC on full training set AND validation set (RED WINE class)
# log likelihood each sample belongs to cluster (likelihood not outlier)
p_train_nf = gm_one_nf.score_samples(np.array(X_train['chlorides']).reshape(-1,1)) 
p_val_nf = gm_one_nf.score_samples(np.array(X_val['chlorides']).reshape(-1,1))
AUC_train_f2 = round(roc_auc_score(y_train, -1 * p_train_nf),3)
AUC_val_f2 = round(roc_auc_score(y_val, -1 * p_val_nf),3)

print(f'AUC of feature f2 from training set: {AUC_train_f2}')
print(f'AUC of feature f2 from validation set: {AUC_val_f2}')

# optimal threshold (maximizes F1 score) of Validation set for each of 3 best feature
tr_f2 = []
f1_f2 = []
pre_f2 = []
rec_f2 = []
for i in np.arange(1, len(p_val_nf) - 500, 30):        # iterate through different thresholds                    
   tr = sorted(p_val_nf)[i]                            # sort the points by probability
   precision = precision_score(y_val, p_val_nf < tr)   # precision = TP / TP + FP
   recall = recall_score(y_val, p_val_nf < tr)         # recall = TP / TP + FN
   f1 = f1_score(y_val,p_val_nf < tr)                  # F1 = 2 * (precision * recall) / (precision + recall)

   tr_f2.append(round(tr,3))                           # store the threshold values in a list
   f1_f2.append(round(f1,3))                           # store the F1 score values in a list
   pre_f2.append(round(precision,3))                   # store the precision values in a list
   rec_f2.append(round(recall,3))                      # store the recall values in a list

idx_f2 = f1_f2.index(max(f1_f2))                       # index of maximum F1 score value
f1_f2_val = max(f1_f2)                                 # max F1 Score of validation set
pre_f2_val = pre_f2[idx_f2]                            # corresponding precision score
rec_f2_val = rec_f2[idx_f2]                            # corresponding recall score
tr_f2_val = tr_f2[idx_f2]                              # optimal threshold
print(
    f'''\n feature f2 Validation Data: \n max F1 score: \t {f1_f2_val} \n precision: \t {pre_f2_val}
 recall: \t {rec_f2_val} \n threshold: \t {tr_f2_val}\n
    ''')

#f2 feature               
tr_f2_val = tr_f2[idx_f2]                                       # optimal threshold found previously from validation set
pre_f2_train = precision_score(y_train, p_train_nf < tr_f2_val) # precision = TP / TP + FP
rec_f2_train = recall_score(y_train, p_train_nf < tr_f2_val)    # recall = TP / TP + FN
f1_f2_train = f1_score(y_train, p_train_nf < tr_f2_val)         # F1 = 2 * (precision * recall) / (precision + recall)
print(
    f''' feature f2 Training Data: \n F1 score: \t {round(f1_f2_train,3)} \n precision: \t {round(pre_f2_train,3)}
 recall: \t {round(rec_f2_train,3)} \n threshold: \t {round(tr_f2_val,3)} 
''')  

Best Feature 3: Total Sulfur Dioxide

#f3 feature:
gm_one_nf.fit(np.array(X_train_2b['total sulfur dioxide']).reshape(-1,1))      #train model using f3 feature - NON-FRAUDULENT CLASS ONLY

#compute AUC on full training set AND validation set (NON-FRAUDULENT class)
p_train_nf = gm_one_nf.score_samples(np.array(X_train['total sulfur dioxide']).reshape(-1,1)) #log likelihood each sample belongs to cluster (likelihood not outlier)
p_val_nf = gm_one_nf.score_samples(np.array(X_val['total sulfur dioxide']).reshape(-1,1))
AUC_train_f3 = round(roc_auc_score(y_train, -1 * p_train_nf),3)
AUC_val_f3 = round(roc_auc_score(y_val, -1 * p_val_nf),3)

print(f'AUC of feature f3 from training set: {AUC_train_f3}')
print(f'AUC of feature f3 from validation set: {AUC_val_f3}')

#optimal threshold (maximizes F1 score) of Validation set for each of 3 best feature
tr_f3 = []
f1_f3 = []
pre_f3 = []
rec_f3 = []
for i in np.arange(1, len(p_val_f3) - 500, 30):      #iterate through different thresholds                    
   tr = sorted(p_val_f3)[i]                            #sort the points by probability
   precision = precision_score(y_val, p_val_nf < tr)   #precision = TP / TP + FP
   recall = recall_score(y_val, p_val_nf < tr)         #recall = TP / TP + FN
   f1 = f1_score(y_val,p_val_nf < tr)                  #F1 = 2 * (precision * recall) / (precision + recall)
   
   tr_f3.append(round(tr,3))                           #store the threshold values in a list
   f1_f3.append(round(f1,3))                           #store the F1 score values in a list
   pre_f3.append(round(precision,3))                   #store the precision values in a list
   rec_f3.append(round(recall,3))                      #store the recall values in a list

idx_f3 = f1_f3.index(max(f1_f3))                     #index of maximum F1 score value
f1_f3_val = max(f1_f3)                                #max F1 Score of validation set
pre_f3_val = pre_f3[idx_f3]                          #corresponding precision score
rec_f3_val = rec_f3[idx_f3]                          #corresponding recall score
tr_f3_val = tr_f3[idx_f3]                            #optimal threshold
print(
    f'''\n feature f3 Validation Data: \n max F1 score: \t {f1_f3_val} \n precision: \t {pre_f3_val}
 recall: \t {rec_f3_val} \n threshold: \t {tr_f3_val}\n
    ''') 

#f3 feature               
tr_f3_val = tr_f3[idx_f3]                                       #optimal threshold found previously from validation set
pre_f3_train = precision_score(y_train, p_train_nf < tr_f3_val) #precision = TP / TP + FP
rec_f3_train = recall_score(y_train, p_train_nf < tr_f3_val)    #recall = TP / TP + FN
f1_f3_train = f1_score(y_train, p_train_nf < tr_f3_val)         #F1 = 2 * (precision * recall) / (precision + recall)
print(
    f''' feature f3 Training Data: \n F1 score: \t {round(f1_f3_train,3)} \n precision: \t {round(pre_f3_train,3)}
 recall: \t {round(rec_f3_train,3)} \n threshold: \t {round(tr_f3_val,3)} 
''')  

3.1.10) Compare results when fitting gaussian regardless/based on class

# create NEW empty dataframe for summary of performance metrics
table_b = pd.DataFrame()                      

# features from both datasets
feat = ['f1_val','f2_val','f3_val','f1_train','f2_train','f3_train']  
# optimal threshold 
tr_val = [tr_f1_val, tr_f2_val, tr_f3_val, tr_f1_val, tr_f2_val, tr_f3_val]       
# precision scores
pre_val = [pre_f1_val, pre_f2_val, pre_f3_val, pre_f1_train, pre_f2_train, pre_f3_train]
# recall scores  
rec_val = [rec_f1_val, rec_f2_val, rec_f3_val, rec_f1_train, rec_f2_train, rec_f3_train]
# f1 scores
f1_val = [f1_f1_val, f1_f2_val, f1_f3_val, f1_f1_train, f1_f2_train, f1_f3_train]       

# AUC values
AUC = [AUC_val_f1,AUC_val_f2,AUC_val_f3,AUC_train_f1,AUC_train_f2,AUC_train_f3]         

# round the values to 3 decimal places:
pre_val_ = []
rec_val_ = []
f1_val_ = []
for i in np.arange(0,len(pre_val)):
  pre_val_.append(round(pre_val[i],3))      # round precision values
  rec_val_.append(round(rec_val[i],3))      # round recall values
  f1_val_.append(round(f1_val[i],3))        # round f1 score values

# fill dataframe table
table_b['Features'] = feat
table_b['Opt. Threshold (RW)'] = tr_val
table_b['Precision (RW)'] = pre_val_
table_b['Recall (RW)'] = rec_val_
table_b['F1 Score (RW)'] = f1_val_
table_b['AUC (RW)'] = AUC

# print dataframe for performance metrics results
table_b

# 3.1.10) Compare results from Part 2a and 2b in a table

# merge the two results
result = pd.merge(table_p6, table_b, on = "Features")      
result                                                      

As seen in the table above, the AUC values between (fitting regardless of class) and (fitting only on red-wine class) are nearly identical. The precision, recall and F1 Score are also very similar. The results are similar due to the class imbalance, as fitting only on red-wine class is just like fitting on both white-wine and red-wine class - as 74.96% (majority) of the observations in the data are from red-wine class.

3.2) Gaussian Model (ONE) - Features (TWO)

MULTIPLE FEATURES (Set the number of components VISUALLY)

2D Plot:
3.2.1) Scatter plot two features from training data

plt.scatter
x-axis: f1 y-axis: f2
colour based on class (white wine vs red wine)

3.2.2) Select number of Gaussian components required to fit data from the plot

n_components

3.2.3) Fit Gaussian model on the entire training data

3.2.4) Calculate AUC on training data and validation data

3.2.5) Repeat for two other pair of features

3.2.6) Select pair of features with highest AUC on validation data

3.2.7) Select threshold that maximizes F1 Score on validation data

3.2.8) Scatter plot two separate figures (training vs validation)

circle outliers based on threshold

3D Plot:
3.2.9) Use the 3 best features with best AUC
3.2.10) Repeat 3.2.1) to 3.2.4)
3.2.11) Select threshold that maximizes F1 Score on validation data

3.2.1) Scatter plot two features from training data

Feature Pair 1: Fixed Acidity and Density

import matplotlib.pyplot as plt
import matplotlib

fig = plt.figure(figsize=(8,8))
# (red wine (0) = red, white wine (1) = blue) to detect 0
colors = ['red','blue']               
plt.scatter(X_train['fixed acidity'], X_train['density'], s=3, c = y_train, cmap=matplotlib.colors.ListedColormap(colors))
plt.title("Feature: Fixed Acidity vs Density")
plt.xlabel("Fixed Acidity")
plt.ylabel("Density")

3.2.2) Select number of Gaussian components required to fit data from the plot

Based on the plot above, one gaussian component (n_components = 1) is required to fit the white wine data (blue)
AUC (training set) = 0.476 & AUC (validation set) = 0.467 computed below

3.2.3) Fit Gaussian model on the entire training data

# define parameters of gaussian mixture model - one gaussian component
gm_3a = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state = 1)     

# train the gaussian mixture model  - unsupervised learning (uses no labels)
gm_3a.fit(np.array([X_train['fixed acidity'],X_train['density']]).T) 
# np.array([X_train['total sulfur dioxide'],X_train['chlorides']]).T.shape               

3.2.4) Calculate AUC on training data and validation data

# log likelihood each sample belongs to cluster - USING TWO FEATURES
p_train = gm_3a.score_samples(np.array([X_train['fixed acidity'],X_train['density']]).T) 
p_val = gm_3a.score_samples(np.array([X_val['fixed acidity'],X_val['density']]).T)

# AUC - Area Under Curve 
AUC_train = round(roc_auc_score(y_train, -1 * p_train), 3)               
AUC_val = round(roc_auc_score(y_val, -1 * p_val), 3)

print(f'AUC of features "fixed acidity" and "density" from training set: {AUC_train}')
print(f'AUC of features "fixed acidity" and "density" from validation set: {AUC_val}')

3.2.5) Repeat for two other pair of features

Feature Pair 2: citric acid and residual sugar
Feature Pair 3: pH and sulfates

#3.2.5) Feature Pair 2: citric acid and residual sugar

# scatter plot
fig = plt.figure(figsize=(8,8))
# (red wine (0) = red, white wine (1) = blue) to detect 0
colors = ['red','blue']              
plt.scatter(X_train['citric acid'], X_train['residual sugar'], s=3, c = y_train, cmap=matplotlib.colors.ListedColormap(colors))
plt.title("Feature: citric acid vs residual sugar")
plt.xlabel("citric acid")
plt.ylabel("residual sugar")

# fit gaussian model on training set (all samples)
gm_3a = GaussianMixture(n_components = 3,
                    covariance_type = 'full', random_state=1)     

# train the gaussian mixture model  - unsupervised learning (uses no labels)
gm_3a.fit(np.array([X_train['citric acid'],X_train['residual sugar']]).T)     
# np.array([X_train['citric acid'],X_train['residual sugar']]).T.shape           

# compute AUC on both training and validation sets
# log likelihood each sample belongs to cluster - USING TWO FEATURES
p_train = gm_3a.score_samples(np.array([X_train['citric acid'],X_train['residual sugar']]).T) 
p_val = gm_3a.score_samples(np.array([X_val['citric acid'],X_val['residual sugar']]).T)
AUC_train = round(roc_auc_score(y_train, -1 * p_train),3)              
AUC_val = round(roc_auc_score(y_val, -1 * p_val),3)

print(f'AUC of features citric acid and residual sugar from training set: {AUC_train}')
print(f'AUC of features citric acid and residual sugar from validation set: {AUC_val}')

Note: Based on the plot of citric acid and residual sugar above, three gaussian components (n_components = 3) is required to fit the white wine data (blue) as there seems to be 3 dense clusters

AUC (training set) = 0.630 & AUC (validation set) = 0.640

#3.2.5) Feature Pair 3: pH and sulphates

# scatter plot
fig = plt.figure(figsize=(8,8))
# (red wine (0) = red, white wine (1) = blue) to detect 0
colors = ['red','blue']            
plt.scatter(X_train['pH'], X_train['sulphates'], s=3,c = y_train, cmap=matplotlib.colors.ListedColormap(colors))
plt.title("Feature: pH vs sulphates")
plt.xlabel("pH")
plt.ylabel("sulphates")

# fit gaussian model on training set (all samples)
gm_3a = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)     

# train the gaussian mixture model  - unsupervised learning (uses no labels)
gm_3a.fit(np.array([X_train['pH'],X_train['sulphates']]).T)            
# np.array([X_train['pH'],X_train['sulphates']]).T.shape 

# compute AUC on both training and validation sets
# log likelihood each sample belongs to cluster - USING TWO FEATURES
p_train = gm_3a.score_samples(np.array([X_train['pH'],X_train['sulphates']]).T) 
p_val = gm_3a.score_samples(np.array([X_val['pH'],X_val['sulphates']]).T)
AUC_train = round(roc_auc_score(y_train, -1 * p_train),3)             
AUC_val = round(roc_auc_score(y_val, -1 * p_val),3)

print(f'AUC of features pH and sulphates from training set: {AUC_train}')
print(f'AUC of features pH and sulphates from validation set: {AUC_val}')

Note: Based on the plot of pH and sulfates above, one gaussian component (n_components = 1) is required to fit the white wine data (blue)

AUC (training set) = 0.405 & AUC (validation set) = 0.399

3.2.6) Select pair of features with highest AUC on validation data

Feature Pair 2: citric acid and residual sugar

3.2.7) Select threshold that maximizes F1 Score on validation data

Threshold to maximize F1 Score on the validation set when using features (citric acid, residual sugar) is -1.869
Results in an F1 Score of 0.62, precision of 0.695 and recall value of 0.559

#3.2.7) optimal threshold (maximizes F1 score) of validation set for best pair of features

# define parameters of gaussian mixture model - one gaussian component
gm_3a = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)     

# train the gaussian mixture model  - unsupervised learning (uses no labels)
gm_3a.fit(np.array([X_train['citric acid'],X_train['residual sugar']]).T)     

# log likelihood each sample belongs to cluster - USING TWO FEATURES          
p_val_cr = gm_3a.score_samples(np.array([X_val['citric acid'],X_val['residual sugar']]).T)      

tr_cr = []
f1_cr = []
pre_cr = []
rec_cr = []
for i in np.arange(1, len(p_val_cr) - 500,15):         # iterate through different thresholds                    
   tr = sorted(p_val_cr)[i]                            # sort the points by probability
   precision = precision_score(y_val, p_val_cr < tr)   # precision = TP / TP + FP
   recall = recall_score(y_val, p_val_cr < tr)         # recall = TP / TP + FN
   f1 = f1_score(y_val,p_val_cr < tr)                  # F1 = 2 * (precision * recall) / (precision + recall)
   
   tr_cr.append(round(tr,3))                           # store the threshold values in a list
   f1_cr.append(round(f1,3))                           # store the F1 score values in a list
   pre_cr.append(round(precision,3))                   # store the precision values in a list
   rec_cr.append(round(recall,3))                      # store the recall values in a list
   print('For k: ',i,'\t  threshold: ','%.3f'% tr ,'  precision: ', '%.3f' % precision,'  recall: ', '%.3f' % recall, '  F1_Score: ','%.3f' % f1) 

idx_cr = f1_cr.index(max(f1_cr))                       # index of maximum F1 score value
f1_cr_val = max(f1_cr)                                 # max F1 Score of validation set
pre_cr_val = pre_cr[idx_cr]                            # corresponding precision score
rec_cr_val = rec_cr[idx_cr]                            # corresponding recall score
tr_cr_val = tr_cr[idx_cr]                              # optimal threshold
print(f'\n max F1 score: {f1_cr_val} with precision: {pre_cr_val}, recall: {rec_cr_val} at threshold: {tr_cr_val} for feature citric acid and residual sugar')  

3.2.8) Scatter plot two separate figures (training vs validation)

# create numpy array of features to plot outliers
X_train_10 = np.array([X_train['citric acid'],X_train['residual sugar']]).T     

#3.2.8) color the TRAINING set based on class

fig = plt.figure(figsize=(8,8))
# (red wine (0) = red, white wine (1) = blue) to detect 0
colors = ['red','blue']             
plt.scatter(X_train['citric acid'], X_train['residual sugar'], s=3, c = y_train, cmap=matplotlib.colors.ListedColormap(colors))
plt.title("Feature citric acid vs residual sugar")
plt.xlabel("Feature citric acid")
plt.ylabel("Feature residual sugar")

# optimal threshold using features: citric acid and residual sugar
threshold = tr_cr_val             

# log likelihood each sample belongs to cluster - USING TWO FEATURES
p_train_cr = gm_3a.score_samples(np.array([X_train['citric acid'],X_train['residual sugar']]).T)   
# determine the outliers 
outliers = np.nonzero(p_train_cr < threshold)[0]                              
# plot the outliers
plt.scatter(X_train_10[outliers,0],X_train_10[outliers,1],marker="o",facecolor= "none",edgecolor="y",s=70)  
plt.show()

# create numpy array of features to plot outliers
X_val_10 = np.array([X_val['citric acid'],X_val['residual sugar']]).T  

# 3.2.8) color the VALIDATION set based on class
fig = plt.figure(figsize=(8,8))
# (red wine (0) = red, white wine (1) = blue) to detect 0
colors = ['red','blue']            
plt.scatter(X_val['citric acid'], X_val['residual sugar'], s=3, c = y_val, cmap=matplotlib.colors.ListedColormap(colors))
plt.title("Feature citric acid vs residual sugar")
plt.xlabel("Feature citric acid")
plt.ylabel("Feature residual sugar")

# optimal threshold using features: citric acid and residual sugar 
threshold = tr_cr_val              

# determine the outliers
outliers = np.nonzero(p_val_cr < threshold)[0]      
# plot the outliers
plt.scatter(X_val_10[outliers,0],X_val_10[outliers,1],marker="o",facecolor= "none",edgecolor="y",s=70)  
plt.show()

3.2.9) Use the 3 best features with best AUC

Free Sulfur Dioxide
Chlorides
Total Sulfur Dioxide

3.2.10) Repeat 3.2.1) to 3.2.4)

# function for 3D plotting 
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

def plt3d(X_train , y_train, f1, f2, f3, angle):
  X_train_2f = pd.DataFrame(data = X_train,columns=[f1, f2,f3])
  fig = plt.figure()
  ax = plt.axes( projection='3d')
  ax.scatter3D(X_train_2f[f1][y_train==1], X_train_2f[f2][y_train==1], X_train_2f[f3][y_train==1], c='b', marker='x',label ='White Wine')
  ax.scatter3D(X_train_2f[f1][y_train==0], X_train_2f[f2][y_train==0], X_train_2f[f3][y_train==0], c='r', marker='o',label="Red Wine")
  ax.set_xlabel(f1)
  ax.set_ylabel(f2)
  ax.set_zlabel(f3)
  plt.legend()
  ax.view_init(30, angle)
  plt.show()

plt3d(X_train, y_train, f1="free sulfur dioxide", f2="chlorides", f3="total sulfur dioxide", angle=130 ) 

Based on the 3D plot of the three features above, one gaussian component (n_components = 1) is required to fit the white wine data (blue)

np.array([X_train['free sulfur dioxide'],X_train['chlorides'],X_train['total sulfur dioxide']]).T.shape

# fit gaussian model on training set (all samples)
# define parameters of gaussian mixture model - one gaussian component
gm_3b = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)      

# train the gaussian mixture model  - unsupervised learning (uses no labels)
gm_3b.fit(np.array([X_train['free sulfur dioxide'],X_train['chlorides'],X_train['total sulfur dioxide']]).T) 

# compute AUC on both training and validation sets
# log likelihood each sample belongs to cluster - USING THREE FEATURES
p_train = gm_3b.score_samples(np.array([X_train['free sulfur dioxide'],X_train['chlorides'],X_train['total sulfur dioxide']]).T) 
p_val = gm_3b.score_samples(np.array([X_val['free sulfur dioxide'],X_val['chlorides'],X_val['total sulfur dioxide']]).T)

# AUC - Area Under Curve 
AUC_train = round(roc_auc_score(y_train, -1 * p_train),3)            
AUC_val = round(roc_auc_score(y_val, -1 * p_val),3)

print(f'AUC of features free sulfur dioxide, chlorides, and total sulfur dioxide from training set: {AUC_train}')
print(f'AUC of features free sulfur dioxide, chlorides, and total sulfur dioxide from validation set: {AUC_val}')

AUC (training set) = 0.304 & AUC (validation set) = 0.297

3.2.11) Select threshold that maximizes F1 Score on validation data

# optimal threshold (maximizes F1 score) of Validation set for 3 best features model
# log likelihood each sample belongs to cluster - USING THREE FEATURES
p_val_3b = gm_3b.score_samples(np.array([X_val['free sulfur dioxide'],X_val['chlorides'],X_val['total sulfur dioxide']]).T) 

tr_3b = []
f1_3b = []
pre_3b = []
rec_3b = []
for i in np.arange(1, len(p_val_3b) - 500,30):       # iterate through different thresholds                    
   tr = sorted(p_val_3b)[i]                            # sort the points by probability
   precision = precision_score(y_val, p_val_3b < tr)   # precision = TP / TP + FP
   recall = recall_score(y_val, p_val_3b < tr)         # recall = TP / TP + FN
   f1 = f1_score(y_val,p_val_3b < tr)                  # F1 = 2 * (precision * recall) / (precision + recall)
   
   tr_3b.append(round(tr,3))                           # store the threshold values in a list
   f1_3b.append(round(f1,3))                           # store the F1 score values in a list
   pre_3b.append(round(precision,3))                   # store the precision values in a list
   rec_3b.append(round(recall,3))                      # store the recall values in a list
   print('For k: ',i,'\t  threshold: ','%.3f'% tr ,'  precision: ', '%.3f' % precision,'  recall: ', '%.3f' % recall, '  F1_Score: ','%.3f' % f1) 

idx_3b = f1_3b.index(max(f1_3b))                       # index of maximum F1 score value
f1_3b_val = max(f1_3b)                                 # max F1 Score of validation set
pre_3b_val = pre_3b[idx_3b]                            # corresponding precision score
rec_3b_val = rec_3b[idx_3b]                            # corresponding recall score
tr_3b_val = tr_3b[idx_3b]                              # optimal threshold
print(f'\n max F1 score: {f1_3b_val} with precision: {pre_3b_val}, recall: {rec_3b_val} at threshold: {tr_3b_val} for feature free sulfur dioxide, chlorides and total sulfur dioxide')  

The optimal threshold to maximize the F1 Score on the validation data when using the three best features is -6.613, which results in an F1 Score of 0.568, precision of 0.638 and recall value of 0.512

3.3) Gaussian Model (TWO) - Feature (ONE)

one Gaussian model for White Wine, one Gaussian model for Red Wine

3.3.1) Fit Gaussian model (G1) on a feature of White Wine

sklearn.mixture.GaussianMixture
n_components = 1

3.3.2) Fit Gaussian model (G2) on same feature for Red Wine

sklearn.mixture.GaussianMixture
n_components = 1

3.3.3) Score samples on G1 & G2 validation data to get S1 & S2

score_samples

3.3.4) Find optimal c real number to maximize validation data F1 Score for a model such that if S1 < c X S2, the wine is classified ad Red Wine

ex: c = 1, S1 < S2 the wine belongs to the G2 (Red Wine) distribution

3.3.5) Repeat for all features

identify feature and optimal c: best F1 score, precision, recall for training and validation data

3.3.1) Fit Gaussian model (G1) on a feature of White Wine

# feature chosen for G1 and G2 = free sulfur dioxide

# only fit gaussian on WHITE WINE IN THE TRAINING SET
X_train_3 = X_train.copy()                                         # copy the training set
X_train_3['label'] = y_train                                       # add column for red/white wine labels
X_train_3_nf = X_train_3.where(X_train_3['label'] == 1).dropna()   # only keep white wine

# define parameters of gaussian mixture model - single feature
G1 = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)    
# train model using free sulfur dioxide feature - WHITE WINE CLASS ONLY 
G1.fit(np.array(X_train_3_nf['free sulfur dioxide']).reshape(-1,1))                      

3.3.2) Fit Gaussian model (G1) on same feature for Red Wine

X_train_3_f = X_train_3.where(X_train_3['label'] == 0).dropna()    # only keep red wine

G2 = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)     
G2.fit(np.array(X_train_3_f['free sulfur dioxide']).reshape(-1,1))   

3.3.3) Score samples on G1 & G2 validation data to get S1 & S2

# log likelihood of belonging to white-wine cluster (larger values)
S1 = G1.score_samples(np.array(X_val['free sulfur dioxide']).reshape(-1,1))    

# log likelihood of belonging to red-wine cluster (larger values)
S2 = G2.score_samples(np.array(X_val['free sulfur dioxide']).reshape(-1,1))    

max_S1 = max(sorted(S1))
max_S2 = max(sorted(S2))
print(f'max log likelihood of VALIDATION data belonging to WHITE WINE CLUSTER(blue): {round(max_S1,3)} vs RED WINE (red):{round(max_S2,3)}')

3.3.4) Find optimal c

# 3.3.4) Find optimal c 
# Graphical Interpretation of G1/White Wine Distribution(Blue) vs G2/Red Wine Distribution(Red)
import seaborn as sns

plt.figure(figsize=(15,4*1))
sns.histplot(sorted(S1)[500:], stat="density", kde=True, color="blue", bins=50) #log likelihood of VALIDATION data belonging to Non-Fraud Distribution
sns.histplot(sorted(S2)[500:], stat="density", kde=True, color="red", bins=50)  #log likelihood of VALIDATION data belonging to Fraud Distribution
ax.legend(['white wine', 'red wine'],loc='best')
ax.set_xlabel('Feature: free sulfur dioxide')
ax.set_title('Distribution of white wine (blue) vs red wine (red) of feature: ' + 'free sulfur dioxide')

# 3.3.4) Find optimal c 

c_fsd = []
f1_fsd = []
pre_fsd = []
rec_fsd = []
for c in np.arange(0.1,1,0.1):                     # iterate through 100 real numbers
   tr = c*S2                                     # threshold is (real number)*(array of log likelihood of belonging to FRAUD Distribution) 
   precision = precision_score(y_val, S1 < tr)   # precision = TP / TP + FP
   recall = recall_score(y_val, S1 < tr)         # recall = TP / TP + FN
   f1 = f1_score(y_val, S1 < tr)                 # F1 = 2 * (precision * recall) / (precision + recall)

   c_fsd.append(round(c,3))                             # store the real number values in a list
   f1_fsd.append(round(f1,3))                           # store the F1 score values in a list
   pre_fsd.append(round(precision,3))                   # store the precision values in a list
   rec_fsd.append(round(recall,3))                      # store the recall values in a list
   print('For real number: ', '%.1f' % c ,'\t  precision: ', '%.3f' % precision,'  recall: ', '%.3f' % recall, '  F1_Score: ','%.3f' % f1)

idx_fsd = f1_fsd.index(max(f1_fsd))                     # index of maximum F1 score value
f1_fsd_val = max(f1_fsd)                                # max F1 Score of validation set
pre_fsd_val = pre_fsd[idx_fsd]                          # corresponding precision score
rec_fsd_val = rec_fsd[idx_fsd]                          # corresponding recall score
c_fsd_val = c_fsd[idx_fsd]                              # optimal real number
print(f'\n max F1 score: {f1_fsd_val} with precision: {pre_fsd_val}, recall: {rec_fsd_val} at real number: {c_fsd_val} for feature free sulfur dioxide') 

NOTE: There are precision warning messages for models of some features due to the predictor (S1<c*S2) predicting all non-fraud transactions (all negatives) so there are NO True Positives (TP) or False Positives (FP) so precision = TP/TP + FP = 0 and thus undefined (and thus the warning message)

3.3.5) Repeat for all features - VALIDATION DATA

# report feature and real number (c) with best F1 score, precision, recall for training and validation

import warnings
warnings.filterwarnings('ignore')  # "error", "ignore", "always", "default", "module" or "once"

# only fit gaussian on WHITE WINE DATA IN THE TRAINING SET
X_train_3 = X_train.copy()                                         # copy the training set
X_train_3['label'] = y_train                                       # add column for red/white wine labels

X_train_3_nf = X_train_3.where(X_train_3['label'] == 1).dropna()   # only keep white wine data
X_train_3_f = X_train_3.where(X_train_3['label'] == 0).dropna()    # only keep red wine data


# define empty lists to store performance metrics for each feature:
results = pd.DataFrame()                  # empty dataframe to store final results as table
features = list(X_train.columns[1:29])    # list of feature names
precision_list = []                       # list of precision scores
recall_list = []                          # list of recall scores
F1_list = []                              # list of F1 Scores
c_list = []                               # list of optimal number (c)

# iterate through each feature 
for V in features:        

  # fit a Gaussian distribution ( G1  ) on feature using WHITE WINE

  # define parameters of gaussian mixture model - single feature
  G1 = GaussianMixture(n_components = 1,
                      covariance_type = 'full', random_state=1) 
  # train model using feature - WHITE WINE CLASS ONLY
  G1.fit(np.array(X_train_3_nf[V]).reshape(-1,1))     

  # fit another Gaussian distribution ( G2 ) on same feature but for RED WINE
  
  # define parameters of gaussian mixture model - single feature
  G2 = GaussianMixture(n_components = 1,
                      covariance_type = 'full', random_state=1) 
  # train model using feature - RED WINE ONLY
  G2.fit(np.array(X_train_3_f[V]).reshape(-1,1))                

  # compute the score samples (S1 and S2) for both G1 and G2 on the VALIDATION set
  S1 = G1.score_samples(np.array(X_val[V]).reshape(-1,1))       # log likelihood of belonging to white wine cluster (larger values)
  S2 = G2.score_samples(np.array(X_val[V]).reshape(-1,1))       # log likelihood of belonging to red wine cluster (larger values)



  # find optimal c (a real number) to maximize validation set F1 score
  c_v = []
  f1_v = []
  pre_v = []
  rec_v = []
  for c in np.arange(0.1,10,0.1):                   # iterate through 100 real numbers
    tr = c*S2                                       # threshold is (real number)*(array of log likelihood of belonging to RED WINE Distribution) 
    precision = precision_score(y_val, S1 < tr)     # precision = TP / TP + FP
    recall = recall_score(y_val, S1 < tr)           # recall = TP / TP + FN
    f1 = f1_score(y_val, S1 < tr)                   # F1 = 2 * (precision * recall) / (precision + recall)
    c_v.append(round(c,3))                             # store the real number values in a list
    f1_v.append(round(f1,3))                           # store the F1 score values in a list
    pre_v.append(round(precision,3))                   # store the precision values in a list
    rec_v.append(round(recall,3))                      # store the recall values in a list

  idx_v = f1_v.index(max(f1_v))                     # index of maximum F1 score value
  f1_v_val = max(f1_v)                              # max F1 Score of validation set
  pre_v_val = pre_v[idx_v]                          # corresponding precision score
  rec_v_val = rec_v[idx_v]                          # corresponding recall score
  c_v_val = c_v[idx_v]                              # optimal real number
  
  precision_list.append(pre_v_val)
  recall_list.append(rec_v_val)
  F1_list.append(f1_v_val)
  c_list.append(c_v_val)

results['Features'] = features
results['Optimal c'] = c_list
results['F1 Score'] = F1_list
results['Precision'] = precision_list
results['Recall'] = recall_list

# show the table of results
results

# find the feature and c value with maximum F1 Score
results[results['F1 Score'] == results['F1 Score'].max()]   

From the table of resulting performance metrics for each feature above, feature “chlorides” with optimal real number threshold 6.9 has the best F1 Score of 0.861 with corresponding precision of 0.755 and recall of 1 using VALIDATION DATA

3.3.5) Repeat for all features - TRAINING DATA

# report feature and real number (c) with best F1 score, precision, recall for training and validation

import warnings
warnings.filterwarnings('ignore')  # "error", "ignore", "always", "default", "module" or "once"

# only fit gaussian on WHITE WINE DATA IN THE TRAINING SET
X_train_3 = X_train.copy()                                         # copy the training set
X_train_3['label'] = y_train                                       # add column for red/white wine labels

X_train_3_nf = X_train_3.where(X_train_3['label'] == 1).dropna()   # only keep white wine data
X_train_3_f = X_train_3.where(X_train_3['label'] == 0).dropna()    # only keep red wine data


# define empty lists to store performance metrics for each feature:
results = pd.DataFrame()                  # empty dataframe to store final results as table
features = list(X_train.columns[1:29])    # list of feature names
precision_list = []                       # list of precision scores
recall_list = []                          # list of recall scores
F1_list = []                              # list of F1 Scores
c_list = []                               # list of optimal number (c)

# iterate through each feature 
for V in features:        

  # fit a Gaussian distribution ( G1  ) on feature using WHITE WINE

  # define parameters of gaussian mixture model - single feature
  G1 = GaussianMixture(n_components = 1,
                      covariance_type = 'full', random_state=1) 
  # train model using feature - WHITE WINE CLASS ONLY
  G1.fit(np.array(X_train_3_nf[V]).reshape(-1,1))     

  # fit another Gaussian distribution ( G2 ) on same feature but for RED WINE
  
  # define parameters of gaussian mixture model - single feature
  G2 = GaussianMixture(n_components = 1,
                      covariance_type = 'full', random_state=1) 
  # train model using feature - RED WINE ONLY
  G2.fit(np.array(X_train_3_f[V]).reshape(-1,1))                

  # compute the score samples (S1 and S2) for both G1 and G2 on the TRAINING set
  S1 = G1.score_samples(np.array(X_train[V]).reshape(-1,1))       # log likelihood of belonging to white wine cluster (larger values)
  S2 = G2.score_samples(np.array(X_train[V]).reshape(-1,1))       # log likelihood of belonging to red wine cluster (larger values)



  # find optimal c (a real number) to maximize training set F1 score
  c_v = []
  f1_v = []
  pre_v = []
  rec_v = []
  for c in np.arange(0.1,10,0.1):                   # iterate through 100 real numbers
    tr = c*S2                                       # threshold is (real number)*(array of log likelihood of belonging to RED WINE Distribution) 
    precision = precision_score(y_train, S1 < tr)     # precision = TP / TP + FP
    recall = recall_score(y_train, S1 < tr)           # recall = TP / TP + FN
    f1 = f1_score(y_train, S1 < tr)                   # F1 = 2 * (precision * recall) / (precision + recall)
    c_v.append(round(c,3))                             # store the real number values in a list
    f1_v.append(round(f1,3))                           # store the F1 score values in a list
    pre_v.append(round(precision,3))                   # store the precision values in a list
    rec_v.append(round(recall,3))                      # store the recall values in a list

  idx_v = f1_v.index(max(f1_v))                     # index of maximum F1 score value
  f1_v_val = max(f1_v)                              # max F1 Score of validation set
  pre_v_val = pre_v[idx_v]                          # corresponding precision score
  rec_v_val = rec_v[idx_v]                          # corresponding recall score
  c_v_val = c_v[idx_v]                              # optimal real number
  
  precision_list.append(pre_v_val)
  recall_list.append(rec_v_val)
  F1_list.append(f1_v_val)
  c_list.append(c_v_val)

results['Features'] = features
results['Optimal c'] = c_list
results['F1 Score'] = F1_list
results['Precision'] = precision_list
results['Recall'] = recall_list

# show the table of results
results

# find the feature and c value with maximum F1 Score
results[results['F1 Score'] == results['F1 Score'].max()]   

From the table of resulting performance metrics for each feature above, feature “chlorides” with optimal real number threshold 6.4 has the best F1 Score of 0.863 with corresponding precision of 0.759 and recall of 1 using TRAINING DATA

When using two gaussian models, is it difficult to report AUC for the gaussian model that was fitted using the red wine class due to the class imbalance (not enough data compared to gaussian fitted using white wine class), and thus some of the features do not predict the positive class at all, and resulting in a True Positive Rate and False Positive Rate of 0, which is required to graph the ROC and get the Area Under the (ROC) Curve.

In other words, the imbalanced classification between red wine and white wine means there are less observations with the red wine class. If the model is fit using the red wine class, the resulting ROC Curve and AUC value is misleading as a small number of red/white wine predictions can significantly affect the ROC-AUC value

3.4) Gaussian Model (MULTI) - Feature (MULTI)

3.4.1) Gaussian Model (TWO) Feature (ONE)

two gaussian model (white wine = single component, red wine = multiple components)
- it makes sense to have multiple components for the RED WINE CLASS as there are less observations and the observations are more spread out - there are several clusters that could be fit for RED WINE CLASS to increase the precision
different types of ONE feature

3.4.2) Gaussian Model (TWO) Feature (THREE)

two gaussian model (non-fraud = single component, fraud = multiple components)
different COMBINATIONS of THREE features

3.4.1) Gaussian Model (TWO) Feature (ONE)

import random

X_train_4 = X_train.copy()                                         # copy the training set
X_train_4['label'] = y_train                                       # add column for red/white labels
X_train_4_nf = X_train_4.where(X_train_4['label'] == 1).dropna()   # only keep white wine class
X_train_4_f = X_train_4.where(X_train_4['label'] == 0).dropna()    # only keep red wine class

# empty dataframe to store final results as table
results = pd.DataFrame()                
features = list(X_train.columns)  

# VALIDATION SET - Define empty lists to store performance metrics for each feature:
precision_list_val = []                       # list of precision scores
recall_list_val = []                          # list of recall scores
F1_list_val = []                              # list of F1 Scores
c_list_val = []                               # list of optimal number (c)
AUC_val_nf = []                               # list of AUC values for white wine gaussian
AUC_val_f = []                                # list of AUC values for red wine gaussian
# TRAINING SET - Define empty lists to store performance metrics for each feature:
precision_list_train = []                    
recall_list_train = []                      
F1_list_train = []                            
c_list_train = []                        
AUC_train_nf = []                           
AUC_train_f = []                              

multivariate_list = []                    # list for combination of features used
n_components = []                         # list for number of gaussian components used in RED WINE

# iterate through each combination of features 
for V in features:  

  # PARAMETER 1 - n_components:    

  # fit a Gaussian distribution ( G1 ) on feature using WHITE WINE CLASS
  # define parameters - SINGLE GAUSSIAN COMPONENT
  G1 = GaussianMixture(n_components = 1,
                      covariance_type = 'full', random_state=1) 
  # train model using 3 features - WHITE WINE CLASS ONLY
  G1.fit(np.array([X_train_4_nf[V]]).reshape(-1,1)) 
  
  # fit another Gaussian distribution ( G2 ) on same feature but for RED WINE CLASS
  # list of n_components to use for fraudulent gaussian model (S2)
  ran_num = [2, 3, 4, 5, 6]      
  # random value chosen from list to be n_components
  n_com = random.choice(ran_num) 
  # define parameters - MULTIPLE GAUSSIAN COMPONENTS
  G2 = GaussianMixture(n_components = n_com,
                      covariance_type = 'full', random_state=1) 
  # train model using 3 features - RED WINE CLASS ONLY
  G2.fit(np.array([X_train_4_f[V]]).reshape(-1,1)) 

  # compute the score samples (S1 and S2) for both G1 and G2 on the VALIDATION set
  # log likelihood of belonging to WHITE WINE cluster (larger values)
  S1_val = G1.score_samples(np.array([X_val[V]]).reshape(-1,1))
  # log likelihood of belonging to RED WINE cluster (larger values)
  S2_val = G2.score_samples(np.array([X_val[V]]).reshape(-1,1))
   
  # compute the score samples (S1 and S2) for both G1 and G2 on the TRAINING set
  S1_train = G1.score_samples(np.array([X_train[V]]).reshape(-1,1))
  S2_train = G2.score_samples(np.array([X_train[V]]).reshape(-1,1))

  
  # FIND AUC
  # VALIDATION SET - Add AUC values to list
  AUC_val_nf.append(round(roc_auc_score(y_val, -1 * S1_val),3))
  AUC_val_f.append(round(roc_auc_score(y_val, -1 * S2_val),3))

  # TRAINING SET - Add AUC values to list
  AUC_train_nf.append(round(roc_auc_score(y_train, -1 * S1_train),3))
  AUC_train_f.append(round(roc_auc_score(y_train, -1 * S2_train),3))




  #PARAMETER 2 - Threshold:

  # find optimal c (a real number) to maximize validation set F1 score
  #   for a model such that if S1 < c*S2, transaction is classified as red wine
  #   for example, if c = 1 we could say that if S2 is greater than S1, then transaction is red wine (belongs to G2/red wine distribution)

  # VALIDATION SET - Empty lists for performance metrics
  c_v_val = []
  f1_v_val = []
  pre_v_val = []
  rec_v_val = []
  # TRAINING SET - Empty lists for performance metrics
  c_v_train = []
  f1_v_train = []
  pre_v_train = []
  rec_v_train = []

  for c in np.arange(0.1,8,0.3):                   # iterate through 100 real numbers
    # VALIDATION SET - Performance Metrics
    tr_val = c*S2_val                                       # threshold is (real number)*(array of log likelihood of belonging to RED Distribution) 
    precision_val = precision_score(y_val, S1_val < tr_val) # precision = TP / TP + FP
    recall_val = recall_score(y_val, S1_val < tr_val)       # recall = TP / TP + FN
    f1_val = f1_score(y_val, S1_val < tr_val)               # F1 = 2 * (precision * recall) / (precision + recall)
    # TRAINING SET - Performance Metrics
    tr_train = c*S2_train                                           
    precision_train = precision_score(y_train, S1_train < tr_train)   
    recall_train = recall_score(y_train, S1_train < tr_train)         
    f1_train = f1_score(y_train, S1_train < tr_train)                

    # VALIDATION SET - Store Performance Metric of each model
    c_v_val.append(round(c,3))                          # store the real number values in a list
    f1_v_val.append(round(f1_val,3))                    # store the F1 score values in a list
    pre_v_val.append(round(precision_val,3))            # store the precision values in a list 
    rec_v_val.append(round(recall_val,3))               # store the recall values in a list 
    # TRAINING SET - Store Performance Metric of each model
    c_v_train.append(round(c,3))                          
    f1_v_train.append(round(f1_train,3))                  
    pre_v_train.append(round(precision_train,3))          
    rec_v_train.append(round(recall_train,3))              

  #VALIDATION SET - Find THRESHOLD with BEST F1 Score (BEST Performance Metrics)
  idx_v_val = f1_v_val.index(max(f1_v_val))             #index of maximum F1 score value 
  f1_v_val = max(f1_v_val)                              #max F1 Score of validation set 
  pre_v_val = pre_v_val[idx_v_val]                      #corresponding precision score 
  rec_v_val = rec_v_val[idx_v_val]                      #corresponding recall score 
  c_v_val = c_v_val[idx_v_val]                          #optimal real number
  #TRAINING SET - Find THRESHOLD with BEST F1 Score (BEST Performance Metrics)
  idx_v_train = f1_v_train.index(max(f1_v_train))             #index of maximum F1 score value 
  f1_v_train = max(f1_v_train)                              #max F1 Score of validation set 
  pre_v_train = pre_v_train[idx_v_train]                      #corresponding precision score 
  rec_v_train = rec_v_train[idx_v_train]                      #corresponding recall score 
  c_v_train = c_v_train[idx_v_val]                          #optimal real number from VALIDATION data
  
  # VALIDATION SET - add BEST performance metrics to list
  precision_list_val.append(pre_v_val)                    
  recall_list_val.append(rec_v_val)
  F1_list_val.append(f1_v_val)
  c_list_val.append(c_v_val)
  # TRAINING SET - add BEST performance metrics to list
  precision_list_train.append(pre_v_train)                    
  recall_list_train.append(rec_v_train)
  F1_list_train.append(f1_v_train)
  c_list_train.append(c_v_train)

  # add feature combination to list
  multivariate_list.append(V)
  # add number of components used to a list
  n_components.append(n_com)

# create summary of results as a dataframe table:
model = []
for i in range(1, len(X_train.columns)+1):              # iterate through model numbers
  model.append(i)                                       # add model number to list

# label each feature combination as a new model
results['Model #'] = model                         

# feature combination
results['Features'] = features  
# number of components used in Red Wine Gaussian Model
results['n_components'] = n_components              

# VALIDATION SET - create columns
results['Optimal c (VAL)'] = c_list_val                       # optimal threshold c value
results['F1 Score (VAL)'] = F1_list_val                       # best f1 score
results['Precision (VAL)'] = precision_list_val               # corresponding precision value
results['Recall (VAL)'] = recall_list_val                     # corresponding recall value
results['AUC_nf (VAL)'] = AUC_val_nf                          # AUC using white wine gaussian
results['AUC_f (VAL)'] = AUC_val_f                            # AUC using red wine gaussian

# TRAINING SET - create columns
results['Optimal c (TRAIN)'] = c_list_train                 
results['F1 Score (TRAIN)'] = F1_list_train                       
results['Precision (TRAIN)'] = precision_list_train               
results['Recall (TRAIN)'] = recall_list_train                  
results['AUC_nf (TRAIN)'] = AUC_train_nf                   
results['AUC_f (TRAIN)'] = AUC_train_f                            

# show table of results
results

# best F1 Score for training set
max_train = results['F1 Score (TRAIN)'].max()   
# best F1 Score for validation set
max_val = results['F1 Score (VAL)'].max()   

results[results['F1 Score (VAL)'] == max_val]

Using one feature, the best model with the highest F1 Score seems to be feature total sulfur dioxide with two gaussians (with 5 components for the red wine gaussian):

n_components = 5
optimal c = 0.1
F1 Score = 0.858
Precision = 0.752
Recall = 1.0
AUC = 0.950

3.4.2) Gaussian Model (TWO) Feature (THREE)

# two gaussian model (white wine = single component, red wine = multiple components) with COMBINATIONS OF multiple features

import random

X_train_4 = X_train.copy()                                         # copy the training set
X_train_4['label'] = y_train                                       # add column for red/white wine labels
X_train_4_nf = X_train_4.where(X_train_4['label'] == 1).dropna()   # only keep white wine class
X_train_4_f = X_train_4.where(X_train_4['label'] == 0).dropna()    # only keep red wine class

# empty dataframe to store final results as table
results = pd.DataFrame()                  
features = list(X_train.columns)          

# VALIDATION SET - define empty lists to store performance metrics for each feature:
precision_list_val = []                       # list of precision scores
recall_list_val = []                          # list of recall scores
F1_list_val = []                              # list of F1 Scores
c_list_val = []                               # list of optimal number (c)
AUC_val_nf = []                               # list of AUC values for white wine gaussian
AUC_val_f = []                                # list of AUC values for red wine gaussian
# TRAINING SET - define empty lists to store performance metrics for each feature:
precision_list_train = []                  
recall_list_train = []                       
F1_list_train = []                             
c_list_train = []                           
AUC_train_nf = []                             
AUC_train_f = []                               

multivariate_list = []                    # list for combination of features used
n_components = []                         # list for number of gaussian components used in red wine distribution

# iterate through each combination of features 
for i in np.arange(0,len(features)- 2):  

  # PARAMETER 1 - n_components:    

  # fit a Gaussian distribution ( G1 ) on feature using WHITE WINE CLASS
  # define parameters - SINGLE GAUSSIAN COMPONENT
  G1 = GaussianMixture(n_components = 1,
                      covariance_type = 'full', random_state=1) 
  # train model using 3 features - WHITE WINE CLASS ONLY
  G1.fit(np.array([X_train_4_nf[features[i]],X_train_4_nf[features[i+1]],X_train_4_nf[features[i+2]]]).T) 
  
  # fit another Gaussian distribution ( G2 ) on same feature but for RED WINE CLASS
  # list of n_components to use for red wine gaussian model (S2)
  ran_num = [2, 3, 4, 5, 6]      
  # random value chosen from list to be n_components
  n_com = random.choice(ran_num) 
  # define parameters - MULTIPLE GAUSSIAN COMPONENTS
  G2 = GaussianMixture(n_components = n_com,
                      covariance_type = 'full', random_state=1) 
  # train model using 3 features - RED WINE CLASS ONLY
  G2.fit(np.array([X_train_4_f[features[i]],X_train_4_f[features[i+1]],X_train_4_f[features[i+2]]]).T) 

  # compute the score samples (S1 and S2) for both G1 and G2 on the VALIDATION set
  # log likelihood of belonging to white wine cluster (larger values)
  S1_val = G1.score_samples(np.array([X_val[features[i]],X_val[features[i+1]],X_val[features[i+2]]]).T) 
  # log likelihood of belonging to red wine cluster (larger values)
  S2_val = G2.score_samples(np.array([X_val[features[i]],X_val[features[i+1]],X_val[features[i+2]]]).T) 
  # compute the score samples (S1 and S2) for both G1 and G2 on the TRAINING set
  S1_train = G1.score_samples(np.array([X_train[features[i]],X_train[features[i+1]],X_train[features[i+2]]]).T) 
  S2_train = G2.score_samples(np.array([X_train[features[i]],X_train[features[i+1]],X_train[features[i+2]]]).T) 

  
  # FIND AUC
  # VALIDATION SET - add AUC values to list
  AUC_val_nf.append(round(roc_auc_score(y_val, -1 * S1_val),3))
  AUC_val_f.append(round(roc_auc_score(y_val, -1 * S2_val),3))
  # TRAINING SET - add AUC values to list
  AUC_train_nf.append(round(roc_auc_score(y_train, -1 * S1_train),3))
  AUC_train_f.append(round(roc_auc_score(y_train, -1 * S2_train),3))


  #PARAMETER 2 - Threshold:

  # find optimal c (a real number) to maximize validation set F1 score
  #   for a model such that if S1 < c*S2, transaction is classified as red wine
  #   for example, if c = 1 we could say that if S2 is greater than S1, then transaction is red wine (belongs to G2/red wine distribution)

  # VALIDATION SET - empty lists for performance metrics
  c_v_val = []
  f1_v_val = []
  pre_v_val = []
  rec_v_val = []
  # TRAINING SET - empty lists for performance metrics
  c_v_train = []
  f1_v_train = []
  pre_v_train = []
  rec_v_train = []

  for c in np.arange(0.1,10,0.1):                   # iterate through 100 real numbers
    # VALIDATION SET - Performance Metrics
    tr_val = c*S2_val                                       # threshold is (real number)*(array of log likelihood of belonging to RED WINE Distribution) 
    precision_val = precision_score(y_val, S1_val < tr_val) # precision = TP / TP + FP
    recall_val = recall_score(y_val, S1_val < tr_val)       # recall = TP / TP + FN
    f1_val = f1_score(y_val, S1_val < tr_val)               # F1 = 2 * (precision * recall) / (precision + recall)
    # TRAINING SET - Performance Metrics
    tr_train = c*S2_train                                            
    precision_train = precision_score(y_train, S1_train < tr_train)   
    recall_train = recall_score(y_train, S1_train < tr_train)         
    f1_train = f1_score(y_train, S1_train < tr_train)               

    # VALIDATION SET - Store Performance Metric of each model
    c_v_val.append(round(c,3))                          # store the real number values in a list
    f1_v_val.append(round(f1_val,3))                    # store the F1 score values in a list
    pre_v_val.append(round(precision_val,3))            # store the precision values in a list 
    rec_v_val.append(round(recall_val,3))               # store the recall values in a list 
    # TRAINING SET - Store Performance Metric of each model
    c_v_train.append(round(c,3))                        
    f1_v_train.append(round(f1_train,3))              
    pre_v_train.append(round(precision_train,3))          
    rec_v_train.append(round(recall_train,3))           

  # VALIDATION SET - Find THRESHOLD with BEST F1 Score (BEST Performance Metrics)
  idx_v_val = f1_v_val.index(max(f1_v_val))             # index of maximum F1 score value 
  f1_v_val = max(f1_v_val)                              # max F1 Score of validation set 
  pre_v_val = pre_v_val[idx_v_val]                      # corresponding precision score 
  rec_v_val = rec_v_val[idx_v_val]                      # corresponding recall score 
  c_v_val = c_v_val[idx_v_val]                          # optimal real number
  # TRAINING SET - Find THRESHOLD with BEST F1 Score (BEST Performance Metrics)
  idx_v_train = f1_v_train.index(max(f1_v_train))         
  f1_v_train = max(f1_v_train)                       
  pre_v_train = pre_v_train[idx_v_train]                     
  rec_v_train = rec_v_train[idx_v_train]          
  c_v_train = c_v_train[idx_v_val]                    
  
  # VALIDATION SET - Add BEST performance metrics to list
  precision_list_val.append(pre_v_val)                    
  recall_list_val.append(rec_v_val)
  F1_list_val.append(f1_v_val)
  c_list_val.append(c_v_val)
  # TRAINING SET - Add BEST performance metrics to list
  precision_list_train.append(pre_v_train)                    
  recall_list_train.append(rec_v_train)
  F1_list_train.append(f1_v_train)
  c_list_train.append(c_v_train)

  # add feature combination to list
  multivariate_list.append(features[i] + ' ' + features[i+1] +' ' + features[i+2])
  # add number of components used to a list
  n_components.append(n_com)

# create summary of results as a dataframe table:
model = []
for i in range(1,len(features) - 1):              # iterate through model numbers
  model.append(i)                                 # add model number to list

# label each feature combination as a new model
results['Model #'] = model                        

# feature combination
results['Feature Combination'] = multivariate_list 
# number of components used in RED WINE Gaussian Model
results['n_components'] = n_components              

#VALIDATION SET - Create columns
results['Optimal c (VAL)'] = c_list_val                       # optimal threshold c value
results['F1 Score (VAL)'] = F1_list_val                       # best f1 score
results['Precision (VAL)'] = precision_list_val               # corresponding precision value
results['Recall (VAL)'] = recall_list_val                     # corresponding recall value
results['AUC_nf (VAL)'] = AUC_val_nf                          # AUC using non-fraudulent gaussian
results['AUC_f (VAL)'] = AUC_val_f                            # AUC using fraudulent gaussian

#TRAINING SET - Create columns
results['Optimal c (TRAIN)'] = c_list_train              
results['F1 Score (TRAIN)'] = F1_list_train                   
results['Precision (TRAIN)'] = precision_list_train             
results['Recall (TRAIN)'] = recall_list_train
results['AUC_nf (TRAIN)'] = AUC_train_nf                    
results['AUC_f (TRAIN)'] = AUC_train_f                   

# show table of results
results

# best F1 Score for training set
max_train = results['F1 Score (TRAIN)'].max()   
# best F1 Score for validation set 
max_val = results['F1 Score (VAL)'].max()   
   
results[results['F1 Score (VAL)'] == max_val]

Using THREE features, the best model with the highest F1 Score seems to be features (free sulfur dioxide, total sulfur dioxide, density) with two gaussians (with 6 components for the red wine gaussian)

n_components = 6
optimal c = 0.1
F1 Score = 0.858
Precision = 0.752
Recall = 1.0
AUC = 0.973

3.5) Gaussian Mixture Model

(using test data set)

Best Overall Model:

3 Features
- free sulfur dioxide
- total sulfur dioxide
- density
2 Gaussian Models
- white wine distribution
- red wine distribution
  - 6 Components

# convert the target class labels to 1 and 0
y_test = y_test.map({'white': 1, 'red': 0}).astype(int)

# clean data
X_train_5 = X_train.copy()                                         # copy the training set
X_train_5['label'] = y_train                                       # add column for RED/WHITE labels
X_train_5_nf = X_train_5.where(X_train_5['label'] == 1).dropna()   # only keep WHITE WINE CLASS
X_train_5_f = X_train_5.where(X_train_5['label'] == 0).dropna()    # only keep RED WINE CLASS

# fit a Gaussian distribution ( G1 ) on the features of WHITE WINE CLASS
# define parameters of gaussian mixture model - single feature
G1 = GaussianMixture(n_components = 1,
                    covariance_type = 'full', random_state=1)    
# train model using free sulfur dioxide feature - WHITE WINE CLASS ONLY
G1.fit(np.array([X_train_5_nf['free sulfur dioxide'],X_train_5_nf['total sulfur dioxide'],X_train_5_nf['density']]).T)                

# fit another Gaussian distribution ( G2 ) on same features but for RED WINE CLASS
# define parameters of gaussian mixture model - single feature
G2 = GaussianMixture(n_components = 6,
                    covariance_type = 'full', random_state=1)  
# train model using free sulfur dioxide feature - RED WINE CLASS ONLY   
G2.fit(np.array([X_train_5_f['free sulfur dioxide'],X_train_5_f['total sulfur dioxide'],X_train_5_f['density']]).T)              

# compute the score samples (S1 and S2) for both G1 and G2 on the TEST set
# log likelihood of belonging to white wine cluster (larger values)
S1 = G1.score_samples(np.array([X_test['free sulfur dioxide'],X_test['total sulfur dioxide'],X_test['density']]).T) 
# log likelihood of belonging to red wine cluster (larger values)    
S2 = G2.score_samples(np.array([X_test['free sulfur dioxide'],X_test['total sulfur dioxide'],X_test['density']]).T)      

# find optimal c (a real number) to maximize validation set F1 score
#   for a model such that if S1 < c*S2, transaction is classified as red wine
#   for example, if c = 1 we could say that if S2 is greater than S1, then transaction is red wine (belongs to G2/red wine distribution)

c_fsd = []
f1_fsd = []
pre_fsd = []
rec_fsd = []
for c in np.arange(0.1,10,0.1):                    # iterate through 100 real numbers
   tr = c*S2                                       # threshold is (real number)*(array of log likelihood of belonging to RED WINE Distribution) 
   precision = precision_score(y_test, S1 < tr)    # precision = TP / TP + FP
   recall = recall_score(y_test, S1 < tr)          # recall = TP / TP + FN
   f1 = f1_score(y_test, S1 < tr)                  # F1 = 2 * (precision * recall) / (precision + recall)

   c_fsd.append(round(c,3))                             # store the real number values in a list
   f1_fsd.append(round(f1,3))                           # store the F1 score values in a list
   pre_fsd.append(round(precision,3))                   # store the precision values in a list
   rec_fsd.append(round(recall,3))                      # store the recall values in a list

idx_fsd = f1_fsd.index(max(f1_fsd))                     # index of maximum F1 score value
f1_fsd_val = max(f1_fsd)                                # max F1 Score of validation set
pre_fsd_val = pre_fsd[idx_fsd]                          # corresponding precision score
rec_fsd_val = rec_fsd[idx_fsd]                          # corresponding recall score
c_fsd_val = c_fsd[idx_fsd]                              # optimal real number
print(f'\n max F1 score: {f1_fsd_val} with precision: {pre_fsd_val}, recall: {rec_fsd_val} at real number: {c_fsd_val} using the best model')            

Using the overall best model, the resulting performance metrics on the test set are:

F1 Score = 0.85
Precision = 0.739
Recall = 1.0

poor dataset, precision and recall tradeoff not accurate.
step by step process is the same

-수완-