Statistical Learning Data Analysis (SLDA)

A Comprehensive Report on Cancer Risk Statistical Analysis

Submitted to: Prof.ssa Roberta Siciliano (University of Naples Federico II)

Report by: Sahaya Gnanadurai (D03000149), Rohan Baidya (D03000192)


1. Data Loading and Description

1.1 Dataset Overview

Field Details
Title Cancer Risk Factors Data
Author Tarek Masryo
Year 2025
Publisher Kaggle
DOI 10.34740/KAGGLE/DSV/13280499
URL https://www.kaggle.com/dsv/13280499
Github RAW URL https://raw.githubusercontent.com/tarekmasryo/cancer-risk-factors-data/main/data/cancer-risk-factors.csv

The dataset consists of 2000 various cancer patients' information alongside with details such as risk factors, lifestyle, environmental, and genetic variables, along with a composite risk score and a categorical risk level classification. The dataset is designed to facilitate comprehensive analysis of the relationships between these factors and various types of cancer risk.

Dataset Information:

Metric Value
Dataset Shape (2000, 21)
Number of Records 2000
Number of Features 21

Variable Descriptions:

Column Name Description
Patient_ID Unique identifier for each patient
Cancer_Type Type of cancer (e.g., Breast, Colon, Lung, Prostate, Skin)
Age Patient age in years
Gender Patient gender (Male / Female)
Smoking Smoking status (0 = No, 1 = Yes)
Alcohol_Use Alcohol consumption level (numeric scale)
Obesity Obesity indicator (redundant with BMI — dropped in cleaning)
Family_History Family history of cancer (0 = No, 1 = Yes)
Diet_Red_Meat Red meat consumption level (numeric scale)
Diet_Salted_Processed Salted/processed food consumption level (numeric scale)
Fruit_Veg_Intake Fruit and vegetable intake level (numeric scale)
Physical_Activity Physical activity level (numeric scale)
Air_Pollution Exposure to air pollution (numeric scale)
Occupational_Hazards Exposure to occupational hazards (numeric scale)
BRCA_Mutation BRCA gene mutation carrier (0 = No, 1 = Yes)
H_Pylori_Infection Helicobacter pylori infection status (0 = No, 1 = Yes)
Calcium_Intake Calcium intake level (numeric scale)
Overall_Risk_Score Composite risk score (continuous)
BMI Body Mass Index (kg/m²)
Physical_Activity_Level Categorical activity level (Low / Medium / High)
Risk_Level Cancer risk classification (Low / Medium / High) — target variable

1.2 First 5 Rows

Patient_ID Cancer_Type Age Gender Smoking Alcohol_Use Obesity Family_History Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards BRCA_Mutation H_Pylori_Infection Calcium_Intake Overall_Risk_Score BMI Physical_Activity_Level Risk_Level
LU0000 Breast 68 0 7 2 8 0 5 3 7 4 6 3 1 0 0 0.3987 28.0000 5 Medium
LU0001 Prostate 74 1 8 9 8 0 0 3 7 1 3 3 0 0 5 0.4243 25.4000 9 Medium
LU0002 Skin 55 1 7 10 7 0 3 3 4 1 8 10 0 0 6 0.6051 28.6000 2 Medium
LU0003 Colon 61 0 6 2 2 0 6 2 4 6 4 8 0 0 8 0.3184 32.1000 7 Low
LU0004 Lung 67 1 10 7 4 0 6 3 10 9 10 9 0 0 5 0.5244 25.1000 2 Medium

1.3 Dataset Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Patient_ID               2000 non-null   object 
 1   Cancer_Type              2000 non-null   object 
 2   Age                      2000 non-null   int64  
 3   Gender                   2000 non-null   int64  
 4   Smoking                  2000 non-null   int64  
 5   Alcohol_Use              2000 non-null   int64  
 6   Obesity                  2000 non-null   int64  
 7   Family_History           2000 non-null   int64  
 8   Diet_Red_Meat            2000 non-null   int64  
 9   Diet_Salted_Processed    2000 non-null   int64  
 10  Fruit_Veg_Intake         2000 non-null   int64  
 11  Physical_Activity        2000 non-null   int64  
 12  Air_Pollution            2000 non-null   int64  
 13  Occupational_Hazards     2000 non-null   int64  
 14  BRCA_Mutation            2000 non-null   int64  
 15  H_Pylori_Infection       2000 non-null   int64  
 16  Calcium_Intake           2000 non-null   int64  
 17  Overall_Risk_Score       2000 non-null   float64
 18  BMI                      2000 non-null   float64
 19  Physical_Activity_Level  2000 non-null   int64  
 20  Risk_Level               2000 non-null   object 
dtypes: float64(2), int64(16), object(3)
memory usage: 328.2+ KB


2. Feature Separation and Descriptive Statistics

2.1 Numerical Features Statistics

Column Min Max Mean Mode Median Std Skewness Kurtosis Missing Values
Age 25.0000 90.0000 63.2480 64.0000 64.0000 10.4629 -0.1814 -0.0148 0
Smoking 0.0000 10.0000 5.1570 10.0000 5.0000 3.3253 0.0552 -1.2552 0
Alcohol_Use 0.0000 10.0000 5.0350 7.0000 5.0000 3.2610 -0.0573 -1.3211 0
Obesity 0.0000 10.0000 5.9675 10.0000 6.0000 3.0614 -0.3250 -0.9672 0
Diet_Red_Meat 0.0000 10.0000 5.1895 10.0000 5.0000 3.1545 -0.0079 -1.1572 0
Diet_Salted_Processed 0.0000 10.0000 4.5635 4.0000 4.0000 3.0883 0.3009 -1.0429 0
Fruit_Veg_Intake 0.0000 10.0000 4.9275 3.0000 5.0000 3.0453 0.0185 -1.0837 0
Physical_Activity 0.0000 10.0000 4.0150 1.0000 4.0000 2.9785 0.4559 -0.8428 0
Air_Pollution 0.0000 10.0000 5.3230 10.0000 5.0000 3.2075 0.0033 -1.2114 0
Occupational_Hazards 0.0000 10.0000 4.9790 5.0000 5.0000 3.2129 0.0749 -1.1776 0
Calcium_Intake 0.0000 10.0000 3.9405 0.0000 4.0000 3.0489 0.3495 -0.9561 0
Overall_Risk_Score 0.0293 0.8522 0.4544 0.0293 0.4554 0.1231 0.0165 -0.2909 0
BMI 15.0000 41.4000 26.1833 25.9000 26.2000 3.9475 0.0477 0.0122 0
Physical_Activity_Level 0.0000 10.0000 4.9385 0.0000 5.0000 3.1660 -0.0103 -1.2055 0

2.2 Categorical Features Statistics

Column Mode Unique Values #Unique Most Frequent Missing
Cancer_Type Lung Breast, Prostate, Skin, Colon, Lung 5 Lung (527) 0
Risk_Level Medium Medium, Low, High 3 Medium (1574) 0
Gender 0 0, 1 2 0 (1022) 0
Family_History 0 0, 1 2 0 (1611) 0
BRCA_Mutation 0 1, 0 2 0 (1935) 0
H_Pylori_Infection 0 0, 1 2 0 (1607) 0

3. Data Cleaning

3.1 Missing Value Imputation and Removal of Redundant Columns

3.1.1 Missing Value Imputation

No missing values detected

3.1.2 Removal of Redundant Columns

Since BMI is the internationally standardized clinical measure, we drop Obesity to avoid redundancy.

Patient_ID Cancer_Type Age Gender Smoking Alcohol_Use Family_History Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards BRCA_Mutation H_Pylori_Infection Calcium_Intake Overall_Risk_Score BMI Physical_Activity_Level Risk_Level
LU0000 Breast 68 0 7 2 0 5 3 7 4 6 3 1 0 0 0.3987 28.0000 5 Medium
LU0001 Prostate 74 1 8 9 0 0 3 7 1 3 3 0 0 5 0.4243 25.4000 9 Medium
LU0002 Skin 55 1 7 10 0 3 3 4 1 8 10 0 0 6 0.6051 28.6000 2 Medium
LU0003 Colon 61 0 6 2 0 6 2 4 6 4 8 0 0 8 0.3184 32.1000 7 Low
LU0004 Lung 67 1 10 7 0 6 3 10 9 10 9 0 0 5 0.5244 25.1000 2 Medium

3.2 Boxplot Visualization Before Outlier Removal

numerical_cols_boxplot

3.3 Outlier Detection and Removal (IQR Method)

Column Q1 Q3 IQR Rows removed
Age 56.00 70.00 14.00 9
Smoking 2.00 8.00 6.00 0
Alcohol_Use 2.00 8.00 6.00 0
Diet_Red_Meat 3.00 8.00 5.00 0
Diet_Salted_Processed 2.00 7.00 5.00 0
Fruit_Veg_Intake 3.00 8.00 5.00 0
Physical_Activity 1.00 6.00 5.00 0
Air_Pollution 3.00 8.00 5.00 0
Occupational_Hazards 2.00 8.00 6.00 0
Calcium_Intake 1.00 6.00 5.00 0
Overall_Risk_Score 0.37 0.54 0.17 6
BMI 23.50 28.70 5.20 17
Physical_Activity_Level 2.00 8.00 6.00 0
Metric Value
Original rows 2000
After outlier removal 1968
Rows removed 32
Percentage removed 1.60%

numerical_cols_boxplot_no_outliers


4. Data Encoding and Transformation

4.1 Categorical Encoding

4.1.1 Encoding Summary

For categorical variable Cancer_Type we use One-Hot encoding (drop_first to avoid multicollinearity)

Cancer_Type: One-Hot encoded (drop_first=True)

Cancer_Type_Colon Cancer_Type_Lung Cancer_Type_Prostate Cancer_Type_Skin
0 0 0 0
0 0 1 0
0 0 0 1
1 0 0 0
0 1 0 0

For categorical variable Risk_Level we use Label encoding (Low=0, Medium=1, High=2)

Risk_Level: Label encoded (Low=0, Medium=1, High=2)

Gender, Family_History, BRCA_Mutation, H_Pylori_Infection: already numeric (binary), kept as-is and no encoding is needed

4.1.2 Modified DataFrame (with encoded columns)

Patient_ID Age Gender Smoking Alcohol_Use Family_History Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards BRCA_Mutation H_Pylori_Infection Calcium_Intake Overall_Risk_Score BMI Physical_Activity_Level Cancer_Type_Colon Cancer_Type_Lung Cancer_Type_Prostate Cancer_Type_Skin Risk_Level_Encoded
LU0000 68 0 7 2 0 5 3 7 4 6 3 1 0 0 0.3987 28.0000 5 0 0 0 0 1
LU0001 74 1 8 9 0 0 3 7 1 3 3 0 0 5 0.4243 25.4000 9 0 0 1 0 1
LU0002 55 1 7 10 0 3 3 4 1 8 10 0 0 6 0.6051 28.6000 2 0 0 0 1 1
LU0003 61 0 6 2 0 6 2 4 6 4 8 0 0 8 0.3184 32.1000 7 1 0 0 0 0
LU0004 67 1 10 7 0 6 3 10 9 10 9 0 0 5 0.5244 25.1000 2 0 1 0 0 1

4.2 Numerical Transformations

Skewness

Here we are checking skewness and applying log transform for highly skewed positive variables (|skew| > 1).

Log Transform

Value Sign Formula
Negative log(1x)\log(1 - x)
Positive log(1+x)\log(1 + x)
Zero (no transform applied)

Skewness values (threshold: |skew| > 1.0):

Column Skewness Exceeds Threshold
Age -0.088 No
Smoking 0.059 No
Alcohol_Use -0.058 No
Diet_Red_Meat -0.012 No
Diet_Salted_Processed 0.301 No
Fruit_Veg_Intake 0.019 No
Physical_Activity 0.453 No
Air_Pollution 0.004 No
Occupational_Hazards 0.077 No
Calcium_Intake 0.346 No
Overall_Risk_Score 0.021 No
BMI 0.060 No
Physical_Activity_Level -0.011 No

No highly skewed numerical variables (|skew| > 1) requiring log transform.

Standardization (z-score)

All continuous numerical features are standardized to zero mean and unit variance:

z=xμσz = \frac{x - \mu}{\sigma}

This ensures features are on comparable scales for distance-based and gradient-based ML algorithms.

4.2.1 Standardization Summary

Numerical columns standardized (zero mean, unit variance):
Age, Smoking, Alcohol_Use, Diet_Red_Meat, Diet_Salted_Processed, Fruit_Veg_Intake, Physical_Activity, Air_Pollution, Occupational_Hazards, Calcium_Intake, Overall_Risk_Score, BMI, Physical_Activity_Level

Column Original Mean (μ) Original Std (σ)
Age 63.3526 10.2284
Smoking 5.1570 3.3220
Alcohol_Use 5.0427 3.2573
Diet_Red_Meat 5.2038 3.1468
Diet_Salted_Processed 4.5671 3.0853
Fruit_Veg_Intake 4.9212 3.0475
Physical_Activity 4.0229 2.9765
Air_Pollution 5.3298 3.1986
Occupational_Hazards 4.9731 3.2112
Calcium_Intake 3.9507 3.0497
Overall_Risk_Score 0.4551 0.1211
BMI 26.2014 3.8190
Physical_Activity_Level 4.9416 3.1742

4.2.2 Transformed Data (Standardized Numerical Columns)

First 10 rows of standardized numerical columns only:

Age Smoking Alcohol_Use Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards Calcium_Intake Overall_Risk_Score BMI Physical_Activity_Level
0 0.4544 0.5548 -0.9341 -0.0648 -0.5079 0.6821 -0.0077 0.2095 -0.6144 -1.2954 -0.4660 0.4710 0.0184
1 1.0410 0.8558 1.2149 -1.6537 -0.5079 0.6821 -1.0156 -0.7284 -0.6144 0.3441 -0.2546 -0.2098 1.2786
2 -0.8166 0.5548 1.5219 -0.7003 -0.5079 -0.3023 -1.0156 0.8348 1.5654 0.6720 1.2381 0.6281 -0.9267
3 -0.2300 0.2538 -0.9341 0.2530 -0.8320 -0.3023 0.6642 -0.4157 0.9426 1.3277 -1.1286 1.5446 0.6485
4 0.3566 1.4578 0.6009 0.2530 -0.5079 1.6666 1.6721 1.4601 1.2540 0.3441 0.5715 -0.2884 -0.9267
5 1.3343 1.4578 0.9079 0.2530 -1.4803 0.3540 -0.6796 1.4601 0.6312 -1.2954 0.3594 -0.2884 -1.2418
6 -0.4255 1.4578 1.5219 1.2064 -0.1838 -1.6149 -1.0156 1.4601 1.2540 0.3441 1.7109 1.5969 -0.9267
7 1.0410 0.8558 0.2939 -0.7003 -0.5079 -0.9586 1.3362 0.8348 0.6312 -0.9675 0.2000 0.7590 1.2786
8 0.7477 1.1568 -1.5481 1.5242 -0.1838 0.3540 2.0081 0.8348 -0.6144 0.3441 0.3508 -0.5502 0.0184
9 -0.8166 0.5548 -1.2411 -1.6537 -0.1838 -0.9586 0.3283 1.1474 1.2540 0.3441 -0.4153 0.5233 -1.2418

4.2.3 Modified DataFrame (with standardized columns)

First 10 rows of the complete modified dataframe (all columns, numerical ones standardized):

Patient_ID Age Gender Smoking Alcohol_Use Family_History Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards BRCA_Mutation H_Pylori_Infection Calcium_Intake Overall_Risk_Score BMI Physical_Activity_Level Cancer_Type_Colon Cancer_Type_Lung Cancer_Type_Prostate Cancer_Type_Skin Risk_Level_Encoded
0 LU0000 0.4544 0 0.5548 -0.9341 0 -0.0648 -0.5079 0.6821 -0.0077 0.2095 -0.6144 1 0 -1.2954 -0.4660 0.4710 0.0184 0 0 0 0 1
1 LU0001 1.0410 1 0.8558 1.2149 0 -1.6537 -0.5079 0.6821 -1.0156 -0.7284 -0.6144 0 0 0.3441 -0.2546 -0.2098 1.2786 0 0 1 0 1
2 LU0002 -0.8166 1 0.5548 1.5219 0 -0.7003 -0.5079 -0.3023 -1.0156 0.8348 1.5654 0 0 0.6720 1.2381 0.6281 -0.9267 0 0 0 1 1
3 LU0003 -0.2300 0 0.2538 -0.9341 0 0.2530 -0.8320 -0.3023 0.6642 -0.4157 0.9426 0 0 1.3277 -1.1286 1.5446 0.6485 1 0 0 0 0
4 LU0004 0.3566 1 1.4578 0.6009 0 0.2530 -0.5079 1.6666 1.6721 1.4601 1.2540 0 0 0.3441 0.5715 -0.2884 -0.9267 0 1 0 0 1
5 LU0005 1.3343 1 1.4578 0.9079 0 0.2530 -1.4803 0.3540 -0.6796 1.4601 0.6312 0 0 -1.2954 0.3594 -0.2884 -1.2418 0 1 0 0 1
6 LU0006 -0.4255 0 1.4578 1.5219 0 1.2064 -0.1838 -1.6149 -1.0156 1.4601 1.2540 0 0 0.3441 1.7109 1.5969 -0.9267 0 1 0 0 2
7 LU0007 1.0410 1 0.8558 0.2939 1 -0.7003 -0.5079 -0.9586 1.3362 0.8348 0.6312 0 0 -0.9675 0.2000 0.7590 1.2786 0 0 1 0 1
8 LU0008 0.7477 1 1.1568 -1.5481 0 1.5242 -0.1838 0.3540 2.0081 0.8348 -0.6144 0 0 0.3441 0.3508 -0.5502 0.0184 1 0 0 0 1
9 LU0009 -0.8166 1 0.5548 -1.2411 0 -1.6537 -0.1838 -0.9586 0.3283 1.1474 1.2540 0 0 0.3441 -0.4153 0.5233 -1.2418 0 0 0 1 1

5. Exploratory Data Analysis (EDA)

5.1 Histogram of Numerical and Categorical Columns

numerical_cols_distribution

categorical_cols_distribution

5.2 Correlation Heatmap of Numerical Columns

correlation_matrix

5.2.1 Correlation Matrix (Tabular Form)

Age Smoking Alcohol_Use Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards Calcium_Intake Overall_Risk_Score BMI Physical_Activity_Level
Age 1.000 0.041 -0.033 -0.057 -0.059 0.008 0.065 -0.014 0.050 0.057 -0.036 0.011 -0.049
Smoking 0.041 1.000 0.112 -0.145 -0.060 0.041 0.108 0.449 -0.017 0.084 0.431 0.003 0.022
Alcohol_Use -0.033 0.112 1.000 -0.030 -0.033 0.041 0.040 0.068 -0.007 -0.062 0.386 -0.012 0.018
Diet_Red_Meat -0.057 -0.145 -0.030 1.000 0.178 -0.193 -0.001 -0.077 -0.015 0.100 0.272 0.030 0.032
Diet_Salted_Processed -0.059 -0.060 -0.033 0.178 1.000 -0.216 -0.025 0.036 0.055 0.060 0.363 -0.008 0.001
Fruit_Veg_Intake 0.008 0.041 0.041 -0.193 -0.216 1.000 0.013 -0.048 -0.053 -0.022 -0.152 -0.014 -0.013
Physical_Activity 0.065 0.108 0.040 -0.001 -0.025 0.013 1.000 0.086 0.001 -0.000 0.063 0.000 0.028
Air_Pollution -0.014 0.449 0.068 -0.077 0.036 -0.048 0.086 1.000 0.081 0.062 0.496 0.034 0.005
Occupational_Hazards 0.050 -0.017 -0.007 -0.015 0.055 -0.053 0.001 0.081 1.000 0.076 0.352 0.000 0.044
Calcium_Intake 0.057 0.084 -0.062 0.100 0.060 -0.022 -0.000 0.062 0.076 1.000 0.062 0.024 -0.007
Overall_Risk_Score -0.036 0.431 0.386 0.272 0.363 -0.152 0.063 0.496 0.352 0.062 1.000 0.029 0.049
BMI 0.011 0.003 -0.012 0.030 -0.008 -0.014 0.000 0.034 0.000 0.024 0.029 1.000 -0.002
Physical_Activity_Level -0.049 0.022 0.018 0.032 0.001 -0.013 0.028 0.005 0.044 -0.007 0.049 -0.002 1.000

5.2.2 Feature Relations with Correlation Strength

Here we take the Weak to Strong Correlations (0.19–0.39 weak, 0.4–0.59 moderate, 0.6–0.79 strong) and collect all pairs with weak/moderate/strong correlation (upper triangle to avoid duplicates)

Weak to Strong Positive Correlation (descending):

Feature Correlated With Correlation Strength
Air_Pollution Overall_Risk_Score 0.496 Moderate
Smoking Air_Pollution 0.449 Moderate
Smoking Overall_Risk_Score 0.431 Moderate
Alcohol_Use Overall_Risk_Score 0.386 Weak
Diet_Salted_Processed Overall_Risk_Score 0.363 Weak
Occupational_Hazards Overall_Risk_Score 0.352 Weak
Diet_Red_Meat Overall_Risk_Score 0.272 Weak

Weak to Strong Negative Correlation (descending by strength):

Feature Correlated With Correlation Strength
Diet_Salted_Processed Fruit_Veg_Intake -0.216 Weak
Diet_Red_Meat Fruit_Veg_Intake -0.193 Weak

5.3 Scatter Plot Collage

scatter_collage_standardized

Density (intensity) color code (low -> high): purple->blue->green->yellow->red

5.4 Binned Heatmap: Numerical Columns vs Each Categorical Column

Cancer_Type:

binned_heatmap_num_vs_Cancer_Type

Gender:

binned_heatmap_num_vs_Gender

Family_History:

binned_heatmap_num_vs_Family_History

BRCA_Mutation:

binned_heatmap_num_vs_BRCA_Mutation

H_Pylori_Infection:

binned_heatmap_num_vs_H_Pylori_Infection

Risk_Level:

binned_heatmap_num_vs_Risk_Level

5.5 Violin Plot Diagram: Numerical Columns vs Each Categorical Column

Cancer_Type:

violinplot_num_vs_Cancer_Type

Gender:

violinplot_num_vs_Gender

Family_History:

violinplot_num_vs_Family_History

BRCA_Mutation:

violinplot_num_vs_BRCA_Mutation

H_Pylori_Infection:

violinplot_num_vs_H_Pylori_Infection

Risk_Level:

violinplot_num_vs_Risk_Level

5.6 Heatmap Collage: Each Categorical Column vs All Other Categorical Columns

Cancer_Type:

heatmap_collage_Cancer_Type_vs_all

Gender:

heatmap_collage_Gender_vs_all

Family_History:

heatmap_collage_Family_History_vs_all

BRCA_Mutation:

heatmap_collage_BRCA_Mutation_vs_all

H_Pylori_Infection:

heatmap_collage_H_Pylori_Infection_vs_all

Risk_Level:

heatmap_collage_Risk_Level_vs_all

5.7 Dimensionality Reduction: Principal Component Analysis (PCA)

5.7.1 Explained Variance (Scree Plot)

This plot shows how much information each Principal Component captures.

pca_scree_plot

PC1 explains 13.4% of variance; PC1–PC2 together explain 25.3%. 9 components reach 80% cumulative variance; 11 reach 90%.

5.7.2 PCA Cluster Map

This plot projects all numerical variables into 2D to visualize risk-level clustering.

pca_2d_scatter

With only 25.3% of variance captured in 2D, the three risk classes show substantial overlap, suggesting that risk-level separation requires higher-dimensional features or non-linear decision boundaries — consistent with the use of tree-based ensemble models below.


6. Inferential Statistics: Hypothesis Testing

Here we are testing the hypotheses for the data. We will be using the chi-square test, t-test and ANOVA test to test the hypotheses.

6.1 Chi-Square Test

Aspect Details
Description Tests whether there is an association between two categorical variables by comparing observed counts to expected counts under independence.
Typical use case Assess if a risk factor (e.g., family history) is related to an outcome category (e.g., cancer risk level) across groups (e.g., cancer types).
Key inference metrics Chi-square statistic (χ²), degrees of freedom (df), p-value, and whether p < 0.05 (evidence against independence).

chi-square Test of Independence Workflow Diagram

Chi-square test of independence for Family_History vs (Cancer_Type × Risk_Level).
Null Hypothesis (H0H_{0}): Family history and risk level are independent (no association) within each cancer type.
Alternative Hypothesis (HAH_{A}): Family history and risk level are not independent (association exists) within each cancer type.
Test statistic: χ2=((OijEij)2Eij)\chi^2 = \sum\left(\frac{(O_{ij} - E_{ij})^2}{E_{ij}}\right), where OijO_{ij} are observed counts and EijE_{ij} are expected counts.

Chi-Square Results: Family_History vs Risk_Level by Cancer_Type

Cancer_Type Chi2 df p_value Cramers_V Significant_at_0.05 N
Breast 4.9802 2 0.0829 0.1044 No 457
Colon 1.4844 2 0.4761 0.0600 No 412
Lung 3.7581 2 0.1527 0.0854 No 515
Prostate 0.3939 2 0.8212 0.0364 No 297
Skin 1.8931 2 0.3881 0.0812 No 287

Verdict:
For none of the 5 cancer types, the association between Family_History and Risk_Level is statistically significant at the 0.05 level (all p-values ≥ 0.05). We do not find strong evidence that family history is associated with risk levels within any cancer type based on this test.

Note: Non-significance does not prove absence of effect; small per-group sample sizes may limit statistical power.

chi_family_history_vs_risk_by_cancer_type

6.2 t-Test: Overall_Risk_Score vs BMI Group

Aspect Details
Description Compares the mean of a numerical variable between two independent groups.
Typical use case Evaluate whether people with high BMI have different average Overall_Risk_Score compared to people with low BMI.
Key inference metrics Group means, t statistic, p-value, and whether p < 0.05 (evidence of a difference in means).

t-test Workflow Diagram

Independent samples t-test for Overall_Risk_Score between Low BMI and High BMI groups.
Null Hypothesis (H0H_{0}): There is no difference. High BMI and Low BMI people have the same average Risk Score.
Alternative Hypothesis (H1H_{1}): People with above-average BMI have a significantly higher average Risk Score.

Test statistic formula:

t=X1X2s12n1+s22n2t = \frac{\overline{X}_1 - \overline{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

t-Test Results: Overall_Risk_Score by BMI Group

BMI_Threshold_Mean N_High_BMI N_Low_BMI Mean_Risk_High_BMI Mean_Risk_Low_BMI t_stat p_value Cohens_d Significant_at_0.05
26.2014 969 999 0.4579 0.4525 0.9776 0.3284 0.0441 No

Verdict: The p-value for the difference in Overall_Risk_Score between High BMI (n=969) and Low BMI (n=999) groups is 0.3284. There is no statistically significant difference in mean Overall_Risk_Score (0.4579 vs 0.4525) at the 0.05 significance level. Fail to reject the null hypothesis.
Effect size: Cohen's d = 0.044 (negligible).
Non-significance does not prove absence of effect; the mean-based BMI split may dilute real associations.

Visualization: boxplot of Overall_Risk_Score by BMI group

boxplot_overall_risk_vs_bmi_group

6.3 One-Way ANOVA: Overall_Risk_Score vs Cancer_Type

Aspect Details
Description Tests whether the mean Overall_Risk_Score differs across multiple Cancer_Type groups using a one-way ANOVA.
Typical use case Assess if average Overall_Risk_Score is the same for Breast, Colon, Lung, Prostate, and Skin cancer groups.
Key inference metrics F statistic, p-value, degrees of freedom (df1, df2), and whether p < 0.05 (evidence that at least one group mean differs).

One-Way ANOVA Workflow Diagram

Comparison of Overall_Risk_Score across Cancer_Type groups.

Null Hypothesis (H0H_{0}): All Cancer_Type groups have the same mean Overall_Risk_Score.
Alternative Hypothesis (HAH_{A}): At least one Cancer_Type group has a different mean Overall_Risk_Score.
Test statistic formula:

F=MSBMSWF = \frac{MSB}{MSW}

Term Full Name Formula (Simplified)
SSW Sum of Squares Within groups (Xxˉgroup)2\sum(X - \bar{x}_{group})^2
SSB Sum of Squares Between groups ni(xˉgroupxˉgrand)2\sum n_i(\bar{x}_{group} - \bar{x}_{grand})^2
MSW Mean Square Within groups SSW/(Nk)SSW / (N - k)
MSB Mean Square Between groups SSB/(k1)SSB / (k - 1)

ANOVA Results: Overall_Risk_Score by Cancer_Type

k_groups N_total df1_(k-1) df2_(N-k) F_stat p_value Eta_squared Significant_at_0.05
5 1968 4 1963 44.8968 4.03e-36 0.0838 Yes

Verdict: Reject H0H_0.
Hence, there is a statistically significant difference in mean Overall_Risk_Score between Cancer_Type groups.
Effect size: η² = 0.0838 (medium).

Per-group mean Overall_Risk_Score (ranked):

Cancer_Type Mean_Risk_Score N
Lung 0.5002 515
Colon 0.4665 412
Skin 0.4556 287
Breast 0.4336 457
Prostate 0.3940 297

Tukey HSD post-hoc pairwise comparisons:

Group A Group B p_value Significant
Breast Colon 0.0003 Yes
Breast Lung 0.0000 Yes
Breast Prostate 0.0000 Yes
Breast Skin 0.0888 No
Colon Lung 0.0001 Yes
Colon Prostate 0.0000 Yes
Colon Skin 0.7381 No
Lung Prostate 0.0000 Yes
Lung Skin 0.0000 Yes
Prostate Skin 0.0000 Yes

Visualization: boxplot of Overall_Risk_Score by Cancer_Type

boxplot_overall_risk_vs_cancer_type

6.4 Key Metrics Summary

Test Comparison Statistic p-value Significant (α=0.05)
Chi-square Family_History vs Risk_Level (Breast) χ²=4.9802, V=0.104 0.0829 No
Chi-square Family_History vs Risk_Level (Colon) χ²=1.4844, V=0.060 0.4761 No
Chi-square Family_History vs Risk_Level (Lung) χ²=3.7581, V=0.085 0.1527 No
Chi-square Family_History vs Risk_Level (Prostate) χ²=0.3939, V=0.036 0.8212 No
Chi-square Family_History vs Risk_Level (Skin) χ²=1.8931, V=0.081 0.3881 No
t-Test Overall_Risk_Score by BMI Group t=0.9776, d=0.044 0.3284 No
ANOVA Overall_Risk_Score by Cancer_Type F=44.8968, η²=0.0838 4.033e-36 Yes

7. Feature Engineering

7.1 Feature Set for Machine Learning

We use df_processed (standardized continuous features + encoded categoricals from Section 4) for ML.
Target: Risk_Level_Encoded (Low=0, Medium=1, High=2). Overall_Risk_Score is excluded to prevent leakage.

Quantity Value
Total samples (rows) 1968
Number of features (X) 20
Target column (y) Risk_Level_Encoded

7.2 Feature Engineering Process

Feature Engineering Pipeline

7.3 Feature Space Snapshot

First 10 rows of the ML feature matrix X (standardized numerical features + encoded categoricals):

Age Gender Smoking Alcohol_Use Family_History Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards BRCA_Mutation H_Pylori_Infection Calcium_Intake BMI Physical_Activity_Level Cancer_Type_Colon Cancer_Type_Lung Cancer_Type_Prostate Cancer_Type_Skin
0.4544 0.0000 0.5548 -0.9341 0.0000 -0.0648 -0.5079 0.6821 -0.0077 0.2095 -0.6144 1.0000 0.0000 -1.2954 0.4710 0.0184 0.0000 0.0000 0.0000 0.0000
1.0410 1.0000 0.8558 1.2149 0.0000 -1.6537 -0.5079 0.6821 -1.0156 -0.7284 -0.6144 0.0000 0.0000 0.3441 -0.2098 1.2786 0.0000 0.0000 1.0000 0.0000
-0.8166 1.0000 0.5548 1.5219 0.0000 -0.7003 -0.5079 -0.3023 -1.0156 0.8348 1.5654 0.0000 0.0000 0.6720 0.6281 -0.9267 0.0000 0.0000 0.0000 1.0000
-0.2300 0.0000 0.2538 -0.9341 0.0000 0.2530 -0.8320 -0.3023 0.6642 -0.4157 0.9426 0.0000 0.0000 1.3277 1.5446 0.6485 1.0000 0.0000 0.0000 0.0000
0.3566 1.0000 1.4578 0.6009 0.0000 0.2530 -0.5079 1.6666 1.6721 1.4601 1.2540 0.0000 0.0000 0.3441 -0.2884 -0.9267 0.0000 1.0000 0.0000 0.0000
1.3343 1.0000 1.4578 0.9079 0.0000 0.2530 -1.4803 0.3540 -0.6796 1.4601 0.6312 0.0000 0.0000 -1.2954 -0.2884 -1.2418 0.0000 1.0000 0.0000 0.0000
-0.4255 0.0000 1.4578 1.5219 0.0000 1.2064 -0.1838 -1.6149 -1.0156 1.4601 1.2540 0.0000 0.0000 0.3441 1.5969 -0.9267 0.0000 1.0000 0.0000 0.0000
1.0410 1.0000 0.8558 0.2939 1.0000 -0.7003 -0.5079 -0.9586 1.3362 0.8348 0.6312 0.0000 0.0000 -0.9675 0.7590 1.2786 0.0000 0.0000 1.0000 0.0000
0.7477 1.0000 1.1568 -1.5481 0.0000 1.5242 -0.1838 0.3540 2.0081 0.8348 -0.6144 0.0000 0.0000 0.3441 -0.5502 0.0184 1.0000 0.0000 0.0000 0.0000
-0.8166 1.0000 0.5548 -1.2411 0.0000 -1.6537 -0.1838 -0.9586 0.3283 1.1474 1.2540 0.0000 0.0000 0.3441 0.5233 -1.2418 0.0000 0.0000 0.0000 1.0000

8. Machine Learning Algorithms

8.1 Machine Learning Workflow

Machine Learning Workflow

8.2 Train/Test Split and Evaluation Metrics

We split the data into training and test sets using an 80/20 ratio with stratification on the target.
This preserves the distribution of risk levels in both training and test sets and avoids biased evaluation.

Training set sample (first 10 rows):

Age Gender Smoking Alcohol_Use Family_History Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards BRCA_Mutation H_Pylori_Infection Calcium_Intake BMI Physical_Activity_Level Cancer_Type_Colon Cancer_Type_Lung Cancer_Type_Prostate Cancer_Type_Skin Risk_Level_Encoded
-0.6211 0.0000 1.4578 0.6009 1.0000 -0.7003 1.7609 0.3540 -0.3436 1.4601 1.2540 0.0000 0.0000 -0.3117 -1.5453 -0.9267 0.0000 1.0000 0.0000 0.0000 1.0000
0.5521 0.0000 -0.0473 -0.9341 0.0000 0.5708 0.1403 -1.6149 0.3283 1.4601 -0.3030 0.0000 0.0000 0.3441 1.4136 0.6485 0.0000 0.0000 0.0000 0.0000 1.0000
0.4544 0.0000 -0.9503 0.6009 0.0000 -0.3825 -1.4803 1.3384 -0.6796 1.1474 1.2540 0.0000 0.0000 -0.6396 0.8638 1.2786 0.0000 0.0000 0.0000 0.0000 1.0000
1.1387 0.0000 0.2538 1.2149 0.0000 -0.7003 -0.8320 -0.6304 0.3283 0.5222 0.0084 0.0000 0.0000 -1.2954 -0.2884 1.2786 0.0000 0.0000 0.0000 0.0000 1.0000
-0.1322 0.0000 -0.6493 -1.5481 0.0000 0.2530 -0.8320 -0.6304 1.3362 -0.4157 -0.3030 0.0000 0.0000 -0.3117 -0.6550 -0.6117 0.0000 0.0000 0.0000 0.0000 0.0000
0.4544 0.0000 1.1568 1.2149 0.0000 0.8886 1.4368 0.3540 -0.0077 -0.1031 1.5654 0.0000 0.0000 -1.2954 -0.2622 0.3335 1.0000 0.0000 0.0000 0.0000 1.0000
1.2365 0.0000 0.2538 -1.2411 0.0000 0.8886 -0.5079 -1.6149 0.3283 1.4601 -0.3030 0.0000 0.0000 -0.9675 1.7279 -1.5568 0.0000 0.0000 0.0000 1.0000 1.0000
-0.5233 1.0000 -0.0473 -0.9341 0.0000 -0.3825 1.4368 -0.9586 1.6721 1.4601 -0.9258 0.0000 0.0000 -1.2954 1.3089 1.2786 0.0000 0.0000 1.0000 0.0000 0.0000
-0.6211 0.0000 1.1568 -1.5481 1.0000 0.8886 -0.8320 -1.6149 0.6642 0.2095 0.9426 0.0000 0.0000 -1.2954 0.8114 0.9635 0.0000 0.0000 0.0000 0.0000 1.0000
-0.6211 1.0000 1.4578 -0.3201 0.0000 1.5242 1.1127 -0.6304 2.0081 -0.1031 -1.5487 0.0000 1.0000 -0.6396 -0.0789 -0.9267 0.0000 1.0000 0.0000 0.0000 1.0000

Test set sample (first 5 rows):

Age Gender Smoking Alcohol_Use Family_History Diet_Red_Meat Diet_Salted_Processed Fruit_Veg_Intake Physical_Activity Air_Pollution Occupational_Hazards BRCA_Mutation H_Pylori_Infection Calcium_Intake BMI Physical_Activity_Level Cancer_Type_Colon Cancer_Type_Lung Cancer_Type_Prostate Cancer_Type_Skin Risk_Level_Encoded
-0.5233 0.0000 -1.5524 -0.6271 0.0000 -0.3825 -1.4803 -0.3023 -1.0156 -1.0410 -1.5487 0.0000 0.0000 -1.2954 2.6443 -0.2966 0.0000 0.0000 0.0000 0.0000 0.0000
0.0633 1.0000 -0.9503 -0.0131 0.0000 1.2064 -0.8320 -0.6304 0.6642 -1.0410 0.0084 0.0000 0.0000 0.6720 0.7852 0.6485 1.0000 0.0000 0.0000 0.0000 1.0000
-0.6211 0.0000 -1.5524 -0.0131 0.0000 0.2530 0.4644 0.3540 0.6642 -1.6663 -0.3030 0.0000 0.0000 -1.2954 0.1306 0.9635 0.0000 0.0000 0.0000 0.0000 1.0000
-0.9144 1.0000 1.4578 0.2939 0.0000 -1.6537 -0.1838 -0.6304 -0.0077 1.1474 0.6312 0.0000 1.0000 -0.9675 -0.8121 0.0184 0.0000 1.0000 0.0000 0.0000 1.0000
-0.0345 1.0000 1.4578 0.9079 1.0000 1.5242 1.1127 0.3540 -0.0077 1.4601 1.5654 0.0000 0.0000 0.3441 0.6281 1.5936 0.0000 1.0000 0.0000 0.0000 2.0000

8.2.1 SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE is a data augmentation technique that addresses class imbalance by generating synthetic samples
for minority classes. Instead of simply duplicating existing minority samples (which can lead to overfitting),
SMOTE creates new data points by interpolating between existing minority-class neighbours in feature space.

Why it is needed here:
The target variable Risk_Level is heavily imbalanced — the Medium class dominates (~79%), while High (~5%)
has very few examples. Without balancing, classifiers tend to predict the majority class and ignore minorities,
resulting in high accuracy but poor recall and F1 for under-represented classes.

How it works:

  1. For each minority-class sample, find its kk nearest neighbours (default k=5k=5) in feature space.
  2. Randomly pick one of the kk neighbours.
  3. Create a synthetic sample at a random point along the line segment between the original and the neighbour:

xnew=xi+λ(xneighbourxi),λ[0,1]x_{\text{new}} = x_i + \lambda \cdot (x_{\text{neighbour}} - x_i), \quad \lambda \in [0,1]

  1. Repeat until the minority class reaches the desired count (here, equal to the majority class).

Important: SMOTE is applied only on the training set — the test set remains untouched to ensure
an honest evaluation on real data.

Class distribution before SMOTE:

Class Count
Low (0) 251
Medium (1) 1245
High (2) 78

Class distribution after SMOTE:

Class Count
Low (0) 1245
Medium (1) 1245
High (2) 1245

Evaluation protocol: All metrics below are computed on the original, unmodified test set (no SMOTE). This ensures honest evaluation on real-world class distribution.

8.3 LightGBM Classifier

LightGBM is a gradient boosting framework based on decision trees. It builds an ensemble of shallow trees
by minimizing a differentiable loss function (here, multiclass log-loss) using gradient descent.

LightGBM Training Loop

Hyperparameters: n_estimators=500, learning_rate=0.05, max_depth=7, num_leaves=63, min_child_samples=10, subsample=0.8, colsample_bytree=0.8

LightGBM_confusion_matrix

Per-class recall — Low: 34/63 (54%), Medium: 292/311 (94%), High: 6/20 (30%).

8.4 XGBoost Classifier

XGBoost (Extreme Gradient Boosting) is another gradient boosting implementation that uses regularization
and shrinkage to reduce overfitting. It optimizes an objective function of the form

L(θ)=il(yi,y^i)+kΩ(fk),\mathcal{L}(\theta) = \sum_{i} l(y_i, \hat{y}_i) + \sum_{k} \Omega(f_k),

where ll is the loss (e.g. multiclass log-loss) and Ω\Omega is a regularization term on each tree fkf_k.

XGBoost Training Loop

Hyperparameters: n_estimators=500, learning_rate=0.05, max_depth=6, min_child_weight=3, subsample=0.8, colsample_bytree=0.8, gamma=0.1

XGBoost_confusion_matrix

Per-class recall — Low: 38/63 (60%), Medium: 289/311 (93%), High: 11/20 (55%).

8.5 RandomForest Classifier

RandomForest is an ensemble of decision trees trained on bootstrapped samples with feature subsampling.
For classification, the final prediction is obtained via majority vote across trees:

y^=mode{ht(x)}t=1T,\hat{y} = \text{mode}\{ h_t(x) \}_{t=1}^T,

where each hth_t is a decision tree trained on a different bootstrap sample.

RandomForest Training Loop

Hyperparameters: n_estimators=500, max_depth=None, min_samples_split=5, min_samples_leaf=2

RandomForest_confusion_matrix

Per-class recall — Low: 40/63 (63%), Medium: 288/311 (93%), High: 9/20 (45%).

8.6 Overall Model Performance Comparison

Algorithm Accuracy Precision_macro Recall_macro F1_macro ROC_AUC_macro
LightGBM 0.8426 0.7085 0.5929 0.6346 0.8914
XGBoost 0.8579 0.7409 0.6941 0.7153 0.9027
RandomForest 0.8553 0.7549 0.6703 0.7037 0.9163

Model Metrics Comparison

ROC Curves All Models

Inference: We select the best model by macro-average F1, which balances precision and recall across all classes — critical when minority-class detection (High risk) matters.

F1 and AUC disagree: XGBoost leads on F1 (better hard predictions), while RandomForest leads on AUC (better probability ranking). For clinical risk triage, F1_macro is preferred.