Statistical Learning Data Analysis (SLDA)

A Comprehensive Report on Cancer Risk Statistical Analysis

Submitted to: Prof.ssa Roberta Siciliano (University of Naples Federico II)

Report by: Sahaya Gnanadurai (D03000149), Rohan Baidya (D03000192)

1. Data Loading and Description

1.1 Dataset Overview

Field	Details
Title	Cancer Risk Factors Data
Author	Tarek Masryo
Year	2025
Publisher	Kaggle
DOI	10.34740/KAGGLE/DSV/13280499
URL	https://www.kaggle.com/dsv/13280499
Github RAW URL	https://raw.githubusercontent.com/tarekmasryo/cancer-risk-factors-data/main/data/cancer-risk-factors.csv

The dataset consists of 2000 various cancer patients' information alongside with details such as risk factors, lifestyle, environmental, and genetic variables, along with a composite risk score and a categorical risk level classification. The dataset is designed to facilitate comprehensive analysis of the relationships between these factors and various types of cancer risk.

Dataset Information:

Metric	Value
Dataset Shape	(2000, 21)
Number of Records	2000
Number of Features	21

Variable Descriptions:

Column Name	Description
Patient_ID	Unique identifier for each patient
Cancer_Type	Type of cancer (e.g., Breast, Colon, Lung, Prostate, Skin)
Age	Patient age in years
Gender	Patient gender (Male / Female)
Smoking	Smoking status (0 = No, 1 = Yes)
Alcohol_Use	Alcohol consumption level (numeric scale)
Obesity	Obesity indicator (redundant with BMI — dropped in cleaning)
Family_History	Family history of cancer (0 = No, 1 = Yes)
Diet_Red_Meat	Red meat consumption level (numeric scale)
Diet_Salted_Processed	Salted/processed food consumption level (numeric scale)
Fruit_Veg_Intake	Fruit and vegetable intake level (numeric scale)
Physical_Activity	Physical activity level (numeric scale)
Air_Pollution	Exposure to air pollution (numeric scale)
Occupational_Hazards	Exposure to occupational hazards (numeric scale)
BRCA_Mutation	BRCA gene mutation carrier (0 = No, 1 = Yes)
H_Pylori_Infection	Helicobacter pylori infection status (0 = No, 1 = Yes)
Calcium_Intake	Calcium intake level (numeric scale)
Overall_Risk_Score	Composite risk score (continuous)
BMI	Body Mass Index (kg/m²)
Physical_Activity_Level	Categorical activity level (Low / Medium / High)
Risk_Level	Cancer risk classification (Low / Medium / High) — target variable

1.2 First 5 Rows

Patient_ID	Cancer_Type	Age	Gender	Smoking	Alcohol_Use	Obesity	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	BRCA_Mutation	Calcium_Intake	Overall_Risk_Score	BMI	Physical_Activity_Level	Risk_Level
LU0000	Breast	68	0	7	2	8	5	3	7	4	6	3	1	0	0.3987	28.0000	5	Medium
LU0001	Prostate	74	1	8	9	8	0	3	7	1	3	3	0	5	0.4243	25.4000	9	Medium
LU0002	Skin	55	1	7	10	7	3	3	4	1	8	10	0	6	0.6051	28.6000	2	Medium
LU0003	Colon	61	0	6	2	2	6	2	4	6	4	8	0	8	0.3184	32.1000	7	Low
LU0004	Lung	67	1	10	7	4	6	3	10	9	10	9	0	5	0.5244	25.1000	2	Medium

1.3 Dataset Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Patient_ID               2000 non-null   object 
 1   Cancer_Type              2000 non-null   object 
 2   Age                      2000 non-null   int64  
 3   Gender                   2000 non-null   int64  
 4   Smoking                  2000 non-null   int64  
 5   Alcohol_Use              2000 non-null   int64  
 6   Obesity                  2000 non-null   int64  
 7   Family_History           2000 non-null   int64  
 8   Diet_Red_Meat            2000 non-null   int64  
 9   Diet_Salted_Processed    2000 non-null   int64  
 10  Fruit_Veg_Intake         2000 non-null   int64  
 11  Physical_Activity        2000 non-null   int64  
 12  Air_Pollution            2000 non-null   int64  
 13  Occupational_Hazards     2000 non-null   int64  
 14  BRCA_Mutation            2000 non-null   int64  
 15  H_Pylori_Infection       2000 non-null   int64  
 16  Calcium_Intake           2000 non-null   int64  
 17  Overall_Risk_Score       2000 non-null   float64
 18  BMI                      2000 non-null   float64
 19  Physical_Activity_Level  2000 non-null   int64  
 20  Risk_Level               2000 non-null   object 
dtypes: float64(2), int64(16), object(3)
memory usage: 328.2+ KB

2. Feature Separation and Descriptive Statistics

2.1 Numerical Features Statistics

Column	Min	Max	Mean	Mode	Median	Std	Skewness	Kurtosis
Age	25.0000	90.0000	63.2480	64.0000	64.0000	10.4629	-0.1814	-0.0148
Smoking	0.0000	10.0000	5.1570	10.0000	5.0000	3.3253	0.0552	-1.2552
Alcohol_Use	0.0000	10.0000	5.0350	7.0000	5.0000	3.2610	-0.0573	-1.3211
Obesity	0.0000	10.0000	5.9675	10.0000	6.0000	3.0614	-0.3250	-0.9672
Diet_Red_Meat	0.0000	10.0000	5.1895	10.0000	5.0000	3.1545	-0.0079	-1.1572
Diet_Salted_Processed	0.0000	10.0000	4.5635	4.0000	4.0000	3.0883	0.3009	-1.0429
Fruit_Veg_Intake	0.0000	10.0000	4.9275	3.0000	5.0000	3.0453	0.0185	-1.0837
Physical_Activity	0.0000	10.0000	4.0150	1.0000	4.0000	2.9785	0.4559	-0.8428
Air_Pollution	0.0000	10.0000	5.3230	10.0000	5.0000	3.2075	0.0033	-1.2114
Occupational_Hazards	0.0000	10.0000	4.9790	5.0000	5.0000	3.2129	0.0749	-1.1776
Calcium_Intake	0.0000	10.0000	3.9405	0.0000	4.0000	3.0489	0.3495	-0.9561
Overall_Risk_Score	0.0293	0.8522	0.4544	0.0293	0.4554	0.1231	0.0165	-0.2909
BMI	15.0000	41.4000	26.1833	25.9000	26.2000	3.9475	0.0477	0.0122
Physical_Activity_Level	0.0000	10.0000	4.9385	0.0000	5.0000	3.1660	-0.0103	-1.2055

2.2 Categorical Features Statistics

Column	Mode	Unique Values	#Unique	Most Frequent
Cancer_Type	Lung	Breast, Prostate, Skin, Colon, Lung	5	Lung (527)
Risk_Level	Medium	Medium, Low, High	3	Medium (1574)
Gender	0	0, 1	2	0 (1022)
Family_History	0	0, 1	2	0 (1611)
BRCA_Mutation	0	1, 0	2	0 (1935)
H_Pylori_Infection	0	0, 1	2	0 (1607)

3. Data Cleaning

3.1 Missing Value Imputation and Removal of Redundant Columns

3.1.1 Missing Value Imputation

No missing values detected

3.1.2 Removal of Redundant Columns

Since BMI is the internationally standardized clinical measure, we drop Obesity to avoid redundancy.

Patient_ID	Cancer_Type	Age	Gender	Smoking	Alcohol_Use	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	BRCA_Mutation	Calcium_Intake	Overall_Risk_Score	BMI	Physical_Activity_Level	Risk_Level
LU0000	Breast	68	0	7	2	5	3	7	4	6	3	1	0	0.3987	28.0000	5	Medium
LU0001	Prostate	74	1	8	9	0	3	7	1	3	3	0	5	0.4243	25.4000	9	Medium
LU0002	Skin	55	1	7	10	3	3	4	1	8	10	0	6	0.6051	28.6000	2	Medium
LU0003	Colon	61	0	6	2	6	2	4	6	4	8	0	8	0.3184	32.1000	7	Low
LU0004	Lung	67	1	10	7	6	3	10	9	10	9	0	5	0.5244	25.1000	2	Medium

3.2 Boxplot Visualization Before Outlier Removal

numerical_cols_boxplot

3.3 Outlier Detection and Removal (IQR Method)

Column	Q1	Q3	IQR	Rows removed
Age	56.00	70.00	14.00	9
Smoking	2.00	8.00	6.00	0
Alcohol_Use	2.00	8.00	6.00	0
Diet_Red_Meat	3.00	8.00	5.00	0
Diet_Salted_Processed	2.00	7.00	5.00	0
Fruit_Veg_Intake	3.00	8.00	5.00	0
Physical_Activity	1.00	6.00	5.00	0
Air_Pollution	3.00	8.00	5.00	0
Occupational_Hazards	2.00	8.00	6.00	0
Calcium_Intake	1.00	6.00	5.00	0
Overall_Risk_Score	0.37	0.54	0.17	6
BMI	23.50	28.70	5.20	17
Physical_Activity_Level	2.00	8.00	6.00	0

Metric	Value
Original rows	2000
After outlier removal	1968
Rows removed	32
Percentage removed	1.60%

numerical_cols_boxplot_no_outliers

4. Data Encoding and Transformation

4.1 Categorical Encoding

4.1.1 Encoding Summary

For categorical variable Cancer_Type we use One-Hot encoding (drop_first to avoid multicollinearity)

Cancer_Type: One-Hot encoded (drop_first=True)

Cancer_Type_Colon	Cancer_Type_Lung	Cancer_Type_Prostate	Cancer_Type_Skin
0	0	0	0
0	0	1	0
0	0	0	1
1	0	0	0
0	1	0	0

For categorical variable Risk_Level we use Label encoding (Low=0, Medium=1, High=2)

Risk_Level: Label encoded (Low=0, Medium=1, High=2)

Gender, Family_History, BRCA_Mutation, H_Pylori_Infection: already numeric (binary), kept as-is and no encoding is needed

4.1.2 Modified DataFrame (with encoded columns)

Patient_ID	Age	Gender	Smoking	Alcohol_Use	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	BRCA_Mutation	Calcium_Intake	Overall_Risk_Score	BMI	Physical_Activity_Level	Cancer_Type_Colon	Cancer_Type_Lung	Cancer_Type_Prostate	Cancer_Type_Skin	Risk_Level_Encoded
LU0000	68	0	7	2	5	3	7	4	6	3	1	0	0.3987	28.0000	5	0	0	0	0	1
LU0001	74	1	8	9	0	3	7	1	3	3	0	5	0.4243	25.4000	9	0	0	1	0	1
LU0002	55	1	7	10	3	3	4	1	8	10	0	6	0.6051	28.6000	2	0	0	0	1	1
LU0003	61	0	6	2	6	2	4	6	4	8	0	8	0.3184	32.1000	7	1	0	0	0	0
LU0004	67	1	10	7	6	3	10	9	10	9	0	5	0.5244	25.1000	2	0	1	0	0	1

4.2 Numerical Transformations

Skewness

Here we are checking skewness and applying log transform for highly skewed positive variables (|skew| > 1).

Log Transform

Value Sign	Formula
Negative	$\log(1 - x)$
Positive	$\log(1 + x)$
Zero	(no transform applied)

Where: $x$ = original value
These transforms help move the data toward a normal distribution for further analysis.

Skewness values (threshold: |skew| > 1.0):

Column	Skewness	Exceeds Threshold
Age	-0.088	No
Smoking	0.059	No
Alcohol_Use	-0.058	No
Diet_Red_Meat	-0.012	No
Diet_Salted_Processed	0.301	No
Fruit_Veg_Intake	0.019	No
Physical_Activity	0.453	No
Air_Pollution	0.004	No
Occupational_Hazards	0.077	No
Calcium_Intake	0.346	No
Overall_Risk_Score	0.021	No
BMI	0.060	No
Physical_Activity_Level	-0.011	No

No highly skewed numerical variables (|skew| > 1) requiring log transform.

Standardization (z-score)

All continuous numerical features are standardized to zero mean and unit variance:

$z = \frac{x - \mu}{\sigma}$

This ensures features are on comparable scales for distance-based and gradient-based ML algorithms.

4.2.1 Standardization Summary

Numerical columns standardized (zero mean, unit variance):
Age, Smoking, Alcohol_Use, Diet_Red_Meat, Diet_Salted_Processed, Fruit_Veg_Intake, Physical_Activity, Air_Pollution, Occupational_Hazards, Calcium_Intake, Overall_Risk_Score, BMI, Physical_Activity_Level

Column	Original Mean (μ)	Original Std (σ)
Age	63.3526	10.2284
Smoking	5.1570	3.3220
Alcohol_Use	5.0427	3.2573
Diet_Red_Meat	5.2038	3.1468
Diet_Salted_Processed	4.5671	3.0853
Fruit_Veg_Intake	4.9212	3.0475
Physical_Activity	4.0229	2.9765
Air_Pollution	5.3298	3.1986
Occupational_Hazards	4.9731	3.2112
Calcium_Intake	3.9507	3.0497
Overall_Risk_Score	0.4551	0.1211
BMI	26.2014	3.8190
Physical_Activity_Level	4.9416	3.1742

4.2.2 Transformed Data (Standardized Numerical Columns)

First 10 rows of standardized numerical columns only:

	Age	Smoking	Alcohol_Use	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	Calcium_Intake	Overall_Risk_Score	BMI	Physical_Activity_Level
0	0.4544	0.5548	-0.9341	-0.0648	-0.5079	0.6821	-0.0077	0.2095	-0.6144	-1.2954	-0.4660	0.4710	0.0184
1	1.0410	0.8558	1.2149	-1.6537	-0.5079	0.6821	-1.0156	-0.7284	-0.6144	0.3441	-0.2546	-0.2098	1.2786
2	-0.8166	0.5548	1.5219	-0.7003	-0.5079	-0.3023	-1.0156	0.8348	1.5654	0.6720	1.2381	0.6281	-0.9267
3	-0.2300	0.2538	-0.9341	0.2530	-0.8320	-0.3023	0.6642	-0.4157	0.9426	1.3277	-1.1286	1.5446	0.6485
4	0.3566	1.4578	0.6009	0.2530	-0.5079	1.6666	1.6721	1.4601	1.2540	0.3441	0.5715	-0.2884	-0.9267
5	1.3343	1.4578	0.9079	0.2530	-1.4803	0.3540	-0.6796	1.4601	0.6312	-1.2954	0.3594	-0.2884	-1.2418
6	-0.4255	1.4578	1.5219	1.2064	-0.1838	-1.6149	-1.0156	1.4601	1.2540	0.3441	1.7109	1.5969	-0.9267
7	1.0410	0.8558	0.2939	-0.7003	-0.5079	-0.9586	1.3362	0.8348	0.6312	-0.9675	0.2000	0.7590	1.2786
8	0.7477	1.1568	-1.5481	1.5242	-0.1838	0.3540	2.0081	0.8348	-0.6144	0.3441	0.3508	-0.5502	0.0184
9	-0.8166	0.5548	-1.2411	-1.6537	-0.1838	-0.9586	0.3283	1.1474	1.2540	0.3441	-0.4153	0.5233	-1.2418

4.2.3 Modified DataFrame (with standardized columns)

First 10 rows of the complete modified dataframe (all columns, numerical ones standardized):

	Patient_ID	Age	Gender	Smoking	Alcohol_Use	Family_History	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	BRCA_Mutation	Calcium_Intake	Overall_Risk_Score	BMI	Physical_Activity_Level	Cancer_Type_Colon	Cancer_Type_Lung	Cancer_Type_Prostate	Cancer_Type_Skin	Risk_Level_Encoded
0	LU0000	0.4544	0	0.5548	-0.9341	0	-0.0648	-0.5079	0.6821	-0.0077	0.2095	-0.6144	1	-1.2954	-0.4660	0.4710	0.0184	0	0	0	0	1
1	LU0001	1.0410	1	0.8558	1.2149	0	-1.6537	-0.5079	0.6821	-1.0156	-0.7284	-0.6144	0	0.3441	-0.2546	-0.2098	1.2786	0	0	1	0	1
2	LU0002	-0.8166	1	0.5548	1.5219	0	-0.7003	-0.5079	-0.3023	-1.0156	0.8348	1.5654	0	0.6720	1.2381	0.6281	-0.9267	0	0	0	1	1
3	LU0003	-0.2300	0	0.2538	-0.9341	0	0.2530	-0.8320	-0.3023	0.6642	-0.4157	0.9426	0	1.3277	-1.1286	1.5446	0.6485	1	0	0	0	0
4	LU0004	0.3566	1	1.4578	0.6009	0	0.2530	-0.5079	1.6666	1.6721	1.4601	1.2540	0	0.3441	0.5715	-0.2884	-0.9267	0	1	0	0	1
5	LU0005	1.3343	1	1.4578	0.9079	0	0.2530	-1.4803	0.3540	-0.6796	1.4601	0.6312	0	-1.2954	0.3594	-0.2884	-1.2418	0	1	0	0	1
6	LU0006	-0.4255	0	1.4578	1.5219	0	1.2064	-0.1838	-1.6149	-1.0156	1.4601	1.2540	0	0.3441	1.7109	1.5969	-0.9267	0	1	0	0	2
7	LU0007	1.0410	1	0.8558	0.2939	1	-0.7003	-0.5079	-0.9586	1.3362	0.8348	0.6312	0	-0.9675	0.2000	0.7590	1.2786	0	0	1	0	1
8	LU0008	0.7477	1	1.1568	-1.5481	0	1.5242	-0.1838	0.3540	2.0081	0.8348	-0.6144	0	0.3441	0.3508	-0.5502	0.0184	1	0	0	0	1
9	LU0009	-0.8166	1	0.5548	-1.2411	0	-1.6537	-0.1838	-0.9586	0.3283	1.1474	1.2540	0	0.3441	-0.4153	0.5233	-1.2418	0	0	0	1	1

5. Exploratory Data Analysis (EDA)

5.1 Histogram of Numerical and Categorical Columns

numerical_cols_distribution

categorical_cols_distribution

5.2 Correlation Heatmap of Numerical Columns

correlation_matrix

5.2.1 Correlation Matrix (Tabular Form)

	Age	Smoking	Alcohol_Use	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	Calcium_Intake	Overall_Risk_Score	BMI	Physical_Activity_Level
Age	1.000	0.041	-0.033	-0.057	-0.059	0.008	0.065	-0.014	0.050	0.057	-0.036	0.011	-0.049
Smoking	0.041	1.000	0.112	-0.145	-0.060	0.041	0.108	0.449	-0.017	0.084	0.431	0.003	0.022
Alcohol_Use	-0.033	0.112	1.000	-0.030	-0.033	0.041	0.040	0.068	-0.007	-0.062	0.386	-0.012	0.018
Diet_Red_Meat	-0.057	-0.145	-0.030	1.000	0.178	-0.193	-0.001	-0.077	-0.015	0.100	0.272	0.030	0.032
Diet_Salted_Processed	-0.059	-0.060	-0.033	0.178	1.000	-0.216	-0.025	0.036	0.055	0.060	0.363	-0.008	0.001
Fruit_Veg_Intake	0.008	0.041	0.041	-0.193	-0.216	1.000	0.013	-0.048	-0.053	-0.022	-0.152	-0.014	-0.013
Physical_Activity	0.065	0.108	0.040	-0.001	-0.025	0.013	1.000	0.086	0.001	-0.000	0.063	0.000	0.028
Air_Pollution	-0.014	0.449	0.068	-0.077	0.036	-0.048	0.086	1.000	0.081	0.062	0.496	0.034	0.005
Occupational_Hazards	0.050	-0.017	-0.007	-0.015	0.055	-0.053	0.001	0.081	1.000	0.076	0.352	0.000	0.044
Calcium_Intake	0.057	0.084	-0.062	0.100	0.060	-0.022	-0.000	0.062	0.076	1.000	0.062	0.024	-0.007
Overall_Risk_Score	-0.036	0.431	0.386	0.272	0.363	-0.152	0.063	0.496	0.352	0.062	1.000	0.029	0.049
BMI	0.011	0.003	-0.012	0.030	-0.008	-0.014	0.000	0.034	0.000	0.024	0.029	1.000	-0.002
Physical_Activity_Level	-0.049	0.022	0.018	0.032	0.001	-0.013	0.028	0.005	0.044	-0.007	0.049	-0.002	1.000

5.2.2 Feature Relations with Correlation Strength

Here we take the Weak to Strong Correlations (0.19–0.39 weak, 0.4–0.59 moderate, 0.6–0.79 strong) and collect all pairs with weak/moderate/strong correlation (upper triangle to avoid duplicates)

Weak to Strong Positive Correlation (descending):

Feature	Correlated With	Correlation	Strength
Air_Pollution	Overall_Risk_Score	0.496	Moderate
Smoking	Air_Pollution	0.449	Moderate
Smoking	Overall_Risk_Score	0.431	Moderate
Alcohol_Use	Overall_Risk_Score	0.386	Weak
Diet_Salted_Processed	Overall_Risk_Score	0.363	Weak
Occupational_Hazards	Overall_Risk_Score	0.352	Weak
Diet_Red_Meat	Overall_Risk_Score	0.272	Weak

Weak to Strong Negative Correlation (descending by strength):

Feature	Correlated With	Correlation	Strength
Diet_Salted_Processed	Fruit_Veg_Intake	-0.216	Weak
Diet_Red_Meat	Fruit_Veg_Intake	-0.193	Weak

5.3 Scatter Plot Collage

scatter_collage_standardized

Density (intensity) color code (low -> high): purple->blue->green->yellow->red

5.4 Binned Heatmap: Numerical Columns vs Each Categorical Column

Cancer_Type:

binned_heatmap_num_vs_Cancer_Type

Gender:

binned_heatmap_num_vs_Gender

Family_History:

binned_heatmap_num_vs_Family_History

BRCA_Mutation:

binned_heatmap_num_vs_BRCA_Mutation

H_Pylori_Infection:

binned_heatmap_num_vs_H_Pylori_Infection

Risk_Level:

binned_heatmap_num_vs_Risk_Level

5.5 Violin Plot Diagram: Numerical Columns vs Each Categorical Column

Cancer_Type:

violinplot_num_vs_Cancer_Type

Gender:

violinplot_num_vs_Gender

Family_History:

violinplot_num_vs_Family_History

BRCA_Mutation:

violinplot_num_vs_BRCA_Mutation

H_Pylori_Infection:

violinplot_num_vs_H_Pylori_Infection

Risk_Level:

violinplot_num_vs_Risk_Level

5.6 Heatmap Collage: Each Categorical Column vs All Other Categorical Columns

Cancer_Type:

heatmap_collage_Cancer_Type_vs_all

Gender:

heatmap_collage_Gender_vs_all

Family_History:

heatmap_collage_Family_History_vs_all

BRCA_Mutation:

heatmap_collage_BRCA_Mutation_vs_all

H_Pylori_Infection:

heatmap_collage_H_Pylori_Infection_vs_all

Risk_Level:

heatmap_collage_Risk_Level_vs_all

5.7 Dimensionality Reduction: Principal Component Analysis (PCA)

5.7.1 Explained Variance (Scree Plot)

This plot shows how much information each Principal Component captures.

pca_scree_plot

PC1 explains 13.4% of variance; PC1–PC2 together explain 25.3%. 9 components reach 80% cumulative variance; 11 reach 90%.

5.7.2 PCA Cluster Map

This plot projects all numerical variables into 2D to visualize risk-level clustering.

pca_2d_scatter

With only 25.3% of variance captured in 2D, the three risk classes show substantial overlap, suggesting that risk-level separation requires higher-dimensional features or non-linear decision boundaries — consistent with the use of tree-based ensemble models below.

6. Inferential Statistics: Hypothesis Testing

Here we are testing the hypotheses for the data. We will be using the chi-square test, t-test and ANOVA test to test the hypotheses.

6.1 Chi-Square Test

Aspect	Details
Description	Tests whether there is an association between two categorical variables by comparing observed counts to expected counts under independence.
Typical use case	Assess if a risk factor (e.g., family history) is related to an outcome category (e.g., cancer risk level) across groups (e.g., cancer types).
Key inference metrics	Chi-square statistic (χ²), degrees of freedom (df), p-value, and whether p < 0.05 (evidence against independence).

chi-square Test of Independence Workflow Diagram

Chi-square test of independence for Family_History vs (Cancer_Type × Risk_Level).
Null Hypothesis ( $H_{0}$ ): Family history and risk level are independent (no association) within each cancer type.
Alternative Hypothesis ( $H_{A}$ ): Family history and risk level are not independent (association exists) within each cancer type.
Test statistic: $\chi^2 = \sum\left(\frac{(O_{ij} - E_{ij})^2}{E_{ij}}\right)$ , where $O_{ij}$ are observed counts and $E_{ij}$ are expected counts.

Chi-Square Results: Family_History vs Risk_Level by Cancer_Type

Cancer_Type	Chi2	df	p_value	Cramers_V	Significant_at_0.05	N
Breast	4.9802	2	0.0829	0.1044	No	457
Colon	1.4844	2	0.4761	0.0600	No	412
Lung	3.7581	2	0.1527	0.0854	No	515
Prostate	0.3939	2	0.8212	0.0364	No	297
Skin	1.8931	2	0.3881	0.0812	No	287

Verdict:
For none of the 5 cancer types, the association between Family_History and Risk_Level is statistically significant at the 0.05 level (all p-values ≥ 0.05). We do not find strong evidence that family history is associated with risk levels within any cancer type based on this test.

Note: Non-significance does not prove absence of effect; small per-group sample sizes may limit statistical power.

chi_family_history_vs_risk_by_cancer_type

6.2 t-Test: Overall_Risk_Score vs BMI Group

Aspect	Details
Description	Compares the mean of a numerical variable between two independent groups.
Typical use case	Evaluate whether people with high BMI have different average Overall_Risk_Score compared to people with low BMI.
Key inference metrics	Group means, t statistic, p-value, and whether p < 0.05 (evidence of a difference in means).

t-test Workflow Diagram

Independent samples t-test for Overall_Risk_Score between Low BMI and High BMI groups.
Null Hypothesis ( $H_{0}$ ): There is no difference. High BMI and Low BMI people have the same average Risk Score.
Alternative Hypothesis ( $H_{1}$ ): People with above-average BMI have a significantly higher average Risk Score.

Test statistic formula:

$t = \frac{\overline{X}_1 - \overline{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$

Where:
- $\overline{X}_1$ , $\overline{X}_2$ = group means (e.g., High BMI, Low BMI)
- $s_1^2$ , $s_2^2$ = sample variances of each group
- $n_1$ , $n_2$ = number of observations in each group

t-Test Results: Overall_Risk_Score by BMI Group

BMI_Threshold_Mean	N_High_BMI	N_Low_BMI	Mean_Risk_High_BMI	Mean_Risk_Low_BMI	t_stat	p_value	Cohens_d	Significant_at_0.05
26.2014	969	999	0.4579	0.4525	0.9776	0.3284	0.0441	No

Verdict: The p-value for the difference in Overall_Risk_Score between High BMI (n=969) and Low BMI (n=999) groups is 0.3284. There is no statistically significant difference in mean Overall_Risk_Score (0.4579 vs 0.4525) at the 0.05 significance level. Fail to reject the null hypothesis.
Effect size: Cohen's d = 0.044 (negligible).
Non-significance does not prove absence of effect; the mean-based BMI split may dilute real associations.

Visualization: boxplot of Overall_Risk_Score by BMI group

boxplot_overall_risk_vs_bmi_group

6.3 One-Way ANOVA: Overall_Risk_Score vs Cancer_Type

Aspect	Details
Description	Tests whether the mean Overall_Risk_Score differs across multiple Cancer_Type groups using a one-way ANOVA.
Typical use case	Assess if average Overall_Risk_Score is the same for Breast, Colon, Lung, Prostate, and Skin cancer groups.
Key inference metrics	F statistic, p-value, degrees of freedom (df1, df2), and whether p < 0.05 (evidence that at least one group mean differs).

One-Way ANOVA Workflow Diagram

Comparison of Overall_Risk_Score across Cancer_Type groups.

Null Hypothesis ( $H_{0}$ ): All Cancer_Type groups have the same mean Overall_Risk_Score.
Alternative Hypothesis ( $H_{A}$ ): At least one Cancer_Type group has a different mean Overall_Risk_Score.
Test statistic formula:

$F = \frac{MSB}{MSW}$

Where:

Term	Full Name	Formula (Simplified)
SSW	Sum of Squares Within groups	$\sum(X - \bar{x}_{group})^2$
SSB	Sum of Squares Between groups	$\sum n_i(\bar{x}_{group} - \bar{x}_{grand})^2$
MSW	Mean Square Within groups	$SSW / (N - k)$
MSB	Mean Square Between groups	$SSB / (k - 1)$

Significant at 0.05: If p-value < 0.05, we reject the null hypothesis.

ANOVA Results: Overall_Risk_Score by Cancer_Type

k_groups	N_total	df1_(k-1)	df2_(N-k)	F_stat	p_value	Eta_squared	Significant_at_0.05
5	1968	4	1963	44.8968	4.03e-36	0.0838	Yes

Verdict: Reject $H_0$ .
Hence, there is a statistically significant difference in mean Overall_Risk_Score between Cancer_Type groups.
Effect size: η² = 0.0838 (medium).

Per-group mean Overall_Risk_Score (ranked):

Cancer_Type	Mean_Risk_Score	N
Lung	0.5002	515
Colon	0.4665	412
Skin	0.4556	287
Breast	0.4336	457
Prostate	0.3940	297

Tukey HSD post-hoc pairwise comparisons:

Group A	Group B	p_value	Significant
Breast	Colon	0.0003	Yes
Breast	Lung	0.0000	Yes
Breast	Prostate	0.0000	Yes
Breast	Skin	0.0888	No
Colon	Lung	0.0001	Yes
Colon	Prostate	0.0000	Yes
Colon	Skin	0.7381	No
Lung	Prostate	0.0000	Yes
Lung	Skin	0.0000	Yes
Prostate	Skin	0.0000	Yes

Visualization: boxplot of Overall_Risk_Score by Cancer_Type

boxplot_overall_risk_vs_cancer_type

6.4 Key Metrics Summary

Test	Comparison	Statistic	p-value	Significant (α=0.05)
Chi-square	Family_History vs Risk_Level (Breast)	χ²=4.9802, V=0.104	0.0829	No
Chi-square	Family_History vs Risk_Level (Colon)	χ²=1.4844, V=0.060	0.4761	No
Chi-square	Family_History vs Risk_Level (Lung)	χ²=3.7581, V=0.085	0.1527	No
Chi-square	Family_History vs Risk_Level (Prostate)	χ²=0.3939, V=0.036	0.8212	No
Chi-square	Family_History vs Risk_Level (Skin)	χ²=1.8931, V=0.081	0.3881	No
t-Test	Overall_Risk_Score by BMI Group	t=0.9776, d=0.044	0.3284	No
ANOVA	Overall_Risk_Score by Cancer_Type	F=44.8968, η²=0.0838	4.033e-36	Yes

7. Feature Engineering

7.1 Feature Set for Machine Learning

We use df_processed (standardized continuous features + encoded categoricals from Section 4) for ML.
Target: Risk_Level_Encoded (Low=0, Medium=1, High=2). Overall_Risk_Score is excluded to prevent leakage.

Quantity	Value
Total samples (rows)	1968
Number of features (X)	20
Target column (y)	Risk_Level_Encoded

7.2 Feature Engineering Process

Feature Engineering Pipeline

7.3 Feature Space Snapshot

First 10 rows of the ML feature matrix X (standardized numerical features + encoded categoricals):

Age	Gender	Smoking	Alcohol_Use	Family_History	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	BRCA_Mutation	Calcium_Intake	BMI	Physical_Activity_Level	Cancer_Type_Colon	Cancer_Type_Lung	Cancer_Type_Prostate	Cancer_Type_Skin
0.4544	0.0000	0.5548	-0.9341	0.0000	-0.0648	-0.5079	0.6821	-0.0077	0.2095	-0.6144	1.0000	-1.2954	0.4710	0.0184	0.0000	0.0000	0.0000	0.0000
1.0410	1.0000	0.8558	1.2149	0.0000	-1.6537	-0.5079	0.6821	-1.0156	-0.7284	-0.6144	0.0000	0.3441	-0.2098	1.2786	0.0000	0.0000	1.0000	0.0000
-0.8166	1.0000	0.5548	1.5219	0.0000	-0.7003	-0.5079	-0.3023	-1.0156	0.8348	1.5654	0.0000	0.6720	0.6281	-0.9267	0.0000	0.0000	0.0000	1.0000
-0.2300	0.0000	0.2538	-0.9341	0.0000	0.2530	-0.8320	-0.3023	0.6642	-0.4157	0.9426	0.0000	1.3277	1.5446	0.6485	1.0000	0.0000	0.0000	0.0000
0.3566	1.0000	1.4578	0.6009	0.0000	0.2530	-0.5079	1.6666	1.6721	1.4601	1.2540	0.0000	0.3441	-0.2884	-0.9267	0.0000	1.0000	0.0000	0.0000
1.3343	1.0000	1.4578	0.9079	0.0000	0.2530	-1.4803	0.3540	-0.6796	1.4601	0.6312	0.0000	-1.2954	-0.2884	-1.2418	0.0000	1.0000	0.0000	0.0000
-0.4255	0.0000	1.4578	1.5219	0.0000	1.2064	-0.1838	-1.6149	-1.0156	1.4601	1.2540	0.0000	0.3441	1.5969	-0.9267	0.0000	1.0000	0.0000	0.0000
1.0410	1.0000	0.8558	0.2939	1.0000	-0.7003	-0.5079	-0.9586	1.3362	0.8348	0.6312	0.0000	-0.9675	0.7590	1.2786	0.0000	0.0000	1.0000	0.0000
0.7477	1.0000	1.1568	-1.5481	0.0000	1.5242	-0.1838	0.3540	2.0081	0.8348	-0.6144	0.0000	0.3441	-0.5502	0.0184	1.0000	0.0000	0.0000	0.0000
-0.8166	1.0000	0.5548	-1.2411	0.0000	-1.6537	-0.1838	-0.9586	0.3283	1.1474	1.2540	0.0000	0.3441	0.5233	-1.2418	0.0000	0.0000	0.0000	1.0000

8. Machine Learning Algorithms

8.1 Machine Learning Workflow

Machine Learning Workflow

8.2 Train/Test Split and Evaluation Metrics

We split the data into training and test sets using an 80/20 ratio with stratification on the target.
This preserves the distribution of risk levels in both training and test sets and avoids biased evaluation.

Training set sample (first 10 rows):

Age	Gender	Smoking	Alcohol_Use	Family_History	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	H_Pylori_Infection	Calcium_Intake	BMI	Physical_Activity_Level	Cancer_Type_Colon	Cancer_Type_Lung	Cancer_Type_Prostate	Cancer_Type_Skin	Risk_Level_Encoded
-0.6211	0.0000	1.4578	0.6009	1.0000	-0.7003	1.7609	0.3540	-0.3436	1.4601	1.2540	0.0000	-0.3117	-1.5453	-0.9267	0.0000	1.0000	0.0000	0.0000	1.0000
0.5521	0.0000	-0.0473	-0.9341	0.0000	0.5708	0.1403	-1.6149	0.3283	1.4601	-0.3030	0.0000	0.3441	1.4136	0.6485	0.0000	0.0000	0.0000	0.0000	1.0000
0.4544	0.0000	-0.9503	0.6009	0.0000	-0.3825	-1.4803	1.3384	-0.6796	1.1474	1.2540	0.0000	-0.6396	0.8638	1.2786	0.0000	0.0000	0.0000	0.0000	1.0000
1.1387	0.0000	0.2538	1.2149	0.0000	-0.7003	-0.8320	-0.6304	0.3283	0.5222	0.0084	0.0000	-1.2954	-0.2884	1.2786	0.0000	0.0000	0.0000	0.0000	1.0000
-0.1322	0.0000	-0.6493	-1.5481	0.0000	0.2530	-0.8320	-0.6304	1.3362	-0.4157	-0.3030	0.0000	-0.3117	-0.6550	-0.6117	0.0000	0.0000	0.0000	0.0000	0.0000
0.4544	0.0000	1.1568	1.2149	0.0000	0.8886	1.4368	0.3540	-0.0077	-0.1031	1.5654	0.0000	-1.2954	-0.2622	0.3335	1.0000	0.0000	0.0000	0.0000	1.0000
1.2365	0.0000	0.2538	-1.2411	0.0000	0.8886	-0.5079	-1.6149	0.3283	1.4601	-0.3030	0.0000	-0.9675	1.7279	-1.5568	0.0000	0.0000	0.0000	1.0000	1.0000
-0.5233	1.0000	-0.0473	-0.9341	0.0000	-0.3825	1.4368	-0.9586	1.6721	1.4601	-0.9258	0.0000	-1.2954	1.3089	1.2786	0.0000	0.0000	1.0000	0.0000	0.0000
-0.6211	0.0000	1.1568	-1.5481	1.0000	0.8886	-0.8320	-1.6149	0.6642	0.2095	0.9426	0.0000	-1.2954	0.8114	0.9635	0.0000	0.0000	0.0000	0.0000	1.0000
-0.6211	1.0000	1.4578	-0.3201	0.0000	1.5242	1.1127	-0.6304	2.0081	-0.1031	-1.5487	1.0000	-0.6396	-0.0789	-0.9267	0.0000	1.0000	0.0000	0.0000	1.0000

Test set sample (first 5 rows):

Age	Gender	Smoking	Alcohol_Use	Family_History	Diet_Red_Meat	Diet_Salted_Processed	Fruit_Veg_Intake	Physical_Activity	Air_Pollution	Occupational_Hazards	H_Pylori_Infection	Calcium_Intake	BMI	Physical_Activity_Level	Cancer_Type_Colon	Cancer_Type_Lung	Risk_Level_Encoded
-0.5233	0.0000	-1.5524	-0.6271	0.0000	-0.3825	-1.4803	-0.3023	-1.0156	-1.0410	-1.5487	0.0000	-1.2954	2.6443	-0.2966	0.0000	0.0000	0.0000
0.0633	1.0000	-0.9503	-0.0131	0.0000	1.2064	-0.8320	-0.6304	0.6642	-1.0410	0.0084	0.0000	0.6720	0.7852	0.6485	1.0000	0.0000	1.0000
-0.6211	0.0000	-1.5524	-0.0131	0.0000	0.2530	0.4644	0.3540	0.6642	-1.6663	-0.3030	0.0000	-1.2954	0.1306	0.9635	0.0000	0.0000	1.0000
-0.9144	1.0000	1.4578	0.2939	0.0000	-1.6537	-0.1838	-0.6304	-0.0077	1.1474	0.6312	1.0000	-0.9675	-0.8121	0.0184	0.0000	1.0000	1.0000
-0.0345	1.0000	1.4578	0.9079	1.0000	1.5242	1.1127	0.3540	-0.0077	1.4601	1.5654	0.0000	0.3441	0.6281	1.5936	0.0000	1.0000	2.0000

8.2.1 SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE is a data augmentation technique that addresses class imbalance by generating synthetic samples
for minority classes. Instead of simply duplicating existing minority samples (which can lead to overfitting),
SMOTE creates new data points by interpolating between existing minority-class neighbours in feature space.

Why it is needed here:
The target variable Risk_Level is heavily imbalanced — the Medium class dominates (~79%), while High (~5%)
has very few examples. Without balancing, classifiers tend to predict the majority class and ignore minorities,
resulting in high accuracy but poor recall and F1 for under-represented classes.

How it works:

For each minority-class sample, find its $k$ nearest neighbours (default $k=5$ ) in feature space.
Randomly pick one of the $k$ neighbours.
Create a synthetic sample at a random point along the line segment between the original and the neighbour:

$x_{\text{new}} = x_i + \lambda \cdot (x_{\text{neighbour}} - x_i), \quad \lambda \in [0,1]$

Repeat until the minority class reaches the desired count (here, equal to the majority class).

Important: SMOTE is applied only on the training set — the test set remains untouched to ensure
an honest evaluation on real data.

Class distribution before SMOTE:

Class	Count
Low (0)	251
Medium (1)	1245
High (2)	78

Class distribution after SMOTE:

Class	Count
Low (0)	1245
Medium (1)	1245
High (2)	1245

Evaluation protocol: All metrics below are computed on the original, unmodified test set (no SMOTE). This ensures honest evaluation on real-world class distribution.

8.3 LightGBM Classifier

LightGBM is a gradient boosting framework based on decision trees. It builds an ensemble of shallow trees
by minimizing a differentiable loss function (here, multiclass log-loss) using gradient descent.

LightGBM Training Loop

Hyperparameters: n_estimators=500, learning_rate=0.05, max_depth=7, num_leaves=63, min_child_samples=10, subsample=0.8, colsample_bytree=0.8

LightGBM_confusion_matrix

Per-class recall — Low: 34/63 (54%), Medium: 292/311 (94%), High: 6/20 (30%).

8.4 XGBoost Classifier

XGBoost (Extreme Gradient Boosting) is another gradient boosting implementation that uses regularization
and shrinkage to reduce overfitting. It optimizes an objective function of the form

$\mathcal{L}(\theta) = \sum_{i} l(y_i, \hat{y}_i) + \sum_{k} \Omega(f_k),$

where $l$ is the loss (e.g. multiclass log-loss) and $\Omega$ is a regularization term on each tree $f_k$ .

XGBoost Training Loop

Hyperparameters: n_estimators=500, learning_rate=0.05, max_depth=6, min_child_weight=3, subsample=0.8, colsample_bytree=0.8, gamma=0.1

XGBoost_confusion_matrix

Per-class recall — Low: 38/63 (60%), Medium: 289/311 (93%), High: 11/20 (55%).

8.5 RandomForest Classifier

RandomForest is an ensemble of decision trees trained on bootstrapped samples with feature subsampling.
For classification, the final prediction is obtained via majority vote across trees:

$\hat{y} = \text{mode}\{ h_t(x) \}_{t=1}^T,$

where each $h_t$ is a decision tree trained on a different bootstrap sample.

RandomForest Training Loop

Hyperparameters: n_estimators=500, max_depth=None, min_samples_split=5, min_samples_leaf=2

RandomForest_confusion_matrix

Per-class recall — Low: 40/63 (63%), Medium: 288/311 (93%), High: 9/20 (45%).

8.6 Overall Model Performance Comparison

Algorithm	Accuracy	Precision_macro	Recall_macro	F1_macro	ROC_AUC_macro
LightGBM	0.8426	0.7085	0.5929	0.6346	0.8914
XGBoost	0.8579	0.7409	0.6941	0.7153	0.9027
RandomForest	0.8553	0.7549	0.6703	0.7037	0.9163

Model Metrics Comparison

ROC Curves All Models

Inference: We select the best model by macro-average F1, which balances precision and recall across all classes — critical when minority-class detection (High risk) matters.

Best by F1_macro: XGBoost (F1 = 0.7153)
Best by ROC_AUC_macro: RandomForest (AUC = 0.9163)

F1 and AUC disagree: XGBoost leads on F1 (better hard predictions), while RandomForest leads on AUC (better probability ranking). For clinical risk triage, F1_macro is preferred.