Submitted to: Prof.ssa Roberta Siciliano (University of Naples Federico II)
Report by: Sahaya Gnanadurai (D03000149), Rohan Baidya (D03000192)
| Field | Details |
|---|---|
| Title | Cancer Risk Factors Data |
| Author | Tarek Masryo |
| Year | 2025 |
| Publisher | Kaggle |
| DOI | 10.34740/KAGGLE/DSV/13280499 |
| URL | https://www.kaggle.com/dsv/13280499 |
| Github RAW URL | https://raw.githubusercontent.com/tarekmasryo/cancer-risk-factors-data/main/data/cancer-risk-factors.csv |
The dataset consists of 2000 various cancer patients' information alongside with details such as risk factors, lifestyle, environmental, and genetic variables, along with a composite risk score and a categorical risk level classification. The dataset is designed to facilitate comprehensive analysis of the relationships between these factors and various types of cancer risk.
Dataset Information:
| Metric | Value |
|---|---|
| Dataset Shape | (2000, 21) |
| Number of Records | 2000 |
| Number of Features | 21 |
Variable Descriptions:
| Column Name | Description |
|---|---|
| Patient_ID | Unique identifier for each patient |
| Cancer_Type | Type of cancer (e.g., Breast, Colon, Lung, Prostate, Skin) |
| Age | Patient age in years |
| Gender | Patient gender (Male / Female) |
| Smoking | Smoking status (0 = No, 1 = Yes) |
| Alcohol_Use | Alcohol consumption level (numeric scale) |
| Obesity | Obesity indicator (redundant with BMI — dropped in cleaning) |
| Family_History | Family history of cancer (0 = No, 1 = Yes) |
| Diet_Red_Meat | Red meat consumption level (numeric scale) |
| Diet_Salted_Processed | Salted/processed food consumption level (numeric scale) |
| Fruit_Veg_Intake | Fruit and vegetable intake level (numeric scale) |
| Physical_Activity | Physical activity level (numeric scale) |
| Air_Pollution | Exposure to air pollution (numeric scale) |
| Occupational_Hazards | Exposure to occupational hazards (numeric scale) |
| BRCA_Mutation | BRCA gene mutation carrier (0 = No, 1 = Yes) |
| H_Pylori_Infection | Helicobacter pylori infection status (0 = No, 1 = Yes) |
| Calcium_Intake | Calcium intake level (numeric scale) |
| Overall_Risk_Score | Composite risk score (continuous) |
| BMI | Body Mass Index (kg/m²) |
| Physical_Activity_Level | Categorical activity level (Low / Medium / High) |
| Risk_Level | Cancer risk classification (Low / Medium / High) — target variable |
| Patient_ID | Cancer_Type | Age | Gender | Smoking | Alcohol_Use | Obesity | Family_History | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | BRCA_Mutation | H_Pylori_Infection | Calcium_Intake | Overall_Risk_Score | BMI | Physical_Activity_Level | Risk_Level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LU0000 | Breast | 68 | 0 | 7 | 2 | 8 | 0 | 5 | 3 | 7 | 4 | 6 | 3 | 1 | 0 | 0 | 0.3987 | 28.0000 | 5 | Medium |
| LU0001 | Prostate | 74 | 1 | 8 | 9 | 8 | 0 | 0 | 3 | 7 | 1 | 3 | 3 | 0 | 0 | 5 | 0.4243 | 25.4000 | 9 | Medium |
| LU0002 | Skin | 55 | 1 | 7 | 10 | 7 | 0 | 3 | 3 | 4 | 1 | 8 | 10 | 0 | 0 | 6 | 0.6051 | 28.6000 | 2 | Medium |
| LU0003 | Colon | 61 | 0 | 6 | 2 | 2 | 0 | 6 | 2 | 4 | 6 | 4 | 8 | 0 | 0 | 8 | 0.3184 | 32.1000 | 7 | Low |
| LU0004 | Lung | 67 | 1 | 10 | 7 | 4 | 0 | 6 | 3 | 10 | 9 | 10 | 9 | 0 | 0 | 5 | 0.5244 | 25.1000 | 2 | Medium |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Patient_ID 2000 non-null object
1 Cancer_Type 2000 non-null object
2 Age 2000 non-null int64
3 Gender 2000 non-null int64
4 Smoking 2000 non-null int64
5 Alcohol_Use 2000 non-null int64
6 Obesity 2000 non-null int64
7 Family_History 2000 non-null int64
8 Diet_Red_Meat 2000 non-null int64
9 Diet_Salted_Processed 2000 non-null int64
10 Fruit_Veg_Intake 2000 non-null int64
11 Physical_Activity 2000 non-null int64
12 Air_Pollution 2000 non-null int64
13 Occupational_Hazards 2000 non-null int64
14 BRCA_Mutation 2000 non-null int64
15 H_Pylori_Infection 2000 non-null int64
16 Calcium_Intake 2000 non-null int64
17 Overall_Risk_Score 2000 non-null float64
18 BMI 2000 non-null float64
19 Physical_Activity_Level 2000 non-null int64
20 Risk_Level 2000 non-null object
dtypes: float64(2), int64(16), object(3)
memory usage: 328.2+ KB
| Column | Min | Max | Mean | Mode | Median | Std | Skewness | Kurtosis | Missing Values |
|---|---|---|---|---|---|---|---|---|---|
| Age | 25.0000 | 90.0000 | 63.2480 | 64.0000 | 64.0000 | 10.4629 | -0.1814 | -0.0148 | 0 |
| Smoking | 0.0000 | 10.0000 | 5.1570 | 10.0000 | 5.0000 | 3.3253 | 0.0552 | -1.2552 | 0 |
| Alcohol_Use | 0.0000 | 10.0000 | 5.0350 | 7.0000 | 5.0000 | 3.2610 | -0.0573 | -1.3211 | 0 |
| Obesity | 0.0000 | 10.0000 | 5.9675 | 10.0000 | 6.0000 | 3.0614 | -0.3250 | -0.9672 | 0 |
| Diet_Red_Meat | 0.0000 | 10.0000 | 5.1895 | 10.0000 | 5.0000 | 3.1545 | -0.0079 | -1.1572 | 0 |
| Diet_Salted_Processed | 0.0000 | 10.0000 | 4.5635 | 4.0000 | 4.0000 | 3.0883 | 0.3009 | -1.0429 | 0 |
| Fruit_Veg_Intake | 0.0000 | 10.0000 | 4.9275 | 3.0000 | 5.0000 | 3.0453 | 0.0185 | -1.0837 | 0 |
| Physical_Activity | 0.0000 | 10.0000 | 4.0150 | 1.0000 | 4.0000 | 2.9785 | 0.4559 | -0.8428 | 0 |
| Air_Pollution | 0.0000 | 10.0000 | 5.3230 | 10.0000 | 5.0000 | 3.2075 | 0.0033 | -1.2114 | 0 |
| Occupational_Hazards | 0.0000 | 10.0000 | 4.9790 | 5.0000 | 5.0000 | 3.2129 | 0.0749 | -1.1776 | 0 |
| Calcium_Intake | 0.0000 | 10.0000 | 3.9405 | 0.0000 | 4.0000 | 3.0489 | 0.3495 | -0.9561 | 0 |
| Overall_Risk_Score | 0.0293 | 0.8522 | 0.4544 | 0.0293 | 0.4554 | 0.1231 | 0.0165 | -0.2909 | 0 |
| BMI | 15.0000 | 41.4000 | 26.1833 | 25.9000 | 26.2000 | 3.9475 | 0.0477 | 0.0122 | 0 |
| Physical_Activity_Level | 0.0000 | 10.0000 | 4.9385 | 0.0000 | 5.0000 | 3.1660 | -0.0103 | -1.2055 | 0 |
| Column | Mode | Unique Values | #Unique | Most Frequent | Missing |
|---|---|---|---|---|---|
| Cancer_Type | Lung | Breast, Prostate, Skin, Colon, Lung | 5 | Lung (527) | 0 |
| Risk_Level | Medium | Medium, Low, High | 3 | Medium (1574) | 0 |
| Gender | 0 | 0, 1 | 2 | 0 (1022) | 0 |
| Family_History | 0 | 0, 1 | 2 | 0 (1611) | 0 |
| BRCA_Mutation | 0 | 1, 0 | 2 | 0 (1935) | 0 |
| H_Pylori_Infection | 0 | 0, 1 | 2 | 0 (1607) | 0 |
No missing values detected
Since BMI is the internationally standardized clinical measure, we drop Obesity to avoid redundancy.
| Patient_ID | Cancer_Type | Age | Gender | Smoking | Alcohol_Use | Family_History | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | BRCA_Mutation | H_Pylori_Infection | Calcium_Intake | Overall_Risk_Score | BMI | Physical_Activity_Level | Risk_Level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LU0000 | Breast | 68 | 0 | 7 | 2 | 0 | 5 | 3 | 7 | 4 | 6 | 3 | 1 | 0 | 0 | 0.3987 | 28.0000 | 5 | Medium |
| LU0001 | Prostate | 74 | 1 | 8 | 9 | 0 | 0 | 3 | 7 | 1 | 3 | 3 | 0 | 0 | 5 | 0.4243 | 25.4000 | 9 | Medium |
| LU0002 | Skin | 55 | 1 | 7 | 10 | 0 | 3 | 3 | 4 | 1 | 8 | 10 | 0 | 0 | 6 | 0.6051 | 28.6000 | 2 | Medium |
| LU0003 | Colon | 61 | 0 | 6 | 2 | 0 | 6 | 2 | 4 | 6 | 4 | 8 | 0 | 0 | 8 | 0.3184 | 32.1000 | 7 | Low |
| LU0004 | Lung | 67 | 1 | 10 | 7 | 0 | 6 | 3 | 10 | 9 | 10 | 9 | 0 | 0 | 5 | 0.5244 | 25.1000 | 2 | Medium |

| Column | Q1 | Q3 | IQR | Rows removed |
|---|---|---|---|---|
| Age | 56.00 | 70.00 | 14.00 | 9 |
| Smoking | 2.00 | 8.00 | 6.00 | 0 |
| Alcohol_Use | 2.00 | 8.00 | 6.00 | 0 |
| Diet_Red_Meat | 3.00 | 8.00 | 5.00 | 0 |
| Diet_Salted_Processed | 2.00 | 7.00 | 5.00 | 0 |
| Fruit_Veg_Intake | 3.00 | 8.00 | 5.00 | 0 |
| Physical_Activity | 1.00 | 6.00 | 5.00 | 0 |
| Air_Pollution | 3.00 | 8.00 | 5.00 | 0 |
| Occupational_Hazards | 2.00 | 8.00 | 6.00 | 0 |
| Calcium_Intake | 1.00 | 6.00 | 5.00 | 0 |
| Overall_Risk_Score | 0.37 | 0.54 | 0.17 | 6 |
| BMI | 23.50 | 28.70 | 5.20 | 17 |
| Physical_Activity_Level | 2.00 | 8.00 | 6.00 | 0 |
| Metric | Value |
|---|---|
| Original rows | 2000 |
| After outlier removal | 1968 |
| Rows removed | 32 |
| Percentage removed | 1.60% |

For categorical variable Cancer_Type we use One-Hot encoding (drop_first to avoid multicollinearity)
Cancer_Type: One-Hot encoded (drop_first=True)
| Cancer_Type_Colon | Cancer_Type_Lung | Cancer_Type_Prostate | Cancer_Type_Skin |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
For categorical variable Risk_Level we use Label encoding (Low=0, Medium=1, High=2)
Risk_Level: Label encoded (Low=0, Medium=1, High=2)
Gender, Family_History, BRCA_Mutation, H_Pylori_Infection: already numeric (binary), kept as-is and no encoding is needed
| Patient_ID | Age | Gender | Smoking | Alcohol_Use | Family_History | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | BRCA_Mutation | H_Pylori_Infection | Calcium_Intake | Overall_Risk_Score | BMI | Physical_Activity_Level | Cancer_Type_Colon | Cancer_Type_Lung | Cancer_Type_Prostate | Cancer_Type_Skin | Risk_Level_Encoded |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LU0000 | 68 | 0 | 7 | 2 | 0 | 5 | 3 | 7 | 4 | 6 | 3 | 1 | 0 | 0 | 0.3987 | 28.0000 | 5 | 0 | 0 | 0 | 0 | 1 |
| LU0001 | 74 | 1 | 8 | 9 | 0 | 0 | 3 | 7 | 1 | 3 | 3 | 0 | 0 | 5 | 0.4243 | 25.4000 | 9 | 0 | 0 | 1 | 0 | 1 |
| LU0002 | 55 | 1 | 7 | 10 | 0 | 3 | 3 | 4 | 1 | 8 | 10 | 0 | 0 | 6 | 0.6051 | 28.6000 | 2 | 0 | 0 | 0 | 1 | 1 |
| LU0003 | 61 | 0 | 6 | 2 | 0 | 6 | 2 | 4 | 6 | 4 | 8 | 0 | 0 | 8 | 0.3184 | 32.1000 | 7 | 1 | 0 | 0 | 0 | 0 |
| LU0004 | 67 | 1 | 10 | 7 | 0 | 6 | 3 | 10 | 9 | 10 | 9 | 0 | 0 | 5 | 0.5244 | 25.1000 | 2 | 0 | 1 | 0 | 0 | 1 |
Here we are checking skewness and applying log transform for highly skewed positive variables (|skew| > 1).
Log Transform
| Value Sign | Formula |
|---|---|
| Negative | |
| Positive | |
| Zero | (no transform applied) |

Skewness values (threshold: |skew| > 1.0):
| Column | Skewness | Exceeds Threshold |
|---|---|---|
| Age | -0.088 | No |
| Smoking | 0.059 | No |
| Alcohol_Use | -0.058 | No |
| Diet_Red_Meat | -0.012 | No |
| Diet_Salted_Processed | 0.301 | No |
| Fruit_Veg_Intake | 0.019 | No |
| Physical_Activity | 0.453 | No |
| Air_Pollution | 0.004 | No |
| Occupational_Hazards | 0.077 | No |
| Calcium_Intake | 0.346 | No |
| Overall_Risk_Score | 0.021 | No |
| BMI | 0.060 | No |
| Physical_Activity_Level | -0.011 | No |
No highly skewed numerical variables (|skew| > 1) requiring log transform.
All continuous numerical features are standardized to zero mean and unit variance:
This ensures features are on comparable scales for distance-based and gradient-based ML algorithms.
Numerical columns standardized (zero mean, unit variance):
Age, Smoking, Alcohol_Use, Diet_Red_Meat, Diet_Salted_Processed, Fruit_Veg_Intake, Physical_Activity, Air_Pollution, Occupational_Hazards, Calcium_Intake, Overall_Risk_Score, BMI, Physical_Activity_Level
| Column | Original Mean (μ) | Original Std (σ) |
|---|---|---|
| Age | 63.3526 | 10.2284 |
| Smoking | 5.1570 | 3.3220 |
| Alcohol_Use | 5.0427 | 3.2573 |
| Diet_Red_Meat | 5.2038 | 3.1468 |
| Diet_Salted_Processed | 4.5671 | 3.0853 |
| Fruit_Veg_Intake | 4.9212 | 3.0475 |
| Physical_Activity | 4.0229 | 2.9765 |
| Air_Pollution | 5.3298 | 3.1986 |
| Occupational_Hazards | 4.9731 | 3.2112 |
| Calcium_Intake | 3.9507 | 3.0497 |
| Overall_Risk_Score | 0.4551 | 0.1211 |
| BMI | 26.2014 | 3.8190 |
| Physical_Activity_Level | 4.9416 | 3.1742 |
First 10 rows of standardized numerical columns only:
| Age | Smoking | Alcohol_Use | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | Calcium_Intake | Overall_Risk_Score | BMI | Physical_Activity_Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.4544 | 0.5548 | -0.9341 | -0.0648 | -0.5079 | 0.6821 | -0.0077 | 0.2095 | -0.6144 | -1.2954 | -0.4660 | 0.4710 | 0.0184 |
| 1 | 1.0410 | 0.8558 | 1.2149 | -1.6537 | -0.5079 | 0.6821 | -1.0156 | -0.7284 | -0.6144 | 0.3441 | -0.2546 | -0.2098 | 1.2786 |
| 2 | -0.8166 | 0.5548 | 1.5219 | -0.7003 | -0.5079 | -0.3023 | -1.0156 | 0.8348 | 1.5654 | 0.6720 | 1.2381 | 0.6281 | -0.9267 |
| 3 | -0.2300 | 0.2538 | -0.9341 | 0.2530 | -0.8320 | -0.3023 | 0.6642 | -0.4157 | 0.9426 | 1.3277 | -1.1286 | 1.5446 | 0.6485 |
| 4 | 0.3566 | 1.4578 | 0.6009 | 0.2530 | -0.5079 | 1.6666 | 1.6721 | 1.4601 | 1.2540 | 0.3441 | 0.5715 | -0.2884 | -0.9267 |
| 5 | 1.3343 | 1.4578 | 0.9079 | 0.2530 | -1.4803 | 0.3540 | -0.6796 | 1.4601 | 0.6312 | -1.2954 | 0.3594 | -0.2884 | -1.2418 |
| 6 | -0.4255 | 1.4578 | 1.5219 | 1.2064 | -0.1838 | -1.6149 | -1.0156 | 1.4601 | 1.2540 | 0.3441 | 1.7109 | 1.5969 | -0.9267 |
| 7 | 1.0410 | 0.8558 | 0.2939 | -0.7003 | -0.5079 | -0.9586 | 1.3362 | 0.8348 | 0.6312 | -0.9675 | 0.2000 | 0.7590 | 1.2786 |
| 8 | 0.7477 | 1.1568 | -1.5481 | 1.5242 | -0.1838 | 0.3540 | 2.0081 | 0.8348 | -0.6144 | 0.3441 | 0.3508 | -0.5502 | 0.0184 |
| 9 | -0.8166 | 0.5548 | -1.2411 | -1.6537 | -0.1838 | -0.9586 | 0.3283 | 1.1474 | 1.2540 | 0.3441 | -0.4153 | 0.5233 | -1.2418 |
First 10 rows of the complete modified dataframe (all columns, numerical ones standardized):
| Patient_ID | Age | Gender | Smoking | Alcohol_Use | Family_History | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | BRCA_Mutation | H_Pylori_Infection | Calcium_Intake | Overall_Risk_Score | BMI | Physical_Activity_Level | Cancer_Type_Colon | Cancer_Type_Lung | Cancer_Type_Prostate | Cancer_Type_Skin | Risk_Level_Encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LU0000 | 0.4544 | 0 | 0.5548 | -0.9341 | 0 | -0.0648 | -0.5079 | 0.6821 | -0.0077 | 0.2095 | -0.6144 | 1 | 0 | -1.2954 | -0.4660 | 0.4710 | 0.0184 | 0 | 0 | 0 | 0 | 1 |
| 1 | LU0001 | 1.0410 | 1 | 0.8558 | 1.2149 | 0 | -1.6537 | -0.5079 | 0.6821 | -1.0156 | -0.7284 | -0.6144 | 0 | 0 | 0.3441 | -0.2546 | -0.2098 | 1.2786 | 0 | 0 | 1 | 0 | 1 |
| 2 | LU0002 | -0.8166 | 1 | 0.5548 | 1.5219 | 0 | -0.7003 | -0.5079 | -0.3023 | -1.0156 | 0.8348 | 1.5654 | 0 | 0 | 0.6720 | 1.2381 | 0.6281 | -0.9267 | 0 | 0 | 0 | 1 | 1 |
| 3 | LU0003 | -0.2300 | 0 | 0.2538 | -0.9341 | 0 | 0.2530 | -0.8320 | -0.3023 | 0.6642 | -0.4157 | 0.9426 | 0 | 0 | 1.3277 | -1.1286 | 1.5446 | 0.6485 | 1 | 0 | 0 | 0 | 0 |
| 4 | LU0004 | 0.3566 | 1 | 1.4578 | 0.6009 | 0 | 0.2530 | -0.5079 | 1.6666 | 1.6721 | 1.4601 | 1.2540 | 0 | 0 | 0.3441 | 0.5715 | -0.2884 | -0.9267 | 0 | 1 | 0 | 0 | 1 |
| 5 | LU0005 | 1.3343 | 1 | 1.4578 | 0.9079 | 0 | 0.2530 | -1.4803 | 0.3540 | -0.6796 | 1.4601 | 0.6312 | 0 | 0 | -1.2954 | 0.3594 | -0.2884 | -1.2418 | 0 | 1 | 0 | 0 | 1 |
| 6 | LU0006 | -0.4255 | 0 | 1.4578 | 1.5219 | 0 | 1.2064 | -0.1838 | -1.6149 | -1.0156 | 1.4601 | 1.2540 | 0 | 0 | 0.3441 | 1.7109 | 1.5969 | -0.9267 | 0 | 1 | 0 | 0 | 2 |
| 7 | LU0007 | 1.0410 | 1 | 0.8558 | 0.2939 | 1 | -0.7003 | -0.5079 | -0.9586 | 1.3362 | 0.8348 | 0.6312 | 0 | 0 | -0.9675 | 0.2000 | 0.7590 | 1.2786 | 0 | 0 | 1 | 0 | 1 |
| 8 | LU0008 | 0.7477 | 1 | 1.1568 | -1.5481 | 0 | 1.5242 | -0.1838 | 0.3540 | 2.0081 | 0.8348 | -0.6144 | 0 | 0 | 0.3441 | 0.3508 | -0.5502 | 0.0184 | 1 | 0 | 0 | 0 | 1 |
| 9 | LU0009 | -0.8166 | 1 | 0.5548 | -1.2411 | 0 | -1.6537 | -0.1838 | -0.9586 | 0.3283 | 1.1474 | 1.2540 | 0 | 0 | 0.3441 | -0.4153 | 0.5233 | -1.2418 | 0 | 0 | 0 | 1 | 1 |



| Age | Smoking | Alcohol_Use | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | Calcium_Intake | Overall_Risk_Score | BMI | Physical_Activity_Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000 | 0.041 | -0.033 | -0.057 | -0.059 | 0.008 | 0.065 | -0.014 | 0.050 | 0.057 | -0.036 | 0.011 | -0.049 |
| Smoking | 0.041 | 1.000 | 0.112 | -0.145 | -0.060 | 0.041 | 0.108 | 0.449 | -0.017 | 0.084 | 0.431 | 0.003 | 0.022 |
| Alcohol_Use | -0.033 | 0.112 | 1.000 | -0.030 | -0.033 | 0.041 | 0.040 | 0.068 | -0.007 | -0.062 | 0.386 | -0.012 | 0.018 |
| Diet_Red_Meat | -0.057 | -0.145 | -0.030 | 1.000 | 0.178 | -0.193 | -0.001 | -0.077 | -0.015 | 0.100 | 0.272 | 0.030 | 0.032 |
| Diet_Salted_Processed | -0.059 | -0.060 | -0.033 | 0.178 | 1.000 | -0.216 | -0.025 | 0.036 | 0.055 | 0.060 | 0.363 | -0.008 | 0.001 |
| Fruit_Veg_Intake | 0.008 | 0.041 | 0.041 | -0.193 | -0.216 | 1.000 | 0.013 | -0.048 | -0.053 | -0.022 | -0.152 | -0.014 | -0.013 |
| Physical_Activity | 0.065 | 0.108 | 0.040 | -0.001 | -0.025 | 0.013 | 1.000 | 0.086 | 0.001 | -0.000 | 0.063 | 0.000 | 0.028 |
| Air_Pollution | -0.014 | 0.449 | 0.068 | -0.077 | 0.036 | -0.048 | 0.086 | 1.000 | 0.081 | 0.062 | 0.496 | 0.034 | 0.005 |
| Occupational_Hazards | 0.050 | -0.017 | -0.007 | -0.015 | 0.055 | -0.053 | 0.001 | 0.081 | 1.000 | 0.076 | 0.352 | 0.000 | 0.044 |
| Calcium_Intake | 0.057 | 0.084 | -0.062 | 0.100 | 0.060 | -0.022 | -0.000 | 0.062 | 0.076 | 1.000 | 0.062 | 0.024 | -0.007 |
| Overall_Risk_Score | -0.036 | 0.431 | 0.386 | 0.272 | 0.363 | -0.152 | 0.063 | 0.496 | 0.352 | 0.062 | 1.000 | 0.029 | 0.049 |
| BMI | 0.011 | 0.003 | -0.012 | 0.030 | -0.008 | -0.014 | 0.000 | 0.034 | 0.000 | 0.024 | 0.029 | 1.000 | -0.002 |
| Physical_Activity_Level | -0.049 | 0.022 | 0.018 | 0.032 | 0.001 | -0.013 | 0.028 | 0.005 | 0.044 | -0.007 | 0.049 | -0.002 | 1.000 |
Here we take the Weak to Strong Correlations (0.19–0.39 weak, 0.4–0.59 moderate, 0.6–0.79 strong) and collect all pairs with weak/moderate/strong correlation (upper triangle to avoid duplicates)
Weak to Strong Positive Correlation (descending):
| Feature | Correlated With | Correlation | Strength |
|---|---|---|---|
| Air_Pollution | Overall_Risk_Score | 0.496 | Moderate |
| Smoking | Air_Pollution | 0.449 | Moderate |
| Smoking | Overall_Risk_Score | 0.431 | Moderate |
| Alcohol_Use | Overall_Risk_Score | 0.386 | Weak |
| Diet_Salted_Processed | Overall_Risk_Score | 0.363 | Weak |
| Occupational_Hazards | Overall_Risk_Score | 0.352 | Weak |
| Diet_Red_Meat | Overall_Risk_Score | 0.272 | Weak |
Weak to Strong Negative Correlation (descending by strength):
| Feature | Correlated With | Correlation | Strength |
|---|---|---|---|
| Diet_Salted_Processed | Fruit_Veg_Intake | -0.216 | Weak |
| Diet_Red_Meat | Fruit_Veg_Intake | -0.193 | Weak |

Density (intensity) color code (low -> high): purple->blue->green->yellow->red
Cancer_Type:

Gender:

Family_History:

BRCA_Mutation:

H_Pylori_Infection:

Risk_Level:

Cancer_Type:

Gender:

Family_History:

BRCA_Mutation:

H_Pylori_Infection:

Risk_Level:

Cancer_Type:

Gender:

Family_History:

BRCA_Mutation:

H_Pylori_Infection:

Risk_Level:

This plot shows how much information each Principal Component captures.

PC1 explains 13.4% of variance; PC1–PC2 together explain 25.3%. 9 components reach 80% cumulative variance; 11 reach 90%.
This plot projects all numerical variables into 2D to visualize risk-level clustering.

With only 25.3% of variance captured in 2D, the three risk classes show substantial overlap, suggesting that risk-level separation requires higher-dimensional features or non-linear decision boundaries — consistent with the use of tree-based ensemble models below.
Here we are testing the hypotheses for the data. We will be using the chi-square test, t-test and ANOVA test to test the hypotheses.
| Aspect | Details |
|---|---|
| Description | Tests whether there is an association between two categorical variables by comparing observed counts to expected counts under independence. |
| Typical use case | Assess if a risk factor (e.g., family history) is related to an outcome category (e.g., cancer risk level) across groups (e.g., cancer types). |
| Key inference metrics | Chi-square statistic (χ²), degrees of freedom (df), p-value, and whether p < 0.05 (evidence against independence). |

Chi-square test of independence for Family_History vs (Cancer_Type × Risk_Level).
Null Hypothesis (): Family history and risk level are independent (no association) within each cancer type.
Alternative Hypothesis (): Family history and risk level are not independent (association exists) within each cancer type.
Test statistic: , where are observed counts and are expected counts.
Chi-Square Results: Family_History vs Risk_Level by Cancer_Type
| Cancer_Type | Chi2 | df | p_value | Cramers_V | Significant_at_0.05 | N |
|---|---|---|---|---|---|---|
| Breast | 4.9802 | 2 | 0.0829 | 0.1044 | No | 457 |
| Colon | 1.4844 | 2 | 0.4761 | 0.0600 | No | 412 |
| Lung | 3.7581 | 2 | 0.1527 | 0.0854 | No | 515 |
| Prostate | 0.3939 | 2 | 0.8212 | 0.0364 | No | 297 |
| Skin | 1.8931 | 2 | 0.3881 | 0.0812 | No | 287 |
Verdict:
For none of the 5 cancer types, the association between Family_History and Risk_Level is statistically significant at the 0.05 level (all p-values ≥ 0.05). We do not find strong evidence that family history is associated with risk levels within any cancer type based on this test.
Note: Non-significance does not prove absence of effect; small per-group sample sizes may limit statistical power.

| Aspect | Details |
|---|---|
| Description | Compares the mean of a numerical variable between two independent groups. |
| Typical use case | Evaluate whether people with high BMI have different average Overall_Risk_Score compared to people with low BMI. |
| Key inference metrics | Group means, t statistic, p-value, and whether p < 0.05 (evidence of a difference in means). |

Independent samples t-test for Overall_Risk_Score between Low BMI and High BMI groups.
Null Hypothesis (): There is no difference. High BMI and Low BMI people have the same average Risk Score.
Alternative Hypothesis (): People with above-average BMI have a significantly higher average Risk Score.
Test statistic formula:
t-Test Results: Overall_Risk_Score by BMI Group
| BMI_Threshold_Mean | N_High_BMI | N_Low_BMI | Mean_Risk_High_BMI | Mean_Risk_Low_BMI | t_stat | p_value | Cohens_d | Significant_at_0.05 |
|---|---|---|---|---|---|---|---|---|
| 26.2014 | 969 | 999 | 0.4579 | 0.4525 | 0.9776 | 0.3284 | 0.0441 | No |
Verdict: The p-value for the difference in Overall_Risk_Score between High BMI (n=969) and Low BMI (n=999) groups is 0.3284. There is no statistically significant difference in mean Overall_Risk_Score (0.4579 vs 0.4525) at the 0.05 significance level. Fail to reject the null hypothesis.
Effect size: Cohen's d = 0.044 (negligible).
Non-significance does not prove absence of effect; the mean-based BMI split may dilute real associations.
Visualization: boxplot of Overall_Risk_Score by BMI group

| Aspect | Details |
|---|---|
| Description | Tests whether the mean Overall_Risk_Score differs across multiple Cancer_Type groups using a one-way ANOVA. |
| Typical use case | Assess if average Overall_Risk_Score is the same for Breast, Colon, Lung, Prostate, and Skin cancer groups. |
| Key inference metrics | F statistic, p-value, degrees of freedom (df1, df2), and whether p < 0.05 (evidence that at least one group mean differs). |

Comparison of Overall_Risk_Score across Cancer_Type groups.
Null Hypothesis (): All Cancer_Type groups have the same mean Overall_Risk_Score.
Alternative Hypothesis (): At least one Cancer_Type group has a different mean Overall_Risk_Score.
Test statistic formula:
| Term | Full Name | Formula (Simplified) |
|---|---|---|
| SSW | Sum of Squares Within groups | |
| SSB | Sum of Squares Between groups | |
| MSW | Mean Square Within groups | |
| MSB | Mean Square Between groups |
ANOVA Results: Overall_Risk_Score by Cancer_Type
| k_groups | N_total | df1_(k-1) | df2_(N-k) | F_stat | p_value | Eta_squared | Significant_at_0.05 |
|---|---|---|---|---|---|---|---|
| 5 | 1968 | 4 | 1963 | 44.8968 | 4.03e-36 | 0.0838 | Yes |
Verdict: Reject .
Hence, there is a statistically significant difference in mean Overall_Risk_Score between Cancer_Type groups.
Effect size: η² = 0.0838 (medium).
Per-group mean Overall_Risk_Score (ranked):
| Cancer_Type | Mean_Risk_Score | N |
|---|---|---|
| Lung | 0.5002 | 515 |
| Colon | 0.4665 | 412 |
| Skin | 0.4556 | 287 |
| Breast | 0.4336 | 457 |
| Prostate | 0.3940 | 297 |
Tukey HSD post-hoc pairwise comparisons:
| Group A | Group B | p_value | Significant |
|---|---|---|---|
| Breast | Colon | 0.0003 | Yes |
| Breast | Lung | 0.0000 | Yes |
| Breast | Prostate | 0.0000 | Yes |
| Breast | Skin | 0.0888 | No |
| Colon | Lung | 0.0001 | Yes |
| Colon | Prostate | 0.0000 | Yes |
| Colon | Skin | 0.7381 | No |
| Lung | Prostate | 0.0000 | Yes |
| Lung | Skin | 0.0000 | Yes |
| Prostate | Skin | 0.0000 | Yes |
Visualization: boxplot of Overall_Risk_Score by Cancer_Type

| Test | Comparison | Statistic | p-value | Significant (α=0.05) |
|---|---|---|---|---|
| Chi-square | Family_History vs Risk_Level (Breast) | χ²=4.9802, V=0.104 | 0.0829 | No |
| Chi-square | Family_History vs Risk_Level (Colon) | χ²=1.4844, V=0.060 | 0.4761 | No |
| Chi-square | Family_History vs Risk_Level (Lung) | χ²=3.7581, V=0.085 | 0.1527 | No |
| Chi-square | Family_History vs Risk_Level (Prostate) | χ²=0.3939, V=0.036 | 0.8212 | No |
| Chi-square | Family_History vs Risk_Level (Skin) | χ²=1.8931, V=0.081 | 0.3881 | No |
| t-Test | Overall_Risk_Score by BMI Group | t=0.9776, d=0.044 | 0.3284 | No |
| ANOVA | Overall_Risk_Score by Cancer_Type | F=44.8968, η²=0.0838 | 4.033e-36 | Yes |
We use df_processed (standardized continuous features + encoded categoricals from Section 4) for ML.
Target: Risk_Level_Encoded (Low=0, Medium=1, High=2). Overall_Risk_Score is excluded to prevent leakage.
| Quantity | Value |
|---|---|
| Total samples (rows) | 1968 |
| Number of features (X) | 20 |
| Target column (y) | Risk_Level_Encoded |

First 10 rows of the ML feature matrix X (standardized numerical features + encoded categoricals):
| Age | Gender | Smoking | Alcohol_Use | Family_History | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | BRCA_Mutation | H_Pylori_Infection | Calcium_Intake | BMI | Physical_Activity_Level | Cancer_Type_Colon | Cancer_Type_Lung | Cancer_Type_Prostate | Cancer_Type_Skin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.4544 | 0.0000 | 0.5548 | -0.9341 | 0.0000 | -0.0648 | -0.5079 | 0.6821 | -0.0077 | 0.2095 | -0.6144 | 1.0000 | 0.0000 | -1.2954 | 0.4710 | 0.0184 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| 1.0410 | 1.0000 | 0.8558 | 1.2149 | 0.0000 | -1.6537 | -0.5079 | 0.6821 | -1.0156 | -0.7284 | -0.6144 | 0.0000 | 0.0000 | 0.3441 | -0.2098 | 1.2786 | 0.0000 | 0.0000 | 1.0000 | 0.0000 |
| -0.8166 | 1.0000 | 0.5548 | 1.5219 | 0.0000 | -0.7003 | -0.5079 | -0.3023 | -1.0156 | 0.8348 | 1.5654 | 0.0000 | 0.0000 | 0.6720 | 0.6281 | -0.9267 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| -0.2300 | 0.0000 | 0.2538 | -0.9341 | 0.0000 | 0.2530 | -0.8320 | -0.3023 | 0.6642 | -0.4157 | 0.9426 | 0.0000 | 0.0000 | 1.3277 | 1.5446 | 0.6485 | 1.0000 | 0.0000 | 0.0000 | 0.0000 |
| 0.3566 | 1.0000 | 1.4578 | 0.6009 | 0.0000 | 0.2530 | -0.5079 | 1.6666 | 1.6721 | 1.4601 | 1.2540 | 0.0000 | 0.0000 | 0.3441 | -0.2884 | -0.9267 | 0.0000 | 1.0000 | 0.0000 | 0.0000 |
| 1.3343 | 1.0000 | 1.4578 | 0.9079 | 0.0000 | 0.2530 | -1.4803 | 0.3540 | -0.6796 | 1.4601 | 0.6312 | 0.0000 | 0.0000 | -1.2954 | -0.2884 | -1.2418 | 0.0000 | 1.0000 | 0.0000 | 0.0000 |
| -0.4255 | 0.0000 | 1.4578 | 1.5219 | 0.0000 | 1.2064 | -0.1838 | -1.6149 | -1.0156 | 1.4601 | 1.2540 | 0.0000 | 0.0000 | 0.3441 | 1.5969 | -0.9267 | 0.0000 | 1.0000 | 0.0000 | 0.0000 |
| 1.0410 | 1.0000 | 0.8558 | 0.2939 | 1.0000 | -0.7003 | -0.5079 | -0.9586 | 1.3362 | 0.8348 | 0.6312 | 0.0000 | 0.0000 | -0.9675 | 0.7590 | 1.2786 | 0.0000 | 0.0000 | 1.0000 | 0.0000 |
| 0.7477 | 1.0000 | 1.1568 | -1.5481 | 0.0000 | 1.5242 | -0.1838 | 0.3540 | 2.0081 | 0.8348 | -0.6144 | 0.0000 | 0.0000 | 0.3441 | -0.5502 | 0.0184 | 1.0000 | 0.0000 | 0.0000 | 0.0000 |
| -0.8166 | 1.0000 | 0.5548 | -1.2411 | 0.0000 | -1.6537 | -0.1838 | -0.9586 | 0.3283 | 1.1474 | 1.2540 | 0.0000 | 0.0000 | 0.3441 | 0.5233 | -1.2418 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |

We split the data into training and test sets using an 80/20 ratio with stratification on the target.
This preserves the distribution of risk levels in both training and test sets and avoids biased evaluation.
Training set sample (first 10 rows):
| Age | Gender | Smoking | Alcohol_Use | Family_History | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | BRCA_Mutation | H_Pylori_Infection | Calcium_Intake | BMI | Physical_Activity_Level | Cancer_Type_Colon | Cancer_Type_Lung | Cancer_Type_Prostate | Cancer_Type_Skin | Risk_Level_Encoded |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.6211 | 0.0000 | 1.4578 | 0.6009 | 1.0000 | -0.7003 | 1.7609 | 0.3540 | -0.3436 | 1.4601 | 1.2540 | 0.0000 | 0.0000 | -0.3117 | -1.5453 | -0.9267 | 0.0000 | 1.0000 | 0.0000 | 0.0000 | 1.0000 |
| 0.5521 | 0.0000 | -0.0473 | -0.9341 | 0.0000 | 0.5708 | 0.1403 | -1.6149 | 0.3283 | 1.4601 | -0.3030 | 0.0000 | 0.0000 | 0.3441 | 1.4136 | 0.6485 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| 0.4544 | 0.0000 | -0.9503 | 0.6009 | 0.0000 | -0.3825 | -1.4803 | 1.3384 | -0.6796 | 1.1474 | 1.2540 | 0.0000 | 0.0000 | -0.6396 | 0.8638 | 1.2786 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| 1.1387 | 0.0000 | 0.2538 | 1.2149 | 0.0000 | -0.7003 | -0.8320 | -0.6304 | 0.3283 | 0.5222 | 0.0084 | 0.0000 | 0.0000 | -1.2954 | -0.2884 | 1.2786 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| -0.1322 | 0.0000 | -0.6493 | -1.5481 | 0.0000 | 0.2530 | -0.8320 | -0.6304 | 1.3362 | -0.4157 | -0.3030 | 0.0000 | 0.0000 | -0.3117 | -0.6550 | -0.6117 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| 0.4544 | 0.0000 | 1.1568 | 1.2149 | 0.0000 | 0.8886 | 1.4368 | 0.3540 | -0.0077 | -0.1031 | 1.5654 | 0.0000 | 0.0000 | -1.2954 | -0.2622 | 0.3335 | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| 1.2365 | 0.0000 | 0.2538 | -1.2411 | 0.0000 | 0.8886 | -0.5079 | -1.6149 | 0.3283 | 1.4601 | -0.3030 | 0.0000 | 0.0000 | -0.9675 | 1.7279 | -1.5568 | 0.0000 | 0.0000 | 0.0000 | 1.0000 | 1.0000 |
| -0.5233 | 1.0000 | -0.0473 | -0.9341 | 0.0000 | -0.3825 | 1.4368 | -0.9586 | 1.6721 | 1.4601 | -0.9258 | 0.0000 | 0.0000 | -1.2954 | 1.3089 | 1.2786 | 0.0000 | 0.0000 | 1.0000 | 0.0000 | 0.0000 |
| -0.6211 | 0.0000 | 1.1568 | -1.5481 | 1.0000 | 0.8886 | -0.8320 | -1.6149 | 0.6642 | 0.2095 | 0.9426 | 0.0000 | 0.0000 | -1.2954 | 0.8114 | 0.9635 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| -0.6211 | 1.0000 | 1.4578 | -0.3201 | 0.0000 | 1.5242 | 1.1127 | -0.6304 | 2.0081 | -0.1031 | -1.5487 | 0.0000 | 1.0000 | -0.6396 | -0.0789 | -0.9267 | 0.0000 | 1.0000 | 0.0000 | 0.0000 | 1.0000 |
Test set sample (first 5 rows):
| Age | Gender | Smoking | Alcohol_Use | Family_History | Diet_Red_Meat | Diet_Salted_Processed | Fruit_Veg_Intake | Physical_Activity | Air_Pollution | Occupational_Hazards | BRCA_Mutation | H_Pylori_Infection | Calcium_Intake | BMI | Physical_Activity_Level | Cancer_Type_Colon | Cancer_Type_Lung | Cancer_Type_Prostate | Cancer_Type_Skin | Risk_Level_Encoded |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.5233 | 0.0000 | -1.5524 | -0.6271 | 0.0000 | -0.3825 | -1.4803 | -0.3023 | -1.0156 | -1.0410 | -1.5487 | 0.0000 | 0.0000 | -1.2954 | 2.6443 | -0.2966 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| 0.0633 | 1.0000 | -0.9503 | -0.0131 | 0.0000 | 1.2064 | -0.8320 | -0.6304 | 0.6642 | -1.0410 | 0.0084 | 0.0000 | 0.0000 | 0.6720 | 0.7852 | 0.6485 | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| -0.6211 | 0.0000 | -1.5524 | -0.0131 | 0.0000 | 0.2530 | 0.4644 | 0.3540 | 0.6642 | -1.6663 | -0.3030 | 0.0000 | 0.0000 | -1.2954 | 0.1306 | 0.9635 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| -0.9144 | 1.0000 | 1.4578 | 0.2939 | 0.0000 | -1.6537 | -0.1838 | -0.6304 | -0.0077 | 1.1474 | 0.6312 | 0.0000 | 1.0000 | -0.9675 | -0.8121 | 0.0184 | 0.0000 | 1.0000 | 0.0000 | 0.0000 | 1.0000 |
| -0.0345 | 1.0000 | 1.4578 | 0.9079 | 1.0000 | 1.5242 | 1.1127 | 0.3540 | -0.0077 | 1.4601 | 1.5654 | 0.0000 | 0.0000 | 0.3441 | 0.6281 | 1.5936 | 0.0000 | 1.0000 | 0.0000 | 0.0000 | 2.0000 |
SMOTE is a data augmentation technique that addresses class imbalance by generating synthetic samples
for minority classes. Instead of simply duplicating existing minority samples (which can lead to overfitting),
SMOTE creates new data points by interpolating between existing minority-class neighbours in feature space.
Why it is needed here:
The target variable Risk_Level is heavily imbalanced — the Medium class dominates (~79%), while High (~5%)
has very few examples. Without balancing, classifiers tend to predict the majority class and ignore minorities,
resulting in high accuracy but poor recall and F1 for under-represented classes.
How it works:
Important: SMOTE is applied only on the training set — the test set remains untouched to ensure
an honest evaluation on real data.
Class distribution before SMOTE:
| Class | Count |
|---|---|
| Low (0) | 251 |
| Medium (1) | 1245 |
| High (2) | 78 |
Class distribution after SMOTE:
| Class | Count |
|---|---|
| Low (0) | 1245 |
| Medium (1) | 1245 |
| High (2) | 1245 |
Evaluation protocol: All metrics below are computed on the original, unmodified test set (no SMOTE). This ensures honest evaluation on real-world class distribution.
LightGBM is a gradient boosting framework based on decision trees. It builds an ensemble of shallow trees
by minimizing a differentiable loss function (here, multiclass log-loss) using gradient descent.

Hyperparameters: n_estimators=500, learning_rate=0.05, max_depth=7, num_leaves=63, min_child_samples=10, subsample=0.8, colsample_bytree=0.8

Per-class recall — Low: 34/63 (54%), Medium: 292/311 (94%), High: 6/20 (30%).
XGBoost (Extreme Gradient Boosting) is another gradient boosting implementation that uses regularization
and shrinkage to reduce overfitting. It optimizes an objective function of the form
where is the loss (e.g. multiclass log-loss) and is a regularization term on each tree .

Hyperparameters: n_estimators=500, learning_rate=0.05, max_depth=6, min_child_weight=3, subsample=0.8, colsample_bytree=0.8, gamma=0.1

Per-class recall — Low: 38/63 (60%), Medium: 289/311 (93%), High: 11/20 (55%).
RandomForest is an ensemble of decision trees trained on bootstrapped samples with feature subsampling.
For classification, the final prediction is obtained via majority vote across trees:
where each is a decision tree trained on a different bootstrap sample.

Hyperparameters: n_estimators=500, max_depth=None, min_samples_split=5, min_samples_leaf=2

Per-class recall — Low: 40/63 (63%), Medium: 288/311 (93%), High: 9/20 (45%).
| Algorithm | Accuracy | Precision_macro | Recall_macro | F1_macro | ROC_AUC_macro |
|---|---|---|---|---|---|
| LightGBM | 0.8426 | 0.7085 | 0.5929 | 0.6346 | 0.8914 |
| XGBoost | 0.8579 | 0.7409 | 0.6941 | 0.7153 | 0.9027 |
| RandomForest | 0.8553 | 0.7549 | 0.6703 | 0.7037 | 0.9163 |


Inference: We select the best model by macro-average F1, which balances precision and recall across all classes — critical when minority-class detection (High risk) matters.
F1 and AUC disagree: XGBoost leads on F1 (better hard predictions), while RandomForest leads on AUC (better probability ranking). For clinical risk triage, F1_macro is preferred.