Developing a Machine Learning Model Using Gene Expression for Breast Cancer Prediction
Main Article Content
Abstract
Recent advancements in genomics have generated vast gene expression datasets, offering profound insights into cancer biology. This study investigates an ensemble machine learning model, integrating K-Nearest Neighbors (KNN), Support Vector Classifier (SVC), and XGBoost, to predict and classify breast cancer subtypes from gene expression profiles. The methodology encompassed data preprocessing, including one-hot encoding, followed by model training and evaluation using standard metrics. The ensemble model achieved a strong overall accuracy of 90.32%. Crucially, it demonstrated a high precision of 0.9240, effectively minimizing false positives which is a key consideration for clinical diagnostics. While the model showed balanced performance with an F1-score of 0.9015, a comparative analysis revealed that, although individual baseline models (SVM, RF) reported higher raw accuracy of ~99%, the proposed ensemble provides a robust and interpretable framework optimized for reliable multi-class discrimination.
Article Details
Library and Information Science Department UNZA