Search

(Part I) The use of various Machine Learning models for predicting Alzheimer’s disease

Latera Tesfaye
Dec 26, 2024
4 min read

Updated: Dec 27, 2024

This was my class project for BIOST-546 (Machine Learning) at University of Washington with Eardi Lali (2023). In this blog I am assuming readers have some basic ML understanding.

The forthcoming blog shows the utilization of various machine learning models to forecast patients with Alzheimer’s disease, in contrast to healthy elderly individuals, through an analysis of cerebral cortex thickness measurements. This report will present a range of models, from simple linear models to intricate non-linear models.

The dataset consists of cerebral cortex thickness measurements at 360 brain regions of interest and associated labels indicating whether the subject has Alzheimer's Disease or is a Control subject (i.e., a healthy elderly). Mathematically, let (yi,xi), with i=1,…,n=400, denote the ith observation in the dataset. Here, xi will be a vector of length 360 describing the 360 cortex thickness measurements for ith subject. We can think of these measurements as 360 'variables' (as the 360 regions where these measurements are taken are in correspondence across subjects). The variable yi∈{C,AD} will be the categorical outcome: Control vs Alzheimer's Disease.

Brain images compare a control subject to one with Alzheimer's. Different thicknesses are shown in blue to red. Labels: Region 1, Region 2. — Figure 1: Brain scan for both groups

The data

The data has zero missing values. In addition, it has 97 C outcome (hereafter, refereed to as no disease) observations and 303 AC (hereafter, referred to as diseased) observations. The data has 360 predictors (p=360). Overall, the data has 400 observations. For this modelling purposes, the data was divided into two, training and test with each containing 280 and 120, respectively.

Figure 2 below (left) shows the scatter plot of the outcome for random p values from 10 to 13. It does not seem like there clearly defined boundary for these predictors. However, if we were to reduce all predictors into two predictors (using Principal Component Analysis - PCA), as shown in figure 2 (right) of the plot we can see that there is a clear distinction between distribution of the outcome values. This might imply linear simpler models might describe the data well. For non-linear relationship, it may be more appropriate to use a non-linear model, such as a decision tree, random forest, gradient boosting, or a neural network. These types of models can capture more complex and nuanced relationships between predictors and the target variable, as opposed to linear models.

An alternative approach is to employ a K-nearest neighbors (KNN) model, which does not impose any predetermined mathematical function for relating the predictors to the target variable. Instead, this technique uses the closeness of data points to predict outcomes.

The following plot (figure 3) shows the correlation between each predictor. Even though, it difficult to see the correlation between individual predictors, we can generally see there is smaller correlation.

Models

Model 1: Using logistic regression is very tempting. However, in this high dimensional set, trying to fit simple logistic regression (glm), the algorithm will not converge to a solution or find a unique solution. The increase in complexity can result in, where the predictors are highly correlated, leading to ill-conditioned matrices and slow convergence. Nevertheless, since our main goal is to worry less about the dataset, let’s see what happens. For this simple logistic regression the test accuracy is 46%, where as, the train accuracy was 100%. As expected, the test accuracy is very small.

Model 2: regulirized logistic regression offers a way to limit or regulate the estimated coefficients, which can lower variance and decrease error. In this work I will use generalization of the ridge and lasso penalties, called elastic net. The three parameters we are interested in are: β is the vector of logistic regression coefficients to be estimated; λ is the regularization parameter controlling the strength of regularization; α is the mixing parameter controlling the trade-off between the L1 and L2 penalty terms. The grid tuning uses cross validation with (k-fold = 5) and tune length of 20, and the selected values are generated using seq(0, 1, by = 0.01) rule. The λ are generated using 10 to the power of (seq(5, -18, length=100)). When performing a grid search, it is often useful to use λ with a range of values, with increasingly larger gaps between the values as the values become larger. For this task I have used caret library.

Table 1: GLM net model specifications cross validation

alpha	lambda	Accuracy	Kappa	AccuracySD	KappaSD
0.01	0.27	0.94	0.83	0.023	0.073

Accordingly, as shown in table 1, the selected parameters are α value of 0.01 (close to ridge) and λ value of 0.27. The following figure also shows the mis-classification error and log(λ). The mis-classification error is calculated based on the cross-validation results, and λ controls the strength of regularization applied to the model. The goal is to choose a value of λ that balances regularization with model performance on new data. Figure 4 shows how mis-classifications error changes with increasing λ value and where this potential λ value is. Areas where labeled (dashed lines), includes λ of 0.27 (log scale).

Figure 4: log λ and mis-classification error

For this model the test accuracy is 97%, where as, the train accuracy was estimated to be 96%. Compared to the simple logistic regression model this regularized elastic model does much better in-terms of accuracy and reducing the risk of over-fitting. For such a simple model this is an incredible accuracy level (but again what we expected given figure 2 (PCA)). In addition, the ROC for elastic model showed an AUC of 95.29%.

Go to part II