For my second PhD project, I learned another exciting method: K-means.
Here, you will find an easy to follow example of how to get started with K-Means and helpful resources that guided me throughout my project.
If you are interesting in reading more on network analysis in the wild, here is the link to the code of my article “Neurodevelopmental Changes after Adversity in Early Adolescence”.
References are linked throughout the text.
K-means is an unsupervised learning method (i.e., without categorical specifications). This method minimizes the average squared distance between points in the same cluster and is used to group objects into clusters (Hartigan & Wong, 1979; Jain, 2010; Lloyd, 1982). K-means is partitioning observations in “k” clusters. Each observation is sorted into a cluster with the closest average. With this technique in psychological research, we can find, for example, subgroups in population studies, or in other words, participants with similar characteristics in specific variables.
These are the steps we will follow during this short tutorial:
We will determine the optimal number of clusters with the commonly used Elbow and Silhouette methods.
We will conduct a K-means analysis.
We visualized the clusters across two time points using ggplot.
# Load necessary libraries for data analysis and visualization
## Data Wrangling and Cleaning
library(tidyverse) # collection of R packages for data manipulation and visualization
library(dplyr) # for data wrangling
library(tidyr) # for data tidying
library(readr) # for reading and writing flat files
library(reshape2)
## Data Visualization
library(ggplot2) # for data visualization
library(ggeffects) # for plotting marginal effects
library(marginaleffects) # for plotting marginal effects
library(gplots)
## Machine Learning
library(caret) # for machine learning workflows
library(nnet) # for neural network modeling
library(MASS) # for linear regression modeling
library(NbClust) # Silhouette & Elbow
library(factoextra) # multivariate analysis
# Define custom color palette
my_colors <- c("#F4A261","#E76F51","#264653","#2A9D8F","#E9C46A")
We will use the build in data set “starwars” from the dplyr package.You can find more information on this data set here.
#load data
Data <- starwars
#remove rows with missing values
Data <- na.omit(Data)
Before we conduct the actual K-Means analysis, we need to determine the optimal number of clusters. Here you can find more information on different techniques of clustering. In this example we are determining the optimal number of clusters based on height and mass of Star Wars character.
#Elbow Method
Elb <- fviz_nbclust(Data[,2:3], kmeans, method = "wss") + #within cluster sums of squares
labs(subtitle = "Elbow method_AllTracts")
Elb
#Silhouette Method
Sil <- fviz_nbclust(Data[,2:3], kmeans, method = "silhouette")+ #average silhouette
labs(subtitle = "Silhouette method_AllTracts")
Sil