Clustering car_sales Dataset

In this project, I will utilize the clustering modeling technique on a car sales dataset. I will be experimenting with 2 clustering methods: k-means and agglomerative.

Introduction

The following are the python libraries that I have utilized in the project:

Pandas

Matplotlib

Seaborn

Numpy

Sklearn (AgglomerativeClustering, KMeans, PCA)

What is clustering?

Clustering is a type of unsupervised learning method used for classification. Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled datasets.

The organization of unlabeled data into similarity groups called clusters. A cluster is a collection of data items which are “similar” between them, and “dissimilar” to data items in other clusters. A good clustering method will produce high quality clusters which have high intra-class similarity, low inter-class similarity.

The quality of a clustering method depends on

the similarity measure used by the method
its implementation, and
its ability to discover some or all of the hidden patterns

The most popular centroid-based clustering algorithm is called k-means clustering. In k-means the “means” are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster. The k in k-means is simply the number of clusters that one would like to find in the data.

The most common way of representing clusters is through its center, called the centroid. In the figure, we have three clusters whose instances are represented by the circles. Each cluster has a centroid, represented by the solid-lined star. The star is not necessarily one of the instances; it is the geometric center of a group of instances.

Introduce the data

The dataset I have chosen to explore is a single csv format file from kaggle.com. The name of the dataset is “Car sales” and it can be publicly accessed. This data set is being taken from the Analytixlabs for the purpose of prediction.

The following are attributes in the dataset:

Manufacturer (Ex: Acura, Honda, etc)

Model (Model of the car)

Sales (Sale value of the car in thousands)

resale (Resale value of the car in thousands)

price (price in thousands)

engine_s (engine size)

horsepow (horsepower)

wheelbas (wheelbase)

width

length

curb_wgt (curb weight)

fuel_cap (fuel capacity)

Mpg (miles per gallon)

Data Understanding/Visualization

The below chart is a scatterplot of price vs horsepower. And we can see that the variables are very highly positively correlated.

The heatmap shows that mpg is moderately to highly negatively correlated with the other features.

Pre-processing the data

In this step of the process, I have reduced the number of columns. I picked the column based on how relevant they would be so that the model would gain important information and use it to cluster the dataset. The following code line shows the x dataframe which I will use in the later steps to create visualizations and clustering models.

Using the code shown below, I discovered any null values present in the x dataframe and dropped those rows to have a clean and filled dataframe so that the model would use every feature equally to learn.

Modeling

We will start the modeling process by first creating a PCA dataframe. PCA is a dimensionality reduction method, in our dataset, it will transform our x dataframe into a new pca_df which will have 2 components(columns). Advantages to creating PCA dataframe is that it removes irrelevant features and multicollinearity and reduces time and storage which helps with creating faster visualizations and models. A con of employing this method is loss of information and the data frame itself is very difficult to interpret. For our purposes, we will only use this dataframe to visualize clusters.

The following code block creates a PCA model and dataframe.

The following plot is called the elbow diagram. This diagram can help us determine the optimal number of clusters to choose for our model. We look at where there is a significant jump between the numbers. For our dataset, we can see that there is a big jump from 2 to 1. Therefore, we will choose to create two clusters in our model.

The following code block creates a k-means clustering model. We are inputting the number of clusters as a parameter and y_kmeans is our model that has been fitted by the x dataframe.

The above plot visualizes the two columns of the pca dataframe plotted against each other, with our k-means model y_kmeans as the hue. We can clearly see the two clusters color coded differently.

The below plot shows the horsepower against the wheelbase with hue mapped to our k-means clustering model.

The following chart is called a dendrogram and similar to an elbow diagram it can also be used to determine the number of clusters to separate the data by.

The following code block creates the agglomerative model and then we use it to map hue to the model, y_agglo, to visually see the clusters.

Storytelling (Clustering Analysis)

Both of my models divided the distribution right down the middle. Based on the elbow diagram and looking at the visualization, 2 was the optimal number to cluster the data. Interpreting the results can be very difficult. I think that this modeling technique can be used together with another modeling technique. And for it to be effective the quality and the specificity of the data is very important so that one can feed the model relevant information and then use the clusters with for example classification method to classify a subject with physical or internal traits.

References

https://www.kaggle.com/datasets/gagandeep16/car-sales

You can access the code by clicking here