One of the simplest and genius ML fashions, for my part, is k-Means clustering. It pertains to the group of unsupervised studying algorithms and is able to find patterns inside an unlabeled dataset. Probably the most nice function is that it lacks difficult math, and mainly any highschool scholar can efficiently implement and use this technique. So on this article I wish to share how one can construct k-Means algorithm from scratch in python utilizing solely numpy and pandas libraries and apply it to an actual world drawback — semantic segmentation of satellite tv for pc imagery.
Firstly, let’s discuss in regards to the knowledge now we have.
In one in every of my earlier articles, I talked about the issue of the Aral Sea shrinkage. In consequence, we managed to get distant sensing imagery from MODIS utilizing Google Earth Engine, which strongly signifies that the ocean is drying. So I puzzled, how can we estimate the change of the water floor between 2000 and 2023 utilizing ML semantic segmentation? The reply is k-Means!
Earlier than diving into coding, let’s take a look on the knowledge we’re going to use on this tutorial. These are two RGB photographs of the identical space with an interval of 23 years, nonetheless it’s clear that the land floor properties and atmospheric situations (clouds, aerosols and so on.) are completely different. That’s why I made a decision to coach two separate k-Means fashions, one for every picture.
To comply with up the tutorial, you possibly can obtain and run the pocket book right here.
To begin with, let’s import the mandatory libraries and add the info to the pocket book:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.picture as mpimgimg = mpimg.imread('MOD_01.jpg')
img2 = mpimg.imread('MOD_24.jpg')
You possibly can see that the realm coated by the photographs is sort of giant, so I recommend to zoom in somewhat:
img = img[140:600,110:500,:]
img2 = img2[140:600,110:500,:]fig, ax = plt.subplots(ncols=2, figsize=(16,9))
ax[0].imshow(img)
ax[1].imshow(img2)
for i in vary(2):
ax[i].set_facecolor('black')
ax[i].set_xticks([])
ax[i].set_yticks([])
ax[0].set_title('2000-08-01', fontsize=26)
ax[1].set_title('2023-08-01', fontsize=26)
plt.present()
And the final step earlier than the ML part, let’s convert our photographs to pandas dataframes (one column for every picture channel). I try this for the sake of visibility of my explanations. If you wish to get it optimized, it’s higher to make use of numpy arrays as a substitute.
df = pd.DataFrame({'R': img[:,:, 0].flatten(), 'G': img[:,:, 1].flatten(), 'B':img[:,:, 2].flatten()})
df2 = pd.DataFrame({'R': img2[:,:, 0].flatten(), 'G': img2[:,:, 1].flatten(), 'B':img2[:,:, 2].flatten()})
So what’s the concept behind the algorithm?
Think about that you just decide in regards to the style of meals utilizing two standards: sweetness and value. Conserving this in thoughts, I’ll provide you with a set of potential choices to eat:
I guess your mind has already break up the choices into three clusters: fruits, drinks and bakery. Principally, you unconsciously clustered the 2-dimensional knowledge, that are outlined by a pair of values — (sweetness; value).
Within the case of k-Means, the aim of the algorithm is sort of related — to discover a pre-set variety of cluster, ok, in n-dimensional area (e.g. apart from sweetness and value you wish to account for vitamin, well being, presence of the meals in your fridge, and on this case, n = 5).
I. Outline the variety of clusters.
As I discussed beforehand, ok in k-Means is the variety of clusters you wish to get ultimately, and also you’re purported to set this worth earlier than coaching the mannequin.
II. Randomly initialize centroids.
Centroid is an integral a part of k-Means. Principally, centroid is a circle with a middle, which is outlined a set of coordinates, and every centroid represents a cluster. For example, in our earlier instance there are 3 centroids.
III. Calculate distances and assign clusters.
Now we have to discover how far every level is from every centroid. Primarily based on this calculations, we assign every level to the least distant centroid (cluster).
IV. Calculate new centroids.
Now every of our clusters comprises a minimum of one factors, so it’s time to re-calculate the centroids just by taking imply coordinates throughout all cluster factors.
And that’s it! We repeat steps 2–4 till centroids should not altering.
Now let’s wrap this actually easy concept behind k-Means into python code.
Reminder: on this job now we have 3D drawback, i.e. our X, Y and Z are Pink, Inexperienced and Blue picture channels!
def kmeans(knowledge, Ok, form):
L = record()
new_centroids = knowledge.pattern(Ok).valuesknowledge = distance(knowledge.copy(), new_centroids, form)
old_centroids = new_centroids.copy()
new_centroids = np.array([data[data.Class == Class][['R', 'G', 'B']].imply().values for Class in knowledge.loc[:,'C1':f'C{K}'].columns])
i = 1
print(f'Iteration: {i}tDistance: {abs(new_centroids.imply()-old_centroids.imply())}')
whereas abs(new_centroids.imply()-old_centroids.imply())>0.001:
L.append(abs(new_centroids.imply()-old_centroids.imply()))
knowledge = distance(knowledge, new_centroids, form)
old_centroids = new_centroids.copy()
new_centroids = np.array([data[data.Class == Class][['R', 'G', 'B']].imply().values for Class in knowledge.loc[:,'C1':f'C{K}'].columns])
i+=1
print(f'Iteration: {i}tDistance: {abs(new_centroids.imply()-old_centroids.imply())}')
print(f"k-Means has ended with {i} iteratinons")
return knowledge, L
On the primary stage we create a listing L to gather all of the distances between clusters to visualise them afterwards and randomly pattern Ok factors from the dataset to make them our centroids (or alternatively, you possibly can assign random values to the centroids).
L = record()
new_centroids = knowledge.pattern(Ok).values
Now we have to calculate the distances between centroids and knowledge factors. There are many completely different distance metrics in Information Science, however let’s deal with the next ones — Euclidean, Manhattan, Chebyshev.
For Euclidean distance:
For Manhattan:
For Chebyshev:
To make use of this formulation, let’s write a flexible operate for any variety of dimensions:
def distance(knowledge, centroids, form):
#form = euclidean, manhattan, chebyshev
#Right here we add to the dataframe as many clusters C-ith as wanted
cols=record()
for i in vary(1,ok+1):
if form=='euclidean':
knowledge[f'C{i}'] = ((centroids[i-1][0]-data.R)**2+(centroids[i-1][1]-data.G)**2+(centroids[i-1][2]-data.B)**2)**0.5
elif form=='manhattan':
knowledge[f'C{i}'] = abs(centroids[i-1][0]-data.R)+abs(centroids[i-1][1]-data.G)+abs(centroids[i-1][2]-data.B)
elif form=='chebyshev':
merged=pd.concat([centroids[i-1][0]-data.R, centroids[i-1][1]-data.G, centroids[i-1][2]-data.B], axis=1)
knowledge[f'C{i}'] = merged.max(axis=1)
cols.append(f'C{i}')
knowledge['Class'] = knowledge[cols].abs().idxmin(axis=1) #assigning clusters to factors
return knowledge #returning the dataframe with ok cluster columns and one Class column with the ultimate cluster
So now we will merely calculate distances and assign a cluster to every knowledge level. Thus, our new centroids grew to become previous, so we retailer them in one other variable and recalculate the brand new ones. To do this we iterate over every cluster and take a imply throughout all of the coordinates (in our case, throughout RGB channels). Due to this fact, the variable new_centroids has a form of (ok,3).
knowledge = distance(knowledge.copy(), new_centroids, form)
old_centroids = new_centroids.copy()
new_centroids = np.array([data[data.Class == Class][['R', 'G', 'B']].imply().values for Class in knowledge.loc[:,'C1':f'C{K}'].columns])
Lastly, we repeat all these steps till centroids’ coordinates don’t change anymore. I expressed this situation as this: the distinction between common cluster coordinates needs to be lower than 0.001. However you possibly can mess around with different numbers right here.
whereas abs(new_centroids.imply()-old_centroids.imply())>0.001:
L.append(abs(new_centroids.imply()-old_centroids.imply()))
knowledge = distance(knowledge, new_centroids, form)
old_centroids = new_centroids.copy()
new_centroids = np.array([data[data.Class == Class][['R', 'G', 'B']].imply().values for Class in knowledge.loc[:,'C1':f'C{K}'].columns])
And that’s it. The algorithm is able to be skilled! So let’s set ok = 3 and retailer the outcomes into dictionaries.
ok = 3
segmented_1, segmented_2, distances_1, distances_2 = {}, {}, {}, {}
segmented_1['euclidean'], distances_1['euclidean'] = kmeans(df, ok, 'euclidean')
segmented_2['euclidean'], distances_2['euclidean'] = kmeans(df2, ok, 'euclidean')
segmented_1['manhattan'], distances_1['manhattan'] = kmeans(df, ok, 'manhattan')
segmented_2['manhattan'], distances_2['manhattan'] = kmeans(df2, ok, 'manhattan')
segmented_1['chebyshev'], distances_1['chebyshev'] = kmeans(df, ok, 'chebyshev')
segmented_2['chebyshev'], distances_2['chebyshev'] = kmeans(df2, ok, 'chebyshev')
I made a decision to match all the space metrics for this specific job as you possibly can see, and it’s evident that right here Manhattan distance was the quickest.
Earlier than visualizing the clusters, let’s convert the clusters names into int kind:
d = {'C1':0, 'C2': 1, 'C3':2}
for key in segmented_1.keys():
segmented_1[key].Class = segmented_1[key].Class.apply(lambda x: d[x])
segmented_2[key].Class = segmented_2[key].Class.apply(lambda x: d[x])
Time make the ultimate plots!
for key in segmented_1.keys():
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(10,10))
ax[0, 0].imshow(img)
ax[0, 1].imshow(segmented_1[key].Class.values.reshape(460,390))
ax[0, 0].set_title('MOD09GA RGB', fontsize=18)
ax[0, 1].set_title(f'kMeansn{key[0].higher()+key[1:]} Distance', fontsize=18)ax[1, 0].imshow(img2)
ax[1, 1].imshow(segmented_2[key].Class.values.reshape(460,390))
ax[1, 0].set_title('MOD09GA RGB', fontsize=18)
ax[1, 1].set_title(f'kMeansn{key[0].higher()+key[1:]} Distance', fontsize=18)
for i in vary(2):
for j in vary(2):
ax[i, j].set_facecolor('black')
ax[i, j].set_xticks([])
ax[i, j].set_yticks([])
plt.savefig(f'{key}.png')
plt.tight_layout()
plt.present()
It’s not onerous to see that Euclidean and Manhattan distance turned out be probably the most appropriate for this specific job. However to guarantee that it’s true, let’s consider the k-Means clustering outcomes utilizing the Silhouette Coefficient. This metric is ideal for coaching outcomes evaluation when there are not any labeled true values for the clustered factors.
To calculate it we are going to use sklearn operate [1].
- a — the imply distance between a pattern and all different factors in the identical class.
- b — the imply distance between a pattern and all different factors within the subsequent nearest cluster.
The vary of values of the Silhouette Coefficient is [-1,1]. And yep, it’s computationally costly, as that you must calculate distances between 1000’s of level a number of occasions, so be prepared to attend.
scores_1, scores_2 = {}, {}
for key in segmented_1.keys(): #secret is a metric for the space estimation
scores_1[key]=spherical(silhouette_score(segmented_1[key].loc[:, :'C3'], segmented_1[key].Class, metric=key),2)
scores_2[key]=spherical(silhouette_score(segmented_2[key].loc[:, :'C3'], segmented_2[key].Class, metric=key),2)
print(f'Distance: {key}t Img 1: {scores_1[key]}t Img 2: {scores_2[key]}')
Now you possibly can see that we proved it: Euclidean and Manhattan distances have equally good efficiency, so let’s estimate the water floor space loss, utilizing each of them.
for metric, Class in zip(['euclidean', 'manhattan'], [2,1]):
img1_water = np.count_nonzero(segmented_1[metric].Class.values == Class)*500*500*1e-6 #pixel measurement is 500, so the realm is 500*500 and to transform to km2 * 1e-6
img2_water = np.count_nonzero(segmented_2[metric].Class.values == Class)*500*500*1e-6print(f'Distance: {metric}tWater Space Earlier than: {spherical(img1_water)}kmu00b2tWater Space After: {spherical(img2_water)}kmu00b2tChange: -{100-round(img2_water/img1_water*100)}%')
— — — —
Distance: euclidean
Water Space Earlier than: 17125 km²
Water Space After: 1960 km²
Change: -89%
— — — — —
Distance: manhattan
Water Space Earlier than: 16244 km²
Water Space After: 2003 km²
Change: -88%
As you possibly can see, in keeping with our clustering outcomes, the change in water floor space is nearly 90% (!!!) water loss over final 23 years, which is actual proof of the truth that the Aral Sea shrinkage is a planetary tragedy…
===========================================
Reference:
===========================================
All my publications on Medium are free and open-access, that’s why I’d actually admire if you happen to adopted me right here!
P.s. I’m extraordinarily captivated with (Geo)Information Science, ML/AI and Local weather Change. So if you wish to work collectively on some challenge pls contact me in LinkedIn.
🛰️Comply with for extra🛰️