K means clustering and its real use-case in the security domain

4 min readJul 30, 2021

What is k means clustering?

K-Means Clustering is an unsupervised learning algorithm that is used to solve clustering problems in machine learning or data science. Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. in simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Its main goal is to find groups in the data, with the number of groups represented by the variable k. It works iteratively to assign each data point to one of the k groups based on the features that are provided. in the reference image below, k=2, and there are two clusters identified from the source dataset.

What is k means algorithm?

K-means clustering algorithm computes the centroids and iterates until we find an optimal centroid. It assumes that the number of clusters is already known. It is also called a flat clustering algorithm. The number of clusters identified from data by the algorithm is represented by ‘K'in K-means.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs to only one group that has similar properties.

How does it work?

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Use cases in the Security Domain

1. Identifying crime localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

2. Customer Segmentation

Clustering helps marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring. This is how telecom providers can cluster pre-paid customers to identify patterns in terms of money spent in recharging, sending SMS, and browsing the internet. the classification would help the company target specific clusters of customers for specific campaigns.

3. Cyber-profiling criminals

Cyber profiling is the process of collecting data from individuals and groups to identify significant correlations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

4. Rideshare data analysis

The publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

5. Crime document classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarities in document groups.

6. call record detail analysis

A call detail record (cdr) is the information captured by telecom companies during the call, SMS, and internet activity of a customer. this information provides greater insights about the customer’s needs when used with customer demographics.

These were few use cases but the list goes on be it in Security Domain or any other, K-means is a very effective as well as an easy way of Clustering in Machine Learning.