Customer Segmentation

In today’s digital age, understanding and catering to the unique needs of each customer is crucial for the success of a business. Leveraging advanced techniques like Kmeans, K-Prototype, and the integrated approach of LLM + Kmeans can revolutionize how businesses view their customer bases. This article aims to deep dive into the realm of customer segmentation using these advanced methods, focusing primarily on the synergy between LLM and Kmeans.

Understanding Customer Segmentation: The Basics

Before delving into the advanced techniques, it’s essential to answer a fundamental question: What is customer segmentation? In simple terms, customer segmentation is the process of dividing a company’s customer base into distinct groups based on shared characteristics. These could be demographics, buying behavior, interests, or any other relevant data points.


├─ data

│ ├─ data.rar

├─ img

├─ embedding.ipynb


├─ kmeans.ipynb

├─ kprototypes.ipynb


└─ requirements.txt

A basic Python code depiction of customer segmentation using Kmeans and the combined LLM + Kmeans technique is provided below. It should be noted that this code is a high-level overview and that full implementation would necessitate suitable preprocessing and real-world data.

import pandas as pd

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

from sklearn.mixture import GaussianMixture  # Assuming LLM is approximated by Gaussian Mixture Model

# Sample dataset

data = pd.read_csv(“customer_data.csv”)

# Pre-processing: Let’s assume the data has age and income columns to be used

features = data[[‘age’, ‘income’]]

scaler = StandardScaler()

scaled_features = scaler.fit_transform(features)

# Using Kmeans for clustering

kmeans = KMeans(n_clusters=3)  # Example with 3 clusters

clusters_kmeans = kmeans.fit_predict(scaled_features)

data[‘kmeans_cluster’] = clusters_kmeans

# LLM + Kmeans approach

# First, let’s get latent profiles using a Gaussian Mixture Model as an approximation for LLM

gmm = GaussianMixture(n_components=3)

gmm_clusters = gmm.fit_predict(scaled_features)

# Now, you can run Kmeans on the results/profiles from LLM

kmeans_on_llm = KMeans(n_clusters=3)

clusters_llm_kmeans = kmeans_on_llm.fit_predict(pd.DataFrame(gmm_clusters))

data[‘llm_kmeans_cluster’] = clusters_llm_kmeans

# Results


Why is Customer Segmentation Important?

Personalised Marketing: Tailored marketing campaigns can be designed for each segment, increasing engagement and conversion rates.

Resource Optimization: Companies can allocate resources more efficiently based on the potential of each segment.

Better Product Development: Feedback from specific segments can inform product improvements or the development of new products.

Method 1: Kmeans Clustering

What is Kmeans?

Kmeans is a popular clustering algorithm that partitions datasets into clusters based on similarity. The aim is to minimize intra-cluster distances and maximize inter-cluster distances.

How to Set Up Kmeans for Customer Segmentation?
  • Data Collection: Gather all necessary data about your customers.
  • Feature Selection: Decide on the relevant features for segmentation like age, income, buying frequency, etc.
  • Normalization: Normalize the data to ensure each feature has equal importance.
  • Choosing the Number of Clusters: Techniques like the Elbow Method can be used to determine the optimal number of clusters.
  • Model Implementation: Run the Kmeans algorithm using tools like Python Scikit-learn.

Method 2: K-Prototype

What is K-Prototype?

K-Prototype is a clustering method which is an extension of Kmeans, designed specifically for datasets with both categorical and numerical attributes.

How is K-Prototype Different from Kmeans?

While Kmeans is best for numerical data, K-Prototype efficiently handles mixed data types by combining Kmeans and K-Modes.

Setting Up K-Prototype for Segmentation:

  • Data Preparation: As with Kmeans, gather all relevant customer data.
  • Mixed Data Handling: Recognize the numerical and categorical attributes.
  • Optimal Cluster Determination: Techniques like Silhouette analysis can be helpful.
  • Algorithm Execution: Implement K-Prototype using specialized libraries that offer this functionality.

Method 3: Integrating LLM with Kmeans (LLM + Kmeans)

Why Combine LLM with Kmeans?

Latent profile analysis (LLM) is a statistical method used to identify latent profiles, or hidden groups, within a dataset. Combining LLM with Kmeans brings the strength of profile analysis into the clustering algorithm, offering a more nuanced customer segmentation.

Steps to Implement LLM + Kmeans:

Step 1 

LLM Implementation: First, run a latent profile analysis to determine hidden profiles in the data.

Step 2 

Kmeans on LLM Results: Use the profiles derived from LLM as input features for the Kmeans clustering.

Step 3 

Interpretation: Understand the combined results to get a comprehensive picture of customer segments.

from pyod.models.ecod import ECOD

clf = ECOD()

outliers = clf.predict(data)

data[“outliers”] = outliers

# Data without outliers

data_no_outliers = data[data[“outliers”] == 0]

data_no_outliers = data_no_outliers.drop([“outliers”], axis = 1)

# Data with Outliers

data_with_outliers = data.copy()

data_with_outliers = data_with_outliers.drop([“outliers”], axis = 1)

print(data_no_outliers.shape) -> (40690, 19)

print(data_with_outliers.shape) -> (45211, 19)

Which Method to Choose?

The choice between Kmeans, K-Prototype, and LLM + Kmeans largely depends on the nature of your dataset and the specific requirements of your business. If:

  • Your data is primarily numerical, Kmeans could suffice.
  • You have a mix of numerical and categorical data, K-Prototype is an excellent choice.
  • You wish for a deeper understanding and integration of hidden profiles with clustering, LLM + Kmeans is the way to go.


Customer segmentation has evolved, and businesses need to stay updated with the latest techniques to remain competitive. Whether it’s Kmeans, K-Prototype, or the integrated LLM + Kmeans approach, the primary goal remains the same: to understand customers better and serve them more efficiently. Dive deep into these methods, experiment with your datasets, and unlock the immense potential that advanced customer segmentation holds.


This error message is only visible to WordPress admins

Error: No feed found.

Please go to the Instagram Feed settings page to create a feed.