How to Do Cluster Analysis in Excel

Cody Schneider9 min read

Trying to find the natural groupings hidden in your customer data, survey results, or product inventory? That's precisely what cluster analysis is for, and you can get started right inside Excel. This article will walk you through, step-by-step, how to perform a hierarchical cluster analysis using Excel's built-in formulas and functions.

GraphedGraphed

Still Building Reports Manually?

Watch how growth teams are getting answers in seconds — not days.

Watch Graphed demo video

What Exactly is Cluster Analysis?

Cluster analysis is a technique used to group a set of objects so that objects in the same group (called a cluster) are more similar to each other than to those in other groups. Think of it as automatic sorting for your data.

It's an "unsupervised" method, meaning you don’t know what the groups are before you start. You provide the data, and the analysis reveals the underlying structure. Businesses use it for countless tasks, including:

  • Customer Segmentation: Grouping customers based on purchasing habits (e.g., high-value loyalists, recent one-time buyers, discount shoppers). This allows for much more targeted marketing.
  • Market Research: Segmenting survey respondents to identify different opinion groups or personas.
  • Product Bundling: Identifying products that are frequently purchased together to create attractive bundles or store layouts.
  • Anomaly Detection: Pinpointing data points that don't fit into any group, which can be useful for identifying fraud or errors.

The main goal is to maximize the similarity within a cluster while maximizing the dissimilarity between clusters.

Step 1: Get Your Data Ready for Analysis

The quality of your analysis lives and dies by the quality and preparation of your data. Before you calculate a single thing, you need to properly structure and clean your dataset.

Choose and Organize Your Variables

First, decide which attributes (variables) you will use to measure similarity. These must be numeric. For a customer segmentation analysis, your variables might be:

  • Recency: Days since the last purchase.
  • Frequency: Total number of purchases in a given period.
  • Monetary Value: Total or average amount spent.

Your data should be organized cleanly in a table: each row represents an object you're clustering (e.g., a customer), and each column represents one of your chosen numeric variables. Make sure there are no blank rows or columns in your main data block.

GraphedGraphed

Still Building Reports Manually?

Watch how growth teams are getting answers in seconds — not days.

Watch Graphed demo video

The Absolute Importance of Normalizing Your Data

This is the most critical and often-overlooked step in data preparation. Imagine you are clustering customers using their "Total Spent" (which could be in the hundreds or thousands) and "Number of Purchases" (likely single or double digits). The "Total Spent" variable's large numbers will completely dominate the analysis, and the "Number of Purchases" will barely have an impact.

Normalization scales all of your variables so they have a similar range, ensuring each one contributes fairly to the analysis. A common and effective method is standardization, which converts all values to a Z-score (the number of standard deviations a value is from the mean).

Here’s how to do it in Excel:

1. Calculate the Mean and Standard Deviation for each variable.

Below your data, pick two empty cells. In the first, calculate the average for your first variable column. For data in column B, the formula would be:

=AVERAGE(B2:B51)

In the cell below that, calculate the population standard deviation:

=STDEV.P(B2:B51)

(Use STDEV.P if your data represents the entire population you're analyzing, use STDEV.S if it's just a sample. For most internal data analysis, STDEV.P is appropriate).

Drag these formulas across to calculate them for your other variable columns.

2. Create new columns for your normalized data.

Beside your original data table, add new columns with headers like "Spent_norm" or "Freq_norm".

3. Apply the Standardization Formula.

The formula for a Z-score is (value - mean) / standard_deviation. In the first cell of your new normalized column, you’ll write this formula, referencing your original data and the mean/stdev you calculated. For a value in cell B2, with your mean in B53 and stdev in B54, the formula is:

=(B2-$B$53)/$B$54

The dollar signs $ are crucial - they create an absolute reference, meaning that when you drag the formula down and across, it will always refer to the same mean and standard deviation cells for that column but will update the B2 reference for each row. Apply this formula to your entire dataset.

Your normalized data table is now ready for analysis.

Step 2: Calculating Distances and Building Clusters

With our normalized data, we can now start grouping objects. We will use a method called Hierarchical Agglomerative Clustering (HAC). Let's break that down:

  • Hierarchical: We build a tree-like structure from the bottom up.
  • Agglomerative: We start with every data point in its own cluster and progressively merge (agglomerate) the closest pairs until only one cluster remains.

Calculating the 'Distance' Between Each Data Point

To merge the "closest" data points, we first need to calculate the distance between every possible pair. The most common way to do this is with Euclidean Distance, which is just the straight-line distance between two points.

Imagine you have two customers, each with three normalized variables (Recency, Frequency, Monetary). The formula for the distance between them would be:

Distance = SQRT( (R₂-R₁)² + (F₂-F₁)² + (M₂-M₁)² )

Let's say our normalized data is in columns F, G, and H. The Excel formula to calculate the distance between Customer 1 (in row 2) and Customer 2 (in row 3) is:

=SQRT((F3-F2)^2 + (G3-G2)^2 + (H3-H2)^2)

You will need to create a distance matrix, which is a grid that shows the calculated distance between every single pair of data points. This is the most labor-intensive part of the process in Excel. You create a grid with all your object IDs (e.g., Customer 1, Customer 2) down the first column and across the top row, then painstakingly fill in each cell with the distance formula for that pair of objects.

GraphedGraphed

Still Building Reports Manually?

Watch how growth teams are getting answers in seconds — not days.

Watch Graphed demo video

Building the Hierarchy: A Manual Step-by-Step Process

Once your distance matrix is complete, the clustering algorithm begins. It's a rinse-and-repeat process:

Iteration 1: Scan your entire distance matrix and find the smallest value. This identifies the two most similar objects. Let's say it's Customer 15 and Customer 28. You now merge them into a new cluster: {15, 28}. You can make a note of the distance at which they merged, we'll use this later.

Iteration 2: You now have a new problem. How do you measure the distance from every other object to your new cluster, {15, 28}? You need to update your distance matrix. There are several ways to do this, known as linkage methods:

  • Single Linkage (Nearest Neighbor): The distance is the minimum distance between any member of the cluster and the other object.
  • Complete Linkage (Farthest Neighbor): The distance is the maximum distance.
  • Average Linkage: The distance is the average of all distances from members of the cluster to the other object.

Average Linkage is often a good and stable choice. So, to calculate the distance from {15, 28} to Customer 3, you would calculate =AVERAGE(Distance(15,3), Distance(28,3)). You replace the old rows/columns for 15 and 28 with a single new row/column for your cluster and update the distances.

Repeating the Process: Now, you simply repeat! Scan your newly updated matrix for the smallest distance. Merge those two objects/clusters. Recalculate and update the matrix. You continue this process, step-by-step, until all your objects have been merged into one giant cluster.

GraphedGraphed

Still Building Reports Manually?

Watch how growth teams are getting answers in seconds — not days.

Watch Graphed demo video

Step 3: Deciding How Many Clusters to Have

The hierarchical process doesn't tell you the "right" number of clusters, it gives you the full-blown tree of all possible merges. It's up to you, the analyst, to decide where to "cut" that tree to define your final groups.

One way to guide this decision is by looking at the distance at which each merge occurred. You'll notice that in the early stages, the merge distances are small (you're merging very similar things). As you go on, the distance between the clusters being merged gets larger and larger. You are often looking for an "elbow" point - a spot where there is a sudden and significant jump in the merge distance. This jump suggests that you have started merging two clusters that are actually quite distinct from each other, and you might want to stop the process right before that merge.

Ultimately, the final number of clusters should also make business sense. Can you develop a distinct strategy for 3 customer groups? For 5? Are the groups you've identified actionable and meaningful? The technical analysis should be a guide, but your business context is the final judge.

Is There an Easier Way?

As you've seen, performing a proper hierarchical cluster analysis in Excel is entirely possible but extremely manual, especially the construction of the distance matrix and the iterative merging process. A single mistake with a formula can throw off the entire analysis. The process gets exponentially more tedious as you add more data points.

This is why specialized tools and add-ins exist. For Excel users who frequently need this type of analysis, add-ins like XLSTAT can automate the entire workflow in a few clicks. Another popular method, K-Means Clustering, is also too complex to run manually in Excel without deep knowledge of the Solver add-in or VBA macros.

The manual exercise is valuable because it forces you to understand exactly what’s happening "under the hood" of the algorithm. But for routine, reliable analysis, an automated solution is a must.

Final Thoughts

This walkthrough breaks down how to bring a powerful data science technique into a familiar tool like Excel. By carefully cleaning, organizing, and normalizing your data, you can use built-in formulas to calculate distances and methodically group similar objects, revealing hidden structures in your data that can lead to smarter business decisions.

The complexity of this manual process in Excel highlights why so many marketing and sales teams get stuck in their data. It just takes too long to go from a question to an insight. With Graphed, we get rid of this friction entirely. Instead of wrestling with distance matrices, you just connect your marketing and sales platforms and ask, "Show me my customer segments based on their purchasing behavior." Our AI handles the entire analysis instantly, turning hours of tedious spreadsheet work into an interactive dashboard you can use to make decisions in seconds.

Related Articles