凝聚的层次聚类方法

原理解读

AGNES(Agglomerative Nesting):采用自底向上的策略，最初将每个对象作为一个类，然后根据某些准则将这些类别逐一合并。合并的过程反复进行直到类别达到预期的数目。

核心思想

1. 将每一个样本都单独作为一类

2. 合并两类(多种定义方法)，直到满足某个终止条件

最小距离：将两个类别之间最近的两个样本之间的距离作为两个类别之间的距离
$$d_{min}=\underset{x_i \in C_i,x_j \in C_j}{min}d(x_i,x_j)$$
最大距离：将两个类别之间最远的两个样本之间的距离作为两个类别之间的距离
$$d_{max}=\underset{x_i \in C_i,x_j \in C_j}{max}d(x_i-x_j)$$

均值距离：将两个类别中样本的平均值之间的距离作为两个类别之间的距离
$$d_{mean}=d(\overline {C_i}- \overline {C_j}) \ , \ 其中\overline {C_i}=\frac {1}{\lvert C_i \rvert}\underset{x_i \in C_i}{\sum}{x_i}$$
平均距离：将两个类别中样本间两两距离的平均值作为两个类别之间的距离
$$d_{avg}=\frac {1}{\lvert C_i \rvert \lvert C_j \rvert}\underset{x_i \in C_i}{\sum}\underset{x_j \in C_j}{\sum}d(x_i-x_j)$$

算法流程

AGNES

代码实战

代码中所用数据集可以查看相关文档，数据集(Data Set)

AGNES_main.m

clear;clc;close all;
load('..\\cluster_gauss.mat');
%输入x的矩阵
x=data;
randIndex = randperm(size(x,2));
x=x(:,randIndex);
%希望划分的类别数
class_num=3;
%样本数
sample_num=size(x,2);
%特征数目
feat_num=size(x,1);
%尺度缩放到0-1
x_scale=zeros(size(x));
for i=1:feat_num
    x_scale(i,:)=(x(i,:)-min(x(i,:)))/(max(x(i,:))-min(x(i,:)));
end
[y,class_center]=AGNES_classify(x_scale,sample_num,class_num);
%样本中心尺度复原
for i=1:feat_num
    class_center(i,:)=(max(x(i,:))-min(x(i,:)))*class_center(i,:)+min(x(i,:));
end
%如果数据的特征是二维的，可以绘图表示
if feat_num==2
    AGNES_display(x,y,class_center,sample_num,class_num);
else
    disp('The Feature Is Not Two-Dimensional');
end

AGNES_classify.m

function [y,class_center]=AGNES_classify(x_scale,sample_num,class_num)
%给每一个样本分配一个初始类别
y=1:sample_num;
%当前的类别数
class_num_temp=sample_num;
%初始化当前每一类的中心
class_center=x_scale;
while class_num_temp~=class_num
    %初始化类别中心距
    center_distance=zeros(class_num_temp);
    for i=1:class_num_temp
        %计算类别中心距
        center_distance(i,:)=sum((class_center-repmat(class_center(:,i),1,class_num_temp)).^2);
        center_distance(i,i)=inf;
    end
    %从中心距中找到最小值
    [row,col]=find(center_distance==min(min(center_distance)),1);
    %将两类合并
    y(y==col)=row;
    %更新类别，从第1类连续分类
    y(y>col)=y(y>col)-1;
    %类别数-1
    class_num_temp=class_num_temp-1;
    %初始化样本中心
    class_center=zeros(2,class_num_temp);
    for i=1:class_num_temp
        %计算当前每一类的样本中心
        class_center(:,i)=sum(x_scale(:,y==i),2)/sum(y==i);
    end
end

AGNES_display.m

function AGNES_display(x,y,class_center,sample_num,class_num)
color_bar=zeros(class_num,3);
hold on;
for i=1:class_num
    color_bar(i,:)=[rand(1),rand(1),rand(1)];
    %绘制样本中心，用*表示
    plot(class_center(1,i),class_center(2,i),'color',color_bar(i,:),'marker','*')
end
for i=1:sample_num
    %绘制数据集，用o表示
    plot(x(1,i),x(2,i),'color',color_bar(y(i),:),'marker','o');
end
hold off;

实验结果

AGNES

性能比较

优点：
- 对噪声数据不敏感
- 算法简单，容易理解
- 不依赖初始值的选择
- 对于类别较多的训练集分类较快

缺点：

合并操作不能撤销
需要在测试前知道类别的个数
对于类别较少的训练集分类较慢
只适合分布呈凸型或者球形的数据集
对于高维数据，距离的度量并不是很好