The CLUSTER Procedure

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on 50 iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica . Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.

The following step displays the iris SAS data set, which is available in the Sashelp library:

title 'Cluster Analysis of Fisher (1936) Iris Data'; proc print data=sashelp.iris; run;

The results of this step are not shown.

This example analyzes the iris data by using Ward’s method and two-stage density linkage and then illustrates how the FASTCLUS procedure can be used in combination with PROC CLUSTER to analyze large data sets.

The following macro, SHOW, is used in the subsequent analyses to display cluster results. It invokes the FREQ procedure to crosstabulate clusters and species. The CANDISC procedure computes canonical variables for discriminating among the clusters, and the first two canonical variables are plotted to show cluster membership. See Chapter 29: The CANDISC Procedure , for a canonical discriminant analysis of the iris species.

/*--- Define macro show ---*/ %macro show; proc freq; tables cluster*species / nopercent norow nocol plot=none; run; proc candisc noprint out=can; class cluster; var petal: sepal:; run; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; run; %mend;

The first analysis clusters the iris data by using Ward’s method (see Output 31.3.1) and plots the CCC and pseudo F and statistics (see Output 31.3.2). The CCC has a local peak at three clusters but a higher peak at five clusters. The pseudo F statistic indicates three clusters, while the pseudo statistic suggests three or six clusters.

The TREE procedure creates an output data set containing the three-cluster partition for use by the SHOW macro. The FREQ procedure reveals 16 misclassifications. The results are shown in Output 31.3.3.

title2 'By Ward''s Method'; ods graphics on; proc cluster data=sashelp.iris method=ward print=15 ccc pseudo; var petal: sepal:; copy species; run; proc tree noprint ncl=3 out=out; copy petal: sepal: species; run; %show;

Output 31.3.1: Cluster Analysis of Fisher’s Iris Data: PROC CLUSTER with METHOD=WARD

Cluster Analysis of Fisher (1936) Iris Data
By Ward's Method