Cluster analysis: principles and practice

1. Introduction

The objective of the work described in the report on cluster analysis [link] is to classify individuals in a population on the basis of their upon their exposure to risk factors. The reasons for doing so are to investigate how these risk factors are distributed within the population and to identify those key characteristics which identify groups of individuals. This is analogous to the activity of taxonomists who seek to achieve a classification of organisms which groups them according to some property of interest, such as evolutionary or genetic relationships and, likewise, provides a basis for the ready identification of the groups obtained. Traditional taxonomy tended to use a small number of characteristics thought to be of special importance, such as the floral anatomy of plants, and its practice relied heavily upon experience and judgement. Some practitioners regarded this as unsatisfactory because of the subjectivity of approach, the wide range of characteristics of practical importance which were ignored and the absence of any rational basis for determining which of a number of rival classifications was the best. Such sceptical taxonomists sought an alternative approach which used a much wider range of characteristics and which minimised the weight given to those attributes which were traditionally thought to be of special importance. Cluster analysis was one of the numerical techniques applied in the pursuit of this new taxonomy. It is important to appreciate, however, that while the means of manipulating the data is important, the value of the results obtained depends principally upon the quality of the data to which they are applied and that this, in turn, depends upon the set of characteristics upon which the classification is to be founded. The discussion which follows will thus begin with a consideration of how these characteristics should be selected and then progress to a consideration of how their values are manipulated. It is based upon Sneath and Sokal (1973) a work which, despite its age, remains one of the most widely cited methodological texts on the subject.

2. Selection of characteristics

While I have just explained that greater objectivity was the principal objective of those who sought to break from traditional methods of classification, it will soon became apparent that as an absolute lack of subjectivity is, in practice, unattainable, the practical goal must be a degree of objectivity. The difficulty is no less in the present case. Assume, as is in fact the case, that the number of characteristics measured in the NHANES data set is greater than the number which we can deal with in practice. How do we select the subset of characteristics to be used? We could simply select an appropriate number at random. As will be seen elsewhere, however, the number of useable measures of smoking behaviour, which is the characteristic of particular interest to us, in the NHANES data set is small, so that if we made a random selection there is a finite possibility that this subset of characteristics would not include smoking. This would be of no use to us. We could, of course, repeat the random selection until we obtained a subset which did contain smoking. We might then find, however, that another interesting attribute, such as alcohol consumption, was excluded, so we would have to try again. As accepting or rejecting the results of randomisation is clearly a subjective process, we have therefore lost the objectivity we sought. We thus have to establish some means of selecting the characters for analysis which guards against excessive subjectivity while at the same time allowing the inclusion of those of interest and those thought, for one reason or another, to be unnecessary for the project.

For the cluster analysis which I performed I used a dataset prepared by others. I thus had no control over which characteristics were available but as their number was large and included many of potential interest, my choice was not unduly restricted. My position did not, therefore, differ greatly from that who collect as well as analyse the data. Such workers are faced with a large number of possibilities and have to choose those characteristics which they set out to measure.

In selecting characteristics for analysis I established a set of criteria for exclusion and inclusion but it was immediately evident that these represent ideals and that their application requires a degree of judgement. It thus appears that the best procedure is to support the selection or rejection of characters with an explicit discussion of the reasons for doing so. This is done in appendix B [link] wherein the characteristics used and the reasons for their inclusion in the analysis are set out. As far as possible, I have also provided justification for the exclusion of those attributes which were not included.

It may be noted that concerns about the selection of characteristics can be addressed to some extent by using sensitivity analysis. This is the rather grandiose term for repeating an exercise several times with major or minor differences and observing the extent to which changes have affected the results. This does not necessarily distinguish better from worse but it does give some indication of the sensitivity of technique to the characteristics used.

The following sections of this appendix represent the criteria which I have used and the ease or otherwise with which they could be applied.

2.2 Relevance

It is at least arguable that an objective selection of characteristics for analysis should not take their relevance to the task in hand into account. Any judgement on relevance must be to some degree subjective and based on the best knowledge available at the time. Moreover, as an influence on the development of disease by unknown risk factors is a perpetual concern in epidemiology, the inclusion of attributes which are not known to affect risk may increase slightly the likelihood of including such risk factors in the analysis. In addition, the use of characters without any known relevance may, through an increase in the total number used, tend to reduce any bias or weighting in the choice of those which were thought to be relevant although possibly at the cost of some loss of sensitivity.

We are, however, spared such problems in the present case as the great majority of characteristics measured in NHANES represent factors thought to have some relevance to public health. The criterion of relevance has, therefore, some support from sources external to the project. For example the NHANES database contains a number of estimates on intake of 42 micronutrients. The amount of information in the literature relevant to the influence of these on health varies from one to another and provides some guidance as to which to include and which to exclude. On the other hand, the database also includes information on the gross intake of macronutrients such as carbohydrate, protein and lipids. While one or more of these might seem to be useful additions to a subset of characteristics concerned with diet, the literature concerning these attributes is sparse. In such instances it is tempting to include such characteristics with the justification that they add to a background of less relevant characteristics. This is certainly not objective, but is it justifiable? In such circumstances, assistance with the decision may be sought from the other criteria discussed here.

2.3 Number of characteristics

There is no absolute rule as to the number of characteristics which should be used in analysis and the total number used is more likely to be determined by practical considerations. Within this constraint, a greater number is preferable to a smaller number for the general reasons that a greater number will reduce the likelihood of undue influence resulting from any single character and increase the comprehensiveness of the exercise. On the other hand, a larger number of characteristics increases the likelihood of weighting the results of clustering by including in the process several attributes related to some broad property such as socioeconomic status. In practice, therefore, no specific number was established as a target and the size of the set finally used was a function of selection process as a whole.

2.4 Subsets of characteristics

The characteristics included in the NHANES database are not independent of each other but fall into various categories such as dietary intake, exercise or socioeconomic status. The number of these categories is relatively small and if only one characteristic from each were used then the total number in the analysis would be smaller than might be desired. It is therefore preferable that, subject to the restrictions discussed in section 2.5 below, several attributes from each category be used. This has the advantage that different measures of one category will, in general, represent it better and reduce the possibility of bias due to the use of a single characteristics which was unsuitable in some way.

This approach is not as simple as it seems. Should all categories of information be represented by similar numbers of attributes or should some, by virtue of some estimate of greater importance, contribute more characteristics? In addition, the categories of information are not clearly circumscribed. Body mass index may be related to both diet and exercise while socioeconomic characteristics such as employment status will be related to demographic attributes such as age and gender. Moreover, some categories, such as smoking behaviour present only a small number of characteristics which are suitable for use.

It is apparent, once more, that judgement, tempered by practical concerns, is required. It is possible, for example, to argue that the category of diet is wider than smoking in both its implications for disease and the number of characteristics required to provide adequate coverage of all of its manifestations may be greater. Convincing as such arguments may be however, they remain general in that they fail to quantify the discussion.

2.5 Correlation of characters

One of the important steps towards avoiding bias due giving undue weight to particular characteristics is to avoid the inclusion of attributes which are strongly correlated with each other. As, however, in this as in most other applications of cluster analysis, the object of the exercise is to seek differential distribution of characteristics in the population, this rule cannot be applied rigidly as most of the characteristics showing such behaviour will be correlated to some degree.

Some potential problems can be detected and avoided simply. If two or more characteristics are simply different expressions of the same thing, only one should be used. Thus it would be inappropriate to include both age and date of birth of subjects. Likewise, if one potential character, such as body mass index, is a function of some others, such as height and weight, then they should not all be used. On the other hand, using the example just given, using two of the three is permissible. Ideally, the two basic characteristics, height and weight, should be used but if it were decided that body mass index and height were each considered to be an important characteristics in themselves, then both could be employed, given that the correlation between them was not close to complete.

The foregoing implies some knowledge of the properties of the characteristics whose use is being considered. This may not always be adequate and it is necessary in the course of the selection of characteristics to determine their mutual correlations. This should detect any unexpected relationships and provide information on relationships which might be suspected but whose magnitude is unknown. For example, in the part of the NHANES database concerned with dietary intake, daily caloric intake is strongly correlated with daily intake of carbohydrate and total fat while the intake of polyunsaturated fat is strongly correlated with total fat but not carbohydrate. Such information is helpful in deciding which of these characteristics are suitable for inclusion. Such judgements do, of course, imply some judgement about the level of correlation which is deemed excessive. Some guidance in setting the effective threshold for exclusion can, however, be obtained by inspection of the complete correlation matrix of potential characteristics and using the value found for pairs of attributes previously deemed to be too closely related for use as guidance.

2.6 Nature of the data

The form of the data available in the NHANES database varies from characteristic to characteristic. Some, such as estimates of dietary intake, are continuous quantitative variables for which any value between the extremes recorded could occur. Some, such as the number of alcoholic drinks per day, are discrete quantitative variables, for which only integer values are possible. Other characteristics are represented by categorical data in which an attribute can fall into several states each of which are represented by a number. In some cases, such as gender, the number serves only to distinguish the states of the character, while in others, such as level of education, the sequence of numbers gives an indication of the intensity of the factor. In others, this is not the case. For example, a variable describing the amount of time spent watching television is coded in six states, with 1 to 5 representing increasing periods of time and 6 representing 'none'.

These differences, if recognised, do not present difficulties in the preparation of data for cluster analysis although they do render the subsequent description of clusters more difficult. Thus, while the mean value for all subjects in a cluster provides an adequate description of the cluster in the case of quantitative variables, this is not necessarily the case for categorical data which require more elaborate comparisons.

Of equal significance to cluster analysis is the magnitude of the numbers representing a variable. Thus, gender is represented simply by 1 or 2, representing men or women respectively, while estimated daily intake of carotene gives numbers in the range 0 to 16,532. The calculation of differences between subjects, which is an essential part of cluster analysis, is sensitive to the range of data and would thus tend to achieve wider differences on the basis of carotene intake than on gender. This problem can be averted by standardising the data for all characteristics so that they fall into the same range. Thus, if:

standardised value = (value – minimum) / (maximum – minimum)

then all values will fall between 0 and 1.

It might also necessary to adjust the data in cases where the distribution of values between subjects is highly irregular. Carotene intake again provides a suitable example in that, within the range of values given in the preceding paragraph, 75% of all values were below 1,000. Thus, the data in its original form would give weight to the few individuals with exceptionally high values and underestimate the importance of differences in the typical part of the range. One means of correcting for such extreme distributions is to apply a mathematical transformation which effectively compresses one end of the range and expands the other. A logarithmic transformation is frequently applied in such cases but was inappropriate here in that values of zero were reported for some subjects and the logarithm of zero is not a real number. For at least some of the characteristics in question, values of zero were clearly legitimate, while for others they were at least possible. This being the case, a square root transformation was applied. The need for transformation was determined using the arbitrary criteria that the modal category in a frequency distribution was either the lowest or the highest and that transformation achieved led to an increase in the correlation between the value of a variable and the rank order of the subject holding it.

An alternative to a mathematical transformation would be to sort the population by values of the variable in question and code each individual by the percentile of the range into which they fell. While this might have been effective in the case of continuous variables such as carotene, it would be problematic with others requiring transformation, such as alcohol consumption, which were represented as discrete quantitative variables. In such cases, coding by percentile could not be applied without either arbitrary adjustment of the categories or the possibility of individuals with the same value falling into different categories.

In preliminary work I standardised data to fall into the range 0 to 1 and applied a square root transformation to data whose distribution was markedly skewed. Exploratory work with sensitivity analysis suggested, however, that this treatment was not sufficient to avoid an influence of the nature of the data. It appeared that those characteristics which were represented by a few values, such as gender which presented two states or smoking status with four states, had a greater influence on the results of cluster analysis than did characteristics based on continuous data which, in principle, could carry different values for each individual. Because of the difficulty in separating such artifactual effects from the true influence of the differential distribution of characteristics, it is not possible to provide rigorous proof of this conjecture. Nevertheless, it seems reasonable to propose that characteristics existing in a few states are more likely to provide exact matches between subjects than those with many possible values and that this will lead to their having a greater influence on clustering.

In the work described here, therefore, I applied a modified coding in which I partitioned the range of values for each characteristic into two parts, that deemed to carry risk and that considered to be innocuous. This immediately introduced the need to apply some threshold which would discriminate between the two parts of the range. I decided to follow an epidemiological convention of dividing the range into successive subsets containing equal numbers of subjects and choosing one the two extreme subsets to represent the level bearing risk. I chose to use the top or bottom quintile of the range, depending upon whether high or low values of a characteristic would be considered risky. While straightforward for continuous data, characteristics represented by discrete data required arbitrary decisions as to where the threshold would fall and I always selected the value which came closest to placing 20% of the subjects in the risky range. Similar judgements were required with categorical variables.

It is important to appreciate that the procedure used involves an implicit transformation of the data into a binomial distribution and that this may affect the validity of subsequent mathematical operations which are sensitive to, or dependent upon, the form of distribution of data.

2.7 Missing data

NHANES, in common with most large data sets, is incomplete. Data for some characteristics are missing for some subjects.

Some of the missing data are identified as such, with subjects' responses coded as 'refused' or 'don't know'. In other cases, large sections of data are missing. While much of the information was obtained in their homes, some was collected when they visited Mobile Examination Centres. It was apparent that 436 of the 4880 subjects aged 20 or over did not make such a visit. Moreover, in the cases of 192 who did make the visit, the data on dietary recall were classified by the interviewer as inadequate.

Other appreciable losses of data were observed in the case of family income, with 266 subjects refusing to provide information or claiming ignorance of it. In the case of alcohol consumption, 188 subjects were coded 0 when only 1 or 2 were valid codes.

In addition to these large quantities of missing data there are blanks in the record scattered throughout the data set. There are also some uncertain cases where a value of 0 is presented for quantitative attributes in which this seems unlikely.

Missing data are a nuisance not only because they can lead to computational difficulties but because they reduce the value of the set as a whole. There are a number of ways of dealing with such cases. Missing data can be coded as such. Doing so, however, makes missing data a value of an attribute and means that it may act as a cryptic differentiating characteristic in cluster analysis. Alternatively, the mean value for all subjects providing valid data may be substituted for the missing data. This however, reduces the value of the characteristic and requires appropriate adjustment if statistical analysis is required. Another strategy is simply to exclude subjects with missing data from the analysis. The practicality of doing so depends on how many subjects are available, how many require such censoring and how many valid subjects are considered to be necessary. As the present study is a preliminary investigation and the number of subjects remaining after removal of those with missing data, 3746, was considered adequate for the purpose this was the procedure used here. The principal difficulty associated with this drastic approach to missing data is that subjects who failed to provide data may differ in some important respect from those who were fully compliant. The population assembled for NHANES was not, however, considered to be representative of any wider population in the first place so that, given this knowledge, the possibility of a further loss of representativeness was considered acceptable.

The effects of censoring for missing data were investigated and are shown in table A.1. This displays for each of the characteristics used in the analysis summary values for the original and censored populations. These values are either the means of the population or the frequency in each category, depending upon the nature of the data. The last column in the table presents the ratio of the original and censored data. Differences in the ratio above an arbitrary threshold of ±0.05 are highlighted. Those highlighted in blue arose in cases where the information was based upon a small proportion of the population and may simply reflect random variation. A green highlight is applied in cases where it is possible that restricted mobility, through age or infirmity, reduced the number of subjects who were able to attend the Mobile Examination centre at which some data were collected. Some of the remaining differences appear to be related to socioeconomic factors and may suggest that people in the lower categories were less likely to attend the Mobile Examination centre and those in upper categories more will to do so. This may be the case in HOU2 and INCO and possibly RACE. The only case in which the remaining differences highlighted are not sporadic is with the category SOFA, erected to reflect a tendency towards a sedentary lifestyle. It is possible that the considerable increase in the censored population in the proportion who watched little television indicates that those concerned were more likely to do other things, such as visiting the Mobile Examination centre with compensatory decreases in the remaining categories. This is, of course, purely speculative as an explanation.

In conclusion, differences between the censored and uncensored populations can be detected. Some of these may be explicable in terms of the ability or willingness of subjects to visit the Mobile Examination centre. With the possible explanation of the variable SOFA, there is little evidence of large or systematic differences attributable to excluding subjects with incomplete data from the analysis.

2.8 Quality of the data

The data used in analysis should be accurate in terms of collection, measurement and recording. The likelihood that this is the case will vary from characteristic to characteristic. It is unlikely, for example, that many errors will arise in recording gender while the accuracy of direct measurements, such as height, will depend largely upon the method used together with some variability associated with factors such as the time of day at which the measurement was made. On the other hand, data based upon information supplied by the subjects will tend to be more variable. Some inaccuracy will arise simply from misunderstanding of the question asked or uncertainty over the answer. Such variability will tend to be neutral in its overall effect and simply reduce the discriminatory power of the characteristic. In other cases, however, subjects may provide misleading data deliberately as a result of preconceptions of the interviewer's expectations or feeling that the correct answer would place them in a poor light. It is widely believed, for example, that people tend to provide underestimates of alcohol consumption. Unless, however, there is reason to believe that this bias interacts with another variable, it represents a constant error rather than a factor likely to bias the analysis. While it seems quite possible that misreporting of alcohol consumption could vary with age, gender, level of education, smoking status or many other attributes, I am not aware of evidence to confirm or quantify this.

The information provided for some attributes will be inherently less accurate than that available for others. The NHANES data set contains various estimates of dietary behaviour. Some of this was obtained from answers to direct questions such as 'do you ever eat chicken?'. On the other hand, estimates of daily intake of specific nutrients were obtained indirectly. Subjects were asked for a detailed record of all the food eaten during the previous day and this was analysed, on the basis of standard assumptions concerning the nutritional composition of food items, to provide estimates of likely intake. This will be less accurate than the more direct assessment. The information provided is, however, of considerable interest.

Overall, the potential accuracy of data may be of assistance in the selection of variables but it is unlikely to be decisive. If a particular characteristic is thought important, it is likely to be used, although a knowledge of potential inaccuracy will be helpful in interpretation. If, however, several measures are available, the quality of the data may help in choosing which of these to use.

3. Clustering

Sneath and Sokal (1973) provided, in their chapter 5.4, a taxonomy of clustering methods which classified and discussed some of the wide range of methods available at the time of their writing. Others have arisen since then. Notable amongst the latter are techniques used by those attempting to classify organisms on the basis of genetic characteristics derived from molecular analysis. These are, however, designed specifically for such purposes and work according to rules determined by their principal users. This renders them unsuitable for the present purpose and, arguably, also for the purpose for which they are intended.

The method chosen here was selected because it bears no burden of preconceptions of the structure of the population or the nature of the characteristics used to describe it, because it is relatively simple and because I have had some previous experience with it. It falls into the category described by Sneath and Sokal (1973) as sequential, agglomerative, hierarchical and non-overlapping. Sequential means simply that the process works through the population in the order in which they happen to be presented in the investigation. The alternative is simultaneous clustering which seeks global properties in the distance matrix. Agglomerative means that process works by adding members to groups which have already been identified. This begins once at least one pair of subjects who resemble each other more than they resemble any others has been found. The description hierarchical follows closely from this in that a structure in which clusters formed earlier have a subordinate relationship to those formed by their fusion. Non-overlapping means that at any given stage in the process, a subject can be a member of only one group. The clusters formed are thus mutually exclusive in their composition.

The cluster analysis described here was performed using a series of programmes written in Microsoft Visual Basic 6.0.

3.1 Comparison of subjects

The first stage in clustering is to calculate some measure of the similarity between every possible pair of subjects based upon their values, transformed and standardised as appropriate, in the data set. Sneath and Sokal (1973) describe a variety of means of estimating similarity, or its reciprocal, distance. On the basis of its simplicity and freedom from prior assumptions about the nature of the relationships between subjects, I chose Euclidian distance as a suitable measure for the present exercise. As the name suggests, this is a geometrical measure. The distance between two subjects J and K for two characteristics x1 and x2 can be envisaged in the following diagram:

euclid

The distance between J and K is expressed as:

DJ,K 2 = (x1J – x1K)2 + (x2J – x2K)2.

The Euclidian distance for n characteristics is then:

DJ,K = [Σ (xnJ – xnK)2]½ .

Sneath and Sokal point out that the magnitude of this measure increases with the number of characteristics and that a more suitable expression is the average distance:

dJK = √(DJ,K 2 / n).

This is the measure of distance used in the present paper.

3.2 Identification and consolidation of the most similar subjects

Having calculated an array of the distances between all pairs of subjects, the next step in the clustering process is to scan this array to find those pairs with the smallest mutual distance and which are hence most similar. This is done sequentially starting with a search for that subject which most resembles the first subject in the array. The pair of subjects thus identified is the genesis of the first cluster and, in the method used, must be regarded as a new entity whose distance from all other subjects must be calculated. The way in which the clustering process develops depends critically upon the method used to do so. Methods known as single linkage, using the terminology of Sneath and Sokal (1973), base the calculation on the smallest distance between any member of a newly formed group and the subject or group with which is being compared. Alternatively, complete linkage uses the largest distance. Derived from this is a flexible method which allows either of these or any position between them to be used. This is the method used here and the formula for recalculating distances is:

d(J,K)L = αdJ,L + αdK,L + βdJ,K

where J and K are the components of the group which has just been formed and L is the subject or group for which the new distance must be calculated and where 2a + b = 1. The value of b determines the extent of linkage, the effect of which can be shown diagrammatically. the diagrams being extracts from figure 5.5 of Sneath and Sokal (1973).

clusters

It is evident that when β = -1.00, equivalent to complete linkage, differences between groups are maximised so that low level differences in grouping, equivalent to small clusters which are distinct but closely related, are lost. On the other hand, with single linkage, attained as β approaches +1.00, a chained relationship in which successive groups tend to be formed by adding single individuals to early groups, pertains. With β = -0.25 a clearer pattern of clustering emerges, with low level clusters being identifiable and with a residue of disparate individuals added only in the last stages of the processes evident. Previous experience with this methods suggest that β = -0.25 did tend to give satisfactory results and this was adopted in the present exercise. As the results again appeared satisfactory and there was no need for change this value was used in all the work described here.

Once the subject most resembling the first in the sequence has been identified and the distance of this pair from all other subjects has been calculated, the process is repeated with the subject next in the series. The process then returns to the beginning of the sequence and is repeated as many times as are required to bring all subjects together into a single cluster.

The programme which conducts this operation records for each clustering step the order in the process in which it occurred, the groups which fused and the distance at which this occurred. This represents the progressive assembly of clusters and the ultimate aggregation of all subjects into the final single cluster.

The next programme receives this information and uses it to identify clusters and the subjects comprising them. Reference to figure 1 in the main text or to the diagrams above shows that the number of clusters resolved decreases from a large number in the early stages to the final single group. This occurs in parallel to the increase in distance between groups as the process proceeds. The programme which assembles the clusters thus requests the distance threshold for which the number of groups is required. It also requests a threshold number for the recognition of a cluster. This was incorporated to allow clearer visualisation of the early stages of clustering at which stage a large number of very small clusters will exist. It soon became apparent in the work reported here, however, that useful information is available at much higher levels of clustering so that this difficulty has not arisen. All the results reported here have been obtained with the minimum size of a cluster set at the default value of four. The programme works by inspecting the output of its predecessor in reverse order to follow the clustering process branch by branch and identify the point at which each individual became involved.

The output file from the previous programme is read by the final programme in the series which summarises the results of the process at the distance threshold selected previously. It produces three files. The first is a summary of results showing the number and size of the clusters obtained. The second lists the composition of these clusters, identifying subjects by their serial number in the original data file. Both of these files also describe the residual group of subjects who, at the distance threshold in question, have not been included in the clusters obtained. These comprise single individuals together with small clusters below the size threshold set in the previous programme. To produce the final output file, the programme interrogates the data set used for clustering and the complete set of unstandardised data for the variables from which the subset used in clustering was derived. The file contains the mean values of each of these values for each cluster together with the numbers of individuals in each smoking category and the percentage of all the subjects in the cluster and in the whole data set which these totals represent.

The dendrogrammes showing the hierarchy of the clustering were prepared manually.