IMPORTANT NOTE FOR THE USER: This page contains a number of tables and figures which are discussed in the text. So that readers can study these in conjunction with the text, links to them open as separate pages so that they can be kept open until the reader chooses to close them. Links to other documents in this page open within it and readers can return to the original page using the BACK button.
1. INTRODUCTION
A risk factor for disease is some property of an individual which appears, by virtue of the work of observational epidemiologists, to influence the likelihood that a person will develop a particular disease. A risk factor may be inherent in an individual as a result of their genetic constitution while others can be related to their lifestyle or environment. Some of these external factors may be associated, with varying degrees of confidence, with exposure to specific chemicals while others appear to be related to the biochemical or physiological consequences of the presence of a risk factor.
Observational epidemiology suggests that the incidence in populations of most chronic diseases is affected by several risk factors but it is seldom clear whether a given case of disease is the result of the predominant influence of one these risk factors or is a function of the combined action of several of them.
In "Multifactorial epidemiology in the population" I note that risk factors may not be distributed independently between members of a population but tend to be assorted according to the behavioural and social characteristics of the individuals which comprise it. There is ample evidence in the literature to support the view that a person exposed to one particular risk factor is more, or less, likely to be exposed to a second but the extension of such observations to larger numbers of risk factors is less accessible to observation. In this paper I attempt to clarify the matter by applying cluster analysis to the occurrence of risk factors in a real population. This approach does not require hypotheses or data on how risk factors interact in their prevalence but seeks simply to classify people on the basis of the risk factors to which they are exposed and thus to determine whether or not risk factors tend to aggregate or disperse within the population.
The underlying principle of this work is, therefore, the classification of a population on the basis of the presence or absence of particular risk factors in each of its members. The results obtained can be expressed in a hierarchical manner as exemplified in figure 1.
The entire population can be divided into a few high level units which differ considerably between each other but have internal cohesion in terms of some of the characteristics of the individuals comprising them. Each of these can be partitioned into smaller units whose internal resemblance increases progressively and whose similarity to other units can be traced through the hierarchy. The units can be described in terms of the properties, in this case the status of risk factors, of its members.
The methods used in this work are based upon those developed by taxonomists who wished to achieve more objective classifications of plants and animals than were obtainable from traditional practice which relied heavily upon judgement, experience and opinion. A brief account of the use of numerical methods in classification is provided in the methodological appendix A.
It is important to appreciate that the results of any attempt to classify biological entities, whether achieved by cluster analysis or some other means, can seldom, if ever, be stated definitively to be absolutely right or absolutely wrong. The absence of any clear grouping of individuals or results which are obviously at odds with the real world will suggest that the technique used to make the classification was not applied correctly or that the units of the study were, for all practical purpose, identical. When, however, an interpretable classification is obtained, there is no formal method, other than simple statistical criteria, for determining whether it is better or worse than some other arrangement of individuals which could be obtained by changing the information used in the process. This being the case, the investigator is faced with the philosophical problem that any judgment between two or more possible classifications may be influenced by which of them gives a result which he or she prefers.
The approach used here to minimise such subjectivity is to follow an experimental plan established a priori and to avoid deviations from this which represent attempts to improve any classification obtained by modifications in detail. The difficulty here is that third parties, such as the readers of this work, have to accept the author's word that this was indeed the case. I must therefore assure readers that the results presented emerged from the plan which I am about to describe and that they are not a favourable selection from numerous and varied experiments.
The hypothesis underlying this work is that risk factors which can be identified in a population are not distributed at random between members of the population but tend to cluster such that some combinations of risk factors will be more common than would predicted from their overall prevalence in a the population. Likewise, some other combinations may be less common. More specifically, smoking behaviour may be related to other forms of behaviour or environment which, from epidemiological evidence, are associated with an increased risk of disease.
As will be described the work was conducted using a collection of data from a population assembled, for purposes broadly similar to my own, by third parties and made available to the public. I thus had to work with a set of risk factors assembled by others and choose from within these those best suited to my purposes. My criteria for this selection are described in appendix B. I had to accept that, as I was undertaking this exercise without any prior knowledge of the application of cluster analysis to such a population, a certain amount of exploratory work to determine the best methods for processing the data before analysis would be required. This is described below and was done after the set of factors for use had been established so that these investigations did not affect the selection of factors used in the study. Having done so, the first cluster analysis was conducted on a data set using one measure of smoking behaviour amongst the characteristics examined. This characteristic was whether or not a subject smoked cigarettes every day. Given, however, that this does not distinguish between former smokers, never smokers and a fourth measure included in the survey, occasional smokers, this would not provide a complete description of smoking behaviour. The next phase was therefore to repeat the analysis with all four, mutually exclusive, measures of smoking behaviour. As will be explained below, this will weight the clustering obtained towards contrasts based on smoking behaviour. Given that this was acknowledged a priori and that my principle objective was to examine the distribution of risk factor according to smoking behaviour, such a weighting was considered acceptable. Finally, to obtain some idea of the dependence of the results obtained upon the choice of risk factors for inclusion in the study, a sensitivity analysis was conducted by the systematic replacement of characteristics within two subsets of factors by a factor appropriate to each which had not hitherto been used.
I should, to assist readers in following this paper, that all figures and tables are accessible by links to documents which remain open until they are closed by the reader. This not only simplifies the appearance of the paper itself but allows the reader to study the figures and tables in parallel with the text.
2. METHODS
2.1 Basic technique
The data used in the analyses described was that assembled during 1999 and 2000 as part of the National Health and Nutrition Examination Survey (NHANES) conducted in the United States of America by the National Center for Disease Statistics. It was a stratified probability sample of the civilian, non-institutionalised population of the United States over-sampled to obtain larger numbers of people with low incomes, people in the age groups 12-19 and 60+, African Americans and Mexican Americans. Appendix C reproduces some of the general description of the NHANES studies provided by the U.S. National Center for Health Statistics.
Information derived from questionnaires on various aspects of lifestyle and from physical and physiological examination was provided for 9,965 anonymous subjects. Data on smoking behaviour had been sought only from those aged 20 or above which reduced the number of subjects relevant to the present analysis to 4,480 while the exclusion of subjects for whom data on the characteristics studied here was incomplete led to a final total of 3,895 suitable for inclusion. The potential for bias arising from the exclusion of subjects for whom full data were not available is discussed and quantified in appendix A (section 2.7) where it is concluded that censoring of subjects with missing data was unlikely to have had a major effect of the distribution of characteristics.
Cluster analysis was conducted as follows. The first step was the calculation of a measure of the similarity , in terms of the characteristics used, between each subject and every other subject. In practice, the reciprocal of similarity, or distance, was estimated as mean Euclidian distances (see Appendix A, section 3.1). This provided a matrix of 3,895 x 3,895 values. This matrix was inspected to identify the pair of subjects which was most similar. This pair was taken as the basis for the first cluster and a new matrix of 3,894 x 3,894 subjects was prepared by calculating the distance between the combined values of each characteristic in this pair with those for all other subjects using an accepted formula (see Appendix A, section 3.2) for the purpose. This process was repeated until all subjects were fused into a single cluster representing the entire population.
A hierarchical view of the series of clusters which comprise the population can be obtained by using information on each fusion between pairs of subjects or clusters to determine the number and composition of the clusters existing at increasing values of the distances between them. Distance values are particular to the population in question and have no absolute significance. Accordingly, the thresholds of distance used in this analysis were chosen arbitrarily as those which partitioned the population into three, six, 12, 24 and 36 clusters or as close to these numbers as could be obtained. The methods used to compare the composition of clusters are described below.
2.2 Selection of characteristics
Thee selection of the characteristics, or risk factors is the most important and most critical stage in cluster analysis. Why this is so is explained in Appendix A and how it is done is described in Appendix B.
This process led to the choice of 19 characteristics, grouped, as shown in the second column of the table below, into broad categories. The set used was:
name |
category |
description |
AGEY |
demographic |
age in years |
GEND |
demographic |
gender |
BMIX |
diet |
body mass index |
CARO |
diet |
daily intake of carotene |
CHOL |
diet |
daily intake of cholesterol |
FIBE |
diet |
daily intake of dietary fibre |
KCAL |
diet |
total daily caloric intake |
ACTI |
exercise |
physical activity during day |
MODE |
exercise |
moderate physical exercise |
SOFA |
exercise |
time spent watching television |
ALCO |
lifestyle |
average number of alcoholic drinks per day |
CAFF |
lifestyle |
daily intake of caffeine |
PREP |
lifestyle |
use of food prepared outside the home |
SDAI |
lifestyle |
smokes daily |
EDUC |
socioeconomic |
level of education |
HOU2 |
socioeconomic |
number of rooms in residence |
INCO |
socioeconomic |
family income |
IND2 |
socioeconomic |
industry of employment |
JOB2 |
socioeconomic |
occupation |
It is important to note that smoking status represented aso a single binary variable SDAI which recorded whether or not the subject was a daily smoker. In the analysis which introduced a deliberate weighting of smoking status, this was supplement by three further characteristics, SOCC, SFOR and SNEV which recorded whether or not the subject was an occasional, former or never smoker respectively.
For the sensitivity analysis, intended to indicate the effect of individual characteristics on the results obtained, the dietary and socioeconomic variables were replaced in turn by a further variable representing these classes. Dietary variables were replaced by VITC, the daily intake of vitamin C. Socioeconomic variables were replaced by RACE, based upon the ethnic origin of the subject. Note that while this was classified by NHANES as a demographic variable, I regard its effect as a risk factor as acting primarily through consequences on socioeconomic status.
Details of the coding of characteristics and other information about them is given in table 1.
A fundamental objective of the work described here is to determine whether or not relationships exist between the distributions of smoking behaviour and of other risk factors for disease within the population. This being so, it is important to consider the status of the characteristics used as risk factors for disease. Relationships derived from observational epidemiology are summarised in table 2. Note that this table does not contain relevant references to the literature as this would add considerably to its size. It may be summarised as showing that the characteristics used here are generally regarded as risk factors for disease with the following three exceptions. Firstly, although high intakes of caffeine may be related to cancer at certain sites and, perhaps, to cardiovascular disease, it is not often discussed as a risk factor of wide ranging significance here. In the present study, as variable CAFF, it is regarded as an indicator of a lifestyle rather than as a potentially toxic influence. Secondly, the characteristic PREP, which refers to the frequency of consumption of food cooked outside the home, has not been studied frequently and, although it might have relationships to dietary quality, it too is used as an indicator of lifestyle. Finally, gross daily caloric intake, KCAL, is not often reported as a risk factor. A case, supported by the results of animal studies can be made, however, for high food intake representing a form of malnutrition which carries a risk for disease. It may be noted (table 3, below) that, in the population studied, it does not bear a close relationship to body mass index.
As discussed in Appendix A it is important that the characteristics used in cluster analysis are not highly correlated with each other. A relationship sufficiently strong to mean that one characteristic was effectively a substitute for another would lead to an excessive influence of both in cluster analysis. On the other hand, as the object of the exercise is to examine relationships between individuals on the basis of their characteristics, some degree of association between must be predicted. To determine what relationships existed, correlation coefficients were calculated between each pair of characteristics used in the main analyses. These are shown in table 3. It is evident that the majority of correlation coefficients fell in the range -0.2 to +0.2. The two highest values, found between KCAL and CHOL and between KCAL and FIBE were 0.46 and 0.51 respectively. Even the highest values fall short of those which would suggest interchangeability. Table 4 shows the correlation between the characteristics introduced in the sensitivity analysis and those which they replaced in this exercise. Once more, the values observed do not indicate strong inter-relationships.
2.3 Preparation of data
Data for use in cluster analysis generally require treatment before use so that differences between characteristics in the absolute magnitudes of the values which they present do not influence the results obtained. Such an effect may be reduced or removed by standardising the data between the minima and maxima recorded for each so that the values for all attributes fall within the same range. In cases where the data are not distributed linearly, it may be necessary to apply an appropriate transformation prior to standardisation.
Preliminary work suggested, however, that when this procedure was applied characteristics described by categorical or coded data appeared to carry more weight in cluster analysis than did those with continuous data. Moreover, there seemed to be an inverse relationship between the number of categories or codes used and the influence of a characteristic in clustering. For this reason, it was decided that in the work described here, all characteristics would be represented by two states only, 0 or 1. In each case, 0 corresponds to values deemed not to offer a risk to their carrier and 1 to represent the risky state. In principle, this offers the further advantage that judgment on risk is made prior to cluster analysis and not at the time of assessment of results. This judgement was made as systematically as possible. When it was possible to do so, the quintile of values at the risky end of the range of those available was deemed to be at risk. This was easily applied to continuous data by inspection of the range. For example, setting the threshold of risk for daily cholesterol intake as a value of 441.78mg or more placed 20.00% of subjects in the risk range (table 1, column 6) while selecting a value of 61.49mg day or less for carotene included 19.97% of subjects in the risky range. It will be noted that this treatment allows coding to be applied in the same way regardless of whether high or low values or a variable are regarded as risky. A threshold giving a 20% : 80% partition was less easily attainable for discrete variables. Thus, setting a threshold for daily alcohol consumption of three or more drinks per day placed 22.41% subjects in the risk range. In the case of categorical data it is not possible to select a threshold. Rather, the most extreme category is chosen to represent the risky condition. Thus, level of education was presented in three categories, effectively corresponding to low, medium and high. Deciding that the lowest of these was the risk state included 37.36% of subjects. Likewise, the decision that an unskilled or manual occupation was risky selected 12.25% of subjects (JOB2) while the arbitrary choice of certain industries as being hazardous put 9.63% of subjects in the risky range (IND2).
Two exceptions should be noted. Firstly, in the case of moderate physical exercise (MODE) the majority of subjects (62.39%) fell into the risky category of not undertaking such exercise. Secondly, although age and gender present a risk in themselves, a part of this is likely to be the result of confounding with other characteristics used in the analysis. Thus gender will be confounded with, for example, smoking status and occupation while age will be confounded with employment status, duration of exposure to environmental risk factors and other variables. These demographic variables were thus excluded from risk analysis and used only for descriptive purposes. Gender was coded arbitrarily as 0 for women and 1 for men and age by allocating 0 to the younger half of the population and 1 to the older.
The coding used for all variables is given in table 1.
2.4 Analysis of results
2.4.1 Description of clusters
The results of cluster analysis were assessed by comparison of the composition of clusters formed at distance thresholds which divided the population into three, six, 12, 24 and 36 clusters or the numbers nearest thereto. These numbers bear no particular significance other than that they form series of doublings modified at the final stage to yield clusters containing around 100 subjects.
The number of subjects in a cluster bearing risky values of each characteristic was determined. This was compared with the number which would have been present if the proportion bearing risky values was the same as that in the whole population. A simple measure of the statistical significance of the difference can then be calculated as the Χ2 statistic with a threshold of p = 0.005 used to infer the significance of differences. Note that with 19 characteristics under examination in each cluster, the number of comparisons is approaching the value of 20 at which one significant difference would be expected from chance alone.
The effects of clustering are illustrated in tables, prepared for each level of clustering, showing the mean value of each characteristic in those cases where the difference was statistically significant. A blank cell in the table denotes that the difference between the cluster and the population as a whole was not statistically significant. Although the comparisons were based on standardised data, the tables show, for ease of interpretation, the mean values of unstandardised values for each characteristic. It is frequent in cluster analysis that at any level of observation below that of the whole population there exists a small residue of subjects which have in common only that they resemble each other more than they do any of the clusters formed. These residual groups are shown in the tabular presentation of results but seldom, if ever differ statistically from the whole population.
The assessment of results is thus intended to show the progressive divergence of clusters from the whole population. Statistical comparisons between clusters have not been attempted.
The hierarchical nature of the series of clusters generated by the analysis is shown diagrammatically as dendrogrammes or tree diagrams. The amount of descriptive data which can be included in these is restricted by considerations of space.
2.4.2 Risk scores
To provide a preliminary assessment of the risk associated with the characteristics defining a cluster, a crude risk score was calculated. This was done in three steps. Firstly, only characteristics for which the mean value of standardised data was significantly greater in the cluster in question than in the whole population were included. Secondly, the increment of risk presented by each such characteristic was represented by the ratio of its mean value, using uncoded data, to the corresponding mean in the whole population. This corresponds to the relative proportion of subjects bearing risky states of the characteristic. This was introduced to provide some measure of the scale of any difference. Finally, the values obtained for each characteristic are summed to provide the risk score. As noted above the two basic demographic characteristics of gender and age were excluded from this sum.
It could be argued that the risk score should take into account the possibility that part of the range of a characteristic could be protective against disease. The score would then represent the balance between potentially damaging and protective effects. This is indeed so but cannot be applied in the present case. This is, in part, because it would require a different form of coding of the variables which would allow both positive and negative effects to be included and this has not been done here. More importantly, however, it would require even more assumptions about the properties of the putative risk factors than have already been made. If high levels of a risk factor are risky, are lower values simply non-risky or are very low levels protective? This will almost certainly differ from case to case. It might be, for example that as carotene intake is related to the anti-oxidant capacity of the body, and hence its ability to deal with toxic challenges, not only are low levels deleterious but high levels may be protective. This assumption would require some assurance that very high levels of carotene are not toxic. On the other hand, if a high body mass index is risky because of the physiological consequences of obesity then lower values will simply be neutral. This too would require qualification, however, as very low levels of body mass index may be an indicator of poor health, in either general or specific terms, or could be associated with a different type of physiological stress. Given these considerations, any attempt to included beneficial as well as deleterious parts of the range of a risk factor into the risk score seems unjustified in that it would require further assumptions beyond those already made.
It is important to note that as one of the objects of the exercise was to assess the levels of risk related to smoking by virtue of its association with other risk factors, the presence or absence of daily smoking was not included in the crude risk score. This does not imply any assumptions on my part concerning any risk associated with smoking itself and was done simply to facilitate the comparison required. I may add that to have included smoking in the risk score would have required that in the second phase of the work described in which I include all four possible states of smoking behaviour, some arbitrary decisions would have been required to allocate units of risk to occasional or former smoking.
3. RESULTS AND DISCUSSION
3.1 Basic relationships between smoking status and risk factors
As one criterion of the value of cluster analysis is that it should provide more information than can be obtained from a consideration of the population as a whole, it is both useful and necessary to examine the distribution of these factors in the whole population to which it was applied. Figure 2 shows the distribution of values of each clusters between the four categories of smoking behaviour available in the data. For categorical characteristics, the figures show the proportions of the total in each smoking category which are present in each level of the characteristic. In the case of quantitative data, the figures show the mean values of the characteristics by smoking status.
It is immediately apparent that differences in distribution according to smoking status can be seen for nearly all of the characteristics and that their direction is in accord with expectations from the literature.
Regarding the basic demographic characteristics of gender and age (figure 2), it can be seen that men are more likely than women to be daily, occasional or former smokers and less likely to be never smokers. The highest proportion of daily smokers lies in the age group 40 – 49, and falls with age. There is a corresponding increase in the proportion of former smokers, with the highest proportion at ages 70 – 79. The frequency of never smoking declines with age, with a maximum at ages 30 – 39 while occasional smoking also appears more common in younger age groups.
The socioeconomic characteristics confirm the expectation that daily smokers tend to be less well educated and to have lower family incomes than never smokers (figure 2 c,d). There are also tendencies for daily smokers to live in houses with fewer rooms and to be more likely to work in potentially hazardous industries (figure 2 o,p) and to hold manual or unskilled jobs (figure 2 q). With regard to ethnic origin, a characteristic used only in the sensitivity analysis, daily smoking seems to be the most frequent form of smoking behaviour in black or white people while this is not the case for Mexican American or other Hispanic people (figure 2 s).
The first of the dietary characteristics included in the study, daily caloric intake, shows a progressive decline from daily smokers, through occasional and former smokers, to never smokers (figure 2 f). Body mass index shows the opposite trend, tending to rise from daily smokers to never smokers (figure 2 n). These observations are seemingly paradoxical, as, other things being equal, a higher caloric intake would tend to lead to a higher body mass index. Other things cannot therefore be equal. It is beyond the scope of this investigation to pursue this point but it is nevertheless of potential interest. The other nutritional characteristics vary in accord with findings in the literature that smokers tend to have a diet which is considered to be less healthy than that of never smokers. They tend to have higher daily intakes of cholesterol and lower values for dietary fibre (figure 2 h), carotene (figure 2 i) and vitamin C (figure 2 t). .
The amount of physical exercise undertaken in a typical day (figure 2 k) appears to suggest inconsistent relationships with smoking behaviour in that daily smokers seem more likely to be both sedentary and to have more active days. It is likely that this reflects confounding with age, gender and job status. On the other hand, expectations are fulfilled in the case of voluntary physical exercise (figure 2 l) which is less common in daily smokers and a sedentary existence in leisure time (figure 2 m) which is more common.
As regards the characteristics chosen to reflect lifestyle, smokers clearly have higher daily intakes of alcohol (figure 2 e) and caffeine (figure 2 j), although differences in these characteristics between occasional and former smokers suggest that these two variables are not reflecting closely similar forms of lifestyle. The frequency of consumption of food prepared outside the home (figure 2 r) is the characteristic showing the least differentiation according to smoking category. Such trends as are apparent suggest, however, that daily smokers may be more likely to indulge themselves in this way.
It is of some interest in the interpretation of epidemiological data to know whether the exposure to risk factors of former smokers tends to resemble that of daily smokers or never smokers. In other words, when people stop smoking do they tend to retain other habits associated with smoking or to modify other behavioural attributes in the direction commonly perceived to be more healthy? Space does not permit a detailed discussion of the risk factors present in the whole population in this context. My general appreciation of the data is, however, that the spectrum of risk factors of former smokers is more like that of never smokers than daily smokers. In some cases (FIBE, figure 2 h; CARO, figure 2 i) they may even surpass never smokers.
The category of occasional smoker is less commonly used in conventional epidemiology. It has some wider relevance, however, to studies which investigate smoking behaviour with questions such as "in your lifetime, have you smoked more than [a specified number] of cigarettes?" In such cases, a positive answer may place an occasional smoker in the categories of current or former smoker and hence contribute to the risk in that category. The information assembled here tends to suggest that occasional smokers have a risk spectrum more similar to that of never smokers than daily smokers, although they do resemble the latter in some respects.
The foregoing discussion has been essentially qualitative. A better idea of the magnitude of relationships can be obtained by calculating the odds ratio between two categories for the exposure to levels of a characteristic exceeding the threshold for risk established for this exercise. This is simply:
proportion exceeding threshold in smoking category A/ proportion exceeding threshold in smoking category B
Results of these calculations for the various possible combinations of smoking behaviour are shown in table 5. It can be seen that daily smokers are more likely to show risky levels of characteristics than the remainder of the population taken together and that the differences tend to be greater when daily smokers are compared with never smokers. The only exception to this observation is in the case of body mass index for which risky levels are less frequent in daily smokers. Smokers are more than twice as likely to have risky levels of intake of carotene, dietary fibre, caffeine and alcohol than other people. In the case of the last, the difference is fourfold. These contrasts tend be greater when smokers are compared with never smokers.
These results also tend to confirm the observations above that the risk profile of former smokers is more similar to that of never smokers than that of daily smokers but that occasional smokers exhibit some of the risky characteristics of daily smokers.
Table 6 shows, in part a, the odds ratios for the possession of risky states of pairs of characteristics when daily smokers and are compared with never smokers. These results provide preliminary evidence of the joint occurrence of risky states of characteristics in smokers. In some cases, the size of the odds ratios is greater for pairs of characteristics than when these were considered alone. This is particularly evident for pairs including risky levels of intake of caffeine or alcohol although the odds ratio for risky levels of KCAL + FIBE also exceeds 7.0. Some combinations of risky states are, however, less common than expected. Most of these include BMIX in the pairs in question although it can also be seen the combination of risky levels of ALCO and EDUC is particularly uncommon in daily smokers.
Systematic combinations of three or more characteristics generates more results than can be shown here. Table 6b, however, shows a few selected examples. It is evident that the combination of risky levels of ALCO + CAFF + CARO is around one hundred times more likely in daily smokers than never smokers. Adding FIBE to the combination, however, reduces the odds ratio to around 20.
The findings discussed in this section demonstrate that risky levels of the characteristics under examination tend to be more frequent in daily smokers than in never smokers in the population as a whole. There is also some evidence that when characteristics are considered in combination, differences between daily smokers and never smokers are sustained or increased. The following sections will show whether or not the partition of the population into subsets by cluster analysis reveals further aggregation or dispersal of risky characteristics.
3.2 Clustering: smoking represented by daily smoking
It is unfortunate for the reader that cluster analysis provides a large amount of information economical presentation of which is difficult. I suggest, therefore that he or she keeps the table and figure introduced below until in order to follow my attempts to convey the findings as clearly as possible. Both presentations use a common colour coding for clusters
Table 7 describes, in separate parts, the clusters formed at the five distance thresholds used in the exercise, the results of each being given in a separate part of the table. The distance thresholds are identified in the table. The properties of the clusters are shown in columns containing the mean values of each characteristic which differed significantly from the whole population according to a Χ2 test. These can be compared with the overall values shown in the second column of the tables. These means refer to unstandardised data which are more easily interpretable than the coded data. Note, however, that coded data are used for IND2 and JOB2 as the reclassification of these characteristics renders the unstandardised values unhelpful. In these cases, a higher value indicates a higher proportion of subjects with risky values of the characteristics.
The first rows in the table provide an identifier for each cluster by showing its relationship to clusters discriminated at higher distance thresholds. Thus, the first cluster in table 7b is labelled 1.1, identifying it as the first cluster formed by division of the first cluster found at the highest level. A dash, or in the case of the highest level of clustering, parentheses, indicate that a cluster persisted without significant change from one level of clustering to the next. These labels are used to refer to clusters described here but the hierarchical arrangement of clusters can be seen more easily in the dendrogram shown in figure 3. Space in this diagram does not permit a full description of the clusters and the information presented is restricted to the percentage of each category of smoking behaviour in the cluster and the risk score associated with it. Note too that space necessitates that the clusters found at the lowest distance level, shown at the bottom of the figure, are presented in two staggered rows.
The following narrative is concerned primarily with the partitioning of smoking behaviour and crude risk. Other characteristics will be discussed only briefly and as required. A more complete qualitative description will be attempted in section 3.5.
The first cluster seen at the highest level of clustering, obtained using a distance threshold of 3.5 , was comprised largely of women and contained 7% of daily smokers compared with 17% in the whole population together with a small deficit of former smokers and an excess of never smokers. The risk score was modest at 3.8. It will be recalled that the risk score is calculated by comparison with the overall population which, by this definition, has a risk score of 0. At a distance threshold of 2.35, this cluster split into two . These appear to be distinguished by the age of the subjects, with the clusters representing younger and older people respectively. Smoking behaviour differed only slightly, with the younger cluster containing slightly more daily smokers and fewer former smokers. The risk scores were similar, being 5.5 and 6.5 in the younger and older groups respectively. While it may seem paradoxical that both of these scores are higher than in the parent cluster, this is not the case. As the population is progressively divided on the basis of internal resemblances in the clusters formed, their composition, and hence their risk score, will diverge progressively from the value of zero in the whole population.
With further subdivision, the distinction in age persists while distinctions in gender tend to become stronger, with successive divisions of clusters leading either to groups comprised almost exclusively of women (e.g., 1.1.1.1 and derivatives) or to groups of mixed gender (e.g. 1.2.2 and descendants). The frequency of daily smoking tends to remain below the overall average, although correction for gender would probably render the contrast less marked. At the highest level of clustering (table 7e), the frequency of daily smoking ranges from 2% (1.2.1.2.-) to 9% (1.1.1.1.1). Some differentiation in the other categories of smoking behaviour, which were not used in clustering, can be observed. One group of younger subjects (1.1.2.1.2) contains significantly more occasional smokers than average while two (1.1.1.1.1; 1.1.2.1.1) contain fewer former smokers than expected. It was also found that two related groups (1.2.2.1.1; 1.2.2.1.2) no longer showed any differentiation by either gender or smoking behaviour although they differed two fold in risk score.
Risk scores at the highest level of clustering cover a range from 0 (1.2.1.2.-) to 16.5 (1.2.2.1.2).
The former group comprises middle aged women with a low frequency of smoking and the latter one of the few clusters in which distinctions in gender and age are absent and in which the frequency of daily smoking was close to the overall average. Within this set of clusters there is little apparent association between risk score and smoking frequency (
text figure 1 ). For example, in the pair of clusters 1.1.2.1.1 and 1.1.2.1.2 the smoking frequency is over four times higher in the former but risk score about half that in its counterpart. Similarly, 1.2.2.1.1 and 1.2.2.1.2 have similar smoking frequencies but a twofold difference in risk scores. Within this set of clusters, the most frequent correlate of high risk appears to obesity (body mass index ≥30).
The second main group identified at the first level of clustering (table 7a) had twice the average frequency of daily smokers, fewer never smokers, represented slightly older subjects but was undifferentiated by gender. The risk score was, at 14.9, the highest of the three principal groups. This group persisted almost unchanged at the second level of clustering but at the third split into three subgroups (table 7c). The first was predominantly females, of middle age, had twice the average frequency of daily smoking and the lowest risk score of the three subgroups. The second seemed to represent older men, with fewer daily smokers and more former smokers than average and a slightly higher risk score. The third group was undifferentiated in gender, contained somewhat younger people and 40% daily smokers. It had the highest risk score of the three subgroups. At higher levels of clustering, daily smoking became differentiated in the descendents of each of these subgroups.
Thus, group (2).2.1, the first of the three described above, ultimately divided into three, one with 98% daily smokers, one with 8% and one showing no differences in smoking behaviour when compared with the whole population (table 7e). Risk scores were generally high, ranging from 7.4 to 19.6. In contrast to the first principal cluster and its descendents, there is some indication of a relationship between the frequency of daily smoking and risk score (text figure 2)
at lower smoking frequencies but not at higher. Inspection of the data in table 7 suggests that, unlike the case in the first major group, body mass index was not a determinant of risk, with values near or below the average. Rather, risk appeared to be associated with socioeconomic characteristics.
The third major group identified at the first level of clustering appears to be contain a predominance of men of slightly below average age, a higher frequency of former smokers, fewer never smokers and slightly above the average percentage of daily smokers. At 10.0 the crude estimate of risk was between that in the other two main groups, although closer to the second (table 7a). The next level of clustering divided this group into three (table 7b). Men, slightly older than average, comprised about three quarters of the first subgroup. Daily smoking was around the average frequency, former smoking more frequent and daily smoking less frequent. The crude risk score was low at 5.6. The second subgroup had a slightly higher risk score, 7.9, less than half the frequency of daily smokers, slightly more never smokers than average, contained more women and did not differ significantly from the whole population in age. Amongst other characteristics, its subjects had a higher daily intake of total calories and cholesterol. The third subgroup appeared to be represented largely by younger men who, in addition to higher intakes of total calories and cholesterol were heavier drinkers and more likely to have a manual job in a potentially hazardous
industry. Their frequency of daily and occasional smoking was above average and never smoking below average. Further subdivision of these groups was accompanied by some further differentiation of gender, giving subgroups with more or fewer men, and partitioning by smoking behaviour. For example, cluster 3.1.1 (table 7c) with 39% women and 29% current smokers ultimately evolved into four clusters with 90%, 7%, 29% and 14% women respectively and frequencies of daily smoking 13%, 7%, 84% in the first three while that of the fourth was no different from the overall average (table 7e). Also worthy of note is cluster 3.2.2.-.- which was largely male and showed the highest concentration of occasional smokers, with a frequency of 10% compared with 4% in the whole population. Text figure 3
suggests a similar relationship between the frequency of daily smoking and crude risk to that in the second major group, with some indication that risk increases with smoking frequency at lower levels of the latter and then remains constant or declines.
This discussion of the results of cluster analysis suggests that the procedure was successful in achieving a partitioning of the population into a hierarchy of clusters which could be described in terms of age, gender, daily smoking and various other characteristics. It was also apparent that, although the presence or absence of daily smoking was the only information on smoking behaviour used in the analysis, some differentiation by other categories of smoking behaviour could be observed. This, it can be inferred, must result from associations between occasional, former and never smoking and other characteristics used in analysis.
Differentiation in the crude risk score was also apparent. Perhaps unsurprisingly, in the clusters comprised largely of people who were not daily smokers, this showed little relationship to daily smoking, but in clusters with more daily smokers, there was some evidence of an increase in risk score with the frequency of daily smoking up to a threshold of somewhere between 25% and 50% daily smoking. This provides evidence beyond that obtained by analyses of the whole population of an association between total risk and smoking behaviour. The potential significance of this is obscured, however, by the fact that the clusters compared above differed in size so that the frequency within them of a given category of smoking behaviour may bear little relationship to the frequency of that category in the whole population. A better indication of the relationship between risk and smoking behaviour can be obtained by assembling clusters into groups according to the risk associated with them and calculating the percentage of the total number of subjects in each smoking category within these groups.
Figure 4 shows, in part (a) how smoking behaviour is distributed across six categories of crude risk using the 37 clusters resolved at the highest level of clustering. Each category contains approximately equal numbers of subjects. It is immediately apparent that while the risk associated with never smokers shows a unimodal distribution, skewed towards the lower end of the range, this is not the case for daily smokers. The distribution of daily smokers is bimodal, with a substantial proportion of them, 37%, in the highest category of risk. It is equally clear that some smokers carry a low risk by virtue of the other factors associated with them with nearly 20% of the total in the lowest two categories of risk. The distributions of risk in occasional and former smokers appears less regular and, to clarify these, figures 4(b) and 4(c) show the distribution of risk between three and two categories respectively. Figure 4(b) suggests that the distribution of risk in former smokers is more similar to that in never smokers than daily smokers while that in occasional smokers resembles that in daily smokers but with a skew towards lower values. Figure 4(c) provides a convenient summary by dividing the risk simply into low and high categories. Around 66% of daily smokers are at higher risk by virtue of their associated characteristics while only slightly more than half of this proportion of never smokers or at higher risk. That is, daily smokers are about twice as likely as never smokers to carry a risky load of other characteristics. Former smokers are neatly divided with 50% in each of the high and low risk categories while a majority of occasional smokers, 58%, are in the high risk group.
These results provide a clear demonstration that the risk associated with factors other than smoking is not distributed uniformly across categories of smoking behaviour. In comparison with never smokers, a substantial minority of daily smokers carry a high level of risk. Nevertheless, not all daily smokers appear to be at high risk, with appreciable numbers carrying little risk derived from other factors. It is also apparent that some occasional smokers and, to a smaller extent, some former smokers, carry some additional risk by virtue of an association between their current or previous smoking behaviour and other risk factors.
3.3 Clustering: smoking represented by four categories
It was seen in the previous section that when daily smoking provided the only information on smoking behaviour cluster analysis resolved clusters which could be differentiated not only on the basis of the frequency of daily smoking but also by other categories of smoking behaviour. This section describes the results obtained when these other categories were added to the characteristics used in cluster analysis.
Doing so introduces a deliberate weighting towards the influence of smoking behaviour on the partitioning of characteristics. The fact that each of the possible states of smoking behaviour are mutually exclusive means that when distances between subjects are calculated, two subjects of the same smoking status will necessarily also have identical, but opposite, scores for the remaining three states. Thus, smoking behaviour will add four units to the distance score whereas a truly independent characteristics will add one. It can thus be predicted that smoking behaviour will become primary basis for the partitioning of subjects into clusters. Given this, however, further division of clusters will then be driven by the relationships between the remaining characteristics of the subjects. By forcing the clustering process to differentiate by smoking status at an early stage, we hope to obtain a clearer indication of how other characteristics are distributed between and within the categories of smoking status.
Such deliberate and explicit weighting is a recognition of our particular interest in smoking behaviour. The use of mutually exclusive categories of smoking behaviour means that it is, in principal, equivalent to conducting separate cluster analyses on four populations each comprised of only one of the four categories of smoking behaviour. This should be distinguished from the form of weighting which would arise if smoking behaviour was represented by several characteristics which were not mutually exclusive. If, for example, we had described smoking behaviour by presence or absence of daily smoking, daily cigarette consumption, type of cigarette smoked and duration of smoking we would, in effect, be including daily smoking four times in that a positive score for all of these characteristics would be equivalent to daily smoking and a null score equivalent to not smoking daily. Had we a particular interest in the relationships of daily cigarette consumption, type of cigarette smoked and duration of smoking to other characteristics of the subjects, then the analysis would be better conducted using a population containing only daily smokers.
Having acknowledged our deliberate weighting towards category of smoking behaviour, we can now examine the results of the cluster analysis obtained. These are shown in table 8 and, diagrammatically, in figure 5.
The first stage of clustering, division of the population into the three clusters shown in table 8(a) , confirms our prediction that the primary distinction would be on the basis of smoking behaviour. Of the subjects comprising the first cluster, 98% were never smokers. The second cluster was almost exclusively a mixture of daily and occasional smokers while 94% of the third cluster were former smokers. Differentiation in other characteristics can also be observed but the information which this provides is essentially the same as would be obtained by a simple comparison of the whole population across the appropriate categories of smoking status and the conclusions obtained would be similar to those in section 3.1 above except that that discussion did not include the crude risk score. We may therefore note that the never smokers cluster 1 had a risk score of 0, the daily plus occasional smokers in cluster 2 a score of 14.5 and the former smokers in cluster 3 a score of 4.8. To obtain further information, we need to examine higher levels of cluster analysis.
At the next distance threshold (table 8(b)), cluster 1, comprising never smokers, splits into four clusters, three of which appear to contain younger subjects and the fourth, older people. Of the first three, two are predominantly female, containing 82% and 87% women while the third has a more uniform distribution of the genders (46% women). The fourth cluster has some excess of women (65% versus 53% in the whole population). It is apparent that the assortment of characteristics between these clusters has disturbed the balance which gave the parent cluster a crude risk score of 0. The new clusters have risk scores of 3.8, 4.5, 8.6 and 13.8. The highest score is in the smallest cluster which contains 201 subjects, 87% of whom were women with a mean age of 40. Characteristics associated with these subjects include lower socioeconomic status, a riskier nutritional profile and, despite a low caloric intake, a body mass index just above the threshold for obesity (BMI = 30). This cluster appears to be strongly coherent and divides just once more as clustering proceeds to the level generating a total of 36 clusters (table 8(e).1). The risk scores of the final two clusters are 11.7 and 16.3. In the other clusters derived from the original group of never smokers at the highest level of clustering, risk scores vary from 0 to 19.2. The former cluster contains 131 older subjects, undifferentiated by gender with a generally favourable socioeconomic and nutritional profile. The most risky group of never smokers is represented by 66 younger subjects, 84% of whom are men. The appear to be heavy drinkers, with a high intake of calories and cholesterol, a low intake of carotene and a body mass index on the threshold of obesity.
The second principal cluster, that containing the daily and occasional smokers, survives the next level of clustering unchanged apart from the transfer of three subjects to the residual group (table 8(b)). At the next level (table 8(c)), it divides into two clusters, one comprised almost exclusively of daily smokers and the other consisting of 75% daily smokers, 24% occasional smokers with the remaining 1% being never smokers. The mixed cluster has a risk score of 10.4 compared with 16.8 for the daily smokers. At the next level of clustering, the mixed group divides again to give one cluster with 49% daily smokers and 51% occasional smokers, with risk scores of 4.6 and 15.6 respectively. The final level of clustering (table 8(e.1)) finally generates a cluster containing only occasional smokers and with a risk score of 3.5. At this level, the daily smokers are now divided between six clusters with risk scores ranging from 3.5 to 21.3.
The principal cluster containing former smokers, like that of daily smokers, was largely intact after the second level of clustering (table 8(b)), apart from the loss of two subjects to the residual group. At the next level, it divided into three clusters. The first had a slight preponderance of men, was above average age and had a risk score of 1.7. A second cluster, undifferentiated by gender, was older than the first, contained a small proportion of daily smokers and never smokers and carried a risk score of 14.7. The third cluster was younger and predominantly female and more heterogeneous in smoking behaviour, containing 81% former smokers, 11% never smokers, 3% daily smokers and the remainder occasional smokers. With a value of 15.7, its risk score was the highest of the three. It was interesting to observe that while the mixed cluster of daily and occasional smokers described above ultimately resolved into clusters containing only one or the other, further partition of the mixed cluster containing former smokers led to one cluster at the final level being more rather than less heterogeneous (table 8(e.2)). It contained 5% daily smokers, 11% occasional smokers, 58% former smokers and 26% never smokers. It was also the riskiest of all the clusters derived from the original group of former smokers, with a risk score of 19.8. It contained younger subjects, nearly all of whom were men, with low socioeconomic status including, in particular, a high proportion engaged in manual jobs in potentially hazardous industries. It is of further interest that the second highest risk score amongst the former smokers, 16.9, was found in the other cluster which became more heterogeneous with progressive division. This cluster derived from the third initial partition described above and comprised 79% former smokers, 9% occasional smokers with the remainder being current smokers. It also appeared to owe its high risk score to low socioeconomic status together, in this case, with a poor nutritional profile. The remaining eight clusters resolved at the final level of clustering were more homogeneous in smoking status, containing from 95% to 100% former smokers. Their risk scores varied between 1.3 and 13.9.
It can thus be seen that the weighted clustering described in this section did indeed lead, with the interesting exceptions just described, to a much clearer separation of groups by smoking status. This allows a clearer understanding of how the crude estimate of risk varies with smoking status. It appears from the discussion so far, that risk did appear to vary with smoking status, although there was a considerable overlap between categories. Risk in clusters of never smokers varied from 0 to 19.2, in former smokers from 1.3 to 19.2 and in daily smokers from 3.5 to 21.3. The single cluster of occasional smokers had a risk score of 4.1. As discussed in section 3.2, however, a better indication of the distribution of risk by smoking category can be obtained by examining categories of risk in terms of the overall proportion of each smoking category which they contain.
Figure 6(a) shows the distribution of risk across six categories of risk erected such that each category represents approximately equal numbers of subjects. The better definition of smoking status achieved in the weighted analysis could be predicted to produce a different distribution of risk from that obtained from the unweighted analysis described in the previous section and shown in figure 4. This is indeed the case. Whereas the risk in never smokers was distributed unimodally in the previous analysis, in the present case it now appears to be bimodal, with the proportion in the highest category of risk higher than that in the next highest category. The distribution of risk in daily smokers is now strongly bimodal with peaks in the second and fifth categories. In addition, both occasional and former smokers seem to have bimodal distributions. Collapsing the categories into three (figure 6 (b)) suggests that the proportion of never and former smokers is lowest in the highest category of crude risk while the proportion of daily smokers increases as crude risk increases. The proportion of occasional smokers tends to fall with increasing risk but with a slight increase in the highest category. Figure 6(c) summaries the distribution of risk by reducing the number of categories to two. This shows a wider differentiation by risk than seen in the previous analysis. The proportions in the higher risk category are 83%, 24%, 48% and 32% for daily, occasional, former and never smokers respectively. To assist in comparison, figures 4 (b) and 6 (b) are reproduced below as text figure 4.
3.4 Sensitivity analysis
I have stated previously that the characteristics used in cluster analysis are likely to have a strong influence on the results obtained. This section describes a series of analyses intended to investigate the extent to which the results described in sections 3.2 and 3.3 were dependent upon the set of characteristics used in corresponding cluster analyses.
The term sensitivity analysis implies some specific statistical procedure but most often it refers simply to comparisons of the results obtained by a procedure when some condition is changed. It is useful in cases, such as this, in which the interpretation of results has a marked qualitative element which renders difficult an overall numerical or statistical summary of results.
It is obvious that the present requirement will be met by changing the set of characteristics used in cluster analysis. It is, however, less obvious how this should be done. One possibility would be to remove one or more characteristics. This, however, would change the size of the set and any differences observed might be due in part to this. Alternatively, characteristics could be replaced by dummy characteristics bearing little or no information. This could be done either by constructing a set of random data or randomising the data within an existing character. It is arguable, however, that a character bearing no information is equivalent to no character at all so that this might entail an effective reduction in the size of the character set. It might, however, provide some indication of the extent to which purely random associations can influence clustering behaviour. The approach adopted, however, was to replace characteristics with other real characteristics not hitherto used in clustering. Any differences observed will indicate the sensitivity of clustering to the composition of the character set. This is not without limitations. It will not necessarily be clear whether any change is due to the properties of the characteristic replaced, those of the replacement or a combination of the two.
In a data set with 18 characteristics, excluding those concerning smoking behaviour, an exhaustive series of replacements would generate a large volume of data and cause difficulties in interpretation. I decided therefore, to conduct a limited, but systematic, set of comparisons in which the five characteristics which I characterise as socioeconomic were replaced successively by a sixth of the same nature and in which the four related to nutrition were likewise replaced by a fifth. For the former I selected RACE as the replacement characteristic. As shown in table 1, I chose the category 'non-hispanic black' to represent a risky part of the population, both because there is a general preconception that such people are at socioeconomic disadvantage in the United States of America and because it identified a proportion of the population close to my target of 20%. As shown in table 4(a), RACE was not highly correlated with any of the characteristics which it was to replace, the highest correlation being a value of 0.28 for its relationship to EDUC. The characteristic selected to replace the nutritional descriptors was VITC. Table 4(b) shows that it had slightly higher correlations with those to be replaced, the highest being 0.39 with FIBE and the lowest 0.10 for CHOL.
Given the volume of data involved in my limited descriptions of the results of clustering in the preceding two sections, it is clear that the for the nine further analyses involved here an abbreviated form of description is required. I chose to restrict consideration to clusters formed at difference thresholds producing 12 clusters and to limit description and discussion to their composition in regard to smoking behaviour and to their crude risk scores. Such qualitative findings as were of interest are discussed in the next section.
The distribution of smoking behaviour across clusters is shown in table 9 in which the first series of rows provides, for comparison, the results obtained without the replacement of characteristics. It can be seen that in nine of the 10 cases, six clusters representing never smokers were resolved. The exception was with replacement of IND2 by RACE which generated only five such clusters. In every case at least one cluster represented predominantly by daily smokers was seen while in three cases, EDUC replaced by RACE and the replacement of KCAL and FIBE by VITC, a second such cluster was observed. Scrutiny of the detailed results, not reproduced here, reveal, unsurprisingly, that that the mean values of the replacement characteristic differed significantly between the two clusters of daily smokers.
Eight of the 10 analyses yielded mixed clusters containing both daily and occasional smokers. In two cases, KCAL replaced by VITC and FIBE replaced by VITC, this mixed cluster was replaced by another mixed cluster comprising principally former and occasional smokers. It may be noted in passing that KCAL and FIBE were the nutritional characteristics most strongly correlated with VITC, suggesting that simple correlations do not necessarily predict the influence of characteristics in clustering.
Eight of the 10 analyses also partitioned the majority of former smokers between three clusters. Of the two discrepancies in which only two such clusters were formed, one (EDUC replaced by RACE) was exceptionally large (data not shown). The second (KCAL replaced by VITC) was one of the analyses generating a mixed cluster of occasional and former smokers.
Finally, one substitution, RACE for IND2, produced a mixed cluster in which daily and former smokers comprised the majority of subjects, a combination not seen in any other case.
This comparison of the distribution of subjects by smoking status must be considered against the fact that all the character sets were strongly weighted to discriminate between subjects on this basis. Any major difference in clustering would thus require a change of sufficient magnitude to overcome this weighting. We can, however, conclude from the foregoing that, within this constraint in interpretation, the approach to cluster analysis used was generally robust, with only relatively minor differences in clustering following the selective replacement of characteristics applied.
The second form of assessment, using the crude risk score for each cluster, should be less strongly influenced by the weighting for smoking behaviour as this was not included in the calculation of the estimate in question. This aspect of sensitivity analysis should thus provide an impression of the influence of other characteristics within a weighted analysis. The four parts of figure 7 shows the distribution of risk scores between clusters for each of the sensitivity analyses performed together with that for the original weighted analysis for comparison. Note that, as previously, the categories of risk were chosen to provide as uniform a distribution of subjects per category as possible. Because of the smaller number of clusters five rather than six categories of risk score were chosen. Despite this, the uniformity of subject numbers across categories was less than in the previous cases. To provide some continuity corresponding data from the original cluster analysis based on all four categories of smoking behaviour is included in the figures.
Figure 18(a) shows that in eight of the ten cases, the greatest proportion of smokers lay in the highest category of risk and that in the two exceptions (INCO replaced by RACE and IND2 replaced by RACE), the greatest proportion was found in the second highest category. In all but two cases (INCO replaced by RACE and CARO replaced by VITC) only a negligible proportion of daily smokers were seen in the lowest three categories. Some consistency in the distribution of never smokers was also observed (figure 18(b)), with a more uniform distribution over the whole range, but some suggestion of a higher proportion in the lower to central part of the range. The findings were more variable in the case of former smokers (figure 18(c)), in part because in any given series the majority of subjects fell into fewer risk categories than was the case with never smokers. While the overall tendency is for the greatest proportion of subjects to lie in the central categories of risk, it appears that the distribution of risk in former smokers is more sensitive to the composition of the character set than is the case for daily and never smokers. The situation is more extreme in the case of occasional smokers. It can be seen in figure 18(d) that for each character set around 60% or more of subjects fall into a single risk category with the remainder distributed across the remaining categories. While a similar situation was seen with daily smokers, in that case the majority of subjects all fell into high risk groups. With occasional smokers, however, the majority groups were scattered across the risk categories, albeit with some preponderance in the second highest. We should note, however, the small number of occasional smokers in the population and that the total number of 282 was less than the mean cluster size (322 for the analysis without replacement of characteristics). It is thus possible that statistical variation contributed to the apparent sensitivity of the distribution of occasional smokers to the character set used.
We may summarise these findings by suggesting that the distribution of daily and never smokers between risk categories showed some sensitivity to the character set used in clustering but that a common pattern was apparent such that the majority of daily smokers were in high risk categories while never smokers were distributed more evenly across the range. The distribution of former smokers appeared rather more sensitive to the characters used while that of occasional smokers appeared to depend strongly on the particular set of information used in clustering.
What do these findings tell us about the technique of cluster analysis and what do they tell us about the distribution of smoking behaviour? We cannot reach definitive conclusions but we can make some inferences. In doing so, we must first consider our expectations. If the replacement series all gave identical results then we would be forced to conclude either that the clustering was independent of the characters used and that the results obtained derived from some other property of the method or that the characters involved in the replacement series were effectively identical. Either conclusion would force us to question the value of the technique and the results which it generated. Therefore, if the method has utility, some sensitivity to the character set is to be expected. Given this, we can now consider the tentative observation that the distribution of risk in certain categories of smoking behaviour was more consistent across the replacement series than in others. While is possible that this is the result of some unknown property of the technique, it is more likely that it tells us something about the smoking categories themselves. We could postulate, for example, that daily and never smokers are more coherent categories whose risk status is determined by the majority of the characteristics used and hence less sensitive to changes in the set. On the other hand, former smokers, and particularly, occasional smokers may be less coherent groups which are more variable in themselves. We could speculate, for example, that former smokers varying according to the number of other attributes of behaviour which they changed when they stopped smoking. Likewise, we could suggest that occasional smokers vary according to the circumstances in which they smoke and hence in the other factors which vary according to this form of smoking behaviour.
Finally, to provide a numerical rather than a narrative summary of the results obtained from the sensitivity analysis, figure 8 shows the mean distribution of risk by smoking categories over the whole replacement series. The differences in variability within the smoking categories described above should be kept in mind while examining this plot. It is evident that the great majority of current smokers are at higher risk by virtue of the characteristics associated with them. Risk is more uniformly distributed in never smokers, with a slight bias towards lower values. Former smokers show a similar distribution of risk to that in never smokers. The trend means in the highly variable data on occasional smokers shows some resemblance to that in daily smokers.
3.5 Qualitative properties of clusters
While the main objective of the work reported here was to examine the distribution of risk factors between subjects of different smoking status, it is of interest to give brief attention to the overall properties, in terms of how they could be identified in the population, of the clusters identified in the process. As this is not intended to be a comprehensive examination, I restrict my attention to the clusters derived at a distance threshold producing 36 clusters in the analysis in which smoking behaviour was represented by all four categories.
Table 10 contains the same information as given in table 8 but re-ordered to present them in categories of low, intermediate or high risk. These simply represent risk scores of less than seven, greater than seven but less than 14 and 14 or more. The data shown are the mean values of unstandardised data for clusters in which the standardised data differed significantly from the population average.
The clusters in the low range of risk, in table 10a, can be considered by smoking status. The clusters of never smokers show a trend towards representing older women, although two clusters of younger women and one, of indeterminate age, is predominantly male. Those risk factors contributing to such risk as is present appear to be scattered across the range of characteristics. The single cluster of daily smokers who are at low risk contained 85 subjects. It comprised younger people of either sex who were at risk by virtue of a low intake of carotene and a higher than average intake of caffeine. The single cluster of occasional smokers contained 91 subjects, which is over a third of all in this category. It represented younger people of either gender with a considerably higher intake of alcohol than average and a higher likelihood of working in low grade jobs. Of the three clusters of former smokers with low risk scores, two contained a high proportion of older women but differed in being above or below average in measures of socioeconomic status. One derived risk from a high intake of cholesterol and the other from a low intake of carotene. The third was also predominantly female of higher socioeconomic status, but of indeterminate age and a high intake of caffeine.
Eight of the clusters of intermediate risk (table 10(b)) represented never smokers, generally younger and with a lower preponderance of women than in the low risk group. Two clusters were of lower socioeconomic status. Risk factors occurring in two or more clusters included high caloric intake, low intakes of fibre or carotene, obesity, a high usage of prepared food and a tendency to take less exercise. Of the two clusters of daily smokers, one contained an excess of women, was slightly above average age, had low intakes of fibre and carotene and consumed more caffeine than average. The other cluster was of young people of either sex with an exceptionally high intakes of alcohol and prepared food and a high daily caloric intake. They were twice as likely as average to have a manual job. Two of the thee clusters of former smokers were above average age with one cluster predominantly female and the other almost exclusively male. The women were sedentary and obese while the men had a high intake of total calories, cholesterol and caffeine. The third group was young, around two thirds male, heavy drinkers with a high caloric intake and likely to have a manual job in a risk industry. They tended to live in dwellings with fewer rooms, although this may be related to their age.
The clusters of people with high risk scores (table 10(c)) comprised four containing never smokers and three each of daily smokers and former smokers. The never smokers included a pair of clusters (1,2,-,-,1; 1,4,3,2,-) which were predominantly female, of low socioeconomic status, with low intakes of dietary fibre and carotene and tending to be less active physically. They differed in age, with that of older women being less likely to have to have manual jobs and spending more of their leisure time watching television. Both of these differences might be related to age. A third cluster (1.1.-.2,-), comprising younger people of either gender was broadly similar except that they had high intakes of total calories and cholesterol and were obese. The fourth cluster of never smokers appeared to contain younger men of average or higher socioeconomic status with high intakes of total calories, cholesterol, alcohol, caffeine and prepared food. They were more likely to have manual jobs and were verging on obesity.
One of the clusters of daily smokers was closely similar to the last cluster of never smokers described above (2,-,1,2,1 c.f. 1,3,1,2,2). The principal differences were that the daily smokers were less likely to undertake voluntary physical exercise, had an average intake of carotene and were of normal stature. The remaining two clusters of daily smokers contained subjects of lower socioeconomic status, with an average caloric intake but tending to have risky levels of dietary constituents and alcohol. One of these clusters had an excess of women and was undifferentiated by age while the other had no bias in gender but was comprised of younger people. The latter cluster was distinguished by a marked tendency for its constituents to hold manual jobs in risky industries.
Two of the clusters of former smokers with high risks scores contained older people and had a predominance of women. They represented subjects of lower socioeconomic status with a more sedentary existence. In one of these, the subjects had a higher body mass index which, although short of the threshold for obesity, was higher than average. The other was differentiated by a lower intake of carotene and fibre. The final cluster of former smokers also contained people of lower socioeconomic status but was largely comprised of younger men. Its profile of risk included higher intakes of total calories and cholesterol and a particularly high likelihood of holding manual jobs in risky industries. It bore some relationships to the clusters of young male daily and never smokers who had high risk score, differing principally in its lower socioeconomic status.
This section has attempted simply to describe the clusters formed in terms of the presence or absence of risk factors and total risk score. I have not attempted to interpret the findings with regard to reasons why particular combinations arise although it is apparent that such information could be used to develop a classification of lifestyles and behaviours in relationship to smoking behaviour and other characteristics. It is generally apparent, however, that clusters are frequently differentiated by age and gender, indicating some broad coherence in the classification. This also suggests, however, that a more detailed examination should correct data for differences in the mean values of characteristics in men and women and the young and the old.
4. SUMMARY AND CONCLUSIONS
It is evident from the literature that smokers tend to differ from non-smokers in a number of aspects of their lifestyle and exposure to environmental and socioeconomic risk factors. This was apparent from a survey of the whole population studied here (figure 2); tables 5 and 6 ). This suggests that in addition to any risk imparted by their practice, smokers may be at a higher risk of disease by virtue of other factors associated with it. Such information is, however, generally derived from pairwise comparisons of risk factors and tells us little about how these other factors are distributed within a population of smokers. Are the distributions of the other factors independent of each other so that smokers are a relatively homogeneous group with any given individual being more likely to have higher values of some of them? Alternatively, do these other factors tend to aggregate or disperse in individuals so that the population of smokers is heterogeneous, containing several sub-populations with different associations of risk factors? The objective of the work described here was to test the hypothesis that the latter is the case. Cluster analysis was chosen as the vehicle for this work as it is a technique designed to classify members of a population on the basis of a number of characteristics considered simultaneously.
In the preparation of data for cluster analysis, information on smoking was provided in two ways. In the first, smoking behaviour was given the same status as all other characteristics, being represented by a single variable which told whether or not a subject was a daily smoker. Thus, information on occasional, former or never smokers was not provided directly and any discrimination on this basis emerging from the analysis would be the result of other associations within the data set. The second approach was to describe smoking behaviour by four mutually exclusive variables, each representing the presence or absence of one of the four possible categories. This represents a weighting of the analysis towards a classification by smoking status, a process considered justifiable in that it reflected the principle interest in the work.
Both approaches were successful providing classifications of smokers on the basis of exposure to risk factors. Inspection of the results at different levels of the distance between clusters produced a hierarchy of clusters which showed progressive differentiation not only in smoking behaviour but in the basic demographic characteristics of age and gender, in socioeconomic attributes and in aspects of lifestyle such as diet or extent of physical exercise. We could thus identify, for example, clusters of young male smokers and observe the risk factors which tended to be associated with them.
The two methodological approaches differed in the degree of differentiation of smoking behaviour and the level in the hierarchy of clustering in which it became apparent. When information on smoking status was provided by four variables recording the presence or absence of each category of smoking behaviour, clusters represented largely or exclusively by one category were identified at an early stage in the hierarchy. On the other hand, when the presence or absence of daily smoking was the only relevant information provided, differentiation of clusters by smoking behaviour was less marked, although significant differences in the proportions of smokers in clusters were readily apparent, and tended to occur later in the hierarchy. It is important to note, moreover, that clusters containing higher proportions of occasional, former or never smokers could be identified. Thus, associations between other factors were of sufficient strength to distinguish these types of smoking behaviour.
To summarise the overall exposure of clusters to risk factors a crude risk score was devised. This is based upon whether or not the level of a risk factor in a cluster differed significantly from the mean value in the whole population and upon the magnitude of any difference. The limitations and assumptions in this approach are discussed in section 2.4.2 . It is intended simply as a broad indicator of risk without specificity to any disease or category thereof. It use demonstrated, however, heterogeneity within any category of smoking behaviour such that clusters with higher or lower levels of risk could be identified. It was observed, for example, that while clusters with high proportions of daily smokers did tend to carry a higher level or risk than those comprised predominantly of never smokers, a substantial minority of smokers fell into clusters of lower overall risk.
These are important findings. They suggest that, in addition to any risk which is imparted by smoking, smokers generally bear a higher load of risk factors than never smokers, confirming and amplifying observations based on pairwise comparisons of risk factors. This implies that a proportion of the risk which appears from observational epidemiology to be attached to smoking may be the result of associations with other risk factors rather than any direct effect of smoking. The additional observation that, despite the generally higher exposure of smokers to other risk factors, their exist groups of smokers with lower overall exposure may assist in explaining why not all smokers suffer from the diseases which appear to be associated with their practice.
A similar heterogeneity in crude risk scores was observed within the other categories of smoking behaviour. Thus, while the majority of never smokers tended to bear a lower load of risk factors than did daily smokers, a substantial minority carried a burden of similar magnitude. Former smokers showed a distribution of risk which was more similar to never than to daily smokers, although tending to be slightly higher than the former. Occasional smokers bore more resemblance to daily smokers than did former smokers.
These observations could be derived from both the cluster analysis based on daily smoking only and that taking all categories into account. Nevertheless, given their significance, it is important to make some assessment of the dependence of the results of cluster analysis upon the particular set of characteristics upon which it operated. As discussed in the preliminary report, the choice and treatment of characteristics is a critical aspect of the technique. Moreover, when several analyses using different sets are conducted, there is no rational basis for deciding upon which provided the most appropriate results and any preference must be largely subjective. To provide an assessment of the dependence of results upon the character set used, a sensitivity analysis, involving a systematic replacement of socioeconomic and dietary characteristics by a variable in these categories not hitherto used, was conducted. Comparisons on the basis of the distribution of smoking behaviour and crude risk scores between clusters suggested that different data sets did indeed produce different results but that the majority of these provided the same basic observations as are described above. This was the most satisfactory outcome that could be expected. To have found little difference between data sets would imply that the sensitivity of the technique was limited. On the other hand, wider differences than those observed would rob the findings of generality.
While this discussion has been concerned with the distribution of smoking behaviour and risk scores, the results of the cluster analyses also allow qualitative descriptions of the clusters formed in terms of all the characteristics used. This is amenable to the description and classification of clusters in relation to particular lifestyles and patterns of exposure and hence how smoking behaviour can be related to other aspects of human existence. This has been explored in the present report only to the extent of demonstrating the potential for such interpretation which exists. It remains therefore, as an area of investigation to be developed as and when a need exists.
In conclusion, the work described here has shown that characteristics related to risk factors for disease are not distributed uniformly between different categories of smoking behaviour but aggregate in various ways into clusters which differ according to the number of risk factors present. In this way it has provided more information than can be obtained from a study of the distribution of the factors in the population as a whole. Thus, while examination at the level of the population suggests that daily smokers tend to be exposed to higher levels of risk factors than never smokers, cluster analysis has demonstrated that clusters of daily smokers at relatively low risk and of never smokers had relatively high risk can be identified within the population. Nevertheless, the majority of daily smokers aggregate into groups with a high exposure to risk factors while most never smokers are at low to medium risk in this sense.
It must be appreciated that these findings apply to a single sample from one country at one point in time. In addition, although the sample was assembled in a statistically robust manner, it was biased towards components of the population thought to be relatively deprived in social and economic terms and the prevalence of smoking was, by historical standards at least, low. It is likely, therefore, that another sample, different in location, time and bias, would provide different results. It might be predicted, however, that while, by analogy with the sensitivity analysis, the results would differ in detail, the general observation of a differential assortment of characteristics by smoking status might still apply.