Characteristics used in clustering and description of clusters
1. INTRODUCTION
In my discussion of the principles underlying cluster analysis in Appendix A, I suggested that the selection of the characteristics upon which cluster analysis is based is the most subjective part of the exercise. I set out various criteria to assist in selection but concluded that while these represented aids to judgement they did not increase the objectivity of the process. My final conclusion was that the best way to justify the inclusion or rejection of characteristics was by a narrative description of the reasons underlying selection. This is presented here.
The number of potential characteristics offered by the NHANES data set is large and the process of selection proceeded of necessity in stages. The first step was the decision that the characteristics to be used in this exercise should be risk factors which are external to the subject and determined by their behaviour and environment although, as will be seen below, two exceptions to this rule were allowed. This led to the early rejection of physiological and biochemical information and concentration on data obtained by questionnaire and related material recorded in the Mobile Examination Centres. From this, the following categories were examined in further detail:
| alcohol |
| demographic characteristics |
| diet * |
| dietary intake * |
| housing |
| occupation |
| physical activity |
| sexual behaviour |
| smoking |
| weight examination |
After further inspection, information on sexual behaviour was dropped from consideration because of the large amount of missing data. On the other hand, characteristics included in the category of 'weight examination' were considered for use because, despite their being internal rather than external factors, obesity is currently considered to be a risk factor for several chronic diseases.
2. CHARACTERISTICS
2.1 Demographic
2.2.1 Characteristics used in cluster analysis
Age (AGEY): A fundamental property of an individual and also a risk factor for chronic disease. Tends to be correlated with socioeconomic and particularly occupational characteristics. Truncating the range to exclude retired people could be considered for further work.
Gender (GEND): A fundamental property of an individual and also a risk factor for chronic disease. Some correlation with dietary variables and a negative correlation with current smoking.
Race (RACE): The ethnic origin of people has been found to be a risk factor for disease in may epidemiological investigations. While some of this apparent risk may be attributable to genetic differences it seems likely that part of the relationship with disease may be associated with socioeconomic differences in a particular population.
This characteristic was included as substitute for socioeconomic factors in sensitivity analysis.
2.2.2 Characteristics rejected for use in cluster analysis
Marital status has been considered to be a risk factor for a number of diseases. Its use in this exercise was rejected because of the high frequency of missing data.
2.2 Diet and dietary intake
2.2.1 Used in cluster analysis
Carotene intake (CARO): the intake inferred for the subjects in NHANES was for total carotene and did not distinguish between the α and β forms
Results from observational epidemiology suggest a negative association between intake of β-carotene and incidence of lung cancer although intervention trials gave contradictory results. May be a surrogate for dietary intake of vegetables for which no direct data were for all subjects available. This conjecture supported to some extent by correlations with intakes of fibre and vitamin C. Of the many micronutrients whose intake by subjects was inferred, carotene was selected for use because of its relatively direct associations with specific diseases.
Correlated strongly with intake of vitamin C, less strongly with fibre intake and weakly with job status and, inversely with, smoking behaviour.
Cholesterol intake (CHOL) is thought to be a risk factor for atherosclerosis and associated cardiovascular disease, although the evidence not consistent. Current opinion is that the intake of fatty acids of different degrees of saturation and body levels of different fractions of lipoproteins may be more appropriate as markers of cardiovascular risk. I chose to use cholesterol, however, because of the breadth of the related literature, because inferred levels of other measures may have been less accurate and because proper expression of the risk associated with such other measures may be expressed by their relative rather than their absolute levels. To have used such ratios as characteristics might have amplified the inaccuracies already implicit in estimated intakes
In that public health activity recommends a low cholesterol intake, low values in the population could be confounded to some extent with aspirations to a healthy lifestyle.
Weak correlations with age, gender, employment status, alcohol intake and smoking.
Dietary fibre intake (FIBE): Low dietary fibre is considered to be a risk factor for a number of chronic diseases of the digestive and circulatory systems.
This characteristic may be correlated strongly with measures of gross food intake, less strongly with carotene and vitamin C intakes and weakly with some measures of exercise and BMI.
Total caloric intake (KCAL): Not generally regarded as a risk factor for disease in itself this characteristic was included as an indicator of gross food intake. A body of results from animal experiments and some limited data from humans suggested that a restricted caloric intake is associated with an increase in life span and a reduced risk of certain cancers
Total caloric intake was correlated, generally strongly, with other dietary factors and also with age, gender, some socioeconomic factors, exercise and smoking.
Vitamin C intake (VITC): as a component of antioxidant defence systems, a low intake might be a risk factor for chronic disease. It was also chosen as an indirect surrogate for the consumption of fruit and vegetables, as direct data for such dietary behaviour was available. This conjecture supported to some extent by correlations with intakes of fibre and carotene.
Highly correlated with carotene intake, less strongly with fibre intake and weakly with gross nutritional factors, age, some socioeconomic factors, caffeine intake and, inversely, with smoking.
This characteristic was included as substitute for other dietary characteristics in sensitivity analysis.
2.2.3 Rejected for use in cluster analysis
Total daily intakes of protein, carbohydrate and fat were considered for use but rejected because of very strong correlation with total caloric intake. Various measures of saturated and unsaturated fat are available but were not used for reasons given above. Information on frequency of eating meat and poultry, the distinction presumably lying between mammalian and avian flesh, was available and considered for use in that it might identify vegetarian subjects. The data are, however, almost invariate with very few subjects not eating one or both, so that vegetarians were either rare in the population or these characteristics were not effective indicators. Information on consumption of milk considered, partly because of published correlations with the incidence of lung cancer and tuberculosis, but this characteristic ultimately rejected because of lack of relevance and possible confounding with rural:urban residence.
The database contains remarkably little information on consumption of fish or fruit and vegetables. Direct assessments were, for unknown reasons, made only for those over 65. Intakes of carotene, fibre and vitamin C were considered to be partial substitutes for fruit and vegetable consumption.
The dietary information obtained by questionnaire included subjects' responses to a question asking how frequently they ate meals prepared in a restaurant. This was included as a descriptive variable under the category of lifestyle (q.v.).
2.3 Exercise
2.3.1 Used in clustering
Moderate physical exercise (MODE): physical exercise is considered to be a protective factor against a number of chronic diseases.
The variable used for clustering was that described in NHANES as moderate physical exercise and was scored as present if subjects had undertaken activity which led to 'light sweating or a slight to moderate increase in breathing or heart rate' in the 30 days prior to interview. I chose this in preference to the category of 'vigorous' physical exercise because the numbers of subjects reporting moderate levels of exercise provided a closer approximation to a 20%:80% partition within the population.
A moderate negative correlation with age is unsurprising. Modest correlations with education and income, and perhaps by association, various occupational factors were observed. Modest negative correlation with alcohol consumption but no correlation with smoking.
It is possible that an important, but hidden, correlation could arise if less healthy subjects took less exercise because they were physically incapable of doing so. The results provided by NHANES included the category 'unable to do activity'. Having determined that the frequency of such subjects did not vary by smoking status (main report, figure 2) I did not differentiate between subjects responding thus and those who simply stated that they did not undertake this form of exercise. Note too that NHANES also recorded a higher level of voluntary physical exercise than that discussed here. Thus some of the subjects who reported that they did not indulge in moderate physical exercise may in fact have fallen into this category. The effect of this would be to weaken the influence of this variable in the cluster analysis.
Daily activity (ACTI): a measure of the extent of activity in daily life, ranging from sedentary to climbing or heavy lifting, and presumably including activity both at work and at leisure. Included to identify inactive rather than active subjects and in attempt, albeit indirect, to include some estimate of physical exercise other than that taken, to a greater or lesser extent, for its own sake.
The observed correlation with job status was predictable. Also modest correlations with age and gender. Weak positive correlations with smoking and alcohol consumption observed. These might result from correlation between these factors and more active jobs.
Time spend watching television (SOFA): The definition also included recreational use of computers. This characteristic was included as an indicator of a sedentary lifestyle.
Weak negative correlations with occupational factors, probably reflected the likelihood that retired or unemployed subjects watched more television.
2.3.2 Not used in clustering
Walking or cycling to work or on errands: a measure of the frequency of such activity, included in an attempt to select subjects who chose to walk or cycle rather than use motorised transport.
Rejected because of possibility of confounding with socioeconomic factors, type of employment and urban:rural location in terms of distance from home to workplace.
Weakly correlated with daily activity and extent of vigorous physical exercise.
2.4 Lifestyle
2.4.1 Used in clustering
Alcohol consumption (ALCO): is a risk factor for certain forms of cancer and, in a biphasic manner, for cardiovascular disease.
This attribute recorded subjects' estimates of the average number of alcoholic drinks which they had taken daily over the previous year. This was used in preference to an estimate of daily consumption, calculated as grams of alcohol, derived from the details of the subjects intake of food and drink on the previous day, as this was regarded as less direct and potentially atypical. Information on the forms of alcoholic beverage preferred by the subject were available but not used.
Correlated weakly with age, gender and some dietary and occupational variables. Also moderately correlated with smoking behaviour.
Smoking is reported to be a risk factor for certain cancers and for some forms of cardiovascular and respiratory disease.
It was recorded by NHANES in four categories, smoked cigarettes daily (SDAI), smoked cigarettes but less frequently than daily (SOCC), previously smoked cigarettes but no longer did so (SFOR) and had never smoked cigarettes (SNEV).
As described in the main text, one cluster analysis was performed using SDAI as the only measure of smoking and a second using all four, the latter conducted to introduce a deliberate weighting in clustering towards smoking behaviour by allowing the direct identification of each category.
Information on daily cigarette consumption for current and former smokers is available in NHANES but was not used. This was because to have done so would, in effect, have introduced a second, indirect, measure of daily or former smoking. Information was also available on duration of smoking and, for former smokers, time since cessation but was not used for similar reasons.
Modest or weak correlations found with gender, alcohol and caffeine consumption and, inversely, with body mass index.
Caffeine consumption (CAFF): derived by NHANES reported consumption of tea, coffee and specific carbonated beverages, has been reported as a risk factor for certain forms of cancer. It is included here, however, primarily as an indicator of a particular form of lifestyle. This is supported by correlations with a high frequency of eating food prepared outside the home, a high caloric intake and smoking.
Prepared food (PREP): included in the dietary questionnaire, this attribute recorded how often subjects ate food prepared in a restaurant, be it in the restaurant, at home or elsewhere. This has not, to my knowledge, been recorded as a risk factor for disease although it could, if the majority of cases referred to fast food restaurants and those providing meals for consumption off the premises, be an indicator of a particular quality of diet. It is included here, however, as an indicator of a lifestyle favouring such food either in main meals or as additional intake between these.
Shows a weak negative correlation with age and weak positive correlations with education, income, total caloric intake, caffeine consumption and occupational variables.
2.5 Occupation
Note that in my attempts to achieve a balance between broad categories of risk factor, occupational characteristics were included under the socioeconomic heading.
Occupational characteristics tended, as was predictable, to be inversely correlated with age in that older subjects who had retired from work were allocated a null score in such variables. Means of correcting for this were considered but reject as necessitating assumptions about retirement age and being incompatible with the binary coding used in the analysis.
Similarly, unemployed subjects would be carry a null score. Employment status was recorded by NHANES as a separate variable and I considered corrections on this basis but rejected this too on grounds similar to those above.
2.5.1 Used in clustering
Job category (JOB2): refers to the status of the subjects' jobs, ranging from manual to professional, which has been reported to be a risk factor for a number of chronic diseases.
This was recorded by NHANES by allotting each employed subject a code from a list available to those administering the questionnaires used in the study. I selected from this list those types of job which I considered to be essentially manual, as opposed to skilled, managerial or professional and coded subjects reporting these as 1 and all others as 0. These are shown in table 1 of the main report. Thus a binary coding was imposed directly for this variable rather than by reference to the distribution of values over the whole population as done for the majority of other variables.
Analysis of clusters suggests a possibly anomalous classification of farmers and other self-employed subjects who may have been allotted a high job status regardless of the scale or prosperity of their enterprise. This anomaly, if it exists, was not introduced by the recoding of job status described in table 1.
Job category was strongly correlated with the other occupational characteristics described here. Moderate correlation with age and weaker correlations with intake of calories and protein, frequency of eating prepared food, alcohol consumption and some measure of physical activity.
Industry (IND2): Certain categories of industry may, on the basis of exposure to a hazardous environment, present a risk factor for disease.
The original data from NHANES was coded according to a list available to interviewers. I recoded the data according to my opinion on whether an industry was or was not likely to involve some exposure to hazardous materials or practices. The basis for recoding is shown in table 1 of the main report. As in the case of JOB2, this process imposed a binary coding on the data.
This attribute was strongly correlated with other occupational factors and moderately related to age. Weaker correlations with education, income, frequency of eating prepared food and measures of physical activity.
2.5.2 Rejected for use in clustering
A number of variables recorded by NHANES provided, at least in principle, more direct measures of potential exposure to hazardous substances or conditions. These included the use and nature of protective equipment and exposure to noise. These were considered for use but not selected on the grounds of some equivalence to manual occupations and the small numbers of subjects providing positive responses.
2.6 Socioeconomic
2.6.1 Used in clustering
Education (EDUC): The amount of education received by a person is considered to be a risk factor for chronic disease. This was coded by NHANES in terms specific to the educational system in the United States but this was readily converted into the low, medium or high levels of educational attainment.
Moderately correlated with income, intake of calories and protein, certain occupational factors and the likelihood of taking voluntary physical exercise.
Income (INCO): low income, and other aspects of poverty, are regarded as a risk factors for many forms of chronic disease.
Moderately correlated with education, intake of calories and protein, housing, certain occupational factors and vigorous exercise.
Housing (HOU2): poor quality or overcrowded housing is a risk factor for certain respiratory diseases and, more generally, for chronic disease. Ideally some measure of density of occupation of a dwelling, in terms of the number of people per room or number of rooms per person would have been used but this was not available in the NHANES data and attempts to estimate this were thwarted by seemingly obscure distinctions between the number of people and the number of households occupying a dwelling. The best indicator available was the number of rooms in a dwelling, a small number being deemed riskier than a large number. The correlations observed with education and income were predictable while a correlation with the level of voluntary physical exercise was also observed.
2.6.3 Rejected for use in clustering
NHANES contains a variable labelled poverty which expresses income relative to needs as assessed by a standard formula. While potentially useful, a high frequency of missing data argued against its use.
The age of the subject's dwelling was available as a variable but rejected because of the possibility that older dwellings could range from poor to high quality according to location and type. The type of dwelling, be it a detached house, a terraced house, a tenement, a static caravan or a hostel was not used because of the likelihood of confounding with income and urban or rural location.
A number of variables referred to the state of interior decoration. These might have provided a good measure of housing quality but, as they were intended to assess exposure of children to lead derived from paint, were collected only for subjects under the age of five. Information on the type of water supply available to a household could also have been of value but was rejected because of the possibility of confounding with urban or rural location.
2.7 Stature
2.7.1 Used in clustering
Body mass index (BMIX): is regarded as a risk factor for many forms of chronic disease.
This variable was found to be weakly correlated with smoking behaviour but, perhaps surprisingly, not to any of the nutritional variables.
2.7.2 Rejected for use in clustering
The basic measures of height and weight were not used because of they were components of BMIX and because of confounding with gender.
Other variables of potential value in assessing extremes in bodily stature, such as waist circumference and various measures of skinfold thickness were rejected because of high frequencies of missing data.