1. Introduction

In the introductory section of this site concerning 'Multifactorial Epidemiology in the Population' I suggested that, while it is accepted that the incidence of most chronic diseases is affected by several aetiological agents, the manner of their interaction is seldom, if ever, addressed by epidemiologists.  Of particular importance is the possibility that if the frequency in a population of one risk factor affects that of others it becomes inappropriate to treat the influence of any one of them in isolation when attempting to understand how rates of disease in a population are determined.

Although I believe that the case for multifactorial interactions as described in my paper is a strong one, it is difficult to make the transition between advancing a theoretical case and demonstrating that this has practical significance.  While it is possible to extract some supporting data from the literature, and this will be done in a later edition of the site, a formal test of the hypothesis would require either a study designed for the purpose or, at the very least, access to the raw data from work which included sufficient breadth of information.

As an intermediate step in the investigation of how multifactorial considerations may have a practical effect upon the outcome of epidemiological investigations, I have constructed a numerical model which simulates the interaction of risk factors in a population generated within the memory of a computer.

Before describing the model and its application, it is necessary to consider the limitations of modelling as an interface between the theoretical and the practical.  In much the same way as practicing epidemiologists concentrate their attention on factors which are of interest to them, and hence create the difficulties which we are seeking here to resolve, modellers base their approaches on a limited and narrow set of attributes which they consider, correctly or otherwise, to be the essence of the problem.  In addition, any model contains approximations and assumptions which are necessary for its operation but which may represent some degree of departure from reality.  It is thus possible that some of the results obtained are properties of the operation of the model and not of the situation which it seeks to simulate.  Distinguishing between meaningful and spurious findings is difficult.  Like so many activities in science, it requires judgment and hence a degree of subjectivity.  Thus, it is tempting to assume that results which satisfy the predictions which the model set out to demonstrate are real, that findings which were unexpected but explicable in terms of the real world support the validity of the model and that unexpected and inexplicable results represent a particular aspect of the functioning of the model which does not compromise its broader utility.  Such difficulties can be minimised only by avoiding over-interpretation and the temptation to attribute definitive significance to the findings of a model.

There are at least two categories of model.  Mathematical, or analytical, models propose a defined mathematical relationship between variables and explore the implications of this.  The relationship may be derived from theoretical considerations or fitting experimental data or both and such models may be used to test mechanistic hypotheses or to extrapolate beyond the range of data obtained in experimental work.  The relationships between rates of lung cancer in the United States and various aspects of smoking behaviour described by Lee, Forey and Gori (2006) provide an example of this.

The model described here is not of this type.  It is purely empirical and simulates a population of individuals described in terms of their exposure to a number of risk factors, makes simple assumptions about how these affect the presence or absence of disease and analyses the data as in a conventional epidemiological study.  It does not require mathematical assumptions or equations and taxes the host computer with no more than simple arithmetic operations.  Its most important attribute is, however, that it contains a stochastic element, whereas most analytical models are deterministic in that when the values for the parameters in an equation are set, only one outcome is possible.  This is in marked contrast to the real world where seemingly chance events intervene between predictions and actual events.  The stochastic component of the model thus satisfies one of the conditions of the real world and the variability in results demonstrated by the model provides useful information in itself, although the danger of over-interpretation must be kept in mind.  In this respect, the model permits much wider replication of studies than is possible in practice; the results presented here are based on several thousand simulations.

2. The Model

2.1 Basis

A full description of the nature and function of the model is given in an Appendix and it is sufficient here to describe the underlying principles and how they are put into effect.

Following from my essay on 'Multifactorial epidemiology in the population' the model explores two propositions concerning the distribution and action of risk factors.  Firstly, exposure to certain risk factors may increase the likelihood that the individuals concerned are exposed to other risk factors.  Secondly, at least part of the biological effect of some risk factors may be common to others so that all of these to which an individual is exposed may act in concert by making incremental contributions to this effect.

The model approaches the first of these propositions by acknowledging that risk factors capable of quantitative expression are not binary entities which are simply present or absent but ranges of values.  Risk applies when the value in an individual exceeds, or depending upon the factor in question, falls below some critical value.  The precise value of this threshold is seldom, if ever, known and the model follows conventional practice in observational epidemiology by partitioning the range into risky and non-risky parts.  This corresponds to postulating that, for example, that individuals in the highest 10% of the range of daily alcohol consumption or the lowest 20% of daily carotene intake are, in some way, at risk of disease.  This being the case, the term risk factor refers strictly to that part of the range of variation in an attribute which presents a risk for disease.

As this website uses smoking as the central risk factor in its discussion, I should point out here that smoking is frequently regarded as a binary rather than a quantitative variable, although expression through, for example, daily or cumulative exposure can provide a quantitative expression.  Moreover, the observation that only a proportion, usually small, of smokers develop any given disease, implies that an appropriate means of quantification is lacking, that the appropriate measure varies according to the disease in question, that the magnitude of effect is conditional upon other factors, or all of these.

The second proposition upon which the model is based is that risk factors can contribute to common modes of biological effect in the development of disease.  I will discuss the evidence supporting this view in a later edition of this site.  It would, however, be a gross oversimplification to assume that the fulfilment by assorted risk factors of one category of biological effect would be a sufficient cause for disease.  On the other hand, satisfaction of a single biological criterion could well be equivalent to the satisfaction of a limiting factor in the process and as such represent an important and discrete component of the risk for that disease.  The results discussed below are presented on this basis, that they describe the risk of a necessary, but not inevitable a sufficient cause for disease. I should add that the model can accommodate a second type of risk, representing a different kind of biological effect.  Introduction of this does not affect the qualitative findings and for the sake of simplicity, results are not presented here but this aspect of the model is discussed in the Appendix.

2.2        Operation

The model simulates a population of 10,000 people each of whom is represented by a value for six discrete risk factors.  On initiation of the model, the variables corresponding to these risk factors are populated by a random number ranging within defined limits. The use of random numbers introduces a stochastic element into the model while the question of the statistical distribution of these values is discussed in the Appendix.  One of the simulated risk factors, termed the Master Factor, may influence the risk associated with the remainder, the Other Factors.  The presence or absence of the Master Factor in an individual is determined by the value of the random number which represents it.  If this value exceeds a threshold set while initialising the model, the Master Factor is deemed to be present. Otherwise, it is absent.  The proportion of the population thus identified corresponds to the prevalence of the Master Factor.

If an individual is exposed to the Master Factor, the values representing the Other Factors are multiplied by a loading factor.  This is also set on initiation of a particular application of the model and can be set independently for each Other Factor so that the extent of loading applied, if any, can be varied at will.  If, on the other hand, an individual is not exposed to the Master Factor, the values for the Other Factors remain unchanged.  As a result of this process, the values of Other Factors in individuals exposed to the Master Factor is increased to an extent corresponding to the loading applied thus increasing the likelihood that any given value will exceed the threshold for risk described below.  This is equivalent to increasing the prevalence of each of the Other Factors when the Master Factor is present.

It should be noted that in the real world interactions in prevalence between risk factors may be reciprocal and differ between any given pair of risk factors.  For the sake of simplicity, and to avoid technical difficulties with recursive loops, the model described here assumes that the Master Factor can affect the Other Factors but not vice versa. 

In the next stage of operation, the model inspects the values for each factor, including the Master Factor and compares them with a threshold value.  If a particular value exceeds the threshold that factor is deemed to contribute a unit of risk to the individual.  The value of this threshold, which establishes the proportion of the range of possible values considered to be risky, and the size of the unit of risk can each be set independently for each factor.  The units of risk present in each individual are then summed and compared with a value set on initiation of the model which determines the total risk considered necessary for a case of disease to be present.

At this stage, the number of cases in subjects exposed or not exposed to each of the risk factors is computed and estimates of risk calculated in the conventional epidemiological manner.  The model assumes an ideal population in which age, duration of exposure to risk factors and other characteristics which would require standardisation in the real world are the same for all individuals.  Thus rates of disease in the exposed and unexposed populations can be calculated and their ratio, representing the relative risk for each factor, calculated.  It is also possible to calculate the relative prevalence of each risk factor in cases and non-cases and hence estimate odds ratios in the manner of a case control study.  This can be done on either the entire population or random subsets of predetermined size.  As, however, the results of case control analyses to not, in the present context, provide any information additional to that obtained from the analogue of a cohort study, they will not be considered here.

The results shown below are the means of 30 replicates on each set of conditions.  This level of replication was found sufficient to achieve stable values of the means (see Appendix).

In summary, therefore, the model provides estimates of the risk of disease associated with each of six factors which contribute units to some biological effect which represents a limiting factor in the development of disease.  More importantly, it allows the situation in which the presence of one risk factor affects the prevalence of others to be simulated.

3. Results

Before discussing the results obtained from the model, it is necessary to reinforce my warnings about the limits in interpretation of this or any other model.  That described here was designed to provide a preliminary evaluation of a hypothesis concerning the interaction of risk factors in the development of disease.  A demonstration that some interaction may occur is all that it seeks.  While the results are necessarily expressed quantitatively, the actual values of the estimates of risk which the model generates are of no significance outside the model itself and it is only relative changes in these values which can be taken to have any wider significance.  The results presented are based on systematic changes in the default values of parameters which were established at an early stage in the development of the model.  No attempt has been made to adjust these to obtain values resembling any situation in the real world nor would any such effort be justifiable.

3.1 Effects of loading the likelihood of risk of Other Factors

As described above, the Other Factors in the model are loaded, by an increase in the value of the number representing their potential risk, in individuals exposed to the master factor.  It is thus appropriate to begin our examination of the results of the model with a description of the effects of changing this load.
figure1  Figure 1 shows the effects of increasing the loading from a value of 1.0, which has no effect upon the Other Factors, progressively to a value of 1.5 at which stage estimates of risk associated with Master Factor are showing signs of an exponential increase.  The simulations were conducted under two conditions.  In the first, designated SM 0, the size of the unit of risk associated with the master factor was set at zero.  In this condition, the Master Factor contributes no risk to the model in its own right.  In simulations designated SM 1 the unit risk of the Master Factor was unity, the default value and the same as applied to the Other Factors. 

It can be seen that even when the Master Factor carries no risk of its own, in conditions where it influences the prevalence of risky states of other factors, it appears to carry a risk and that this increases with the magnitude of loading applied.  Inspection of the y-intercept of the appropriate, red, curve in figure 1b confirms that when the Master Factor neither bears risk nor influences the other factors, the model generates a relative risk of unity for this factor.  The model includes a stochastic element, as a result of which, the actual value returned in this particular simulation was 1.01.

When the Master Factor bears risk in itself, as in condition SM 1, the increase in relative risk associated with the Master factor is similar to that with SM0 but displaced towards higher values.

The effects of loading by the Master Factor on the estimates of risk associated with the Other Factors were minor, showing only a marginal rise as loading increased. The results for the Other Factors varied little between themselves and the values shown are the means of all five. Whether or not the Master Factor carried risk had a slightly greater effect, higher values in the latter condition suggesting simply that the amount of risk in the model was partitioned between fewer factors.

It may be noted that the unit risk associated with the Master Factor differs from that of the Other Factors in the basal condition when the former applies no loading to the latter.  This is because, at any given combination of settings, the number of potential cases in the Other Factors is limited only by the proportion of their range set to be risky, whereas in the Master Factor, the susceptible proportion is determined firstly by its prevalence and then by the proportion within the subset thus identified which is potentially at risk.  It would be possible to adjust the default settings to allow for this but, as my interest lies in the relative changes associated with changes in a parameter of interest, I seen no reason for doing so.

3.2 Prevalence of Master Factor

It is axiomatic in observational epidemiology that risk associated with a factor is independent of its prevalence in the population given that, at the extremes of prevalence, sufficient cases to allow stable estimates to be made can be found in both exposed and unexposed subjects.  If, however, we propose that the prevalence of one factor affects that of others, this situation may not pertain.  The results of simulations intended to test this hypothesis are shown in figure 2.figure2

It can be seen from the upper two curves in part (a) of the figure that when the Master Factor influences the prevalence of the Other Factors, as indicated by a loading (L) of 1.1, the risk associated with it increases.  This occurs whether the Master Factor bears risk in itself (SM1) or not (SM0) although in the latter case the increase with prevalence is smaller.  When, however, the Master Factor does not affect the prevalence of the Others (L1.0), the estimates of its risk remain constant as its prevalence increases.  As would be expected, when the Master Factor does not carry any risk itself (SM0) its apparent risk hovers around unity.

Figure 2(b) suggests that under conditions where the presence of the Master Factor influences the prevalence of the Other Factors, the risk associated with latter falls as the prevalence of the Master increases.  Whether or not the Master Factor carries risk in itself has little effect on the risk associated with the Other Factors.

3.3 Influence of proportion of range of Other Factors bearing risk

 As described above, in the model the proportion of the range of the risk factors which bears has a direct counterpart figure 3in observational epidemiology. It corresponds to the part of the range of values recorded in a study which the worker chooses to represent the risky state of a factor.  Figure 3 shows the results of varying this between values in the model of 0.025 and 0.5, or the top 2.5% values and the top 50% respectively.

It is apparent that as the proportion of the range selected to be risky increases, the relative risk associated with the Master Factor falls rapidly from high values with a trend resembling an asymptotic approach to unity.  As in previous examples, the highest values are recorded when the Master Factor affects the likelihood of risk in the Other Factors and is higher when it bears risk in its own right than when it does not.  A reduction in apparent risk with an increase in the proportion of Other Factors bearing risk is predictable from general principles and is likely to apply figure 4in the real world.  As a threshold for risk decreases, numbers at risk increase and this rise is greater in the unexposed than in the exposed population so that the number of cases in each, and hence estimates of risk, tend towards convergence.

It will be observed that the trends are less regular and that some data points are missing at the lowest values of proportion.  This is a consequence of variation in the data and will be discussed below in section 3.5.

Figure 4 shows that a similar situation pertains with the risk associated with the Other Factors and that again as seen previously, their risk is higher when their prevalence is independent of that of the Master Factor.












3.4.       Interaction of variables

We have seen that each of the variables investigated affects the risk associated with Master Factor.  It is reasonable, figure5therefore, to expect some interaction between them in determining risk.  That this is the case can be seen in figure 5.  I do not propose to discuss this complicated plot in any detail and simply point out the size and prevalence of the Master Factor its loading upon the prevalence of the Other Factors and the proportion of the range of those set to be risky all combine to increase the apparent risk of the Master Factor.







For those who wish to study the plots in more detail, the codes on the x-axis are as follows:

S

size of risk of Master Factor

0 or 1

L

loading applied in presence of Master Factor

1.0, 1.1 or 1.2

P

proportion of other factors bearing risk

0.225, 0.2 or 0.175

Note that the left hand set of data refer to the condition with S = 0 and the right to S = 1.

3.5        Variability in results

In section 3.3 I referred to signs of variation in the data in simulations examining the effect of increasing the proportion of the range of Other Factors bearing risk.  While particularly evident in this case, variability in results is also evident when extreme values of other variables are examined.  While variability is to be expected in the results of a model with a stochastic component in its function, it may be instructive to examine its origins.  I do so not to explain characteristics particular to the model but because they may have some relevance to the real world.

Figure 6 figure 6seeks to quantify the variation in the model by expressing the standard deviation associated with the means of the replicates in each simulation as a percentage of the mean estimate of risk.  Given that the simulations in question were not based on normally distributed data, these calculations are of limited validity but I include them here as the best estimates available. Figure 6(a) shows these coefficients of variation for the Master Factor and 6(b) those for one of the other factors.  In the latter case, to avoid excessive statistical impropriety, I have not averaged the values across the other factors.  

It can be seen that the measure of variability is not only high but also erratic at low values of the parameter in the x-axis.  Part of the explanation for this can be seen in figure 7 .

As already explained, the estimates of risk discussed here are the means of 30 replicate simulations.  If, however, in any one of these no cases are generated in either the figure 7exposed or unexposed populations, risk cannot be calculated and the model returns the number of replicates used in calculating the mean.  This occurs only in a small proportion of 'experiments', usually when the variable under examination approaches extreme values.  Figure 7 shows this situation when the proportion of Other Factors bearing risk is set at low values.  Under these circumstances, the mean estimates of risk are based on smaller numbers of replicates and are hence potentially less stable and representative.

The origin of this can be seen in figure 8 which shows the number of cases generated in exposed and unexposed subjects.

Here it can be seen that when the proportion of Other Factors bearing risk is low, the number of cases generated is small.  In particular, the number of cases in the unexposed population, represented as 'ux' in the figure, is close to zero and stochastic factors will have an influence on whether or not any such cases occur and, if they do so, on the number arising. Thus, variability will be high.  Thus, as the number of cases in either exposure category approach zero, the relative influence of stochastic, as opposed to deterministic, factors will increase.

While these phenomenon are, of course, particular to the model, they have a direct analogy in the real world.  When epidemiologists study a disease with a relatively low prevalence in a population, such figure 8as lung cancer, and when its statistical relationship with one risk factor, such as smoking, shows a high magnitude of association with the disease attempts to conduct studies may be thwarted, or rendered difficult, by the scarcity of cases in unexposed individuals.  When this is the case, stochastic factors related to the sampling efficiency of the study, may, as in the model, tend to equal or exceed deterministic factors when estimates of risk are computed. Unlike the situation with the model, however, studies in epidemiology are seldom replicated in the strict sense so that investigators are denied direct measures of variability.  Instead, confidence limits purporting to indicate the potential effect of variation, be it stochastic or biological, are inferred from sample size and assumptions concerning the distribution of data.  Thus, observational epidemiologists are denied the opportunity which experimental scientists enjoy of examining and partitioning variation between their units of study and hence of obtaining an effective appreciation of the influence of experimental variables.

4. Conclusions

In summarising what we can learn from the model described and used here, it is necessary to reiterate that the use of models imposes strict limits on the interpretations which can be derived from their findings and the over-interpretation is the greatest hazard attending their use.  Their utility depends upon the extent to which the assumptions on which a model is founded are judged to be representative of the real world.  The model described here is founded on the two principal assumptions described in section 2 and readers are free to accept or reject these.  If this hurdle is surmounted, then the credibility of a model is best sustained by limiting the demands put upon it to the examination hypothesis of the hypothesis which it was designed to test.  In this case, the hypothesis was that when risk factors interact both in prevalence and in the biological contributions which they make to disease, then estimates of risk associated with them will depend upon these interactions.

The results obtained do indeed support this hypothesis.  When a factor bears no risk in itself and does not affect the prevalence of other factors the relative risks associated with it are, allowing for stochastic variation, unity.  On the other hand, if a factor does influence the prevalence of other factors, then it will seem to carry a relative risk greater than unity and the size of the risk is proportional to its influence on the other factors.  This occurs whether or not the factor in question bears risk in its own right, although if it does so then the risk associated with it is larger.

The results obtained also suggest that the estimates of risk contain both stochastic and deterministic components.  This is, of course, obvious, although the model does suggest that the balance between these two influences will depend upon the conditions applying.  While this is of interest, especially in highlighting the limitations in current observational epidemiology, it goes beyond the original hypothesis and is thus a highly speculative conclusion.  The matter may be worthy of further investigation, but that is a separate concern.

Although readers must take my word for it, the results discussed were obtained at an early stage in the development of the model and subsequent refinements were added only to improve its representativeness.  The corollary of this is that different versions of the model give similar results.  In other words, the results obtained are not a function of long processes of adjustment and modification necessary to obtain any particular behaviour which I required.

The numerical values of estimates of relative risk obtained may or may not bear some resemblance to instances drawn from the real world.  I do not consider, however, that such quantitative interpretation is legitimate.  While the variables in the model all operate within ranges relevant to the real world, the conception of the model is too simple to permit direct comparisons.  Thus, while it might be tempting to fit real risk factors and associated values of parameters, this would, with the current version, represent over-interpretation and would distract from and discredit its simple purpose of demonstrating the possibility of interactions between risk factors such as I have proposed.