1. Technical description of the model
The model was written in Microsoft Visual Basic 6.0. For much of the work described here it was run in an uncompiled form to facilitate any changes required and to allow details of its function to be followed. It has been translated into Visual Basic 2005 and a compiled form which runs as a conventional application in Windows has been produced.
1.1 Settings
The settings available are displayed on the opening screen of the model and on initiation contain default values for each. These may be changed as required.
OPTIONS
Initial distribution refers to the values representing the potential risk for each factor in each subject. This may be either uniform, in which any value is as likely to occur as any other, or normal, in which the frequency of values decreases on either side of the mean according to the normal distribution. The latter is simulated as described below while considerations surrounding the choice of distribution are discussed in a separate section.
Risk threshold, which determines whether or not any value representing a subject is potentially risky or not, can act in two ways.
With the threshold set by percentile, the threshold is based upon the number of subjects whose values lie on either side of the threshold. If, for example, the value of the threshold is 0.2, then the highest 20% of values are deemed to be risky. This corresponds to the frequent practice by epidemiologists of setting such thresholds by percentile. Thus a threshold of 0.2 represents the highest quintile of subjects.
With a threshold set by value, the discrimination between risky and non-risky subject is based upon the range of values available. For example, if the range of values is 0 – 1 and the threshold is 0.2, the value determining the boundary between the risky and non-risky levels is 1 x (1-0.2) = 0.8.
As shown below [link], both types of threshold gives closely similar results with uniformly distributed data but with a normal distribution, a threshold by percentile will include more subjects above a given threshold than does a threshold by value.
The facility to use a threshold by value is provided to simulate situations where the threshold is selected by a statistical property, such as the standard deviation of a normal distribution.
Types of risk: All the simulations described here assume that a case of disease occurs when a certain number of units of equivalent risk accumulate. As described elsewhere, this situation is an oversimplification of the real world in which disease is most likely to be the result of the joint operation of several different types of risk. To take some account of this, the model can be operated with two different types of risk, to which each factor may or may not contribute, and satisfaction of both types is required before a case develops.
SETTINGS
Replicates: The model can be set to provide to repeat operation with a given set of settings once or any greater number of times. In all the work described here 30 replicates were used. This value was selected, as shown below by inspection of the running mean of results from increasing numbers of replicates.
Prevalence of the master factor: as is implicit in the term, this determines the proportion of the population which is exposed to the master factor
Threshold for case establishes the number of units of risk accumulated across all factors which represents a case. This is set in relation to the value of size selected. If the threshold exceeds the maximum possible number of units which would arise if all factors were risky in a given subject then no cases will arise. This threshold can be set independently for each of the two types of risk allowed by the model
Load determines the extent to which the likelihood of risk in the other factors is increased in the presence of the master factor.
Proportion sets the threshold above which the value representing any of the factors is deemed to be risky. In the case of the master factor this applies within the range previously set to represent those exposed to it. For the other factors, this threshold applies to the entire range of values.
Size refers to the unit of risk presented by a factor exceeding the risk threshold set by the proportion described above. Values may be integers or fractions. A value of zero means that the factor in question does not present risk in any circumstances. In principle, negative values representing a protective factor, could be used but the model has not been evaluated fully in such cases.
Size can be set independently for each of the two types of risk which the model supports.
1.2. Operation
1. The model establishes an array containing, for 10,000 people, values for the states of six risk factors. One of these, the master factor (MF) is a risk factor, such as smoking, whose presence or absence may determine the status of the Other Risk Factors (F2 to F6).
While it is likely that in the real world such relationships may be reciprocal rather than unidirectional, no attempt is made to represent this in the current version. To do so would lead to complications in computation.
2. To initialise the model, the elements of the arrays representing each risk factor are populated by random numbers in the range between 0 and 1. These values determine an individual's position in the range of possible values of the risk factors and establish the stochastic characteristics of the model.
There is no particular significance in choosing a range of 0 to 1. Zero to 100, 100 to 1000 or 59 to 68 would be equally effective.
When the random numbers are taken directly as generated they provide a uniform series, with any value equally likely to be present as any other. They thus do not conform to any of the forms of distribution usually encountered in biological data. The model can operate using such a uniform distribution but also allows a simulated normal distribution to be used. This is prepared by screening random numbers as they are generated and rejecting them if the subset of the distribution into which they would fall already contains a full quota of numbers.
The question of the distribution of data is addressed below.
It should be appreciated that the numbers used are not truly random, but the products of a pseudo-random number generator in the underlying software. Bearing in mind the large numbers used, I was initially concerned that systematic variation in their generation could affect the model, but careful scrutiny of the raw output has not provided any evidence that this is the case.
3. The random values representing the master factor are scrutinised to detect the actual values corresponding to the nominal thresholds for its prevalence and risk set on initiation of the run. Two forms of threshold are available, for use in conjunction with the form of distribution set. Thresholds set by percentile separate the population into parts which either exceed or, alternatively, equal or fall below that value in the actual range available which separates the higher (100-x)% from the lower x% of the total number where x is the threshold set for the run. Thresholds set by value detect that number whose value is, or is closest to, (100-x%) of the highest number in the range established for the model.
Two such pairs of thresholds are set for the master factor. One pair sets the threshold for the prevalence of the master factor and it used to determine whether or not a particular subject is exposed to the master factor. The second pair operates within those deemed to be exposed to determine those which do or which do not present a risk of disease to the subject.
Note that this procedure separates the presence or absence of the master factor from the possibility of its presence representing a risk to the bearer. That is, the simple presence of the master factor does not in itself imply risk with risk applying only in a specified higher part of the range in which it is present. If we consider that the master factor represents smoking, this corresponds to two well established observations from epidemiology. Firstly, only a proportion of smokers appear to be liable to the diseases associated with smoking and, secondly, the risk associated with smoking appears to be related to measures of intensity of smoking. In the model, therefore, the value representing the master factor in any given subject is a combined measure of presence and intensity.
4. The values representing the master factor in each individual are compared with the thresholds set as above and according to whether the model has been set to apply thresholds by percentile or by value. Separate arrays record whether the subject is exposed to the master factor or whether it bears risk according to the value of the master factor. The presence or absence of the master factor is recorded as 1 or 0 respectively. If risk is absent, this is coded as 0 while if risk is present this is coded by the size of the risk set on initiation of the model.
Although not used in the experiments described here, the model allows two different forms of risk to be considered. These correspond notionally to different forms of biological effect. Each factor, including the master, thus has two registers to record the size of the risk in exposed subjects, with this size selected independently for each on initiation of the model.
5. Having established the status of each subject with regard to the master factor the array of values for all factors is inspected for each subject. If a subject is exposed to the master factor, then the values representing the other factors are multiplied by the load set for each factor.
This has the effect of increasing the likelihood that the other factors bear risk when the master factor is present.
6. The range of values as modified in this way are now examined to establish the values for the thresholds for risk as described factor in (3) above.
It will have been appreciated that the loading process will lead to some of the values representing the potential for risk in subjects will exceed the maximum which applied to the initial values. That is, if the range initially fell between 0 and 1, some values will now exceed 1. In initial versions of the model, the values for each of the other factors were thus standardised back to the original range. It was found, however, that not only was this unnecessary but that it also led to differential changes in the distribution of values in exposed and non-exposed subjects as described below . The current version of the model does not, therefore, apply any standardisation after loading.
7. The values for each of the other factors are compared, subject by subject, with the appropriate threshold for risk and if they exceed it allot in a register a risk of the size set for each factor. If appropriate, the size for the second form of risk is allocated to a separate register.
8. The total units of risk allocated to each subject for each factor, including the master, are summed. If a total exceeds the threshold set to determine the number of units corresponding to a case, the subject in question is registered as a case.
9. The register of cases is inspected by the exposure status of the subject for each factor and the number of cases in exposed and unexposed individuals is calculated separately for each factor.
10. From the data accumulated, the rate of disease in exposed and unexposed individuals is calculated and the relative risk associated with each factor is calculated as the ratio of these two rates. If either the numerator or denominator in this calculation is zero the relative risk cannot be calculated and a code to indicate this is generated.
11. These estimates of risk, with the data used in their calculation, are recorded.
12. Steps (1) to (11) are repeated according to the number of replicates set on initiation of the model.
13. Mean values of the risk estimates, the numbers exposed and unexposed to each risk factor together with the number of cases in exposed and unexposed cases are calculated. In addition, the standard deviation of each of these is calculated and expressed as a percentage of the mean to provide a coefficient of variation.
14. These results are presented to the operator on the computer display and, if the operator chooses to do so, saved as a text file which records the values of all the settings used in the model together with the data described in (13) above. The file name contains and alphabetical and a numeric component both of which can be set by the operator. Otherwise, the numeric component is automatically increased by one from the value last used.
15. The text file can be opened by any suitable application, such as Microsoft Excel, and the results analysed at the operator's discretion.
2. Number of replicates
The number of replicates most suitable for routine application in the work described here was selected empirically by assessing the stability of the running mean of the relative risk of the master factor as the number of replicates increased. As described in the main text, the degree of the variability in the model depends upon the settings used. The example chosen here as figure A1 shows the values obtained for the risk associated with the master factor as the proportion of the range of the other factors presenting risk increases.
At the lowest setting of proportion the estimates of risk were highly variable and only nine of the fifty replicates generated valid estimates of risk. As proportion increased, variability reduced. With a proportion of 0.075, 48 of the 50 replicates provided valid estimates or risk while at all higher values a full 50 estimates of risk were obtained. It can be seen that in all cases except the first the red curve representing the running mean has achieved some stability by around 30 replicates or less and this value was chosen for the studies described here.
It should be noted that most of the work described here was done with the model running in an uncompiled form and that the number of replicates used had an appreciable influence on running time. The compiled version operates much more quickly and run time ceases to be a consideration in choosing the number of replicates.
3. Form of distribution
During the development of the model the arrays representing the risk values of the factors were populated by random numbers as they were generated in the software.
This was done for the sake of simplicity with the justification, already referred to, that if risk thresholds based on percentiles were applied, as is frequently the case in epidemiology, then the form of distribution would have little effect upon the point in the series of numbers where this threshold fell. A typical example of such data is shown in figure A2(a). It was appreciated, however, that such uniform distributions are probably rare in the real world and that the model could be considered unrepresentative for this reason. Although some other form of distribution was a desirable addition to the model, the choice of what this should be was less obvious, especially as I was not aware of any detailed information on the forms of distribution presented by risk factors in the real world nor had any reason to believe that they would be the same for all risk factors. Thus, substituting one unrepresentative distribution for another would not improve matters. Nevertheless, to assess the effects in the model of the form of distribution used, I decided to apply one which occurs frequently in nature, the normal distribution.
Having chosen this as an alternative to a uniform distribution to represent the basic risk data in the model it was necessary to modify the model to achieve this. This was done empirically using the function NORMDIST in Microsoft Excel to calculate the necessary properties of a normal distribution.
Firstly, it was necessary to determine the properties of the distribution necessary to generate data which matched the characteristic of those provided by the uniform distribution. It was found that setting a mean of 0.5 and a standard deviation of 0.2 provided a series within the required range of 0 to 0.99999. Secondly, the proportion of values occurring in each of ten size categories within this range was calculated. Applying this proportion to a total population of 10000 values then established the number of values within each of these size categories. Thirdly, a routine was added to the programme which constrains the addition of random numbers to the arrays representing the risk values for each factor in such a way that when the required number for each size category is satisfied no further values within this range are added. In doing so it was necessary to randomise the order of size categories to which successive random numbers were offered, as biases in the mean values within size categories arose if each was offered only numbers rejected by those lower in the series.
Figure A2(b) shows the distribution generated in this manner with a an example of the alternative uniform distribution shown in figure A2(a) for comparison.
It should be noted that while the numbers in each size category of the uniform distribution vary slightly as the result of the use of random numbers, the size categories of in the normal distribution are fixed by the method used to generate the distribution. This will reduce the variability inherent in the model to some extent but it has little discernable effect on the distribution of individual numbers within the model as can be seen in figure A2 (c). This shows the cumulative distribution of data, that is the total number of observations obtained as successively larger values are added to the total. Had the method of construction led to marked constraints in the distribution of values these would have been evident at the boundaries of the size categories indicated by the vertical gridlines in the figure.
Figure A3 shows how loading the risk data in subjects for whom the MASTER FACTOR is present changes the distribution of data. The examples shown were based on a LOAD of 1.1 and a prevalence of the MASTER FACTOR of 0.5.
It can be seen that loading has added a number of values to the previously empty size category for values exceeding 1.0 and that in the case of the normal distribution, the data are no longer distributed symmetrically around the central values. This is reflected by the increase in the mean and standard deviation shown in table A1, below.
Figure A4 shows the loaded data plotted separately for subjects who were exposed to the MASTER FACTOR and those who were not.
It is evident that, as expected, the unexposed subjects retain the original form of distribution, the visual impression being supported by the parameters of the distribution shown in the table above. Within the exposed subjects, however, it is apparent that, as is obvious, the addition of a new size category has reduced the size of all the other categories to an extent varying by category. Table A1, below, shows for the normal distribution that the subjects exposed to the master factor are now represented by a distribution with a mean of 0.550 and a standard deviation of 0.221.
| condition | mean | standard deviation | |
| unmodified | 0.500 | 0.200 | |
| loaded | all | 0.525 | 0.212 |
| master absent | 0.500 | 0.199 | |
| master present | 0.550 | 0.221 |
I discussed in section 2 the availability in the model of two methods for setting the threshold which discriminates between risky and non-risky values of risk factors. A threshold set by percentile corresponds to common epidemiological practice and will be indifferent to the form of distribution unless some particularly extreme form were used. The possibility of setting a threshold by numerical value was introduced as it will act differently from the use of percentiles in distinguishing between risky and innocuous values in a distribution such as the normal. Two comparisons of the various combinations of distribution and threshold are provided here to show their effect.
The first comparison is based upon the number of subjects exposed to one of the other factors as this measure represents the immediate effect of applying a threshold to the data. These results were obtained using the default settings for all parameters except for the proportion of the range of other factors bearing risk. As described in the main text small changes in the low ranges of the this variable have marked effects on estimates of risk. This is shown in figure A5.
It can be seen that in all four combinations of distribution and threshold the results are regular and show little evidence of the stochastic variation which affects estimates of risk. More importantly, it is clear that, as predicted, when the threshold for risk is set by percentile the form of distribution has little effect on the number of subjects as risk. On the other hand, with a threshold set by value, slightly higher numbers are at risk with a uniform distribution, simply because loading increases the upper limit of the range of values so that any given value used as a threshold will occur at a lower point in the range. With a normal distribution, however, the number of subjects at risk is lower when the threshold is set by value reflecting the lower density of observations towards the extreme of the distribution.
While these observations reflect, as explained, the immediate effects of the form of distribution and threshold used, the purpose of the model is to generate estimates of risk and it is thus useful to see how these are affected. This is shown in figure A6.
It is again clear that with thresholds set by percentile, the two forms of threshold give closely similar results over most of the range of the parameter varied. Differences are apparent only in that part of the curve where stochastic variation appears to have a detectable influence on estimates of risk [see report]. Differences between the two forms of distribution reflect the differences in the number of subjects exposed which I have just described.
In conclusion, the form of distribution used appears to have little effect on the results obtained from the model when the threshold of risk is set by percentile. The combination of a normal distribution and a threshold set by value offers the opportunity to examine situations where this form of threshold is thought to have relevance to the real world.