Fitting the Humphreys Distribution to a Dataset

by P Barber ( January 2010)

(BACK)

 

INTRODUCTION

1. A significant challenge with simulation modelling is to be able to determine the distribution of any resultant data set.

 

2. Packages such as Crystal Ball (http://www.oracle.com/appserver/business-intelligence/crystalball/index.html) or EasyFit (www.matwave.com) provide a means of determining the best-fit, from many of the common distributions available, but unfortunately the Humphreys distribution (see section 32) is not included within the libraries of these packages.

 

3. This paper sets out an iterative process for determining the NMX parameters and then goes on to outline a method of estimating these parameters directly from measurements of the dataset Mean, Standard Deviation, Skew and Kurtosis. ( see section 16)

 

ITERATIVE METHOD OF FITTING A DISTRIBUTION

4. The Iterative Method utilises the Anderson Darling test (see section 51) .

 

5. At the beginning of the iterative process an estimate of the NMX parameters of the data-set is made. In this example parameters of  N=126.479, M=170.162, X=198.597 have been assumed, and this results in an Anderson Darling statistic of AD = 108.08

 

N =

126.479

M =

170.162

X =

198.597

AD =

108.082605

 

6.  The value of the first parameter (N) is then increased by (0.001). The results of the first and second trial are shown below.

 

N =

126.479

126.48

M =

170.162

170.162

X =

198.597

198.597

AD =

108.082605

108.113858

Mvt AD=

 

0.031253

 

7.  From the above it can be seen that the effect has been to increase the value of AD by 0.031253, since it is desired to decrease the value of AD it is deduced that the value of the first parameter (N) should have been reduced, rather than increased.

 

N =

126.479

126.478

M =

170.162

170.162

X =

198.597

198.597

AD =

108.082605

108.051358

Mvt AD=

 

-0.031247

 

8.  The process of reducing the value of (N) is then repeated until three iterations, providing four data elements, has been completed.

 

126.479

126.478

126.477

126.476

170.162

170.162

170.162

170.162

198.597

198.597

198.597

198.597

108.082605

108.051358

108.0201174

107.988883

 

-0.031247

-0.031241

-0.031235

 

9.  The next stage is to plot the values of AD against the moving parameter (N), fitting a ‘Polynomial Order 2’ trendline and using the fitted equation to calculate the turning point.

 

The data we have is:

 

N =

126.479

126.478

126.477

126.476

AD =

108.082605

108.051358

108.020117

107.988883

 

 

 

10.  The equation of the trendline is of the form:

 

Y = aX2 + bX + c

 

And the turning point occurs when dY/dX = 0

 

Differentiating the equation above we have:

 

dY/dX = 2aX +b = 0

 

Hence the turning point occurs when:

 

X = -b / 2a

 

The values found in the fitted trendline were:

 

a = 3.034654

 

b = - 736.390052

 

and if we substitute these into the equation:

 

X = -b / 2a

 

We get X = 736.390052/(2 x 3.34654) = 121.330

 

11.  Setting N = 121.330 we get a results which reduces the AD statistic from 107.988 to 17.5667

 

N =

126.479

126.478

126.477

126.476

121.33

M =

170.162

170.162

170.162

170.162

170.162

X =

198.597

198.597

198.597

198.597

198.597

AD =

108.082605

108.051358

108.0201174

107.988883

17.56672132

Mvt AD=

 

-0.031247

-0.031241

-0.031235

-90.422161

 

12.  Holding the parameters N and X constant the process is then repeated with M.

 

N =

121.33

121.33

121.33

121.33

M =

170.162

170.161

170.16

170.159

X =

198.597

198.597

198.597

198.597

AD =

17.566721

17.546337

17.52596903

17.505617

Mvt AD=

 

-0.020384

-0.020368

-0.020352

 

13.  With the four data elements being

 

M =

170.162

170.161

170.16

170.159

AD =

17.566721

17.546337

17.525969

17.505617

 

Which when plotted:

 

 

and if we substitute the equation parameters into the turning point equation:

 

X = -b / 2a

 

We get X = 2662.952274/(2 x 7.884675) = 168.869

 

Insertion of this value results in a reduction in AD from 17.505617 to 4.4608879

 

N =

121.33

121.33

121.33

121.33

121.33

M =

170.162

170.161

170.16

170.159

168.869

X =

198.597

198.597

198.597

198.597

198.597

AD =

17.566721

17.546337

17.52596903

17.505617

4.460887915

Mvt AD=

 

-0.020384

-0.020368

-0.020352

-13.044729

 

14.  The process is then repeated holding parameters N and M constant while adjusting X. Once four iterations have been completed the data is used to construct a trendline, calculate the turning point and reduce the value of AD further. Repeating the process until no further reduction can be achieved.

 

X =

198.597

198.596

198.595

198.594

AD =

4.460887915

4.459631

4.458380154

4.457134

 

 

and if we substitute the equation parameters into the turning point equation:

 

X = -b / 2a

 

We get X = 1043.546912/(2 x 2.630468) = 198.358

 

Which when substituted yields a further reduction in the value of AD.

 

N =

121.33

121.33

121.33

121.33

121.33

M =

168.869

168.869

168.869

168.869

168.869

X =

198.597

198.596

198.595

198.594

198.358

AD =

4.460887915

4.459631

4.458380154

4.457134

4.311477132

Mvt AD=

 

-0.001257

-0.001251

-0.001246

-0.145657

 

15.  While it can be seen that the above method provides a method of fitting the NMX parameters of the Humphreys Distribution to a data-set, the method involves a significant number of iterations to achieve a final convergence, particularly since the rate of convergence declines with each iteration, as can be seen above. An alternative method, utilising the measurable parameters of the data-set: Mean, Standard Deviation, Skew and Kurtosis.

 

ALTERNATIVE MEANS OF FITTING A DISTRIBUTION

16.  The parameters Mean, Standard Deviation, Skew and Kurtosis of a data-set can be measured using standard Excel Functions. It should be noted however that the values of  Skew and Kurtosis vary slightly with variations in the size of the data sample. Skew is a measure of the third standardised moment of the data-set while Kurtosis is a measure of the fourth standardised moment. As the size of the data-set is increased more values are included ‘in the tails’ of the distribution, and these affect the calculation of these two variables. To overcome this problem a standard data-set with N= 10,000 elements was employed.

 

17.  Investigation of fitted data sets reveals that a number of relationships exist, namely:

 

a) Pm and Skew

b) (Mean – M)/standard deviation and Skew

c) (X – N)/standard deviation and Kurtosis

d) (X – N)/standard deviation and Skew

Pm and Skew

18. The relationship between Pm and Skew was found to be:

 

 

 


(Mean – M)/standard deviation and Skew

19. The relationship between (Mean – M)/standard deviation and Skew is shown below.

 

 

 

 

(X – N)/standard deviation and Kurtosis

20. The relationship between (X – N)/standard deviation and Kurtosis is shown below.

 


(X – N)/standard deviation and Skew

21. The relationship between (X – N) / standard deviation and Kurtosis is shown below.

 

 

 

22. Using the above equations it was possible to develop a set of lookup tables which would predict the NMX parameters for data-sets which possessed a Humphreys Distribution. Three lookup tables were established based upon Skew+1 (SKEW+1 to unsure that all the values in the table were positive) and these provided the following factors:

 

a) Pm

b) G = (Ave – M) / Std

c) H = (X – N) / Std

 

Where:

Ave = the Average or Mean of the data-set

Std = the Standard Deviation of the data-set

 

Estimates of the NMX Humphreys Distribution could then be estimated from the above equations as follows:

 

a) M = Ave – (G x Std)

 

b) N = M – (Pm x H x Std)

 

c) X = (H x Std) + N

 

 

 

 

23. Although the above relationships provide a means of predicting the NMX parameters for data-sets that possess a Humphreys Distribution, the above relationships were not sufficient to provide the best-fit, when the underlying distribution was different. A number of tests were carried out using data created with a Beta Distribution and the following additional relationships involving changes in Kurt were established.

 

24.  Firstly for data which can be fitted closely to a Humphreys Distribution, it was found that there was a relationship between Skew and Kurt, as shown below.

 

 

25.  However, it was found that the relationship between Skew and Kurt changed when the data-set differed from the Humphreys Distribution. The plots below show how differences in the value of Kurt (dKurt) could be related to changes in the lookup table factors:

 

Pm,

G= (Ave – M) / Std and

H = (X – N) / Std

 

 

 

 

 

26.  These relationships were used to create a lookup table which would predict the NMX parameters over a wider range of data-sets. However, it should be noted that in creating the lookup tables it was assumed that the relationships between data points was linear. With the limited number of data points employed in the trial it is therefore inevitable that some inaccuracy will occur. However, it is considered that the method provides a useful means of predicting the NMX parameters, which can then be checked using the Anderson Darling statistic and if necessary further iteration will yield a result close to the best fit.

 

APPLICATION OF LOOKUP TABLES

27.  The section below illustrates the process for estimating NMX parameters using the developed Lookup Tables. (down-load table)

 

The parameters for a data set have been measured as follows:

 

Mean (Ave)                =  123.0769

Standard Deviation    = 15.38512

Skew                          =  0.823478

Kutosis                       = 0.278257

 

Then (Skew+1) =   1 + 0.823478 = 1.823478

 

And (Kurt+0.5) =  0.5 + 0.278257 = 0.778257

 

From this value two lookup reference values are calculated, given by:

 

a) Look1 = Round((Skew+1) – 0.0005,3)  = 1.823

 

b) Look2 = Round((Skew+1) + 0.0005,3) = 1.824

 

A segment of the table is shown below:

 

 

 

 

 

 

 

 

 

 

 


Skew+1 to (X-N)/Std

"

"

"

"

"

"

"

"

"

"

"

"

"

"

"

"

"

"

 

 

 

 

 

 

 

1.822

1.0265

5.5092

0.3519

1.6212

0.9384

-0.6387

-0.6610

0.1798

0.2465

 

 

 

 

 

 

 

1.823

1.0279

5.5074

0.3545

1.6212

0.9401

-0.6397

-0.6610

0.1791

0.2470

 

 

 

 

 

 

 

1.824

1.0294

5.5056

0.3536

1.6212

0.9417

-0.6410

-0.6610

0.1784

0.2500

 

 

 

 

 

 

 

1.825

1.0309

5.5038

0.3497

1.6212

0.9433

-0.6425

-0.6610

0.1778

0.2500

 

 

 

 

 

 

 

 

Reading from the second column in the table it can be seen that for a value of  (Skew+1) = 1.823 we would expect (Kurt+0.5) to be equal to 1.0279.

 

However the value of (Skew+1) we have is 1.823478 and we must therefore adjust for the difference between the measured value and the value in the table.

 

If we read the value of (Kurt+0.5) at (Skew+1)=1.824, then we find an expected value of (Kurt+0.5) of  1.0294.

 

Interpolating between these two values we have an expected value of (Kurt+0.5) of:

 

Expected (Kurt+0.5) = 1.0279 + (1.0294 – 1.0279)(1.823478 – 1.823)/(1.824 - 1.823)

 

                                  =  1.028617

 

Now the value of (Kurt+0.5) that has been measured is 0.778257

 

Hence the difference between the measured and expected values of (Kurt+0.5) is

(0.778257 – 1.028617) = -0.25036

This difference of (Kurt+0.5)= -0.25036 will be used to adjust the values of Pm, H and G obtained from the table.

 

28.  The basic value of H=(X-N)/Std can be read from the third column in the table.

 

for Look1 = 1.823 we have H = 5.5074 and

for Look2 = 1.824 we have H = 5.5056

 

Given that the actual value of (Skew+1) was 1.823478

 

We need to interpolate to determine the basic value of H=(X-N)/Std as follows:

 

Basic value of H = 5.5074 + (5.5056 – 5.5074)(1.823478 – 1.823)/(1.824 - 1.823)

 

Giving a basic value of H=(X-N)/Std = 5.5065396

 

This basic value of H now needs to be adjusted to take account of the difference between the expected and the measured value of Kurt+0.5, as determined above.

 

The factor for adjusting the value of H with respect to the difference in Kurt+0.5 is provided in the fourth column of the table, we have, for:

 

Look1 H/KurtAdj = 0.3545 and for

Look2 H/KurtAdj = 0.3536

 

Once again it is necessary to interpolate to determine the correct factor:

 

H/KurtAdj = 0.3534 + (0.3536 – 0.3545)(1.823478 – 1.823)/(1.824 – 1.823)

 

Producing an H/KurtAdj factor = 0.3529698

 

The estimated value of H is therefore = 5.5065396 + (0.3529698 x -0.25036)

 

Giving an estimate for H = 5.41817

 

29.  The basic value for Pm is found in the ninth column, reading at:

 

Look1= 1.823   Pm = 0.1791

 

A correction factor is found in the eighth column (slope) and is equal to -.0.6610

 

The calculation for adjusting, to determine the basic value of Pm is:

 

Basic Pm = 0.1791 + (1.823478 – 1.823) x -0.6610 = 0.17878

 

This basic value of Pm must now be adjusted to take account of the difference in Kurt+0.5)

 

The adjustment factor Pm/KurtAdj is found in the tenth column of the table, selecting values at:

 

Look1 = 1.823 , we have Pm/KurtAdj = 0.2470 and

Look2 = 1.824   we have Pm/KurtAdj = 0.2500

 

Interpolating to find the correction factor, we have:

 

Pm/KurtAdj factor = 0.2470 + (0.2500 – 0.2470)(1.823478 – 1.823)/(1.824 – 1.823)

 

Hence Pm/KurtAdj factor = 0.24843

 

Applying this factor to correct the value of Pm we have:

 

Pm Corrected = Basic Pm + (Pm/KurtAdj factor x diff Kurt+0.5)

 

Giving Pm corrected = 0.17878 + (0.24843 x -0.25036) = 0.11658

 

29.  The final factor to be determined from the table is G=(Ave-M)/Std, the Look1 value of G is extracted from the sixth column, and the adjustment slope from the fifth column, while the G/KurtAdj is read from the seventh column. Reading at Look1 = 1.823, we have:

 

Look1 G = 0.9401

Slope = 1.6212

G/KurtAdj1 = -0.6397

G/KurtAdj2 = -0.6410

 

Correcting to determine the basic value of G we have:

 

Basic G=(Ave-M)/Std = 0.9401 + (1.823478 – 1.823) x 1.6212 = 0.94087

 

The KurtAdj for G has to be obtained by interpolating between the two adjustment values, as follows:

 

G/KurtAdj  factor = -0.6397 + (-0.6410 + 0.6397)(1.823478 – 1.823)/(1.824-1.823)

 

Hence the G/KurtAdj factor = -0.64032

 

Adjusting the corrected value of G we have:

 

Estimate of G = Basic G + (G/KurtAdj factor x diff Kurt+0.5)

 

Estimate of G = 0.94087 + (-0.64032 x -0.25036) = 1.10118

 

30.  From the measured parameters of the data-set Skew and Kurtosis we have now determined the following estimated factors:

 

Pm(est) = 0.11658

 

H = (X-M/Std = 5.41817

 

G = (Ave – M)/Std = 1.10118

 

While the following factors were measured:

 

Mean = Ave = 123.0769

 

Std = Standard Deviation =  15.38512

 

The NMX parameters of the Humphreys Distribution can now be estimated as follows:

 

M = Ave – (G x Std) = 123.0769 - (1.10118 x 15.38512) = 106.135

 

N = M - (Pm x H x Std) = 106.135 – (0.11658 x 5.41817 x 15.38512) = 96.417

 

X =  N + (H x Std) = 96.417 + (5.41817 x 15.38512) = 179.776

 

31.  The data was fitted using the iterative method and a comparison of the results from the two methods is shown below:

 

 

Parameter

Iterative

 

Method

Method

 

 

 

N =

96.417

96.395

 

 

 

M =

106.135

106.136

 

 

 

X =

179.776

179.744

 

 

THE HUMPHREYS DISTRIBUTION

32. Humphreys (1979) described a method of constructing a cumulative distribution curve, based on the three parameters: Minimum, Most Likely, Maximum (NMX). Unlike the PERT Distribution, which is based on the incomplete Beta Function, the Humphreys Distribution is based upon a Normal Distribution which is pulled and stretched about the Most Likely (M) point.

 

33.  The cumulative probability corresponding to the Most Likely (M) point is given by:

 

            Pm = (M – N) / (X – N)

 

34.  The illustration below shows Cumulative Humphreys Distribution curves for a range of Pm’s.

 

 

35.  The example below illustrates the construction of the Humphreys Distribution with the parameters:

 

           

N=

100

M=

130

X=

200

 

36.  The first stage in the process is to calculate Pm.

 

Pm =

(M - N)

 

(X - N)

 

37.  Inserting the parameters this gives:

 

Pm =

(130 - 100)

 = 0.3

 

(200 - 100)

 

 

38.  The next stage is to establish a reference probability scale. The intervals of this scale will depend be governed by the total number of data values, but will be generated by the sequence:

 

1/2N, 1/N + 1/2N, 2/N + 1/2N, 3/N + 1/2N, …..  (N-1)/N + 1/2N

 

 

39.  If the number of data values N = 100 then the reference probability sequence would be:

 

0.005

0.015

0.025

0.035

"   

"   

"   

"   

0.995

 

 

40.  In order to calculate the data points it is necessary to remap the reference probability sequence. The chart below illustrates the process.

 

 

41.  The reference probability scale is shown on the horizontal axis, while the re-mapped scale is shown on the vertical axis. The Humphreys Distribution assumes that the distribution either side of the Most Likely point (M) is in the form of a normal distribution. The first part of the curve is covered by the range from zero to Pm. To construct the curve it is necessary to re-map this range from zero to 0.5 (the probability that at which the Mode occurs on a Normal Distribution). Re-mapping on this section of the curve is achieved by the formula:

 

For P < Pm           Pr = 0.5P/Pm

 

Where:

Pr = the re-mapped probabilities

P = the values of the reference probability sequence

Pm = the cumulative probability corresponding to the Most Likely point (M), calculated as shown above.

 

42.  The second part of the curve is a little more difficult, but the process can be followed by reference to the illustration above. The re-mapped probabilities in the second part of the curve begin at a value of  Pr = 0.5, so this is the first term in the equation. In order to re-map the values of P, which lie above Pm, it is necessary to deduct the value of Pm and then to adjust the proportions by multiplying by 0.5 and dividing by (1 – Pm). The equation for the upper part of the curve is shown below:

 

For P > Pm          Pr = 0.5   +  0.5 (P – Pm)/(1 – Pm)

 

43.  A sequence of data is shown below. Although the value of P = Pm = 0.5 does not appear in the reference probability sequence, it can be seen that, such a value would correspond to a Pr value of  Pr = 0.5.

 

P

Pr

0.005

0.0083

0.015

0.0250

0.025

0.0417

0.035

0.0583

"   

"   

"   

"   

"   

"   

0.285

0.4750

0.295

0.4917

0.305

0.5036

0.315

0.5107

0.325

0.5179

"   

"   

"   

"   

0.965

0.9750

0.975

0.9821

0.985

0.9893

0.995

0.9964

 

 

44.  The next stage in the process is to calculate the Normal Standard Deviates corresponding to the values of Pr. In Excel this can be achieved using the function:  =NORMSINV(B37)  Where B37 refers to the column and rows containing the values of Pr. A segment of the table is shown below:

 

  

P

Pr

SD

0.005

0.0083

-2.39398

0.015

0.0250

-1.95996

0.025

0.0417

-1.73166

0.035

0.0583

-1.56892

"   

"   

"   

"   

"   

"   

"   

"   

"   

0.285

0.4750

-0.06271

0.295

0.4917

-0.02089

0.305

0.5036

0.008952

0.315

0.5107

0.02686

0.325

0.5179

0.044776

"   

"   

"   

"   

"   

"   

0.965

0.9750

1.959963

0.975

0.9821

2.100164

0.985

0.9893

2.300346

0.995

0.9964

2.690114

 

 

45.  The Humphreys Distribution assumes that the distance (M – N) = -3SD and the distance (X – M) = +3SD. The conversion of the SD values into values of ‘X’ is undertaken in two parts.

 

For P<Pm            Y  =  M  +  (M – N) SD/3

 

And for P>Pm     Y  =  M  +  (X – M) SD/3

 

 

P

Pr

SD

Y

0.005

0.0083

-2.39398

106.060

0.015

0.0250

-1.95996

110.400

0.025

0.0417

-1.73166

112.683

0.035

0.0583

-1.56892

114.311

"   

"   

"   

 

"   

"   

"   

 

"   

"   

"   

 

0.285

0.4750

-0.06271

129.373

0.295

0.4917

-0.02089

129.791

0.305

0.5036

0.008952

130.209

0.315

0.5107

0.02686

130.627

0.325

0.5179

0.044776

131.045

"   

"   

"   

 

"   

"   

"   

 

0.965

0.9750

1.959963

175.732

0.975

0.9821

2.100164

179.004

0.985

0.9893

2.300346

183.675

0.995

0.9964

2.690114

192.769

 

46.  It can be seen that the values of Y in the above table do not extend to N = 100 and X = 200 this is because the reference probability sequence contains only one hundred values and hence the probabilities of 0.00135 and 0.99865 are not covered. However, if a greater number of data points had been covered by the reference probability sequence, say 10,000, then it is likely that values below N = 100 and X = 200 would have been created.

 

47.  The Cumulative Frequency Distribution for the Humphreys Distribution, with N = 100, M = 130 and X = 200 obtained by plotting the values of P and Y from the above data, is shown below:

 

 

48.  A bell curve, useful only for illustration purposes, can be produced for this data by utilising the Excel Function:

 

        B  =NORMDIST(M33,0,1,FALSE)

 

Where:

 

B = the ordinates of the Bell Curve

M33 = Column and rows containing the values of SD

0 = zero = the value of the mean

1 = standard deviation = 1

False = to obtain the bell curve rather than a cumulative frequency curve

 

The Bell Curve is then obtained by plotting values of ‘Y’ against ‘B’

 

Y

B

106.0602

0.02272

110.4004

0.058445

112.6834

0.089076

"   

"   

129.7911

0.398855

130.2089

0.398926

130.6267

0.398798

131.0448

0.398543

"   

"   

179.0038

0.043968

183.6747

0.028305

192.7693

0.010702

 

 

 

 

ALTERNATIVE TO USING EXCEL FUNCTIONS

49. The example above demonstrated the use of the Excel function =NORMSINV(Pr) to calculate a value of SD from Pr. It should also be noted that Excel provides other useful functions such as = GAMMAINV(Po, Alpha, Beta ) which will produce a cumulative Gamma distribution, = BETAINV(Po, Alpha, Beta, Lower bound, Upper bound) which will produce a Beta distribution, and LOGINV(Po, Mean, Standard deviation) which will produce a Lognormal distribution, should these be required. The results of these Excel functions are produced by an iterative process rather than by direct calculation.

 

50. Although the Excel library functions can be used in an Access database, it may be found more convenient, and certainly faster, to employ the equation set out below, which is shown in the “Handbook of Mathematical Functions With Formulas, Graphs, and Mathematical Tables” Edited by Milton Abramowitz and Irene A. Stegun published by the National Bureau of Standards, United States Department of Commerce, Applied Mathematics Series 55 June 1964. This equation only covers values of  P up to 0.5 so a little manipulation is necessary.

(P in the equation below is equal to Pr above)

 

        

 

 

 

THE  ANDERSON DARLING TEST

51. The Anderson Darling Test is an empirical test which is used in commercial applications such as Crystal Ball and is useful where it is necessary to test data against a number of alternative distributions.

 

52. The graphs below illustrate the discrimination that this test provides when fitting an NMX parameters of the Humphreys Distribution.

 

 

           

 

53.  The graph above shows how the AD statistic increases significantly if a small movement is made in the value of one of the parameters, NMX, of the Humphreys distribution.

21. The figures in the graph above are based on the fitting of a distribution where N = 18,800, M = 20,000 and X = 22,800. hence a discrimination of  +/- 5 is equivalent to an accuracy of 2.5 x 10-4 .

 

54. The Anderson Darling Statistic is calculated as follows:

 

a) The data set comprising of ‘N’ data points (created using an Excel Beta Function with parameters Alpha=9, Beta=5, Min=100, Max=200) is ranked in ascending order as shown below. (the Rank column is used later so that data can be accessed as a lookup table)

 

 

 

Rank

Yi

1

117.2852

2

119.7510

3

121.0388

4

121.9360

"

"

"

"

4998

164.9755

4999

164.9788

5000

164.9820

5001

164.9853

5002

164.9885

"

"

"

"

9997

194.7800

9998

195.1447

9999

195.6451

10000

196.5454

 

b) An estimate of the Humphreys Distribution parameters is then made, in this case N=126.479, M=170.162, X=198.597 (which gives a Pm of (170.162 – 126.479)/(198.597 – 126.479) = 0,60572) This is not a particularly good fit, as shown in the illustration below, but it provides a starting point. The data is shown in blue and the Humphreys Distribution estimate is shown in red.

 

 

 

 

c) The next stage is to calculate the Standard Deviates from the values in the data set, based on the assumed distribution. The following equations are used for this operation:

 

 

For Yi < M

 

SD = 3(Yi – M)/(M – N)

 

And for Yi > M

 

SD = 3(Yi – M)/(X – M)

 

The results are as shown below:

 

 

 

Std Dev

Rank

Yi

based on Yi

1

117.2852

-3.6314

2

119.7510

-3.4621

3

121.0388

-3.3736

4

121.9360

-3.3120

"

"

"

"

"

"

4998

164.9755

-0.3562

4999

164.9788

-0.3560

5000

164.9820

-0.3557

5001

164.9853

-0.3555

5002

164.9885

-0.3553

"

"

"

"

"

"

9997

194.7800

2.5973

9998

195.1447

2.6358

9999

195.6451

2.6886

10000

196.5454

2.7835

 

 

b) The next stage is to calculate the Cumulative Probabilities associated with each of these Standard Deviation values. Because the two parts of the Humphreys Distribution are based on Normal Distributions with different parameters, It is necessary to divide the calculations into two parts. The formula are as follows:

 

For Yi < M       Pi = NORMSDIST(SD) x Pm / 0.5

 

For Yi > M       Pi =  Pm + (NORMSDIST(SD) – 0.5) x (1 – Pm) / 0.5

 

 

Values are produced as follows:

 

 

 

Std Dev

Pi for

Rank

Yi

based on Yi

NMX Fit

1

117.2852

-3.6314

0.000171

2

119.7510

-3.4621

0.000325

3

121.0388

-3.3736

0.000449

4

121.9360

-3.3120

0.000561

"

"

"

"

"

"

"

"

4998

164.9755

-0.3562

0.437145

4999

164.9788

-0.3560

0.437246

5000

164.9820

-0.3557

0.437347

5001

164.9853

-0.3555

0.437448

5002

164.9885

-0.3553

0.437550

"

"

"

"

"

"

"

"

9997

194.7800

2.5973

0.996295

9998

195.1447

2.6358

0.996690

9999

195.6451

2.6886

0.997171

10000

196.5454

2.7835

0.997880

 

 

55.  Note that the logic behind the Anderson Darling Test is not entirely clear (personal comment – meaning that the author does not understand why it works), but appears to be related to the fact that if the cumulative probabilities of the data and the fitted distribution are plotted against each other, then if the assumed distributions fits the data exactly, then the resultant plot will be a straight line. The relevant plot is shown below:

 

 

56.  The plot of Pi against PO is shown in red, while a straight line has been added in black to demonstrate how the linear requirement is not met by the parameters selected in the example. Note that the values of PO have been calculated by the Rank sequence and are in the form:

 

1/2N, 1/N + 1/2N, 2/N + 1/2N, 3/N + 1/2N, …..  (N-1)/N + 1/2N

 

Where N = the number of data elements (this process is described in section 38)

 

 

c) The next stages in the process are to calculate the natural logarithms of Pi and (1 – Pi) these are shown below:

 

 

 

 

Std Dev

Pi for

 

 

Rank

Yi

based on Yi

NMX Fit

Ln(Pi)

Ln(1-Pi)

1

117.2852

-3.6314

0.000171

-8.675105

-0.000171

2

119.7510

-3.4621

0.000325

-8.032430

-0.000325

3

121.0388

-3.3736

0.000449

-7.707516

-0.000450

4

121.9360

-3.3120

0.000561

-7.485499

-0.000561

"

"

"

"

"

"

"

"

"

"

"

"

4998

164.9755

-0.3562

0.437145

-0.827491

-0.574732

4999

164.9788

-0.3560

0.437246

-0.827259

-0.574913

5000

164.9820

-0.3557

0.437347

-0.827028

-0.575092

5001

164.9853

-0.3555

0.437448

-0.826797

-0.575272

5002

164.9885

-0.3553

0.437550

-0.826565

-0.575453

"

"

"

"

"

"

"

"

"

"

"

"

9997

194.7800

2.5973

0.996295

-0.003712

-5.598116

9998

195.1447

2.6358

0.996690

-0.003315

-5.710816

9999

195.6451

2.6886

0.997171

-0.002833

-5.867693

10000

196.5454

2.7835

0.997880

-0.002122

-6.156329

 

 

d) The values of  Ln(1 –Pi) are then reversed, such the associated with the first element of data (the one which has the lowest value in the set) there is the value of Ln(Pi) and set along-side, the value taken from the last data element for Ln(1 – Pi) as shown below.

 

 

 

Pi for

 

 

Reverse

in reverse

Rank

Yi

NMX Fit

Ln(Pi)

Ln(1-Pi)

Rank

Order

1

117.2852

0.000171

-8.675105

-0.000171

10000

-6.156329

2

119.7510

0.000325

-8.032430

-0.000325

9999

-5.867693

3

121.0388

0.000449

-7.707516

-0.000450

9998

-5.710816

4

121.9360

0.000561

-7.485499

-0.000561

9997

-5.598116

"

"

"

"

"

"

"

"

"

"

"

"

"

"

4998

164.9755

0.437145

-0.827491

-0.574732

5003

-0.575632

4999

164.9788

0.437246

-0.827259

-0.574913

5002

-0.575453

5000

164.9820

0.437347

-0.827028

-0.575092

5001

-0.575272

5001

164.9853

0.437448

-0.826797

-0.575272

5000

-0.575092

5002

164.9885

0.437550

-0.826565

-0.575453

4999

-0.574913

"

"

"

"

"

"

"

"

"

"

"

"

"

"

9997

194.7800

0.996295

-0.003712

-5.598116

4

-0.000561

9998

195.1447

0.996690

-0.003315

-5.710816

3

-0.000450

9999

195.6451

0.997171

-0.002833

-5.867693

2

-0.000325

10000

196.5454

0.997880

-0.002122

-6.156329

1

-0.000171

 

In the last column it can be seen that the value of -6.156329 which appears as the Ln(1-Pi) for item 10000 is now associated with the first item.

 

e) An elemental statistic is then calculated, for each row, by adding the values Ln(Pi) and ‘Ln(1 – Pi) in reverse Order’ and multiplying this line sub-total by (1 – 2i)/N (where i = Rank)

 

 

 

 

 

Ln(1-Pi)

 

 

 

in reverse

 

Rank

Ln(Pi)

Order

AD Stati

1

-8.675105

-6.156329

0.001483

2

-8.032430

-5.867693

0.004170

3

-7.707516

-5.710816

0.006709

4

-7.485499

-5.598116

0.009159

"

"

"

"

"

"

"

"

4998

-0.827491

-0.575632

1.402422

4999

-0.827259

-0.575453

1.402291

5000

-0.827028

-0.575272

1.402161

5001

-0.826797

-0.575092

1.402029

5002

-0.826565

-0.574913

1.401898

"

"

"

"

"

"

"

"

9997

-0.003712

-0.000561

0.008543

9998

-0.003315

-0.000450

0.007528

9999

-0.002833

-0.000325

0.006315

10000

-0.002122

-0.000171

0.004586

 

The calculation for the fourth element is as follows:

 

AD Stat(N=4) = (Ln(Pi) – ‘Ln(1-Pi)in reverse’)(1 – (2Rank)/N

 

 

AD Stat(N=4)  =   (-7.485499 -5.598116) ( 1 – (2 x 4))/10000

 

 

AD Stat(N=4) =  ( - 13.083616 )( - 7)/10000 = 0.009159

 

 

f) The elemental statistics are then summed, and ‘N’ deducted from the total, to produce the Anderson Darling Statistic (AD).

 

 

 

Ln(1-Pi)

 

 

 

in reverse

 

Rank

Ln(Pi)

Order

AD Stati

1

-8.675105

-6.156329

0.001483

2

-8.032430

-5.867693

0.004170

3

-7.707516

-5.710816

0.006709

4

-7.485499

-5.598116

0.009159

"

"

"

"

"

"

"

"

4998

-0.827491

-0.575632

1.402422

4999

-0.827259

-0.575453

1.402291

5000

-0.827028

-0.575272

1.402161

5001

-0.826797

-0.575092

1.402029

5002

-0.826565

-0.574913

1.401898

"

"

"

"

"

"

"

"

9997

-0.003712

-0.000561

0.008543

9998

-0.003315

-0.000450

0.007528

9999

-0.002833

-0.000325

0.006315

10000

-0.002122

-0.000171

0.004586

 

 

Sum

10,108.08

 

 

AD = Sum of elements – N = 10,108.08 – 10,000 = 108.08

 

An AD of 108.08 indicates that the distribution selected, of N=126.479, M=170.162, X=198.597 does not fit the data.

 

 

 

 

 

 

              .

 

57. It should be noted that with a normal distribution the points P = 0 and P = 1 correspond to + and – infinity. In creating a data it has been assumed that values of Pi will be selected from the equivalent mid class points, as such the values of Pi selected correspond to the series: 0.5/n, 0.5/n + 1/n, 0.5/n + 2/n, …..etc. where n equals the number of data points to be selected. For n = 10,000 the series would be of the form 0.00005, 0.00015, 0.00025, etc upto 0.99995

 

58. It should also be noted that it is assumed value of N and X occur at -3 and +3 standard deviations, corresponding to probabilities of 0.00135 and 0.99865. From the above it will be apparent that a number of the values in the data set will lie outside the N and X limits.

 

 

 

REFERENCES

Humphreys G. (1979) Turning Uncertainty to Advantage, McGraw-Hill Book Company (UK) Limited. P22-30