Exploration of Excess Homozygosity in relation to DNA profile match probability
Exploration of Excess Homozygosity in relation to DNA profile match probability
Visual basic macros and Word macros are
between the horizontal rules.
Before going into Visual Basic Editor go into ordinary
Word and call up anything in the directory you want
the VB files to go into as this is not designated in the
following code.
Using plain text handling Notepad with no line wrap
copy to between Sub / End Sub ,Reset and Run.
Some long conditional statements may break
and would need reconcattenating before running.
If using straight VB6 then designate the directory
for files by "replace all" occurances of dec14 to
c:\vb\dec14 or whatever, also add a sound progress
indicator before the [ next x ] line
If x/1000 = Int (x/1000 ) Then Beep
before highlighting and copying.
In VB6 open New Project
In Form1 open up a Command1 button
Double click this button to open command
code window and copy and paste the 'DNA' VB code
between the Private Sub Command1_click ()
and the End Sub
Then Run/Start
Press command1
Wait until Beep/ clicks cease
' Generating profiles for 6 loci only and adding HZ at generation
' directing pairs and first divider
Dim ph(12)
Dim pb(12)
' change xxxx for different total number of profiles
xxxx = 50000
xxxx = xxxx - 1
' initialising Random Number Generator - RNG
count9 = 0
count8 = 0
Randomize
a = 214013
c = 2531011
x0 = Timer
z = 2 ^ 24
' 1 file 'jun10a-g' for original, un-directed pairs, source data.
' This file is necessary to check on the performance of the RNG
' when a matched pair is found then it is highly unlikely that
' both sequences as generated, before pair directing, would
' be the same - more likely a manifest of repeat within the RNG
' (reason for adopting the 214013 / 2531011 RNG )
' Use 'Word' find function on part of the sequences, including pair reversals,
' with luck would include a 'homozygotic' pair eg (3,3) say ,so no reversal
' on that pair
Open "jun10a-g" For Output As #1
Open "jun10a-Shz.txt" For Output As #2
' outputs directed and divided by first digit
Open "jun10a-0" For Output As #10
Open "jun10a-1" For Output As #11
Open "jun10a-2" For Output As #12
Open "jun10a-3" For Output As #13
Open "jun10a-4" For Output As #14
Open "jun10a-5" For Output As #15
Open "jun10a-6" For Output As #16
Open "jun10a-7" For Output As #17
Open "jun10a-8" For Output As #18
Open "jun10a-9" For Output As #19
For x = 0 To xxxx
For j = 0 To 1
' D3 ,first locus
' RNG random number generator
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.0037 Then ph(j) = 11
If ph(j) < 0.0111 Then ph(j) = 1
If ph(j) < 0.1318 Then ph(j) = 2
If ph(j) < 0.3966 Then ph(j) = 3
If ph(j) < 0.6466 Then ph(j) = 4
If ph(j) < 0.8313 Then ph(j) = 5
If ph(j) < 0.9865 Then ph(j) = 6
If ph(j) < 0.9976 Then ph(j) = 7
If ph(j) < 1 Then ph(j) = 8
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 2 To 3
' vWA
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.1268 Then ph(j) = 11
If ph(j) < 0.2263 Then ph(j) = 1
If ph(j) < 0.4618 Then ph(j) = 2
If ph(j) < 0.7192 Then ph(j) = 3
If ph(j) < 0.9138 Then ph(j) = 4
If ph(j) < 0.984 Then ph(j) = 5
If ph(j) < 0.9982 Then ph(j) = 6
If ph(j) < 1 Then ph(j) = 7
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 4 To 5
' D5
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.0049 Then ph(j) = 11
If ph(j) < 0.0357 Then ph(j) = 1
If ph(j) < 0.096 Then ph(j) = 2
If ph(j) < 0.4876 Then ph(j) = 3
If ph(j) < 0.8312 Then ph(j) = 4
If ph(j) < 0.9876 Then ph(j) = 5
If ph(j) < 0.9987 Then ph(j) = 6
If ph(j) < 1 Then ph(j) = 7
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 6 To 7
' D8
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
pb(j) = "Z"
If ph(j) < 0.0198 Then ph(j) = 11
If ph(j) < 0.026 Then ph(j) = 1
If ph(j) < 0.105 Then ph(j) = 2
If ph(j) < 0.1927 Then ph(j) = 3
If ph(j) < 0.3384 Then ph(j) = 4
If ph(j) < 0.6631 Then ph(j) = 5
If ph(j) < 0.8804 Then ph(j) = 6
If ph(j) < 0.9705 Then ph(j) = 7
If ph(j) < 0.9977 Then ph(j) = 8
If ph(j) < 1 Then ph(j) = 9
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 8 To 9
' D18
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
pb(j) = "Z"
If ph(j) < 0.01 Then ph(j) = 11
If ph(j) < 0.0238 Then ph(j) = 1
If ph(j) < 0.1742 Then ph(j) = 2
If ph(j) < 0.3033 Then ph(j) = 3
If ph(j) < 0.4462 Then ph(j) = 4
If ph(j) < 0.584 Then ph(j) = 5
If ph(j) < 0.7168 Then ph(j) = 6
If ph(j) < 0.8421 Then ph(j) = 7
If ph(j) < 0.916 Then ph(j) = 8
If ph(j) < 0.9626 Then ph(j) = 9
If ph(j) < 0.9824 And ph(j) >= 0.9626 Then pb(j) = "A"
If ph(j) < 0.9924 And ph(j) >= 0.9824 Then pb(j) = "B"
If ph(j) < 0.9962 And ph(j) >= 0.9924 Then pb(j) = "C"
If ph(j) < 0.9977 And ph(j) >= 0.9962 Then pb(j) = "D"
If ph(j) < 1 And ph(j) >= 0.9977 Then pb(j) = "E"
If ph(j) > 10 Then ph(j) = 0
If pb(j) <> "Z" Then ph(j) = pb(j)
Next j
For j = 10 To 11
' FGA
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
pb(j) = "Z"
If ph(j) < 0.0173 Then ph(j) = 11
If ph(j) < 0.0767 Then ph(j) = 1
If ph(j) < 0.2227 Then ph(j) = 2
If ph(j) < 0.2239 Then ph(j) = 3
If ph(j) < 0.4021 Then ph(j) = 4
If ph(j) < 0.4046 Then ph(j) = 5
If ph(j) < 0.5927 Then ph(j) = 6
If ph(j) < 0.5989 Then ph(j) = 7
If ph(j) < 0.7437 Then ph(j) = 8
If ph(j) < 0.7487 Then ph(j) = 9
If ph(j) < 0.8985 And ph(j) >= 0.7487 Then pb(j) = "A"
If ph(j) < 0.974 And ph(j) >= 0.8985 Then pb(j) = "B"
If ph(j) < 0.995 And ph(j) >= 0.974 Then pb(j) = "C"
If ph(j) < 1 And ph(j) >= 0.995 Then pb(j) = "D"
If ph(j) > 10 Then ph(j) = 0
If pb(j) <> "Z" Then ph(j) = pb(j)
Next j
' output the original generated file
Write #1, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
' Because in real DNA profiles without further info ,no one
' knows which allele in each pair came from the mother or father
' by convention they are written smaller ,larger (or equal).
' The following directs each pair
For j = 0 To 10 Step 2
If ph(j + 1) < ph(j) Then
jjj = ph(j)
ph(j) = ph(j + 1)
ph(j + 1) = jjj
End If
Next j
rng = 6 * Rnd
If 0 <= rng And rng < 1 Then
ph(5) = ph(4)
End If
If 1 <= rng And rng < 2 Then
ph(7) = ph(6)
End If
' put extra conditional statements here to reduce
' the number of files or just delete some of the following
'
' dividing on first column, file by file
If ph(0) = 0 Then
Write #10, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count0 = count0 + 1
End If
If ph(0) = 1 Then
Write #11, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count1 = count1 + 1
End If
If ph(0) = 2 Then
Write #12, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count2 = count2 + 1
End If
If ph(0) = 3 Then
Write #13, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count3 = count3 + 1
End If
If ph(0) = 4 Then
Write #14, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count4 = count4 + 1
End If
If ph(0) = 5 Then
Write #15, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count5 = count5 + 1
End If
If ph(0) = 6 Then
Write #16, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count6 = count6 + 1
End If
If ph(0) = 7 Then
Write #17, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count7 = count7 + 1
End If
If ph(0) = 8 Then
Write #18, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count8 = count8 + 1
End If
If ph(0) = 9 Then
Write #19, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count9 = count9 + 1
End If
Next x
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Close #1
Close #2
' count file for data to fix for - next loops in sucessive dividings
Open "jun10a-c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20
' Add homozygosity on loci 3 & 4
' xxxx is count = ????
xxxx = 50000
Dim ph(12)
Dim ps As String
Open "jun07-S.txt" For Input As #1
Open "jun07-Shz.txt" For Output As #2
xxxx = xxxx - 1
For x = 0 To xxxx
Input #1, ps
a1$ = Mid(ps, 1, 1)
a2$ = Mid(ps, 2, 1)
a3$ = Mid(ps, 3, 1)
a4$ = Mid(ps, 4, 1)
a5$ = Mid(ps, 5, 1)
a6$ = Mid(ps, 6, 1)
a7$ = Mid(ps, 7, 1)
a8$ = Mid(ps, 8, 1)
a9$ = Mid(ps, 9, 1)
a10$ = Mid(ps, 10, 1)
a11$ = Mid(ps, 11, 1)
a12$ = Mid(ps, 12, 1)
ph(0) = a1$
ph(1) = a2$
ph(2) = a3$
ph(3) = a4$
ph(4) = a5$
ph(5) = a6$
ph(6) = a7$
ph(7) = a8$
ph(8) = a9$
ph(9) = a10$
ph(10) = a11$
ph(11) = a12$
rng = 6 * Rnd
If 0 <= rng And rng < 1 Then
Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(4) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
End If
If 1 <= rng And rng < 2 Then
Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(6) & ph(8) & ph(9) & ph(10) & ph(11)
End If
If 2 <= rng And rng < 6 Then
Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
End If
Next x
Close #1
Close #2
' Find homozygosity
' xxxx is count = ????
xxxx = 50000
Dim ph(12)
Dim ps As String
count1 = 0
count2 = 0
count3 = 0
count4 = 0
count5 = 0
count6 = 0
count7 = 0
count8 = 0
count9 = 0
Open "jun01b-g.txt" For Input As #1
Open "jun01b-hz.txt" For Output As #2
xxxx = xxxx - 1
For x = 0 To xxxx
Input #1, ps
a1$ = Mid(ps, 1, 1)
a2$ = Mid(ps, 2, 1)
ph(1) = Val(a1$)
ph(2) = Val(a2$)
If ph(1) = ph(2) Then
count1 = count1 + 1
End If
a3$ = Mid(ps, 3, 1)
a4$ = Mid(ps, 4, 1)
ph(3) = Val(a3$)
ph(4) = Val(a4$)
If ph(3) = ph(4) Then
count2 = count2 + 1
End If
a5$ = Mid(ps, 5, 1)
a6$ = Mid(ps, 6, 1)
ph(5) = Val(a5$)
ph(6) = Val(a6$)
If ph(5) = ph(6) Then
count3 = count3 + 1
End If
a7$ = Mid(ps, 7, 1)
a8$ = Mid(ps, 8, 1)
ph(7) = Val(a7$)
ph(8) = Val(a8$)
If a7$ = "A" Then ph(7) = 10
If a8$ = "A" Then ph(8) = 10
If ph(7) = ph(8) Then
count4 = count4 + 1
End If
a9$ = Mid(ps, 9, 1)
a10$ = Mid(ps, 10, 1)
ph(9) = Val(a9$)
ph(10) = Val(a10$)
If a9$ = "A" Then ph(9) = 10
If a10$ = "A" Then ph(10) = 10
If a9$ = "B" Then ph(9) = 11
If a10$ = "B" Then ph(10) = 11
If a9$ = "C" Then ph(9) = 12
If a10$ = "C" Then ph(10) = 12
If a9$ = "D" Then ph(9) = 13
If a10$ = "D" Then ph(10) = 13
If ph(9) = ph(10) Then
count5 = count5 + 1
End If
a11$ = Mid(ps, 11, 1)
a12$ = Mid(ps, 12, 1)
ph(11) = Val(a11$)
ph(12) = Val(a12$)
If ph(11) = ph(12) Then
count6 = count6 + 1
End If
Next x
countsum = count1 + count2 + count3 + count4 + count5 + count6
Write #2, "Locus 1 Homozygosity Count = ", count1, " locus 2 = ", count2, " locus 3 = ", count3, " locus 4 = ", count4, " locus 5 = ", count5, " locus 6 = ", count6, " TOTAL count = ", countsum
Close #1
Close #2
The nice thing about the world's forensic statisticians having
hoodwinked virtually everyone by making reference to
billions,trillions .... even heptillions is I have a whole area of
research almost to myself.
Recently I was finding that excess homozygosity does not
necessarily lead to a greater probability of unrelated false
matches of DNA profiles.
I have now found another rule that comes into play in
unrelated DNA profile matches.
I did 4 off 50,000 runs of 6- loci profiles based on Australian
data.
First run produced 64 off 12 digit matches.
Adding some homozygosity, resorting and checking again
there was only 50 matches.
Another run , without excess HZ, produced 55 matches
Repeating but this time adding the homozygosity at generation,
sorting and match checking, again resulted in only 50 pairs.
Australian caucasian data just for locus D5
Code here / Allele / Allele frequency %
0 8 0.49
1 9 3.08
2 10 6.03
3 11 39.16
4 12 34.36
5 13 15.64
6 14 1.11
7 15 0.12
For unmodified D5 subset of 50,000 profiles
bi-allele totals, firstly homoZ then largest heteroZ counts
00 - 2
11 - 59
22 - 166
33 - 7,647
44 - 5,848
55 - 1,274
66 - 4
34 - 13,364
35 - 6,165
45 - 5,367
23 - 2,357
24 - 2,132
Then adding HZ, seriously over-inflating the
rarest allele occurances
00 - 86
11 - 562
22 - 1,104
33 - 10,946
44 - 6,797
55 - 1,300
66 - 4
34 - 11,184
35 - 5,130
45 - 4,482
23 - 1,937
24 - 1,781
Overall excess D5 HZ = 38.7 %
33 excess HZ 43%
44 excess HZ 21.8%
So new 33 count is now nearly the same as the now reduced
34 count and as a consequence is equally likely to influence the
6 loci match situation but only if the excess HZ on other loci
is similar in behaviour.
But if 33 excess HZ was only 20%
then 33 count goes up from 7,647 to about 9,200 but
still less than the 34 count now about 12,400. The square law has
it that the larger counts predominate disproportionately.
Consider 44 situation ( less likely to occur in matches)
but now 6,797 >> 4,482 so critical in comparison.
Now consider the general case.
h is excess homozygosity expressed as scaling factor ie 1.28 rather than 28 per cent.
a is the allele frequency of an alelle that has excess HZ factor of h
and b is the highest (no excess HZ ) allele frequency of the remaining alleles.
Normal HZ is a^2 and b^2
Highest normal hetero-bi-allele frequency = 2*a*b
Excess HZ increases a to (a^2)*h
Highest hetero-bi-allele frequency becomes approximately = 2*b *{a - 0.5*a(h-1)}
= a* b*(3 - h)
So criticality is if a*a*h <> a*b *(3-h)
or a*h <> b* (3-h)
So even if h is high then the b frequency if high
enough in comparison it can over-ride.
Then this effect has to be present in most loci to
affect the overall match probability.
Analysis of D8 results for 50,000 run
Code / Allele f. & / unenhanced HZ count
0 1.98 23
1 0.62 3
2 7.9 301
3 8.77 344
4 14.57 1,074
5 32.47 5,423
6 21.73 2,423
7 9.0 396
8 2.72 39
9 0.25 1
Principle HeteroZ
56 7,009
45 4,700
46 3,279
57 2,984
47 1,299
Added Homozygosity
00 323
11 104
22 1522
33 1,550
44 2,719
55 7,095
66 2,834
77 438
88 40
99 1
principal HeteroZ
56 5,820
45 3,924
46 2,704
57 2,478
67 1,670
47 1,084
Then results for the principle D5 and D8 alleles in 2 off 50,000 runs
First has 55 matches on 6 loci then added HZ giving 50 matches
D5 counts per bi-allele
33 12
44 4
34 29
35 2
45 5
Total 52
Enhanced HZ on D5
33 20
44 8
34 16
35 3
45 1
Total 48
So total contribution of D5 drops from 52
out of 55 to 48 out of 50
D8 counts per bi-allele
44 0
55 7
66 5
56 17
45 10
46 5
57 0
Total 44
Principal D8 enhanzed HZ
44 4
55 16
66 2
56 9
45 4
46 4
57 2
Total 41
So total contribution of D8 drops from 44
out of 55 to 41 out of 50
Second has 64 matches on 6 loci then added HZ giving 50 matches
D5 counts per bi-allele
33 12
44 6
34 33
35 7
45 3
Total 61
Enhanced HZ on principal D5
33 16
44 8
34 16
35 5
45 4
Total 49
So total contribution of D5 drops from 61
out of 64 to 49 out of 50
D8 counts per bi-allele
44 1
55 14
66 2
56 21
45 4
46 6
57 2
Total 50
D8 enhanzed HZ
44 1
55 9
66 6
56 14
45 7
46 3
57 1
Total 41
So total contribution of D8 drops from 50
out of 64 to 41 out of 50
Excess HZ of less than 10 % is generally more likely to reduce the number
of unrelated profile matches in populations of the same type and size.
The first rule is that unrelated matches involve primarily the more common
alleles and the frequency of these common alleles is increased in match
profiles and the rarer alleles decrease in the matches and the rarest
do not occur, at least in sub- 10 million populations.