Exploration of Excess Homozygosity in relation to DNA profile match probability

	Sign In Sign-Up


Visual basic macros and Word macros are 
between the horizontal rules. 
Before going into Visual Basic Editor go into ordinary 
Word and call up anything in the directory you want 
the VB files to go into as this is not designated in the 
following code.
Using plain text handling Notepad with no line wrap 
copy to between Sub / End Sub ,Reset and Run.
Some long conditional statements may break 
and would need reconcattenating before running.

If using straight VB6 then designate the directory 
for files by "replace all" occurances of dec14 to 
c:\vb\dec14 or whatever, also add a sound progress 
indicator before the [ next x ] line 
 
If x/1000 = Int (x/1000 ) Then Beep
 
before highlighting and copying.

In VB6 open New Project
In Form1 open up a Command1 button
Double click this button to open command 
code window and copy and paste the 'DNA' VB code 
between the Private Sub Command1_click ()
and the End Sub
Then Run/Start
Press command1
Wait until Beep/ clicks cease



' Generating  profiles for 6 loci only and adding HZ at generation
' directing pairs and first divider
Dim ph(12)
Dim pb(12)
' change xxxx for different total number of profiles
xxxx = 50000
xxxx = xxxx - 1

' initialising Random Number Generator - RNG
count9 = 0
count8 = 0

Randomize
a = 214013
c = 2531011
x0 = Timer
z = 2 ^ 24
'  1 file 'jun10a-g' for original, un-directed pairs, source data.
' This file is necessary to check on the performance of the RNG
' when a matched pair is found then it is highly unlikely that
' both sequences as generated, before pair directing, would
' be the same - more likely a manifest of repeat within the RNG
' (reason for adopting the 214013 / 2531011 RNG )
' Use 'Word' find function on part of the sequences, including pair reversals,
' with luck would include a  'homozygotic' pair eg (3,3) say ,so no reversal
' on that pair

Open "jun10a-g" For Output As #1
Open "jun10a-Shz.txt" For Output As #2
' outputs directed and divided by first digit
Open "jun10a-0" For Output As #10
Open "jun10a-1" For Output As #11
Open "jun10a-2" For Output As #12
Open "jun10a-3" For Output As #13
Open "jun10a-4" For Output As #14
Open "jun10a-5" For Output As #15
Open "jun10a-6" For Output As #16
Open "jun10a-7" For Output As #17
Open "jun10a-8" For Output As #18
Open "jun10a-9" For Output As #19

For x = 0 To xxxx

For j = 0 To 1
' D3 ,first locus
' RNG random number generator
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z

ph(j) = phj
If ph(j) < 0.0037 Then ph(j) = 11
If ph(j) < 0.0111 Then ph(j) = 1
If ph(j) < 0.1318 Then ph(j) = 2
If ph(j) < 0.3966 Then ph(j) = 3
If ph(j) < 0.6466 Then ph(j) = 4
If ph(j) < 0.8313 Then ph(j) = 5
If ph(j) < 0.9865 Then ph(j) = 6
If ph(j) < 0.9976 Then ph(j) = 7
If ph(j) < 1 Then ph(j) = 8


If ph(j) > 10 Then ph(j) = 0
Next j

For j = 2 To 3
' vWA
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.1268 Then ph(j) = 11
If ph(j) < 0.2263 Then ph(j) = 1
If ph(j) < 0.4618 Then ph(j) = 2
If ph(j) < 0.7192 Then ph(j) = 3
If ph(j) < 0.9138 Then ph(j) = 4
If ph(j) < 0.984 Then ph(j) = 5
If ph(j) < 0.9982 Then ph(j) = 6

If ph(j) < 1 Then ph(j) = 7
If ph(j) > 10 Then ph(j) = 0
Next j

For j = 4 To 5
' D5
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.0049 Then ph(j) = 11
If ph(j) < 0.0357 Then ph(j) = 1
If ph(j) < 0.096 Then ph(j) = 2
If ph(j) < 0.4876 Then ph(j) = 3
If ph(j) < 0.8312 Then ph(j) = 4
If ph(j) < 0.9876 Then ph(j) = 5
If ph(j) < 0.9987 Then ph(j) = 6

If ph(j) < 1 Then ph(j) = 7

If ph(j) > 10 Then ph(j) = 0
Next j

For j = 6 To 7
' D8
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj
pb(j) = "Z"
If ph(j) < 0.0198 Then ph(j) = 11
If ph(j) < 0.026 Then ph(j) = 1
If ph(j) < 0.105 Then ph(j) = 2
If ph(j) < 0.1927 Then ph(j) = 3
If ph(j) < 0.3384 Then ph(j) = 4
If ph(j) < 0.6631 Then ph(j) = 5
If ph(j) < 0.8804 Then ph(j) = 6
If ph(j) < 0.9705 Then ph(j) = 7
If ph(j) < 0.9977 Then ph(j) = 8

If ph(j) < 1 Then ph(j) = 9

If ph(j) > 10 Then ph(j) = 0

Next j

For j = 8 To 9
' D18
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj
pb(j) = "Z"

If ph(j) < 0.01 Then ph(j) = 11
If ph(j) < 0.0238 Then ph(j) = 1
If ph(j) < 0.1742 Then ph(j) = 2
If ph(j) < 0.3033 Then ph(j) = 3
If ph(j) < 0.4462 Then ph(j) = 4
If ph(j) < 0.584 Then ph(j) = 5
If ph(j) < 0.7168 Then ph(j) = 6
If ph(j) < 0.8421 Then ph(j) = 7
If ph(j) < 0.916 Then ph(j) = 8
If ph(j) < 0.9626 Then ph(j) = 9
If ph(j) < 0.9824 And ph(j) >= 0.9626 Then pb(j) = "A"
If ph(j) < 0.9924 And ph(j) >= 0.9824 Then pb(j) = "B"
If ph(j) < 0.9962 And ph(j) >= 0.9924 Then pb(j) = "C"
If ph(j) < 0.9977 And ph(j) >= 0.9962 Then pb(j) = "D"
If ph(j) < 1 And ph(j) >= 0.9977 Then pb(j) = "E"

If ph(j) > 10 Then ph(j) = 0
If pb(j) <> "Z" Then ph(j) = pb(j)
Next j

For j = 10 To 11
' FGA
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

pb(j) = "Z"

If ph(j) < 0.0173 Then ph(j) = 11
If ph(j) < 0.0767 Then ph(j) = 1
If ph(j) < 0.2227 Then ph(j) = 2
If ph(j) < 0.2239 Then ph(j) = 3
If ph(j) < 0.4021 Then ph(j) = 4
If ph(j) < 0.4046 Then ph(j) = 5
If ph(j) < 0.5927 Then ph(j) = 6
If ph(j) < 0.5989 Then ph(j) = 7
If ph(j) < 0.7437 Then ph(j) = 8
If ph(j) < 0.7487 Then ph(j) = 9
If ph(j) < 0.8985 And ph(j) >= 0.7487 Then pb(j) = "A"
If ph(j) < 0.974 And ph(j) >= 0.8985 Then pb(j) = "B"
If ph(j) < 0.995 And ph(j) >= 0.974 Then pb(j) = "C"
If ph(j) < 1 And ph(j) >= 0.995 Then pb(j) = "D"

If ph(j) > 10 Then ph(j) = 0
If pb(j) <> "Z" Then ph(j) = pb(j)

Next j

' output the original generated file

Write #1, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
' Because in real DNA profiles without further info ,no one
' knows which allele in each pair came from the mother or father
' by convention they are written smaller ,larger (or equal).
' The following directs each pair

For j = 0 To 10 Step 2
If ph(j + 1) < ph(j) Then
jjj = ph(j)
ph(j) = ph(j + 1)
ph(j + 1) = jjj
End If
Next j


rng = 6 * Rnd


If 0 <= rng And rng < 1 Then
ph(5) = ph(4)
End If

If 1 <= rng And rng < 2 Then
ph(7) = ph(6)
End If


' put extra conditional statements here to reduce
' the number of files or just delete some of the following
'
' dividing on first column, file by file
If ph(0) = 0 Then
Write #10, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count0 = count0 + 1

End If
If ph(0) = 1 Then
Write #11, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count1 = count1 + 1

End If
If ph(0) = 2 Then
Write #12, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count2 = count2 + 1

End If
If ph(0) = 3 Then
Write #13, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count3 = count3 + 1

End If
If ph(0) = 4 Then
Write #14, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count4 = count4 + 1

End If
If ph(0) = 5 Then
Write #15, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count5 = count5 + 1

End If
If ph(0) = 6 Then
Write #16, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count6 = count6 + 1

End If
If ph(0) = 7 Then
Write #17, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count7 = count7 + 1

End If
If ph(0) = 8 Then
Write #18, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count8 = count8 + 1

End If
If ph(0) = 9 Then
Write #19, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11)
count9 = count9 + 1
End If

Next x
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Close #1


Close #2
' count file for data to fix for - next loops in sucessive dividings
Open "jun10a-c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20




' Add homozygosity on loci  3 & 4
' xxxx is count  = ????
xxxx = 50000

Dim ph(12)
Dim ps As String

Open "jun07-S.txt" For Input As #1
Open "jun07-Shz.txt" For Output As #2

xxxx = xxxx - 1
For x = 0 To xxxx

Input #1, ps
a1$ = Mid(ps, 1, 1)
a2$ = Mid(ps, 2, 1)
a3$ = Mid(ps, 3, 1)
a4$ = Mid(ps, 4, 1)
a5$ = Mid(ps, 5, 1)
a6$ = Mid(ps, 6, 1)
a7$ = Mid(ps, 7, 1)
a8$ = Mid(ps, 8, 1)
a9$ = Mid(ps, 9, 1)
a10$ = Mid(ps, 10, 1)
a11$ = Mid(ps, 11, 1)
a12$ = Mid(ps, 12, 1)



ph(0) = a1$
ph(1) = a2$
ph(2) = a3$
ph(3) = a4$
ph(4) = a5$
ph(5) = a6$
ph(6) = a7$
ph(7) = a8$
ph(8) = a9$
ph(9) = a10$
ph(10) = a11$
ph(11) = a12$


rng = 6 * Rnd


If 0 <= rng And rng < 1 Then
Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(4) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) 
End If

If 1 <= rng And rng < 2 Then
Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(6) & ph(8) & ph(9) & ph(10) & ph(11) 
End If


If 2 <= rng And rng < 6 Then
Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) 
End If

Next x
Close #1

Close #2




' Find homozygosity
' xxxx is count  = ????
xxxx = 50000

Dim ph(12)
Dim ps As String
count1 = 0
count2 = 0
count3 = 0
count4 = 0
count5 = 0
count6 = 0
count7 = 0
count8 = 0
count9 = 0
Open "jun01b-g.txt" For Input As #1
Open "jun01b-hz.txt" For Output As #2

xxxx = xxxx - 1
For x = 0 To xxxx

Input #1, ps

a1$ = Mid(ps, 1, 1)
a2$ = Mid(ps, 2, 1)
ph(1) = Val(a1$)
ph(2) = Val(a2$)
If ph(1) = ph(2) Then
count1 = count1 + 1
End If

a3$ = Mid(ps, 3, 1)
a4$ = Mid(ps, 4, 1)
ph(3) = Val(a3$)
ph(4) = Val(a4$)
If ph(3) = ph(4) Then
count2 = count2 + 1
End If

a5$ = Mid(ps, 5, 1)
a6$ = Mid(ps, 6, 1)
ph(5) = Val(a5$)
ph(6) = Val(a6$)
If ph(5) = ph(6) Then
count3 = count3 + 1
End If


a7$ = Mid(ps, 7, 1)
a8$ = Mid(ps, 8, 1)
ph(7) = Val(a7$)
ph(8) = Val(a8$)
If a7$ = "A" Then ph(7) = 10
If a8$ = "A" Then ph(8) = 10
If ph(7) = ph(8) Then
count4 = count4 + 1
End If

a9$ = Mid(ps, 9, 1)
a10$ = Mid(ps, 10, 1)
ph(9) = Val(a9$)
ph(10) = Val(a10$)
If a9$ = "A" Then ph(9) = 10
If a10$ = "A" Then ph(10) = 10
If a9$ = "B" Then ph(9) = 11
If a10$ = "B" Then ph(10) = 11
If a9$ = "C" Then ph(9) = 12
If a10$ = "C" Then ph(10) = 12
If a9$ = "D" Then ph(9) = 13
If a10$ = "D" Then ph(10) = 13

If ph(9) = ph(10) Then
count5 = count5 + 1
End If

a11$ = Mid(ps, 11, 1)
a12$ = Mid(ps, 12, 1)
ph(11) = Val(a11$)
ph(12) = Val(a12$)
If ph(11) = ph(12) Then
count6 = count6 + 1
End If



Next x
countsum = count1 + count2 + count3 + count4 + count5 + count6 
Write #2, "Locus 1 Homozygosity Count = ", count1, " locus 2 = ", count2, " locus 3 = ", count3, " locus 4 = ", count4, " locus 5 = ", count5, " locus 6 = ", count6, " TOTAL count = ", countsum

Close #1
Close #2




The nice thing about the world's forensic statisticians having 
hoodwinked virtually everyone by making reference to 
billions,trillions .... even heptillions is I have a whole area of 
research almost to myself.
Recently I was finding that excess homozygosity does not 
necessarily lead to a greater probability of unrelated false 
matches of DNA profiles.
I have now found another rule that comes into play in 
unrelated DNA profile matches.
I did 4 off 50,000 runs of 6- loci profiles based on Australian 
data.
First run produced 64 off 12 digit matches.
Adding some homozygosity, resorting and checking again 
there was only 50 matches.
Another run , without excess HZ, produced 55 matches
Repeating but this time adding the homozygosity at generation, 
sorting and match checking, again resulted in only 50 pairs.

Australian caucasian data just for locus D5 
Code here / Allele / Allele frequency %
0	8	0.49
1	9	3.08
2	10	6.03
3	11	39.16
4	12	34.36
5	13	15.64
6	14	1.11
7	15	0.12

For unmodified D5 subset of 50,000 profiles
bi-allele totals, firstly homoZ then largest heteroZ counts
00	- 2
11	- 59
22	- 166
33	- 7,647
44	- 5,848
55 	- 1,274
66	- 4

34	- 13,364
35	- 6,165
45	- 5,367
23	- 2,357
24	- 2,132

Then adding HZ, seriously over-inflating the 
rarest allele occurances
00	- 86
11	- 562
22	- 1,104
33	- 10,946
44	- 6,797
55	- 1,300
66	 - 4

34	- 11,184
35	- 5,130
45	- 4,482
23	- 1,937
24	- 1,781

Overall excess D5 HZ = 38.7 %
33 excess HZ 43%
44 excess HZ 21.8%
So new 33 count is now nearly the same as the now reduced  
34 count and as a consequence is equally likely to influence the 
6 loci match situation but only if the excess HZ on other loci 
is similar in behaviour. 
But if 33 excess HZ was only 20% 
then 33 count goes up from 7,647 to about 9,200 but  
still  less than the 34 count now about 12,400. The square law has 
it that the larger counts predominate disproportionately.

Consider 44 situation ( less likely to occur in matches)
but now 6,797 >> 4,482 so critical in comparison.

Now consider the general case.
h is excess homozygosity expressed as scaling factor ie 1.28 rather than 28 per cent.
a is the allele frequency of an alelle that has excess HZ factor of h 
and b is the highest (no excess HZ ) allele frequency of the remaining alleles.
Normal HZ is a^2 and b^2
Highest normal hetero-bi-allele frequency = 2*a*b
Excess HZ increases a to (a^2)*h
Highest hetero-bi-allele frequency becomes approximately = 2*b *{a - 0.5*a(h-1)}
= a* b*(3 - h)
So criticality is if a*a*h <> a*b *(3-h)
or a*h <> b* (3-h)

So even if h is high then the b frequency if high 
enough in comparison it can over-ride.
Then this effect has to be present in most loci to 
affect the overall match probability.

Analysis of D8 results for 50,000 run
Code / Allele f. & / unenhanced HZ count
0	1.98	23
1	0.62	3
2	7.9	301
3	8.77	344
4	14.57	1,074
5	32.47	5,423
6	21.73	2,423
7	9.0	396
8	2.72	39
9	0.25	1
Principle HeteroZ
56	7,009
45	4,700
46	3,279
57	2,984
47	1,299

Added Homozygosity
00	323
11	104
22	1522
33	1,550
44	2,719
55	7,095
66	2,834
77	438
88	40
99	1
principal HeteroZ
56	5,820
45	3,924
46	2,704
57	2,478
67	1,670
47	1,084

Then results for the principle D5 and D8 alleles in 2 off 50,000 runs
First has 55 matches on 6 loci then added HZ giving 50 matches
D5 counts per bi-allele
33	12
44	4
34	29
35	2
45	5
Total 	52

Enhanced HZ on D5
33	20
44	8
34	16
35	3
45	1
Total 	48
So total contribution of D5 drops from 52 
out of 55 to 48 out of 50

D8 counts per bi-allele
44	0
55	7
66	5
56	17
45	10
46	5
57	0
Total	 44

Principal D8 enhanzed HZ
44	4
55	16
66	2
56	9
45	4
46	4
57	2
Total	41
So total contribution of D8 drops from 44 
out of 55 to 41 out of 50

Second has 64 matches on 6 loci then added HZ giving 50 matches
D5 counts per bi-allele
33	12
44	6
34	33
35	7
45	3
Total 	61

Enhanced HZ on principal D5
33	16
44	8
34	16
35	5
45	4
Total 	49
So total contribution of D5 drops from  61
out of 64 to 49 out of 50

D8 counts per bi-allele
44	1
55	14
66	2
56	21
45	4
46	6
57	2
Total	 50

D8 enhanzed HZ
44	1
55	9
66	6
56	14
45	7
46	3
57	1
Total	41
So total contribution of D8 drops from  50
out of 64 to  41 out of 50

Excess HZ of less than 10 % is generally more likely to reduce the number 
of unrelated profile matches in populations of the same type and size.

The first rule is that unrelated matches involve primarily the more common 
alleles and the frequency of these common alleles is increased in match 
profiles and the rarer alleles decrease in the matches and the rarest 
do not occur, at least in sub- 10 million populations.
Email Paul Nutteing by removing 4 of the 5 dots
or email Paul Nutteing ,remove all but one dot
Or a message on usenet group uk.legal has got to me recenently a couple of times.
A lot of the contents of this file plus other material 'peer reviewed' on the main forensic science usergroup

Background
A simulation of a large DNA profile database
A simulation of DNA profile 'families'
A simulation of DNA profile families with consanguinity
A simulation of DNA profile 'families' for 6 generations
dnas.htm revisited with all alleles represented
dnas.htm revisited for >8 percent allele frequency subset (similar ancestry )
Simulation of Taiwanese Tao and Rukai populations to explore the effect of within and without ancestral clusters
Basques autochthonous DNA profiles simulation, 9 loci
Australian Capital Caucasian 9 loci simulation
Australian Capital Caucasian 9 loci simulation, >= 5% allele frequency
CODIS, 13 Loci Caucasian Simulation