Exploration of Excess Homozygosity in relation to DNA profile match probability

Exploration of Excess Homozygosity in relation to DNA profile match probability




Visual basic macros and Word macros are 
between the horizontal rules. 
Before going into Visual Basic Editor go into ordinary 
Word and call up anything in the directory you want 
the VB files to go into as this is not designated in the 
following code.
Using plain text handling Notepad with no line wrap 
copy to between Sub / End Sub ,Reset and Run.
Some long conditional statements may break 
and would need reconcattenating before running.

If using straight VB6 then designate the directory 
for files by "replace all" occurances of dec14 to 
c:\vb\dec14 or whatever, also add a sound progress 
indicator before the [ next x ] line 
 
If x/1000 = Int (x/1000 ) Then Beep
 
before highlighting and copying.

In VB6 open New Project
In Form1 open up a Command1 button
Double click this button to open command 
code window and copy and paste the 'DNA' VB code 
between the Private Sub Command1_click ()
and the End Sub
Then Run/Start
Press command1
Wait until Beep/ clicks cease


' Generating profiles for 6 loci only and adding HZ at generation ' directing pairs and first divider Dim ph(12) Dim pb(12) ' change xxxx for different total number of profiles xxxx = 50000 xxxx = xxxx - 1 ' initialising Random Number Generator - RNG count9 = 0 count8 = 0 Randomize a = 214013 c = 2531011 x0 = Timer z = 2 ^ 24 ' 1 file 'jun10a-g' for original, un-directed pairs, source data. ' This file is necessary to check on the performance of the RNG ' when a matched pair is found then it is highly unlikely that ' both sequences as generated, before pair directing, would ' be the same - more likely a manifest of repeat within the RNG ' (reason for adopting the 214013 / 2531011 RNG ) ' Use 'Word' find function on part of the sequences, including pair reversals, ' with luck would include a 'homozygotic' pair eg (3,3) say ,so no reversal ' on that pair Open "jun10a-g" For Output As #1 Open "jun10a-Shz.txt" For Output As #2 ' outputs directed and divided by first digit Open "jun10a-0" For Output As #10 Open "jun10a-1" For Output As #11 Open "jun10a-2" For Output As #12 Open "jun10a-3" For Output As #13 Open "jun10a-4" For Output As #14 Open "jun10a-5" For Output As #15 Open "jun10a-6" For Output As #16 Open "jun10a-7" For Output As #17 Open "jun10a-8" For Output As #18 Open "jun10a-9" For Output As #19 For x = 0 To xxxx For j = 0 To 1 ' D3 ,first locus ' RNG random number generator temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.0037 Then ph(j) = 11 If ph(j) < 0.0111 Then ph(j) = 1 If ph(j) < 0.1318 Then ph(j) = 2 If ph(j) < 0.3966 Then ph(j) = 3 If ph(j) < 0.6466 Then ph(j) = 4 If ph(j) < 0.8313 Then ph(j) = 5 If ph(j) < 0.9865 Then ph(j) = 6 If ph(j) < 0.9976 Then ph(j) = 7 If ph(j) < 1 Then ph(j) = 8 If ph(j) > 10 Then ph(j) = 0 Next j For j = 2 To 3 ' vWA ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.1268 Then ph(j) = 11 If ph(j) < 0.2263 Then ph(j) = 1 If ph(j) < 0.4618 Then ph(j) = 2 If ph(j) < 0.7192 Then ph(j) = 3 If ph(j) < 0.9138 Then ph(j) = 4 If ph(j) < 0.984 Then ph(j) = 5 If ph(j) < 0.9982 Then ph(j) = 6 If ph(j) < 1 Then ph(j) = 7 If ph(j) > 10 Then ph(j) = 0 Next j For j = 4 To 5 ' D5 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj If ph(j) < 0.0049 Then ph(j) = 11 If ph(j) < 0.0357 Then ph(j) = 1 If ph(j) < 0.096 Then ph(j) = 2 If ph(j) < 0.4876 Then ph(j) = 3 If ph(j) < 0.8312 Then ph(j) = 4 If ph(j) < 0.9876 Then ph(j) = 5 If ph(j) < 0.9987 Then ph(j) = 6 If ph(j) < 1 Then ph(j) = 7 If ph(j) > 10 Then ph(j) = 0 Next j For j = 6 To 7 ' D8 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj pb(j) = "Z" If ph(j) < 0.0198 Then ph(j) = 11 If ph(j) < 0.026 Then ph(j) = 1 If ph(j) < 0.105 Then ph(j) = 2 If ph(j) < 0.1927 Then ph(j) = 3 If ph(j) < 0.3384 Then ph(j) = 4 If ph(j) < 0.6631 Then ph(j) = 5 If ph(j) < 0.8804 Then ph(j) = 6 If ph(j) < 0.9705 Then ph(j) = 7 If ph(j) < 0.9977 Then ph(j) = 8 If ph(j) < 1 Then ph(j) = 9 If ph(j) > 10 Then ph(j) = 0 Next j For j = 8 To 9 ' D18 ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj pb(j) = "Z" If ph(j) < 0.01 Then ph(j) = 11 If ph(j) < 0.0238 Then ph(j) = 1 If ph(j) < 0.1742 Then ph(j) = 2 If ph(j) < 0.3033 Then ph(j) = 3 If ph(j) < 0.4462 Then ph(j) = 4 If ph(j) < 0.584 Then ph(j) = 5 If ph(j) < 0.7168 Then ph(j) = 6 If ph(j) < 0.8421 Then ph(j) = 7 If ph(j) < 0.916 Then ph(j) = 8 If ph(j) < 0.9626 Then ph(j) = 9 If ph(j) < 0.9824 And ph(j) >= 0.9626 Then pb(j) = "A" If ph(j) < 0.9924 And ph(j) >= 0.9824 Then pb(j) = "B" If ph(j) < 0.9962 And ph(j) >= 0.9924 Then pb(j) = "C" If ph(j) < 0.9977 And ph(j) >= 0.9962 Then pb(j) = "D" If ph(j) < 1 And ph(j) >= 0.9977 Then pb(j) = "E" If ph(j) > 10 Then ph(j) = 0 If pb(j) <> "Z" Then ph(j) = pb(j) Next j For j = 10 To 11 ' FGA ' RNG temp = x0 * a + c temp = temp / z x1 = (temp - Fix(temp)) * z x0 = x1 phj = x1 / z ph(j) = phj pb(j) = "Z" If ph(j) < 0.0173 Then ph(j) = 11 If ph(j) < 0.0767 Then ph(j) = 1 If ph(j) < 0.2227 Then ph(j) = 2 If ph(j) < 0.2239 Then ph(j) = 3 If ph(j) < 0.4021 Then ph(j) = 4 If ph(j) < 0.4046 Then ph(j) = 5 If ph(j) < 0.5927 Then ph(j) = 6 If ph(j) < 0.5989 Then ph(j) = 7 If ph(j) < 0.7437 Then ph(j) = 8 If ph(j) < 0.7487 Then ph(j) = 9 If ph(j) < 0.8985 And ph(j) >= 0.7487 Then pb(j) = "A" If ph(j) < 0.974 And ph(j) >= 0.8985 Then pb(j) = "B" If ph(j) < 0.995 And ph(j) >= 0.974 Then pb(j) = "C" If ph(j) < 1 And ph(j) >= 0.995 Then pb(j) = "D" If ph(j) > 10 Then ph(j) = 0 If pb(j) <> "Z" Then ph(j) = pb(j) Next j ' output the original generated file Write #1, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) ' Because in real DNA profiles without further info ,no one ' knows which allele in each pair came from the mother or father ' by convention they are written smaller ,larger (or equal). ' The following directs each pair For j = 0 To 10 Step 2 If ph(j + 1) < ph(j) Then jjj = ph(j) ph(j) = ph(j + 1) ph(j + 1) = jjj End If Next j rng = 6 * Rnd If 0 <= rng And rng < 1 Then ph(5) = ph(4) End If If 1 <= rng And rng < 2 Then ph(7) = ph(6) End If ' put extra conditional statements here to reduce ' the number of files or just delete some of the following ' ' dividing on first column, file by file If ph(0) = 0 Then Write #10, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count0 = count0 + 1 End If If ph(0) = 1 Then Write #11, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count1 = count1 + 1 End If If ph(0) = 2 Then Write #12, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count2 = count2 + 1 End If If ph(0) = 3 Then Write #13, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count3 = count3 + 1 End If If ph(0) = 4 Then Write #14, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count4 = count4 + 1 End If If ph(0) = 5 Then Write #15, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count5 = count5 + 1 End If If ph(0) = 6 Then Write #16, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count6 = count6 + 1 End If If ph(0) = 7 Then Write #17, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count7 = count7 + 1 End If If ph(0) = 8 Then Write #18, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count8 = count8 + 1 End If If ph(0) = 9 Then Write #19, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) count9 = count9 + 1 End If Next x Close #10 Close #11 Close #12 Close #13 Close #14 Close #15 Close #16 Close #17 Close #18 Close #19 Close #1 Close #2 ' count file for data to fix for - next loops in sucessive dividings Open "jun10a-c" For Output As #20 Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9 Close #20
' Add homozygosity on loci 3 & 4 ' xxxx is count = ???? xxxx = 50000 Dim ph(12) Dim ps As String Open "jun07-S.txt" For Input As #1 Open "jun07-Shz.txt" For Output As #2 xxxx = xxxx - 1 For x = 0 To xxxx Input #1, ps a1$ = Mid(ps, 1, 1) a2$ = Mid(ps, 2, 1) a3$ = Mid(ps, 3, 1) a4$ = Mid(ps, 4, 1) a5$ = Mid(ps, 5, 1) a6$ = Mid(ps, 6, 1) a7$ = Mid(ps, 7, 1) a8$ = Mid(ps, 8, 1) a9$ = Mid(ps, 9, 1) a10$ = Mid(ps, 10, 1) a11$ = Mid(ps, 11, 1) a12$ = Mid(ps, 12, 1) ph(0) = a1$ ph(1) = a2$ ph(2) = a3$ ph(3) = a4$ ph(4) = a5$ ph(5) = a6$ ph(6) = a7$ ph(7) = a8$ ph(8) = a9$ ph(9) = a10$ ph(10) = a11$ ph(11) = a12$ rng = 6 * Rnd If 0 <= rng And rng < 1 Then Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(4) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) End If If 1 <= rng And rng < 2 Then Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(6) & ph(8) & ph(9) & ph(10) & ph(11) End If If 2 <= rng And rng < 6 Then Write #2, ph(0) & ph(1) & ph(2) & ph(3) & ph(4) & ph(5) & ph(6) & ph(7) & ph(8) & ph(9) & ph(10) & ph(11) End If Next x Close #1 Close #2
' Find homozygosity ' xxxx is count = ???? xxxx = 50000 Dim ph(12) Dim ps As String count1 = 0 count2 = 0 count3 = 0 count4 = 0 count5 = 0 count6 = 0 count7 = 0 count8 = 0 count9 = 0 Open "jun01b-g.txt" For Input As #1 Open "jun01b-hz.txt" For Output As #2 xxxx = xxxx - 1 For x = 0 To xxxx Input #1, ps a1$ = Mid(ps, 1, 1) a2$ = Mid(ps, 2, 1) ph(1) = Val(a1$) ph(2) = Val(a2$) If ph(1) = ph(2) Then count1 = count1 + 1 End If a3$ = Mid(ps, 3, 1) a4$ = Mid(ps, 4, 1) ph(3) = Val(a3$) ph(4) = Val(a4$) If ph(3) = ph(4) Then count2 = count2 + 1 End If a5$ = Mid(ps, 5, 1) a6$ = Mid(ps, 6, 1) ph(5) = Val(a5$) ph(6) = Val(a6$) If ph(5) = ph(6) Then count3 = count3 + 1 End If a7$ = Mid(ps, 7, 1) a8$ = Mid(ps, 8, 1) ph(7) = Val(a7$) ph(8) = Val(a8$) If a7$ = "A" Then ph(7) = 10 If a8$ = "A" Then ph(8) = 10 If ph(7) = ph(8) Then count4 = count4 + 1 End If a9$ = Mid(ps, 9, 1) a10$ = Mid(ps, 10, 1) ph(9) = Val(a9$) ph(10) = Val(a10$) If a9$ = "A" Then ph(9) = 10 If a10$ = "A" Then ph(10) = 10 If a9$ = "B" Then ph(9) = 11 If a10$ = "B" Then ph(10) = 11 If a9$ = "C" Then ph(9) = 12 If a10$ = "C" Then ph(10) = 12 If a9$ = "D" Then ph(9) = 13 If a10$ = "D" Then ph(10) = 13 If ph(9) = ph(10) Then count5 = count5 + 1 End If a11$ = Mid(ps, 11, 1) a12$ = Mid(ps, 12, 1) ph(11) = Val(a11$) ph(12) = Val(a12$) If ph(11) = ph(12) Then count6 = count6 + 1 End If Next x countsum = count1 + count2 + count3 + count4 + count5 + count6 Write #2, "Locus 1 Homozygosity Count = ", count1, " locus 2 = ", count2, " locus 3 = ", count3, " locus 4 = ", count4, " locus 5 = ", count5, " locus 6 = ", count6, " TOTAL count = ", countsum Close #1 Close #2
The nice thing about the world's forensic statisticians having hoodwinked virtually everyone by making reference to billions,trillions .... even heptillions is I have a whole area of research almost to myself. Recently I was finding that excess homozygosity does not necessarily lead to a greater probability of unrelated false matches of DNA profiles. I have now found another rule that comes into play in unrelated DNA profile matches. I did 4 off 50,000 runs of 6- loci profiles based on Australian data. First run produced 64 off 12 digit matches. Adding some homozygosity, resorting and checking again there was only 50 matches. Another run , without excess HZ, produced 55 matches Repeating but this time adding the homozygosity at generation, sorting and match checking, again resulted in only 50 pairs. Australian caucasian data just for locus D5 Code here / Allele / Allele frequency % 0 8 0.49 1 9 3.08 2 10 6.03 3 11 39.16 4 12 34.36 5 13 15.64 6 14 1.11 7 15 0.12 For unmodified D5 subset of 50,000 profiles bi-allele totals, firstly homoZ then largest heteroZ counts 00 - 2 11 - 59 22 - 166 33 - 7,647 44 - 5,848 55 - 1,274 66 - 4 34 - 13,364 35 - 6,165 45 - 5,367 23 - 2,357 24 - 2,132 Then adding HZ, seriously over-inflating the rarest allele occurances 00 - 86 11 - 562 22 - 1,104 33 - 10,946 44 - 6,797 55 - 1,300 66 - 4 34 - 11,184 35 - 5,130 45 - 4,482 23 - 1,937 24 - 1,781 Overall excess D5 HZ = 38.7 % 33 excess HZ 43% 44 excess HZ 21.8% So new 33 count is now nearly the same as the now reduced 34 count and as a consequence is equally likely to influence the 6 loci match situation but only if the excess HZ on other loci is similar in behaviour. But if 33 excess HZ was only 20% then 33 count goes up from 7,647 to about 9,200 but still less than the 34 count now about 12,400. The square law has it that the larger counts predominate disproportionately. Consider 44 situation ( less likely to occur in matches) but now 6,797 >> 4,482 so critical in comparison. Now consider the general case. h is excess homozygosity expressed as scaling factor ie 1.28 rather than 28 per cent. a is the allele frequency of an alelle that has excess HZ factor of h and b is the highest (no excess HZ ) allele frequency of the remaining alleles. Normal HZ is a^2 and b^2 Highest normal hetero-bi-allele frequency = 2*a*b Excess HZ increases a to (a^2)*h Highest hetero-bi-allele frequency becomes approximately = 2*b *{a - 0.5*a(h-1)} = a* b*(3 - h) So criticality is if a*a*h <> a*b *(3-h) or a*h <> b* (3-h) So even if h is high then the b frequency if high enough in comparison it can over-ride. Then this effect has to be present in most loci to affect the overall match probability. Analysis of D8 results for 50,000 run Code / Allele f. & / unenhanced HZ count 0 1.98 23 1 0.62 3 2 7.9 301 3 8.77 344 4 14.57 1,074 5 32.47 5,423 6 21.73 2,423 7 9.0 396 8 2.72 39 9 0.25 1 Principle HeteroZ 56 7,009 45 4,700 46 3,279 57 2,984 47 1,299 Added Homozygosity 00 323 11 104 22 1522 33 1,550 44 2,719 55 7,095 66 2,834 77 438 88 40 99 1 principal HeteroZ 56 5,820 45 3,924 46 2,704 57 2,478 67 1,670 47 1,084 Then results for the principle D5 and D8 alleles in 2 off 50,000 runs First has 55 matches on 6 loci then added HZ giving 50 matches D5 counts per bi-allele 33 12 44 4 34 29 35 2 45 5 Total 52 Enhanced HZ on D5 33 20 44 8 34 16 35 3 45 1 Total 48 So total contribution of D5 drops from 52 out of 55 to 48 out of 50 D8 counts per bi-allele 44 0 55 7 66 5 56 17 45 10 46 5 57 0 Total 44 Principal D8 enhanzed HZ 44 4 55 16 66 2 56 9 45 4 46 4 57 2 Total 41 So total contribution of D8 drops from 44 out of 55 to 41 out of 50 Second has 64 matches on 6 loci then added HZ giving 50 matches D5 counts per bi-allele 33 12 44 6 34 33 35 7 45 3 Total 61 Enhanced HZ on principal D5 33 16 44 8 34 16 35 5 45 4 Total 49 So total contribution of D5 drops from 61 out of 64 to 49 out of 50 D8 counts per bi-allele 44 1 55 14 66 2 56 21 45 4 46 6 57 2 Total 50 D8 enhanzed HZ 44 1 55 9 66 6 56 14 45 7 46 3 57 1 Total 41 So total contribution of D8 drops from 50 out of 64 to 41 out of 50 Excess HZ of less than 10 % is generally more likely to reduce the number of unrelated profile matches in populations of the same type and size. The first rule is that unrelated matches involve primarily the more common alleles and the frequency of these common alleles is increased in match profiles and the rarer alleles decrease in the matches and the rarest do not occur, at least in sub- 10 million populations.

Email Paul Nutteing by removing 4 of the 5 dots
or email Paul Nutteing ,remove all but one dot
Or a message on usenet group uk.legal has got to me recenently a couple of times.
A lot of the contents of this file plus other material 'peer reviewed' on the main forensic science usergroup

Background
A simulation of a large DNA profile database
A simulation of DNA profile 'families'
A simulation of DNA profile families with consanguinity
A simulation of DNA profile 'families' for 6 generations
dnas.htm revisited with all alleles represented
dnas.htm revisited for >8 percent allele frequency subset (similar ancestry )
Simulation of Taiwanese Tao and Rukai populations to explore the effect of within and without ancestral clusters
Basques autochthonous DNA profiles simulation, 9 loci
Australian Capital Caucasian 9 loci simulation
Australian Capital Caucasian 9 loci simulation, >= 5% allele frequency
CODIS, 13 Loci Caucasian Simulation


Powered by counter.bloke.com