A pure ,mathematical ,database consisting of
2 million unrelated 'DNA profiles' will ,on average, contain one match.
The generation is totally random, it would be possible to
do a 2 million run and get no matches and also possible another
one to have 2 matches in 2 million.
The UK NDNAD contains 2 million profiles with this one match
plus many more due to the inescapable fact that most people
in the UK have ancestors in common , so more chance of shared
alleles and consequential match.
Despite official government sites linking to these files there are
still corrupt persons knocking out my sites, so for the
purposes of searchengines cross-linking them, files no longer
available on the original web hosting sites were on
http://www.nutteing.50megs.com/dnas.htm , http://www.nutteing.freeisp.co.uk/dnas.htm , http://www.nutteing.batcave.net/dnas.htm , http://home.graffiti.net/nutteing/dnas.htm ,
http://nutteing.no-frills.net/dnas.htm and http://nutteing3.no-frills.net/dnas.htm (last 2 due now to host failure)
Details of that match at the end of this file.
If you found this file in an archive then use keyword "nutteingd" in a
search engine to find an updated version or related pages.
Updated file August 2006
Please contact me if you notice any error that would lead to an error
in the result.
( 'allele' 4 on ' locus' D19 was slightly erroneously 0.715 previously,
corrected now to 0.719, so generated 'profiles' will be slightly
different at D19 / 4 to the results displayed in this file )
I am not a programmer so don't bother communicating
about my lack of structure etc. I know the flags are poorly chosen
and the external Random calls should be function calls etc.
Before going into Visual Basic Editor go into ordinary
Word and call up anything in the directory you want
the VB files to go into as this is not designated in the
following code.
Using plain text handling Notepad with no line wrap
copy and paste
from this file as displayed on a browser or as Source/ Text file into
a Visual Basic / Macro handler between Sub and End Sub ,
reset, and Run. I am not familiar with VB and so get tied up in
knots concerning procedures,mudules ,functions etc.
My choice of file names ,datewise sept25- etc is for
ease of deleting because of disk space constraints.
If using straight VB6 then designate the directory
for files by "replace all" occurances of sept25 to
c:\vb\sept25 or whatever, also add a sound progress
indicator before the [ next x ] line
If x/1000 = Int (x/1000 ) Then Beep
before highlighting and copying.
In VB6 open New Project
In Form1 open up a Command1 button
Double click this button to open command
code window and copy and paste the 'DNA' VB code
between the Private Sub Command1_click ()
and the End Sub
Then Run/Start
Press command1
Wait until Beep/ clicks cease
I had to ditch 3 Random Number Generators as
they were producing their repeats too often considering
at times I was dealing with calls to the RNG 200 million times
for 10 million profiles.
The results and background is after the VB code.
The first task is to generate a file simulating '10 loci'
that is an array of 10 pairs of numbers. These number pairs
are constrained to represent the allele frequencies of the published
UK caucasian population. The average number of alleles in the UK NDNAD (Caucasian )
SGM plus system is 11.3 alleles per locus ,2 per locus,times the 10 chosen loci.
But as derrived from bio-chemistry the inheritance of these alleles
is not equally likely. If indeed equal occurance over 11.3 x 2 x 20
then false matches would be very much rarer than real life.
To simplify I have standardised to a choice of 10 (0 to 9) and the
rarer alleles lumped together in the '0' subset.
For purists it is an easy matter to increase from 0 to 9 to
include "A","B" etc , as now string data for complete
modelling of all alleles on loci FGA,D21,D18,D2 and D19.
------------
In the generator section at the start of each j loop ,have
pb(j) = "Z"
then amend generator characteristics,
If ph(j) < 0.337 Then pb(j) = "A"
If ph(j) < 0.437 Then ph(j) = 2
If ph(j) < 0.444 Then pb(j) = "B"
etc instead of just 0 to 9
then before end of each j loop,have
If pb(j) <> "Z" Then ph(j) = pb(j)
--------------
I would suggest using the letters for only the
rare alleles rather than going 0 to 9,A,B,C,D etc.
The first 3 loci (6 numbers) will not contain
alphanumerics but 7 or more would so beware
if subdividing on 7th digit or more.
In principle I tried adapting and it processes
through to final match checking, but I've
not done a full run fully enlarged.
The final macro for converting back to
standard notation would need altering, or at least
manually converting the A,B,Cs etc back to alleles.
As one general result along the way was that rare alleles become
very much rarer, proportionally, in any matches.
Because of the
large numbers involved and my pc being of 1997 vintage there
is a lot of saving to disk and only partial sets are processed
rather than trying to process full 10 million 'profiles',my
sensible limit is about 2 million processed in their entirety.
Others, with more powerful computers should be able
to tackle full 10 million.
If the long conditional statements break in this HTML file
then you will have to re-concattenate to use.
The order of each locus in the most commonly portrayed
order for the UK NDNAD profiles. On my pc ,200MHz,64MB only
machine with about 200MB of hard disk space free so not daunting
requirements. To generate and process all 2 million profiles expect
to look at 5 hours to complete and that is when you are familiar with the routines.
For faster pcs then reduce this time as most relates to the sort routines.
Put a conditional If / End If statement in the genertor file where the output
write is, to restrict to profiles in areas where matches are known to have occured
will reduce process time. Anyway I suggest starting with generating
only 20,000 profiles then 200,000 and eventually 2 million to get the hang
of things.
Macro modified for data input and output
as strings rather than earlier version as numeric data.
Visual Basic/ Macro code for the separate macros are between horizontal rules.
FGA,vWA etc are the 10 loci and the associated generating
tables are from the allele frequency tables in the forensic
science literature cited on file dnapr.htm .
' Generating 10 loci x2 profiles
' directing pairs and first divider
Dim ph(20)
' initialising Random Number Generator - RNG
count9 = 0
count8 = 0
Randomize
a = 214013
c = 2531011
x0 = Timer
z = 2 ^ 24
' 1 file 'sept25g' for original, un-directed pairs, source data.
' This file is necessary to check on the performance of the RNG
' when a matched pair is found then it is highly unlikely that
' both sequences as generated, before pair directing, would
' be the same - more likely a manifest of repeat within the RNG
' (reason for adopting the 214013 / 2531011 RNG )
' Use 'Word' find function on part of the sequences, including pair reversals,
' with luck would include a 'homozygotic' pair eg (3,3) say ,so no reversal
' on that pair
Open "sept25g" For Output As #1
' outputs directed and divided by first digit
Open "sept25-0" For Output As #10
Open "sept25-1" For Output As #11
Open "sept25-2" For Output As #12
Open "sept25-3" For Output As #13
Open "sept25-4" For Output As #14
Open "sept25-5" For Output As #15
Open "sept25-6" For Output As #16
Open "sept25-7" For Output As #17
Open "sept25-8" For Output As #18
Open "sept25-9" For Output As #19
' change for different total size eg 199999 for 200,000
For x = 0 To 1999999
For j = 0 To 1
' vWA ,first locus
' RNG random number generator
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.001 Then ph(j) = 11
If ph(j) < 0.106 Then ph(j) = 1
If ph(j) < 0.186 Then ph(j) = 2
If ph(j) < 0.402 Then ph(j) = 3
If ph(j) < 0.672 Then ph(j) = 4
If ph(j) < 0.891 Then ph(j) = 5
If ph(j) < 0.984 Then ph(j) = 6
If ph(j) < 0.998 Then ph(j) = 7
If ph(j) < 1 Then ph(j) = 8
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 2 To 3
' THO1
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.002 Then ph(j) = 11
If ph(j) < 0.243 Then ph(j) = 1
If ph(j) < 0.437 Then ph(j) = 2
If ph(j) < 0.545 Then ph(j) = 3
If ph(j) < 0.546 Then ph(j) = 4
If ph(j) < 0.686 Then ph(j) = 5
If ph(j) < 0.99 Then ph(j) = 6
If ph(j) < 1 Then ph(j) = 7
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 4 To 5
' D8
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.018 Then ph(j) = 11
If ph(j) < 0.031 Then ph(j) = 1
If ph(j) < 0.125 Then ph(j) = 2
If ph(j) < 0.191 Then ph(j) = 3
If ph(j) < 0.334 Then ph(j) = 4
If ph(j) < 0.667 Then ph(j) = 5
If ph(j) < 0.876 Then ph(j) = 6
If ph(j) < 0.964 Then ph(j) = 7
If ph(j) < 0.995 Then ph(j) = 8
If ph(j) < 1 Then ph(j) = 9
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 6 To 7
' FGA
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.025 Then ph(j) = 11
If ph(j) < 0.081 Then ph(j) = 1
If ph(j) < 0.224 Then ph(j) = 2
If ph(j) < 0.411 Then ph(j) = 3
If ph(j) < 0.576 Then ph(j) = 4
If ph(j) < 0.587 Then ph(j) = 5
If ph(j) < 0.726 Then ph(j) = 6
If ph(j) < 0.872 Then ph(j) = 7
If ph(j) < 0.947 Then ph(j) = 8
If ph(j) < 0.982 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
' 1.8% not generated
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 8 To 9
' D21
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.031 Then ph(j) = 11
If ph(j) < 0.191 Then ph(j) = 1
If ph(j) < 0.417 Then ph(j) = 2
If ph(j) < 0.675 Then ph(j) = 3
If ph(j) < 0.702 Then ph(j) = 4
If ph(j) < 0.771 Then ph(j) = 5
If ph(j) < 0.864 Then ph(j) = 6
If ph(j) < 0.882 Then ph(j) = 7
If ph(j) < 0.972 Then ph(j) = 8
If ph(j) < 0.994 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
' 0.5% not generated
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 10 To 11
' D18
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.012 Then ph(j) = 11
If ph(j) < 0.151 Then ph(j) = 1
If ph(j) < 0.276 Then ph(j) = 2
If ph(j) < 0.44 Then ph(j) = 3
If ph(j) < 0.585 Then ph(j) = 4
If ph(j) < 0.722 Then ph(j) = 5
If ph(j) < 0.837 Then ph(j) = 6
If ph(j) < 0.917 Then ph(j) = 7
If ph(j) < 0.958 Then ph(j) = 8
If ph(j) < 0.975 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
' 2.5% not generated
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 12 To 13
' D2S1338
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.037 Then ph(j) = 11
If ph(j) < 0.222 Then ph(j) = 1
If ph(j) < 0.309 Then ph(j) = 2
If ph(j) < 0.419 Then ph(j) = 3
If ph(j) < 0.557 Then ph(j) = 4
If ph(j) < 0.589 Then ph(j) = 5
If ph(j) < 0.613 Then ph(j) = 6
If ph(j) < 0.725 Then ph(j) = 7
If ph(j) < 0.867 Then ph(j) = 8
If ph(j) < 0.978 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
' 2.2% not generated
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 14 To 15
' D16
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.019 Then ph(j) = 11
If ph(j) < 0.148 Then ph(j) = 1
If ph(j) < 0.202 Then ph(j) = 2
If ph(j) < 0.491 Then ph(j) = 3
If ph(j) < 0.779 Then ph(j) = 4
If ph(j) < 0.965 Then ph(j) = 5
If ph(j) < 0.994 Then ph(j) = 6
If ph(j) < 1 Then ph(j) = 7
If ph(j) > 10 Then ph(j) = 0
Next j
For j = 16 To 17
' D19
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.087 Then ph(j) = 11
If ph(j) < 0.309 Then ph(j) = 1
If ph(j) < 0.322 Then ph(j) = 2
If ph(j) < 0.704 Then ph(j) = 3
If ph(j) < 0.719 Then ph(j) = 4
If ph(j) < 0.896 Then ph(j) = 5
If ph(j) < 0.934 Then ph(j) = 6
If ph(j) < 0.975 Then ph(j) = 7
If ph(j) < 0.992 Then ph(j) = 8
If ph(j) < 0.997 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
If ph(j) > 10 Then ph(j) = 0
' 0.3% not generated
Next j
For j = 18 To 19
' D3
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
phj = x1 / z
ph(j) = phj
If ph(j) < 0.001 Then ph(j) = 11
If ph(j) < 0.007 Then ph(j) = 1
If ph(j) < 0.139 Then ph(j) = 2
If ph(j) < 0.404 Then ph(j) = 3
If ph(j) < 0.651 Then ph(j) = 4
If ph(j) < 0.846 Then ph(j) = 5
If ph(j) < 0.987 Then ph(j) = 6
If ph(j) < 1 Then ph(j) = 7
If ph(j) > 10 Then ph(j) = 0
Next j
' output the original generated file
Write #1, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
' Because in real DNA profiles without further info ,no one
' knows which allele in each pair came from the mother or father
' by convention they are written smaller ,larger (or equal).
' The following directs each pair
For j = 0 To 18 Step 2
If ph(j + 1) < ph(j) Then
jjj = ph(j)
ph(j) = ph(j + 1)
ph(j + 1) = jjj
End If
Next j
' put extra conditional statements here to reduce
' the number of files or just delete some of the following
'
' dividing on first column, file by file
If ph(0) = 0 Then
Write #10 , ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count0 = count0 + 1
End If
If ph(0) = 1 Then
Write #11, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count1 = count1 + 1
End If
If ph(0) = 2 Then
Write #12, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count2 = count2 + 1
End If
If ph(0) = 3 Then
Write #13, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count3 = count3 + 1
End If
If ph(0) = 4 Then
Write #14, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count4 = count4 + 1
End If
If ph(0) = 5 Then
Write #15, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count5 = count5 + 1
End If
If ph(0) = 6 Then
Write #16, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count6 = count6 + 1
End If
If ph(0) = 7 Then
Write #17, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count7 = count7 + 1
End If
If ph(0) = 8 Then
Write #18, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count8 = count8 + 1
End If
If ph(0) = 9 Then
Write #19, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count9 = count9 + 1
End If
Next x
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Close #1
' count file for data to fix for - next loops in sucessive dividings
Open "sept25-c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20
To reduce the file sizes so they can be sorted it is necessary
to subdivide by various leading digits .
If 5th or 6th column divider is required make approriate changes
' Dividing file into 10 by second digit
Dim ph(20)
dim ps as string
' xxxx = count size from count file
xxxx =
' input file
Open "sept25-1" For Input As #1
' 10 divided files
Open "sept25-10" For Output As #10
Open "sept25-11" For Output As #11
Open "sept25-12" For Output As #12
Open "sept25-13" For Output As #13
Open "sept25-14" For Output As #14
Open "sept25-15" For Output As #15
Open "sept25-16" For Output As #16
Open "sept25-17" For Output As #17
Open "sept25-18" For Output As #18
Open "sept25-19" For Output As #19
count9 = 0
count8 = 0
xxxx = xxxx - 1
For x = 0 To xxxx
Input #1, ps
a2$ = Mid(ps, 2, 1)
ph(1) = Val(a2$)
If ph(1) = 0 Then
Write #10, ps
count0 = count0 + 1
End If
If ph(1) = 1 Then
Write #11, ps
count1 = count1 + 1
End If
If ph(1) = 2 Then
Write #12, ps
count2 = count2 + 1
End If
If ph(1) = 3 Then
Write #13, ps
count3 = count3 + 1
End If
If ph(1) = 4 Then
Write #14, ps
count4 = count4 + 1
End If
If ph(1) = 5 Then
Write #15, ps
count5 = count5 + 1
End If
If ph(1) = 6 Then
Write #16, ps
count6 = count6 + 1
End If
If ph(1) = 7 Then
Write #17, ps
count7 = count7 + 1
End If
If ph(1) = 8 Then
Write #18, ps
count8 = count8 + 1
End If
If ph(1) = 9 Then
Write #19, ps
count9 = count9 + 1
End If
Next x
Close #1
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
' output counts
Open "sept25-1c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20
' Dividing file into 10 by third digit
Dim ph(20)
dim ps as string
' enter count in xxxx
xxxx =
Open "sept25-11" For Input As #1
Open "sept25-110" For Output As #10
Open "sept25-111" For Output As #11
Open "sept25-112" For Output As #12
Open "sept25-113" For Output As #13
Open "sept25-114" For Output As #14
Open "sept25-115" For Output As #15
Open "sept25-116" For Output As #16
Open "sept25-117" For Output As #17
Open "sept25-118" For Output As #18
Open "sept25-119" For Output As #19
count9 = 0
count8 = 0
xxxx=xxxx - 1
For x = 0 To xxxx
Input #1, ps
a3$ = Mid(ps, 3, 1)
ph(2) = Val(a3$)
If ph(2) = 0 Then
Write #10, ps
count0 = count0 + 1
End If
If ph(2) = 1 Then
Write #11, ps
count1 = count1 + 1
End If
If ph(2) = 2 Then
Write #12, ps
count2 = count2 + 1
End If
If ph(2) = 3 Then
Write #13, ps
count3 = count3 + 1
End If
If ph(2) = 4 Then
Write #14, ps
count4 = count4 + 1
End If
If ph(2) = 5 Then
Write #15, ps
count5 = count5 + 1
End If
If ph(2) = 6 Then
Write #16, ps
count6 = count6 + 1
End If
If ph(2) = 7 Then
Write #17, ps
count7 = count7 + 1
End If
If ph(2) = 8 Then
Write #18, ps
count8 = count8 + 1
End If
If ph(2) = 9 Then
Write #19, ps
count9 = count9 + 1
End If
Next x
Close #1
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Open "sept25-11c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20
' Dividing file into 10 by fourth digit
Dim ph(20)
dim ps as string
' enter count in xxxx
xxxx =
Open "sept25-131" For Input As #1
Open "sept25-1310" For Output As #10
Open "sept25-1311" For Output As #11
Open "sept25-1312" For Output As #12
Open "sept25-1313" For Output As #13
Open "sept25-1314" For Output As #14
Open "sept25-1315" For Output As #15
Open "sept25-1316" For Output As #16
Open "sept25-1317" For Output As #17
Open "sept25-1318" For Output As #18
Open "sept25-1319" For Output As #19
count9 = 0
count8 = 0
xxxx=xxxx - 1
For x = 0 To xxxx
Input #1, ps
a4$ = Mid(ps, 4, 1)
ph(3) = Val(a4$)
If ph(3) = 0 Then
Write #10, ps
count0 = count0 + 1
End If
If ph(3) = 1 Then
Write #11, ps
count1 = count1 + 1
End If
If ph(3) = 2 Then
Write #12, ps
count2 = count2 + 1
End If
If ph(3) = 3 Then
Write #13, ps
count3 = count3 + 1
End If
If ph(3) = 4 Then
Write #14, ps
count4 = count4 + 1
End If
If ph(3) = 5 Then
Write #15, ps
count5 = count5 + 1
End If
If ph(3) = 6 Then
Write #16, ps
count6 = count6 + 1
End If
If ph(3) = 7 Then
Write #17, ps
count7 = count7 + 1
End If
If ph(3) = 8 Then
Write #18, ps
count8 = count8 + 1
End If
If ph(3) = 9 Then
Write #19, ps
count9 = count9 + 1
End If
Next x
Close #1
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Open "sept25-131c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20
' Dividing file into 10 by fifth digit
Dim ph(20)
Dim ps As String
' enter count in xxxx
xxxx =
Open "dec14-3412" For Input As #1
Open "dec14-34120" For Output As #10
Open "dec14-34121" For Output As #11
Open "dec14-34122" For Output As #12
Open "dec14-34123" For Output As #13
Open "dec14-34124" For Output As #14
Open "dec14-34125" For Output As #15
Open "dec14-34126" For Output As #16
Open "dec14-34127" For Output As #17
Open "dec14-34128" For Output As #18
Open "dec14-34129" For Output As #19
count9 = 0
count8 = 0
xxxx = xxxx - 1
For x = 0 To xxxx
Input #1, ps
a5$ = Mid(ps, 5, 1)
ph(4) = Val(a5$)
If ph(4) = 0 Then
Write #10, ps
count0 = count0 + 1
End If
If ph(4) = 1 Then
Write #11, ps
count1 = count1 + 1
End If
If ph(4) = 2 Then
Write #12, ps
count2 = count2 + 1
End If
If ph(4) = 3 Then
Write #13, ps
count3 = count3 + 1
End If
If ph(4) = 4 Then
Write #14, ps
count4 = count4 + 1
End If
If ph(4) = 5 Then
Write #15, ps
count5 = count5 + 1
End If
If ph(4) = 6 Then
Write #16, ps
count6 = count6 + 1
End If
If ph(4) = 7 Then
Write #17, ps
count7 = count7 + 1
End If
If ph(4) = 8 Then
Write #18, ps
count8 = count8 + 1
End If
If ph(4) = 9 Then
Write #19, ps
count9 = count9 + 1
End If
Next x
Close #1
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Open "dec14-3412c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20
The next is sorting using Word Tables/ Sort
Before using ,make a test batch of numbers as there are various
Sort outcomes. Now I'm using string data, Text sort gave the
right form on my machine. Use Ctrl+shift+Home(or End) to
highlight text up or down .
After sort and before saving to disk press up or down
arrow to select which way the text is returned to you.
My set-up was limited to no more than 15,000. To sort
say 28,000 sort upper half ,then lower half then cut and
paste say 0 to 2 section of lower half into the top of the
top half. Re-sort the expanded 0 to 2 section then
re-sort the remainder. If say selecting 2 to 3 section then
cut and paste at the juncture of 2 and 3 in the other block
to save some repeated sorting. Other times it is quicker
to oversort then backtrack / overlap on the next sort.
Many of the subdivision files are empty because
of the directing. They consist of eg 4,4.. 4,5.... etc
never 4,0.., 4,1.. etc and a number of 8 and 9 sections
are absent back to the generator characteristics eg
only first 8 of 10 are used. When you know all files are less than
15,000, or whatever Sort limit, use the next (simply a recorded macro)
to sort 10 related files. An empty file will stop the macro so edit
out empty files before running.
'Sort 10 related files in one go
'
Documents.Open FileName:="sept25-130", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-131", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-132", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-133", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-134", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-135", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-136", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-137", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-138", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
'
Documents.Open FileName:="sept25-139", ConfirmConversions:=False, ReadOnly _
:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
, Format:=wdOpenFormatAuto
Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
:=wdLanguageNone
ActiveDocument.Save
' empty files will append spurious carriage returns at
' head or tail of files so check for this before final match routine
' otherwise use Insert / File to merge files
' merge 10 related files back to one
' for convenience I named these re-concattenated
' files as .txt so they were obvious in listing
' compared to no suffix ones
'
Documents.Add Template:="", NewTemplate:=False
Selection.InsertFile FileName:="sept25-130", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-131", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-132", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-133", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-134", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-135", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-136", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-137", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-138", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
Selection.InsertFile FileName:="sept25-139", Range:="", ConfirmConversions _
:=False, Link:=False, Attachment:=False
ActiveDocument.SaveAs FileName:="sept25-13.txt", FileFormat:=wdFormatText, _
LockComments:=False, Password:="", AddToRecentFiles:=True, WritePassword _
:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:=False, _
SaveNativePictureFormat:=False, SaveFormsData:=False, SaveAsAOCELetter:= _
False
End Sub
Copy and paste all these subfiles together to
submit to the next section. The final match finding,
initially for 12 digits ,then change to 14,16,18
and finally 20 if 18 shows something. This routine
after hours of dividing/sorting/re-merging takes only seconds to complete.
' Find matching pairs in 12 digits
' xxxx is count = ????
xxxx =
b$ = "0"
Count = 0
Dim ps As String
Open "sept25-24.txt" For Input As #1
Open "sept25-24m12.txt" For Output As #2
' change the 12 in the #2 file name above and
' the Left function below to suit number of matches
xxxx = xxxx - 1
For x = 0 To xxxx
Input #1, ps
a$ = Left(ps, 12)
If a$ = b$ Then
Write #2, ps
Count = Count + 1
End If
b$ = a$
Next x
Write #2, "Count ", Count
close #1
Close #2
' Find matching triples in 12 digits
' xxxx is count from the count files
xxxx =
b$ = "0"
c$ = "0"
Count = 0
Dim ps As String
xxxx = xxxx - 1
Open "sept25-1.txt" For Input As #1
Open "sept25-1trip.txt" For Output As #2
' change the 12 in the #2 file name above and
' the Left function below to suit number of matches
For x = 0 To xxxx
Input #1, ps
a$ = Left(ps, 12)
a2$ = ps
If a$ = c$ Then
Write #2, a2$, b2$, c2$
Count = Count + 1
End If
If a$ = b$ Then
c$ = b$
c2$ = b2$
End If
b$ = a$
b2$ = a2$
Next x
Write #2, "Count ", Count
Close #1
Close #2
' Find matching quadruples in 12 digits
' xxxx is from the count files
xxxx =
b$ = "0"
c$ = "0"
Count = 0
Dim ps As String
xxxx = xxxx - 1
Open "sept25-3.txt" For Input As #1
Open "sept25-3quad.txt" For Output As #2
' change the 12 in the #2 file name above and
' the Left function below to suit number of matches
For x = 0 To xxxx
Input #1, ps
a$ = Left(ps, 12)
a2$ = ps
If a$ = d$ Then
Write #2, a2$, b2$, c2$, d2$
Count = Count + 1
End If
If a$ = c$ Then
d$ = c$
d2$ = c2$
End If
If a$ = b$ Then
c$ = b$
c2$ = b2$
End If
b$ = a$
b2$ = a2$
Next x
Write #2, "Count ", Count
Close #1
Close #2
' Find matching quintuples in 12 digits
' xxxx is from the count files
xxxx =
b$ = "0"
c$ = "0"
Count = 0
Dim ps As String
xxxx = xxxx - 1
Open "sept25-4.txt" For Input As #1
Open "sept25-4quin.txt" For Output As #2
' change the 12 in the #2 file name above and
' the Left function below to suit number of matches
For x = 0 To xxxx
Input #1, ps
a$ = Left(ps, 12)
a2$ = ps
If a$ = e$ Then
Write #2, a2$, b2$, c2$, d2$, e2$
Count = Count + 1
End If
If a$ = d$ Then
e$ = d$
e2$ = d2$
End If
If a$ = c$ Then
d$ = c$
d2$ = c2$
End If
If a$ = b$ Then
c$ = b$
c2$ = b2$
End If
b$ = a$
b2$ = a2$
Next x
Write #2, "Count ", Count
Close #1
Close #2
' converting integre values back to DNA loci,alleles
xxxx=
' xxxx is number of profiles to be converted
Dim ph(20)
Dim pj(20)
Dim ps As String
Open "sept25-m12.txt" For Input As #1
Open "sept25-mr12.txt" For Output As #2
For x = 1 To xxxx
Input #1, ps
a1$ = Mid(ps, 1, 1)
a2$ = Mid(ps, 2, 1)
a3$ = Mid(ps, 3, 1)
a4$ = Mid(ps, 4, 1)
a5$ = Mid(ps, 5, 1)
a6$ = Mid(ps, 6, 1)
a7$ = Mid(ps, 7, 1)
a8$ = Mid(ps, 8, 1)
a9$ = Mid(ps, 9, 1)
a10$ = Mid(ps, 10, 1)
a11$ = Mid(ps, 11, 1)
a12$ = Mid(ps, 12, 1)
a13$ = Mid(ps, 13, 1)
a14$ = Mid(ps, 14, 1)
a15$ = Mid(ps, 15, 1)
a16$ = Mid(ps, 16, 1)
a17$ = Mid(ps, 17, 1)
a18$ = Mid(ps, 18, 1)
a19$ = Mid(ps, 19, 1)
a20$ = Mid(ps, 20, 1)
ph(0) = Val(a1$)
ph(1) = Val(a2$)
ph(2) = Val(a3$)
ph(3) = Val(a4$)
ph(4) = Val(a5$)
ph(5) = Val(a6$)
ph(6) = Val(a7$)
ph(7) = Val(a8$)
ph(8) = Val(a9$)
ph(9) = Val(a10$)
ph(10) = Val(a11$)
ph(11) = Val(a12$)
ph(12) = Val(a13$)
ph(13) = Val(a14$)
ph(14) = Val(a15$)
ph(15) = Val(a16$)
ph(16) = Val(a17$)
ph(17) = Val(a18$)
ph(18) = Val(a19$)
ph(19) = Val(a20$)
For j = 0 To 1
' vWA
If ph(j) = 0 Then pj(j) = 13
If ph(j) = 1 Then pj(j) = 14
If ph(j) = 2 Then pj(j) = 15
If ph(j) = 3 Then pj(j) = 16
If ph(j) = 4 Then pj(j) = 17
If ph(j) = 5 Then pj(j) = 18
If ph(j) = 6 Then pj(j) = 19
If ph(j) = 7 Then pj(j) = 20
If ph(j) = 8 Then pj(j) = 21
If ph(j) = 9 Then pj(j) = 0
Next j
For j = 2 To 3
' THO1
If ph(j) = 0 Then pj(j) = 5
If ph(j) = 1 Then pj(j) = 6
If ph(j) = 2 Then pj(j) = 7
If ph(j) = 3 Then pj(j) = 8
If ph(j) = 4 Then pj(j) = 8.3
If ph(j) = 5 Then pj(j) = 9
If ph(j) = 6 Then pj(j) = 9.3
If ph(j) = 7 Then pj(j) = 10
If ph(j) = 8 Then pj(j) = 0
If ph(j) = 9 Then pj(j) = 0
Next j
For j = 4 To 5
' D8
If ph(j) = 0 Then pj(j) = 8
If ph(j) = 1 Then pj(j) = 9
If ph(j) = 2 Then pj(j) = 10
If ph(j) = 3 Then pj(j) = 11
If ph(j) = 4 Then pj(j) = 12
If ph(j) = 5 Then pj(j) = 13
If ph(j) = 6 Then pj(j) = 14
If ph(j) = 7 Then pj(j) = 15
If ph(j) = 8 Then pj(j) = 16
If ph(j) = 9 Then pj(j) = 17
Next j
For j = 6 To 7
' FGA
If ph(j) = 0 Then pj(j) = 18
If ph(j) = 1 Then pj(j) = 19
If ph(j) = 2 Then pj(j) = 20
If ph(j) = 3 Then pj(j) = 21
If ph(j) = 4 Then pj(j) = 22
If ph(j) = 5 Then pj(j) = 22.2
If ph(j) = 6 Then pj(j) = 23
If ph(j) = 7 Then pj(j) = 24
If ph(j) = 8 Then pj(j) = 25
If ph(j) = 9 Then pj(j) = 26
Next j
For j = 8 To 9
' D21
If ph(j) = 0 Then pj(j) = 27
If ph(j) = 1 Then pj(j) = 28
If ph(j) = 2 Then pj(j) = 29
If ph(j) = 3 Then pj(j) = 30
If ph(j) = 4 Then pj(j) = 30.2
If ph(j) = 5 Then pj(j) = 31
If ph(j) = 6 Then pj(j) = 31.2
If ph(j) = 7 Then pj(j) = 32
If ph(j) = 8 Then pj(j) = 32.2
If ph(j) = 9 Then pj(j) = 33.2
Next j
For j = 10 To 11
' D18
If ph(j) = 0 Then pj(j) = 11
If ph(j) = 1 Then pj(j) = 12
If ph(j) = 2 Then pj(j) = 13
If ph(j) = 3 Then pj(j) = 14
If ph(j) = 4 Then pj(j) = 15
If ph(j) = 5 Then pj(j) = 16
If ph(j) = 6 Then pj(j) = 17
If ph(j) = 7 Then pj(j) = 18
If ph(j) = 8 Then pj(j) = 19
If ph(j) = 9 Then pj(j) = 20
Next j
For j = 12 To 13
' D2S1338
If ph(j) = 0 Then pj(j) = 16
If ph(j) = 1 Then pj(j) = 17
If ph(j) = 2 Then pj(j) = 18
If ph(j) = 3 Then pj(j) = 19
If ph(j) = 4 Then pj(j) = 20
If ph(j) = 5 Then pj(j) = 21
If ph(j) = 6 Then pj(j) = 22
If ph(j) = 7 Then pj(j) = 23
If ph(j) = 8 Then pj(j) = 24
If ph(j) = 9 Then pj(j) = 25
Next j
For j = 14 To 15
' D16
If ph(j) = 0 Then pj(j) = 8
If ph(j) = 1 Then pj(j) = 9
If ph(j) = 2 Then pj(j) = 10
If ph(j) = 3 Then pj(j) = 11
If ph(j) = 4 Then pj(j) = 12
If ph(j) = 5 Then pj(j) = 13
If ph(j) = 6 Then pj(j) = 14
If ph(j) = 7 Then pj(j) = 15
If ph(j) = 8 Then pj(j) = 0
If ph(j) = 9 Then pj(j) = 0
Next j
For j = 16 To 17
' D19
If ph(j) = 0 Then pj(j) = 12
If ph(j) = 1 Then pj(j) = 13
If ph(j) = 2 Then pj(j) = 13.2
If ph(j) = 3 Then pj(j) = 14
If ph(j) = 4 Then pj(j) = 14.2
If ph(j) = 5 Then pj(j) = 15
If ph(j) = 6 Then pj(j) = 15.2
If ph(j) = 7 Then pj(j) = 16
If ph(j) = 8 Then pj(j) = 16.2
If ph(j) = 9 Then pj(j) = 17
Next j
For j = 18 To 19
' D3
If ph(j) = 0 Then pj(j) = 12
If ph(j) = 1 Then pj(j) = 13
If ph(j) = 2 Then pj(j) = 14
If ph(j) = 3 Then pj(j) = 15
If ph(j) = 4 Then pj(j) = 16
If ph(j) = 5 Then pj(j) = 17
If ph(j) = 6 Then pj(j) = 18
If ph(j) = 7 Then pj(j) = 19
If ph(j) = 8 Then pj(j) = 0
If ph(j) = 9 Then pj(j) = 0
Next j
Write #2, ""; pj(0), pj(1); ""; pj(2), pj(3); ""; pj(4), pj(5); ""; pj(6), pj(7); ""; pj(8), pj(9); ""; pj(10), pj(11); ""; pj(12), pj(13); ""; pj(14), pj(15); ""; pj(16), pj(17); ""; pj(18), pj(19); ""
Next x
Close #1
Close #2
2 million profile sub-division counts.
Anyone repeating the exercise will have very similar numbers
For 1 million divide numbers by 2, for 200,000 divide by 10 etc
For my set up any profile count over 15,000 would not be sorted
by Word.
all / first dividing
0,4019,1,398036,2,273611,3,609940,4,499104,5,191390,6,23392,7,501,8,7,9,0
1...................
0,,1,22058,2,33588,3,91034,4,113493,5,92135,6,38947,7,5927,8,854,9,0
11..................
0,88,1,9262,2,5621,3,2518,4,19,5,2354,6,2194,7,2,8,0,9,0
12..................
120,121,1,14174,2,8609,3,3670,4,31,5,3589,6,3394,7,,8,0,9,0
13..................
0,371,1,38677,2,22983,3,10158,4,78,5,9802,6,8956,7,9,8,0,9,0
131.................
0,,1,5305,2,8541,3,4746,4,46,5,6233,6,13372,7,434,8,0,9,0
132.................
0,,1,,2,3291,3,3769,4,31,5,4887,6,10666,7,339,8,0,9,0
14..................
0,456,1,47943,2,29091,3,12592,4,94,5,12133,6,11173,7,11,8,0,9,0
141
0,,1,6488,2,10745,3,5863,4,48,5,7669,6,16576,7,554,8,0,9,0
142.................
0,,1,5305,2,8541,3,4746,4,46,5,6233,6,13372,7,434,8,0,9,0
15..................
0,376,1,38889,2,23703,3,10039,4,88,5,9828,6,9200,7,12,8,0,9,0
151.................
0,,1,5350,2,8576,3,4743,4,40,5,6204,6,13535,7,441,8,0,9,0
152.................
0,,1,,2,3415,3,3929,4,36,5,4940,6,10984,7,399,8,0,9,0
16..................
0,162,1,16607,2,9981,3,4262,4,36,5,4113,6,3784,7,2,8,0,9,0
2...................
0,,1,,2,12917,3,68968,4,86770,5,70116,6,29621,7,4535,8,684,9,0
23..................
0,260,1,29219,2,17652,3,7616,4,63,5,7440,6,6709,7,9,8,0,9,0
24..................
0,337,1,36823,2,22145,3,9418,4,75,5,9414,6,8549,7,9,8,0,9,0
241.................
0,,1,5015,2,8205,3,4468,4,46,5,5986,6,12698,7,405,8,0,9,0
242.................
0,,1,,2,3307,3,3664,4,38,5,4563,6,10213,7,360,8,0,9,0
25..................
0,280,1,29665,2,17938,3,7755,4,67,5,7519,6,6882,7,10,8,0,9,0
251.................
0,,1,4115,2,6585,3,3575,4,34,5,4802,6,10203,7,351,8,0,9,0
252.................
0,,1,,2,2570,3,2966,4,30,5,3803,6,8281,7,288,8,0,9,0
26..................
0,109,1,12495,2,7572,3,3238,4,22,5,3161,6,3017,7,7,8,0,9,0
3...................
0,,1,,2,,3,93386,4,233176,5,188883,6,80487,7,12264,8,1744,9,0
33..................
0,366,1,39623,2,23922,3,10316,4,79,5,9959,6,9112,7,9,8,0,9,0
331.................
0,,1,5321,2,8850,3,4920,4,43,5,6275,6,13774,7,440,8,0,9,0
332.................
0,,1,,2,3543,3,3904,4,37,5,5170,6,10888,7,380,8,0,9,0
34..................
0,922,1,98703,2,59774,3,25677,4,233,5,24954,6,22893,7,20,8,0,9,0
341.................
0,,1,13568,2,21954,3,12210,4,114,5,15676,6,34066,7,1115,8,0,9,0
342.................
0,,1,,2,8847,3,9825,4,93,5,12693,6,27433,7,883,8,0,9,0
343.................
0,,1,,2,,3,2694,4,50,5,7065,6,15350,7,518,8,0,9,0
345.................
0,,1,,2,,3,,4,,5,4449,6,19839,7,666,8,0,9,0
346.................
0,,1,,2,,3,,4,,5,,6,21486,7,1407,8,0,9,0
35..................
0,736,1,80070,2,48356,3,20742,4,188,5,20292,6,18483,7,16,8,0,9,0
351.................
0,,1,10979,2,17776,3,9753,4,102,5,12661,6,27911,7,888,8,0,9,0
352.................
0,,1,,2,7096,3,8007,4,59,5,10369,6,22127,7,698,8,0,9,0
353.................
0,,1,,2,,3,2158,4,33,5,5739,6,12399,7,413,8,0,9,0
355.................
0,,1,,2,,3,,4,,5,3714,6,16060,7,518,8,0,9,0
356.................
0,,1,,2,,3,,4,,5,,6,17319,7,1164,8,0,9,0
36..................
0,321,1,34027,2,20677,3,8876,4,92,5,8566,6,7921,7,7,8,0,9,0
361.................
0,,1,4752,2,7491,3,4162,4,34,5,5425,6,11758,7,405,8,0,9,0
362.................
0,,1,,2,3021,3,3428,4,31,5,4284,6,9608,7,305,8,0,9,0
4...................
0,,1,,2,,3,,4,145639,5,236123,6,100130,7,15027,8,2185,9,0
44..................
0,544,1,61636,2,37389,3,15876,4,141,5,15655,6,14386,7,12,8,0,9,0
441.................
0,,1,8497,2,13443,3,7579,4,77,5,9872,6,21480,7,688,8,0,9,0
4416................
0,810,1,536,2,3761,3,2412,4,4533,5,7152,6,1963,7,289,8,24,9,0
442.................
0,,1,,2,5491,3,6096,4,59,5,7952,6,17248,7,543,8,0,9,0
4426................
0,614,1,444,2,2979,3,1928,4,3701,5,5628,6,1683,7,249,8,21,9,1
443.................
0,,1,,2,,3,1699,4,32,5,4355,6,9480,7,310,8,0,9,0
45..................
0,960,1,99982,2,60165,3,25896,4,218,5,25443,6,23434,7,25,8,0,9,0
451.................
0,,1,13926,2,22066,3,12193,4,110,5,16034,6,34516,7,1137,8,0,9,0
4512................
0,780,1,610,2,3810,3,2399,4,4661,5,7412,6,2058,7,311,8,23,9,2
4516................
0,1207,1,858,2,5971,3,3889,4,7295,5,11576,6,3206,7,474,8,40,9,0
452.................
0,,1,,2,8919,3,9771,4,86,5,12634,6,27853,7,902,8,0,9,0
4526................
0,994,1,698,2,4849,3,2986,4,5844,5,9369,6,2656,7,417,8,40,9,0
453.................
0,,1,,2,,3,2744,4,55,5,7136,6,15444,7,517,8,0,9,0
455.................
0,,1,,2,,3,,4,,5,4601,6,20190,7,652,8,0,9,0
456.................
0,,1,,2,,3,,4,,5,,6,21951,7,1483,8,0,9,0
46..................
0,399,1,42396,2,25654,3,11094,4,102,5,10657,6,9822,7,6,8,0,9,0
461.................
0,,1,5900,2,9361,3,5178,4,51,5,6755,6,14631,7,520,8,0,9,0
462.................
0,,1,,2,3800,3,4153,4,44,5,5403,6,11654,7,400,8,0,9,0
47..................
0,60,1,6295,2,3908,3,1694,4,12,5,1620,6,1437,7,1,8,0,9,0
5...................
0,,1,,2,,3,,4,,5,95938,6,81661,7,12100,8,1691,9,0
55..................
0,386,1,40439,2,24634,3,10444,4,87,5,10438,6,9497,7,13,8,0,9,0
56..................
0,309,1,34369,2,20890,3,8991,4,80,5,8921,6,8097,7,4,8,0,9,0
6...................
0,,1,,2,,3,,4,,5,,6,17375,7,5293,8,724,9,0
Background and Results
Background and results reported to usenet group uk.legal
over a number of weeks in 2003
Thread titled
R v. Watters - Court of Appeal judgement 2000/2001
My first idea was to download UK football results
data and analyse because they are in pairs of numbers
with a bias towards the lower numbers.
Which brings me to the statistical research I would
like to do concerning such multi-modal sets
and conjectured increase in matches in close to modal
sets. Like the 'biblical' analysis on this same thread.
I am for the moment trying to get some weighted numerical
data . At the moment am trying to find complete historical
record of football results back over 100 years or so.
Theory being that restricting to divisional football to avoid
mismatched teams like Man U. v. Barnstoneworth United
then score lines containing 0s,1s and 2s should be more
common than 3s,4, and 5s etc. Retaining order eg 2,1 and 1,2
to equate to my negative normalised elements.
Then for each 10 games (non-void) over all the decades
find how many 20 figure numerical matches there are .
I suspect many more in the 0s to 2s only rather than
containing some 3s,4,s etc.
From
http://www.rsssf.com/engpaul/FLA/league.html
up to about 1950 "cross-table" (good keyword) results so perhaps
60 blocks of paired data like block below.
Concattenating 60 seasons and breaking into 5 pairs then repeat on
6 pairs etc ,unlikely (my guess) up to 10 pairs, and testing for matches
would be interesting.
example below for England 1936/7,no particular reason
one score of 10 changed to * and
from original xxx deleted as well
as text.
digit count
0 197
1 292
2 218
3 114
4 55
5 31
6 12
7 4
8 nil
9 nil
10 1
which appears to have about the right sort of weighting to equate
with normalised DNA profiles
1-1 0-0 1-1 1-1 4-1 2-2 3-2 0-0 1-1 4-1 1-0 1-3 1-1 5-3 4-0 4-1 1-1 0-0 4-1 2-0 3-0
1-3 1-1 4-0 1-2 0-0 0-1 2-0 2-3 4-2 2-1 5-0 2-2 2-2 0-0 2-1 1-0 1-1 2-4 2-0 1-1 1-0
0-5 0-0 2-2 2-1 2-1 1-3 1-2 1-2 2-2 2-1 0-1 0-2 0-4 1-3 1-0 0-0 1-0 0-0 1-1 4-1 1-2
2-0 2-1 2-2 4-2 1-0 6-2 2-2 2-3 1-1 4-1 5-2 2-6 4-0 4-1 4-0 1-1 2-1 2-1 3-3 2-1 3-2
0-2 2-2 1-0 2-1 1-0 2-0 2-0 1-0 1-0 1-0 1-1 1-1 3-0 2-2 0-0 3-1 1-0 2-0 3-1 4-2 4-0
2-0 1-3 0-1 2-1 3-0 1-1 4-0 3-2 0-0 2-1 2-0 4-4 4-2 1-0 1-1 0-0 1-1 1-0 1-3 3-0 0-1
5-4 3-1 3-0 2-3 5-0 1-1 3-1 3-1 3-3 5-3 4-1 0-5 5-4 0-2 1-3 1-2 3-2 2-2 3-0 1-0 5-1
1-1 3-3 3-2 3-0 2-2 0-0 7-0 3-0 2-1 7-1 2-0 1-1 2-3 2-3 4-0 2-2 3-1 1-1 3-0 4-2 1-0
1-3 1-1 3-1 2-0 0-1 3-0 3-4 1-0 2-2 4-1 2-1 5-3 6-2 5-1 1-0 6-4 5-1 1-3 6-0 2-3 1-1
0-0 1-1 2-0 1-1 1-2 4-2 2-0 0-3 0-3 3-0 4-0 1-1 3-1 2-0 1-2 4-2 1-0 2-1 2-1 1-1 4-0
3-4 0-2 2-2 3-1 2-0 2-3 2-0 3-0 2-0 2-1 2-0 1-1 2-1 5-0 3-1 1-0 1-1 2-1 3-0 3-1 0-1
2-1 2-0 0-0 2-2 1-2 1-1 3-3 3-2 7-1 1-1 3-0 0-5 2-0 0-2 0-0 1-1 2-2 2-1 4-0 1-2 1-0
2-0 1-1 2-2 2-1 1-1 0-0 3-2 4-1 1-1 3-0 4-0 5-1 1-0 2-1 3-1 4-1 4-1 2-1 2-4 6-2 4-1
2-0 1-2 1-0 1-3 0-0 0-0 2-2 2-1 1-1 3-1 0-0 2-5 3-2 2-1 0-1 1-1 1-1 2-1 2-1 2-2 1-1
1-1 3-1 2-0 3-0 1-1 2-0 1-3 2-0 0-0 5-0 4-2 3-3 2-0 3-2 2-2 2-1 2-0 1-0 5-5 4-1 1-0
1-5 2-1 1-1 1-3 0-1 4-1 1-2 2-2 2-1 1-0 3-0 6-2 2-1 2-1 2-1 0-1 1-0 1-0 3-2 5-3 1-1
1-3 2-2 1-2 1-1 0-0 1-0 5-2 1-0 3-2 1-1 1-0 3-1 2-5 3-1 2-0 1-1 1-1 0-1 2-0 3-2 1-3
0-0 0-3 2-0 0-2 3-1 1-1 2-3 6-4 2-1 2-2 1-2 1-2 5-1 1-0 1-0 0-0 0-1 0-0 2-0 2-3 1-3
0-0 2-0 2-2 5-1 1-1 2-0 1-2 2-1 2-0 1-1 2-1 1-1 2-2 3-0 6-2 2-4 0-2 1-0 5-3 *-3 2-1
1-1 4-0 3-0 4-1 1-0 2-3 3-2 3-1 5-1 3-2 2-1 4-2 1-3 1-1 4-1 3-2 3-0 2-1 3-0 1-0 6-2
2-4 3-2 0-2 1-0 1-2 2-0 1-3 2-1 4-2 2-1 3-0 3-1 2-2 1-0 3-1 3-1 0-0 2-3 2-2 6-4 2-1
2-0 2-1 2-3 4-0 6-1 1-2 3-1 7-2 5-2 3-1 3-0 2-0 2-1 3-1 0-1 1-1 5-0 4-3 2-1 1-1 5-2
Each row is one team in turn, results,playing each of the others in the
season.
For my normalising purposes I would perhaps leave 0=0
1=1,2=> -1,3=> 2,4=> -2,5=> 3,6=> -3,7=> 4,8=> -4,9 or 10=>5,
11 or 12=>-5
or some such transformation. Then add a transformation normalisation
profile element by element and the results would have very much the look
and feel of DNA profiles
Point to note - greater likelihood of home side with larger score so
perhaps convert all pairs to right number equal or greater
than left number,before match processing,then a part negative
transformation.
After all in the real world of DNA profiles they never know
which parent contributes which so always right number
larger or same as left number of each pair.
This evening I tested my technique just on that 1936/37
block posted earlier. Broke it into triplets so 147 triples
of pairs. Only one match 3-0,2-2,0,0 so already
finding evidence against my theory. I expected any
matches to be most likely consisting of 0s,1s and 2s
as weighted like DNA profiles ,
but my first match includes a 3.
Giving it 'chapter and verse' these are
Charlton v Man U,Middlesb,Pompey
and Everton v Brentford,Charlton,Chelsea
Now I know my analytical technique works I
will download the other 60 odd blocks and break
into '5 loci' ,'6loci' pairs and analyse
Football results alnalysis
I broke down 1888 to 1938 data into 5 pairs giving 3038 sets of 5pairs ,
discarding surplus columns after splitting each years data
into 5 pair wide chunks.
Within them 6 only pairs of matches - no triples
0-1,2-2,2-0,2-1,3-1 for Arsenal 1911 and Shef Wed 1933
0-2,1-1,1-0,2-2,4-0 for Man C 1899 and Blackpool 1937
1-1,1-0,1-1,3-0,0-0 for 1921 Middlesb and 1923 Huddersfield
1-1,2-1,1-0,1-0,1-0 for 1900 Wolves, 1911 Bradford
2-2,1-0,0-0,3-0,3-0 for 1905 Notts C and 1906 Derby
3-0,0-1,1-0,2-2,3-1 for 1899 Burnley and 1903 Bury
so 1 only involving a 4
(Approximate for 0 to 5 in total) digit ocurance counts in total / matched
pairs
0 4900 / 20
1 8400 / 21
2 7300 / 12
3 4840 / 6
4 2650 / 1
5 1400
6 358
7 146
8 51
9 22
10 9
12 2
11 & 13-19 nil
There is no point in doing a 6 pair analysis for this data
but I may try a 4 pair analysis to see if there is something
like a correlation between overall number count
and roughly similar distribution within the matched pairs
I repeated the footballl result analysis on
sets of 4 pairs.
Gave 54 matches and a single triple match
on 2-0 2-1 1-0 1-1
The digit distribution on the matches was
0 298
1 334
2 158
3 72
4 20
5 6
nothing higher
For all scores the digit counts were
0 6771
1 9331
2 5840
3 3538
4 1746
5 759
6 304
7 118
8 39
9 17
>=10 9
Again a rough correlation with proportionally
more of the higher frequncy digits in the matches
There must have have been someone here before
with some weighted otherwise random process not
necessarily DNA inheritance.
Is there a rule relating a known weighted generator (
approximately multi-modal 'normal' distribution )
predetermning the weighting of any matches occuring ?
Well that was an interesting exercise, I've not
tried composing Visual Basic macros before.
Tailored the pseudo-random generator to
the desired characteristic,checked the output
against the desired characteristic.
Determined matches and plotted the digit
distribution of the match cases.
32000 digits divided into 8 columns.
Not quite as i predicted one match of
44,43,14,44 so a single 1 crept in.
10 matched sequences in total,no triples
Desired weighting of generator to roughly equate to vWA
0 _ 0.002
1 _ 0.015
2 _ 0.100
3 _ 0.133
4 _ 0.25
5 _ 0.25
6 _ 0.133
7 _ 0.100
8 _ 0.015
9 _ 0.002
Actual weighting of output
0 _ 0.00203
1 _ 0.01569
2 _ 0.0996
3 _ 0.13103
4 _ 0.25147
5 _ 0.25044
6 _ 0.13412
7 _ 0.09775
8 _ 0.016
9 _ 0.00187
Weighting within matched pairs,80 digits,no triples
0 _ 0
1 _ 0.012
2 _ 0.05
3 _ 0.137
4 _ 0.4
5 _ 0.225
6 _ 0.1
7 _ 0.075
8 _ 0
9 _ 0
Matched sequencies were
24425554
27565445
33457434
44431444
44535467
46544564
54634754
55244574
57643443
73563434
So my prediction not quite right for this one-off as that
single 1 intruded and an interesting skew in the centre
which hopefully would clarify with repeated processing.
I suspect it relates to the piecewise 'quantisation' as 3 say is something
like between 2.5 and 3.5 .
But can sum up as attenuation at the tails and enlarged
modal group - more of an inverted U or V characteristic.
So in the original 'population' 50 % have 4 or 5
increasing to 62.5 % within any matches and 76.7%
have 3,4,5 or 6 increasing to 86.2 % within matches.
Tentative evidence that any unrelated matches in the NDNAD
are going to be concentrated around the multi-modal groups.
So if anyone does get around to resolving those unresolved
matches in the NDNAD then any matches
involving rareish ( < 2% allele frequency say )
alleles can be ignored in the first instance as they are
probably repeats either due to clerical error or use of aliases.
Concentrate investigation /cross-correlation with the dermal
fingerprint database ,or whatever,for those matches
nearest the 'average Joe'.
All the above concerns undirected numbers - the data in
the NDNAD is of course directed pairs eg (14,16) never (16,14).
Also some loci are more distributed than vWA but others
are less distributed/more skewed. In theory could model
for each loci/allele frequency distribution and simulate
a large DNA database given big enough number cruncher.
So in case i've discovered some previously unknown
mathematical law i should repeat with a weighting of a genuine
'normal distribution' f(x) of form EXP [-(x-mu)^2] and repeat many times
to try and put some sort of a f(x) to the match characteristic and
also see how far I can push the number crunching on my pc
to 10 or more digit-sets and 100,000 or more digits.
I've just done an 8 digit times 10,000 run
yielding 63 matches . Unfortunately my method does
not ,as it stands, pick up triples. I have to check one by one
the source file which is alright for 10 but 63 is a bit much
Weighting within matched sequences,504 digits,no triples checked in
central 45...... to 54...... region,a single 1 again in 1,5,5,4,4,5,5,5 .
0 _ 0
1 _ 0.002
2 _ 0.0536
3 _ 0.1151
4 _ 0.3353
5 _ 0.3571
6 _ 0.1012
7 _ 0.0357
8 _ 0
9 _ 0
so for 4,5 69.2% and 3,4,5,6 at 90.9 %
I tailored my generator for Afro- Caribbean vWA
which has no nulls for 10 adjascent alleles and is
more symmetric than the caucasian
Projected allele frequency characteristic of the generator
0_ 0.005
1_ 0.016
2_ 0.079
3_ 0.218
4_ 0.208
5_ 0.211
6_ 0.161
7_ 0.068
8_ 0.029
9_ 0.005
Actual characteristic of 500,000 digits
0_ 0.0051
1_ 0.0157
2_ 0.0794
3_ 0.217
4_ 0.2085
5_ 0.212
6_ 0.1606
7_ 0.0679
8_ 0.0288
9_ 0.005
And characteristic of the 28 matches (no triples)
0_ 0
1_ 0.0036
2_ 0.0464
3_ 0.2071
4_ 0.2679
5_ 0.2714
6_ 0.1821
7_ 0.0179
8_ 0.0036
9_ 0
So again serious attenuation of the normal/binomial
distribution tails and increase in the take of
the modal group.
3,4,5 originally 63.7% increasing to 74.6%
0,1,2 originally 10% decreasing to 5%
and 6,7,8 originally 26% decreasing to 22%
or 3,4,5,6 79% up to 92.8%
and 0,1,2,7,8,9 21% down to 7.2%
These are the 28 matches for 50,000 spins
of 'vWA - Afro Caribbean allele frequencies'
No 0 or 9 and one each of 1 and 8
3,3,6,5,3,6,6,5,4,4
3,4,3,5,6,6,5,3,3,3
3,4,4,5,4,5,5,4,6,6
3,4,5,3,5,5,7,6,5,3
3,4,6,4,4,3,4,5,6,4
3,4,6,7,4,5,4,4,4,4
3,5,3,4,3,3,3,3,3,5
3,5,3,4,4,4,7,4,4,5
3,5,3,5,3,3,5,4,4,5
3,5,5,5,4,5,5,6,2,3
3,5,6,3,6,2,6,4,1,4
3,6,5,4,3,3,3,5,5,3
3,6,6,6,5,4,5,5,3,5
4,3,3,6,5,3,3,5,5,4
4,3,6,4,4,5,3,3,3,2
4,4,3,4,5,2,6,2,5,6
4,4,4,5,5,5,6,7,4,5
4,4,7,3,4,3,4,5,5,3
4,5,5,3,6,5,3,6,5,4
4,5,5,4,6,6,6,5,5,4
4,5,6,3,2,5,4,5,5,4
4,6,3,4,5,3,6,6,6,2
4,6,4,6,5,5,2,6,2,6
5,4,3,4,4,4,5,2,5,5
5,4,6,5,5,6,5,5,4,5
6,4,6,3,4,4,6,6,6,6
6,5,5,4,5,4,4,2,6,4
8,2,6,6,4,4,5,6,2,5
For my next run I think I will
model each of the 10 UK loci/alleles in my
generator and spin 50,000 times to
simulate a 10 loci/single allele database
of 50,000 profiles. I will lose the significance
between null and 0 but 0s have not appeared in
any match so far. THO1 would only use 7 of the
possible 10 values in the array ,others like D21
with about 16 possible alleles I will truncate to
the modal 10 /most frequent (undecided yet) .The triple peaked D2
(equal peaks at 17,20 and 24) I will
truncate to the10 around the 'Anglo-Saxon'
group of 17 - 20 leaving out 2% alleles 26 and 27
at the 'Celtic 24 end'.
For any mathematical runs I cannot decide whether
to use binomial quantised/piecewise distribution
for the generator
,closer to this use, or the normal function f(x) with
more chance of a numerically derrived
f '(x) for the match distribution.
A 100,000 x 10 run would be possible I think
but a bit of a work up. I may also try 6 loci
,paired alleles,so 50,000 x 12 ,with simulation
of the earlier 6 NDNAD loci charcteristics which should
give an idea of how many 'Raymond Eaton' cases there
would be in the earlier NDNAD form. But I would have
to build another macro to direct the pairs before
match checking.
I have converted my generator to 6 loci and pairs so 12 digit
'profiles' . So far I've only done one run of 12000 x12
spins to check the characteristics. Continuing on
and directing pairs and checking for matches
produced no matches with 12000 '6 loci profiles'.
I deliberately added 2 matches to the data and
it found those 2(4) as a check of functioning.
Anyone care to predict how many matches for runs
of 20,000 / 50,000 / 100,000 and 200,000 ?
Nulls are either due to no FSS data for that allele
or to keep my selection down to a maximum of 10 digits.
UK Caucasian
Tabulated as FSS data eg vWA allele 14 corresponds
to a digit 1 in my modelling
Allele / desired frequency / modelled frequency
for VWA
11 0.000 NULL
13 0.001 0.0012
14 0.105 0.1065
15 0.080 0.0794
15.2 0.000 NULL
16 0.216 0.2146
17 0.270 0.2717
18 0.219 0.2183
19 0.093 0.0926
20 0.014 0.0137
21 0.002 0.0022
^
9 only modelled
THO1
5 0.002 0.0012
6 0.241 0.2439
7 0.194 0.1972
8 0.108 0.1027
8.3 0.001 0.0011
9 0.140 0.1385
9.3 0.304 0.3051
10 0.012 0.0103
10.3 0.000 NULL
^
8 only modelled
D8 D8S1179 / D6
8 0.018 0.018
9 0.013 0.0143
10 0.094 0.0953
11 0.066 0.0656
12 0.143 0.1442
13 0.333 0.330
14 0.209 0.2081
15 0.088 0.0886
16 0.031 0.030
17 0.004 0.0057
18 0.000 NULL
FGA
18 0.025 0.0426
18.2 0.000 null
19 0.056 0.0577
19.2 0.000 null
20 0.143 0.1432
20.2 0.002 null
21 0.187 0.1838
21.2 0.002 null
22 0.165 0.1631
22.2 0.011 0.0116
23 0.139 0.1411
23.2 0.004 null
24 0.146 0.1462
24.2 0.002 null
25 0.075 0.0758
25.2 0.000 null
26 0.035 0.0348
27 0.007 null
28 0.000 null
29 0.000 null
30 0.001 null
30.2 0.000 null
31 0.000 null
45.2 0.000 null
46.2 0.000 null
^ 0 (allele 18 ) is inflated by 1.8% nulls
D21 D21S11
53 (24) 0.000 null
54 0.001 null
57 (26) 0.001 null
59 (27) 0.031 0.0368
61 (28) 0.160 0.1559
63 (29) 0.226 0.2289
64.1 0.000 null
64 0.000 null
65 (30) 0.258 0.2571
66 0.027 0.0264
67 (31) 0.069 0.0666
68 0.093 0.0965
69 (32) 0.018 0.0179
70 0.090 0.0922
71 (33) 0.001 null
72 0.022 0.0217
73 (34) 0.000 null
74 0.002 null
75 (35) 0.000 null
77 0.000 null
^ 0 (allele 27) is inflated by 0.5% nulls
D18 D18S51
8 0.000 null
9.2 0.001 null
10 0.008 null
11 0.012 0.0335
12 0.139 0.1405
13 0.125 0.1254
14 0.164 0.1686
14.2 0.000 null
15 0.145 0.1447
16 0.137 0.1342
17 0.115 0.1167
18 0.080 0.0767
19 0.041 0.0419
19.2 0.000 null
20 0.017 0.0177
21 0.010 null
22 0.005 null
23 0.001 null
24 0.002 null
^ 0 (allele 11) is inflated by 2.5% nulls
Remainder
7 to 10 loci yet to be modelled
D2 D2S1338
16 0.037
17 0.185
18 0.087
19 0.110
20 0.138
21 0.032
22 0.024
23 0.112
24 0.142
25 0.111
26 0.019
27 0.002
28 0.000
D16 D16S539
5 0.000
8 0.019
9 0.129
10 0.054
11 0.289
12 0.288
13 0.186
14 0.029
15 0.005
D19 D19S433
10 0.000
10.2 0.000
11 0.000
12 0.087
12.2 0.000
13 0.222
13.2 0.013
14 0.382
14.2 0.015
15 0.177
15.2 0.038
16 0.041
16.2 0.017
17 0.005
17.2 0.000
18 0.000
18.2 0.002
19.2 0.001
D3 D3S1358
12 0.001
13 0.006
14 0.132
15 0.265
16 0.247
17 0.195
18 0.141
19 0.014
Trumpet these results for a simulated DNA database
for UK caucasians
For 20,000 ,6loci / 12 allele 'profile' run
First run 5 matched pairs ,no triples
1,6,1,6,2,6,1,3,2,3,2,6
3,4,2,6,5,5,2,3,1,3,4,5
3,5,2,6,5,6,2,3,1,3,1,6
5,6,1,6,5,5,2,3,1,1,1,6
5,6,2,6,5,6,7,7,2,8,1,8
Second run one pair only
4,5,6,6,6,6,3,6,1,3,5,5
The real NDNAD had 45,000 6loci profiles back
in 1991 which just shows what dangerous
nonsense these databases are for nabbing
false suspects and the number of 'unresolved'
pairs in the real NDNAD.
This is the multi-modal 'average Joe' 6 loci profile
for UK Caucasians is
vWA,THO1,D8,FGA,D21,D18
(17,17)(6&9.3)(13,14)(21,21)(29,30)(14,14)
corresponding to
4,4,1,6,5,6,3,3,2,3,3,3
in this representation so little agreement
with my other hypothesis although individual
pairs seem to tally in 4 of the 6 loci.
To convert one representation to the other
use the tables in my previous posting.
Then a single 50,000x 12 run with 27 matched pairs,no triples
processed down from 600,000 data points
1,3,1,6,4,7,4,8,2,3,2,2
1,5,1,6,4,5,3,6,2,3,4,6
1,5,1,6,5,5,2,3,2,2,1,2
1,5,6,6,5,5,4,8,1,3,1,3
1,5,6,6,5,6,2,3,2,3,2,3
2,6,1,5,5,5,3,6,1,3,5,5
3,3,2,5,5,6,3,6,2,3,1,3
3,4,1,6,5,5,3,4,3,8,1,3
3,4,2,6,3,5,3,4,2,8,4,5
3,4,2,6,5,5,4,8,5,6,4,6
3,5,1,2,5,6,3,4,2,3,0,7
3,5,1,5,5,5,2,8,2,3,1,6
3,5,1,5,5,6,4,4,3,8,5,5
3,5,5,6,4,5,3,6,2,8,0,2
4,4,1,5,4,6,7,7,2,3,1,3
4,4,1,6,5,6,3,4,1,3,3,7
4,4,5,6,0,2,2,3,1,3,1,5
4,4,5,6,4,6,2,4,2,3,2,4
4,5,1,1,5,5,2,9,3,3,5,5
4,5,1,2,5,6,2,3,2,3,1,3
4,5,1,6,5,5,2,7,3,3,1,6
4,5,2,6,4,5,3,9,1,3,3,7
4,5,2,6,5,6,1,7,2,3,4,7
4,5,2,6,5,6,3,7,2,3,1,3
4,6,1,6,4,6,2,3,3,3,1,2
4,6,5,6,5,5,2,3,2,8,5,6
5,7,2,5,5,6,4,6,2,5,3,5
There is one further bit of analysis which probably could do with another macro.
For each pair of columns on the above 27 match- rows do a frequency
plot of each digit and compare to the generating characteristic for each 'locus' for the
attenuated tails and enlarged modal group effect.
And which are the most commonly occuring paired alleles in each locus ?
eg (4,5) for vWA and (1,6) for THO1 or whatever.
To go any further i must make a
software restructure, basically swapping
disk-space for memory and reconcattenating ,
to go to 7 loci ,to 8 ,to 9
and then 10 loci and more than 50,000 'profiles'.
Would anyone care to speculate for
number of matches in 50,000, 100,000, and 200,000
profiles in 6 loci,7 loci,8 loci ,9 loci and 10 loci data-sets ?
or even the general case to extend to 2 million or even 60 million.
At the moment for 6 loci it looks as though the number
of matches is square law about [N/(10^4)]^2 where N = No of 'profiles'
I have to guage when to post this stuff to the Yahoo /
forensic group, Prof Sir Alec Jeffreys etc,
I've now converted all macros to 7 loci 14 data-points.
This is the result for a 50,000 profile run simulating
vWA,THO1,D8,FGA,D21,D18,D2
1,4,1,6,4,6,4,7,1,3,4,7,3,4
4,5,5,6,5,5,3,6,1,3,3,5,1,4
5,6,1,6,4,6,6,8,1,3,3,3,2,9
3 pairs,no triples
This 7th,D2 ,is the most removed from normal distribution
having 3 distinct,separated peaks for UK caucasian.
The final 3 loci are more normal distribution
but I will certainly have to increase to 100,000
profiles and more.
At the moment just the pc processing time on 1997
vintage AMD K6 ,64M RAM pc
for 50,000 x 14 is
1/ generating profiles constrained to allele frequencies - 32 seconds
2/ redirecting pairs - 20 s
3/ splitting into 10 files by first digit (0 to 9 ) - 17 seconds
4/ sorting the biggest file (3...........) in this case (but no pairs in this file ) - 85 seconds
5/ pair matching - 3 seconds
6/ visual check of sorted file to confirm presence of matches and also see if a triple
7/ for files that don't reveal a match then repeat with
a seeded match in the data to check the macro does pick it up.
repeat processes 4,5,6,7 on each/bunch of remaining 9 files
The sort is alphanumeric rather than numeric.
If the files become too big to sort (process 4 )
then I will just subdivide
on the second digit and proceed as before.
So anyone care to predict for number of matches in 100,000
and 200,000 runs for 8 loci,9 loci and 10 loci ?
Results so far
4,000 8 digit ,undirected,10 pairs
10,000 ,8 digit, ", 63 pairs
10,000 ,10 digit,undirected ,28 pairs
12,000 6 loci,directed, no pairs
20,000 6 loci, " ,1 to 5 matches
50,000 6 loci , ", 27 pairs
50,000 7 loci ,directed, 3 pairs
I've now converted all macros to 8 loci 16 data-points.
This is the result for one 100,000 profile run simulating
vWA,THO1,D8,FGA,D21,D18,D2,D16
Just one match
4,4,2,6,6,6,1,4,2,3,3,3,1,8,3,3
Results so far
4,000 8 digit ,undirected,10 pairs
10,000 ,8 digit, ", 63 pairs
10,000 ,10 digit,undirected ,28 pairs
12,000 6 loci,directed, no pairs
20,000 6 loci, " ,1 to 5 matches
50,000 6 loci , ", 27 pairs
50,000 7 loci ,directed, 3 pairs
100,000 8 Loci,directed ,1 pair
I've now converted all macros to 9 loci 18 data-points.
No matches for one 200,000 profile run simulating
vWA,THO1,D8,FGA,D21,D18,D2,D16,D19
Results so far
4,000 8 digit ,undirected,10 pairs
10,000 ,8 digit, ", 63 pairs
10,000 ,10 digit,undirected ,28 pairs
12,000 6 loci,directed, no pairs
20,000 6 loci, " ,1 to 5 matches
50,000 6 loci , ", 27 pairs
50,000 7 loci ,directed, 3 pairs
100,000 8 Loci,directed ,1 pair
200,000 9 Loci,directed,no pairs
For anyone coming after me this is a breakdown by 'vWA'
leading digits as it is quite bunched and matches presumably more likely
in the bigger groups (eg 1,4... ; 3,4.... ; 3,5.... ;4,4.... ; 4,5... ; 4,6... )
and probably much the same proportions
for the 10 Loci case
0,0.... to 0,9..... 400 'profiles'
1,1............. 2100
12..... 3400
13 9000
14 11300
15 9300
16 3900
1,7.... to 1,9.... 700
2,2 1300
2,3 6800
2,4 8900
25 7100
26 2900
2,7.... to 2,9..... 500
3,3 9500
34 23400
35 18900
36 8000
3,7..... to 3,9....... 1400
4,4 14700
45 23400
46 10100
4,7..... to 4,9...... 1800
5,5 9500
5,6 8000
5,7.... to 5,9...... 1400
6,6.... to 6,9.... 2200
7,0..... to 7,9..... 50
8,0... to 8,9 1
Now converted all macros to 10 Loci x2 and
also a macro for converting back to usual represention.
For a run of 600,000
Single 10 loci match of
VWA,THO1,D8,FGA,D21,D18,D2,D16,D19,D3
(17,18);(8,9);(13,14);(20,22);(30,30);(14,15);(20,20);(12,13);(13,14);(16,18)
Then cutting back on the same output array
Same single match on 9 loci and no other
9 (18) matches on 8 loci,including the 9 and 10 one
102 (204) matches on 7 Loci,including the first 7 pairs of the 8,9,10 ones
and 2907 (x2) matches on 6 loci,including the first 6 pairs of the 7,8,9,10 ones
No triples on the 8 loci set and i've not checked for the 7 aand 6 loci
If there are 6 loci records on the NDNAD they must be next to useless.
About 3000 matches in 600,000,so if it were still 6 loci and square law
then there would have been about 3000 x 3 squared or 27,000 x 2 matches .
My 'average Joe' is
(17,17);(6,9.3);(13,14);(21,21);(29,30);(14,14);(17,20);(11,12);(14,14);(15,16)
and my (slightly altered) profile is
(17,19);(8,9.3);(13,13);(20,22);(29,29);(13,15);(18,19);(12,12);(12,14);(16,18)
4,6;3,6;5,5;2,4;2,2;2,4;2,3;4,4;0,3;4,6
even closer to the numerically derrived first match normalised to
(0,1);(0,1);(0,-1);(0,0);(-1,-1);(-1,0);(-2,-1);(0,-1);(-1,0);(0,0)
There were 70,764 'profiles' with ,first,vWA pair of (17,18) that contained
the 10 loci match and was the largest sub-set.
The next largest was 70,131 for vWA (16,17)
I may do a one million run for the shear hell of it but maybe
only fully sort for the (17,18) subset.
I think there is a problem with the Rnd function
depite using the Randomize adjunct.
Did a (4,5) subset of 2 million profiles which
took 25 minutes. Giving 236,345 x20 digit 'profiles'
(4,5);..........
Much processing later .........
The same matched pair as before which all looks highly
suspicious. And again same single match for 9 loci.
22 matches for 8 loci subset
214 for 7 loci
and 6113 for 6 loci .
Inspecting the 22 matches compared to full 10 loci.
There were 3 near misses to adjascent sorted
sequences,first 16 digits matched
Final 4 digits 1,3,3,6 and 1,8,3,6
1,3,3,6 and 3,5,3,6
1,3,4,6 and 3,3,4,6
so 3 separate 9 loci matches and 2 separate 9.5 loci if I had chosen
loci 1,2,3,4,5,6,7,8,10 instead of straight sequence.
I will have to research the Rnd pseudo random number generator
as my macros seem to check out ok.
I did some further checking back to the
original generated undirected '2million profile' file and
what becomes a match started as these
two sequences
5,4,3,5,6,5,2,4,3,3,3,4,4,4,5,4,1,3,4,6
5,4,3,5,6,5,2,4,3,3,4,3,4,4,5,4,3,1,6,4
which when directed both become
4,5,3,5,5,6,2,4,3,3,3,4,4,4,4,5,1,3,4,6
which is not in the original at all
so not a manifest of the Rnd function
repeating itself. There is no way the Rnd
function would 'know' what i was going
to do with the output. In other words what
looked highly suspicious 1, 10 loci
and only 1, 9 loci match would seem
genuine after all. Fascinating stuff.
There is no repeated sequence turning up
in the generator file as that would carry through
and be picked up by the matching macro.
Unfortunately due to constraints of disk space/
enforced deleting files I don't have the
original undirected source file ( 23.4MByte) for the
600,000 profiles where the same sequence
later emerged,only the directed file but
probably the same effect.
Generating new (5,4) + (4,5) 2 million subset the sequences
differed from the previous run so randomize was working.
BUT - I checked
on original undirected /unsorted file for
central 3x2 group 3,3,3,4,4,4 and
5,4,3,5,6,5,2,4,3,3,3,4,4,4,5,4,1,3,4,6 emerged again
but in a different place in the file. The following
sequences also matched . So Rnd seems ok within
one run but repeat a run and same result is likely
to emerge somewhere in a long run despite using the Randomize so the
Rnd function starts at a different point. Bearing in mind
although I'm only selecting the 5,4 /4,5 subset the
Rnd function is being called 2 million x 20 times,
2^24 in the inbuilt Rnd function is only about 16 million
Rnd produces an exact figure based on the previous call.
I've buried a superfluous Rnd call in the subroutine that
writes the (4,5)..... file very approximately on average every 20
profiles so should disrupt the sequence as far as the numbers
used in the loci generator are concerned. This write call would
not be the same for each run .
I've so far done another 230,000 odd (4,5).... profiles
and that sequence does not reappear and will
fully process and see what emerges
some right fun and games with Linear Congruent
Generators for random numbers
from sources
http://www.geocities.com/SiliconValley/Campus/7071/rnd.html
and
www.kaner.com/pdfs/random.pdf
I am now using the microsoft form for the Rnd
but in this form has 15 digit precision
rather than truncated to 7 and seem to be getting
more convincing results.
I tried the Kaner/ Vokey with z = 2^ 40
trying each
a= 27182819621,c = 3
and a = 8413453205,c = 99991
in exactly the same visual basic code as below
but there was horrendous repeating of 'random' numbers.
I've no idea what the problem is if someone else
would like to fabricate a fairly simple RNG
or check the following code in a VB procedure.
Dimensioning variables to Double made no difference.
---------------
' initialising
a = 214013
c = 2531011
x0 = Timer
' timer sets start seed to number of seconds after midnight
z = 2 ^ 24
' RNG
temp = x0 * a + c
temp = temp / z
x1 = (temp - Fix(temp)) * z
x0 = x1
result = x1 / z
' 0 < result < 1
-----------------------
So far just processed from one run of 1 million generator
using the above form of RNG outputing to disk just
4,5 and 5,4 subsets - directed giving 118,193 4,5.........
Then sorting just 12,878 of the divided 4,5,3............. profiles
no 16,18 or 20 loci matches, 3 '14 loci matches'
and 88 '12 loci matches' lopping back.
For a 1 million run ,each time, one time extracting 4,5 and another
run extracting 4,4..
4,5…. including the 4,5,1.. I mentioned yesteday
of 105,315 2x 8 loci matches only,no 9 or 10.
4,5,1,6,6,7,3,4,0,3,4,5,3,8,3,4
4,5,2,6,6,6,3,4,1,2,3,7,1,8,4,5
The 0 above represents only 3.1%
sequences convert to
17,18 ; 6,9.3 ; 14,15 ; 21,22 ; 27,30 ; 15,16 ; 19,24 ; 11,12
17,18 ; 7,9.3 ; 14,14 ; 21,22 ; 28,29 ; 14,18 ; 17,24 ; 12,13
36x 7 loci matches
1345x 6 loci matches
and another 1 million run for 4,4 only
and one match for 73,259 4,4,……...profiles
for 8 loci 1 match
4,4,3,6,5,6,4,6,3,3,3,7,0,1,4,5
the 0 here represents 3.7%
18x 7 loci matches
647x 6 loci matches
I now have confidence in the RNG and have ramped up
to 10 million profiles.
It took 2 hours 12 minutes to generate and save to disk subset
174,017 profiles 4,5,1,6... 4,5,6,1... 5,4,1,6... and 5,4,6,1.....
which when directed gave 4,5,1,6..... profiles only.
In them were 2 matches on 10 loci
4,5,1,6,2,5,0,4,1,5,1,7,1,4,4,4,1,3,5,6
4,5,1,6,5,6,2,6,1,2,2,3,1,4,3,4,3,3,3,5
which converts to
vWa;THO1;D8;FGA;D21;D18;D2;D16;D19;D3
17,18 ; 6,9.3 ; 10,13 ; 18,22 ; 28,31 ; 12,18 ; 17,20 ; 12,12 ; 13,14 ; 17,18
17,18 ; 6,9.3 ; 13,14 ; 20,23 ; 28,29 ; 13,14 ; 17,20 ; 11,12 ; 14,14 ; 15,17
The remaining processing because only 174,017
took much the same time as previous processing
but narrower 'catch'.
Other results in usual sequence, ie ordered 9 (excluding D3) ,
not perm any 9 from 10,which would be higher numbers but
as I rely on a sort routine I cannot do that determination.
9 loci - 7 matches
8 loci - 103 matches
7 loci - 1078 matches
6 loci - 21,113 matches
The 7x 9 loci and 2x 10 loci result is not too surprising
because the 10th locus is D3 and very biased in the 3/4/5 area.
9 loci match analysis 4 pairs were 4,5,1,6,2,5,.... including the 10 loci one
8 loci analysis 12 were 4,5,1,6,2,5 ..... 17 were 4,5,1,6,4,5......
35 were 4,5,1,6,5,5...... 17 were 4,5,1,6,5,6......
So from ramping up from 2 million to 10 million
a factor of 5 then these results agree square law
with the 2 million results if restricted to 4,5,1,6... also.
Remember someone has decided to halt the NDNAD
when it reaches 3 million. It looks suspiciously
like he has done the same processing as me.
3m is likely the figure [<10/(2^-.5) and > 2 million (square law assumed)]
where you are likely to get one match
in the most frequently occuring (first) loci.
Returning to the 10 million result I still have no idea
whether there would be more 10 loci matches in the
remaining (10m minus 174,017 ) = 9,825,983 profiles I
did not save and test. From the 2m runs and 8 loci results for subsets
4,5,2,6... and 4,4,.... I would suggest there is but I
cannot put a likely figure there. What 8,9,10 matches
do emerge are not being found totally in the multi-modal
areas where I intuitively expected them to be. So could
appear anywhere it seems perhaps with a majority
of modal matches .
I will try another 174,017 subset of 10m in a block away from
4,5,1,6.... ; perhaps 4,4........ and see what emerges.
I will also write up with the macros for anyone else
to have a go - independent replication of
such analysis is fundamental. I used Visual Basic /
macros with Word 97 on a 6 year-old pc .
The next area of exploration is the common ancestor ie
parent and 10 alleles in common at least,
grand-parent and 5 alleles in common, on average, at least.
What is the probability of someone related having these 5 to 10
as a starting point then also matching on the remaining 15 to 10
,just by chance process,and probability of that person
being also in the NDNAD. ? Remember we are talking
real ancestry here not the nice comfy (sham) ancestry of the
genealogy community. The milkman factor, lovers,one night
stands etc that mean up to 30% of people have a genetic
father different to their accredited father.
The nearest ,to 174,017 I could find to a convenient rarer subset
of 10million profiles was for 2,6.... & 6,2.....
giving 150,105 'profiles'
Results for
10 loci - 0 matches
9 loci - 1 match
8 loci - 3 matches
7 loci - 39 matches
6 loci - 1262 matches
The 9 loci match was on
2,6,2,6,5,5,3,7,2,3,3,5,7,8,4,5,3,5
which started as
2,6,2,6,5,5,3,7,2,3,3,5,8,7,5,4,3,5,6,2
and
2,6,6,2,5,5,7,3,3,2,3,5,8,7,4,5,3,5,3,4
so confidence in the RNG
Previously I did a similar 2,6&6,2 run but included a
variation from adding in calls , 1 in 20 ,to the built in Rnd
function added to the external Rnd
on the assumption that adding a poor rand
to a reasonable rand would make it better.
Not so
Processed through and checked for matches.
Apparently 3 10 loci matches.
Going back to the generator array, exept for
the pair-directing, the sequence appeared twice exactly
the same ,different places,
in that array . I repeated for the second 'match'
and again a pair of sequences in the original. I did
not bother checking the third result and scrapped the lot.
I down-loaded the Sunny-beach RRnd but haven't
got anywhere with it. The help file doesn't come up
and it doesn't like my sound-card. Knowing what
(regular rather than random) hash appears on
radio reception close to a computer I would
have thought any analogue noise derrived from a sound
card would be heavily contaminated with all
the ,repetitive, digital noise.
171,122 subset 3,4,1,6.......... of a 10m run
results
10 loci - 0 matches
9 loci - 5 matches
8 loci - 91 matches
7 loci - 1079 matches
6 loci - 22,113 matches
The 5x 9 loci results were all 3,4,1,6,5,6......
For anyone wishing to replicate these processes
I've put the macros and some background on
http://www.nutteing2.freeservers.com/dnas.htm
Over the next week i will write up the rest of
this simulation experiment and add to that file
(and mirror sites).
Next run will probably be subset 3,5......
which should be about 946,000 processing
3,5,1,6....... first and then the remainder.
I am trying to think my way around the co-ancestry conundrum.
Should it help anyone else I did some processing
on the final sorted arrays for 15 alleles and 10 alleles
In the first instance assuming match on 10
digits 1,2,.....10 of 20 is for this purpose much
the same as 1,3,5... 19 of 20 and for the moment ignoring
the perm 1 from 2 .
For the rarer 2,6...... profiles (150,105 out of 10m )
15 allele matches 9
10 allele matches 16,939 and i would guess about 1 in 50 were quadruples,
repeated pairs.
For the common 3,4,1,6.... profiles (171,122 out of 10m )
15 alleles - 271 matches
10 alleles - 71,876 matches including i would guess 1 in 10 quadruples
What is the probability of a related person (parent-wise) so 10
loci in common already also having by chance a match on the other 10 ?
What is the probability of a related (grandparent-wise) so 5
loci in common already ,on average,also having by chance a match on the other 15 ?
This week i've started reading the Spencer Wells book
The Journey of Man : A Genetic Odyssey.
The Y chromosome derrivation of human migration
since an African ' Adam' - like the mitochondrial 'Eve'.
A quote ,from it,relevent here (Kidd's paper,concidently, on the Amerindian study i
should have received this week from the Brittish Library )
" The geneticist Kenneth Kidd, of Yale University , has pointed
out that if we double the number of ancestors in each generation
(around 25 years) ,when we go back in time about 500 years
each of us must have had over a million living ancestors.
If we go back a thousand years ,our calculation tells us that
we must have had one trillion (1,000,000,000,000 ) ancestors -
far more than the total number of people that have existed in
the whole of human history. ...................
....... The error in our ancestor tally is not from a malfunctioning
calculator,but from the assumption that each of the people
in our genealogy is completely unrelated to the others"
Good news for the anti-FSS brigade.
Found another 10 loci match in a different area.
I thought THO1 had the maximum possibility
of pairs of alleles as max frequencies of .241 and .304
but it is actually loci 8 and 9 in the standard FSS order
D19 at .382 and .222 and
D16 at .289 and .288
So I rejigged things generating 10 million profiles
but only saving to disk those directed to become
..............3,4,1,3..
Giving 283,201 'profiles'
Then divided for THO1 (1,6)
so 41,551 profiles of form
..1,6..........3,4,1,3..
Then divided ,sorted , reconcattenated and match-checked
giving one match of
4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3,3,4
converted back as
17,18 ; 6,9.3 ; 14,15 ; 21,24 ; 29,30 ; 14,14 ; 17,24 ; 11,12 ; 13,14 ;15,16
This match started as
54,61,67,37,32,33,18,43,31,43 and
54,61,67,73,32,33,18,34,13,34
so no obvious problem with the RNG
Cutting back on final array for lower matches ,
perhaps not too relevent , as 3,4,1,3 columns all the same
9 loci 4 matches
7 and 8 as 9
6 loci 162 matches
9 loci result
4,5,1,6,4,4,2,7,1,3,3,3,3,7,3,4,1,3
3,5,1,6,5,6,2,7,1,2,1,3,1,8,3,4,1,3
4,5,1,6,5,6,6,7,4,8,1,1,1,2,3,4,1,3
4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3
By reconfiguring the columns and resorting
10 loci - 1 match as before
9 loci - 4 matches as before in effect
8 loci - 162 match
7 loci - 3,682 match
6 loci - 15,172 matches
I will probably process the next biggest batch of
the generated 283,201 profiles before ditching
ie ..2,6..............3,4,1,3..
It really requires someone with a bigger number
crunching computer to structure the multiple sort processes into
one macro or different process altogether and crunch all 10 million
in one go.
I have now found the first 10 loci match
in an area where I was not expecting one.
Firstly continuing yesterdays results
9 loci result
4,5,1,6,4,4,2,7,1,3,3,3,3,7,3,4,1,3
3,5,1,6,5,6,2,7,1,2,1,3,1,8,3,4,1,3
4,5,1,6,5,6,6,7,4,8,1,1,1,2,3,4,1,3
4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3
By reconfiguring the columns and resorting
10 loci - 1 match as before
9 loci - 4 matches as before in effect
8 loci - 162 match
7 loci - 3,682 match
6 loci - 15,172 matches
Then processed the remaining ..2,*...........3,4,1,3..
For 72,578 profiles
10 loci - 0 matches
9 loci - 3
8 loci - 117
7 loci - 3,855
6 loci - 21,646 matches
Then processed the remaining ..1*..............3,4,1,3..
but * not = 6 for 78,434 profiles
10 loci - 0
9 loci 4
8 loci - 154
7 loci - 4,028
6 loci - 23,401
Then remaining ..a*..............3,4,1,3..
a not= 1 or 2 for 90,634 profiles
10 loci 1 match
9 loci 5 matches
8 loci - 159
7 loci - 4,327
6 loci - 25,318
The match was for
4,5,3,6,5,6,2,6,1,3,4,6,1,8,3,4,1,3,3,4
converted back to
17,18 ; 8,9.3 ; 13,14 ; 20,23 ; 28,30 ; 15,17 ; 17,24 ; 11,12 ; 13,14 ; 15,16
generated originally as
5,4,3,6,5,6,2,6,3,1,4,6,8,1,4,3,3,1,4,3 and
4,5,6,3,6,5,6,2,1,3,6,4,8,1,4,3,3,1,4,3
so good RNG
This has (4,5) of one of the main modal groups so
to properly test for extramulti-modal matches I will probably
derrive 300,000 profiles selected to be of form
(other than 4 or 5)(other than 1 or 6) .............. (other than 3 or 4)(other than 1 or 3) ..
and process through.
At the moment for all matches found
2 matches in expected batch of 174,017
1 match in expected batch of 41,551
1 match in unexpected 90,634 plus 72,578 plus 78,434
So for the moment ,best guess ,for 10 loci matches
in 10 million, totally unrelated ,profiles is >4 and less than 40
I tried 300,000 profiles selected to be of form
(not a 5)(not a 6) .............. (not a 3 )(not a 3) ..
and processed through.
I hadn't realised this gives only 6.6%.
Of 300,000 such profiles no 10 loci matches
or 9 loci matches.
8 loci - 2
7 loci - 40
6 loci - 1387
Next I will probably try something like the opposite
in a 2 million run
2 million run with processed 137,190 profiles containing
at least one of the four most common alleles on each of loci 0,1,....7,8.
10 loci matches - 0
9 - 0
8 - 0
7 - 22
6 - 987
I was playing around with kinship (coin-tossing) statistics
simulating with the RNG so approximate only as only
using variously 100,000 and 10,000 x 10 and 20 'tosses'.
As far as I see it,50 % chance per allele.
For two people with the same mother and father
the chances for inheriting the same ,unspecified ie
no particular order,N alleles is
N percent probability
20 low
19 .003%
18 .02
17 .12
16 .5
15 1.5
14 3.6
13 7.4
12 12.0
11 16.3
10 17.3
9 16.3
8 12.0
7 7.4
6 3.6
5 1.5
4 .5
3 .12
2 .02
1 .003
0 low
For one parent concerning inheritance of
matching 1 allele in each pair of 10 loci
N %
10 .11
9 1.0
8 4.2
7 11.6
6 20.6
5 24.6
4 20.6
3 11.6
2 4.2
1 1.0
0 .11
For a common grandparent ,one allele on each of 10 loci,
25% chance for each allele
N %
10 low
9 .001
8 .04
7 .36
6 1.64
5 5.43
4 14.3
3 25.4
2 28.4
1 18.8
0 5.7
For common great-grandparent ,12.5% chance each,
N %
7 .004
6 .06
5 .4
4 2.3
3 9.3
2 24.1
1 37.8
0 25.6
For common gg-grandparent ,6.25%
N %
7 .001
6 .001
5 .015
4 .13
3 2.0
2 10.4
1 35.2
0 52.3
For common ggg-grandparent, 3.125%
N %
5 .01
4 .28
3 2.0
2 10.5
1 34.2
0 53.0
Is there a mistake here in the residual 30 odd
percent chance of inheriting one allele over 5 generations ?
Anyone know the figures for number of people alive today
legitimate (real and assumed) and illegitimate
having the same 2 parents, one parent ,one grandparent,
one great-grandparent etc ?
How to meld this sort of data with multi-allele
matching probability in a NDNAD ?
Is there a numerical /simulation way to determine
how much co-ancestry will increase number of matches
within a database ?
I've now joined the redirection macro and the
first divider macro to the generator macro
and added a save of the original undirected
array of profiles as number strings to
halve the disk space requirment.
This is probably the proper way to do all
the processing. Repeated application
of the dividing routine on successive columns
until there is nothing left to divide. To do this
automatically would be alright if it was not for
such variable divided file sizes /counts from few to 10,000s
in the same dividing.
I suspected i was wasting my time but i decided to
do a million run and save all to disk for
later processing.
So far just processed profiles of form 4...................
numbering about a quarter of a million - 250,942
Results
10 loci matches - 0
9 - 0
8 - 2 matches both starting 45,
7 - 60
6 loci - 2465 matches
That jump from 1 million to 10 million makes all
the difference
Now i've started i will have to carry on to
the remaining before trying a 2 or 3 million run - disk
space permitting.
The next large run i will probably change the
order in the generator array from the clumpy vWA,THO1....
to D2,D18,FGA...... to partially even-out some
of this clumpiness.
Perhaps a starting simulation could be.
5 males and 5 females of totally random
unconnected but otherwise generic UK caucasian profiles.
Generate 4 or 5 'children' for each pairing and
they in turn only
allowed to mate with random profile outsiders.
Add in a bit of second cousin/cousin/incest
matings/pairings and repeat
for perhaps 5 or 10 generations and see what
emerges. Then repeat with outsiders constrained
to come only from 5 say similarly generated 'communities' etc
I worked out how to do VB random access files, Get
and Put ,and made a macro to detect matches
in datafiles in string form. But it would take
forever and a day for the macro to process through like that.
Looks as though it will have to be a quick-sort
macro or Word /sort for the subdivided files then my match macro
after re-uniting the sub-files.
Now using the data stored as strings not only reduces
the file size but using the standard Word/Sort ,un-highlighted
columns or text, default (Text) type of sort now works.
I thought the smaller files would increase the handlling
size of Word/Sort from 15,000 but its still the same limit
I may make a macro within Word that inputs in
turn each of the subdivided (<15,000 profile ) files ,Sorts each file,
saves each file,then some sort of macro to copy and
paste all these sorted subfiles into one file to match check.
I've accessed a number of VB sort code procedures but
will try the repeated Word/Sort/ macro first as I suspect going
down that route will be quicker in actual processing time.
I thought i was wasting my time processing the remaining
3/4 million profiles,but no . I was going to leap to 3 million
now i have changed to 'string' data blocks and handling.
Now I know what i'm doing ,have all macros to hand and
how the profiles sub-divide into various amounts.
If i repeated a whole 1 million run again i reckon
it would take only about 3 hours in total to generate, through
dividing,batch sorting,batch file merging and final match
checking. So as a very near miss (below) for a 10 loci/
20 digit match in 1 million ,the next run will be for 2 million .
File size for 1 million profiles as strings 22.8 MB.
Results for 1 million profiles all saved and processed through
1.............. (198,191 profiles )
6 loci/12 - 938 matches
7 - 29
8 - 1
2............... (135,851 )
6 - 546
7 - 19
8 - 3
9 - 1
3............ (305,269 )
6 - 2972
7 - 71
8 - 5
9 - 0
4................... (previously reported) - 250,942
6 loci - 2465 matches
7 - 60
8 - 2 matches
9 - 0
5................ (95,969 )
6 - 474
7 - 13
8 - 2
9 - 0
Remainder 0...,6....,7......,8...... (13,778 profiles)
6 - 11 matches
7 - 1
8 - 0
So match totals in 1 million profiles
6 loci - 7,406
7 loci - 193
8 loci - 13
9 loci - 1
Can now also easily check for triples
so far only emerged on 6 loci matches
( reconfigured macro for quadruples also)
1..... -18 triples (1 quadruple)
2........ - 7 triples (0 quad)
3....... - 84 triples (3 quadruple)
4........ - data no longer retained
5... - 11 triples ( 0 quad)
remainder - 0
So >=120 triples and >=4 quadruples on 6 loci
Needle in a Haystack
The near miss on 2........ profiles was actually
also a match for 19 digits
The 2 profiles were
"24162378233401331125" and
"24162378233401331122"
Conversion to standard notation
(15,17)(6,9.3)(10,11)(24,25)(29,30)(14,15)(16,17)(11,11)(13,13)(14,17)
(15,17)(6,9.3)(10,11)(24,25)(29,30)(14,15)(16,17)(11,11)(13,13)(14,14)
again mostly ,but not all ,are common alleles
vWA / 15 - allele frequency .08
D8 /10,11 - af .094 ,.066
FGA/ 25 - .075
and D2 /16 is only af 0.037
These started life as
"42613287323401331152" and
"24162378323410331122"
so nothing suspect about the Rand function.
Anyone care to lay bets on a match/ matches being contained
within 2 million profiles ?
For anyone not aware of all the previous research. This
simulation is for the artificial situation where all profiles
are generated absolutely randomly within the constraint
of distributions as found in UK caucasians. It does not
assume any co-ancestry ie all profiles are totally independent
of one another with no common ancestors bequeething any
allele/alleles down the generations. That is the next research/simulation.
The final reckoning
A single 10 loci match on 2 million profiles
Breakdown of results in standard loci order
for 1.............. (398,036 profiles )
6 loci/12 - 3,644 matches,105 triples,5 quad
7 - 91, 0 triples
8 - 5
9 - 0
for 2............... (273,611 )
6 - 2,118 , 48 triples, 1 quad
7 - 69, 0 triples
8 - 4
9 - 0
for 3............ (609,940 )
6 - 9,950, 597 triples , 52 quadruples, 7 quintuples
7 - 255, 0 triples
8 - 28
9 - 2
10 LOCI - 1 match
for 4................... (499,104 )
6 loci - 9,865, 540 triples , 49 quad , 7 quin
7 - 268, 0 triples
8 - 28
9 - 0
for 5................ (191,390 )
6 - 1,564, 40 triples , 3 quad
7 - 28 , 0 triples
8 - 2
9 - 0
Remainder 0...,6....,7......,8...... (27,921 profiles)
6 - 27 matches , 1 triple
7 - 1
8 - 0
So match totals in 2 million profiles
6 loci - 27,168
7 loci - 712
8 loci - 67
9 loci - 2
10 loci - 1
for 6 loci
1231 triples
110 quadruples
14 quintuples
3... subset, 9, 10 loci numbers look suspicious but that is just the
way things have panned out, including somewhat similar before .
If i wanted to fiddle these results
then the first thing i would do is make the 9 loci match number
larger. Hopefully anyone repeating this experiment will
find similar numbers. For anyone so doing I will
add the count breakdown of the sub-divisions to the dnas.htm file
tomorrow. You need a plan to work to because of the serious
disparity of numbers in sub-divisions.
*****************************************
THE 10 LOCI MATCH in 2 MILLION is
"34,66,56,24,33,13,17,45,13,45"
when converted back , in standard form
(16,17)(9.3,9.3)(13,14)(20,22)(30,30)(12,14)(17,23)(12,13)(13,14)(16,17)
all in the more common allele frequencies.
*****************************************
The lowest being D2 / 23 of 11.2 % allele frequency
This match started life as
"34,66,65,24,33,13,71,54,13,45" and
"34,66,65,42,33,13,71,45,31,54"
so nothing suspect about the Rand function.
Previous results suggested number of 10 loci matches in 10
million to be between 4 and 40 . Assuming the square law then
5x 2 million leads to 5^2 = 25 approx matches in 10 million profles
and implied 625 in 50 million. More repeats of this experiment ,or
even perhaps 3m or 4m runs ,will show whether 1 in 2 million is
average ,below or above average. My hunch from the near miss
in 1m is that it is below average ie implying between 25 and 40 matches
in 10 million.
I now have population date for UK from 1700 to present day to work on
the next simulation. No data yet for interbreeding factors,father/daughter,
brother/sister,uncle/neice, first cousin mariages,second cousin mariages etc.
I will place the modified macros ,other 'tools' and results on the ftp'd dnas.htm file sunday.
And notify the forensic science lot Sunday or Monday.
Is all the above and preceding a first ?
I've not come across a hint even of anyone publishing
this sort of simulation.
Up to Sept 28,2003 -f207