Simulation of a large DNA Profile Database

	Sign In Sign-Up
A pure ,mathematical ,database consisting of 2 million unrelated 'DNA profiles' will ,on average, contain one match. The generation is totally random, it would be possible to do a 2 million run and get no matches and also possible another one to have 2 matches in 2 million. The UK NDNAD contains 2 million profiles with this one match plus many more due to the inescapable fact that most people in the UK have ancestors in common , so more chance of shared alleles and consequential match.

Despite official government sites linking to these files there are still corrupt persons knocking out my sites, so for the purposes of searchengines cross-linking them, files no longer available on the original web hosting sites were on http://www.nutteing.50megs.com/dnas.htm , http://www.nutteing.freeisp.co.uk/dnas.htm , http://www.nutteing.batcave.net/dnas.htm , http://home.graffiti.net/nutteing/dnas.htm , http://nutteing.no-frills.net/dnas.htm and http://nutteing3.no-frills.net/dnas.htm (last 2 due now to host failure)
Details of that match at the end of this file.
If you found this file in an archive then use keyword "nutteingd" in a search engine to find an updated version or related pages.
Updated file August 2006
Please contact me if you notice any error that would lead to an error 
in the result. 
( 'allele'  4  on ' locus' D19 was slightly erroneously 0.715 previously, 
corrected now to 0.719, so generated 'profiles' will be slightly 
different at D19 / 4 to the results displayed in this file )
I am not a programmer so don't bother communicating 
about my lack of structure etc. I know the flags are poorly chosen 
and the external Random calls should be function calls etc. 
Before going into Visual Basic Editor go into ordinary 
Word and call up anything in the directory you want 
the VB files to go into as this is not designated in the 
following code.
Using plain text handling Notepad with no line wrap 
 copy and paste 
from this file as displayed on a browser or as Source/ Text file into 
a Visual Basic / Macro handler between Sub and End Sub ,
reset, and Run. I am not familiar with VB and so get tied up in 
knots concerning procedures,mudules ,functions etc. 
My choice of file names ,datewise sept25- etc is for 
ease of deleting because of disk space constraints. 

If using straight VB6 then designate the directory 
for files by "replace all" occurances of sept25 to 
c:\vb\sept25 or whatever, also add a sound progress 
indicator before the [ next x ] line 
 
If x/1000 = Int (x/1000 ) Then Beep
 
before highlighting and copying.

In VB6 open New Project
In Form1 open up a Command1 button
Double click this button to open command 
code window and copy and paste the 'DNA' VB code 
between the Private Sub Command1_click ()
and the End Sub
Then Run/Start
Press command1
Wait until Beep/ clicks cease
I had to ditch 3 Random Number Generators as 
they were producing their repeats too often considering 
at times I was dealing with calls to the RNG 200 million times 
for 10 million profiles.
The results and background is after the VB code.


The first task is to generate a file simulating '10 loci' 
that is an array of 10 pairs of numbers. These number pairs 
are constrained to represent the allele frequencies of the published 
UK caucasian population. The average number of alleles in the UK NDNAD (Caucasian ) 
SGM plus system is 11.3 alleles per locus ,2 per locus,times the 10 chosen loci. 
But as derrived from bio-chemistry the inheritance of these alleles 
is not equally likely. If indeed equal occurance over 11.3 x 2 x 20 
then false matches would be very much rarer than real life. 
To simplify I have standardised to a choice of 10 (0 to 9) and the 
rarer alleles lumped together in the '0' subset. 
For purists it is an easy matter to increase from 0 to 9 to 
include "A","B" etc , as now string data for complete 
modelling of all alleles on loci FGA,D21,D18,D2 and D19.
------------
In the generator section at the start of each j loop ,have
pb(j) = "Z"
then amend generator characteristics, 
If ph(j) < 0.337 Then pb(j) = "A"
If ph(j) < 0.437 Then ph(j) = 2
If ph(j) < 0.444 Then pb(j) = "B"
etc instead of just 0 to 9
then before end of each j loop,have
If pb(j) <> "Z" Then ph(j) = pb(j)
--------------
I would suggest using the letters for only the 
rare alleles rather than going 0 to 9,A,B,C,D etc.
The first 3 loci (6 numbers) will not contain 
alphanumerics but 7 or more would so beware 
if subdividing on 7th digit or more.
In principle I tried adapting and it processes 
through to final match checking, but I've 
not done a full run fully enlarged.
The final macro for converting back to 
standard notation would need altering, or at least 
manually converting the A,B,Cs etc back to alleles.
As one general result along the way was that rare alleles become 
very much rarer, proportionally,  in any matches.

Because of the 
large numbers involved and my pc being of 1997 vintage there 
is a lot of saving to disk and only partial sets are processed 
rather than trying to process full 10 million 'profiles',my 
sensible limit is about 2 million processed in their entirety. 
Others, with more powerful computers should be able 
to tackle full 10 million.
If the long conditional statements break in this HTML file 
then you will have to re-concattenate to use.
The order of each locus in the most commonly portrayed 
order for the UK NDNAD profiles. On my pc ,200MHz,64MB only 
machine with about 200MB of hard disk space free so not daunting 
requirements. To generate and process all 2 million profiles expect 
to look at 5 hours to complete and that is when you are familiar with the routines. 
For faster pcs then reduce this time as most relates to the sort routines. 
Put a conditional If / End If statement in the genertor file where the output 
write is, to restrict to profiles in areas where matches are known to have occured 
will reduce process time. Anyway I suggest starting with generating 
only 20,000 profiles then 200,000 and eventually 2 million to get the hang 
of things.
 Macro modified for data input and output 
as strings rather than earlier version as numeric data.
Visual Basic/ Macro code for the separate macros are between horizontal rules.
FGA,vWA etc are the 10 loci and the associated generating 
tables are from the allele frequency tables in the forensic 
science literature cited on file dnapr.htm .



' Generating 10 loci x2 profiles
' directing pairs and first divider
Dim ph(20)
' initialising Random Number Generator - RNG
count9 = 0
count8 = 0

Randomize
a = 214013
c = 2531011
x0 = Timer
z = 2 ^ 24
'  1 file 'sept25g' for original, un-directed pairs, source data.
' This file is necessary to check on the performance of the RNG
' when a matched pair is found then it is highly unlikely that 
' both sequences as generated, before pair directing, would 
' be the same - more likely a manifest of repeat within the RNG
' (reason for adopting the 214013 / 2531011 RNG )
' Use 'Word' find function on part of the sequences, including pair reversals,
' with luck would include a  'homozygotic' pair eg (3,3) say ,so no reversal 
' on that pair

Open "sept25g" For Output As #1
' outputs directed and divided by first digit
Open "sept25-0" For Output As #10
Open "sept25-1" For Output As #11
Open "sept25-2" For Output As #12
Open "sept25-3" For Output As #13
Open "sept25-4" For Output As #14
Open "sept25-5" For Output As #15
Open "sept25-6" For Output As #16
Open "sept25-7" For Output As #17
Open "sept25-8" For Output As #18
Open "sept25-9" For Output As #19
' change for different total size eg 199999 for 200,000
For x = 0 To 1999999

For j = 0 To 1
' vWA ,first locus
' RNG random number generator
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z

ph(j) = phj
If ph(j) < 0.001 Then ph(j) = 11
If ph(j) < 0.106 Then ph(j) = 1
If ph(j) < 0.186 Then ph(j) = 2
If ph(j) < 0.402 Then ph(j) = 3
If ph(j) < 0.672 Then ph(j) = 4
If ph(j) < 0.891 Then ph(j) = 5
If ph(j) < 0.984 Then ph(j) = 6
If ph(j) < 0.998 Then ph(j) = 7
If ph(j) < 1 Then ph(j) = 8

If ph(j) > 10 Then ph(j) = 0
Next j

For j = 2 To 3
' THO1
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.002 Then ph(j) = 11
If ph(j) < 0.243 Then ph(j) = 1
If ph(j) < 0.437 Then ph(j) = 2
If ph(j) < 0.545 Then ph(j) = 3
If ph(j) < 0.546 Then ph(j) = 4
If ph(j) < 0.686 Then ph(j) = 5
If ph(j) < 0.99 Then ph(j) = 6
If ph(j) < 1 Then ph(j) = 7

If ph(j) > 10 Then ph(j) = 0
Next j

For j = 4 To 5
' D8
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.018 Then ph(j) = 11
If ph(j) < 0.031 Then ph(j) = 1
If ph(j) < 0.125 Then ph(j) = 2
If ph(j) < 0.191 Then ph(j) = 3
If ph(j) < 0.334 Then ph(j) = 4
If ph(j) < 0.667 Then ph(j) = 5
If ph(j) < 0.876 Then ph(j) = 6
If ph(j) < 0.964 Then ph(j) = 7
If ph(j) < 0.995 Then ph(j) = 8
If ph(j) < 1 Then ph(j) = 9
If ph(j) > 10 Then ph(j) = 0
Next j

For j = 6 To 7
' FGA
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.025 Then ph(j) = 11
If ph(j) < 0.081 Then ph(j) = 1
If ph(j) < 0.224 Then ph(j) = 2
If ph(j) < 0.411 Then ph(j) = 3
If ph(j) < 0.576 Then ph(j) = 4
If ph(j) < 0.587 Then ph(j) = 5
If ph(j) < 0.726 Then ph(j) = 6
If ph(j) < 0.872 Then ph(j) = 7
If ph(j) < 0.947 Then ph(j) = 8
If ph(j) < 0.982 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
' 1.8% not generated
If ph(j) > 10 Then ph(j) = 0
Next j

For j = 8 To 9
' D21
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.031 Then ph(j) = 11
If ph(j) < 0.191 Then ph(j) = 1
If ph(j) < 0.417 Then ph(j) = 2
If ph(j) < 0.675 Then ph(j) = 3
If ph(j) < 0.702 Then ph(j) = 4
If ph(j) < 0.771 Then ph(j) = 5
If ph(j) < 0.864 Then ph(j) = 6
If ph(j) < 0.882 Then ph(j) = 7
If ph(j) < 0.972 Then ph(j) = 8
If ph(j) < 0.994 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
' 0.5% not generated
If ph(j) > 10 Then ph(j) = 0
Next j

For j = 10 To 11
' D18
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.012 Then ph(j) = 11
If ph(j) < 0.151 Then ph(j) = 1
If ph(j) < 0.276 Then ph(j) = 2
If ph(j) < 0.44 Then ph(j) = 3
If ph(j) < 0.585 Then ph(j) = 4
If ph(j) < 0.722 Then ph(j) = 5
If ph(j) < 0.837 Then ph(j) = 6
If ph(j) < 0.917 Then ph(j) = 7
If ph(j) < 0.958 Then ph(j) = 8
If ph(j) < 0.975 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
' 2.5% not generated
If ph(j) > 10 Then ph(j) = 0
Next j

For j = 12 To 13
' D2S1338
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.037 Then ph(j) = 11
If ph(j) < 0.222 Then ph(j) = 1
If ph(j) < 0.309 Then ph(j) = 2
If ph(j) < 0.419 Then ph(j) = 3
If ph(j) < 0.557 Then ph(j) = 4
If ph(j) < 0.589 Then ph(j) = 5
If ph(j) < 0.613 Then ph(j) = 6
If ph(j) < 0.725 Then ph(j) = 7
If ph(j) < 0.867 Then ph(j) = 8
If ph(j) < 0.978 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
' 2.2% not generated
If ph(j) > 10 Then ph(j) = 0
Next j


For j = 14 To 15
' D16
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.019 Then ph(j) = 11
If ph(j) < 0.148 Then ph(j) = 1
If ph(j) < 0.202 Then ph(j) = 2
If ph(j) < 0.491 Then ph(j) = 3
If ph(j) < 0.779 Then ph(j) = 4
If ph(j) < 0.965 Then ph(j) = 5
If ph(j) < 0.994 Then ph(j) = 6
If ph(j) < 1 Then ph(j) = 7

If ph(j) > 10 Then ph(j) = 0
Next j

For j = 16 To 17
' D19
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.087 Then ph(j) = 11
If ph(j) < 0.309 Then ph(j) = 1
If ph(j) < 0.322 Then ph(j) = 2
If ph(j) < 0.704 Then ph(j) = 3
If ph(j) < 0.719 Then ph(j) = 4
If ph(j) < 0.896 Then ph(j) = 5
If ph(j) < 0.934 Then ph(j) = 6
If ph(j) < 0.975 Then ph(j) = 7
If ph(j) < 0.992 Then ph(j) = 8
If ph(j) < 0.997 Then ph(j) = 9
If ph(j) < 1 Then ph(j) = 0
If ph(j) > 10 Then ph(j) = 0
' 0.3% not generated
Next j


For j = 18 To 19
' D3
' RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  phj = x1 / z
ph(j) = phj

If ph(j) < 0.001 Then ph(j) = 11
If ph(j) < 0.007 Then ph(j) = 1
If ph(j) < 0.139 Then ph(j) = 2
If ph(j) < 0.404 Then ph(j) = 3
If ph(j) < 0.651 Then ph(j) = 4
If ph(j) < 0.846 Then ph(j) = 5
If ph(j) < 0.987 Then ph(j) = 6
If ph(j) < 1 Then ph(j) = 7

If ph(j) > 10 Then ph(j) = 0
Next j

' output the original generated file

Write #1, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
' Because in real DNA profiles without further info ,no one 
' knows which allele in each pair came from the mother or father 
' by convention they are written smaller ,larger (or equal).
' The following directs each pair

For j = 0 To 18 Step 2
If ph(j + 1) < ph(j) Then
jjj = ph(j)
ph(j) = ph(j + 1)
ph(j + 1) = jjj
End If
Next j

' put extra conditional statements here to reduce 
' the number of files or just delete some of the following 
'
' dividing on first column, file by file
If ph(0) = 0 Then
Write #10 , ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count0 = count0 + 1

End If
If ph(0) = 1 Then
Write #11, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count1 = count1 + 1

End If
If ph(0) = 2 Then
Write #12, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count2 = count2 + 1

End If
If ph(0) = 3 Then
Write #13, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count3 = count3 + 1

End If
If ph(0) = 4 Then
Write #14, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count4 = count4 + 1

End If
If ph(0) = 5 Then
Write #15, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count5 = count5 + 1

End If
If ph(0) = 6 Then
Write #16, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count6 = count6 + 1

End If
If ph(0) = 7 Then
Write #17, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count7 = count7 + 1

End If
If ph(0) = 8 Then
Write #18, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count8 = count8 + 1

End If
If ph(0) = 9 Then
Write #19, ph(0) & ph(1) & ph(2) & ph(3)& ph(4)& ph(5)& ph(6)& ph(7)& ph(8)& ph(9)& ph(10)& ph(11)& ph(12)& ph(13)& ph(14)& ph(15)& ph(16)& ph(17)& ph(18)& ph(19)
count9 = count9 + 1
End If

Next x
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Close #1
' count file for data to fix for - next loops in sucessive dividings
Open "sept25-c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20



To reduce the file sizes so they can be sorted it is necessary 
to subdivide by various leading digits .
If 5th or 6th column divider is required make approriate changes



' Dividing file into 10 by second digit
Dim ph(20)
dim ps as string
' xxxx = count size from count file
xxxx = 
' input file
Open "sept25-1" For Input As #1
' 10 divided files 
Open "sept25-10" For Output As #10
Open "sept25-11" For Output As #11
Open "sept25-12" For Output As #12
Open "sept25-13" For Output As #13
Open "sept25-14" For Output As #14
Open "sept25-15" For Output As #15
Open "sept25-16" For Output As #16
Open "sept25-17" For Output As #17
Open "sept25-18" For Output As #18
Open "sept25-19" For Output As #19
count9 = 0
count8 = 0
xxxx = xxxx - 1
For x = 0 To xxxx

Input #1, ps

a2$ = Mid(ps, 2, 1)

ph(1) = Val(a2$)


If ph(1) = 0 Then
Write #10, ps
count0 = count0 + 1

End If
If ph(1) = 1 Then
Write #11, ps

count1 = count1 + 1


End If
If ph(1) = 2 Then
Write #12, ps
count2 = count2 + 1

End If
If ph(1) = 3 Then
Write #13, ps
count3 = count3 + 1

End If
If ph(1) = 4 Then
Write #14, ps
count4 = count4 + 1

End If
If ph(1) = 5 Then
Write #15, ps
count5 = count5 + 1

End If
If ph(1) = 6 Then
Write #16, ps
count6 = count6 + 1

End If
If ph(1) = 7 Then
Write #17, ps
count7 = count7 + 1

End If
If ph(1) = 8 Then
Write #18, ps
count8 = count8 + 1

End If
If ph(1) = 9 Then
Write #19, ps
count9 = count9 + 1

End If
Next x
Close #1
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
' output counts
Open "sept25-1c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20


' Dividing file into 10 by third digit
Dim ph(20)
dim ps as string
' enter count in xxxx
xxxx =
Open "sept25-11" For Input As #1
Open "sept25-110" For Output As #10
Open "sept25-111" For Output As #11
Open "sept25-112" For Output As #12
Open "sept25-113" For Output As #13
Open "sept25-114" For Output As #14
Open "sept25-115" For Output As #15
Open "sept25-116" For Output As #16
Open "sept25-117" For Output As #17
Open "sept25-118" For Output As #18
Open "sept25-119" For Output As #19
count9 = 0
count8 = 0
xxxx=xxxx - 1
For x = 0 To xxxx

Input #1, ps

a3$ = Mid(ps, 3, 1)

ph(2) = Val(a3$)


If ph(2) = 0 Then
Write #10, ps
count0 = count0 + 1

End If
If ph(2) = 1 Then
Write #11, ps

count1 = count1 + 1


End If
If ph(2) = 2 Then
Write #12, ps
count2 = count2 + 1

End If
If ph(2) = 3 Then
Write #13, ps
count3 = count3 + 1

End If
If ph(2) = 4 Then
Write #14, ps
count4 = count4 + 1

End If
If ph(2) = 5 Then
Write #15, ps
count5 = count5 + 1

End If
If ph(2) = 6 Then
Write #16, ps
count6 = count6 + 1

End If
If ph(2) = 7 Then
Write #17, ps
count7 = count7 + 1

End If
If ph(2) = 8 Then
Write #18, ps
count8 = count8 + 1

End If
If ph(2) = 9 Then
Write #19, ps
count9 = count9 + 1

End If
Next x
Close #1
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Open "sept25-11c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20

' Dividing file into 10 by fourth digit
Dim ph(20)
dim ps as string
' enter count in xxxx
xxxx =
Open "sept25-131" For Input As #1
Open "sept25-1310" For Output As #10
Open "sept25-1311" For Output As #11
Open "sept25-1312" For Output As #12
Open "sept25-1313" For Output As #13
Open "sept25-1314" For Output As #14
Open "sept25-1315" For Output As #15
Open "sept25-1316" For Output As #16
Open "sept25-1317" For Output As #17
Open "sept25-1318" For Output As #18
Open "sept25-1319" For Output As #19
count9 = 0
count8 = 0
xxxx=xxxx - 1
For x = 0 To xxxx

Input #1, ps

a4$ = Mid(ps, 4, 1)

ph(3) = Val(a4$)


If ph(3) = 0 Then
Write #10, ps
count0 = count0 + 1

End If
If ph(3) = 1 Then
Write #11, ps

count1 = count1 + 1


End If
If ph(3) = 2 Then
Write #12, ps
count2 = count2 + 1

End If
If ph(3) = 3 Then
Write #13, ps
count3 = count3 + 1

End If
If ph(3) = 4 Then
Write #14, ps
count4 = count4 + 1

End If
If ph(3) = 5 Then
Write #15, ps
count5 = count5 + 1

End If
If ph(3) = 6 Then
Write #16, ps
count6 = count6 + 1

End If
If ph(3) = 7 Then
Write #17, ps
count7 = count7 + 1

End If
If ph(3) = 8 Then
Write #18, ps
count8 = count8 + 1

End If
If ph(3) = 9 Then
Write #19, ps
count9 = count9 + 1

End If
Next x
Close #1
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Open "sept25-131c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20



' Dividing file into 10 by fifth digit
Dim ph(20)
Dim ps As String
' enter count in xxxx
xxxx = 
Open "dec14-3412" For Input As #1
Open "dec14-34120" For Output As #10
Open "dec14-34121" For Output As #11
Open "dec14-34122" For Output As #12
Open "dec14-34123" For Output As #13
Open "dec14-34124" For Output As #14
Open "dec14-34125" For Output As #15
Open "dec14-34126" For Output As #16
Open "dec14-34127" For Output As #17
Open "dec14-34128" For Output As #18
Open "dec14-34129" For Output As #19
count9 = 0
count8 = 0
xxxx = xxxx - 1
For x = 0 To xxxx

Input #1, ps

a5$ = Mid(ps, 5, 1)

ph(4) = Val(a5$)


If ph(4) = 0 Then
Write #10, ps
count0 = count0 + 1

End If
If ph(4) = 1 Then
Write #11, ps

count1 = count1 + 1


End If
If ph(4) = 2 Then
Write #12, ps
count2 = count2 + 1

End If
If ph(4) = 3 Then
Write #13, ps
count3 = count3 + 1

End If
If ph(4) = 4 Then
Write #14, ps
count4 = count4 + 1

End If
If ph(4) = 5 Then
Write #15, ps
count5 = count5 + 1

End If
If ph(4) = 6 Then
Write #16, ps
count6 = count6 + 1

End If
If ph(4) = 7 Then
Write #17, ps
count7 = count7 + 1

End If
If ph(4) = 8 Then
Write #18, ps
count8 = count8 + 1

End If
If ph(4) = 9 Then
Write #19, ps
count9 = count9 + 1

End If
Next x
Close #1
Close #10
Close #11
Close #12
Close #13
Close #14
Close #15
Close #16
Close #17
Close #18
Close #19
Open "dec14-3412c" For Output As #20
Write #20, 0, count0, 1, count1, 2, count2, 3, count3, 4, count4, 5, count5, 6, count6, 7, count7, 8, count8, 9, count9
Close #20





The next is sorting using Word Tables/ Sort 
Before using ,make a test batch of numbers as there are various 
Sort outcomes. Now I'm using string data, Text sort gave the 
right form on my machine. Use Ctrl+shift+Home(or End) to 
highlight text up or down .
After sort and before saving to disk press up or down 
arrow to select which way the text is returned to you.
My set-up was limited to no more than 15,000. To sort 
say 28,000 sort upper half ,then lower half then cut and 
paste say 0 to 2 section of lower half into the top of the 
top half. Re-sort the expanded 0 to 2 section then 
re-sort the remainder. If say selecting 2 to 3 section then 
cut and paste at the juncture of 2 and 3 in the other block 
to save some repeated sorting. Other times it is quicker 
to oversort then backtrack / overlap on the next sort. 
Many of the subdivision files are empty because 
of the directing. They consist of eg 4,4.. 4,5.... etc 
never 4,0.., 4,1.. etc and a number of 8 and 9 sections 
are absent back to the generator characteristics eg 
only first 8 of 10 are used. When you know all files are less than 
15,000, or whatever Sort limit, use the next (simply a  recorded macro) 
to sort 10 related files. An empty file will stop the macro so edit 
out empty files before running.

'Sort 10 related files in one go

'
    Documents.Open FileName:="sept25-130", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-131", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-132", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-133", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-134", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-135", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-136", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-137", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-138", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save

'
    Documents.Open FileName:="sept25-139", ConfirmConversions:=False, ReadOnly _
        :=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate _
        :="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="" _
        , Format:=wdOpenFormatAuto
    Selection.Sort ExcludeHeader:=False, FieldNumber:="Paragraphs", _
        SortFieldType:=wdSortFieldAlphanumeric, SortOrder:=wdSortOrderAscending, _
        FieldNumber2:="", SortFieldType2:=wdSortFieldAlphanumeric, SortOrder2:= _
        wdSortOrderAscending, FieldNumber3:="", SortFieldType3:= _
        wdSortFieldAlphanumeric, SortOrder3:=wdSortOrderAscending, Separator:= _
        wdSortSeparateByTabs, SortColumn:=False, CaseSensitive:=False, LanguageID _
        :=wdLanguageNone
    ActiveDocument.Save





' empty files will append spurious carriage returns at 
' head or tail of files so check for this before final match routine
' otherwise use Insert / File to merge files
' merge 10 related files back to one
' for convenience I named these re-concattenated 
' files as .txt so they were obvious in listing
' compared to no suffix ones

'
    Documents.Add Template:="", NewTemplate:=False


    Selection.InsertFile FileName:="sept25-130", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False


 
    Selection.InsertFile FileName:="sept25-131", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False

Selection.InsertFile FileName:="sept25-132", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False

Selection.InsertFile FileName:="sept25-133", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False

Selection.InsertFile FileName:="sept25-134", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False

Selection.InsertFile FileName:="sept25-135", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False

Selection.InsertFile FileName:="sept25-136", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False

Selection.InsertFile FileName:="sept25-137", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False

Selection.InsertFile FileName:="sept25-138", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False

Selection.InsertFile FileName:="sept25-139", Range:="", ConfirmConversions _
        :=False, Link:=False, Attachment:=False


    ActiveDocument.SaveAs FileName:="sept25-13.txt", FileFormat:=wdFormatText, _
         LockComments:=False, Password:="", AddToRecentFiles:=True, WritePassword _
        :="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:=False, _
        SaveNativePictureFormat:=False, SaveFormsData:=False, SaveAsAOCELetter:= _
        False
End Sub




Copy and paste all these subfiles together to 
submit to the next section. The final match finding,
initially for 12 digits ,then change to 14,16,18 
and finally 20 if 18 shows something. This routine 
after hours of dividing/sorting/re-merging takes only seconds to complete.




' Find matching pairs in 12 digits
' xxxx is count  = ????
xxxx = 
b$ = "0"
Count = 0
Dim ps  As String
Open "sept25-24.txt" For Input As #1
Open "sept25-24m12.txt" For Output As #2
' change the 12 in the #2 file name above and
' the Left function below to suit number of matches
xxxx = xxxx - 1
For x = 0 To xxxx
Input #1, ps
a$ = Left(ps, 12)
If a$ = b$ Then
Write #2, ps
Count = Count + 1
End If
b$ = a$
Next x
Write #2, "Count  ", Count
close #1
Close #2



' Find matching triples in 12 digits
' xxxx is count from the count files
xxxx = 
b$ = "0"
c$ = "0"
Count = 0
Dim ps  As String
xxxx = xxxx - 1
Open "sept25-1.txt" For Input As #1
Open "sept25-1trip.txt" For Output As #2
' change the 12 in the #2 file name above and
' the Left function below to suit number of matches
For x = 0 To xxxx
Input #1, ps
a$ = Left(ps, 12)
a2$ = ps
If a$ = c$ Then
Write #2, a2$, b2$, c2$
Count = Count + 1
End If
If a$ = b$ Then
c$ = b$
c2$ = b2$
End If

b$ = a$
b2$ = a2$
Next x
Write #2, "Count  ", Count
Close #1
Close #2




' Find matching quadruples in 12 digits
' xxxx is from the count files
xxxx = 
b$ = "0"
c$ = "0"
Count = 0
Dim ps  As String
xxxx = xxxx - 1
Open "sept25-3.txt" For Input As #1
Open "sept25-3quad.txt" For Output As #2
' change the 12 in the #2 file name above and
' the Left function below to suit number of matches
For x = 0 To xxxx
Input #1, ps
a$ = Left(ps, 12)
a2$ = ps
If a$ = d$ Then
Write #2, a2$, b2$, c2$, d2$
Count = Count + 1
End If
If a$ = c$ Then
d$ = c$
d2$ = c2$
End If
If a$ = b$ Then
c$ = b$
c2$ = b2$
End If

b$ = a$
b2$ = a2$
Next x
Write #2, "Count  ", Count
Close #1
Close #2



' Find matching quintuples in 12 digits
' xxxx is from the count files
xxxx = 
b$ = "0"
c$ = "0"
Count = 0
Dim ps  As String
xxxx = xxxx - 1
Open "sept25-4.txt" For Input As #1
Open "sept25-4quin.txt" For Output As #2
' change the 12 in the #2 file name above and
' the Left function below to suit number of matches
For x = 0 To xxxx
Input #1, ps
a$ = Left(ps, 12)
a2$ = ps
If a$ = e$ Then
Write #2, a2$, b2$, c2$, d2$, e2$
Count = Count + 1
End If
If a$ = d$ Then
e$ = d$
e2$ = d2$
End If
If a$ = c$ Then
d$ = c$
d2$ = c2$
End If
If a$ = b$ Then
c$ = b$
c2$ = b2$
End If

b$ = a$
b2$ = a2$
Next x
Write #2, "Count  ", Count
Close #1
Close #2




' converting integre values back to DNA loci,alleles
xxxx=
' xxxx is number of profiles to be converted
Dim ph(20)
Dim pj(20)
Dim ps As String

Open "sept25-m12.txt" For Input As #1
Open "sept25-mr12.txt" For Output As #2

For x = 1 To xxxx

Input #1, ps

a1$ = Mid(ps, 1, 1)
a2$ = Mid(ps, 2, 1)
a3$ = Mid(ps, 3, 1)
a4$ = Mid(ps, 4, 1)
a5$ = Mid(ps, 5, 1)
a6$ = Mid(ps, 6, 1)
a7$ = Mid(ps, 7, 1)
a8$ = Mid(ps, 8, 1)
a9$ = Mid(ps, 9, 1)
a10$ = Mid(ps, 10, 1)
a11$ = Mid(ps, 11, 1)
a12$ = Mid(ps, 12, 1)
a13$ = Mid(ps, 13, 1)
a14$ = Mid(ps, 14, 1)
a15$ = Mid(ps, 15, 1)
a16$ = Mid(ps, 16, 1)
a17$ = Mid(ps, 17, 1)
a18$ = Mid(ps, 18, 1)
a19$ = Mid(ps, 19, 1)
a20$ = Mid(ps, 20, 1)

ph(0) = Val(a1$)
ph(1) = Val(a2$)
ph(2) = Val(a3$)
ph(3) = Val(a4$)
ph(4) = Val(a5$)
ph(5) = Val(a6$)
ph(6) = Val(a7$)
ph(7) = Val(a8$)
ph(8) = Val(a9$)
ph(9) = Val(a10$)
ph(10) = Val(a11$)
ph(11) = Val(a12$)
ph(12) = Val(a13$)
ph(13) = Val(a14$)
ph(14) = Val(a15$)
ph(15) = Val(a16$)
ph(16) = Val(a17$)
ph(17) = Val(a18$)
ph(18) = Val(a19$)
ph(19) = Val(a20$)




For j = 0 To 1
' vWA

If ph(j) = 0 Then pj(j) = 13
If ph(j) = 1 Then pj(j) = 14
If ph(j) = 2 Then pj(j) = 15
If ph(j) = 3 Then pj(j) = 16
If ph(j) = 4 Then pj(j) = 17
If ph(j) = 5 Then pj(j) = 18
If ph(j) = 6 Then pj(j) = 19
If ph(j) = 7 Then pj(j) = 20
If ph(j) = 8 Then pj(j) = 21
If ph(j) = 9 Then pj(j) = 0
Next j

For j = 2 To 3
' THO1
If ph(j) = 0 Then pj(j) = 5
If ph(j) = 1 Then pj(j) = 6
If ph(j) = 2 Then pj(j) = 7
If ph(j) = 3 Then pj(j) = 8
If ph(j) = 4 Then pj(j) = 8.3
If ph(j) = 5 Then pj(j) = 9
If ph(j) = 6 Then pj(j) = 9.3
If ph(j) = 7 Then pj(j) = 10
If ph(j) = 8 Then pj(j) = 0
If ph(j) = 9 Then pj(j) = 0

Next j

For j = 4 To 5
' D8
If ph(j) = 0 Then pj(j) = 8
If ph(j) = 1 Then pj(j) = 9
If ph(j) = 2 Then pj(j) = 10
If ph(j) = 3 Then pj(j) = 11
If ph(j) = 4 Then pj(j) = 12
If ph(j) = 5 Then pj(j) = 13
If ph(j) = 6 Then pj(j) = 14
If ph(j) = 7 Then pj(j) = 15
If ph(j) = 8 Then pj(j) = 16
If ph(j) = 9 Then pj(j) = 17

Next j

For j = 6 To 7
' FGA
If ph(j) = 0 Then pj(j) = 18
If ph(j) = 1 Then pj(j) = 19
If ph(j) = 2 Then pj(j) = 20
If ph(j) = 3 Then pj(j) = 21
If ph(j) = 4 Then pj(j) = 22
If ph(j) = 5 Then pj(j) = 22.2
If ph(j) = 6 Then pj(j) = 23
If ph(j) = 7 Then pj(j) = 24
If ph(j) = 8 Then pj(j) = 25
If ph(j) = 9 Then pj(j) = 26

Next j

For j = 8 To 9
' D21
If ph(j) = 0 Then pj(j) = 27
If ph(j) = 1 Then pj(j) = 28
If ph(j) = 2 Then pj(j) = 29
If ph(j) = 3 Then pj(j) = 30
If ph(j) = 4 Then pj(j) = 30.2
If ph(j) = 5 Then pj(j) = 31
If ph(j) = 6 Then pj(j) = 31.2
If ph(j) = 7 Then pj(j) = 32
If ph(j) = 8 Then pj(j) = 32.2
If ph(j) = 9 Then pj(j) = 33.2

Next j

For j = 10 To 11
' D18
If ph(j) = 0 Then pj(j) = 11
If ph(j) = 1 Then pj(j) = 12
If ph(j) = 2 Then pj(j) = 13
If ph(j) = 3 Then pj(j) = 14
If ph(j) = 4 Then pj(j) = 15
If ph(j) = 5 Then pj(j) = 16
If ph(j) = 6 Then pj(j) = 17
If ph(j) = 7 Then pj(j) = 18
If ph(j) = 8 Then pj(j) = 19
If ph(j) = 9 Then pj(j) = 20

Next j

For j = 12 To 13
' D2S1338
If ph(j) = 0 Then pj(j) = 16
If ph(j) = 1 Then pj(j) = 17
If ph(j) = 2 Then pj(j) = 18
If ph(j) = 3 Then pj(j) = 19
If ph(j) = 4 Then pj(j) = 20
If ph(j) = 5 Then pj(j) = 21
If ph(j) = 6 Then pj(j) = 22
If ph(j) = 7 Then pj(j) = 23
If ph(j) = 8 Then pj(j) = 24
If ph(j) = 9 Then pj(j) = 25

Next j


For j = 14 To 15
' D16
If ph(j) = 0 Then pj(j) = 8
If ph(j) = 1 Then pj(j) = 9
If ph(j) = 2 Then pj(j) = 10
If ph(j) = 3 Then pj(j) = 11
If ph(j) = 4 Then pj(j) = 12
If ph(j) = 5 Then pj(j) = 13
If ph(j) = 6 Then pj(j) = 14
If ph(j) = 7 Then pj(j) = 15
If ph(j) = 8 Then pj(j) = 0
If ph(j) = 9 Then pj(j) = 0

Next j

For j = 16 To 17
' D19
If ph(j) = 0 Then pj(j) = 12
If ph(j) = 1 Then pj(j) = 13
If ph(j) = 2 Then pj(j) = 13.2
If ph(j) = 3 Then pj(j) = 14
If ph(j) = 4 Then pj(j) = 14.2
If ph(j) = 5 Then pj(j) = 15
If ph(j) = 6 Then pj(j) = 15.2
If ph(j) = 7 Then pj(j) = 16
If ph(j) = 8 Then pj(j) = 16.2
If ph(j) = 9 Then pj(j) = 17

Next j

For j = 18 To 19
' D3
If ph(j) = 0 Then pj(j) = 12
If ph(j) = 1 Then pj(j) = 13
If ph(j) = 2 Then pj(j) = 14
If ph(j) = 3 Then pj(j) = 15
If ph(j) = 4 Then pj(j) = 16
If ph(j) = 5 Then pj(j) = 17
If ph(j) = 6 Then pj(j) = 18
If ph(j) = 7 Then pj(j) = 19
If ph(j) = 8 Then pj(j) = 0
If ph(j) = 9 Then pj(j) = 0

Next j
Write #2, ""; pj(0), pj(1); ""; pj(2), pj(3); ""; pj(4), pj(5); ""; pj(6), pj(7); ""; pj(8), pj(9); ""; pj(10), pj(11); ""; pj(12), pj(13); ""; pj(14), pj(15); ""; pj(16), pj(17); ""; pj(18), pj(19); ""

Next x
Close #1
Close #2





2 million profile sub-division counts.
Anyone repeating the exercise will have very similar numbers
For 1 million divide numbers by 2, for 200,000 divide by 10 etc
For my set up any profile count over 15,000 would not be sorted 
by Word.

all / first dividing
0,4019,1,398036,2,273611,3,609940,4,499104,5,191390,6,23392,7,501,8,7,9,0
1...................
0,,1,22058,2,33588,3,91034,4,113493,5,92135,6,38947,7,5927,8,854,9,0
11..................
0,88,1,9262,2,5621,3,2518,4,19,5,2354,6,2194,7,2,8,0,9,0
12..................
120,121,1,14174,2,8609,3,3670,4,31,5,3589,6,3394,7,,8,0,9,0
13..................
0,371,1,38677,2,22983,3,10158,4,78,5,9802,6,8956,7,9,8,0,9,0
131.................
0,,1,5305,2,8541,3,4746,4,46,5,6233,6,13372,7,434,8,0,9,0
132.................
0,,1,,2,3291,3,3769,4,31,5,4887,6,10666,7,339,8,0,9,0
14..................
0,456,1,47943,2,29091,3,12592,4,94,5,12133,6,11173,7,11,8,0,9,0
141
0,,1,6488,2,10745,3,5863,4,48,5,7669,6,16576,7,554,8,0,9,0
142.................
0,,1,5305,2,8541,3,4746,4,46,5,6233,6,13372,7,434,8,0,9,0
15..................
0,376,1,38889,2,23703,3,10039,4,88,5,9828,6,9200,7,12,8,0,9,0
151.................
0,,1,5350,2,8576,3,4743,4,40,5,6204,6,13535,7,441,8,0,9,0
152.................
0,,1,,2,3415,3,3929,4,36,5,4940,6,10984,7,399,8,0,9,0
16..................
0,162,1,16607,2,9981,3,4262,4,36,5,4113,6,3784,7,2,8,0,9,0
2...................
0,,1,,2,12917,3,68968,4,86770,5,70116,6,29621,7,4535,8,684,9,0
23..................
0,260,1,29219,2,17652,3,7616,4,63,5,7440,6,6709,7,9,8,0,9,0
24..................
0,337,1,36823,2,22145,3,9418,4,75,5,9414,6,8549,7,9,8,0,9,0
241.................
0,,1,5015,2,8205,3,4468,4,46,5,5986,6,12698,7,405,8,0,9,0
242.................
0,,1,,2,3307,3,3664,4,38,5,4563,6,10213,7,360,8,0,9,0
25..................
0,280,1,29665,2,17938,3,7755,4,67,5,7519,6,6882,7,10,8,0,9,0
251.................
0,,1,4115,2,6585,3,3575,4,34,5,4802,6,10203,7,351,8,0,9,0
252.................
0,,1,,2,2570,3,2966,4,30,5,3803,6,8281,7,288,8,0,9,0
26..................
0,109,1,12495,2,7572,3,3238,4,22,5,3161,6,3017,7,7,8,0,9,0
3...................
0,,1,,2,,3,93386,4,233176,5,188883,6,80487,7,12264,8,1744,9,0
33..................
0,366,1,39623,2,23922,3,10316,4,79,5,9959,6,9112,7,9,8,0,9,0
331.................
0,,1,5321,2,8850,3,4920,4,43,5,6275,6,13774,7,440,8,0,9,0
332.................
0,,1,,2,3543,3,3904,4,37,5,5170,6,10888,7,380,8,0,9,0
34..................
0,922,1,98703,2,59774,3,25677,4,233,5,24954,6,22893,7,20,8,0,9,0
341.................
0,,1,13568,2,21954,3,12210,4,114,5,15676,6,34066,7,1115,8,0,9,0
342.................
0,,1,,2,8847,3,9825,4,93,5,12693,6,27433,7,883,8,0,9,0
343.................
0,,1,,2,,3,2694,4,50,5,7065,6,15350,7,518,8,0,9,0
345.................
0,,1,,2,,3,,4,,5,4449,6,19839,7,666,8,0,9,0
346.................
0,,1,,2,,3,,4,,5,,6,21486,7,1407,8,0,9,0
35..................
0,736,1,80070,2,48356,3,20742,4,188,5,20292,6,18483,7,16,8,0,9,0
351.................
0,,1,10979,2,17776,3,9753,4,102,5,12661,6,27911,7,888,8,0,9,0
352.................
0,,1,,2,7096,3,8007,4,59,5,10369,6,22127,7,698,8,0,9,0
353.................
0,,1,,2,,3,2158,4,33,5,5739,6,12399,7,413,8,0,9,0
355.................
0,,1,,2,,3,,4,,5,3714,6,16060,7,518,8,0,9,0
356.................
0,,1,,2,,3,,4,,5,,6,17319,7,1164,8,0,9,0
36..................
0,321,1,34027,2,20677,3,8876,4,92,5,8566,6,7921,7,7,8,0,9,0
361.................
0,,1,4752,2,7491,3,4162,4,34,5,5425,6,11758,7,405,8,0,9,0
362.................
0,,1,,2,3021,3,3428,4,31,5,4284,6,9608,7,305,8,0,9,0
4...................
0,,1,,2,,3,,4,145639,5,236123,6,100130,7,15027,8,2185,9,0
44..................
0,544,1,61636,2,37389,3,15876,4,141,5,15655,6,14386,7,12,8,0,9,0
441.................
0,,1,8497,2,13443,3,7579,4,77,5,9872,6,21480,7,688,8,0,9,0
4416................
0,810,1,536,2,3761,3,2412,4,4533,5,7152,6,1963,7,289,8,24,9,0
442.................
0,,1,,2,5491,3,6096,4,59,5,7952,6,17248,7,543,8,0,9,0
4426................
0,614,1,444,2,2979,3,1928,4,3701,5,5628,6,1683,7,249,8,21,9,1
443.................
0,,1,,2,,3,1699,4,32,5,4355,6,9480,7,310,8,0,9,0
45..................
0,960,1,99982,2,60165,3,25896,4,218,5,25443,6,23434,7,25,8,0,9,0
451.................
0,,1,13926,2,22066,3,12193,4,110,5,16034,6,34516,7,1137,8,0,9,0
4512................
0,780,1,610,2,3810,3,2399,4,4661,5,7412,6,2058,7,311,8,23,9,2
4516................
0,1207,1,858,2,5971,3,3889,4,7295,5,11576,6,3206,7,474,8,40,9,0
452.................
0,,1,,2,8919,3,9771,4,86,5,12634,6,27853,7,902,8,0,9,0
4526................
0,994,1,698,2,4849,3,2986,4,5844,5,9369,6,2656,7,417,8,40,9,0
453.................
0,,1,,2,,3,2744,4,55,5,7136,6,15444,7,517,8,0,9,0
455.................
0,,1,,2,,3,,4,,5,4601,6,20190,7,652,8,0,9,0
456.................
0,,1,,2,,3,,4,,5,,6,21951,7,1483,8,0,9,0
46..................
0,399,1,42396,2,25654,3,11094,4,102,5,10657,6,9822,7,6,8,0,9,0
461.................
0,,1,5900,2,9361,3,5178,4,51,5,6755,6,14631,7,520,8,0,9,0
462.................
0,,1,,2,3800,3,4153,4,44,5,5403,6,11654,7,400,8,0,9,0
47..................
0,60,1,6295,2,3908,3,1694,4,12,5,1620,6,1437,7,1,8,0,9,0
5...................
0,,1,,2,,3,,4,,5,95938,6,81661,7,12100,8,1691,9,0
55..................
0,386,1,40439,2,24634,3,10444,4,87,5,10438,6,9497,7,13,8,0,9,0
56..................
0,309,1,34369,2,20890,3,8991,4,80,5,8921,6,8097,7,4,8,0,9,0
6...................
0,,1,,2,,3,,4,,5,,6,17375,7,5293,8,724,9,0



 Background and Results 
Background and results reported to usenet group uk.legal 
over a number of weeks in 2003
Thread titled
R v. Watters - Court of Appeal judgement 2000/2001

My first idea was to download UK football results 
data and analyse because  they are in pairs of numbers 
with a bias towards the lower numbers.



Which brings me to the statistical research I would
like to do concerning such multi-modal sets
and conjectured increase in matches in close to modal
sets. Like the 'biblical' analysis on this same thread.
I am for the moment trying to get some weighted numerical
data . At the moment am trying to find complete historical
record of football results back over 100 years or so.
Theory being that restricting to divisional football to avoid
mismatched teams like Man U. v. Barnstoneworth United
then score lines containing 0s,1s and 2s should be more
common than 3s,4, and 5s etc. Retaining order eg 2,1 and 1,2
to equate to my negative normalised elements.
Then for each 10 games (non-void) over all the decades
find how many 20 figure numerical matches there are .
I suspect many more in the 0s to 2s only rather than
containing some 3s,4,s etc.




From
http://www.rsssf.com/engpaul/FLA/league.html
up to about 1950 "cross-table" (good keyword) results so perhaps
60 blocks of paired data like block below.
Concattenating 60 seasons and breaking into 5 pairs then repeat on
6 pairs etc ,unlikely (my guess) up to 10 pairs, and testing for matches
would be interesting.

example below for England 1936/7,no particular reason
one score of 10 changed to * and
from original  xxx deleted as well
as text.

digit count
0  197
1  292
2  218
3  114
4  55
5  31
6  12
7  4
8  nil
9  nil
10  1
which appears to have about the right sort of weighting to equate
with normalised DNA profiles


1-1 0-0 1-1 1-1 4-1 2-2 3-2 0-0 1-1 4-1 1-0 1-3 1-1 5-3 4-0 4-1 1-1 0-0 4-1 2-0 3-0
1-3 1-1 4-0 1-2 0-0 0-1 2-0 2-3 4-2 2-1 5-0 2-2 2-2 0-0 2-1 1-0 1-1 2-4 2-0 1-1 1-0
0-5 0-0 2-2 2-1 2-1 1-3 1-2 1-2 2-2 2-1 0-1 0-2 0-4 1-3 1-0 0-0 1-0 0-0 1-1 4-1 1-2
2-0 2-1 2-2 4-2 1-0 6-2 2-2 2-3 1-1 4-1 5-2 2-6 4-0 4-1 4-0 1-1 2-1 2-1 3-3 2-1 3-2
0-2 2-2 1-0 2-1 1-0 2-0 2-0 1-0 1-0 1-0 1-1 1-1 3-0 2-2 0-0 3-1 1-0 2-0 3-1 4-2 4-0
2-0 1-3 0-1 2-1 3-0 1-1 4-0 3-2 0-0 2-1 2-0 4-4 4-2 1-0 1-1 0-0 1-1 1-0 1-3 3-0 0-1
5-4 3-1 3-0 2-3 5-0 1-1 3-1 3-1 3-3 5-3 4-1 0-5 5-4 0-2 1-3 1-2 3-2 2-2 3-0 1-0 5-1
1-1 3-3 3-2 3-0 2-2 0-0 7-0 3-0 2-1 7-1 2-0 1-1 2-3 2-3 4-0 2-2 3-1 1-1 3-0 4-2 1-0
1-3 1-1 3-1 2-0 0-1 3-0 3-4 1-0 2-2 4-1 2-1 5-3 6-2 5-1 1-0 6-4 5-1 1-3 6-0 2-3 1-1
0-0 1-1 2-0 1-1 1-2 4-2 2-0 0-3 0-3 3-0 4-0 1-1 3-1 2-0 1-2 4-2 1-0 2-1 2-1 1-1 4-0
3-4 0-2 2-2 3-1 2-0 2-3 2-0 3-0 2-0 2-1 2-0 1-1 2-1 5-0 3-1 1-0 1-1 2-1 3-0 3-1 0-1
2-1 2-0 0-0 2-2 1-2 1-1 3-3 3-2 7-1 1-1 3-0 0-5 2-0 0-2 0-0 1-1 2-2 2-1 4-0 1-2 1-0
2-0 1-1 2-2 2-1 1-1 0-0 3-2 4-1 1-1 3-0 4-0 5-1 1-0 2-1 3-1 4-1 4-1 2-1 2-4 6-2 4-1
2-0 1-2 1-0 1-3 0-0 0-0 2-2 2-1 1-1 3-1 0-0 2-5 3-2 2-1 0-1 1-1 1-1 2-1 2-1 2-2 1-1
1-1 3-1 2-0 3-0 1-1 2-0 1-3 2-0 0-0 5-0 4-2 3-3 2-0 3-2 2-2 2-1 2-0 1-0 5-5 4-1 1-0
1-5 2-1 1-1 1-3 0-1 4-1 1-2 2-2 2-1 1-0 3-0 6-2 2-1 2-1 2-1 0-1 1-0 1-0 3-2 5-3 1-1
1-3 2-2 1-2 1-1 0-0 1-0 5-2 1-0 3-2 1-1 1-0 3-1 2-5 3-1 2-0 1-1 1-1 0-1 2-0 3-2 1-3
0-0 0-3 2-0 0-2 3-1 1-1 2-3 6-4 2-1 2-2 1-2 1-2 5-1 1-0 1-0 0-0 0-1 0-0 2-0 2-3 1-3
0-0 2-0 2-2 5-1 1-1 2-0 1-2 2-1 2-0 1-1 2-1 1-1 2-2 3-0 6-2 2-4 0-2 1-0 5-3 *-3 2-1
1-1 4-0 3-0 4-1 1-0 2-3 3-2 3-1 5-1 3-2 2-1 4-2 1-3 1-1 4-1 3-2 3-0 2-1 3-0 1-0 6-2
2-4 3-2 0-2 1-0 1-2 2-0 1-3 2-1 4-2 2-1 3-0 3-1 2-2 1-0 3-1 3-1 0-0 2-3 2-2 6-4 2-1
2-0 2-1 2-3 4-0 6-1 1-2 3-1 7-2 5-2 3-1 3-0 2-0 2-1 3-1 0-1 1-1 5-0 4-3 2-1 1-1 5-2

Each row is one team in turn, results,playing each of the others in the
season.
For my normalising purposes I would perhaps leave 0=0
1=1,2=> -1,3=> 2,4=> -2,5=> 3,6=> -3,7=> 4,8=> -4,9 or 10=>5,
11 or 12=>-5
or some such transformation. Then add a transformation normalisation
profile element by element and the results would have very much the look
and feel of DNA profiles
Point to note - greater likelihood of home side with larger score so
perhaps convert all pairs to right number equal or greater
than left number,before match processing,then a part negative
transformation.
After all in the real world of DNA profiles they never know
which parent contributes which so always right number
larger or same as left number of each pair.



This evening I tested my technique just on that 1936/37
block posted earlier. Broke it into triplets so 147 triples
of pairs. Only one match 3-0,2-2,0,0 so already
finding evidence against my theory. I expected any
matches to be most likely consisting of 0s,1s and 2s
as weighted like DNA profiles ,
but my first match includes a 3.
Giving it 'chapter and verse' these are
Charlton v Man U,Middlesb,Pompey
and Everton v Brentford,Charlton,Chelsea

Now I know my analytical technique works I
will download the other 60 odd blocks and break
into '5 loci' ,'6loci' pairs and analyse



Football results alnalysis
I broke down 1888 to 1938 data into 5 pairs giving 3038 sets of 5pairs ,
discarding surplus columns after splitting each years data
into 5 pair wide chunks.
Within them 6 only pairs of matches - no triples
0-1,2-2,2-0,2-1,3-1 for Arsenal 1911 and Shef Wed 1933
0-2,1-1,1-0,2-2,4-0 for Man C 1899 and Blackpool 1937
1-1,1-0,1-1,3-0,0-0 for 1921 Middlesb and 1923 Huddersfield
1-1,2-1,1-0,1-0,1-0 for 1900 Wolves, 1911 Bradford
2-2,1-0,0-0,3-0,3-0 for 1905 Notts C and 1906 Derby
3-0,0-1,1-0,2-2,3-1 for 1899 Burnley and 1903 Bury

so 1 only involving a 4
(Approximate for 0 to 5 in total) digit ocurance counts in total / matched
pairs

0 4900 / 20
1  8400 / 21
2  7300 / 12
3  4840 / 6
4  2650 / 1
5  1400
6  358
7  146
8  51
9  22
10  9
12   2
11 & 13-19 nil

There is no point in doing a 6 pair analysis for this data
but I may try a 4 pair analysis to see if there is something
like a correlation between overall number count
and roughly similar distribution within the matched pairs


I repeated the footballl result analysis on
sets of 4 pairs.
Gave 54 matches  and a single triple match
on 2-0 2-1 1-0 1-1
The digit distribution on the matches was
0  298
1  334
2  158
3  72
4  20
5  6
nothing higher

For all scores the digit counts were
0  6771
1  9331
2  5840
3  3538
4  1746
5  759
6  304
7  118
8  39
9  17
>=10  9

Again a rough correlation with proportionally
more of the higher frequncy digits in the matches

There must have have been someone here before
with some weighted otherwise random process not
necessarily DNA inheritance.
Is there a rule relating a known weighted generator (
approximately multi-modal 'normal' distribution )
predetermning the weighting of any matches occuring ?


Well that was an interesting exercise, I've not 
tried composing Visual Basic macros before.
Tailored the pseudo-random generator to 
the desired characteristic,checked the output 
against the desired characteristic.
Determined matches and plotted the digit 
distribution of the match cases.
32000 digits divided into 8 columns.
Not quite as i predicted one match of
44,43,14,44 so a single 1 crept in.
10 matched sequences in total,no triples

Desired weighting of generator to roughly equate to vWA

0 _ 0.002
1 _ 0.015
2 _ 0.100
3 _ 0.133
4 _ 0.25
5 _ 0.25
6 _ 0.133
7 _ 0.100
8 _ 0.015
9 _ 0.002

Actual weighting of output
0 _ 0.00203
1 _ 0.01569
2 _ 0.0996
3 _ 0.13103
4 _ 0.25147
5 _ 0.25044
6 _ 0.13412
7 _ 0.09775
8 _ 0.016
9 _ 0.00187

Weighting within matched pairs,80 digits,no triples

0 _ 0
1 _ 0.012
2 _ 0.05
3 _ 0.137
4 _ 0.4
5 _ 0.225
6 _ 0.1
7 _ 0.075
8 _ 0
9 _ 0

Matched sequencies were
24425554
27565445
33457434
44431444
44535467
46544564
54634754
55244574
57643443
73563434

So my prediction not quite right for this one-off as that 
single 1 intruded and an interesting skew in the centre 
which hopefully would clarify with repeated processing. 
I suspect it relates to the piecewise 'quantisation' as 3 say is something 
like between 2.5 and 3.5 .
But can sum up as attenuation at the tails and enlarged  
modal group - more of an inverted U or V characteristic.
So in the original 'population' 50 % have 4 or 5 
increasing to 62.5 % within any matches and   76.7% 
have 3,4,5 or 6 increasing to 86.2 % within matches.

Tentative evidence that any unrelated matches in the NDNAD 
are going to be concentrated around the multi-modal groups. 
So if anyone does get around to resolving those unresolved 
matches in the NDNAD then any matches 
involving rareish ( < 2% allele frequency say )
alleles can be ignored in the first instance as they are 
probably repeats either due to clerical error or use of aliases.
 Concentrate investigation /cross-correlation with the dermal 
fingerprint database ,or whatever,for those matches 
nearest the 'average Joe'.

All the above concerns undirected numbers - the data in 
the NDNAD is of course directed pairs eg (14,16) never (16,14). 
Also some loci are more distributed than vWA but others 
are less distributed/more skewed. In theory could model 
for each loci/allele frequency distribution and simulate 
a large DNA database given big enough number cruncher.

So in case i've discovered some previously unknown 
mathematical law i should repeat with a weighting of a genuine 
'normal distribution' f(x) of form EXP [-(x-mu)^2] and repeat many times 
to try and put some sort of a f(x) to the match characteristic and 
also see how far I can push the number crunching on my pc 
to 10 or more digit-sets and 100,000 or more digits.

I've just done an 8 digit times 10,000 run
yielding 63 matches . Unfortunately my method does 
not ,as it stands, pick up triples. I have to check one by one  
the source file which is alright for 10 but 63 is a bit much

Weighting within matched sequences,504 digits,no triples checked  in 
central 45...... to  54...... region,a single 1 again in 1,5,5,4,4,5,5,5 .

0 _ 0
1 _ 0.002
2 _ 0.0536
3 _ 0.1151
4 _ 0.3353
5 _ 0.3571
6 _ 0.1012
7 _ 0.0357
8 _ 0
9 _ 0

so for 4,5   69.2% and 3,4,5,6  at 90.9 %


I tailored my generator for Afro- Caribbean vWA 
which has no nulls for 10 adjascent alleles and is 
more symmetric than the caucasian

Projected allele frequency characteristic of the generator
0_  0.005
1_  0.016
2_  0.079
3_  0.218
4_  0.208
5_  0.211
6_  0.161
7_  0.068
8_  0.029
9_  0.005

Actual characteristic of 500,000 digits
0_  0.0051
1_  0.0157
2_  0.0794
3_  0.217
4_  0.2085
5_  0.212
6_  0.1606
7_  0.0679
8_  0.0288
9_  0.005

And characteristic of the 28 matches (no triples)
0_  0
1_  0.0036
2_  0.0464
3_  0.2071
4_  0.2679
5_  0.2714
6_  0.1821
7_  0.0179
8_  0.0036
9_   0

So again serious attenuation of the normal/binomial 
distribution tails and increase in the take of 
the modal group.
3,4,5 originally 63.7% increasing to 74.6%
0,1,2 originally 10% decreasing to 5%
and 6,7,8 originally 26% decreasing to 22%

or 3,4,5,6 79% up to 92.8%
 and 0,1,2,7,8,9 21%  down to 7.2%

These are the 28 matches for 50,000 spins 
of  'vWA - Afro Caribbean allele frequencies' 
No 0 or 9 and one each of 1 and 8

3,3,6,5,3,6,6,5,4,4
3,4,3,5,6,6,5,3,3,3
3,4,4,5,4,5,5,4,6,6
3,4,5,3,5,5,7,6,5,3
3,4,6,4,4,3,4,5,6,4
3,4,6,7,4,5,4,4,4,4
3,5,3,4,3,3,3,3,3,5
3,5,3,4,4,4,7,4,4,5
3,5,3,5,3,3,5,4,4,5
3,5,5,5,4,5,5,6,2,3
3,5,6,3,6,2,6,4,1,4
3,6,5,4,3,3,3,5,5,3
3,6,6,6,5,4,5,5,3,5
4,3,3,6,5,3,3,5,5,4
4,3,6,4,4,5,3,3,3,2
4,4,3,4,5,2,6,2,5,6
4,4,4,5,5,5,6,7,4,5
4,4,7,3,4,3,4,5,5,3
4,5,5,3,6,5,3,6,5,4
4,5,5,4,6,6,6,5,5,4
4,5,6,3,2,5,4,5,5,4
4,6,3,4,5,3,6,6,6,2
4,6,4,6,5,5,2,6,2,6
5,4,3,4,4,4,5,2,5,5
5,4,6,5,5,6,5,5,4,5
6,4,6,3,4,4,6,6,6,6
6,5,5,4,5,4,4,2,6,4
8,2,6,6,4,4,5,6,2,5

For my next run I think I will 
model each of the 10 UK loci/alleles in my 
generator and spin 50,000 times to 
simulate a 10 loci/single allele database 
of 50,000 profiles. I will lose the significance 
between null and 0 but 0s have not appeared in 
any match so far. THO1 would only use 7 of the 
possible 10 values in the array ,others like D21 
with about 16 possible alleles I will truncate to 
the modal 10 /most frequent (undecided yet) .The triple peaked D2 
(equal peaks at 17,20 and 24) I will 
truncate to the10 around the 'Anglo-Saxon' 
group of 17 - 20 leaving out 2% alleles 26 and 27 
at the 'Celtic 24 end'.

For any mathematical runs I cannot decide whether 
to use binomial quantised/piecewise distribution 
for the generator 
,closer to this use, or the normal function f(x) with 
more chance of a numerically derrived 
f '(x) for the match distribution.

A 100,000 x 10 run would be possible I think 
but a bit of a work up. I may also try 6 loci 
,paired alleles,so 50,000 x 12 ,with simulation 
of the earlier 6 NDNAD loci charcteristics which should 
give an idea of how many 'Raymond Eaton' cases there 
would be in the earlier NDNAD form. But I would have 
to build another macro to direct the pairs before 
match checking.


I have converted my generator to 6 loci and pairs so 12 digit 
'profiles' . So far I've only done one run of 12000 x12
spins to check the characteristics. Continuing on 
and directing pairs and checking for matches 
produced no matches with 12000 '6 loci profiles'.
I deliberately added 2 matches to the data and 
it found those 2(4) as a check of functioning.
Anyone care to predict how many matches for runs 
of 20,000 / 50,000 / 100,000 and 200,000 ?

Nulls are either due to no FSS data for that allele 
or to keep my selection down to a maximum of 10 digits.

UK Caucasian 

Tabulated as FSS data eg vWA allele 14 corresponds
 to a digit 1 in my modelling

 Allele / desired frequency / modelled frequency 
for VWA
11     0.000     NULL
13     0.001     0.0012
14     0.105     0.1065
15     0.080     0.0794
15.2     0.000     NULL
16     0.216     0.2146
17     0.270     0.2717
18     0.219     0.2183
19     0.093     0.0926
20     0.014     0.0137
21     0.002     0.0022
^
9 only modelled

THO1
5     0.002     0.0012
6     0.241     0.2439
7     0.194     0.1972
8     0.108     0.1027
8.3     0.001     0.0011
9     0.140     0.1385
9.3     0.304     0.3051
10     0.012     0.0103
10.3     0.000     NULL
^
8 only modelled

D8 D8S1179 / D6
8     0.018     0.018
9     0.013     0.0143
10     0.094     0.0953
11     0.066     0.0656
12     0.143     0.1442
13     0.333     0.330
14     0.209     0.2081
15     0.088     0.0886
16     0.031     0.030
17     0.004     0.0057
18     0.000     NULL

FGA
18     0.025     0.0426
18.2     0.000     null
19     0.056     0.0577
19.2     0.000     null
20     0.143     0.1432
20.2     0.002     null
21     0.187     0.1838
21.2     0.002     null
22     0.165     0.1631
22.2     0.011     0.0116
23     0.139     0.1411
23.2     0.004     null
24     0.146     0.1462
24.2     0.002     null
25     0.075     0.0758
25.2     0.000     null
26     0.035     0.0348
27     0.007     null
28     0.000     null
29     0.000     null
30     0.001     null
30.2     0.000     null
31     0.000     null
45.2     0.000     null
46.2     0.000     null

^  0 (allele 18 ) is inflated by 1.8% nulls

D21 D21S11

53 (24)     0.000     null
54     0.001     null
57 (26)     0.001     null
59 (27)     0.031     0.0368
61 (28)     0.160     0.1559
63 (29)     0.226     0.2289
64.1     0.000     null
64     0.000     null
65 (30)     0.258    0.2571 
66     0.027     0.0264
67 (31)     0.069     0.0666
68     0.093     0.0965
69 (32)     0.018     0.0179
70     0.090     0.0922
71 (33)     0.001     null
72     0.022     0.0217
73 (34)     0.000     null
74     0.002     null
75 (35)     0.000     null
77     0.000     null

^  0 (allele 27) is inflated by 0.5% nulls

D18 D18S51
8     0.000     null
9.2     0.001     null
10     0.008     null
11     0.012     0.0335
12     0.139     0.1405
13     0.125     0.1254
14     0.164     0.1686
14.2     0.000     null
15     0.145     0.1447
16     0.137     0.1342
17     0.115     0.1167
18     0.080     0.0767
19     0.041     0.0419
19.2     0.000     null
20     0.017     0.0177
21     0.010     null
22     0.005     null
23     0.001     null
24     0.002     null

^  0 (allele 11)  is inflated by 2.5% nulls

Remainder
7 to 10 loci yet to be modelled

D2 D2S1338
16     0.037     
17     0.185     
18     0.087     
19     0.110     
20     0.138     
21     0.032     
22     0.024     
23     0.112     
24     0.142     
25     0.111     
26     0.019     
27     0.002     
28     0.000     

D16 D16S539
5     0.000     
8     0.019     
9     0.129     
10     0.054     
11     0.289     
12     0.288     
13     0.186     
14     0.029     
15     0.005     

D19 D19S433
10     0.000     
10.2     0.000     
11     0.000     
12     0.087     
12.2     0.000     
13     0.222     
13.2     0.013     
14     0.382     
14.2     0.015     
15     0.177     
15.2     0.038     
16     0.041     
16.2     0.017     
17     0.005     
17.2     0.000     
18     0.000     
18.2     0.002     
19.2     0.001     

D3 D3S1358
12     0.001     
13     0.006     
14     0.132     
15     0.265     
16     0.247     
17     0.195     
18     0.141     
19     0.014     



Trumpet these results for a simulated DNA database 
for UK caucasians

For 20,000 ,6loci / 12 allele 'profile' run
First run 5 matched pairs ,no triples

1,6,1,6,2,6,1,3,2,3,2,6
3,4,2,6,5,5,2,3,1,3,4,5
3,5,2,6,5,6,2,3,1,3,1,6
5,6,1,6,5,5,2,3,1,1,1,6
5,6,2,6,5,6,7,7,2,8,1,8

Second run one pair only

4,5,6,6,6,6,3,6,1,3,5,5

The real NDNAD had 45,000 6loci profiles back 
in 1991 which just shows what dangerous 
nonsense these databases are for nabbing 
false suspects and the number of 'unresolved' 
pairs in the real NDNAD.

This is the multi-modal 'average Joe' 6 loci profile 
for UK Caucasians is
vWA,THO1,D8,FGA,D21,D18
(17,17)(6&9.3)(13,14)(21,21)(29,30)(14,14)
corresponding to 
4,4,1,6,5,6,3,3,2,3,3,3
in this representation so little agreement 
with my other hypothesis although individual 
pairs seem to tally in 4 of the 6 loci.
To convert one representation to the other  
use the tables in my previous posting.

Then a single 50,000x 12 run with 27 matched pairs,no triples
processed down from 600,000 data points

1,3,1,6,4,7,4,8,2,3,2,2
1,5,1,6,4,5,3,6,2,3,4,6
1,5,1,6,5,5,2,3,2,2,1,2
1,5,6,6,5,5,4,8,1,3,1,3
1,5,6,6,5,6,2,3,2,3,2,3
2,6,1,5,5,5,3,6,1,3,5,5
3,3,2,5,5,6,3,6,2,3,1,3
3,4,1,6,5,5,3,4,3,8,1,3
3,4,2,6,3,5,3,4,2,8,4,5
3,4,2,6,5,5,4,8,5,6,4,6
3,5,1,2,5,6,3,4,2,3,0,7
3,5,1,5,5,5,2,8,2,3,1,6
3,5,1,5,5,6,4,4,3,8,5,5
3,5,5,6,4,5,3,6,2,8,0,2
4,4,1,5,4,6,7,7,2,3,1,3
4,4,1,6,5,6,3,4,1,3,3,7
4,4,5,6,0,2,2,3,1,3,1,5
4,4,5,6,4,6,2,4,2,3,2,4
4,5,1,1,5,5,2,9,3,3,5,5
4,5,1,2,5,6,2,3,2,3,1,3
4,5,1,6,5,5,2,7,3,3,1,6
4,5,2,6,4,5,3,9,1,3,3,7
4,5,2,6,5,6,1,7,2,3,4,7
4,5,2,6,5,6,3,7,2,3,1,3
4,6,1,6,4,6,2,3,3,3,1,2
4,6,5,6,5,5,2,3,2,8,5,6
5,7,2,5,5,6,4,6,2,5,3,5

There is one further bit of analysis which probably could do with another macro.
For each pair of columns on the above 27 match- rows do a frequency 
plot of each digit and compare to the generating characteristic for each 'locus' for the 
attenuated tails and enlarged modal group effect.
And which are the most commonly occuring paired alleles in each locus ?
eg (4,5) for vWA and (1,6) for THO1 or whatever.

To go any further i must make a 
software restructure, basically swapping 
disk-space for memory and reconcattenating , 
to go to 7 loci ,to 8 ,to 9 
and then 10 loci and more than 50,000 'profiles'.

Would anyone care to speculate for 
number of matches in 50,000, 100,000, and 200,000 
profiles in 6 loci,7 loci,8 loci ,9 loci and 10 loci data-sets ? 
or even the general case  to extend to 2 million or even 60 million.
At the moment for 6 loci it looks as though the number 
of matches is square law about [N/(10^4)]^2 where N = No of 'profiles'

I have to guage when to post this stuff to the Yahoo / 
forensic group, Prof Sir Alec Jeffreys etc,


I've now converted all macros to 7 loci 14 data-points.
This is the result for a 50,000 profile run simulating 
vWA,THO1,D8,FGA,D21,D18,D2
1,4,1,6,4,6,4,7,1,3,4,7,3,4
4,5,5,6,5,5,3,6,1,3,3,5,1,4
5,6,1,6,4,6,6,8,1,3,3,3,2,9

3 pairs,no triples

This 7th,D2 ,is the most removed from normal distribution 
having 3 distinct,separated peaks for UK caucasian.
The final 3 loci are more normal distribution 
but I will certainly have to increase to 100,000 
profiles and more.
At the moment just the pc processing time on 1997 
vintage AMD K6 ,64M RAM pc 
for 50,000 x 14 is
1/ generating profiles constrained to allele frequencies - 32 seconds
2/ redirecting pairs - 20 s
3/ splitting into  10 files by first digit (0 to 9 ) - 17 seconds
4/ sorting the biggest file (3...........) in this case (but no pairs in this file ) - 85 seconds
5/ pair matching - 3 seconds
6/ visual check of sorted file to confirm presence of matches and also see if a triple
7/ for files that don't reveal a match then repeat with 
a seeded match in the data to check the macro does pick it up.
repeat processes 4,5,6,7 on each/bunch of remaining 9 files 

The sort is alphanumeric rather than numeric.
 If the files become too big to sort (process 4 )
then I will just subdivide 
on the second digit and proceed as before.

So anyone care to predict for number of matches in 100,000 
and 200,000 runs for 8 loci,9 loci and 10 loci ?
Results so far
4,000 8 digit ,undirected,10 pairs
10,000  ,8 digit,    ", 63 pairs
10,000 ,10 digit,undirected ,28 pairs
12,000 6 loci,directed, no pairs
20,000 6 loci, " ,1 to 5 matches
50,000 6 loci , ", 27 pairs
50,000 7 loci ,directed, 3 pairs


I've now converted all macros to 8 loci 16 data-points.
This is the result for one 100,000 profile run simulating 
vWA,THO1,D8,FGA,D21,D18,D2,D16

Just one match

4,4,2,6,6,6,1,4,2,3,3,3,1,8,3,3

Results so far
4,000 8 digit ,undirected,10 pairs
10,000  ,8 digit,    ", 63 pairs
10,000 ,10 digit,undirected ,28 pairs
12,000 6 loci,directed, no pairs
20,000 6 loci, " ,1 to 5 matches
50,000 6 loci , ", 27 pairs
50,000 7 loci ,directed, 3 pairs
100,000 8 Loci,directed ,1 pair


I've now converted all macros to 9 loci 18 data-points.
No matches for one 200,000 profile run simulating 
vWA,THO1,D8,FGA,D21,D18,D2,D16,D19

Results so far
4,000 8 digit ,undirected,10 pairs
10,000  ,8 digit,    ", 63 pairs
10,000 ,10 digit,undirected ,28 pairs
12,000 6 loci,directed, no pairs
20,000 6 loci, " ,1 to 5 matches
50,000 6 loci , ", 27 pairs
50,000 7 loci ,directed, 3 pairs
100,000 8 Loci,directed ,1 pair
200,000 9 Loci,directed,no pairs

For anyone coming after me this is a breakdown by 'vWA' 
leading digits as it is quite bunched and matches presumably more likely 
in the bigger groups (eg 1,4... ; 3,4.... ; 3,5.... ;4,4.... ; 4,5... ; 4,6... )
and probably much the same proportions 
for the 10 Loci case
0,0.... to 0,9.....  400 'profiles'
1,1............. 2100
12..... 3400
13   9000
14 11300
15 9300
16  3900
1,7.... to 1,9....  700
2,2 1300
2,3  6800
2,4  8900
25 7100
26  2900
2,7.... to 2,9..... 500
3,3 9500
34  23400
35  18900
36  8000
3,7..... to 3,9....... 1400
4,4  14700
45  23400
46  10100
4,7..... to 4,9...... 1800
5,5  9500
5,6  8000
5,7.... to 5,9......  1400
6,6.... to 6,9.... 2200
7,0..... to 7,9.....  50
8,0... to 8,9      1


Now converted all macros to 10 Loci x2 and 
also a macro for converting back to usual represention.
For a run of 600,000
Single 10 loci match of
VWA,THO1,D8,FGA,D21,D18,D2,D16,D19,D3
(17,18);(8,9);(13,14);(20,22);(30,30);(14,15);(20,20);(12,13);(13,14);(16,18)

Then cutting back on the same output array
Same single match on 9 loci and no other
9 (18) matches on 8 loci,including the 9 and 10 one
102 (204) matches on 7 Loci,including the first 7 pairs of the  8,9,10 ones
and 2907 (x2) matches on 6 loci,including the first 6 pairs of the 7,8,9,10 ones
No triples on the 8 loci set and i've not checked for the 7 aand 6 loci

If there are 6 loci records on the NDNAD they must be next to useless. 
About 3000 matches in 600,000,so if it were still 6 loci and square law 
then there would have been about 3000 x 3 squared or 27,000 x 2 matches  .

My 'average Joe' is 
(17,17);(6,9.3);(13,14);(21,21);(29,30);(14,14);(17,20);(11,12);(14,14);(15,16)
and my (slightly altered)  profile is 
(17,19);(8,9.3);(13,13);(20,22);(29,29);(13,15);(18,19);(12,12);(12,14);(16,18)
4,6;3,6;5,5;2,4;2,2;2,4;2,3;4,4;0,3;4,6
even closer to the numerically derrived first match normalised to
(0,1);(0,1);(0,-1);(0,0);(-1,-1);(-1,0);(-2,-1);(0,-1);(-1,0);(0,0)

There were 70,764 'profiles' with ,first,vWA pair of (17,18) that contained 
the 10 loci match and was the largest sub-set.
The next largest was 70,131 for vWA (16,17)

I may do a one million run for the shear hell of it but maybe 
only fully sort for the (17,18) subset.


I think there is a problem with the Rnd function 
depite using the Randomize adjunct.
Did a (4,5) subset of 2 million profiles which 
took 25 minutes. Giving 236,345 x20 digit 'profiles' 
(4,5);..........
Much processing later .........
The same matched pair as before which all looks highly 
suspicious. And again same single match for 9 loci.
22 matches for 8 loci subset
214 for 7 loci
and 6113 for 6 loci .
Inspecting the 22 matches compared to full 10 loci.
There were 3 near misses to adjascent sorted 
sequences,first 16 digits matched 
Final 4 digits 1,3,3,6 and 1,8,3,6 
1,3,3,6 and 3,5,3,6 
1,3,4,6 and 3,3,4,6
so 3 separate 9 loci matches and 2 separate 9.5 loci if I had chosen 
loci  1,2,3,4,5,6,7,8,10 instead of straight sequence.

I will have to research the Rnd pseudo random number generator 
as my macros seem to check out ok.


I did some further checking back to the 
original generated undirected '2million profile' file and 
what becomes a match started as these 
two sequences
5,4,3,5,6,5,2,4,3,3,3,4,4,4,5,4,1,3,4,6
5,4,3,5,6,5,2,4,3,3,4,3,4,4,5,4,3,1,6,4
which when directed both become
4,5,3,5,5,6,2,4,3,3,3,4,4,4,4,5,1,3,4,6
which is not in the original at all 
so not a manifest of the Rnd function 
repeating itself. There is no way the Rnd 
function would 'know' what i was going 
to do with the output. In other words what 
looked highly suspicious 1, 10 loci 
and only 1, 9 loci match would seem 
genuine after all. Fascinating stuff.
There is no repeated sequence turning up 
in the generator file as that would carry through 
and be picked up by the matching macro.

Unfortunately due to constraints of disk space/
enforced deleting files I don't have the 
original undirected source file ( 23.4MByte) for the 
600,000 profiles where the same sequence 
later emerged,only the directed file but 
probably the same effect.

Generating new (5,4) + (4,5) 2 million subset the sequences 
differed from the previous run so randomize was working.
BUT - I checked 
on original undirected /unsorted file for 
central 3x2 group 3,3,3,4,4,4 and
5,4,3,5,6,5,2,4,3,3,3,4,4,4,5,4,1,3,4,6 emerged again 
but in a different place in the file. The following 
sequences also matched . So Rnd seems ok within 
one run but repeat a run and same result is likely 
to emerge somewhere in a long run despite using the Randomize so the 
Rnd function starts at a different point. Bearing in mind 
although I'm only selecting the 5,4 /4,5 subset the 
Rnd function is being called 2 million x 20 times,
2^24 in the inbuilt Rnd function is only about 16 million

Rnd produces an exact figure based on the previous call.
I've buried a superfluous Rnd call in the subroutine that 
writes the (4,5)..... file very approximately on average every 20 
profiles so should disrupt the sequence as far as the numbers 
used in the loci generator are concerned. This write call would 
not be the same for each run .
I've so far done another 230,000 odd (4,5).... profiles 
and that sequence does not reappear and will 
fully process and see what emerges


some right fun and games with Linear Congruent 
Generators for random numbers
from sources
http://www.geocities.com/SiliconValley/Campus/7071/rnd.html
and
www.kaner.com/pdfs/random.pdf
I am now using the microsoft form for the Rnd 
but in this form has 15 digit precision 
rather than  truncated to 7 and seem to be getting 
more convincing results.
I tried the Kaner/ Vokey with z = 2^ 40
trying each
a= 27182819621,c = 3
and a = 8413453205,c = 99991
in exactly the same visual basic code as below 
but there was horrendous repeating of 'random' numbers.
I've no idea what the problem is if someone else 
would like to fabricate  a fairly simple RNG 
or check the following code in a VB procedure.
Dimensioning variables to Double made no difference.

---------------
' initialising
a = 214013
c = 2531011
x0 = Timer
'  timer sets start seed to number of seconds after midnight
z = 2 ^ 24


'  RNG
  temp = x0 * a + c
  temp = temp / z
  x1 = (temp - Fix(temp)) * z
  x0 = x1
  result = x1 / z
' 0 < result < 1 

-----------------------
So far just processed from one run of 1 million generator 
using the above form of RNG outputing to disk just 
4,5 and 5,4 subsets - directed giving 118,193   4,5.........
Then sorting just 12,878   of the divided 4,5,3............. profiles
no 16,18 or 20 loci matches, 3 '14 loci matches' 
and 88 '12 loci matches' lopping back.


For a 1 million run ,each time, one time extracting 4,5 and another 
run extracting 4,4..
4,5�. including the 4,5,1.. I mentioned yesteday
of 105,315 2x 8 loci matches only,no 9 or 10.
4,5,1,6,6,7,3,4,0,3,4,5,3,8,3,4
4,5,2,6,6,6,3,4,1,2,3,7,1,8,4,5

The 0 above represents only 3.1%
sequences convert to
17,18 ; 6,9.3 ; 14,15 ; 21,22 ; 27,30 ; 15,16 ; 19,24 ; 11,12
17,18 ; 7,9.3 ; 14,14 ; 21,22 ; 28,29 ; 14,18 ; 17,24 ; 12,13

36x 7 loci matches
1345x 6 loci matches

and another 1 million run for 4,4   only
and one match for 73,259 4,4,��...profiles
for 8 loci 1 match
4,4,3,6,5,6,4,6,3,3,3,7,0,1,4,5
the 0 here represents 3.7%
18x 7 loci matches
647x 6 loci matches


I now have confidence in the RNG and have ramped up 
to 10 million profiles.
It took 2 hours 12 minutes to generate and save to disk subset 
174,017 profiles 4,5,1,6... 4,5,6,1... 5,4,1,6... and 5,4,6,1.....
which when directed gave 4,5,1,6..... profiles only.
In them were 2 matches on 10 loci
4,5,1,6,2,5,0,4,1,5,1,7,1,4,4,4,1,3,5,6
4,5,1,6,5,6,2,6,1,2,2,3,1,4,3,4,3,3,3,5

which converts to 
vWa;THO1;D8;FGA;D21;D18;D2;D16;D19;D3
17,18 ; 6,9.3 ; 10,13 ; 18,22 ; 28,31 ; 12,18 ; 17,20 ; 12,12 ; 13,14 ; 17,18
17,18 ; 6,9.3 ; 13,14 ; 20,23 ; 28,29 ; 13,14 ; 17,20 ; 11,12 ; 14,14 ; 15,17

The remaining processing because only 174,017 
took much the same time as previous processing 
but narrower 'catch'.
Other results in usual sequence, ie ordered 9 (excluding D3) ,
not perm any 9 from 10,which would be higher numbers but 
as I rely on a sort routine I cannot do that determination.
9 loci - 7 matches
8 loci - 103 matches
7 loci - 1078 matches
6 loci - 21,113 matches
The 7x 9 loci and 2x 10 loci result is not too surprising 
because the 10th locus is D3 and very biased in the 3/4/5 area.

9 loci match analysis  4 pairs were 4,5,1,6,2,5,.... including the 10 loci one
8 loci analysis 12 were 4,5,1,6,2,5 .....  17 were 4,5,1,6,4,5......
35 were 4,5,1,6,5,5......  17 were 4,5,1,6,5,6......

So from ramping up from 2 million to 10 million 
a factor of 5 then these results agree square law 
with the 2 million results if restricted to 4,5,1,6... also. 

Remember someone has decided to halt the NDNAD 
when it reaches 3 million. It looks suspiciously 
like he has done the same processing as me. 
3m is likely the figure [<10/(2^-.5) and > 2 million (square law assumed)]
where you are likely to get one match 
in the most frequently occuring (first) loci.
Returning to the 10 million result I still have no idea 
whether there would be more 10 loci matches in the 
remaining (10m minus 174,017 )  = 9,825,983 profiles I 
did not save and test. From the 2m runs and 8 loci results for subsets 
4,5,2,6... and 4,4,.... I would suggest there is but I 
cannot put a likely figure there. What 8,9,10 matches 
do emerge are not being found totally in the multi-modal 
areas where I intuitively expected them to be. So could 
appear anywhere it seems perhaps with a majority 
of modal matches .

I will try another 174,017 subset of 10m  in a block away from 
4,5,1,6.... ; perhaps 4,4........ and see what emerges.

I will also write up with the macros for anyone else 
to have a go - independent replication of 
such analysis is fundamental. I used Visual Basic / 
macros with Word 97  on a 6 year-old pc .

The next area of exploration is the common ancestor ie 
parent and 10 alleles in common at least,
grand-parent and 5 alleles in common, on average, at least.
What is the probability of someone related having these 5 to 10 
as a  starting point then also matching on the remaining 15 to 10 
,just by chance process,and probability of that person 
being also in the NDNAD. ? Remember we are talking 
real ancestry here not the nice comfy (sham) ancestry of the 
genealogy community. The milkman factor, lovers,one night 
stands etc that mean up to 30% of people have a genetic 
father different to their accredited father. 


The nearest ,to 174,017  I could find to a convenient rarer subset 
of 10million  profiles was for 2,6....  &  6,2..... 
giving 150,105 'profiles'
Results for 
10 loci -  0 matches
9 loci  - 1 match
8 loci  - 3 matches
7 loci  - 39 matches
6 loci  - 1262 matches

The 9 loci match was on 
2,6,2,6,5,5,3,7,2,3,3,5,7,8,4,5,3,5
which started as
2,6,2,6,5,5,3,7,2,3,3,5,8,7,5,4,3,5,6,2
and
2,6,6,2,5,5,7,3,3,2,3,5,8,7,4,5,3,5,3,4
so confidence in the RNG

Previously I did a similar 2,6&6,2 run but included a 
variation from adding in calls , 1 in 20 ,to the built in Rnd 
function added to the external Rnd 
on the assumption that adding a poor rand 
to a reasonable rand would make it better.
Not so 
Processed through and checked for matches.
Apparently 3 10 loci matches.
Going back to the generator array, exept for 
the pair-directing, the sequence appeared twice exactly 
the same ,different places,
 in that array . I repeated for the second 'match' 
and again a pair of sequences in the original. I did 
not bother checking the third result and scrapped the lot.

I down-loaded the Sunny-beach RRnd but haven't 
got anywhere with it. The help file doesn't come up 
and it doesn't like my sound-card. Knowing what 
(regular rather than random) hash appears on 
radio reception close to a computer I would 
have thought any analogue noise derrived from a sound 
card would be heavily contaminated with all 
the ,repetitive, digital noise.


171,122 subset 3,4,1,6..........  of a 10m run
results
10 loci - 0 matches
9 loci  - 5 matches
8 loci - 91 matches
7 loci - 1079 matches
6 loci - 22,113 matches

The 5x  9 loci results were all 3,4,1,6,5,6......

For anyone wishing to replicate these processes  
I've put the macros and some background on
http://www.nutteing2.freeservers.com/dnas.htm
Over the next week i will write up the rest of 
this simulation experiment and add to that file 
(and mirror sites).

Next run will probably be subset 3,5......
which should be about 946,000 processing 
3,5,1,6....... first and then the remainder.

I am trying to think my way around the co-ancestry conundrum.
Should it help anyone else I did some processing 
on the final sorted arrays for 15 alleles and 10 alleles
In the first instance assuming match on 10 
digits 1,2,.....10 of 20 is for this purpose much 
the same as 1,3,5... 19 of 20 and for the moment ignoring 
the perm 1 from 2 .
For the rarer 2,6...... profiles (150,105 out of 10m )
15 allele matches 9
10 allele matches 16,939 and i would guess about 1 in 50 were quadruples, 
repeated pairs.

For the common 3,4,1,6....   profiles (171,122 out of 10m )
15 alleles - 271 matches
10 alleles - 71,876 matches including i would guess 1 in 10 quadruples

What is the probability of a related person (parent-wise) so 10 
loci in common already also having by chance a match on the other 10 ?
What is the probability of a related (grandparent-wise) so 5  
loci in common already ,on average,also having by chance a match on the other 15 ?

This week i've started reading the Spencer Wells book 
The Journey of Man : A Genetic Odyssey.
The Y chromosome derrivation of human migration 
since an African ' Adam' -  like the mitochondrial 'Eve'.

A quote ,from it,relevent here (Kidd's paper,concidently, on the Amerindian study i 
should have received this week from the Brittish Library )
" The geneticist Kenneth Kidd, of Yale University , has pointed 
out that if we double the number of ancestors in each generation 
(around 25 years) ,when we go back in time about 500 years 
each of us must have had over a million living ancestors. 
If we go back a thousand years ,our calculation tells us that 
we must have had one trillion (1,000,000,000,000 ) ancestors - 
far more than the total number of people that have existed  in 
the whole of human history. ...................
....... The error in our ancestor tally is not from a malfunctioning 
calculator,but from the assumption that each of the people 
in our genealogy is completely unrelated to the others"

Good news for the anti-FSS brigade.
Found another 10 loci match in a different area.
I thought THO1 had the maximum possibility 
of pairs of alleles as max frequencies of .241 and .304
but it is actually loci 8 and 9 in the standard FSS order 
D19 at .382 and .222 and
D16 at .289 and .288
So I rejigged things generating 10 million profiles 
but only saving to disk those directed to become 
..............3,4,1,3..
Giving 283,201 'profiles'
Then divided for THO1 (1,6)
so 41,551 profiles of form
..1,6..........3,4,1,3..
Then divided ,sorted , reconcattenated and match-checked 
giving one match of
4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3,3,4
converted back as
17,18 ; 6,9.3 ; 14,15 ; 21,24 ; 29,30 ; 14,14 ; 17,24 ; 11,12 ; 13,14 ;15,16
This match started as
54,61,67,37,32,33,18,43,31,43 and 
54,61,67,73,32,33,18,34,13,34
so no obvious problem with the RNG

Cutting back on final array for lower matches ,
perhaps not too relevent , as 3,4,1,3 columns all the same
9 loci 4 matches
7 and 8 as 9
6 loci 162 matches

9 loci result
4,5,1,6,4,4,2,7,1,3,3,3,3,7,3,4,1,3
3,5,1,6,5,6,2,7,1,2,1,3,1,8,3,4,1,3
4,5,1,6,5,6,6,7,4,8,1,1,1,2,3,4,1,3
4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3

By reconfiguring the columns and resorting
10 loci - 1 match as before
9 loci - 4 matches as before in effect
8 loci - 162 match
7 loci - 3,682 match
6 loci - 15,172 matches

I will probably process the next biggest batch of 
the generated 283,201 profiles before ditching
ie ..2,6..............3,4,1,3..

It really requires someone with a bigger number 
crunching computer to structure the multiple sort processes into 
one macro or different process altogether and crunch all 10 million 
in one go.



I have now found the first 10 loci match 
in an area where I was not expecting one.

Firstly continuing yesterdays results
9 loci result
4,5,1,6,4,4,2,7,1,3,3,3,3,7,3,4,1,3
3,5,1,6,5,6,2,7,1,2,1,3,1,8,3,4,1,3
4,5,1,6,5,6,6,7,4,8,1,1,1,2,3,4,1,3
4,5,1,6,6,7,3,7,2,3,3,3,1,8,3,4,1,3

By reconfiguring the columns and resorting
10 loci - 1 match as before
9 loci - 4 matches as before in effect
8 loci - 162 match
7 loci - 3,682 match
6 loci - 15,172 matches

Then processed the remaining ..2,*...........3,4,1,3..
For 72,578 profiles
10 loci - 0 matches
9 loci - 3 
8 loci - 117
7 loci - 3,855
6 loci - 21,646 matches


Then processed the remaining ..1*..............3,4,1,3.. 
but * not = 6 for 78,434 profiles
10 loci - 0
9 loci 4 
8 loci - 154
7 loci - 4,028 
6 loci - 23,401

Then remaining ..a*..............3,4,1,3..
a not= 1 or 2 for 90,634 profiles
10 loci 1 match
9 loci 5 matches
8 loci - 159
7 loci - 4,327
6 loci - 25,318

The match was for 
4,5,3,6,5,6,2,6,1,3,4,6,1,8,3,4,1,3,3,4
converted back to
17,18 ; 8,9.3 ; 13,14 ; 20,23 ; 28,30 ; 15,17 ; 17,24 ; 11,12 ; 13,14 ; 15,16
generated originally as
5,4,3,6,5,6,2,6,3,1,4,6,8,1,4,3,3,1,4,3 and
4,5,6,3,6,5,6,2,1,3,6,4,8,1,4,3,3,1,4,3
so good RNG

This has (4,5) of one of the main modal groups so 
to properly test for extramulti-modal matches I will probably 
derrive 300,000 profiles selected to be of form
(other than 4 or 5)(other than 1 or 6) .............. (other than 3 or 4)(other than 1 or 3) ..
and process through.
At the moment for all matches found 
2 matches in expected batch of  174,017
1 match in expected batch of 41,551
1 match in unexpected 90,634 plus 72,578 plus 78,434


So for the moment ,best guess ,for 10 loci matches 
in 10 million, totally unrelated ,profiles is >4 and less than 40



I tried 300,000 profiles selected to be of form
(not a   5)(not a   6) .............. (not a 3 )(not a   3) ..
and processed through.
I hadn't realised this gives only 6.6%.
Of 300,000 such profiles no 10 loci matches 
or 9 loci matches.
8 loci - 2 
7 loci - 40 
6 loci - 1387

Next I will probably try something like the opposite 
in a 2 million run

2 million run with processed 137,190 profiles containing 
at least one of the four most common alleles on each of loci 0,1,....7,8.
10 loci matches - 0
9 - 0 
8 - 0
7 - 22
6 - 987

I was playing around with kinship (coin-tossing) statistics 
simulating with the RNG so approximate only as only 
using variously 100,000 and 10,000 x 10 and 20 'tosses'.
As far as I see it,50 % chance per allele.
For two people with the same mother and father
the chances for inheriting the same ,unspecified ie 
no particular order,N alleles is 
N	percent probability
20	low
19	.003%
18	.02
17	.12
16	.5
15	1.5
14	3.6
13	7.4
12	12.0
11	16.3
10	17.3
9	16.3
8	12.0
7	7.4
6	3.6
5	1.5
4	.5
3	.12
2	.02
1	.003
0	low

For one parent concerning inheritance of 
matching 1 allele in each pair of 10 loci
N	%
10	.11
9	1.0
8	4.2
7	11.6
6	20.6
5	24.6
4	20.6
3	11.6
2	4.2
1	1.0
0	.11

For a common grandparent ,one allele on each of 10 loci,
25% chance for each allele
N	%
10	low
9	.001
8	.04
7	.36
6	1.64
5	5.43
4	14.3
3	25.4
2	28.4
1	18.8
0	5.7

For common great-grandparent ,12.5% chance each,
N	%
7	.004
6	.06
5	.4
4	2.3
3	9.3
2	24.1
1	37.8
0	25.6

For common gg-grandparent ,6.25%
N	%
7	.001
6	.001
5	.015
4	.13
3	2.0
2	10.4
1	35.2
0	52.3

For common ggg-grandparent, 3.125%
N	%
5	.01
4	.28
3	2.0
2	10.5
1	34.2
0	53.0

Is there a mistake here in the residual 30 odd 
percent chance of inheriting one allele over  5 generations ?
Anyone know the figures for number of people alive today 
legitimate (real and assumed) and illegitimate 
having the same 2 parents, one parent ,one grandparent, 
one great-grandparent etc ?

How to meld this sort of data with multi-allele 
matching probability in a NDNAD ?
Is there a numerical /simulation way to determine 
how much co-ancestry will increase number of matches
within a database ?


I've now joined the redirection macro and the 
first divider macro to the generator macro 
and added a save of the original undirected 
array of profiles as number strings to 
halve the disk space requirment.
This is probably the proper way to do all 
the processing. Repeated application 
of the dividing routine on successive columns 
until there is nothing left to divide. To do this 
automatically would be alright if it was not for 
such variable divided file sizes /counts from few to 10,000s 
in the same dividing.

I suspected i was wasting my time but i decided to 
do a million run and save all to disk for 
later processing.
So far just processed profiles of form 4...................
numbering about a quarter of a million - 250,942
Results
10 loci matches - 0
9 - 0
8 - 2 matches  both starting 45,
7 - 60
6 loci - 2465 matches

That jump from 1 million to 10 million makes all 
the difference

Now i've started i will have to carry on to 
the remaining before trying a 2 or 3 million run - disk 
space permitting.
The next large run i will probably change the 
order in the generator array from the clumpy vWA,THO1....
to D2,D18,FGA......  to partially even-out some 
of this clumpiness.



Perhaps a starting simulation could be.
5 males and 5 females of totally random 
unconnected but otherwise generic UK caucasian profiles.
Generate 4 or 5 'children' for each pairing and 
 they in turn only 
allowed to mate with random profile outsiders. 
Add in a bit of second cousin/cousin/incest  
matings/pairings and repeat 
for perhaps 5 or 10 generations and see what 
emerges. Then repeat with outsiders constrained 
to come only from 5 say similarly generated 'communities' etc

I worked out how to do VB random access files, Get 
and Put ,and made a macro to detect matches 
in datafiles in string form. But it would take 
forever and a day for the macro to  process through like that.
Looks as though it will have to be a quick-sort 
macro or Word /sort for the subdivided files then my match macro 
after re-uniting the sub-files.
Now using the data stored as strings not only reduces 
the file size but using the standard Word/Sort ,un-highlighted 
columns or text, default (Text) type of sort now works.
I thought the smaller files would increase the handlling 
size of Word/Sort from 15,000 but its still the same limit
I may make a macro within Word that inputs in 
turn each of the subdivided (<15,000 profile ) files ,Sorts each file,
saves each file,then some sort of macro to copy and 
paste all these sorted subfiles into one file to match check.
I've accessed a number of VB sort code procedures but 
will try the repeated Word/Sort/ macro first as I suspect going 
down that route will be quicker in actual processing time.


I thought i was wasting my time processing the remaining 
3/4 million profiles,but no . I was going to leap to 3 million 
now i have changed to 'string' data blocks and handling.
Now I know what i'm doing ,have all macros to hand and 
 how the profiles sub-divide into various amounts. 
If i repeated a whole 1 million run again i reckon 
it would take only about 3 hours in total to generate, through 
dividing,batch sorting,batch file merging and final match 
checking. So as a very near miss (below) for a 10 loci/ 
20 digit match in 1 million ,the next run will be for 2 million .
File size for 1 million profiles as strings  22.8 MB.
Results for 1 million profiles all saved and processed through

1.............. (198,191 profiles )
6 loci/12 - 938 matches
7 - 29
8 - 1

2............... (135,851 )
6 -  546
7 - 19
8 - 3 
9 - 1

3............ (305,269 )
6 - 2972
7 - 71 
8 - 5 
9 - 0

 4................... (previously reported)  - 250,942
6 loci - 2465 matches
7 - 60
8 - 2 matches
9 - 0

5................ (95,969 )
6 - 474
7 - 13
8 - 2
9 - 0

Remainder 0...,6....,7......,8...... (13,778 profiles)
6 - 11 matches
7 - 1 
8 - 0

So match totals in 1 million profiles
6 loci - 7,406
7 loci - 193
8 loci - 13
9 loci  - 1

Can now also easily check for triples 
so far only emerged on 6 loci matches
( reconfigured macro for quadruples also)
1..... -18 triples (1 quadruple)
2........ - 7 triples (0 quad)
3....... - 84 triples (3 quadruple)
4........ - data no longer retained
5... - 11 triples ( 0 quad)
remainder - 0  

So >=120 triples and >=4 quadruples on 6 loci

Needle in a Haystack
The near miss on 2........ profiles was actually 
also a match for 19 digits 
The 2 profiles were
"24162378233401331125" and 
"24162378233401331122"
Conversion to standard notation
(15,17)(6,9.3)(10,11)(24,25)(29,30)(14,15)(16,17)(11,11)(13,13)(14,17)
(15,17)(6,9.3)(10,11)(24,25)(29,30)(14,15)(16,17)(11,11)(13,13)(14,14)
again mostly ,but not all ,are common alleles
vWA / 15 - allele frequency .08
D8 /10,11 - af .094 ,.066
FGA/  25 - .075
and D2 /16 is only af 0.037

These started life as
"42613287323401331152" and 
"24162378323410331122"
so nothing suspect about the Rand function.

Anyone care to lay bets on a match/ matches being contained 
within 2 million profiles ?

For anyone not aware of all the previous research. This 
simulation is for the artificial situation where all profiles 
are generated absolutely randomly within the constraint 
of distributions as found in  UK caucasians. It does not 
assume any co-ancestry ie all profiles are totally independent 
of one another with no common ancestors bequeething any 
allele/alleles down the generations. That is the next research/simulation.



The final reckoning

A single 10 loci match on 2 million profiles

Breakdown of results in standard loci order
for 1.............. (398,036  profiles )
6 loci/12 - 3,644 matches,105 triples,5 quad
7 - 91, 0 triples
8 - 5
9 - 0

for 2............... (273,611 )
6 -  2,118 , 48 triples, 1 quad
7 - 69, 0 triples
8 - 4
9 - 0

for 3............ (609,940 )
6 - 9,950, 597 triples , 52 quadruples, 7 quintuples
7 - 255, 0 triples
8 - 28
9 - 2
10 LOCI - 1 match

for  4................... (499,104 )
6 loci - 9,865, 540 triples , 49 quad , 7 quin
7 - 268, 0 triples
8 - 28
9 - 0

for 5................ (191,390 )
6 - 1,564, 40 triples , 3 quad
7 - 28 , 0 triples
8 - 2
9 - 0

Remainder 0...,6....,7......,8...... (27,921 profiles)
6 -  27 matches , 1 triple
7 - 1
8 - 0

So match totals in 2 million profiles
6 loci - 27,168
7 loci - 712
8 loci - 67
9 loci  - 2
10 loci - 1

for 6 loci
1231 triples
110 quadruples
14 quintuples

3... subset,  9, 10 loci numbers look suspicious but that is just the 
way things have panned out, including somewhat similar before . 
If i wanted to fiddle these results 
then the first thing i would do is make the 9 loci match number 
larger. Hopefully anyone repeating this experiment will 
find similar numbers. For anyone so doing I will 
add the count breakdown of the sub-divisions to the dnas.htm file 
tomorrow. You need a plan to work to because of the serious 
disparity of numbers in sub-divisions.

*****************************************

THE 10 LOCI MATCH in 2 MILLION is

"34,66,56,24,33,13,17,45,13,45"
when converted back , in standard form
(16,17)(9.3,9.3)(13,14)(20,22)(30,30)(12,14)(17,23)(12,13)(13,14)(16,17)
all in the more common allele frequencies. 

*****************************************

The lowest being D2 / 23 of 11.2 % allele frequency

This match  started life as
"34,66,65,24,33,13,71,54,13,45" and
"34,66,65,42,33,13,71,45,31,54"
so nothing suspect about the Rand function.

Previous results suggested number of 10 loci matches in 10 
million to be between 4 and 40 . Assuming the square law then 
5x 2 million leads to 5^2 = 25 approx matches in 10 million profles 
and implied 625 in 50 million. More repeats of this experiment ,or 
even perhaps 3m or 4m runs ,will show whether 1 in 2 million is 
average ,below or above average. My hunch from the near miss 
in 1m is that it is below average ie implying between 25 and 40 matches 
in 10 million.
I now have population date for UK from 1700 to present day to work on 
the next simulation. No data yet for interbreeding factors,father/daughter,
brother/sister,uncle/neice, first cousin mariages,second cousin mariages etc.

I will place the modified macros ,other 'tools' and results on the ftp'd dnas.htm file sunday.
And notify the forensic science lot Sunday or Monday.

Is all the above and preceding a first ?
I've not come across a hint even of anyone publishing 
this sort of simulation.

Up to Sept 28,2003 -f207
Email Paul Nutteing by removing 4 of the 5 dots
or email Paul Nutteing ,remove all but one dot
Or a message on usenet group uk.legal has got to me recently a couple of times.
A lot of the contents of this file plus other material 'peer reviewed' on the main forensic science usergroup

Background
A simulation of DNA profile 'families'
A simulation of DNA profile families with consanguinity
A simulation of DNA profile 'families' for 6 generations
dnas.htm revisited with all alleles represented
dnas.htm revisited for >8 percent allele frequency subset (similar ancestry )
Simulation of Taiwanese Tao and Rukai populations to explore the effect of within and without ancestral clusters
Basques autochthonous DNA profiles simulation, 9 loci
Australian Capital Caucasian 9 loci simulation
Australian Capital Caucasian 9 loci simulation, >= 5% allele frequency
Australian Capital Caucasian 9 loci simulation, >= 5% allele frequency
CODIS, 13 Loci Caucasian Simulation
Automating the macros
Exploring other DNA profile match scenarios
Suspect familial matching
Return to co-ancestry factor in the NDNAD simulations
144 random matches in 65,000 -- ONLY?