This is used for identifying duplicate individuals/genotypes in large data sets. I've specified this in terms of the max number of missing loci because I think everyone should already have tossed out individuals with a lot of missing data, and then it makes it easy to toss out pairs without even looking at all the loci, so it is faster for all the comparisons.

pairwise_geno_id(S, max_miss)

Arguments

S

"source", a matrix whose rows are integers, with NumInd-source rows and NumLoci columns, with each entry being a a base-0 representation of the genotype of the c-th locus at the r-th individual. These are the individuals you can think of as parents if there is directionality to the comparisons. Missing data is denoted by -1 (or any integer < 0).

max_miss

maximum allowable number of mismatching genotypes betwen the pairs.

Value

a data frame with columns:

ind1

the base-1 index in S of the first individual of the pair

ind2

the base-1 index in S of the second individual of the pair

num_mismatch

the number of loci at which the pair have mismatching genotypes

num_loc

the total number of loci missing in neither individual