UNPHASED: association analysis of haplotypes from unphased genotype data
UNPHASED is a suite of programs for association analysis of multilocus
haplotypes from unphased genotype data. UNPHASED currently includes the
following programs:
UNPHASED: graphical front end to the analysis programs
TDTPHASE: TDT and HHRR analysis for nuclear families
COCAPHASE: case/control data
QTPHASE: quantitative trait in unrelateds
PDTPHASE: pedigree disequilibrium test
QPDTPHASE: quantitative trait PDT
Input files
===========
All programs take a linkage pre-makeped pedigree file as input. This is a
plain text file with lines of the following whitespace-delimited format:
PedID SubID PID MID Sex Status Allele1a Allele1b Allele2a Allele2b ....
The pedigree name may be alphanumeric; all other fields must be numeric.
Missing data is coded as zero. Chromosome X data for males may be coded
either as homozygous or as single allele followed by zero.
Quantitative data may be put in column 6 instead of the affection
status; missing data is coded as "-".
A linkage datafile can be used to define the marker names (following
"#" as in Genehunter) and to allow different pedigree file formats to
the above.
PDTPHASE and QPDTPHASE can optionally use an input file containing
haplotype frequencies. Each line of the file specifies one
haplotype. The alleles are contained in the first n columns (n being
the length of the haplotype), followed by the frequency. For example,
1 1 0.25
1 2 0.25
2 1 0.2
2 2 0.3
If a haplotype is omitted from the frequency file, it is assumed to have
zero frequency.
Extended pedigrees are automatically broken up into the appropriate units:
case-parent trios for tdtphase, singletons for cocaphase and qtphase, and
whole pedigrees for pdtphase and qpdtphase.
Important note: as far as possible, UNPHASED is "robust" to errors in
the pedigree structure. This means that if a mendelian inconsistency
is encountered, a warning is printed and the pedigree is discarded,
but the program continues. If a parent is missing, including when the
parental ID has a typo, a dummy parent will be created and no warning
will be given. Sometimes this is an advantage, sometimes not.
Therefore I strongly recommend that you run your file through PEDCHECK
before running UNPHASED. The error checking in UNPHASED is reasonable
but not exhaustive.
If you get "variable number of columns", use Excel or AWK to see which
lines have missing or extra fields.
The programs run from a shell and output goes to stdout unless
redirected with the -output option.
Disclaimer
==========
This program is development software and is provided "as is", free of charge
under the GPL license.
Reference
=========
Dudbridge F (2003) Pedigree disequilibrium tests for multilocus haplotypes.
Genet Epidemiol 25:115-221
Author
======
Frank Dudbridge
MRC Human Genome Mapping Project Resource Centre
Hinxton, Cambridge
CB10 1SB UK
f.dudbridge@hgmp.mrc.ac.uk
http://www.hgmp.mrc.ac.uk/~fdudbrid/software/unphased/
Acknowledgments
===============
Bobby Koeleman and Iain Eaves helped greatly with the original
development of tdtphase. Thanks to David Clayton and Heather Cordell
for many valuable discussions regarding tdtphase, Mathias Chiano and
Lars Berglund for qpdtphase, and Adrian Mander for qtphase. Benny Lie,
Andrew Louka, Mark Iles and Thomas Schulze did a lot of beta testing.
Summary of theory
=================
TDTPHASE and COCAPHASE
----------------------
All methods are based on likelihood ratio tests in a log-linear model:
log(p/(1-p)) = mu + sum_i { beta_i . x_i }
where p is the probability of a chromosome being a "case" rather than a
"control", x_i are variables which represent the haplotypes in some way
depending upon the particular test, and mu and beta_i are coefficients to be
estimated.
A good reference for this application of log-linear models is Cordell &
Clayton, AJHG 70:124 (2002)
TDT - certain haplotypes
If analysis is restricted to phase-certain haplotypes only, the conditional
logistic regression model is used, corresponding to the probability of the
offspring, conditional upon the parents. In this case mu=0 and
p=exp(beta_t)/(exp(beta_t)+exp(beta_u))
where beta_t is the relative risk for the transmitted haplotype and beta_u
is the relative risk for the untransmitted haplotype. This is the ETDT of
Sham & Curtis (Ann Hum Genet 59:323).
In the situation described by Dudbridge et al, AJHG 66:2009, the correct
likelihood contribution is used - see Cordell & Clayton 2002 for description
of this.
TDT - uncertain haplotypes
--------------------------
When data are uncertain, there are theoretical and practical problems with
the conditional likelihood and instead we use unconditional logistic
regression on the full likelihood of parents and offspring (see Clayton AJHG
65:1170 for formulations of these likelihoods). This treats all the
transmitted haplotypes as "cases" and all untransmitted haplotypes as
"controls", equivalent to the haplotype-based haplotype relative risk of
Terwilliger & Ott (Hum Hered 42:337). The EM algorithm is used to obtain
maximium-likelihood estimates of case and control parental haplotype
frequencies under both null and alternative. Note that this is not robust to
population stratification although the permutation option can get around
this. Note that this is not robust to population stratification
although the permutation option can get around this. Note also that
this is only a test for association and is not a test of linkage.
Case-control data
-----------------
The method for case-control data is a standard unconditional logistic
regression identical to the model-free method T5 of EHPLUS (Zhao et al Hum
Hered 50:133) and the log-linear modelling of Mander (Stata J 1:58). The
beta_i are log odds ratios for the haplotypes. The EM algorithm is used to
obtain ML frequency estimates.
Convergence of EM algorithm
---------------------------
It is often noted that the EM algorithm may converge to a local
minimum. This is less of a problem than it may seem, at least for
hypothesis testing. If the algorithm starts from equal haplotype
frequencies in both cases and controls, this corresponds to the
"global" null that all odds ratios are equal. The likelihood will
then be maximised local to the null, which will give a similar result
to the score test even if the global maximum is not found. This does
not apply to other null hypotheses such as in conditional testing (see
below). Nevertheless, UNPHASED provides an option to restart the EM
algorithm from multiple random starting points.
Missing data
------------
There is an option to deal with missing genotypes by looping over all allele
values. The EM algorithm is used in the same way as for uncertain
haplotypes. This applies only to parental genotypes as missing
offspring genotypes would be uninformative.
Conditional tests
-----------------
UNPHASED provides association tests conditioning on additional loci which
may already be associated and in linkage disequilibrium with the test loci.
For TDT of certain data, this is the Conditional ETDT (Koeleman et al, Ann
Hum Genet 64:207); for uncertain-data TDT and unrelated subjects, the test
is of equality of odds ratios for haplotypes identical at conditioning loci
(Mander, Stata J 1:58).
Treating parental haplotypes independently is equivalent to assuming a
multiplicative penetrance model for the conditioning loci; as this is
unlikely to be the case, there is an option to condition on the genotype:
the data are stratified according to genotype at the conditioning loci.
The "main effects" test of Cordell & Clayton (AJHG 70:124) assumes that the
haplotype risk depends only on the constituent allelic risks and is a more
parsimonious test than the full conditional test. In this implementation,
all interaction terms are included within the conditioning loci and within
the test loci, but no interactions between conditioning and test loci are
allowed. Thus the "main effects" apply only to the conditioning and test
"superloci". The degrees of freedom are the number of test
haplotypes minus one.
An option is now available to specify no interaction terms at all.
There is no specific test of gene-gene interaction, but the
likelihoods under various options can be compared to test certain
hypotheses. For example, comparing the alternative likelihoods for the
conditional test with and without the -maineffects option gives a test
for interactions between test and conditioning loci. I have not
implemented these as specific options because the interpretation of
these tests is rather subjective and they are probably best left to
the specialists.
Haplotype grouping
------------------
Haplotypes are grouped by constraining their odds ratios to be equal. Rare
haplotypes may be grouped together to retain asymptotic validity and
increase power. Individual haplotypes are tested for association by grouping
all others together. Pairwise comparisons of risk may be made by using other
grouping configurations (Koeleman et al, Ann Hum Genet 64:207).
Permutation test
----------------
In TDTPHASE, the "transmitted" and "untransmitted" labels are
reassigned, with the same assignment used for all possible haplotype
assignments. This procedure gives the correct significance for TDT
even when the -EM option is used and the parents are not in HWE. A
"robust" permutation option uses the same reassignment for all sibs in
a family, which gives a valid test in the presence of linkage.
In COCAPHASE, the "case" and "control" labels are reassigned. The labels
apply to the whole subject, not its haplotypes, so this procedure is
robust to deviations from HWE even though the test assumes independent
chromosomes (this feature has not been noted in the literature
although some other programs, eg EHPLUS, have the same property).
In each replicate, all the selected markers are analysed and the most
significant p-value stored, so that the permutation procedure gives a
significance level corrected for the multiple haplotypes and markers
tested.
The permutation test may not give the actual probability under
the null, depending on the exact conditions of the test; but under
exchangeability arguments the reported P-values do give a valid test
of the null hypothesis in all situations.
QTPHASE
-------
This is similar to the COCAPHASE program except that it uses a generalized
linear model for a quantitative trait, of the form
Pr(Trait|Haplotype) = Normal ( mu=sum_i { beta_i . x_i } , sigma )
Here the trait is assumed to have a normal distribution whose mean depends
only on additive effects of haplotypes: no dominance is assumed,
although cis-phase interactions are allowed. The variance is not of
interest and is a nuisance parameter in the likelihood. A pooled
variance estimate is used, assuming the same variance for all
haplotypes. An intuitive description of the model is in Abecasis et
al, AJHG 66:279.
An E-M algorithm is used to deal with uncertain haplotypes. In the E-step,
we require Pr(Haplotype|Trait) for each possible haplotype solution. This is
obtained from Bayes' theorem since the model defines Pr(Trait|Haplotype). In
the M-step, we can estimate the trait mean for each haplotype by the
counting algorithm, assuming normality of the trait. The population
frequencies of the haplotypes are also estimated by counting.
This model allows conditional tests similar to the CETDT and the
main-effects test. The program gives an accessible subset of the more
general modelling implemented by Mander (Stata J 2:65).
The permutation test reassigns trait values among the subjects and the
test is robust to deviations from HWE even though the likelihood
assumes independent chromosomes (see comments for COCAPHASE above).
PDTPHASE
--------
This is an implementation of the pedigree disequilibrium test (Martin et al,
AJHG 67:146) with extensions to deal with haplotypes and missing data.
The PDT calculates a measure of association D within each family which has
expectation zero. The central limit theorem combines these across the whole
dataset into a standard normal. In PDTPHASE, D becomes the expected
association measure, over a prior distribution of haplotype frequencies.
This also has zero expectation, for any prior frequencies, and the same
theory applies.
A reasonable choice for the prior frequencies is the maximum-likelihood
gametic frequencies under the null, which are calculated using an E-M
algorithm. This approach extends readily to include missing genotypes,
though this will only result in a gain in power if the proportion of missing
data is quite small. Alternatively, the prior frequencies can be specified
in an input file.
QPDTPHASE
---------
This is an implementation of the quantitative trait PDT described in Monks &
Kaplan, AJHG 66:576, with extensions to deal with haplotypes and missing
data.
The null hypothesis is no linkage or no association, under which the trait
and genotypes are uncorrelated. The covariance is estimated within each
family and the estimates combined across the dataset by the central limit
theorem.
The covariance is an expectation which can be further be taken over all
possible haplotype configurations in all subjects in a family. The expected
value of the covariance under the null is still zero, for any prior
haplotype frequencies. Similarly to PDTPHASE, prior frequencies can be
estimated by the E-M algorithm, or specified in an input file.
The permutation procedure for both QPDTPHASE and PDTPHASE changes the
sign of the per-pedigree statistic, with probability 0.5. Because
this assumes complete linkage and non-random union of gametes in
offspring, the procedure does not give the true probability under the
null, but under exchangeability arguments the P-values do give a
valid test of the null hypothesis.
Usage
=====
All programs run from a command line of the form
tdtphase [options]
The input pedigree file is the only required argument. By default, the
programs analyse each marker in the file individually. Markers are
named 1,2,3... unless a data file is specified.
Here are all the available options: they are not available in all
programs, if an program does not implement an option, it will print a
list of the options it does implement.
-EM EM estimation of uncertain haplotype frequencies; if
omitted, only phase-certain haplotypes are used
-LD displays D', r^2 and Cramer's V when two-locus haplotypes
are selected (ie. -window 2)
-bygenotype conditions on genotypes rather than on haplotypes at the
conditioning loci
-cellcounts determines rare haplotypes on the basis of expected counts
rather than frequencies
-chrX identifies data from chromosome X; male alleles in the
first column for each marker
-compare specify first haplotype in a pairwise comparison; wild
cards may be specified by "0"
-condition select the conditioning markers
-conditiontype specify a conditioning haplotype; the null is that all
haplotypes containing this sub-haplotype have same risk
-datafile specify a linkage datafile in Genehunter format
-droprare rare haplotypes are grouped together, but under the null
are not grouped with other haplotypes, so the global test
has one fewer degree of freedom. The test has greater power
when the effect is in common haplotypes.
-epsilon specify a convergence threshold for the likelihood in the
EM algorithm; default is 1e-8 which is quite strict
-freqfile specify a file containing fixed haplotype frequencies
-hpt homozygous parent test; uses only the parents which are
homozygous are the conditioning loci. Optionally specify
which haplotype to be homozygous for, using the
-conditiontype option
-individual perform tests of individual haplotypes by grouping all
others together
-maineffects test main effects only, in conditional testing
-marker select markers to analyse from the pedigree file
-missing estimate missing parental genotypes by looping over allele
values which are consistent with pedigree data
-nointeraction no interactions terms at all; test and conditioning
haplotypes are both modelled with main effects only
-observed analyse only the haplotypes which are observed with
certainty in at least one subject
-onesib use only the first informative sibling from each family
-output write most of the output to a file instead of the screen
-paraff select parental affection status; 1=unaffected, 2=affected
-parsex select parental sex; 1=male, 2=female
-pedfile specify the pedigree file (not required as an option)
-permutation specify number of replicates for permutation test.
Estimates the significance of the best result correcting
for multiple testing of haplotypes and loci.
-phenotype select the phenotype for analysis, requires a
datafile. If omitted, the first phenotype in the
pedigree file is used.
-rare specify frequency below which haplotypes are grouped as
rare. Maximum-likelihood frequencies are re-computed after
identification of rare haplotypes.
-rarein followed by "both" (default), "either", "case" or
"control", selects which frequency is used to determine
whether a haplotype is rare
-reference specify reference haplotype which has odds ratio 1
-restarts restart the EM algorithm from multiple random starting points
-robustperm use the same permuted transmission status for all
sibs, therefore robust to prior linkage
-sibaff select sibling affection status
-sibsex select sibling sex
-sum compute the "PDT-sum" statistic (see Martin et
al. AJHG 68:1065). By default, "PDT-avg" is computed.
-window specify the length of haplotypes; the selected markers will
be analysed in groups of this size - see examples below
-with specify second haplotype in a pairwise comparison; no
wildcards allowed
-zero specify frequency below which haplotype frequencies are
trimmed to zero. Such haplotypes are assumed to be absent
in the population when calculating degrees of
freedom. The default is 1e-8.
Examples
========
tdtphase chr1.ped
analyse each marker in chr1.ped individually
tdtphase chr1.ped -window 2
phase-certain haplotypes from markers 1-2, 2-3, 3-4, etc
tdtphase chr1.ped -window 3 -EM
all haplotypes from markers 1-2-3, 2-3-4, 3-4-5, etc
tdtphase chr1.ped -marker 3 6
analyse marker 3, then marker 6
tdtphase chr1.ped -marker 3 6 -window 2
analyse haplotypes from markers 3 and 6
tdtphase chr1.ped -datafile chr1.dat -marker SNP1 SNP2
analyse the marker called SNP1 in chr1.dat, then the marker called SNP2
tdtphase chr1.ped -marker 3 6 8 -window 2
analyse haplotypes from markers 3 and 6, then from markers 6 and 8
tdtphase chr1.ped -marker 2 -condition 1
null: haplotypes identical at marker 1 have same relative risk
tdtphase chr1.ped -marker 2 3 4 -condition 1
test marker 2 conditioning on marker 1, then marker 3 conditioning on marker
1, then marker 4 conditioning on marker 1
tdtphase chr1.ped -marker 3 4 -window 2 -condition 1 2
test haplotypes of markers 3-4 conditioning on haplotypes of markers 1-2
tdtphase chr1.ped -marker 2 -condition 1 -conditiontype 3
haplotypes with allele 3 at marker 1 have same relative risk
tdtphase chr1.ped -marker 2 -condition 1 -bygenotype
odds ratio for test locus depends on genotype at conditioning loci
tdtphase chr1.ped -marker 2 -condition 1 -conditiontype 3 4 -bygenotype
relative risks for locus 2 are equal when genotype at locus 1 is 3/4
tdtphase chr1.ped -marker 2 -condition 1 -maineffects
test main effect of marker 2; does not detect interaction-only model
tdtphase chr1.ped -marker 3 4 -condition 1 2 -window 2 -maineffects
interaction allowed between markers 1 & 2 and 3 & 4, not between 1 &
3, 1 & 4, 2 & 3, 2 & 4
tdtphase chr1.ped -marker 3 4 -condition 1 2 -window 2 -nointeraction
no interaction allowed between any markers
tdtphase chr1.ped -marker 1 2 -window 2 -compare 1 2 -with 2 2
null: haplotype 1-2 has same relative risk as 2-2
tdtphase chr1.ped -marker 1 2 -window 2 -compare 1 0 -with 2 2
all haplotypes with allele 1 at marker 1 have same relative risk as
2-2
tdtphase chr1.ped -marker 1 2 -window 2 -compare 0 0 -with 2 2
individual test of the 2-2 haplotype
SIMULATION3G.C and TEST.PED
===========================
simulation3g.C is a program which simulates a QTL with additive effects on a
normal distribution. A liability threshold is used to determine affection
status. One marker with eight alleles is simulated with linkage/LD to the QTL
defined in the program. You need to specify the name of the file
containing simulated data, and a script which outputs one real number
(typically a P-value).
test.ped is a file of 500 three-generation pedigrees with a simple sib-pair
structure. There is one affection status and one quantitative trait,
followed by data for six markers. The first two markers are the simulated
marker recoded as a two-marker haplotype of diallelic markers. The next two
markers are the same haplotype, with parental data missing. The fifth marker
is the simulated marker itself, and the sixth is the same with parental data
missing.
test.dat should be specified when testing the programs, since test.ped has a
non-standard format (marker data begins in column 8).
test.ped serves only to verify that the programs are working and is
meaningless otherwise. simulation3g could be used to test type-1 error
rates and assess power.
Version history
===============
2.40 Added the GUI. Added -restarts and -output options. Fixed bug
in assigning default locus names. Fixed bug in calculating D' with
the -zero option. Fixed bug in QTPHASE with extremely rare
haplotypes. Improved robustness to erroneous pedigree structures.
2.371 More portable source code
2.37 Removed untyped subjects from all programs; this speeds them up and gives
very similar results. Added -nointeraction and -observed options.
Permutation significance as recommended North et al. (AJHG 72:498)
2.36 Changed the implementation of -rare in pdtphase and qpdtphase.
It should now give valid results with the -missing option.
2.35 Fixed chrX implementation in cocaphase and qtphase. Set default value of
1e-8 for the -zero option. Fixed bug with -zero option in tdtphase.
Fixed bug with -missing option for single marker analysis in (q)pdtphase.
2.34 Fixed bug in tdtphase, cocaphase and qtphase affecting situations
where the -zero flag is used and many haplotypes then have zero
frequency. Started all EM procedures from equal haplotype
frequencies. Added -cellcounts and -rarein options.
2.33 Fixed bug in permutation test over multiple window positions when
the global test is used. Affected tdtphase, cocaphase and qtphase.
Did not affect test of one window position nor when -individual option
was used.
2.32 Fixed bug in -missing option in cocaphase and qtphase. Output haplotype
frequencies and counts in all programs
2.31 Allowed one missing parent in tdtphase with EM option. May lead to
conservative test but that is users responsibility
2.30 Fixed bugs in cocaphase: main effects test, and omit subjects with
unknown affection status. Speeded up all the programs using a lookup table
for marker names. Added pdtphase.
2.20 Added qpdtphase and qtphase.
2.14 Added support for extended pedigrees in tdtphase. Fixed bug in
selecting subjects with unknown sex or affection status
2.13 Added -hpt option for homozygous parent test
2.12 Fixed bug in calculating LD stats without EM option
2.11 Fixed bug in reading files with no newline on the last line
2.1 Added support for general linkage format files. Fixed bug in calculating
null in cocaphase. More elegant calculation of permutation test in tdtphase.
2.02 Changed "+" indicator from rare haplotypes to common haplotypes.
Corrected some of the examples above
2.01 Changed exploratory pass to have same options as main pass, output
expected counts instead of frequencies and "+" for rare haplotypes, default
-sibaff 2
2.0 First public release; rewrite in C++
1.x Development versions in C