NAME
AnalyseSeqs - Analyse a set of sequences of common length
SYNOPSIS
AnalyseSeqs [-X[bswn]] [-Q] [-M{mask}[+|!]] [-D{H|A|G}]
[-d{S|H|D|B}]
DESCRIPTION
AnalyseSeqs reads a set of sequences from stdin and tries a
variety of methods for sequence analysis on them. Currently
available are:
Statistical geometry for quadruples of sequences; THIS IS
PRELIMINARY AND NOT WELL TESTED BY NOW.
split decomposition; neighbour joining and Ward's variance
method for reconstructing phylogenies using various distance
measures. For statistical geometry and the cluster methods
PostScript output is available.
The program continues reading until it encounters one of the
separator characters '@' or '%'. Only sequences of alphabet-
ical characters or of a specified alphabet are processed,
all other lines are ignored. The program stops reading if it
either encounters an EOF condition, or if there are no valid
sequence data between two lines beginning with separator
characters.
A list of taxa names can be specified in the input stream.
The list begins with a line beginning with '*'. Optionally,
a file name prefix [fn] for the PostScript output can be
specified in this line. The entries have the form 'x :
Taxon', where x is the number of taxon, i.e., of the
corresponding entry in the list of input sequences. The taxa
list need not be complete. It must end, however, with a line
beginning with '*' or any of the separator characters. The
taxa list is printed on top of the output. The specified
taxa names are used as labels in the PostScript output.
OPTIONS
-X[bswn]
specifies the analysis methods to be used.
[b] Statistical Geometry. A PostScript file named
'[fn_]box.ps' giving a graphical representation of the
statistical geometry is created. The resulting box is a
good measure of 'tree likeness' of the data set. This
is the default.
[s] Split decomposition.
[w] Cluster analysis using Ward's method. A PostScript file
named '[fn_]wards.ps' is created containing a drawing
of the tree.
[n] Cluster analysis using Saitou's neighbour joining
method. A PostScript file named '[fn_]nj.ps' is created
containing a drawing of the tree.
-Q indicates that a statistical geometry analysis is to be
performed comparing four data sets, for instance to
confirm the significance of a proposed phylogeny. This
option is only useful for statistical geometry analysis
and hence the -X option is ignored. Each of the four
data sets must be of the form
* [filename_prefix]
# number
[list of taxa names]
*
list of sequences
%
where number is 1,2,3,4 for the four groups to be com-
pared.
-M{mask}[+|!]
allows to specify a mask for the input file. '{mask}'
can be one of the following letters indicating a prede-
fined alphabet or the %-sign followed by all characters
to be accepted. A + sign at the very end of the mask
indicates that the input is to be handled case sensi-
tive. Default is conversion of the input to upper case.
A ! sign can be used to convert the input data to RY
code: GgAaXx -> R, UuCcKkTt -> Y, all other letters are
converted to *.
-Ma all letters A-Z and a-z.
-Mu uppercase letters.
-Ml lowercase letters.
-Mc digits [0-9].
-Mn all alphanumeric characters.
-MR RNA alphabet (GCAUgcau).
-MD DNA alphabet (GCATgcat).
-MA Amino acids in one-letter code.
-MS Secondary strcutures coded as '^.()'
-M%alphabet
use the specified alphabet.
-D specifies the algorithm to be used for calculating the
distance matrix of the input data set. Available are
-DH Hamming Distance
-DA[,cost]
Simple alignment distance according to Needleman and
Wunsch. A gap cost different from 1. can be specified
after the comma.
-DG[,cost1,cost2]
Gotoh's distance with gap cost function g(k) =
cost2+cost1*(k-1). cost2<=cost1 has to be fulfilled.
Default values are cost1=1., cost2=1., yielding the
same distance as option A.
ONLY THE HAMMING DISTANCE IS WELL TESTED BY NOW !!!
-d specifies the edit cost matrix to be used. Available
are
-dS simple distance. Indel and substitution of different
characters all have cost 1. The indel cost can be set
by specifying the gap costs with the algorithm options
-DA and -DG. This is the default.
-dH A distance matrix for RNA secondary structures.
Inspired by Hogeweg's similarity measure (J.Mol.Biol
1988). Gap-function is set automatically.
-dD Dayhoff's matrix for amino acid distances.
-dB Distinguish purines and pyrimidines only. CAUTION this
option of course influences only the calculation of
distances. It does NOT affect computation of the sta-
tistical geometry. This is done directly on the
sequences. If you want to do statistical geometry on RY
sequences use the ! sign with the -M option, for
instance -MR!.
REFERENCES
The method of statistical geometry has been introduced by M.
Eigen, R. Winkler-Oswatitsch and A.W.M. Dress (Proc Natl
Acad Sci, 85:1988,5912). The method of split decomposition
was proposed by H.J. Bandelt and A.W.M. Dress (Adv Math,
92:1992,47). The variance method for cluster analysis is
due to H.J. Ward (J Amer Stat Ass, 58:1963,236). The neigh-
bour joining method was published by Saitou and Nei (Mol
Biol Evol, 4:1987,406).
This program is part of the Vienna RNA Package
WARNING
This is the beta test version. Some options or combinations
of options may still produce nonsense. Please send bug
reports to ivo@tbi.univie.ac.at.
VERSION
This man page is part of the Vienna RNA Package version 1.2.
AUTHOR
Peter F Stadler, Ivo L. Hofacker.
BUGS
Comments should be sent to ivo@itc.univie.ac.at.
Man(1) output converted with
man2html