IMgc README FILE:
WHAT THE SOFTWARE DOES
IMgc reads in an aligned fasta file, finds the largest non-recombining
block of DNA sequence, and prints that block to file. This is necessary
for downstream analyses that require datasets with no evidence of
recombination. IMgc maximizes the amount of information (as defined by
the user) in the final dataset, and the user can favour retention of
segregating sites to individuals, or vice versa. The default is equal
weighting of individuals and segregating sites.
The following manuscript fully describes the algorithm implemented by
IMgc:
Woerner, A.E., M.P. Cox and M.F. Hammer. (2007) Recombination-Filtered
Genomic Datasets by Information Maximization. Bioinformatics
23:1851-1853.
INSTALLATION
IMgc requires working installations of:
Perl
BioPerl
BioPerl::SeqIO module
SOFTWARE USAGE
Usage information can be found by running the command:
IMgc --help
When running IMgc, the software first identifies segregating sites; we
currently include indels in this definition. Sequence data must be
haplotypic and fully phased (i.e. no ambiguity codes are allowed). The
following characters are permitted:
GATCN-, where N signifies missing data and - indicates an indel.
Complex multi-base indels are sometimes observed, e.g.:
ATT
--T
---
ATT
--T
IMgc would treat these three bases as a single unit. This is an infinite
sites violation, and IMgc currently changes all but the two highest
frequency character states at a site violating the infinite sites model
to N. For instance, the above example becomes:
ATT
--T
NNN
ATT
--T
EXAMPLE FILES
Example input and output files are included with this package.
'3.0cMperMb_example.fasta' represents a 10-kb aligned DNA dataset of 96
chromosomal copies simulated under a standard n-coalescent with a
recombination rate of 3.0 cM/Mb. Non-segregating sites have been changed
here to N for clarification as an example, but this is not a requirement
of IMgc.
'3.0cMperMb_example.fasta.out' shows the corresponding output from
IMGc.This is the largest non-recombining block obtainable from
'3.0cMperMb_example.fasta', in which individuals and segregating sites
are jointly optimized. The output format represents a file body for Jody
Hey's IM programme, but other output formats (e.g. fasta) are also
possible.
FINAL COMMENTS
IMgc is still under development. If you identify bugs, please contact
the software author (details below). Feel free to drop me a line if you
have questions or comments too.
-August Woerner (February 2007) augustw AT email DOT arizona DOT edu