IMgc README FILE: WHAT THE SOFTWARE DOES IMgc reads in an aligned fasta file, finds the largest non-recombining block of DNA sequence, and prints that block to file. This is necessary for downstream analyses that require datasets with no evidence of recombination. IMgc maximizes the amount of information (as defined by the user) in the final dataset, and the user can favour retention of segregating sites to individuals, or vice versa. The default is equal weighting of individuals and segregating sites. The following manuscript fully describes the algorithm implemented by IMgc: Woerner, A.E., M.P. Cox and M.F. Hammer. (2007) Recombination-Filtered Genomic Datasets by Information Maximization. Bioinformatics 23:1851-1853. INSTALLATION IMgc requires working installations of: Perl BioPerl BioPerl::SeqIO module SOFTWARE USAGE Usage information can be found by running the command: IMgc --help When running IMgc, the software first identifies segregating sites; we currently include indels in this definition. Sequence data must be haplotypic and fully phased (i.e. no ambiguity codes are allowed). The following characters are permitted: GATCN-, where N signifies missing data and - indicates an indel. Complex multi-base indels are sometimes observed, e.g.: ATT --T --- ATT --T IMgc would treat these three bases as a single unit. This is an infinite sites violation, and IMgc currently changes all but the two highest frequency character states at a site violating the infinite sites model to N. For instance, the above example becomes: ATT --T NNN ATT --T EXAMPLE FILES Example input and output files are included with this package. '3.0cMperMb_example.fasta' represents a 10-kb aligned DNA dataset of 96 chromosomal copies simulated under a standard n-coalescent with a recombination rate of 3.0 cM/Mb. Non-segregating sites have been changed here to N for clarification as an example, but this is not a requirement of IMgc. '3.0cMperMb_example.fasta.out' shows the corresponding output from IMGc.This is the largest non-recombining block obtainable from '3.0cMperMb_example.fasta', in which individuals and segregating sites are jointly optimized. The output format represents a file body for Jody Hey's IM programme, but other output formats (e.g. fasta) are also possible. FINAL COMMENTS IMgc is still under development. If you identify bugs, please contact the software author (details below). Feel free to drop me a line if you have questions or comments too. -August Woerner (February 2007) augustw AT email DOT arizona DOT edu