Dr Richard J. Edwards

Main Page
Main EdwardsLab Site
UoS staff page
UNSW staff page
Research
CMG webpage
Software
SLiMSuite (UK)
SLiMSuite (Australia)
Webservers
REST Services
SLiMSuite Blog
Publications and Bibliometrics
Publications
ResearcherID
Google Scholar
Microsoft Academic Research
Wormbase
Social Networking
Biostars
ResearchGate
Academia.edu
LinkedIn
Twitter Feed
Lab Twitter Feed
Other
MapTime

SLiMSuite software package

This webpage contains the latest download and documentation for the SLiMSuite package.Please see the tabs below for more details. Release notes can be found in the ReadMe tab.

The current SLiMSuite release is 2015-01-07. Click on the tabs for details.

Availability

The tools in SLiMSuite are freely available for local installation under a GNU General Public License as part of the SLiMSuite package (2015-01-07 release). Please see the Manuals and ReadMe for more details.

To contact the author, e-mail: seqsuite@gmail.com

Citing SLiMSuite

When publishing analyses performed with this software, please cite the individual papers listed for the relevant program/module. If no paper or URL is listed, please cite this website.

Introduction

Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular, it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates the matter.

The SLiMSuite collection contains a number of open-source bioinformatics tools to analyse these important protein features. Note that SLiMSuite now includes all of the programs from the previous SeqSuite package. (SLiMSuite was formerly also available as part of a larger SeqSuite package.) See also the SLiMSuite blog for more information.

SLiMFinder

SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54]. SLiMFinder is comprised of two algorithms:

SLiMBuild identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships. Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPCs), which are defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired, SLiMBuild can be used as a replacement for TEIRESIAS in other software (teiresias=T slimchance=F).

SLiMChance estimates the probability of these motifs arising by chance, correcting for the size and composition of the dataset, and assigns a significance value to each motif. Motif occurrence probabilites are calculated independently for each UPC, adjusted the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc. These individual occurrence probabilities are then converted into the total probability of the seeing the observed motifs the observed number of (unrelated) times. These probabilities assume that the motif is known before the search. In reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted for the size of motif-space searched to give a significance value. This is an estimate of the probability of seeing that motif, or another one like it. These values are calculated separately for each length of motif. Where pre-known motifs are also of interest, these can be given with the slimcheck=MOTIFS option and will be added to the output. SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can be switched on using sigprime=T. Likewise, the more precise (but more computationally intensive) correction to the mean UPC probability heuristic can be switched on using sigv=T. (Note that the other SLiMChance options may not work with either of these options.) The allsig=T option will output all four scores. In this case, SigPrimeV will be used for ranking etc. unless probscore=X is used.

Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication of which motifs may actually be variants of a larger SLiM and should therefore be considered together.

Additional Motif Occurrence Statistics, such as motif conservation, are handled by the rje_slimlist module. Please see the documentation for this module for a full list of commandline options. These options are currently under development for SLiMFinder and are not fully supported. See the SLiMFinder Manual for further details. Note that the OccFilter *does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min. Support) but the overall Motif StatFilter *only* affects SLiMFinder output following SLiMChance calculations.

QSLiMFinder

QSLiMFinder (Query SLiMFinder) is a variant of SLiMFinder for explicitly returning motifs present in a given query sequence specified by query=X. More details can be found in the SLiMFinder Manuals or contact the author.

SLiMCore

The rje_slimcore module forms the basis for SLiMFinder & SLiMSearch and can also be used in its own right for additional special functions: The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.

The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.

SLiMProb

SLiMProb, formerly called SLiMSearch, is a tool for finding pre-defined SLiMs (Short Linear Motifs) in a protein sequence database. SLiMProb can make use of corrections for evolutionary relationships and a variation of the SLiMChance alogrithm from SLiMFinder to assess motifs for statistical over- and under-representation. SLiMProb is a replacement for PRESTO and uses many of the same underlying modules.

Benefits of SLiMProb that make it more useful than a lot of existing tools include:

  • searching with mismatches rather than restricting hits to perfect matches.
  • optional equivalency files for searching with specific allowed mismatched (e.g. charge conservation)
  • generation or reading of alignment files from which to calculate conservation statistics for motif occurrences.
  • additional statistics, inlcuding protein disorder, surface accessibility and hydrophobicity predictions
  • recognition of "n of m" motif elements in the form , where X is one or more amino acids that must occur n+ times across which m positions. E.g. must have 3+ Is and/or Ls in a 5aa stretch.

Main output for SLiMProb is a delimited file of motif/peptide occurrences but the motifaln=T and proteinaln=T also allow output of alignments of motifs and their occurrences.

CompariMotif

CompariMotif is a piece of software with a single objective: to take two lists of protein motifs and compare them to each other, identifying which motifs have some degree of overlap, and identifying the relationships between those motifs. It can be used to compare a list of motifs with themselves, their reversed selves, or a list of previously published motifs, for example (e.g. ELM (http://elm.eu.org/)). CompariMotif outputs a table of all pairs of matching motifs, along with their degree of similarity (information content) and their relationship to each other.

The best match is used to define the relationship between the two motifs. These relationships are comprised of the following keywords:

Match type keywords identify the type of relationship seen:

  • Exact = all the matches in the two motifs are precise
  • Variant = focal motif contains only exact matches and subvariants of degenerate positions in the other motif
  • Degenerate = the focal motif contains only exact matches and degenerate versions of positions in the other motif
  • Complex = some positions in the focal motif are degenerate versions of positions in the compared motif, while others are subvariants of degenerate positions.

    Match length keywords identify the length relationships of the two motifs:

  • Match = both motifs are the same length and match across their entire length
  • Parent = the focal motif is longer and entirely contains the compared motif
  • Subsequence = the focal motif is shorter and entirely contained within the compared motif
  • Overlap = neither motif is entirely contained within the other This gives sixteen possible classifications for each motif's relationship to the compared motif.

    SLiMMaker

    SLiMMaker is a simple tool for generating SLiM patterns from a set of peptides. This method assumes that the peptides are already aligned - no alignment is performed by the tool.

    SLiMBench

    SLiMBench is tool under development for generating robust SLiM prediction benchmarking datasets and assessing the subsequent predictions for consistent benchmarking.

    SLiMFarmer

    SLiMFarmer is a tool under development for running SLiMSuite programs on parallel processors, either using rsh to spawn processors across different nodes on a supercomputer, or using Python forking for running on multiple CPUs on a single machine. Currently, SLiMFinder, QSLiMFinder and SLiMProb have been tested.

  • ReadMe documentation for release of SLiMSuite software

    Distribution compiled: Wed Jan 7 16:56:17 2015

    Questions/Comments?: please contact seqsuite@gmail.com

    Python Modules in tools/:

    Python Modules in extras/:

    Python Modules in libraries/:

    Python Modules in legacy/:


    Installation Instructions

    1. Place the slimsuite.2015-01-07.tar.gz file in chosen directory (e.g. c:\bioware\) and unpack.
    2. A subdirectory slimsuite will be created containing all the files necessary to run.

    The software should run on any system that has Python installed. Additional software may be necessary for full functionality. Further details can be found in the manuals supplied.

    Updates since last release:

    • fiesta: Updated from Version 1.8.
    → Version 1.8.1: Replaced type with stype throughout to try and avoid TypeError crashes.
    → Version 1.9.0: Altered HAQDB to be a list of files rather than just one.

    • gablam: Updated from Version 2.14.
    → Version 2.15.0: Added seqnr function. Add run() method.
    → Version 2.16.0: Added fullblast=T/F : Whether to perform full BLAST followed by blastres analysis [False]
    → Version 2.16.1: Fixed a bug where the fullblast option was failing to return scores and evalues.

    • multihaq: Updated from Version 1.1.
    → Version 1.2: Changed defaults to autoskip=F.

    • pingu_V4: Updated from Version 4.2.
    → Version 4.3: Modified to use Pfam as hub field for DomPPI generation. Modified naming of PPI output after ppisource.

    • seqsuite: Created/Renamed.
    → Version 0.0: Initial Compilation.
    → Version 0.1: Added rje_seq and FIESTA. Added Uniprot.
    → Version 1.0: Moved to tools/ for general release. Added HAQESAC and MultiHAQ. Moved mod to enable easy external access.
    → Version 1.1: Added XRef = rje_xref.XRef. Identifier cross-referencing module.
    → Version 1.2: Added taxonomy.
    → Version 1.3.0: Added rje_zen.Zen. Modified code to work with REST services.
    → Version 1.4.0: Added rje_tree.Tree, GABLAM and GOPHER.

    • slimbench: Updated from Version 2.5.
    → Version 2.6: Added ELM domain interactions table: http://www.elm.eu.org/infos/browse_elm_interactiondomains.tsv.
    → Version 2.6: Fixed issues introduced with new SLiMCore V2.0 SLiMSuite code.
    → Version 2.7: Reinstate filtering. (Not sure why disabled.) Add genspec=LIST to filter by species. Added domlink=T/F.
    → Version 2.8.0: Implemented PPIBench benchmarking for datasets without Motifs in name.

    • slimfarmer: Updated from Version 1.3.
    → Version 1.4: Added modules=LIST : List of modules to add in job file [clustalo,mafft]
    → Version 1.4.1: Fixed farm=batch mode for qsub=T.

    • slimmaker: Updated from Version 1.1.
    → Version 1.2.0: Modified to work with REST servers

    • slimmutant: Updated from Version 1.0.
    → Version 1.1: Minor tweaks to generate method to increase speed. (Make index in method.) Added splitfield=X.
    → Version 1.2: Added a batch mode for mutfiles - all other options will be kept fixed. Added maxmutant and minmutant.
    → Version 1.3: Added SLiMPPI analysis (will set analyse=T). Started basing on SLiMCore

    • slimprob: Updated from Version 2.1.
    → Version 2.2.0: Added basic REST functionality.

    • slimsuite: Created/Renamed.
    → Version 0.0: Initial Compilation based on SeqSuite.
    → Version 1.0: Moved to tools/ for general release. Added reading and using of SeqSuite programs.
    → Version 1.1: Added slimlist.
    → Version 1.2: Added SLiMBench.
    → Version 1.3.0: Added SLiMMaker and modified code to work with REST services.

    • rje_dbase: Created/Renamed.
    → Version 1.0: Initial working version for use with rje_uniprot V2.0
    → Version 1.1: Added automated downloading of databases from file
    → Version 1.2: Incorporated RJE_ENSEMBL for handling EnsEMBL genomes and making a one-protein-per-gene proteome.
    → Version 1.3: Tidied code a little and improved comments/docstrings.
    → Version 2.0: Heavily reorganised and modified module.
    → Version 2.1: Added seqfilter=T/F to speed up TaxaDB manufacture.
    → Version 2.2: Added use of rje_seqlist for TaxaDB manufacture.
    → Version 2.3: Added construction of EnsEMBL TaxaDB sets during TaxaDB construction.

    • rje_pydocs: Created/Renamed.
    → Version 2.0: Initial Compilation based on rje_pydocs 1.7 (now rje_pydocs_V1.py), converted to new module framework.
    → Version 2.1: Added R code and INI directory.
    → Version 2.2: Minor bug fixes for distribution making code.
    → Version 2.3: Added Release text (default YYMMDD) to be inserted into downloads.
    → Version 2.4: Implemented reading of version numbers and outputting updates to log.
    → Version 2.5: Converted to make use of settings/ directory in place of libraries/ini/
    → Version 2.6: Minor bug fixes for readme etc. Changed to read ini for packages from ./defaults/.
    → Version 2.7: Added rje_ppi output for module links.
    → Version 2.8: Added parsing of commandline options from docstring and cmdRead calls.
    → Version 2.8: Added docsource=PATH : Input path for Python Module documentation (manuals etc.) ['../docs/']
    → Version 2.9: Attempts to fix some broken links and sort out manuals confusion.
    → Version 2.10: Added legacy/ to default drives for processing.
    → Version 2.11: Added development option for multifile HTML readme output.
    → Version 2.12: Continued modification of the HTML output for SLiMSuite package generation.
    → Version 2.13: Centred logo and altered restricted width (300) to height (120).
    → Version 2.14.0: Added handling of X.Y.Z versions.

    • rje: Updated from Version 4.12.
    → Version 4.13.0: Added new built-in attributes/options for REST services.
    → Version 4.13.1: Fixed MemSaver typo in WarnLog output. Modified mkDir() to avoid clashes raising errors.

    • rje_db: Updated from Version 1.5.
    → Version 1.6: Added option to save a subset of entries using saveToFile(savekeys=LIST).
    → Version 1.7.0: Added splitchar to table splitting.
    → Version 1.7.1: Reinstated raise error if expected table missing.

    • rje_dismatrix_V3: Created/Renamed.
    → Version 0.0: Initial Compilation.
    → Version 1.0: Working version used in several programs.
    → Version 2.0: Updated module inline with newer modules and layout. Incorporation of Minimum Spanning Tree.
    → Version 2.1: Attempted in incorporate pylab heatmap generation from distance matrix.
    → Version 2.2: Added loading of matrix from GABLAM-style database file.
    → Version 2.3: Added UPGMA method.
    → Version 2.4: Added UPGMA branch length return.
    → Version 2.5: Added PNG output based on rje_slimhtml.
    → Version 2.6: Added nsf2nwk=T/F - Whether to convert extension for Newick Standard Format from nsf to nwk (for MEGA)
    → Version 2.7: Added reading of BLOSUM matrix and standalone functionality. Removed pylab heatmap from docstring.
    → Version 2.8: Slightly modified clustering output.
    → Version 2.9: Modified PNG output to use rje_tree code instead. Added Graph output using rje_ppi.
    → Version 2.10: Minor modifications for SLiMCore UPC.
    → Version 3.0: Updated to new rje_obj.RJE_Object class.

    • rje_ensembl: Updated from Version 2.13.
    → Version 2.14: Add enspep=T/F : Create full gnspacc EnsEMBL peptide datasets [False]

    • rje_genbank: Created/Renamed.
    → Version 0.0: Initial Compilation.
    → Version 0.1: Modified and Tidied output a little.
    → Version 0.2: Added details to skip and option to use different detail for protein accession number.
    → Version 0.3: Added reloading of features.
    → Version 1.0: Basic functioning version. Added fetchuid=LIST Genbank retrieval to generate seqin=FILE.
    → Version 1.1: Added use of rje_taxonomy for getting Species Code from TaxID.
    → Version 1.2: Modified to deal with genbank protein entries.
    → Version 1.2.1: Fixed feature bug that was breaking parser and removing trailing '*' from protein sequences.
    → Version 1.2.2: Fixed more features that were breaking parser.

    • rje_obj: Updated from Version 2.0.
    → Version 2.1.0: Added new built-in attributes/options for REST services.

    • rje_ppi: Updated from Version 2.8.
    → Version 2.8.1: Fixed bug with Spring Layout interruption message.

    • rje_qsub: Updated from Version 1.5.
    → Version 1.6: Added modules=LIST : List of modules to add in job file [clustalo,mafft]
    → Version 1.6.1: Added R/3.1.1 to modules.

    • rje_seq: Updated from Version 3.20.
    → Version 3.21.0: Added extraction of uniprot IDs for seqin.

    • rje_seqlist: Updated from Version 1.7.
    → Version 1.8: Added sortseq=X : Whether to sort sequences prior to output (size/invsize/accnum/name/seq/species/desc) [None]
    → Version 1.9.0: Added extra functions for returning sequence AccNum, ID or Species code.
    → Version 1.10.0: Added extraction of uniprot IDs for seqin. Added more dna2prot reformatting options.

    • rje_sequence: Updated from Version 2.3.
    → Version 2.4: Added recognition of modified IPI format. Added standalone low complexity masking.
    → Version 2.4.1: Moved the gnspacc fragment recognition to reduce issues. Should perhaps remove completely?

    • rje_slim: Updated from Version 1.8.
    → Version 1.9: Reinstated ambcut for slimToPattern()

    • rje_slimcalc: Updated from Version 0.8.
    → Version 0.9: Improvements to use of GOPHER.

    • rje_slimcore: Updated from Version 2.2.
    → Version 2.3: Docstring edits. Minor tweak to walltime() to close open files.
    → Version 2.4: Added megaslimfix=T/F : Whether to run megaslim in "fix" mode to tidy/repair existing files [False]
    → Version 2.5: Added (hidden) slimmutant=T/F : Whether to ignore '.p.\D\d+\D' at end of accnum. Made default append=True.
    → Version 2.6.0: Added uniprotid=LIST : Extract IDs/AccNums in list from Uniprot into BASEFILE.dat and use as seqin=FILE. []
    → Version 2.6.1: Removed the maxseq default setting.

    • rje_slimlist: Updated from Version 1.4.
    → Version 1.5: Added run() method for slimsuite.py compatibility. Improved split motif handling.
    → Version 1.6: Modified to read in new ELM class download file with extra header information. Added varlength=T/F filter.
    → Version 1.6: Modified so that filtering one element of a split motif removes all.

    • rje_tree: Updated from Version 2.10.
    → Version 2.11.0: Modified for standalone running as part of SeqSuite.

    • rje_uniprot: Updated from Version 3.19.
    → Version 3.20: Updated dbsplit=T output and checked function with Pfam. Probably needs work for other databases.
    → Version 3.20.1: Added uniprotid=LIST as an alias to acclist=LIST and extract=LIST.
    → Version 3.20.2: Added extra sequence return methods to UniprotEntry. Added fasta REST output.
    → Version 3.20.3: Fixed bug if new uniprot extraction method fails.

    • rje_xml: Created/Renamed.
    → Version 0.0: Initial Compilation.
    → Version 0.1: Added xml.sax functions.
    → Version 0.2: Added parsing from URL.

    • rje_xref: Updated from Version 1.1.
    → Version 1.2: Added join=LIST Run in join mode for list of FILE:key1|...|keyN:JoinField [] and naturaljoin=T/F.
    → Version 1.3.0: Added compress=LIST to handle 1:many input data. []

    • rje_zen: Updated from Version 1.2.
    → Version 1.3.0: Modified output to work with new REST service calls.


    GNU License

    Copyright (C) 2013 RJ Edwards <seqsuite@gmail.com>

    This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

    You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

    Author contact: <seqsuite@gmail.com> / Centre for Biological Sciences, University of Southampton, UK.

    To incorporate this module into your own programs, please see GNU Lesser General Public License disclaimer in rje.py

    PDF Manuals

    Please note that not all programs have manuals and some manuals are out of date. If in doubt, check the readme information for the latest options and default settings. Please report any anomalous behaviour. Suggestions for improvements to programs and documentation are also appreciated.

    Main SLiMSuite programs/modules

    More information can also be found at the SLiMSuite blog.
    The Main SLiMSuite programs/modules in SLiMSuite are: CompariMotif, QSLiMFinder, SLiMBench, SLiMFarmer, SLiMFinder, SLiMMaker, SLiMPred, SLiMPrints, SLiMProb, SLiMSuite. Click on the tabs, read the manuals, or see the ReadMe for more details.

    Program: CompariMotif
    Description: Motif vs Motif Comparison Software
    Version: 3.11
    Last Edit: 12/09/13
    Citation: Edwards, Davey & Shields (2008), Bioinformatics 24(10):1307-9.
    Webserver: http://bioware.ucd.ie/

    Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

    Function:
    CompariMotif is a piece of software with a single objective: to take two lists of regular expression protein motifs
    (typically SLiMs) and compare them to each other, identifying which motifs have some degree of overlap, and
    identifying the relationships between those motifs. It can be used to compare a list of motifs with themselves, their
    reversed selves, or a list of previously published motifs, for example (e.g. ELM (http://elm.eu.org/)). CompariMotif
    outputs a table of all pairs of matching motifs, along with their degree of similarity (information content) and
    their relationship to each other.

    The best match is used to define the relationship between the two motifs. These relationships are comprised of the
    following keywords:

    Match type keywords identify the type of relationship seen:
    * Exact = all the matches in the two motifs are precise
    * Variant = focal motif contains only exact matches and subvariants of degenerate positions in the other motif
    * Degenerate = the focal motif contains only exact matches and degenerate versions of positions in the other motif
    * Complex = some positions in the focal motif are degenerate versions of positions in the compared motif, while
    others are subvariants of degenerate positions.
    * Ugly = the two motifs match at partially (but not wholly) overlapping ambiguous positions (e.g. [AGS] vs [ST]). Such
    matches can be excluded using overlaps=F. (Version 3.8 onwards only.)

    Match length keywords identify the length relationships of the two motifs:
    * Match = both motifs are the same length and match across their entire length
    * Parent = the focal motif is longer and entirely contains the compared motif
    * Subsequence = the focal motif is shorter and entirely contained within the compared motif
    * Overlap = neither motif is entirely contained within the other

    This gives twenty possible classifications for each motif's relationship to the compared motif.

    Input:
    CompariMotif can take input in a number of formats. The preferred format is SLiMSearch format, which is a single line
    motif format: 'Name Sequence #Comments' (Comments are optional and ignored). Alternative inputs include SLiMFinder and
    SLiMDisc output, ELM downloads, raw lists of motifs, and fasta format. Any delimited file with 'Name' and 'Pattern'
    fields should be recognised.

    Complex motifs containing either/or (REGEX1|REGEX2) portions will be split into multiple motifs (marked a, b etc.).
    Similarly, variable numbers of non-wildcard positions will be split, e.g. RK{0,1}R would become RR and RKR. "3of5"
    motif patterns, formatted >R:m:n< where at least m of a stretch of n residues must match R, are also split prior to a
    search being perfomed. Currently, wildcard spacers are limited to a maximum length of 9.

    Output:
    The main output for CompariMotif is delimited text file containing the following fields:
    * File1 = Name of motifs file (if outstyle=multi)
    * File2 = Name of searchdb file (if outstyle=multi)
    * Name1 = Name of motif from motif file 1
    * Name2 = Name of motif from motif file 2
    * Motif1 = Motif (pattern) from motif file 1
    * Motif2 = Motif (pattern) from motif file 2
    * Sim1 = Description of motif1's relationship to motif2
    * Sim2 = Description of motif2's relationship to motif1
    * Match = Text summary of matched region
    * MatchPos = Number of matched positions between motif1 and motif2 (<= mishare=X)
    * MatchIC = Information content of matched positions
    * NormIC = MatchIC as a proportion of the maximum possible MatchIC (e.g. the lowest IC motif)
    * CoreIC = MatchIC as a proportion of the maximum possible IC in the matched region only.
    * Score = Heuristic score (MatchPos x NormIC) for ranking motif matches
    * Info1 = Ambiguity score of motif1
    * Info2 = Ambiguity score of motif2
    * Desc1 = Description of motif1 (if motdesc = 1 or 3)
    * Desc2 = Description of motif2 (if motdesc = 2 or 3)

    With the exception of the file names, which are only output if outstyle=multi, the above is the output for the
    default "normal" output style. If outstyle=single then only statistics for motif2 (the searchdb motif) are output
    as this is designed for searches using a single motif against a motif database. If outstyle=normalsplit or
    outstyle=multisplit then motif1 information is grouped together, followed by motif2 information, followed by the
    match statistics. More information can be found in the CompariMotif manual.

    Webserver:
    CompariMotif can be run online at http://bioware.ucd.ie.

    Commandline:
    ## Basic Input Parameters ##

        motifs=FILE     : File of input motifs/peptides [None]
        searchdb=FILE   : (Optional) second motif file to compare. Will compare to self if none given. [None]
        dna=T/F         : Whether motifs should be considered as DNA motifs [False]
    

    ## Basic Output Parameters ##

        resfile=FILE    : Name of results file, FILE.compare.tdt. [motifsFILE-searchdbFILE.compare.tdt]
        motinfo=FILE    : Filename for output of motif summary table (if desired) [None]
        motific=T/F     : Output Information Content for motifs [False]
        coreic=T/F      : Whether to output normalised Core IC [True]
        unmatched=T/F   : Whether to output lists of unmatched motifs (not from searchdb) into *.unmatched.txt [False]
    

    ## Motif Comparison Parameters ##

        minshare=X      : Min. number of non-wildcard positions for motifs to share [2]
        normcut=X       : Min. normalised MatchIC for motif match [0.5]
        matchfix=X      : If <0 must exactly match *all* fixed positions in the motifs from:  [0]
                            - 1: input (motifs=FILE) motifs
    
    - 2: searchdb motifs
    - 3: *both* input and searchdb motifs
    ambcut=X : Max number of choices in ambiguous position before replaced with wildcard (0=use all) [10] overlaps=T/F : Whether to include overlapping ambiguities (e.g. [KR] vs [HK]) as match [True] memsaver=T/F : Run in more efficient memory saver mode. XGMML output not available. [False]

    ## Advanced Motif Input Parameters ##

        minic=X         : Min information content for a motif (1 fixed position = 1.0) [2.0]
        minfix=X        : Min number of fixed positions for a motif to contain [0]
        minpep=X        : Min number of defined positions in a motif [2]
        trimx=T/F       : Trims Xs from the ends of a motif [False]
        nrmotif=T/F     : Whether to remove redundancy in input motifs [False]
        reverse=T/F     : Reverse the input motifs. [False]
                            - If no searchdb given, these will be searched against the "forward" motifs.
        mismatches=X    : >= X mismatches of positions can be tolerated [0]
        aafreq=FILE     : Use FILE to replace uniform AAFreqs (FILE can be sequences or aafreq) [None]    
    

    ## Advanced Motif Output Parameters ##

        xgmml=T/F       : Whether to output XGMML format results [True]
        xgformat=T/F    : Whether to use default CompariMotif formatting or leave blank for e.g. Cytoscape [True]
        pickle=T/F      : Whether to load/save pickle following motif loading/filtering [False]
        motdesc=X       : Sets which motifs have description outputs (0-3 as matchfix option) [3]
        outstyle=X      : Sets the output style for the resfile [normal]
                            - normal = all standard stats are output
                            - multi = designed for multiple appended runs. File names are also output
                            - single = designed for searches of a single motif vs a database. Only motif2 stats are output
                            - reduced = motifs do not have names or descriptions
                            - normalsplit/multisplit = as normal/multi but stats are grouped by motif rather than by type
    

    Uses general modules: os, string, sys, time
    Uses RJE modules: rje, rje_menu, rje_slim, rje_slimlist, rje_xgmml, rje_zen
    Other modules needed: rje_aaprop, rje_dismatrix, rje_seq, rje_blast, rje_pam, rje_sequence, rje_uniprot

    Program: QSLiMFinder
    Description: Query Short Linear Motif Finder
    Version: 2.0
    Last Edit: 11/07/14
    Citation: Edwards, Davey & Shields (2007), PLoS ONE 2(10): e967.

    Copyright © 2008 Richard J. Edwards - See source code for GNU License Notice

    Function:
    QSLiMFinder is a modification of the basic SLiMFinder tool to specifically look for SLiMs shared by a query sequence
    and one or more additional sequences. To do this, SLiMBuild first identifies all motifs that are present in the query
    sequences before removing it (and its UPC) from the dataset. The rest of the search and stats takes place using the
    remainder of the dataset but only using motifs found in the query. The final correction for multiple testing is made
    using a motif space defined by the original query sequence, rather than the full potential motif space used by the
    original SLiMFinder. This is offset against the increased probability of the observed motif support values due to the
    reduction of support that results from removing the query sequence but could potentially still identify SLiMs will
    increased significance.

    Note that minocc and ambocc values *include* the query sequence, e.g. minocc=2 specifies the query and ONE other UPC.

    Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Basic Input/Output Options ###

        seqin=FILE      : Sequence file to search [None]
        batch=LIST      : List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
        query=LIST      : Return only SLiMs that occur in 1+ Query sequences (Name/AccNum/Seq Number) [1]
        addquery=FILE   : Adds query sequence(s) to batch jobs from FILE [None]
        maxseq=X        : Maximum number of sequences to process [500]
        maxupc=X        : Maximum UPC size of dataset to process [0]
        sizesort=X      : Sorts batch files by size prior to running (+1 small-<big; -1 big-<small; 0 none) [0]
        walltime=X      : Time in hours before program will abort search and exit [1.0]
        resfile=FILE    : Main QSLiMFinder results table [qslimfinder.csv]
        resdir=PATH     : Redirect individual output files to specified directory (and look for intermediates) [QSLiMFinder/]
        buildpath=PATH  : Alternative path to look for existing intermediate files [SLiMFinder/]
        force=T/F       : Force re-running of BLAST, UPC generation and SLiMBuild [False]
        pickup=T/F      : Pick-up from aborted batch run by identifying datasets in resfile using RunID [False]
        dna=T/F         : Whether the sequences files are DNA rather than protein [False]
        alphabet=LIST   : List of characters to include in search (e.g. AAs or NTs) [default AA or NT codes]
        megaslim=FILE   : Make/use precomputed results for a proteome (FILE) in fasta format [None]
        megablam=T/F    : Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [False]
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### SLiMBuild Options I: Evolutionary Filtering ###

        efilter=T/F     : Whether to use evolutionary filter [True]
        blastf=T/F      : Use BLAST Complexity filter when determining relationships [True]
        blaste=X        : BLAST e-value threshold for determining relationships [1e=4]
        altdis=FILE     : Alternative all by all distance matrix for relationships [None]
        gablamdis=FILE  : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
        homcut=X        : Max number of homologues to allow (to reduce large multi-domain families) [0]
    

    ### SLiMBuild Options II: Input Masking ###

        masking=T/F     : Master control switch to turn off all masking if False [True]
        dismask=T/F     : Whether to mask ordered regions (see rje_disorder for options) [False]
        consmask=T/F    : Whether to use relative conservation masking [False]
        ftmask=LIST     : UniProt features to mask out [EM]
        imask=LIST      : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
        compmask=X,Y    : Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
        casemask=X      : Mask Upper or Lower case [None]
        motifmask=X     : List (or file) of motifs to mask from input sequences []
        metmask=T/F     : Masks the N-terminal M (can be useful if termini=T) [True]
        posmask=LIST    : Masks list of position-specific aas, where list = pos1:aas,pos2:aas  [2:A]
        aamask=LIST     : Masks list of AAs from all sequences (reduces alphabet) []
        qregion=X,Y     : Mask all but the region of the query from (and including) residue X to residue Y [0,-1]
    

    ### SLiMBuild Options III: Basic Motif Construction ###

        termini=T/F     : Whether to add termini characters (^ & $) to search sequences [True]
        minwild=X       : Minimum number of consecutive wildcard positions to allow [0]
        maxwild=X       : Maximum number of consecutive wildcard positions to allow [2]
        slimlen=X       : Maximum length of SLiMs to return (no. non-wildcard positions) [5]
        minocc=X        : Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if > 1) [0.05]
        absmin=X        : Used if minocc>1 to define absolute min. UP occ [3]
        alphahelix=T/F  : Special i, i+3/4, i+7 motif discovery [False]
    

    ### SLiMBuild Options IV: Ambiguity ###

        ambiguity=T/F   : (preamb=T/F) Whether to search for ambiguous motifs during motif discovery [True]
        ambocc=X        : Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or < minocc) [0.05]
        absminamb=X     : Used if ambocc>1 to define absolute min. UP occ [2]
        equiv=LIST      : List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
        wildvar=T/F     : Whether to allow variable length wildcards [True]
        combamb=T/F     : Whether to search for combined amino acid degeneracy and variable wildcards [False]
    

    ### SLiMBuild Options V: Advanced Motif Filtering ###

        musthave=LIST   : Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
        focus=FILE      : FILE containing focal groups for SLiM return (see Manual for details) [None]
        focusocc=X      : Motif must appear in X+ focus groups (0 = all) [0]
    
    * See also rje_slimcalc options for occurrence-based calculations and filtering *

    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### SLiMChance Options ###

        cloudfix=T/F    : Restrict output to clouds with 1+ fixed motif (recommended) [False]
        slimchance=T/F  : Execute main QSLiMFinder probability method and outputs [True]
        sigprime=T/F    : Calculate more precise (but more computationally intensive) statistical model [False]
        sigv=T/F        : Use the more precise (but more computationally intensive) fix to mean UPC probability [False]
        qexact=T/F      : Calculate exact Query motif space (True) or over-estimate from dimers (False) (quicker) [True]
        probcut=X       : Probability cut-off for returned motifs [0.1]
        maskfreq=T/F    : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [False]
        aafreq=FILE     : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
        aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None]
        negatives=FILE  : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None]
        smearfreq=T/F   : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
        seqocc=T/F      : Whether to upweight for multiple occurrences in same sequence (heuristic) [False]
        probscore=X     : Score to be used for probability cut-off and ranking (Prob/Sig) [Sig]
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Advanced Output Options I: Output data ###

        clouds=X        : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc / -1=off) [2]
        runid=X         : Run ID for resfile (allows multiple runs on same data) [DATE:TIME]
        logmask=T/F     : Whether to log the masking of individual sequences [True]
        slimcheck=FILE  : Motif file/list to add to resfile output [] 
    

    ### Advanced Output Options II: Output formats ###

        teiresias=T/F   : Replace TEIRESIAS, making *.out and *.mask.fasta files [False]
        slimdisc=T/F    : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False]
        extras=X        : Whether to generate additional output files (alignments etc.) [1]
                            --1 = No output beyond main results file
                            - 0 = Generate occurrence file and cloud file
                            - 1 = Generate occurrence file, alignments and cloud file
                            - 2 = Generate all additional QSLiMFinder outputs
                            - 3 = Generate SLiMDisc emulation too (equiv extras=2 slimdisc=T)
        targz=T/F       : Whether to tar and zip dataset result files (UNIX only) [False]
        savespace=0     : Delete "unneccessary" files following run (best used with targz): [0]
                            - 0 = Delete no files
                            - 1 = Delete all bar *.upc and *.pickle
                            - 2 = Delete all bar *.upc (pickle added to tar)
                            - 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
    

    ### Advanced Output Options III: Additional Motif Filtering ###

        topranks=X      : Will only output top X motifs meeting probcut [1000]
        minic=X         : Minimum information content for returned motifs [2.1]
        allsig=T/F      : Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False]
    
    * See also rje_slimcalc options for occurrence-based calculations and filtering *


    Module: SLiMBench
    Description: Short Linear Motif prediction Benchmarking
    Version: 2.8.0
    Last Edit: 02/12/14
    Copyright © 2012 Richard J. Edwards - See source code for GNU License Notice

    Function:
    SLiMBench has two primary functions:

    1. Generating SLiM prediction benchmarking datasets from ELM (or other data in a similar format). This includes
    options for generating random and/or simulated datasets for ROC analysis etc.

    2. Assessing the results of SLiM predictions against a Benchmark. This program is designed to work with SLiMFinder
    and QSLiMFinder output, so some prior results parsing may be needed for other methods.

    If `generate=F benchmark=F`, SLiMBench will check and optionally download the input files but perform no additional
    processing or analysis.

    Please see the SLiMBench manual for more details.

    Commandline:
    ### ~ SOURCE DATA OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        sourcepath=PATH/    : Will look in this directory for input files if not found ['SourceData/']
        sourcedate=DATE		: Source file date (YYYY-MM-DD) to preferentially use [None]
        elmclass=FILE       : Download from ELM website of ELM classes ['elm_classes.tsv']
        elminstance=FILE    : Download from ELM website of ELM instances ['elm_instances.tsv']
        elminteractors=FILE : Download from ELM website of ELM interactors ['elm_interactions.tsv']
        elmdomains=FILE     : Download from ELM website of ELM Pfam domain interactors ['elm_interaction_domains.tsv']
        elmdat=FILE         : File of downloaded UniProt entries (See rje_uniprot for more details) ['ELM.dat']
        ppisource=X			: Source of PPI data. (See documentation for details.) (HINT/FILE) ['HINT']
        ppispec=LIST        : List of PPI files/species/databases to generate PPI datasets from [HUMAN,MOUSE,DROME,YEAST]
        ppid=X              : PPI source protein identifier type (gene/uni/none; will work out from headers if None) [None]
        randsource=FILE     : Source for random/simulated dataset sequences. If species, will extract from UniProt [HUMAN]
        download=T/F        : Whether to download files directly from websites where possible if missing [True]
        integrity=T/F       : Whether to quit by default if source data integrity is breached [True]
        unipath=PATH        : Path to UniProt download. Will query website if "URL" [URL]
    

    ### ~ GENERAL/ELM BENCHMARK GENERATION OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        genpath=PATH        : Output path for datasets generated with SLiMBench file generator [./SLiMBenchDatasets/]
        generate=T/F        : Whether to generate SLiMBench datasets from ELM input. [False]
        genspec=LIST        : Restrict ELM/OccBench datasets to listed species (restricts ELM instances) []
        slimmaker=T/F       : Whether to use SLiMMaker to "reduce" ELMs to more findable SLiMs [True]
        minupc=X            : Minimum number of UPC for benchmark dataset [3]
        maxseq=X            : Maximum number of sequences for benchmark datasets [0]
        minic=X             : Min information content for a motif (1 fixed position = 1.0) [2.0; 1.1 for OccBench]
        filterdir=X         : Directory suffix for filtered benchmarking datasets [_Filtered/]
        queries=T/F         : Whether to generate datasets with specific Query proteins [False]
        flankmask=LIST      : List of flanking mask options (used with queries and simbench) [none,win100,flank5,site]
        elmbench=T/F        : Whether to generate ELM datasets [True]
        ppibench=T/F        : Whether to generate ELM PPI datasets [True]
        domlink=T/F         : Link ELMs to PPI via Pfam domains (True) or (False) just use direct protein links [True]
        itype=X             : Interaction identifer for PPI datasets [first element of ppisource]
        dombench=T/F        : Whether to generate Pfam domain ELM PPI datasets [True]
        occbench=T/F        : Whether to generate ELM OccBench datasets [True]
    

    ### ~ RANDOM/SIMULATION BENCHMARK GENERATION OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        simbench=T/F        : Whether to generate simulated datasets using reduced ELMs (if found) [False]
        ranbench=T/F        : Whether to generate randomised datasets (part of simulation if simbench=T) [False]
        randreps=X          : Number of replicates for each random (or simulated) datasets [8]
        simcount=LIST       : Number of "TPs" to have in dataset [4,8,16]
        simratios=LIST      : List of simulated ELM:Random ratios [0,1,3,7,15,31]
        randir=PATH         : Output path for creation of randomised datasets [./SLiMBenchDatasets/Random/]
        randbase=X          : Base for random dataset name if simbench=F [ran]
        masking=T/F         : Whether to use SLiMCore masking for query selection [True]
        searchini=FILE      : INI file containing SLiMProb search options that restrict returned positives []
        maxseq=X            : Maximum number of randsource sequences for SLiM to hit (also maxaa and maxupc limits) [1000]
    

    ### ~ BENCHMARK ASSESSMENT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        benchmark=T/F       : Whether to perfrom SLiMBench benchmarking assessment against motif file [False]
        datatype=X          : Type of data to be generated and/or benchmarked (occ/elm/ppi/sim/simonly) [elm]
        queries=T/F         : Whether to datasets have specific Query proteins [False]
        resfiles=LIST       : List of (Q)SLiMFinder results files to use for benchmarking [*.csv]
        compdb=FILE         : Motif file to be used for benchmarking [elmclass file] (reduced unless occ/ppi)
        occbenchpos=FILE    : File of all positive occurrences for OccBench [genpath/ELM_OccBench/ELM.full.ratings.csv]
        benchbase=X         : Basefile for SLiMBench benchmarking output [slimbench]
        runid=LIST          : List of factors to split RunID column into (on '.') ['Program','Analysis']
        bycloud=X           : Whether to compress results into clouds prior to assessment (True/False/Both) [Both]
        sigcut=LIST         : Significance thresholds to use for assessment [0.1,0.05,0.01,0.001,0.0001]
        iccut=LIST          : Minimum IC for (Q)SLiMFinder results for elm/sim/ppi benchmark assessment [2.0,2.1,3.0]
        slimlencut=LIST     : List of individual SLiM lengths to return results for (0=All) [0,3,4,5]
        noamb=T/F           : Filter out ambiguous patterns [False]
    
    # Add CompariMotif settings here for OT/TP etc.

    ### ~ GENERAL OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        force=T/F           : Whether to force regeneration of outputs (True) or assume existing outputs are right [False]
        backups=T/F         : Whether to (prompt if interactive and) generate backups before overwriting files [True]
    

    See also rje.py generic commandline options.

    Module: SLiMFarmer
    Description: SLiMSuite HPC job farming control program
    Version: 1.4.1
    Last Edit: 17/11/14
    Copyright © 2014 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This module is designed to control and execute parallel processing jobs on an HPC cluster using PBS and QSUB. If
    qsub=T it will generate a job file and use qsub to place that job in the queue using the appropriate parameter
    settings. If `slimsuite=T` and `farm=X` gives a recognised program (below) or `hpcmode` is not `fork` then the qsub
    job will call SLiMFarmer with the same commandline options, plus `qsub=F i=-1 v=-1`. If `seqbyseq=T`, this will be
    run in a special way. (See SeqBySeq mode.) Otherwise `slimsuite=T` indicates that `farm=X` is a SLiMSuite program,
    for which the python call and `pypath` will be added. If this program uses forking then it should parallelise over a
    single multi-processor node. If `farm=X` contains a `/` path separator, this will be added to `pypath`, otherwise it
    will be assumed that `farm` is in `tools/`. If `slimsuite=F` then farm should be a program call to be queued in the
    PBS job file instead.

    Currently recognised SLiMSuite programs for farming: SLiMFinder, QSLiMFinder, SLiMProb, SLiMCore.

    Currently recognised SLiMSuite programs for rsh mode only qsub farming: GOPHER, SLiMSearch, UniFake.

    NOTE: Any commandline options that need bracketing quotes will need to be placed into an ini file. This can either
    be the ini file used by SLiMFarmer, or a `jobini=FILE` that will only be used by the farmed programs. Note that
    commands in `slimfarmer.ini` will not be passed on to other SLiMSuite programs unless `ini=slimfarmer.ini` is given
    as a commandline argument.

    The runid=X setting is important for SLiMSuite job farming as this is what separates different parameter setting
    combinations run on the same data and is also used for identifying which datasets have already been run. Running
    several jobs on the same data using the same SLiMSuite program but with different parameter settings will therefore
    cause problems. If runid is not set, it will default to the job=X setting.

    The hpcmode=X setting determines the method used for farming out jobs across the nodes. hpcmode=rsh uses rsh to spawn
    the additional processes out to other nodes, based on a script written for the IRIDIS HPC by Ivan Wolton.
    hpcmode=fork will restrict analysis to a single node and use Python forking to distribute jobs. This can be used even
    on a single multi-processor machine to fork out SLiMSuite jobs. basefile=X will set the log, RunID, ResFile, ResDir
    and Job: RunID and Job will have path stripped; ResFile will have .csv appended.

    Initially, it will call other programs but, in time, it is envisaged that other programs will make use of SLiMFarmer
    and have parallelisation built-in.

    SeqBySeq Mode:
    In SeqBySeq mode, the program assumes that seqin=FILE and basefile=X are given and farm=X states the Python program
    to be run, which should be SLiMSuite program. (The SLiMSuite subdirectory will also need to be given unless
    slimsuite=F, in which case the whole path to the program should be given. pypath=PATH can set an alternative path.)

    Seqin will then be worked through in turn and each sequence farmed out to the farm program. Outputs given by OutList
    are then compiled, as is the Log, into the correct basefile=X given. In the case of *.csv and *.tdt files, the header
    row is copied for the first file and then excluded for all subsequent files. For all other files extensions, the
    whole output is copied.

    Commandline:
    ### ~ Basic QSub Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        qsub=T/F        : Whether to execute QSub PDB job creation and queuing [False]
        jobini=FILE     : Ini file to pass to the farmed HPC jobs with SLiMFarmer options. Overrides commandline. [None]
        slimsuite=T/F   : Whether program is an RJE *.py script (adds log processing) [True]
        nodes=X         : Number of nodes to run on [1]
        ppn=X           : Processors per node [16]
        walltime=X      : Walltime for qsub job (hours) [12]
        vmem=X          : Virtual Memory limit for run (GB) [127]
        job=X           : Name of job file (.job added) [slimfarmer]
    

    ### ~ Advanced QSub Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        hpc=X           : Name of HPC system ['katana']
        pypath=PATH     : Path to python modules [slimsuite home directoy]
        qpath=PATH      : Path to change directory too [current path]
        pause=X         : Wait X seconds before attempting showstart [5]
        email=X         : Email address to email job stats to at end ['']
        depend=LIST     : List of job ids to wait for before starting job (dependhpc=X added) []
        dependhpc=X     : Name of HPC system for depend ['blue30.iridis.soton.ac.uk']
        report=T/F      : Pull out running job IDs and run showstart [False]
        modules=LIST    : List of modules to add in job file [clustalo,mafft,R/3.1.1]
    

    ### ~ Main SLiMFarmer Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        farm=X          : Execute a special SLiMFarm analysis on HPC [batch]
                            - batch will farm out a batch list of commands read in from subjobs=LIST
                            - gopher/slimfinder/qslimfinder/slimprob/slimcore/slimsearch/unifake = special SLiMSuite HPC.
                            - if seqbyseq=T, farm=X will specify the program to be run (see docs)
                            - otherwise, farm=X will be executed as a system call in place of SLiMFarmer
        hpcmode=X       : Mode to be used for farming jobs between nodes (rsh/fork) [fork]
        forks=X         : Number of forks to be used when hpcmode=fork and qsub=F. [1]
        jobini=FILE     : Ini file to pass to the farmed SLiMSuite run. (Also used for SLiMFarmer options if qsub=T.) [None]
    

    ### ~ Standard HPC Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        subsleep=X      : Sleep time (seconds) between cycles of subbing out jobs to hosts [1]
        subjobs=LIST    : List of subjobs to farm out to HPC cluster []
        iolimit=X       : Limit of number of IOErrors before termination [50]
        memfree=X       : Min. proportion of node memory to be free before spawning job [0.1]
        test=T/F        : Whether to produce extra output in "test" mode [False]
        keepfree=X      : Number of processors to keep free on head node [1]
    

    ### ~ SeqBySeq Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        seqbyseq=T/F    : Activate seqbyseq mode - assumes basefile=X option used for output [False]
        seqin=FILE      : Input sequence file to farm out [None]
        basefile=X      : Base for output files - compiled from individual run results [None]
        outlist=LIST    : List of extensions of outputs to add to basefile for output (basefile.*) []
        pickhead=X      : Header to extract from OutList file and used to populate AccNum to skip []
    

    ### ~ SLiMSuite Farming Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        runid=X         : Text identifier for SLiMSuite job farming [`job`]
        resfile=FILE    : Main output file for SLiMSuite run [`farm`.csv]
        pickup=T/F      : Whether to pickup previous run based on existing results and RunID [True]
        sortrun=T/F     : Whether to sort input files by size and run big -< small to avoid hang at end [True]
        loadbalance=T/F : Whether to split SortRun jobs equally between large & small to avoid memory issues [True]
        basefile=X      : Set the log, RunID, ResFile, ResDir and Job to X [None].
    

    ### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
    See also rje.py generic commandline options.

    Program: SLiMFinder
    Description: Short Linear Motif Finder
    Version: 5.1
    Last Edit: 19/08/14
    Citation: Edwards, Davey & Shields (2007), PLoS ONE 2(10): e967.
    ConsMask Citation: Davey NE, Shields DC & Edwards RJ (2009), Bioinformatics 25(4): 443-50.
    SigV/SigPrime Citation: Davey NE, Edwards RJ & Shields DC (2010), BMC Bioinformatics 11: 14.

    Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

    Function:
    Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological
    systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few
    as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular,
    it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating
    ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates
    the matter.

    SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting
    for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54].
    SLiMFinder is comprised of two algorithms:

    SLiMBuild identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are
    identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike
    programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned
    motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses
    the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships.
    Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPCs), which are
    defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired,
    SLiMBuild can be used as a replacement for TEIRESIAS in other software (teiresias=T slimchance=F).

    SLiMChance estimates the probability of these motifs arising by chance, correcting for the size and composition of
    the dataset, and assigns a significance value to each motif. Motif occurrence probabilites are calculated
    independently for each UPC, adjusted the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc. These
    individual occurrence probabilities are then converted into the total probability of the seeing the observed motifs
    the observed number of (unrelated) times. These probabilities assume that the motif is known before the search. In
    reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted for the
    size of motif-space searched to give a significance value. This is an estimate of the probability of seeing that
    motif, or another one like it. These values are calculated separately for each length of motif. Where pre-known
    motifs are also of interest, these can be given with the slimcheck=MOTIFS option and will be added to the output.
    SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can
    be switched on using sigprime=T. Likewise, the more precise (but more computationally intensive) correction to the
    mean UPC probability heuristic can be switched on using sigv=T. (Note that the other SLiMChance options may not
    work with either of these options.) The allsig=T option will output all four scores. In this case, SigPrimeV will be
    used for ranking etc. unless probscore=X is used.

    Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically
    overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication
    of which motifs may actually be variants of a larger SLiM and should therefore be considered together.

    Additional Motif Occurrence Statistics, such as motif conservation, are handled by the rje_slimlist module. Please
    see the documentation for this module for a full list of commandline options. These options are currently under
    development for SLiMFinder and are not fully supported. See the SLiMFinder Manual for further details. Note that the
    OccFilter *does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min.
    Support) but the overall Motif StatFilter *only* affects SLiMFinder output following SLiMChance calculations.

    Secondary Functions:
    The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.

    The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
    by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
    datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.

    Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Basic Input/Output Options ###

        seqin=FILE      : Sequence file to search [None]
        batch=LIST      : List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
        maxseq=X        : Maximum number of sequences to process [500]
        maxupc=X        : Maximum UPC size of dataset to process [0]
        sizesort=X      : Sorts batch files by size prior to running (+1 small-<big; -1 big-<small; 0 none) [0]
        walltime=X      : Time in hours before program will abort search and exit [1.0]
        resfile=FILE    : Main SLiMFinder results table [slimfinder.csv]
        resdir=PATH     : Redirect individual output files to specified directory (and look for intermediates) [SLiMFinder/]
        buildpath=PATH  : Alternative path to look for existing intermediate files [SLiMFinder/]
        force=T/F       : Force re-running of BLAST, UPC generation and SLiMBuild [False]
        pickup=T/F      : Pick-up from aborted batch run by identifying datasets in resfile [False]
        pickid=T/F      : Whether to use RunID to identify run datasets when using pickup [True]
        pickall=T/F     : Whether to skip aborted runs (True) or only those datasets that ran to completion (False) [True]
        dna=T/F         : Whether the sequences files are DNA rather than protein [False]
        alphabet=LIST   : List of characters to include in search (e.g. AAs or NTs) [default AA or NT codes]
        megaslim=FILE   : Make/use precomputed results for a proteome (FILE) in fasta format [None]
        megablam=T/F    : Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [False]
    

    SLiMBuild Options: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### SLiMBuild Options I: Evolutionary Filtering ###

        efilter=T/F     : Whether to use evolutionary filter [True]
        blastf=T/F      : Use BLAST Complexity filter when determining relationships [True]
        blaste=X        : BLAST e-value threshold for determining relationships [1e=4]
        altdis=FILE     : Alternative all by all distance matrix for relationships [None]
        gablamdis=FILE  : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
        homcut=X        : Max number of homologues to allow (to reduce large multi-domain families) [0]
        newupc=PATH     : Look for alternative UPC file and calculate Significance using new clusters [None]
    

    ### SLiMBuild Options II: Input Masking ###

        masking=T/F     : Master control switch to turn off all masking if False [True]
        dismask=T/F     : Whether to mask ordered regions (see rje_disorder for options) [False]
        consmask=T/F    : Whether to use relative conservation masking [False]
        ftmask=LIST     : UniProt features to mask out [EM]
        imask=LIST      : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
        compmask=X,Y    : Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
        casemask=X      : Mask Upper or Lower case [None]
        motifmask=X     : List (or file) of motifs to mask from input sequences []
        metmask=T/F     : Masks the N-terminal M (can be useful if termini=T) [True]
        posmask=LIST    : Masks list of position-specific aas, where list = pos1:aas,pos2:aas  [2:A]
        aamask=LIST     : Masks list of AAs from all sequences (reduces alphabet) []
        qregion=X,Y     : Mask all but the region of the query from (and including) residue X to residue Y [0,-1]
    

    ### SLiMBuild Options III: Basic Motif Construction ###

        termini=T/F     : Whether to add termini characters (^ & $) to search sequences [True]
        minwild=X       : Minimum number of consecutive wildcard positions to allow [0]
        maxwild=X       : Maximum number of consecutive wildcard positions to allow [2]
        slimlen=X       : Maximum length of SLiMs to return (no. non-wildcard positions) [5]
        minocc=X        : Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if > 1) [0.05]
        absmin=X        : Used if minocc>1 to define absolute min. UP occ [3]
        alphahelix=T/F  : Special i, i+3/4, i+7 motif discovery [False]
        fixlen=T/F      : If true, will use maxwild and slimlen to define a fixed total motif length [False]
        palindrome=T/F  : Special DNA mode that will search for palindromic sequences only [False]
    

    ### SLiMBuild Options IV: Ambiguity ###

        ambiguity=T/F   : (preamb=T/F) Whether to search for ambiguous motifs during motif discovery [True]
        ambocc=X        : Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or < minocc) [0.05]
        absminamb=X     : Used if ambocc>1 to define absolute min. UP occ [2]
        equiv=LIST      : List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
        wildvar=T/F     : Whether to allow variable length wildcards [True]
        combamb=T/F     : Whether to search for combined amino acid degeneracy and variable wildcards [False]
    

    ### SLiMBuild Options V: Advanced Motif Filtering ###

        altupc=PATH     : Look for alternative UPC file and filter based on minocc [None]
        musthave=LIST   : Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
        query=LIST      : Return only SLiMs that occur in 1+ Query sequences (Name/AccNum) []
        focus=FILE      : FILE containing focal groups for SLiM return (see Manual for details) [None]
        focusocc=X      : Motif must appear in X+ focus groups (0 = all) [0]
    
    * See also rje_slimcalc options for occurrence-based calculations and filtering *

    SLiMChance Options: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

        cloudfix=T/F    : Restrict output to clouds with 1+ fixed motif (recommended) [False]
        slimchance=T/F  : Execute main SLiMFinder probability method and outputs [True]
        sigprime=T/F    : Calculate more precise (but more computationally intensive) statistical model [False]
        sigv=T/F        : Use the more precise (but more computationally intensive) fix to mean UPC probability [False]
        dimfreq=T/F     : Whether to use dimer masking pattern to adjust number of possible sites for motif [True]
        probcut=X       : Probability cut-off for returned motifs (sigcut=X also recognised) [0.1]
        maskfreq=T/F    : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True]
        aafreq=FILE     : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
        aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None]
        negatives=FILE  : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None]
        smearfreq=T/F   : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
        seqocc=T/F      : Whether to upweight for multiple occurrences in same sequence (heuristic) [False]
        probscore=X     : Score to be used for probability cut-off and ranking (Prob/Sig/S/R) [Sig]
    

    Advanced Output Options: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Advanced Output Options I: Output data ###

        clouds=X        : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc / -1=off) [2]
        runid=X         : Run ID for resfile (allows multiple runs on same data) [DATE]
        logmask=T/F     : Whether to log the masking of individual sequences [True]
        slimcheck=FILE  : Motif file/list to add to resfile output [] 
    

    ### Advanced Output Options II: Output formats ###

        teiresias=T/F   : Replace TEIRESIAS, making *.out and *.mask.fasta files [False]
        slimdisc=T/F    : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False]
        extras=X        : Whether to generate additional output files (alignments etc.) [1]
                            --1 = No output beyond main results file
                            - 0 = Generate occurrence file
                            - 1 = Generate occurrence file, alignments and cloud file
                            - 2 = Generate all additional SLiMFinder outputs
                            - 3 = Generate SLiMDisc emulation too (equiv extras=2 slimdisc=T)
        targz=T/F       : Whether to tar and zip dataset result files (UNIX only) [False]
        savespace=0     : Delete "unneccessary" files following run (best used with targz): [0]
                            - 0 = Delete no files
                            - 1 = Delete all bar *.upc and *.pickle
                            - 2 = Delete all bar *.upc (pickle added to tar)
                            - 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
    

    ### Advanced Output Options III: Additional Motif Filtering ###

        topranks=X      : Will only output top X motifs meeting probcut [1000]
        oldscores=T/F   : Whether to also output old SLiMDisc score (S) and SLiMPickings score (R) [False]
        allsig=T/F      : Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False]
        minic=X         : Minimum information content for returned motifs [2.1]
    
    * See also rje_slimcalc options for occurrence-based calculations and filtering *

    Additional Functions: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Additional Functions I: MotifSeq ###

        motifseq=LIST   : Outputs fasta files for a list of X:Y, where X is the pattern and Y is the output file []
        slimbuild=T/F   : Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [True]
    

    ### Additional Functions II: Randomised datasets ###

        randomise=T/F   : Randomise UPC within batch files and output new datasets [False]
        randir=PATH     : Output path for creation of randomised datasets [Random/]
        randbase=X      : Base for random dataset name [rand]
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

    Uses general modules: copy, glob, math, os, string, sys, time
    Uses RJE modules: rje, rje_blast, rje_slim, rje_slimlist, rje_slimcalc, rje_slimcore, rje_dismatrix_V2, rje_seq, rje_scoring
    Other modules needed: None

    Module: SLiMMaker
    Description: SLiM generator from aligned peptide sequences
    Version: 1.2.0
    Last Edit: 03/11/14
    Copyright © 2011 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This program has a fairly simple function of reading in a set of sequences and generating a regular expression motif
    from them. It is designed with protein sequences in mind but should work for DNA sequences too. Input sequences can
    be in fasta format or just plain text (with no sequence headers) and should be aligned already. Gapped positions will
    be ignored (treated as Xs) and variable length wildcards are not returned.

    SLiMMaker considers each column of the input in turn and compresses it into a regular expression element according to
    some simple rules, screening out rare amino acids and converting particularly degenerate positions into wildcards.
    Each amino acid in the column that occurs at least X times (as defined by minseq=X) is considered for the regular
    expression definition for that position. The full set of amino acids meeting this criterion is then assessed for
    whether to keep it as a defined position, or convert into a wildcard. First, if the number of different amino acids
    meeting this criterion is zero or above a second threshold (maxaa=X), the position is defined as a wildcard. Second,
    the proportion of input sequences matching the amino acid set is compared to a minimum frequency criterion
    (minfreq=X). Failing to meet this minimum frequency will again result in a wildcard. Otherwise, the amino acid set is
    added to the SLiM definition as either a fixed position (if only one amino acid met the minseq criterion) or as a
    degenerate position. Finally, leading and trailing wildcards are removed.

    By default, each defined position in a motif will contain amino acids that (a) occur in at least three sequences
    each, (b) have a combined frequency of <=75%, and (c) have 5 or fewer different amino acids (that occur in 3+
    sequences).

    Note. Unless the "iterate" function is used, the final motif only contains defined positions that match a given
    frequency of the input (75% by default). Because positions are considered independently, however, the final motif
    might occur in fewer than 75% of the input sequences. SLiMSearch can be used to check the occurrence stats.

    Commandline:
    ### ~ SLiMMaker Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        peptides=LIST   : These can be entered as a list or a file. If a file, lines following '#' or '<' are ignored
        minseq=X        : Min. no. of sequences for an aa to be in [3]
        minfreq=X       : Min. combined freq of accepted aa to avoid wildcard [0.75]
        maxaa=X         : Max. no. different amino acids for one position [5]
        ignore=X        : Amino acid(s) to ignore. (If nucleotide, would be N-) ['X-']
        dna=T/F         : Whether "peptides" are actually DNA fragments [False]
        iterate=T/F     : Whether to perform iterative SLiMMaker, re-running on matched peptides with each iteration [False]
    

    See also rje.py generic commandline options.

    Uses general modules: copy, glob, os, string, sys, time
    Uses RJE modules: rje, rje_obj, rje_slim, rje_zen
    Other modules needed: None

    Program: SLiMProb
    Description: Short Linear Motif Probability tool
    Version: 2.2.0
    Last Edit: 15/11/14
    Citation: Davey, Haslam, Shields & Edwards (2010), Lecture Notes in Bioinformatics 6282: 50-61.
    Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

    Function:
    SLiMProb is a tool for finding pre-defined SLiMs (Short Linear Motifs) in a protein sequence database. SLiMProb
    can make use of corrections for evolutionary relationships and a variation of the SLiMChance alogrithm from
    SLiMFinder to assess motifs for statistical over- and under-representation. SLiMProb is replace for the original
    SLiMSearch, which itself was a replacement for PRESTO. The basic architecture is the same but it was felt that having
    two different "SLiMSearch" servers was confusing.

    Benefits of SLiMProb that make it more useful than a lot of existing tools include:
    * searching with mismatches rather than restricting hits to perfect matches.
    * optional equivalency files for searching with specific allowed mismatched (e.g. charge conservation)
    * generation or reading of alignment files from which to calculate conservation statistics for motif occurrences.
    * additional statistics, including protein disorder, surface accessibility and hydrophobicity predictions
    * recognition of "n of m" motif elements in the form >X:n:m<, where X is one or more amino acids that must occur n+
    times across which m positions. E.g. >IL:3:5< must have 3+ Is and/or Ls in a 5aa stretch.

    Main output for SLiMProb is a delimited file of motif/peptide occurrences but the motifaln=T and proteinaln=T also
    allow output of alignments of motifs and their occurrences. The primary outputs are named *.occ.csv for the occurrence
    data and *.csv for the summary data for each motif/dataset pair. (This is a change since SLiMSearch.)

    Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Basic Input/Output Options ###

        motifs=FILE     : File of input motifs/peptides [None]
    
    Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
    Alternative formats include fasta, SLiMDisc output and raw motif lists.

        seqin=FILE      : Sequence file to search [None]
        batch=LIST      : List of sequence files for batch input (wildcard * permitted) []
        maxseq=X        : Maximum number of sequences to process [0]
        maxsize=X       : Maximum dataset size to process in AA (or NT) [100,000]
        maxocc=X        : Filter out Motifs with more than maximum number of occurrences [0]
        walltime=X      : Time in hours before program will abort search and exit [1.0]
        resfile=FILE    : Main SLiMProb results table (*.csv and *.occ.csv) [slimprob.csv]
        resdir=PATH     : Redirect individual output files to specified directory (and look for intermediates) [SLiMProb/]
        buildpath=PATH  : Alternative path to look for existing intermediate files [SLiMProb/]
        force=T/F       : Force re-running of BLAST, UPC generation and search [False]
        dna=T/F         : Whether the sequences files are DNA rather than protein [False]
        alphabet=LIST   : List of characters to include in search (e.g. AAs or NTs) [default AA or NT codes]
        megaslim=FILE   : Make/use precomputed results for a proteome (FILE) in fasta format [None]
        megablam=T/F    : Whether to create and use all-by-all GABLAM results for (gablamdis) UPC generation [False]
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### SearchDB Options I: Input Protein Sequence Masking ###

        masking=T/F     : Master control switch to turn off all masking if False [False]
        dismask=T/F     : Whether to mask ordered regions (see rje_disorder for options) [False]
        consmask=T/F    : Whether to use relative conservation masking [False]
        ftmask=LIST     : UniProt features to mask out [EM,DOMAIN,TRANSMEM]
        imask=LIST      : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
        compmask=X,Y    : Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
        casemask=X      : Mask Upper or Lower case [None]
        motifmask=X     : List (or file) of motifs to mask from input sequences []
        metmask=T/F     : Masks the N-terminal M [False]
        posmask=LIST    : Masks list of position-specific aas, where list = pos1:aas,pos2:aas  [2:A]
        aamask=LIST     : Masks list of AAs from all sequences (reduces alphabet) []
    

    ### SearchDB Options II: Evolutionary Filtering ###

        efilter=T/F     : Whether to use evolutionary filter [False]
        blastf=T/F      : Use BLAST Complexity filter when determining relationships [True]
        blaste=X        : BLAST e-value threshold for determining relationships [1e=4]
        altdis=FILE     : Alternative all by all distance matrix for relationships [None]
        gablamdis=FILE  : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
        occupc=T/F      : Whether to output the UPC ID number in the occurrence output file [False]
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### SLiMChance Options ###

        maskfreq=T/F    : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True]
        aafreq=FILE     : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
        aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) [None]
        negatives=FILE  : Multiply raw probabilities by under-representation in FILE [None]
        background=FILE : Use observed support in background file for over-representation calculations [None]
        smearfreq=T/F   : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
        seqocc=X        : Restrict to sequences with X+ occurrences (adjust for high frequency SLiMs) [1]
    
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
    ### Output Options ###

        extras=X        : Whether to generate additional output files (alignments etc.) [2]
                            - 0 = No output beyond main results file
                            - 1 = Saved masked input sequences [*.masked.fas]
                            - 2 = Generate additional outputs (alignments etc.)
                            - 3 = Additional distance matrices for input sequences
        pickle=T/F      : Whether to save/use pickles [True]
        targz=T/F       : Whether to tar and zip dataset result files (UNIX only) [False]
        savespace=0     : Delete "unneccessary" files following run (best used with targz): [0]
                            - 0 = Delete no files
                            - 1 = Delete all bar *.upc and *.pickle files
                            - 2 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
    
    * See also rje_slimcalc options for occurrence-based calculations and filtering *
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

    Uses general modules: copy, glob, os, string, sys, time
    Uses RJE modules: rje
    Other modules needed: None

    Introduction

    Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular, it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates the matter.

    The SLiMSuite collection contains a number of open-source bioinformatics tools to analyse these important protein features. Note that SLiMSuite now includes all of the programs from the previous SeqSuite package. (SLiMSuite was formerly also available as part of a larger SeqSuite package.) See also the SLiMSuite blog for more information.

    SLiMFinder

    SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54]. SLiMFinder is comprised of two algorithms:

    SLiMBuild identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships. Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPCs), which are defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired, SLiMBuild can be used as a replacement for TEIRESIAS in other software (teiresias=T slimchance=F).

    SLiMChance estimates the probability of these motifs arising by chance, correcting for the size and composition of the dataset, and assigns a significance value to each motif. Motif occurrence probabilites are calculated independently for each UPC, adjusted the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc. These individual occurrence probabilities are then converted into the total probability of the seeing the observed motifs the observed number of (unrelated) times. These probabilities assume that the motif is known before the search. In reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted for the size of motif-space searched to give a significance value. This is an estimate of the probability of seeing that motif, or another one like it. These values are calculated separately for each length of motif. Where pre-known motifs are also of interest, these can be given with the slimcheck=MOTIFS option and will be added to the output. SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can be switched on using sigprime=T. Likewise, the more precise (but more computationally intensive) correction to the mean UPC probability heuristic can be switched on using sigv=T. (Note that the other SLiMChance options may not work with either of these options.) The allsig=T option will output all four scores. In this case, SigPrimeV will be used for ranking etc. unless probscore=X is used.

    Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication of which motifs may actually be variants of a larger SLiM and should therefore be considered together.

    Additional Motif Occurrence Statistics, such as motif conservation, are handled by the rje_slimlist module. Please see the documentation for this module for a full list of commandline options. These options are currently under development for SLiMFinder and are not fully supported. See the SLiMFinder Manual for further details. Note that the OccFilter *does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min. Support) but the overall Motif StatFilter *only* affects SLiMFinder output following SLiMChance calculations.

    QSLiMFinder

    QSLiMFinder (Query SLiMFinder) is a variant of SLiMFinder for explicitly returning motifs present in a given query sequence specified by query=X. More details can be found in the SLiMFinder Manuals or contact the author.

    SLiMCore

    The rje_slimcore module forms the basis for SLiMFinder & SLiMSearch and can also be used in its own right for additional special functions: The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.

    The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.

    SLiMProb

    SLiMProb, formerly called SLiMSearch, is a tool for finding pre-defined SLiMs (Short Linear Motifs) in a protein sequence database. SLiMProb can make use of corrections for evolutionary relationships and a variation of the SLiMChance alogrithm from SLiMFinder to assess motifs for statistical over- and under-representation. SLiMProb is a replacement for PRESTO and uses many of the same underlying modules.

    Benefits of SLiMProb that make it more useful than a lot of existing tools include:

    • searching with mismatches rather than restricting hits to perfect matches.
    • optional equivalency files for searching with specific allowed mismatched (e.g. charge conservation)
    • generation or reading of alignment files from which to calculate conservation statistics for motif occurrences.
    • additional statistics, inlcuding protein disorder, surface accessibility and hydrophobicity predictions
    • recognition of "n of m" motif elements in the form , where X is one or more amino acids that must occur n+ times across which m positions. E.g. must have 3+ Is and/or Ls in a 5aa stretch.

    Main output for SLiMProb is a delimited file of motif/peptide occurrences but the motifaln=T and proteinaln=T also allow output of alignments of motifs and their occurrences.

    CompariMotif

    CompariMotif is a piece of software with a single objective: to take two lists of protein motifs and compare them to each other, identifying which motifs have some degree of overlap, and identifying the relationships between those motifs. It can be used to compare a list of motifs with themselves, their reversed selves, or a list of previously published motifs, for example (e.g. ELM (http://elm.eu.org/)). CompariMotif outputs a table of all pairs of matching motifs, along with their degree of similarity (information content) and their relationship to each other.

    The best match is used to define the relationship between the two motifs. These relationships are comprised of the following keywords:

    Match type keywords identify the type of relationship seen:

  • Exact = all the matches in the two motifs are precise
  • Variant = focal motif contains only exact matches and subvariants of degenerate positions in the other motif
  • Degenerate = the focal motif contains only exact matches and degenerate versions of positions in the other motif
  • Complex = some positions in the focal motif are degenerate versions of positions in the compared motif, while others are subvariants of degenerate positions.

    Match length keywords identify the length relationships of the two motifs:

  • Match = both motifs are the same length and match across their entire length
  • Parent = the focal motif is longer and entirely contains the compared motif
  • Subsequence = the focal motif is shorter and entirely contained within the compared motif
  • Overlap = neither motif is entirely contained within the other This gives sixteen possible classifications for each motif's relationship to the compared motif.

    SLiMMaker

    SLiMMaker is a simple tool for generating SLiM patterns from a set of peptides. This method assumes that the peptides are already aligned - no alignment is performed by the tool.

    SLiMBench

    SLiMBench is tool under development for generating robust SLiM prediction benchmarking datasets and assessing the subsequent predictions for consistent benchmarking.

    SLiMFarmer

    SLiMFarmer is a tool under development for running SLiMSuite programs on parallel processors, either using rsh to spawn processors across different nodes on a supercomputer, or using Python forking for running on multiple CPUs on a single machine. Currently, SLiMFinder, QSLiMFinder and SLiMProb have been tested.

  • $def with (prog,jobid) $if html: $:html $else: REST HTML not yet implemented correctly! Sigh. prog = self.prog(); jobid = self.name() base = '%s%s.%s' % (self.getStr('QueuePath'),jobid,prog) queued = glob.glob('%s*.queue' % self.getStr('QueuePath')) running = glob.glob('%s*.run*' % self.getStr('QueuePath')) hobj = rje_html.HTML(self.log,self.cmd_list+['nobots=T']) newurl = '%sretrieve&jobid=%s&rest=%s&password=%s' % (self.getStr('RestURL'),jobid,self.getStr('Rest'),self.getStr('Password')) ## ~ [0a] Queue Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ## qtxt = 'SLiMSuite Server: %d jobs queued; %d running.\n\n' % (len(queued),len(running)) qtxt += '### %s job %s\n' % (prog,jobid) qfile = None for ext in ['queue','run','running']: if rje.exists('%s.%s' % (base,ext)): qfile = '%s.%s' % (base,ext) finished = not qfile if not qfile: qfile = '%s%s.finished' % (self.runPath(),jobid) qtxt += open(qfile,'r').read() if 'test' in sys.argv: return qtxt + '\n\nRedirect URL: %s\n' % newurl ### ~ [1] Generate and return HTML ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### if finished: if self.getStr('RestProg') == 'retrieve': html = hobj.htmlHead(title='SLiMSuite Job Finished',tabber=False,frontpage=True,keywords=[],redirect=None) else: html = hobj.htmlHead(title='SLiMSuite Job Finished',tabber=False,frontpage=True,keywords=[],redirect=newurl,refresh=0) else: html = hobj.htmlHead(title='SLiMSuite Server Queue',tabber=False,frontpage=True,keywords=[],redirect=newurl,refresh=self.getInt('Refresh')) html += '
    %s
    \n\n' % qtxt if not finished or self.getStr('RestProg') != 'retrieve': html += '

    This page will refresh shortly: %s.

    ' % (newurl,newurl) #html += '

    JobID %s (%s) still running

    \n' % (jobid,prog) html += hobj.htmlTail(tabber=False) return html
    $def with (html) $if html: $:html


    © 2014 RJ Edwards. Contact: richard.edwards@unsw.edu.au.

    $else: REST HTML not yet implemented correctly! Sigh.

    Module: SLiMSuite
    Description: Short Linear Motif analysis Suite
    Version: 1.3.0
    Last Edit: 03/11/14
    Copyright © 2014 Richard J. Edwards - See source code for GNU License Notice

    Function:
    SLiMSuite is designed to be a front end for the SLiMSuite set of sequence analysis tools. The relevant tool is given
    by the first system command, or selected using `prog=X` (or `program=X`). As much as possible, SLiMSuite will emulate
    running that tool from the commandline, adding any matching `X.ini` file to the default commandline options read in
    (*before* settings read from slimsuite.ini itself). By default, the SLiMCore tool will be called
    (libraries/rje_slimcore.py) and read in commands from slimcore.ini.

    Help for the selected tool can be accessed using the `help=T` option. Note that `-h`, `-help` or `help` alone will
    trigger the SLiMSuite help (this!). As `-help` or `help` will also set `help=T`, these commands will trigger both the
    SLiMSuite help and the selected program help (unless over-ruled by `help=F`). An explicit `help=T` command will only
    trigger the selected program help.

    Running SLiMSuite should also try importing all the main SLiMSuite modules, testing for download errors etc.

    SLiMSuite tools:
    The list of tools recognised by `prog=X` will be added here as the relevant code is added:
    - SLiMCore = rje_slimcore.SLiMCore. SLiMSuite core module with MegaSLiM and UPC functions.
    - SLiMFarmer = slimfarmer.SLiMFarmer. SLiMSuite job forking/HPC controller.
    - QSLiMFinder = qslimfinder.QSLiMFinder. Query-based Short Linear Motif Finder - de novo SLiM prediction.
    - SLiMFinder = slimfinder.SLiMFinder. Short Linear Motif Finder - de novo SLiM prediction.
    - SLiMList = rje_slimlist.SLiMList. Short Linear Motif manipulation/filtering module.
    - SLiMProb = slimprob.SLiMProb. Short Linear Motif Probability - known SLiM prediction.
    - SLiMBench = slimbench.SLiMBench. SLiM discovery benchmarking module.
    - SLiMMaker = slimmaker.SLiMMaker. Simple SLiM generation from aligned peptide sequences.

    Example use to run SLiMFinder:
    python SLiMSuitePATH/tools/slimsuite.py slimfinder

    Please also see the SeqSuite documentation for additional utilities, which can be run from SLiMSuite or SeqSuite.

    Commandline:
    ### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
    prog=X # Identifies the tool to be used. Will load defaults from X.ini (before slimsuite.ini) [slimcore]
    help=T/F # Return the help documentation for program X. [False]
    ### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
    See also rje.py generic commandline options.

    SLiMSuite Legacy/Development programs/modules

    More information can also be found at the SLiMSuite blog.
    The SLiMSuite Legacy/Development programs/modules in SLiMSuite are: PRESTO, SLiMDisc, SLiMMutant. Click on the tabs, read the manuals, or see the ReadMe for more details.

    Program: PRESTO
    Description: Protein Regular Expression Search Tool
    Version: 5.0
    Last Edit: 22/01/07
    Note: This program has been superceded in most functions by SLiMSearch.

    Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

    Function:
    PRESTO is what the acronym suggests: a search tool for searching proteins with peptide sequences or motifs using an
    algorithm based on Regular Expressions. The simple input and output formats and ease of use on local databases make
    PRESTO a useful alternative to web resources for high throughput studies.

    The additional benefits of PRESTO that make it more useful than a lot of existing tools include:
    * PRESTO can be given alignment files from which to calculate conservation statistics for motif occurrences.
    * searching with mismatches rather than restricting hits to perfect matches.
    * additional statistics, inlcuding protein disorder, surface accessibility and hydrophobicity predictions
    * production of separate fasta files containing the proteins hit by each motif.
    * production of both UniProt format results and delimited text results for easy incorporation into other applications.
    * inbuilt tandem Mass Spec ambiguities.

    PRESTO recognises "n of m" motif elements in the form >X:n:m<, where X is one or more amino acids that must occur n+
    times across which m positions. E.g. >IL:3:5< must have 3+ Is and/or Ls in a 5aa stretch.

    Main output for PRESTO is a delimited file of motif/peptide occurrences but the motifaln=T and proteinaln=T also allow
    output of alignments of motifs and their occurrences. PRESTO has an additional motinfo=FILE output, which produces a
    summary table of the input motifs, inlcuding Expected values if searchdb given and information content if motifIC=T.
    Hit proteins can also be output in fasta format (fasout=T) or UniProt format with occurrences as features (uniprot=T).

    Release Notes:
    Expectation scores have now been modified since PRESTO Version 1.x. In addition to the expectation score for the no.
    of occurrences of a given motif (given the number of mismatches) in the entire dataset ("EXPECT"), there is now an
    estimation of the probability of the observed number of occurrences, derived from a Poisson distribution, which is
    output in the log file ("#PROB"). Further more, these values are now also calculated per sequence individually
    ("SEQ_EXP" and "SEQ_PROB").

    Note on MS-MS mode: The old Perl version of Presto had a handy MS-MS mode for searching peptides sequenced from tandem
    mass-spec data. (In this mode [msms=T], amino acids of equal mass (Leu-Ile [LI], Gln-Lys [QK], MetO-Phe [MF]) are
    automatically placed as possible variants and additional output columns give information of predicted tryptic fragment
    masses etc.) Implementation of MS-MS mode has been started in this version but discontinued due to lack of demand. As a
    result, extra tryptic fragment data is not produced. If you would like to use it, contact me at richard.edwards@ucd.ie
    and I will finish implementing it.

    Note for compare=T mode: This is still fully functional but main documentation has been moved to comparimotif.py.

    !!!NEW!!! for version 3.7, PRESTO has an additional domfilter=FILE option. This is quite crude and will read in domains
    to be filtered from the FILE given. This file MUST be tab-delimited and must have at least three columns, with headers
    'Name','Start' and 'Stop', where Name matches the short name of the Hit and 'Start' and 'End' are the positions of the
    domain 1-N. This will output two additional columns, plus a further two if iupred=T:
    * DOM_MASK = Gives the motif a score of the length of the domain if it would be masked out by masking domains or 0 if not
    * DOM_PROP = Gives the proportion of motif positions in a domain
    * DOM_DIS = Gives the motif the mean disorder score for the *domain* if in the domain, else 1.0 if not
    * DOM_COMB = Gives positions in the domain the mean disorder score for the domain, else they keep their own scores

    !!!NEW!!! for version 4.0, PRESTO has a Peptide design mode (peptides=T), using winsize=X to set size of peptides around
    occurrences. This will output peptide sequences into a fasta file and additional columns to the main PRESTO output file:
    * PEP_SEQ = Sequence of peptide
    * PEP_DESIGN = Peptide design comments. "OK" if all looking good, else warnings bad AA combos (DP, DC, DG, NG, NS or PP)

    Development Notes: (To be assimilated with release notes etc. when version is fully functional.)
    Main output is now determined by outfile=X and/or basefile=X, which will set the self.info['Basefile'] attribute,
    using standard rje module commands. If it is not set (i.e. is '' or 'None'), it will be generated using the motif and
    searchdb files as with the old PRESTO. Main search output will use this file leader and add the appropriate extension based
    on the output type and delimiter:
    * resfile = Main PRESTO search = *.presto.tdt
    * motifaln = Produce fasta files of local motif alignments *.motifaln.fas
    * uniprot = Output of hits as a uniprot format file = *.uniprot.presto
    * motinfo = Motif summary table = *.motinfo.tdt
    * ftout = Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
    * peptides = Peptides designed around motifs = *.peptides.fas

    Other special output will generate their names using protein and/or motif names using the root PATH of basefile
    (e.g. the PATH will be stripped and ProteinAln/ or HitFas/ directories made for output):

        * proteinaln=T/F  : Search for alignments of proteins containing motifs and produce new file containing motifs in [False]
        * fasout=T/F      : Whether to output hit sequences as a fasta format file motif.fas [False]
    

    Reformatting and ouputting motifs require a file name to be given:

        * motifout=FILE   : Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]
    



    PRESTO Commands:
    ## Basic Input Parameters ##

        motifs=FILE     : File of input motifs/peptides [None]
    
    Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
    Alternative formats include fasta, SLiMDisc output and raw motif lists.

        minpep=X        : Min length of motif/peptide X aa [2]
        minfix=X        : Min number of fixed positions for a motif to contain [0]
        minic=X         : Min information content for a motif (1 fixed position = 1.0) [2.0]
        trimx=T/F       : Trims Xs from the ends of a motif [False]
        nrmotif=T/F     : Whether to remove redundancy in input motifs [False]
        searchdb=FILE   : Protein Fasta file to search (or second motif file to compare) [None]
        xpad=X          : Adds X additional Xs to the flanks of the motif (after trimx if trimx=T) [0]
        xpaddb=X        : Adds X additional Xs to the flanks of the search database sequences (will mess up alignments) [0]
        minimotif=T/F   : Input file is in minimotif format and will be reformatted (PRESTO File format only) [False]
        goodmotif=LIST  : List of text to match in Motif names to keep (can have wildcards) []
    

    ## Basic Output Parameters ##

        outfile=X       : Base name of results files, e.g. X.presto.tdt. [motifsFILE-searchdbFILE.presto.tdt]
        expect=T/F      : Whether to give crude expect values based on AA frequencies [True]
        nohits=T/F      : Save list of sequence IDs without motif hits to *.nohits.txt. [False]
        useres=T/F      : Whether to append existing results to *.presto.txt and *.nohits.txt (continuing afer last sequence)
    
    and/or use existing results in to search for conservation in alignments if usealn=T. [False]

        mysql=T/F       : Output results in mySQL format - lower case headers and no spaces [False]
        hitname=X       : Format for Hit Name: full/short/accnum [short]
        fasout=T/F      : Whether to output hit sequences as a fasta format file motif.fas [False]
        datout=T/F      : Whether to output hits as a uniprot format file *.uniprot.presto [False]
        motinfo=T/F     : Whether to output motif summary table *.motinfo.tdt [None]
        motifout=FILE   : Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]
    

    ## Advanced Output Options ##

        winsa=X         : Number of aa to extend Surface Accessibility calculation either side of motif [0]
        winhyd=X        : Number of aa to extend Eisenberg Hydrophobicity calculation either side of motif [0]
        windis=X        : Extend disorder statistic X aa either side of motif (use flanks *only* if negative) [0]
        winchg=X        : Extend charge calculations (if any) to X aa either side of motif [0]
        winsize=X       : Sets all of the above window sizes (use flanks *only* if negative) [0]
        slimchg=T/F     : Calculate Asolute, Net and Balance charge statistics (above) for occurrences [False]
        iupred=T/F      : Run IUPred disorder prediction [False]
        foldindex=T/F   : Run FoldIndex disorder prediction [False]
        iucut=X         : Cut-off for IUPred results (0.0 will report mean IUPred score) [0.0]
        iumethod=X      : IUPred method to use (long/short) [short]
        iupath=PATH     : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
        domfilter=FILE  : Use the DomFilter options, reading domains from FILE [None]
        runid=X         : Adds an additional Run_ID column identifying the run (for multiple appended runs [None]
        restrict=LIST   : List of files containing instances (hit,start,end) to output (only) []
        exclude=LIST    : List of files containing instances (hit,start,end) to exclude []
        peptides=T/F    : Peptide design mode, using winsize=X to set size of peptides around motif [False]
        newscore=LIST   : Lists of X:Y, create a new statistic X, where Y is the formula of the score. []
    

    ## Basic Search Parameters ##

        mismatch=X,Y    : Peptide must be <= Y aa for X mismatches
        ambcut=X        : Cut-off for max number of choices in ambiguous position to be shown as variant [10]
        expcut=X        : The maximum number of expected occurrences allowed to still search with motif [0] (if -ve, per seq)
        alphabet=X,Y,.. : List of letters in alphabet of interest [AAs]
        reverse=T/F     : Reverse the motifs - good for generating a test comparison data set [False]
    
    *** No longer outputs *.rev.txt - use motifout=X instead! ***

        msms=T/F        : Whether searching Tandem Mass Spec peptides [False]
        ranking=T/F     : Whether to rank hits by their rating in MSMS mode [False]
        memsaver=T/F    : Whether to store all results in Objects (False) or clear as search proceeds (True) [True]    
        startfrom=X     : Accession Number / ID to start from. (Enables restart after crash.) [None]
    

    ## Conservation Parameters ##

        usealn=T/F      : Whether to search for and use alignemnts where present. [False]
        gopher=T/F      : Use GOPHER to generate missing orthologue alignments in alndir - see gopher.py options [False]
        fullforce=T/F   : Force GOPHER to re-run even if alignment exists [False]
        alndir=PATH     : Path to alignments of proteins containing motifs [./] * Use forward slashes (/)
        alnext=X        : File extension of alignment files, accnum.X [aln.fas]
        alngap=T/F      : Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore
    
    as putative sequence fragments [False] (NB. All X regions are ignored as sequence errors.)

        conspec=LIST    : List of species codes for conservation analysis. Can be name of file containing list. [None]
        conscore=X      : Type of conservation score used:  [pos]
                            - abs = absolute conservation of motif using RegExp over matched region
                            - pos = positional conservation: each position treated independently 
    
    - prop = conservation of amino acid properties - all = all three methods for comparison purposes consamb=T/F : Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True] consinfo=T/F : Weight positions by information content (does nothing for conscore=abs) [True] consweight=X : Weight given to global percentage identity for conservation, given more weight to closer sequences [0] - 0 gives equal weighting to all. Negative values will upweight distant sequences. posmatrix=FILE : Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) [None] aaprop=FILE : Amino Acid property matrix file. [aaprop.txt] consout=T/F : Outputs an additional result field containing information on the conservation score used [False]

    ## Additional Output for Extracted Motifs ##

        motific=T/F     : Output Information Content for motifs [False]
        motifaln=T/F    : Produce fasta files of local motif alignments [False]  
        proteinaln=T/F  : Search for alignments of proteins containing motifs and produce new file containing motifs [False]
        protalndir=PATH : Directory name for output of protein aligments [ProteinAln/]
        flanksize=X     : Size of sequence flanks for motifs [30]
        xdivide=X       : Size of dividing Xs between motifs [10]
        ftout=T/F       : Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
        unipaths=LIST   : List of additional paths containing uniprot.index files from which to look for and extract features ['']
        statfilter=LIST : List of stats to filter (*discard* occurrences) on, consisting of X*Y where:
                          - X is an output stat (the column header),
                          - * is an operator in the list <, <=, !=, =, <= ,>    !!! Remember to enclose in "quotes" for >< !!!
                          - Y is a value that X must have, assessed using *.
    
    This filtering is crude and may behave strangely if X is not a numerical stat!

    ## Motif Comparison Parameters ##

        compare=T/F     : Compare the motifs from the motifs FILE with the searchdb FILE (or self if None) [False]
        minshare=X      : Min. number of non-wildcard positions for motifs to share [2]
        matchfix=X      : If <0 must exactly match *all* fixed positions in the motifs from:  [0]
                            - 1: input (motifs=FILE) motifs
    
    - 2: searchdb motifs
    - 3: *both* input and searchdb motifs
    matchic=T/F : Use (and output) information content of matched regions to asses motif matches [True] motdesc=X : Sets which motifs have description outputs (0-3 as matchfix option) [3] outstyle=X : Sets the output style for the resfile [normal] - normal = all standard stats are output - multi = designed for multiple appended runs. File names are also output - single = designed for searches of a single motif vs a database. Only motif2 stats are output - normalsplit/multisplit = as normal/multi but stats are grouped by motif rather than by type

    Uses general modules: copy, glob, os, string, sys, time
    Uses RJE modules: rje, rje_aaprop, rje_disorder, rje_motif_V3, rje_motif_cons, rje_scoring, rje_seq, rje_sequence,
    rje_blast, rje_pam, rje_uniprot
    Other modules needed: rje_dismatrix,

    This module was not recognised by rje_pydocs.

    Module: SLiMMutant
    Description: Short Linear Motif Mutation Analysis Tool
    Version: 1.3
    Last Edit: 16/09/14
    Copyright © 2014 Richard J. Edwards - See source code for GNU License Notice

    Function:
    SLiMMutant is a Short Linear Motif Mutation Analysis Tool, designed to identify and assess mutations that create
    and/or destroy SLiMs. There are three main run modes:

    - Mode 1. Generating mutant datasets for analysis [`generate=T`]

    The main input is: (1) a file of protein sequence mutations [`mutfile=FILE`] in a delimited text format with aa
    substitution [`mutfield=X`] and protein accession number [`protfield=X`] data; (2) a corresponding sequence file
    [`seqin=FILE`]; (3) a file of SLiMs to analyse [`motifs=FILE`]. This will process the data and generate two sequence
    files: `*.wildtype.fas` and `*.mutant.fas`. These files will be named after the input `mutfile` unless `basefile=X`
    is used.

    - Mode 2. Run SLiMProb on datasets [`slimprob=T`]

    This will run SLiMProb on the two datasets, once per `*.ini` file given by `slimini=LIST`. These runs should have
    distinct `runid=X` settings. If no `*.ini` files are given, as single run will be made using commandline settings.

    - Mode 3. Compile results of SLiMProb runs. [`analyse=T`]

    This will compare the `*.wildtype.fas` and `*.mutant.fas` results from the `*.occ.csv` file produced by SLiMProb.
    All mutations analysed will be identified from `*.mutant.fas`. SLiM occurrences are then matched up between wildtype
    and mutant versions of the same sequence. If none of the mutations have effected the SLiM prediction, then the
    wildtype and all mutant sequences will return the motif. If, on the other hand, mutations have created/destroyed
    motifs, occurrences will be missing from the wildtype and/or 1+ mutant sequences. All unaffected SLiM instances are
    first removed and altered SLiM instances output to `*.MutOcc.csv`. Differences between mutants and wildtypes are
    calculated for each `RunID`-`Motif` combination and summary results output to `*.Mut_vs_WT.csv`. If `motlist=LIST` is
    given, analysis is restricted to a subset of motifs.

    Unless `basefile=FILE` is given, output files will be named after `mutfile=FILE` but output into the current run
    directory. If running in batch mode, basefile cannot be used.

    NOTE: SLiMMutant is still in development and has not been thoroughly tested or benchmarked.

    Commandline:
    ### ~ SEQUENCE GENERATION METHODS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        generate=T/F	: Whether to run sequence generation pipeline [False]
        mutfile=FILE	: Delimited text file with sequence mutation info. Sets basefile. []
        mutfield=X		: Field in mutfile corresponding to AA subsitution data ['AAChange']
        protfield=X		: Field in mutfile corresponding to protein accession number ['Uniprot']
        splitfield=X    : Field in mutfile to split data on (saved as basefile.X.tdt) []
        seqin=FILE		: Input file with protein sequences []
        motifs=FILE		: Input file of SLiMs []
        motlist=LIST	: List of input SLiMs to restrict analysis to []
        mutflanks=X		: Generate for casemask=Upper of X aa flanking mutation (None if > 1) [0]
        minmutant=X     : Minimum number of mutants for output [100]
        maxmutant=X     : Maximum number of mutants for output [100000]
    

    ### ~ SLiMProb Run Methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        slimprob=T/F	: Whether to run SLiMProb on *.wildtype.fas and *.mutant.fas (*=basefile) [False]
        slimini=LIST	: Lists of INI file with settings for SLiMProb run. Should include runid=X and resdir=PATH. []
        resdir=PATH   	: Location of output files. SLiMProb resdir should be in slimini [SLiMMutant/ (and SLiMProb/)]
    

    ### ~ SLiMProb Results Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        analyse=T/F		: Whether to analyse the results of a SLiMProb run [False]
        resfile=FILE   	: Main SLiMProb results table (*.csv and *.occ.csv) [slimprob.csv]
        runid=X			: Limit analysis to SLiMProb RunID (blank = analyse all) []
        buildpath=PATH 	: Alternative path to look for existing intermediate files (e.g. *.upc) [SLiMProb/]
    

    ### ~ SLiMProb PPI Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        slimppi=T/F         : Whether to perform SLiMPPI analysis (will set analyse=T) [False]
        sourcepath=PATH/    : Will look in this directory for input files if not found ['SourceData/']
        sourcedate=DATE		: Source file date (YYYY-MM-DD) to preferentially use [None]
        ppisource=X			: Source of PPI data. (HINT/FILE) FILE needs 'Hub' and 'SpokeUni' fields. ['HINT']
        dmifile=FILE        : Delimited text file containing domain-motif interaction data ['elm_interaction_domains.tsv']
    

    ### ~ Batch running ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        batch=FILELIST  : List of mutfiles to run in batch mode. Wildcards allowed. []
    

    ### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
    See also rje.py generic commandline options.

    Main SeqSuite programs/modules

    More information can also be found at the SLiMSuite blog.
    The Main SeqSuite programs/modules in SLiMSuite are: GABLAM, GOPHER, HAQESAC, MultiHAQ, PeptCluster, PINGU. Click on the tabs, read the manuals, or see the ReadMe for more details.

    Program: GABLAM
    Description: Global Analysis of BLAST Local AlignMents
    Version: 2.16.1
    Last Edit: 07/01/15
    Citation: Davey, Shields & Edwards (2006), Nucleic Acids Res. 34(12):3546-54.

    Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This module is for taking one or two sequence datasets, peforming an intensive All by All BLAST and then tabulating
    the results as a series of pairwise comparisons (*.gablam.*):
    * Qry = Query Short Name (or AccNum)
    * Hit = Hit Short Name
    * Rank = Rank of that Hit vs Query (based on Score)
    * Score = BLAST Score (one-line)
    * E-Value = BLAST E-value
    * QryLen = Length of Query Sequence
    * HitLen = Length of Hit Sequence
    * Qry_AlnLen = Total length of local BLAST alignment fragments in Query (Unordered)
    * Qry_AlnID = Number of Identical residues of Query aligned against Hit in local BLAST alignments (Unordered)
    * Qry_AlnSim = Number of Similar residues of Query aligned against Hit in local BLAST alignments (Unordered)
    * Qry_OrderedAlnLen = Total length of local BLAST alignment fragments in Query (Ordered)
    * Qry_OrderedAlnID = Number of Identical residues of Query aligned against Hit in local BLAST alignments (Ordered)
    * Qry_OrderedAlnSim = Number of Similar residues of Query aligned against Hit in local BLAST alignments (Ordered)
    * Hit_AlnLen = Total length of local BLAST alignment fragments in Hit (Unordered)
    * Hit_AlnID = Number of Identical residues of Hit aligned against Query in local BLAST alignments (Unordered)
    * Hit_AlnSim = Number of Similar residues of Hit aligned against Query in local BLAST alignments (Unordered)
    * Hit_OrderedAlnLen = Total length of local BLAST alignment fragments in Hit (Ordered)
    * Hit_OrderedAlnID = Number of Identical residues of Hit aligned against Query in local BLAST alignments (Ordered)
    * Hit_OrderedAlnSim = Number of Similar residues of Hit aligned against Query in local BLAST alignments (Unordered)
    * ALIGN_ID = Number of Identical residues as determined by pairwise ALIGN
    * ALIGN_Len = Length of pairwise ALIGN

    By default, all BLAST hits will return alignments. (blastv=N blastb=N, where N is the size of searchdb.) This can be
    over-ridden by the blastv=X and blastb=X options to limit results to the top X hits.

    GABLAM will also produce a single table of summary statistics for all non-self hits (*.hitsum.*) (self hits included
    if selfhits=T selfsum=T):
    * Qry = Query Short Name (or AccNum)
    * Hits = Number of Hits
    * MaxScore = Max non-self BLAST Score (one-line)
    * E-Value = BLAST E-value for max score

    Version 2.8 onwards features explicit extra functionality for all-by-all searches, where the QueryDB (seqin=FILE) and
    SearchDB (searchdb=FILE) are the same. (Failing to give a searchdb will run in this mode.)

    Version 2.16 introduces a new "fullblast" mode, which performs a full BLAST search (using forks=X to set the number
    of processors for the BLAST search) followed by the blastres=FILE multiGABLAM processing. This should be faster for
    large datasets but precludes any appending of results files. This is incompatible with the missing=LIST advanced
    update option. (missing=LIST should only be required for aborted fullblast=F runs.)

    Commandline:
    ### Input/Search Options ###

        seqin=FILE      : Query dataset file [infile.fas]
        searchdb=FILE   : Database to search. [By default, same as seqin]
        blastres=FILE   : BLAST results file for input (over-rides seqin and searchdb) [None]
        fullblast=T/F   : Whether to perform full BLAST followed by blastres analysis [False]
        blastp=X        : Type of BLAST search to perform (blastx for DNA vs prot; tblastn for Prot vs DNA) [blastp]
        gablamcut=X     : Min. percentage value for a GABLAM stat to report hit [0.0]  (GABLAM from FASTA only)
        cutstat=X       : Stat for gablamcut (eg. AlnLen or OrderedAlnSim. See above for full list) [OrderedAlnID]
        cutfocus=X      : Focus for gablamcut. Can be Query/Hit/Either/Both. [Either]
        localcut=X      : Cut-off length for local alignments contributing to global GABLAM stats) [0]
    

    ### General Output Options ###

        append=T/F      : Whether to append to output file or not. (Not available for blastres=FILE or fullblast=F) [False]
        fullres=T/F     : Whether to output full GABLAM results table [True]
        hitsum=T/F      : Whether to output the BLAST Hit Summary table [True]
        local=T/F       : Whether to output local alignment summary stats table [True]
        localmin=X      : Minimum length of local alignment to output to local stats table [0]
        selfhit=T/F     : Whether to include self hits in the fullres output [True] * See also selfsum=T/F *
        selfsum=T/F     : Whether to also include self hits in hitsum output [False] * selfhit must also be T *
        qryacc=T/F      : Whether to use the Accession Number rather than the short name for the Query [True]
        keepblast=T/F   : Whether to keep the blast results files rather than delete them [False]
        blastdir=PATH   : Path for blast results file (best used with keepblast=T) [./]
        percres=T/F     : Whether output is a percentage figures (2d.p.) or absolute numbers [True]
                          - Note that enough data is output to convert one into the other in other packages
        reduced=LIST    : List of terms that must be included in reduced output headers (e.g. Hit or Qry_Ordered) []
    

    ### All-by-all Output Options ###

        dismat=T/F      : Whether to output compiled distance matrix [True]
        diskey=X        : GABLAM Output Key to be used for distance matrix ['Qry_AlnID']
        distrees=T/F    : Whether to generate UPGMA tree summaries of all-by-all distances [True]
        treeformats=LIST: List of output formats for generated trees (see rje_tree.py) [nwk,text,png]
        disgraph=T/F    : Whether to output a graph representation of the distance matrix (edges = homology) [True]
        graphtypes=LIST : Formats for graph outputs (svg/xgmml/png/html) [xgmml,png]
        clusters=T/F    : Whether to output a list of clusters based on shared BLAST homology [True]
        bycluster=X     : Generate separate trees and distance matrix for clusters of X+ sequences [0]
        clustersplit=X  : Threshold at which clusters will be split (e.g. must be > distance to cluster) [1.0]
        singletons=T/F  : Whether to include singleton in main tree and distance matrix [False]
        saveupc=T/F     : Whether to output a UPC file for SLiMSuite compatibility [False]
    

    ### Sequence output options ###

        fasout=T/F      : Output a fasta file per input sequence "ACCNUM.DBASE.fas" [False]   (GABLAM from FASTA only)
        fasdir=PATH     : Directory in which to save fasta files [BLASTFAS/]
        fragfas=T/F     : Whether to output fragmented Hits based on local alignments [False]
        gablamfrag=X    : Length of gaps between mapped residues for fragmenting local hits [100]
        addflanks=X     : Add flanking regions of length X to fragmented hits [0]
        combinedfas=T/F : Whether to generate a combined fasta file [False]
    

    ### Advanced/Obselete Search/Output Options ###

        dotplots=T/F    : Whether to use gablam.r to output dotplots. (Needs R installed an setup) [False]
        dotlocalmin=X   : Minimum length of local alignment to output to local hit dot plots [1000]
        mysql=T/F       : Whether to output column headers for mysql table build [False]
        missing=LIST    : This will go through and add missing results for AccNums in FILE (or list of AccNums X,Y,..) [None]
        startfrom=X     : Accession number to start from [None]
        alnstats=T/F    : Whether to output GABLAM stats or limit to one-line stats (blastb=0) [True]
        posinfo=T/F     : Output the Start/End limits of the BLAST Hits [True]
        outstats=X      : Whether to output just GABLAM, GABLAMO or All [All]
    

    ### GABLAM Non-redundancy options. NOTE: These are different to rje_seq NR options. ###

        nrseq=T/F       : Make sequences Non-Redundant following all-by-all. [False]
        nrcut=X         : Cut-off for non-redundancy filter, uses nrstat=X for either query or hit [100.0]
        nrstat=X        : Stat for nrcut (eg. AlnLen or OrderedAlnSim. See above for full list) [OrderedAlnID]
        nrchoice=LIST   : Order of decisions for choosing NR sequence to keep. Otherwise keeps first sequence. (swiss/nonx/length/spec/name/acc/manual) [swiss,nonx,length]
        nrsamespec=T/F  : Non-Redundancy within same species only. [False]
        nrspec=LIST     : List of species codes in order of preference (good to bad) []
    

    ### BLAST Options ###

        blastpath+=PATH : path for blast+ files [c:/bioware/blast+/] *Use fwd slashes
        blastpath=PATH  : path for blast files [c:/bioware/blast/] *Use fwd slashes
        blaste=X        : E-Value cut-off for BLAST searches (BLAST -e X) [1e-4]
        blastv=X        : Number of one-line hits per query (BLAST -v X)
        blastb=X        : Number of hit alignments per query (BLAST -b X)
        blastf=T/F      : Complexity Filter (BLAST -F X) [False]
        checktype=T/F   : Whether to check sequence types and BLAST program selection [True]
    

    ### Additional ALIGN Global Identity ###

        globid=T/F  : Whether to output Global %ID using ALIGN [False]
        rankaln=X   : Perform ALIGN pairwise global alignment for top X hits [0]
        evalaln=X   : Perform ALIGN pairwise global alignment for all hits with e >= X [1000]
        alncut=X    : Perform ALIGN pairwise global alignment until > X %ID reached [0]
    

    ### Forking Options ###

        noforks=T/F     : Whether to avoid forks [False]
        forks=X         : Number of parallel sequences to process at once [0]
        killforks=X     : Number of seconds of no activity before killing all remaining forks. [36000]
    

    Program: GOPHER
    Description: Generation of Orthologous Proteins from Homology-based Estimation of Relationships
    Version: 3.4
    Last Edit: 15/05/14
    Citation: Davey, Edwards & Shields (2007), Nucleic Acids Res. 35(Web Server issue):W455-9.

    Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This script is designed to take in two sequences files and generate datasets of orthologous sequence alignments.
    The first [seqin] sequence set is the 'queries' around which orthologous datasets are to be assembled. This is now
    optimised for a dataset consisting of one protein per protein-coding gene, although splice variants should be dealt
    with OK and treated as paralogues. This will only cause problems if the postdup=T option is used, which restricts
    orthologues returned to be within the last post-duplication clade for the sequence.

    The second [orthdb] is the list of proteins from which the orthologues will be extracted. The seqin sequences are
    then BLASTed against the orthdb and processed (see below) to retain putative orthologues using an estimation of the
    phylogenetic relationships based on pairwise sequences similarities.

    NB. As of version 2.0, gopher=FILE has been replaced with seqin=FILE for greater rje python consistency. The allqry
    option has been removed. Please cleanup the input data into a desired non-redundant dataset before running GOPHER.
    (In many ways, GOPHER's strength is it's capacity to be run for a single sequence of interest rather than a whole
    genome, and it is this functionality that has been concentrated on for use with PRESTO and SLiM Pickings etc.) The
    output of statistics for each GOPHER run has also been discontinued for now but may be reintroduced with future
    versions. The phosalign command (to produce a table of potential phosphorylation sites (e.g. S,T,Y) across
    orthologues for special conservation of phosphorylation prediction analyses) has also been discontinued for now.

    Version 2.1 has tightened up on the use of rje_seq parameters that were causing trouble otherwise. It is now the
    responsibility of the user to make sure that the orthologue database meets the desired criteria. Duplicate accession
    numbers will not be tolerated by GOPHER and (arbitrary) duplicates will be deleted if the sequences are the same, or
    renamed otherwise. Renaming may cause problems later. It is highly desirable not to have two proteins with the same
    accession number but different amino acid sequences. The following commands are added to the rje_seq object when input
    is read: accnr=T unkspec=F specnr=F gnspacc=T. Note that unknown species are also not permitted.

    Version 3.0 has improved directory organisation for multi-species and multi-orthdb GOPHER runs on the same system, in
    line with the bioware.ucd.ie webserver. Additional data cleanup has been added too. (NB. Experimental Sticky GOPHER
    runs will not work with Organise=T.) The output directory is now set at the highest level by gopherdir=PATH. If
    organise=T then a subdirectory will be created within this directory named after the orthdb (path and extension
    stripped) and each species will have its own subdirectory within this in turn. By default, these are named using the
    UniProt species codes, which are read from the input sequences. To use TaxIDs, the unispec=FILE option must be used.
    This should point to a file that has one line per UniProt species code, starting with the species code and ending
    with :TaxID: where TaxID is a number. The file should be sorted by species code.

    The process for dataset assembly is as follows for each protein :

    1. BLAST against orthdb [orthblast]
    < BLASTs saved in BLAST/AccNum.blast
    2. Work through BLAST hits, indentifying paralogues (query species duplicates) and the closest homologue from each
    other species. This involves a second BLAST of the query versus original BLAST hits (e-value=10, no complexity
    filter). The best sequence from each species is kept, i.e. the one with the best similarity to the query and not part
    of a clade with any paralogue that excludes the query. (If postdup=T, the hit must be in the query's post duplication
    clade.) In addition hits: [orthfas]
    * Must have minimum identity level with Query
    * Must be one of the 'good species' [goodspec=LIST]
    < Save reduced sequences as ORTH/AccNum.orth.fas
    < Save paralogues identified (and meeting minsim settings) in PARA/AccNum.para.fas
    3. Align sequences with MUSCLE [orthalign]
    < ALN/AccNum.orthaln.fas
    4. Generate an unrooted tree with (ClustalW or PHYLIP) [orthtree]
    < TREE/AccNum.orth.nsf

    Optional paralogue/subfamily output: (These are best not used with Force=T or FullForce=T)
    2a. Alignment of query protein and any paralogues <minsim threshold (paralign=T/F). The parasplice=T/F controls
    whether splice variants are in these paralogue alignments (where identified using AccNum-X notation).
    < PARALN/AccNum.paraln.fas
    2b. Pairwise combinations of paralogues and their orthologues aligned, with "common" orthologues removed from the
    dataset, with a rooted tree and group data for BADASP analysis etc. (parafam=T)
    < PARAFAM/AccNum+ParaAccNum.parafam.fas
    < PARAFAM/AccNum+ParaAccNum.parafam.nsf
    < PARAFAM/AccNum+ParaAccNum.parafam.grp
    2c. Combined protein families consisting of a protein, all the paralogues < minsim and all orthologues for each in a
    single dataset. Unaligned. (gopherfam=T)
    < SUBFAM/AccNum.subfam.fas
    *NB.* The subfamily outputs involve Gopher calling itself to ensure the paralogues have gone through the Gopher
    process themselves. This could potentially cause conflict if forking is used.

    Commandline:
    ### Basic Input/Output ###

        seqin=FILE      : Fasta file of 'query' sequences for orthology discovery []
        orthdb=FILE     : Fasta file with pool of sequences for orthology discovery []. Should contain query sequences.
        startfrom=X     : Accession Number / ID to start from. (Enables restart after crash.) [None]
        dna=T/F         : Whether to analyse DNA sequences (not optimised) [False]
    

    ### GOPHER run control parameters ###
    orthblast : Run to blasting versus orthdb (Stage 1).
    orthfas : Run to output of orthologues (Stage 2).
    orthalign : Run to alignment of orthologues (Stage 3).
    orthtree : Run to tree-generation (Stage 4). [default!]

    ### GOPHER Orthologue identifcation Parameters ###

        postdup=T/F     : Whether to align only post-duplication sequences [False]
        reciprocal=T/F  : Use Reciprocal Best Hit method instead of standard GOPHER approach [False]
        fullrbh=T/F     : Whether RBH method should run BLAST searches for all potential RBH orthologues [False]
        compfilter=T/F  : Whether to use complexity filter and composition statistics for *initial* BLAST [True]
        minsim=X        : Minimum %similarity of Query for each "orthologue" [40.0]
        simfocus=X      : Style of similarity comparison used for MinSim and "Best" sequence identification [query]
            - query = %query must < minsim (Best if query is ultimate focus and maximises closeness of returned orthologues)
            - hit = %hit must < minsim (Best if lots of sequence fragments are in searchdb and should be retained)
            - either = %query < minsim OR %hit < minsim (Best if both above conditions are true)
            - both = %query < minsim AND %hit < minsim (Most stringent setting)
        gablamo=X       : GABLAMO measure to use for similarity measures [Sim]
            - ID = %Identity (from BLAST)
            - Sim = %Similarity (from BLAST)
            - Len = %Coverage (from BLAST)
            - Score = BLAST BitScore (Reciprocal Match only)
        goodX=LIST      : Filters where only sequences meeting the requirement of LIST are kept.
    
    LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
    - goodacc = list of accession numbers
    - goodseq = list of sequence names
    - goodspec = list of species codes
    - gooddb = list of source databases
    - gooddesc = list of terms that, at least one of which must be in description line

        badX=LIST       : As goodX but excludes rather than retains filtered sequences
    

    ### Additional run control options ###

        repair=T/F      : Repair mode - replace previous files if date mismatches or files missing.
    
    (Skip missing files if False) [True]

        force=T/F       : Whether to force execution at current level even if results are new enough [False]
        fullforce=T/F   : Whether to force current and previous execution even if results are new enough [False]
        dropout=T/F     : Whether to "drop out" at earlier phases, or continue with single sequence [False]
        ignoredate=T/F  : Ignores the age of files and only replaces if missing [False]
        savespace=T/F   : Save space by deleting intermediate blast files during orthfas [True]
        maxpara=X       : Maximum number of paralogues to consider (large gene families can cause problems) [50]
        oldblast=T/F    : Run with old BLAST rather than BLAST+ [False]
    

    ### Additional Output Options ###

        runpath=PATH    : Directory from which to run GOPHER. (NB. Will look for input here unless full paths given) [./]
        gopherdir=PATH  : Parent directory for output of files [./]
        organise=T/F    : Output files according to orthdb an species (code or TaxaID - need conversion) [True]
        unispec=FILE    : UniProt species file for TaxaID conversion. Will use TaxaID instead of Species Code if given [None]
        alnprog=X       : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa) [clustalo]
        orthid=X        : File identifier (Lower case) for orthology files [orth]
        paralign=T/F    : Whether to produce paralogue alignments (<minsim) in PARALN/ (assuming run to orthfas+) [False]
        parasplice=T/F  : Whether splice variants (where identified) are counted as paralogues [False]
        parafam=T/F     : Whether to paralogue paired subfamily alignments (<minsim) (assuming run to orthfas+) [False]
        gopherfam=T/F   : Whether to combined paralogous gopher orthologues into protein families (<minsim) (assuming run to orthfas+) [False]
        sticky=T/F      : Switch on "Sticky Orthologous Group generation" [False] *** Experimental ***
        stiggid=X       : Base for Stigg ID numbers [STIGG]
    

    Uses general modules: copy, gc, glob, os, string, sys, threading, time
    Uses RJE modules: rje, rje_blast, rje_dismatrix, rje_seq, rje_tree
    Other modules needed: rje_ancseq, rje_pam, rje_sequence, rje_tree_group, rje_uniprot

    Program: HAQESAC
    Description: Homologue Alignment Quality, Establishment of Subfamilies and Ancestor Construction
    Version: 1.10
    Last Edit: 02/04/14
    Citation: Edwards et al. (2007), Nature Chem. Biol. 3(2):108-112.

    Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

    Function:
    HAQESAC is a tool designed for processing a dataset of potential homologues into a trustworthy gobal alignment and well
    bootstrap-supported phylogeny. By default, an intial dataset consisting of homologues (detected by BLAST, for example)
    is processed into a dataset consisting of the query protein and its orthologues in other species plus paralogous
    subfamilies in the same gene family.

    Individual sequences are therefore screened fulfil two criteria:
    1. They must be homologous to (and alignable with) the query sequence of interest.
    2. They must be a member of a subfamily within the gene family to which the query sequence belongs.

    HAQ. The first stage of data cleanup is therefore to remove rogue sequences that either do not fit in the gene family at
    all or are too distantly related to the query protein for a decent alignment that can be used for useful further
    analysis. This is achieved firstly by a simple identity cut-off, determined by pairwise alignments of sequences, and
    then by a more complex procedure of removing sequences for whom the overall alignability is poor. During this procedure,
    sequences that have too many gaps are also removed as too many gapped residues can cause problems for downstream
    evolutionary analyses. Further screening is achieved based on phylogenetic information.

    ES. Once the dataset has been 'cleaned up' (and, indeed, during processing), HAQESAC can be used to assign sequences to
    subgroups or subfamilies, if such information is needed for downstream analyses.

    AC. The final step that HAQESAC is able to perform is ancestral sequence prediction using the GASP (Gapped Ancestral
    Sequence Prediction) algorithm (Edwards & Shields 2005)

    By default, HAQESAC will perform all these operations. However, it is possible to turn one or more off and only, for
    example, reject individually badly aligned sequences, if desired. Details can be found in the Manual and at the website:
    http://www.southampton.ac.uk/~re1u06/software/packages/seqsuite/.

    Commandline Options:
    # General Dataset Input/Output Options #

        seqin=FILE  : Loads sequences from FILE (fasta, phylip, aln, uniprot format, or list of fastacmd names for fasdb option)
        query=X     : Selects query sequence by name (or part of name, e.g. Accession Number)
        acclist=X   : Extract only AccNums in list. X can be FILE or list of AccNums X,Y,.. [None]
        fasdb=FILE  : Fasta format database to extract sequences from [None]
        basefile=X  : Basic 'root' for all files X.* [By default will use 'root' of seqin=FILE if given or haq_AccNum if qblast]
        v=X         : Sets verbosity (-1 for silent) [0]
        i=X         : Sets interactivity (-1 for full auto) [0]
        d=X         : Data output level (0-3) [1]
        log=FILE    : Redirect log to FILE [Default = calling_program.log or basefile.log]
        newlog=T/F  : Create new log file. [Default = False: append log file]
        multihaq=T/F: If pickle present, will load and continue, else will part run then pickle and stop [False]
    

    # Pre-HAQESAC Data selection #

        qseqfile=FILE   : Sequence file from which to extract query (query=X) and peform BLAST [None]
        qblast=FILE     : BLAST database against which to BLAST query (see rje_blast.py options) [None] ['blaste=1e-7','blastv=1000,blastb=0']
        qfastacmd=X     : Extract query X from qblast database using fastacmd (may also need query=X) [None]
    

    # Pre-HAQESAC Sequence Filtering #

        gnspacc=T/F     : Convert sequences into gene_SPECIES__AccNum format wherever possible. [True] 
        backup=T/F      : Whether to backup initial fasta file. Will overwrite existing *.fas.bak. [True]
        accnr=T/F       : Check for redundant Accession Numbers/Names on loading sequences. [False]
        #!# filterseq=FILE  : Filters out sequences in given file (Log, Fasta or list of names)
        #!# filterspec=FILE : Filters out sequences according to species (codes) listed in given file
        unkspec=T/F     : Whether sequences of unknown species are allowed [True]
        dblist=X,Y,..,Z : List of databases in order of preference (good to bad)
        dbonly=T/F      : Whether to only allow sequences from listed databases [False]
        #!# keepdesc=FILE   : Only keeps sequences with 1+ of text listed in given file in sequence description
        minlen=X        : Minimum length of sequences [0]
        maxlen=X        : Maximum length of sequences (>=0 = No maximum) [0]
        goodX=LIST      : Filters where only sequences meeting the requirement of LIST are kept.
                          - LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
                            - goodacc  = list of accession numbers
                            - goodseq  = list of sequence names
                            - goodspec = list of species codes
                            - gooddb   = list of source databases
                            - gooddesc = list of terms that, at least one of which must be in description line
        badX=LIST       : As goodX but excludes rather than retains filtered sequences
    

    # Pre-HAQ Data Cleanup #

        cleanup=T/F : Initial data cleanup [True]
        seqnr=T/F   : Make sequence Non-Redundant [True]
        specnr=T/F  : Non-Redundancy within same species only [True]
        nrid=X      : %Identity cut-off for Non-Redundancy [99.0]
        blastcut=X  : Maximum number of sequences to have in dataset (BLAST query against NR dataset.) [0]
        blaste=X    : E-Value cut-off for BLAST searches (BLAST -e X) [1e-4]
        blastv=X    : Number of one-line hits per query (BLAST -v X) [500]
        blastf=T/F  : Complexity Filter (BLAST -F X) [True]
        qcover=X    : Min. % (ordered) BLAST coverage of Query vs Hit or Hit vs Query [60.0]
        gablam=T/F  : Whether to use GABLAMO Global Alignment from BLAST Local Alignment Matrix (Ordered) rather than ALIGN [True]
        gabsim=T/F  : Whether to use %Similarity for GABLAMO comparisons, rather than %Identity [True]
        pairid=X    : Fasta Alignment ID cut-off for any pair of sequences [0.0]
        qryid=X     : Fasta Alignment ID with Query cut-off [40.0]
        maxgap=X    : Maximum proportion of sequence that may be gaps compared to nearest neighbour (>=0 = No maximum) [0.5]
    

    # General HAQ Options #

        qregion=X,Y     : Concentrate on the region of the query from (and including) residue X to residue Y [0,-1]
        haq=T/F         : Homologue Alignment Quality [True]
        noquery=T/F     : No Query for SAQ, Random Query for PAQ (else query=X or first sequence) [False]
        keep=T/F        : Keep all sequences (saqkl=0, paqkl=0) [False]
        usealn=T/F      : Use current alignment (do not realign, degap=F) [False]
        cwcut=X         : Total number of residues above which to use ClustalW for alignment in place of alnprog=X [0]
        haqmatrix=FILE  : File of AA vs AA scores used in SAQ and PAQ. [None]
    

    # Single Sequence Alignment Quality (SAQ) Options #

        saq=T/F     : Single Sequence AQ [True]
        saqc=X      : Min score for a residue in SAQ (Default matrix: no. seqs to share residue). [2]
        saqb=X      : SAQ Block length. [10]
        saqm=X      : No. residues to match in SAQ Block. [7]
        saqks=X     : Relative Weighting of keeping Sequences in SAQ. [3]
        saqkl=X     : Relative Weighting of keeping Length in SAQ. [1]
        mansaq=T/F  : Manual over-ride of sequence rejection decisions in SAQ [False]
    

    # Pairwise Alignment Quality (PAQ) Options #

        prepaq=T/F  : PrePAQ tree grouping [True]
        paq=T/F     : Pairwise AQ [True]
        paqb=X      : PAQ Block length. [7]
        paqm=X      : Min score in PAQ Block (Default matrix: No. residues to match). [3]
        paqks=X     : Relative Weighting of keeping Sequences in PAQ. [3]
        paqkl=X     : Relative Weighting of keeping Length in PAQ. [1]
        manpaq=T/F  : Manual over-ride of sequence rejection decisions in PAQ [False]
    

    # Establishment of Subfamilies (and PrePAQ) Options #

        es=T/F      : Establishment of Subfamilies [True]
        root=X      : Rooting of tree (rje_tree.py), where X is:
            - mid = midpoint root tree. [Default]
            - ran = random branch.
            - ranwt = random branch, weighted by branch lengths.
            - man = always ask for rooting options (unless i>0).
            - FILE = with seqs in FILE as outgroup. (Any option other than above)
        bootcut=X   : cut-off percentage of tree bootstraps for grouping.
        mfs=X       : minimum family size [3]
        fam=X       : minimum number of families (If 0, no subfam grouping) [0]
        orphan=T/F  : Whether orphans sequences (not in subfam) allowed. [True]
        allowvar=T/F: Allow variants of same species within a group. [False]
        qryvar=T/F  : Allow variants of query species within a group (over-rides allowvar=F). [False]
        groupspec=X : Species for duplication grouping [None]
        qspec=T/F   : Whether to highlight query species in PNG tree files [True]
        specdup=X   : Minimum number of different species in clade to be identified as a duplication [2]
        group=X     : Grouping of tree
            - man = manual grouping (unless i>0).
            - dup = duplication (all species unless groupspec specified).
            - qry = duplication with species of Query sequence (or Sequence 1) of treeseq
            - one = all sequences in one group
            - None = no group (case sensitive)
            - FILE = load groups from file
        phyoptions=FILE : File containing extra Phylip tree-making options ('batch running') to use [None]
        protdist=FILE   : File containing extra Phylip PROTDIST ('batch running') to use [None]
        maketree=X      : Program for making tree [None]
            - None = Do not make tree from sequences 
            - clustalw = ClustalW NJ method
            - neighbor = PHYLIP NJ method (NB. Bootstraps not yet supported)
            - upgma    = PHYLIP UPGMA (neighbor) method (NB. Bootstraps not yet supported)
            - fitch    = PHYLIP Fitch method (NB. Bootstraps not yet supported)
            - kitsch   = PHYLIP Kitsch (clock) method (NB. Bootstraps not yet supported)
            - protpars = PHYLIP MP method (NB. Bootstraps not yet supported)
            - proml    = PHYLIP ML method (NB. Bootstraps not yet supported)
            - PATH     = Alternatively, a path to a different tree program/script can be given. This should accept ClustalW parameters.
    

    # Ancestor Construction (GASP) Options

        ac=T/F      : Ancestor Construction (GASP) [True]
        pamfile=FILE: Sets PAM1 input file [jones.pam]
        pammax=X    : Initial maximum PAM matrix to generate [100]
        pamcut=X    : Absolute maximum PAM matrix [1000]
        fixpam=X    : PAM distance fixed to X [0].
        rarecut=X   : Rare aa cut-off [0.05].
        fixup=T/F   : Fix AAs on way up (keep probabilities) [True].
        fixdown=T/F : Fix AAs on initial pass down tree [False].
        ordered=T/F : Order ancestral sequence output by node number [False].
        pamtree=T/F : Calculate and output ancestral tree with PAM distances [True].
        desconly=T/F: Limits ancestral AAs to those found in descendants [True].
        xpass=X     : How many extra passes to make down & up tree after initial GASP [1].
    

    # System Info Options #

        blastpath=PATH  : Path to BLAST programs ['c:/bioware/blast/'] * Use forward slashes (/)
        fastapath=PATH  : Path to FASTA programs ['c:/bioware/fasta/'] * Use forward slashes (/)
        clustalw=PATH   : Path to CLUSTALW program ['c:/bioware/clustalw.exe'] * Use forward slashes (/)
        phylip=PATH     : Path to PHYLIP programs ['c:/bioware/phylip3.65/exe/'] * Use forward slashes (/)
        muscle=PATH     : Path to MUSCLE ['c:/bioware/muscle.exe'] * Use forward slashes (/)
        win32=T/F       : Run in Win32 Mode [False]
    


    Program: MultiHAQ
    Description: Multi-Query HAQESAC controller
    Version: 1.2
    Last Edit: 30/09/14
    Citation: Jones, Edwards et al. (2011), Marine Biotechnology 13(3): 496-504.
    Copyright © 2009 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This module is a wrapper for multiple HAQESAC runs where different query proteins are to be BLASTed against the same
    search database(s) and run through HAQESAC with the same settings. The default expectation is that some queries will
    be returned by the HAQESAC runs of other queries and may therefore be skipped as a result, although this can be
    switched off using screenqry=F. For large runs, the first phase of MulitHAQ will take a long time to run. In these
    cases, it may be desirable to set the second, interactive, phase running before it has finished. This is achieved
    using the "chaser" option, which will set the second phase in motion, "chasing" the progress of the first. To avoid
    jumbled log output, this should be given a different log file using log=FILE.

    Note: that all options will be output into a haqesac.ini file in the haqdir path, for both HAQESAC runs within the
    framework of MultiHAQ itself and also for later runs using the batch file produced. Any generic HAQESAC options
    should therefore be placed into a multihaq.ini file, not a haqesac.ini file and multiple runs with different settings
    using the same haqdir should be avoided.

    Note: Because HAQESAC makes use of RJE_SEQ filtering options, they will NOT be applied to the MultiHAQ query input
    file prior to analysis. To filter this input, run it through rje_seq.py separately in advance of running multihaq.

    Commandline:
    ### ~~~ INPUT OPTIONS ~~~ ###

        seqin=FILE      : Input query sequences [None]
        blast2fas=LIST  : List of databases to BLAST queries against prior to HAQESAC []
        addqueries=T/F  : Whether to add query database to blast2fas list [True]
    

    ### ~~~ MULTIHAQ OPTIONS ~~~ ###

        haqesac=T/F     : Run HAQESAC (True) or limit to batch file output (False) [True]
        multihaq=T/F    : Whether to run HAQESAC in two-phase multihaq mode [True]
        screenqry=T/F   : Whether to look for queries in previous runs and give option to skip [True]
        autoskip=T/F    : Whether to automatically skip queries found in previous runs [False]
        chaser=T/F      : Option for running second phase of multihaq as second run whilst first run in progress [False]
        force=T/F       : Whether to force re-running of stages (True) or pick-up pre-existing runs (False) [False]
    

    ### ~~~ OUTPUT OPTIONS ~~~~ ###

        haqdir=PATH     : Directory in which to output HAQESAC files and perform run [seqin_HAQESAC]
    

    See also haqesac.py commands and rje.py generic commandline options.

    Uses general modules: copy, glob, os, string, sys, time
    Uses RJE modules: rje, rje_zen
    Other modules needed: None

    Module: PeptCluster
    Description: Peptide Clustering Module
    Version: 1.4
    Last Edit: 20/08/13
    Copyright © 2012 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This program is for simple sequence-based clustering of short (aligned) peptide sequences. First, a pairwise distance
    matrix is generated from the peptides. This distance matrix is then used to generate a tree using a distance method
    such as Neighbour-Joining or UPGMA.

    Default distances are amino acid property differences loaded from an amino acid property matrix file.

    Commandline:
    ### ~ INPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        peptides=LIST   : These can be entered as a list or file. Lines following '#' or '<' ignored [peptides.txt]
        aaprop=FILE     : File of amino acid properties [aaprop.txt]
        aadis=FILE      : Alternative amino acid distance matrix [None]
    

    ### ~ CLUSTER OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        peptdis=X       : Method for generating pairwise peptide distances (id/prop/pam) [prop]
        peptcluster=X   : Clustering mode (upgma/wpgma/neighbor) [upgma]
    

    ### ~ OUTPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        savedis=FILE    : Output distance matrix to file [peptides.*]
        outmatrix=X     : Type for output matrix - tdt / csv / mysql / phylip / png [tdt]
        savetree=FILE   : Save generated tree as FILE [peptides.peptdis.peptcluster.*]
        treeformats=LIST: List of output formats for generated trees (nsf/nwk/text/r/png/bud/qspec/cairo/te/svg/html) [nwk,text]
    

    See also rje.py generic commandline options.

    Uses general modules: copy, glob, os, string, sys, time
    Uses RJE modules: rje, rje_db, rje_obj, rje_zen
    Other modules needed: None

    Program: PINGU
    Description: Protein Interaction Network & GO Utility
    Version: 3.9
    Last Edit: 16/07/13
    Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This utility was originally created for handling proteomics data with EnsEMBL peptide IDs. The data needed to be
    mapped onto Genes, overlaps and redundancies identified, gene lists output for GO analysis with FatiGO, and PPI data
    from HPRD and BioGRID to identify potential complexes.

    See rje_ensembl documentation for details of what to download for EnsGO files and how to make EnsLoci files etc. The
    ens_SPECIES.GO.tdt file used for go mapping should be suitable for the ensmap=FILE file.

    Giving a ppioutdir=PATH will produce combined PPI sequence files for all genes. A pingu.combinedppi.tdt summary file
    will be placed in resdir.

    QSLiMFinder=FILE will perform an analysis for shared motifs between primary interactors of those genes identified
    in a given sample and the original sequence used for the pulldown. FILE should be a fasta file where names of the
    sequences match Sample names. Datasets will then be formed that contain that sequence plus the primary PPI of each
    gene in that sample as a dataset named SAMPLE_GENE.fas in a directory RESDIR/SLiMFinder.

    Commandline:
    ### Main Input Options ###

        data=LIST       : List of files of results containing "Sample" and "Identifier" columns
        ensmap=FILE     : Mappings from EnsEMBL - peptides, genes and HGNC IDs (from BioMart)
        ipilinks=FILE   : IPI Links file with 'IPI', 'Symbol' and 'EnsG' fields []
        ensloci=FILE    : File of EnsEMBL genome EnsLoci treatment []
        baits=LIST      : List of genes of interest for overlap analysis []
        addbaits=T/F    : Whether to add primary interactors of baits as additional samples [False]
        combaits=X      : Whether to combine bait PPIs into single sample (X) (if addbaits=T) []
        controls=LIST   : List of sample names that correspond to controls [Control]
        experiments=LIST: List of sample names that correspond to key samples of interest []
        exponly=T/F     : Limited analysis to samples listed as experiments (before baits added etc.) [False]
        addalias=FILE   : Extra (manual?) aliases to add to GeneMap object following loading of pickles etc. [None]
    

    ### Processing Options ###

        hgnconly=T/F    : Whether to restrict PPI data to only those proteins with Gene Symbol links [False]
        pickle=T/F      : Whether to save/load pickle of parsed/combined data rather than regenerating each time [True]
        pingupickle=FILE: Full path to Pingu pickle file to look for/use/save [pingu.pickle]
        nocontrols=T/F  : Whether to remove genes found in designated controls from designated experiments [False]
        gablam=T/F      : Whether to run all-by-all GABLAM on EnsLoci and add homology to networks [True]
        ppitype=LIST    : List of acceptable interaction types to parse out []
        badtype=LIST    : List of bad interaction types, to exclude [indirect_complex,neighbouring_reaction]
        makefam=X       : GABLAM Percentage identity threshold for grouping sequences into families [0.0]
        gofilter=LIST   : List of GO IDs to filter out of gene lists []
        goexcept=LIST   : List of GO ID exceptions to filtering []
        remsticky=X     : Remove "sticky" hubs as defined by <X known PPI [0]
        stickyhubs=T/F  : Only remove "sticky" spokes but keep sticky hubs [False]
        stickyppi=T/F   : Only remove "sticky" hubs from samples, not from total PPI [False]
        addlinks=T/F    : Add linking proteins (linking two Sample proteins) [False]
    

    ### Main Output Options ###

        resdir=PATH     : Redirect output files to specified directory [./]
        basefile=X      : Results file prefix if no data file given with data=FILE [pingu]
        fulloutput=T/F  : Generate all possible outputs from one input [False]
        genelists=T/F   : Generate lists of genes for each sample (e.g. for FatiGO upload) [False]
        gosummary=T/F   : Make a GO summary table [False]
        summaryhgnc=T/F : Generate a summary table of genes in dataset, including peptide lists for each sample [False]
        mapout=T/F      : Generate a summary table of full peptide mapping [False]
        dbcomp=T/F      : Comparison of PPI databases [False]
        dbsizes=T/F     : Outputs a file of PPI dataset sizes (histogram) [False]
        allbyall=X      : Generates an all-by-all table of PPI links upto X degrees of separation (sample only) [0]
        pathfinder=X    : Perform (lengthy) PathFinder analysis to link genes upto X degree separation (-1 = no limit) [0]
        pathqry=LIST    : Limit PathFinder analysis to start with given queries []
        overlap=T/F     : Produce a table of the overlap (mapped through HGNC) between samples (and bait 1y PPI) [False]
        cytoscape=T/F   : Produce old cytoscape input files from allbyall table (reads back in) [False]
        xgmml=T/F       : Produce an XGMML file with all Cytoscape data and more [False]
        xgformat=T/F    : Whether to add colour/shape formatting to XGMML output [False]
        xgexpand=X      : Expand XGMML network with additional levels of interactors [0]
        xgcomplex=T/F   : Restrict XGMML output (and expansion) to protein complex edges [False]
        compresspp=T/F  : Whether to compress multiple samples of interest into ShareX for cytoscape [False]
        seqfiles=T/F    : Whether to generate protein sequence fasta files using EnsLoci [False]
        goseqdir=PATH   : Path to output full GO fasta files (No output if blank/none) []
        ppioutdir=PATH  : Path to output combined PPI files (No output if blank/none) []
        acconly=T/F     : Whether to output lists of Accession numbers only, rather than full fasta files [False]
        ensdat=PATH     : Path to EnsDAT files to use for making combined PPI datasets [None]
        qslimfinder=FILE: File containing sequences matching Sample names for Query SLiMFinder runs [None]
        screenddi=FILE  : Whether to screen out probably domain-domain interactions from file [None]
        domppidir=PATH  : Produce domain-based PPI files and output into PATH (No output if blank/none) []
        nocomplex=T/F   : Perform crude screening of complexes (PPI triplets w/o homodimers) [False]
        fasid=X         : Text ID for PPI fasta files ['ppi']
        association=T/F : Perform experiment association analysis [False]
        asscombo=T/F    : Whether to subdivide genes further based on combinations of experiments containing them [False]
        noshare=T/F     : Whether to exclude those genes that are shared between samples when comparing those samples [True]
        selfonly=T/F    : Whether to only look at associations within experiments, not between [False]
        randseed=X      : Seed for randomiser [0]
        randnum=X       : Number of randomisations [1000]
    

    ### Database/Path options ###

        enspath=PATH    : Path to EnsEMBL downloads
        ensgopath=PATH  : Path to EnsGO files   (!!! Restricted to Humans Currently !!!)
        unipath=PATH    : Path to UniProt files [UniProt/]
        hprd=PATH       : Path to HPRD flat files [None]
        biogrid=FILE    : BioGRID flat file [None]
        intact=FILE     : IntAct flat file [None]
        mint=FILE       : MINT flat file [None]
        reactome=FILE   : Reactome interactions flat file [None]
        dip=FILE        : DIP interactions flat file [None]
        domino=FILE     : Domino interactions flat file [None]
        pairwise=FILE   : Load interaction data from existing Pingu Pairwise file [None]
        addppi=FILE     : Add additional PPI from a simple delimited file IDA,IDB,Evidence [None]
        genepickle=FILE : Pickled GeneMap object. Alternatively, use below commands to make GeneMap object [None]
        - hgncdata/sourcedata/pickledata/aliases : See rje_genemap docstring.
    
    pfamdata=FILE : Delimited files containing domain organisation of sequences [None] evidence=FILE : Mapping file for evidence terms [None]

    Uses general modules: copy, glob, os, string, sys, time
    Uses RJE modules: rje
    Other modules needed: None

    SeqSuite Extra programs/modules

    More information can also be found at the SLiMSuite blog.
    The SeqSuite Extra programs/modules in SLiMSuite are: BADASP, BUDAPEST, FIESTA, GASP, GFESSA, PICSI, UniFake, SeqMapper, SeqForker, rje_seqgen. Click on the tabs, read the manuals, or see the ReadMe for more details.

    Program: BADASP
    Description: Burst After Duplication with Ancestral Sequence Prediction
    Version: 1.3
    Last Edit: 24/05/06
    Citation: Edwards & Shields (2005), Bioinformatics 21(22):4190-1.

    Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

    Function:
    BADASP implements the previously published Burst After Duplication (BAD) algorithm, plus two variants that have been used
    successfully in identifying functionally interesting sites in platelet signalling proteins and can identify Type I and
    Type II divergence. In addition, several other measures of functional specificity and conservation are calculated and
    output in plain text format for easy import into other applications. See Manual for details.

    Commandline:
    # General Dataset Input/Output Options #

        seqin=FILE  : Loads sequences from FILE
        query=X     : Selects query sequence by name (or part of name, e.g. Accession Number)
        basefile=X  : Basic 'root' for all files X.* [By default will use 'root' of seqin=FILE if given or haq_AccNum if qblast]
        v=X         : Sets verbosity (-1 for silent) [0]
        i=X         : Sets interactivity (-1 for full auto) [0]
        log=FILE    : Redirect log to FILE [Default = calling_program.log or basefile.log]
        newlog=T/F  : Create new log file. [Default = False: append log file]
        rank=T/F    : Whether to output ranks as well as scores [True]
        append=FILE : Append results to FILE instead of standard output to *.badasp
        trimtrunc=T/F   : Whether to trim the leading and trailing gaps (within groups) -< change to X [False]
        winsize=X       : Window size for window scores
    

    # BADASP Statistics #

        funcspec=X1[,X2,..] : List of functional specificity methods to apply X1,X2,..,XN
            - BAD   = Burst After Duplication (2 Subfamilies)
            - BADX  = Burst After Duplication Extra (Query Subfam versus X subfams)
            - BADN  = Burst After Duplication vs N Subfams (2+ Subfams)
            - SSC   = Livingstone and Barton Score
            - PDAD  = Variant of Livingstone and Barton
            - ETA   = Evolutionary Trace Analysis (Basic)
            - ETAQ  = Evolutionary Trace Analysis (Quantitative)
            - all   = All of the above!
        seqcon=X1[,X2,..] : List of sequence conservation measures to apply X1,X2,..,XN
            - Info  = Information content
            - PCon  = Property Conservation (Absolute)
            - MPCon = Mean Property Conservation
            - QPCon = Mean Property Conservation with Query
            - all   = All of the above
    

    # Tree and Grouping Options #

        nsfin=FILE  : File name for Newick Standard Format tree
        root=X      : Rooting of tree (rje_tree.py), where X is:
            - mid = midpoint root tree. [Default]
            - ran = random branch.
            - ranwt = random branch, weighted by branch lengths.
            - man = always ask for rooting options (unless i>0).
            - FILE = with seqs in FILE as outgroup. (Any option other than above)
        bootcut=X   : cut-off percentage of tree bootstraps for grouping.
        mfs=X       : minimum family size [3]
        fam=X       : minimum number of families (If 0, no subfam grouping) [0]
        orphan=T/F  : Whether orphans sequences (not in subfam) allowed. [True]
        allowvar=T/F: Allow variants of same species within a group. [False]
        gnspacc=T/F : Convert sequences into gene_SPECIES__AccNum format wherever possible. [True] 
        groupspec=X : Species for duplication grouping [None]
        group=X     : Grouping of tree
            - man = manual grouping (unless i>0).
            - dup = duplication (all species unless groupspec specified).
            - qry = duplication with species of Query sequence (or Sequence 1) of treeseq
            - one = all sequences in one group
            - None = no group (case sensitive)
            - FILE = load groups from file
    

    # GASP ancestral sequence prediction options #

        useanc=FILE : Gives file of predicted ancestral sequences
        pamfile=FILE: Sets PAM1 input file [jones.pam]
        pammax=X    : Initial maximum PAM matrix to generate [100]
        pamcut=X    : Absolute maximum PAM matrix [1000]
        fixpam=X    : PAM distance fixed to X [100].
        rarecut=X   : Rare aa cut-off [0.05].
        fixup=T/F   : Fix AAs on way up (keep probabilities) [True].
        fixdown=T/F : Fix AAs on initial pass down tree [False].
        ordered=T/F : Order ancestral sequence output by node number [False].
        pamtree=T/F : Calculate and output ancestral tree with PAM distances [True].
        desconly=T/F: Limits ancestral AAs to those found in descendants [True].
        xpass=X     : How many extra passes to make down & up tree after initial GASP [1].
    

    # System Info Options #

        win32=T/F       : Run in Win32 Mode [False]
    

    Please see help for rje_tree.py and rje_seq.py for additional options not covered here.

    Uses general modules: copy, string, sys, time
    Uses RJE modules: rje, rje_aaprop, rje_ancseq, rje_conseq, rje_seq, rje_specificity, rje_tree
    Additional modules required: rje_blast, rje_dismatrix, rje_pam, rje_sequence, rje_tree_group, rje_uniprot

    Program: BUDAPEST
    Description: Bioinformatics Utility for Data Analysis of Proteomics on ESTs
    Version: 2.3
    Last Edit: 31/07/13
    Citation: Jones, Edwards et al. (2011), Marine Biotechnology 13(3): 496-504.
    Copyright © 2008 Richard J. Edwards - See source code for GNU License Notice

    Function:
    Proteomic analysis of EST data presents a bioinformatics challenge that is absent from standard protein-sequence
    based identification. EST sequences are translated in all six Reading Frames (RF), most of which will not be
    biologically relevant. In addition to increasing the search space for the MS search engines, there is also the added
    challenge of removing redundancy from results (due to the inherent redundancy of the EST database), removing spurious
    identifications (due to the translation of incorrect reading frames), and identifying the true protein hits through
    homology to known proteins.

    BUDAPEST (Bioinformatics Utility for Data Analysis of Proteomics on ESTs) aims to overcome some of these problems by
    post-processing results to remove redundancy and assign putative homology-based identifications to translated RFs
    that have been "hit" during a MASCOT search of MS data against an EST database. Peptides assigned to "incorrect" RFs
    are eliminated and EST translations combined in consensus sequences using FIESTA (Fasta Input EST Analysis). These
    consensus hits are optionally filtered on the number of MASCOT peptides they contain before being re-annotated using
    BLAST searches against a reference database. Finally, HAQESAC can be used for automated or semi-automated phylogenetic
    analysis for improved sequence annotation.

    Input:
    BUDAPEST takes three main files as input:
    * A MASCOT results file, specified by mascot=FILENAME.
    * The EST sequences used (or, at least, hit by) the MASCOT search, in fasta format, specified by seqin=FILENAME.
    * A protein database for BLAST-based annotation in fasta format, specified by searchdb=FILENAME.

    Output:
    BUDAPEST produces the following main output files, where X is set by basefile=X:
    * X.budapest.tdt = main output table of results
    * X.budapest.fas = BLAST-annotated clustered consensus EST translations using FIESTA
    * X.summary.txt = summary of results from BUDAPEST pipeline
    * X.details.txt = full details of processing for each original MASCOT hit.

    Additional information can also be obtained from the additional sequence files:
    * X.est.fas = subset of EST sequences from EST database that have 1+ hits in MASCOT results.
    * X.translations.fas = fasta format of translated RF Hits that are retained after BUDAPEST cleanup.
    * X.fiesta.fas = BLAST-annotated consensus EST translations using FIESTA (pre min. peptide filtering)
    * X_HAQESAC/X.* = HAQESAC results files for annotating translated ESTs (haqesac=T only)
    * X_seqfiles/X.cluster*.fas = fasta files of translations and BLAST hits in NR clusters (clusterfas=T only)

    Lastly, reformatted MASCOT files are produced, named after the original input file (Y):
    * Y.mascot.txt = header information from the MASCOT file.
    * Y.mascot.csv = the delimited data portion of the MASCOT file.

    Commandline:
    ### ~ INPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        mascot=FILE     : Name of MASCOT csv file [None]
        seqin=FILE      : Name of EST fasta file used for search [None]
        searchdb=FILE   : Fasta file for GABLAM search of EST translations [None]
        partial=T/F     : Whether partial EST data is acceptable (True) or all MASCOT hits must be found (False) [True]
        itraq=T/F       : Whether data is from an iTRAQ experiment [False]
        empai=T/F       : Whether emPAI data is present in MASCOT file [True]
        samples=LIST    : List of X:Y, where X is an iTRAQ isotag and Y is a sample []
    
    ### ~ PROCESSING OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        minpolyat=X     : Min length of poly-A/T to be counted. (-1 = ignore all) [10]
        fwdonly=T/F     : Whether to treat EST/cDNA sequences as coding strands (False = search all 6RF) [False]
        minorf=X        : Min length of ORFs to be considered [10]
        topblast=X      : Report the top X BLAST results [10]
        minaln=X        : Min length of shared region for FIESTA consensus assembly [20]
        minid=X         : Min identity of shared region for FIESTA consensus assembly [95.0]
        minpep=X        : Minimum number of different peptides mapped to final translation/consensus [2]
    
    ### ~ SEQUENCE FORMATTING ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        newacc=X        : New base for sequence accession numbers ['BUD']
        gnspacc=T/F     : Convert sequences into gene_SPECIES__AccNum format wherever possible. [True]
        spcode=X        : Species code for EST sequences [None]
    
    ### ~ OUTPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        basefile=X      : "Base" name for all results files, e.g. X.budapest.tdt [MASCOT file basename]
        hitdata=LIST    : List of hit data to add to main budapest table [prot_mass,prot_pi]
        seqcluster=T/F  : Perform additional sequence (BLAST/GABLAM) clustering [True]
        clusterfas=T/F  : Generate fasta files of translations and BLAST hits in NR clusters [False]
        clustertree=LIST: List of formats for cluster tree output (3+ seqs only) [text,nsf,png]
        fiestacons=T/F  : Use FIESTA to auto-construct consensi from BUDAPEST RF translations [True]
        haqesac=T/F     : HAQESAC analysis of identified EST translations [True]
        blastcut=X      : Reduced the number of sequences in HAQESAC runs to X (0 = no reduction) [50]
        multihaq=T/F    : Whether to run HAQESAC in two-phases with second, manual phase [False]
        cleanhaq=T/F    : Delete excessive HAQESAC results files [True]
        haqdb=FILE      : Optional additional search database for MultiHAQ analysis [None]
    


    Program: FIESTA
    Description: Fasta Input EST Analysis
    Version: 1.9.0
    Last Edit: 26/11/14
    Citation: Jones, Edwards et al. (2011), Marine Biotechnology 13(3): 496-504.
    Copyright © 2008 Richard J. Edwards - See source code for GNU License Notice

    Function:
    FIESTA has three primary functions: >1< Discovery, assembly and evolutionary analysis of candidate genes in an EST
    library; >2< Assembly of an EST library for proteomics analysis; >3< Translation/Annotation of an EST library for
    proteomics analysis. These functions are outlined below.

    Candidate Gene Discovery:
    After optional pre-assembly of the EST library using the DNA FIESTA pipeline (see below), the candidate protein
    dataset (the QueryDB) is used for translation and annotation (see below) of those ESTs with BLAST homology to 1+
    candidates. These translations are then assembled into consensus sequences where appropriate and alignments and
    trees made for the hits to each candidate protein. Finally, if desired, HAQESAC is run for each candidate protein.

    EST Assembly:
    All EST assembly experiences two trade-offs: one between speed and accuracy, and a second between redundancy and
    accuracy. In particular, distinguishing sequencing errors from sequence variants (alleles) from different gene family
    members is not trivial.

    FIESTA is designed to provide straightforward assembly and BLAST-based annotation of EST sequences in Fasta format.
    The rationale behind its design was to try and optimise quality & redundancy versus comprehensive coverage for
    Proteomics identifications from Mass Spec data. FIESTA is designed to function in a relatively standalone capacity,
    with BLAST being the only other tool necessary. Due to this simplicity, FIESTA has some limitations; the main one
    being its inability to identify and deal with frameshift (indel) sequencing errors.

    FIESTA has two assembly and annotation pipelines: a protein pipeline based loosely on BUDAPEST and a DNA pipeline for
    "true" EST assembly. Details can be found in the Manual.

    EST library assembly/annotation:
    In addition to the main functions, parts of the main FIESTA assembly/annotation pipeline can be run as standalone
    functions.

    ESTs can be converted to Reading Frames (RF) with est2rf=T:
    1. Identify orientation using 5' poly-T or 3' poly-A.
    - 1a. Where poly-AT tail exists, remove, translate in 3 forward RF and truncate at terminal stop codon.
    - 1b. Where no poly-AT tail exists, translate in all six RF.
    2. BLAST translations vs. search database with complexity filter on.
    - 2a. If EST has BLAST hits, retain RFs with desired e-value or better.
    - 2b. If no BLAST hits, retain all RFs.

    Alternatively, translated RFs or other unannotated protein sequences can be given crude BLAST-based annotations using
    searchdb=FILE sequences with blastann=T. Note that these are simply the top BLAST hit and better annotation would be
    achieved using HAQESAC (or MultiHAQ for many sequences).

    Commandline:
    ### ~ GENERAL INPUT ~ ###

        seqin=FILE      : EST file to be processed [None]
        fwdonly=T/F     : Whether to treat EST/cDNA sequences as coding strands (False = search all 6RF) [False]
        minpolyat=X     : Min length of poly-AT to be considered a poly AT [10]
        minorf=X        : Min length of ORFs to be considered [20]   
        blastopt=FILE   : File containing additional BLAST options for run, e.g. -B F [None]
        ntrim=X         : Trims of regions <= X proportion N bases [0.5]
    

    ### ~ SEQUENCE FORMATTING ~ ###

        gnspacc=T/F     : Convert sequences into gene_SPECIES__AccNum format wherever possible. [False]
        spcode=X        : Species code for EST sequences [None]
        species=X       : Species for EST sequences [None]
        newacc=X        : New base for sequence accession numbers ['' or spcode]
    

    ### ~ EST ASSEMBLY ~ ###

        minaln=X        : Min length of shared region for consensus assembly [40]
        minid=X         : Min identity of shared region for consensus assembly [95.0]
        bestorf=T/F     : Whether to use the "Best" ORF only for ESTs without BLAST Hits [True]
        pickup=T/F      : Whether to read in partial results and skip those sequences [True]
        annotate=T/F    : Annotate consensus sequences using BLAST-based approach [False]
        dna=T/F         : Implement DNA-based GABLAM assembly [True]
        resave=X        : Number of ESTs to remove before each resave of GABLAM searchdb [200]
        gapblast=T/F    : Whether to allow gaps during BLAST identification of GABLAM homologues [False]
        assmode=X       : Mode to use for EST assembly (nogab,gablam,oneqry) [oneqry]
        gabrev=T/F      : Whether to use GABLAM-based reverse complementation [True]
    

    ### ~ ANNOTATION ~ ###

        est2rf=T/F      : Execute BLAST-based EST to RF translation/annotation only, on seqin [False]
        est2haq=T/F     : Execute BLAST-based EST to RF translation/annotation on seqin followed by HAQESAC analysis [False]
        blastann=T/F    : Execute BLAST-based annotation of conensus translations only, on seqin [False]
        truncnt=T/F     : Whether to truncate N-terminal to Met in final BLAST annotation (if hit) [False]
        searchdb=FILE   : Fasta file for GABLAM search of EST translations [None]
    

    ### ~ QUERY SEARCH ~ ###

        batch=LIST      : List of EST libraries to search (will use seqin if none given) []
        querydb=FILE    : File of query sequences to search for in EST library [None]
        qtype=X         : Sequence "Type" to be used with NewAcc for annotation of translations [hit]
        assembly=T/F    : Assemble EST sequences prior to search [False]
        consensi=T/F    : Assemble hit ORF into consensus sequences [False]
    

    ### ~ HAQESAC OPTIONS ~ ###

        haqesac=T/F     : HAQESAC analysis of identified EST translations [True]
        multihaq=T/F    : Whether to run HAQESAC in two-phases [True]
        blastcut=X      : Reduced the number of sequences in HAQESAC runs to X (0 = no reduction) [50]
        cleanhaq=T/F    : Delete excessive HAQESAC results files [True]
        haqdb=FILELIST  : Optional extra databases to search for HAQESAC analysis []
        haqbatch=T/F    : Whether to only generate HAQESAC batch file (True) or perform whole run (False) [False]
    

    Program: GASP
    Description: Gapped Ancestral Sequence Prediction
    Version: 1.4
    Last Edit: 16/07/13
    Citation: Edwards & Shields (2004), BMC Bioinformatics 5(1):123.

    Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This module is purely for running the GASP algorithm as a standalone program. All the main functionality is encoded in
    the modules listed below. The GASP algorithm itself is now encoded in the rje_ancseq module. GASP may also be run
    interactively from the rje_tree module.

    Commandline:

        seqin=FILE  : Fasta/ClustalW/Phylip Format sequence file [infile.fas].
        nsfin=FILE  : Newick Standard Format treefile [infile.nsf].
    

    help : Triggers Help = list of commandline arguments.

        indeltree=FILE  : Prints a text tree describing indel patterns to FILE.
    

    Uses general modules: os, re, sys, time
    Uses RJE modules: rje, rje_ancseq, rje_pam, rje_seq, rje_tree
    Additional modules needed: rje_blast, rje_dismatrix, rje_sequence, rje_tree_group, rje_uniprot

    Program: GFESSA
    Description: Genome-Free EST SuperSAGE Analysis
    Version: 1.4
    Last Edit: 20/08/13
    Copyright © 2011 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This program is for the automated processing, mapping and identification-by-homology for SuperSAGE tag data for
    organisms without genome sequences, relying predominantly on EST libraries etc. Although designed for genome-free
    analysis, there is no reason why transcriptome data from genome projects cannot be used in the pipeline.

    GFESSA aims to take care of the following main issues:
    1. Removal of unreliable tag identification/quantification based on limited count numbers.
    2. Converting raw count values into enrichment in one condition versus another.
    3. Calculating mean quantification for genes based on all the tags mapping to the same sequence.
    4. The redundancy of EST libraries, by mapping tags to multiple sequences where necessary and clustering sequences
    on shared tags.

    The final output is a list of the sequences identified by the SAGE experiment along with enrichment data and
    clustering based on shared tags.

    Commandline:
    ### ~ INPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        tagfile=FILE    : File containing SuperSAGE tags and counts [None]
        tagfield=X      : Field in tagfile containing tag sequence. (All others should be counts) ['Tag sequence']
        expconvert=FILE : File containing 'Header', 'Experiment' conversion data [None]
        experiments=LIST: List of (converted) experiment names to use []
        seqin=FILE      : File containing EST/cDNA data to search for tags within [None]
        tagindex=FILE   : File containing possible tags and sequence names from seqin file [*.tag.index]
        tagmap=FILE     : Tag to sequence mapping file to over-ride auto-generated file based on Seqin and Mismatch [None]
        tagstart=X      : Sequence starting tags ['CATG']
        taglen=X        : Length of sequence tags [26]
    
    ### ~ PROCESS OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        mintag=X        : Minimum total number of counts for a tag to be included (summed replicates) [0]
        minabstag=X     : Minimum individual number of counts for a tag to be included (ANY one replicate) [5]
        minexptag=X     : Minimum number of experiments for a tag to be included (no. replicates) [3]
        allreptag=X     : Filter out any Tags that are not returned by ALL replicates of X experiments [0]
        minenrtag=X     : Minimum number of counts for a tag to be retained for enrichment etc. (summed replicates) [15]
        enrcut=X        : Minimum mean fold change between experiments [2.5]
        pwenr=X         : Minimum fold change between pairwise experiment comparisons [1.0]
        expand=T/F      : Whether to expand from enriched TAGs to unenriched TAGs through shared sequence hits [True]
        mismatch=X      : No. mismatches to allow. -1 = Exact matching w/o BLAST [-1]
        bestmatch=T/F   : Whether to stop looking for more mismatches once hits of a given stringency found [True]
        normalise=X     : Method for normalising tag counts within replicate (None/ppm) [ppm]
    
    ### ~ OUTPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        basefile=X      : "Base" name for all results files, e.g. X.gfessa.tdt [TAG file basename]
        longtdt=T/F     : Whether to output "Long" format file needed for R analysis [True]
    

    See also rje.py generic commandline options.

    Program: PICSI
    Description: Proteomics Identification from Cross-Species Inference
    Version: 1.2
    Last Edit: 26/03/14
    Citation: Jones, Edwards et al. (2011), Marine Biotechnology 13(3): 496-504.
    Copyright © 2010 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This module is for cross-species protein identifications using searches against NCBInr, for example. MASCOT
    processing uses BUDAPEST. Hits are then converted into peptides for redundancy removal. Hits from a given (known)
    query species are preferentially kept and any peptides belonging to those hits are purged from hits returned by
    other species. All hits are then classified:
    - UNIQUE : Contains 2+ peptides, including 1+ unique peptides
    - NR : Contains 2+ peptides; None unique but 1+ peptides only found in other "NR" proteins
    - REDUNDANT : Contains 2+ peptides but all found in proteins also containing UNIQUE peptides
    - REJECT : Identified by >2 peptides once query-species peptides subtracted

    Commandline:
    ### ~~~ INPUT OPTIONS ~~~ ###

        seqin=FILE      : File containing sequences hit during searches [None]
        resfiles=FILES  : List of CSV MASCOT output files []
        sumfile=FILE    : Delimited summary file containing search,prot_hit_num,prot_acc,prot_desc,pep_seq
        qryspec=FILE    : Species code for Query species [EMIHU]
        spectdt=FILE    : File of UniProt species translations ['/scratch/RJE_Filestore/SBSBINF/Databases/DBase_090505/UniProt/uniprot.spec.tdt']
    
    ### ~~~ OUTPUT OPTIONS ~~~ ###

        basefile=FILE   : Base for output files [picsi]
        delimit=X       : Delimiter for output files [tab]
    

    See also rje.py generic commandline options.

    Program: UniFake
    Description: Fake UniProt DAT File Generator
    Version: 1.3
    Last Edit: 17/04/12
    Copyright © 2008 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This program runs a number of in silico predication programs and converts protein sequences into a fake UniProt DAT
    flat file. Additional features may be given as one or more tables, using the features=LIST option. Please see the
    UniFake Manual for more details.

    Commandline:
    ### ~ INPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        seqin=FILE      : Input sequence file. See rje_seq documentation for filtering options. [None]
        spcode=X        : Species code to use if it cannot be established from sequence name [None]
        features=LIST   : List of files of addtional features in delimited form []    
        aliases=FILE    : File of aliases to be added to Accession number list (for indexing) [None]
        pfam=FILE       : PFam HMM download [None]
        unipath=PATH    : Path to real UniProt Datafile (will look here for DB Index file made with rje_dbase)
        unireal=LIST    : Real UniProt data to add to UniFake output ['AC','GN','RC','RX','CC','DR','PE','KW']
        fudgeft=T/F     : Fudge the real features left/right until a sequence match is found [True]
    

    ### ~ PROCESSING/OUTPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

        unifake=LIST    : List of predictions to add to entries [tmhmm,signalp,disorder,pfam,uniprot]
        datout=FILE     : Name of output DAT file [Default input FILE.dat]
        disdom=X        : Disorder threshold below which to annotate PFam domain as "DOMAIN" [0.0]
        makeindex=T/F   : Whether to make a uniprot index file following run [False]
        ensdat=T/F      : Look for acc/pep/gene in sequence name [False]
        tmhmm=FILE      : Path to TMHMM program [None]
        cleanup=T/F     : Remove TMHMM files after run [True]
        signalp=FILE    : Path to SignalP program [None]
    

    Uses general modules: copy, glob, os, string, sys, time
    Uses RJE modules: rje
    Other modules needed: None

    Module: SeqMapper
    Description: Sequence Mapping Program
    Version: 2.1
    Last Edit: 16/04/14
    Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This module is for mapping one set of protein sequences onto a different sequence database, using Accession Numbers
    etc where possible and then using GABLAM when no direct match is possible. The program gives the following outputs:
    - *.*.mapped.fas = Fasta file of successfully mapped sequences
    - *.*.missing.fas = Fasta file of sequences that could not be mapped
    - *.*.mapping.tdt = Delimited file giving details of mapping (Seq, MapSeq, Method)
    If combine=T then the *.missing.fas file will not be created and unmapped sequences will be output in *.mapped.fas.
    Note that the possible mappings are all identified through BLAST and so a protein with matching IDs etc. but not
    hitting with BLAST will NOT be mapped. Currently only mapping of protein or nucleotides onto a protein database is
    supported.

    Unless the interactivity setting is set to 2 or more (i=2), sequences that are mapped using Name, AccNum, Sequence
    (100% identical sequences), ID or DescAcc will be mapped onto the first appropriate sequence. If automap < 0, then
    the best sequence according to the mapstat will be mapped automatically. If two sequences tie, the other two possible
    stats will also be used to rank the hits. If still tied and mapfocus is not "both" then the sequences will be ranked
    using both query and hit stats. If still tied, the first sequence will be selected.

    Any sequences that fall below automap (or i<1) but meet the minmap criteria will be ranked according to their BLAST
    rankings and then presented for a user decision. Presentation will be in reverse order, so that in the case of many
    possible mappings, the best options remain clear and on screen. The default choice (selected by hitting ENTER) will
    be the best ranked according to GABLAM stats, which will have been moved to position 1 if not already there. (BLAST
    rankings and GABLAM rankings will not always agree.)

    SeqMapper will enter a user menu if i<1 or seqin and/or mapdb are missing. If i=0 and one of these is missing, a
    simple prompt will ask for the missing files. If i>0 and one of these is missing, the program will exit.

    Commandline:
    ### Input Options ###

        seqin=FILE      : File of sequences to be mapped [None]
        mapdb=FILE      : File of sequences to map sequences onto [None]
        startfrom=X     : Shortname or AccNum of seqin file to startfrom (will append results) (memsaver=T only) [None]
    

    ### Output Options ###

        resfile=FILE    : Base of output filenames (*.mapped.fas, *.missing.fas & *.mapping.tdt)  [seqin.mapdb]
        combine=T/F     : Combine both fasta files in one (e.g. include unmapped sequences in *.mapping.fas) [False]
        gablamout=T/F   : Output GABLAM statistics for mapped sequences, including "straight" matches [True]
        append=T/F      : Append rather than overwrite results files [False]
        delimit=X       : Delimiter for *.mapping.* file (will set extension) [tab]
    

    ### Mapping Options ###

        mapspec=X       : Maps sequences onto given species code. "Self" = same species as query. "None" = any. [None]
        mapping=LIST    : Possible ways of mapping sequences (in pref order) [Name,AccNum,Sequence,ID,DescAcc,GABLAM,grep]
            - Name = First word of sequence name
            - Sequence = Identical sequence
            - grep = grep-based searching of sequence if no hits
            - ID = SwissProt style ID of GENE_SPECIES (note that the species may be changed according to mapspec)
            - AccNum = Primary Accession Number
            - DescAcc = Accession Number featured in description line in form "\WAccNum\W", where \W is non-
        skipgene=LIST   : List of "genes" in protein IDs to ignore [ens,nvl,ref,p,hyp,frag]
        mapstat=X       : GABLAM Stat to use for mapping assessment (if GABLAM in mapping list) (ID/Sim/Len) [ID]
        minmap=X        : Minimum value of mapstat for any mapping to occur [90.0]
        automap=X       : Minimum value of mapstat for automatic mapping to occur (if i>1) [99.5]
        ordered=T/F     : Whether to use GABLAMO rather than GABLAM stat [True]
        mapfocus=X      : Focus for mapping statistic, i.e. which sequence must meet requirements [query]
            - query = Best if query is ultimate focus and maximises closeness of mapped sequence)
            - hit = Best if lots of sequence fragments are in mapdb and should be allowed as mappings
            - either = Best if both above conditions are true
            - both = Gets most similar sequences in terms of length but can be quite strict where length errors exist
    

    ### Advanced BLAST Options ###

        blaste=X    : E-Value cut-off for BLAST searches (BLAST -e X) [1e-4]
        blastv=X    : Number of BLAST hits to return per query (BLAST -v X) [20]
        blastf=T/F  : Complexity Filter (BLAST -F X) [False]
    


    Program: SeqForker
    Description: Generic Sequence Analysis Forking Script
    Version: 1.0
    Last Edit: 23/09/05
    Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This module is designed to take a large input sequence dataset, split it into chunks, fork out the chunks to some
    process and then stick all the results back together at the end. It is designed to be flexible and work for any
    analyses where each sequence is looked at independently and filenames can be given to the program performing the
    analysis.

    Commandline:

        seqin=FILE          : Input sequence file [None]
        split=X             : Number of sequences per split file [0]
        startfrom=X         : Will pick up program at this sequence, where X is the name or accession number [None]
    
    * Remember to set append=T if picking up a crashed run *

        append=T/F          : Will append to results files rather than overwrite [False]
        forkprog=PATH       : Common program system call for all forked splits of sequence file (e.g. program) [None]
    
    forkcmd="blah blah" : Common system commands for all forked splits of sequence file (e.g. program options) [None]

        outcmd=X            : Command line option for giving output file name. Will be altered to match forked splits (*.*) [None]
        seqincmd=X          : Command given to program for input file name [seqin=]
    
    * SeqForker will stitch together "forkprog seqincmd=FILE forkcmd outcmd" *

    General Commandline:

        v=X         : Sets verbosity (-1 for silent) [0]
        i=X         : Sets interactivity (-1 for full auto) [0]
        log=FILE    : Redirect log to FILE [Default = calling_program.log]
        newlog=T/F  : Create new log file. [Default = False: append log file]
    

    Forking Commandline:

        noforks=T/F : Whether to avoid forks [False]
        forks=X     : Number of parallel sequences to process at once [0]
        killforks=X : Number of seconds of no activity before killing all remaining forks. [3600]
    

    Uses general modules: copy, glob, os, re, string, sys, time
    Uses RJE modules: rje, rje_seq
    Other modules needed: rje_blast, rje_dismatrix, rje_pam, rje_sequence, rje_uniprot

    Module: rje_seqgen
    Description: Random Sequence Generator Module
    Version: 1.7
    Last Edit: 17/01/13
    Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

    Function:
    This module is designed to generate a number a random sequences based on input AA or Xmer frequencies and the desired
    order of markov chain from which to draw the amino acid (or nucleotide) probabilites.

    If poolgen=T, then the amino acid frequencies will be used to make a finite pool of amino acids from which sequences
    will be built. This will ensure that the total dataset has the correct amino acid frequencies. Because of the
    potential of this to get 'stuck' in impossible sequence space - especially if screenx < 0 - an additional parameter
    poolcyc=X determines how many times to retry the generation of sequences. If poolgen=F, then generation of sequences
    will be faster but the resulting dataset may have Xmer frequencies that differ greatly from the input frequencies,
    depending on how many (and which) redundant and/or screened Xmer-containing sequences are removed. (If seqin contains
    only one peptide then each random peptide will be a scramble of that peptide.)

    !!!NEW!!! Verson 1.3 has new scramble function, which takes in a list of peptides and tries to construct scrambled
    versions of them. In this case, screenx=X sets the length of common Xmers between the scrambled peptide and the
    original peptide at which a scrambled peptide will be rejected. This should set < 1, else all peptides will be
    rejected. (If left at the default of zero, no peptides will be rejected.) In this mode, outfile=FILE will set the
    name of a delimited output file containing two columns: peptide & scramble. (Default filename = scramble.tdt)

    !!!NEW!!! Version 1.5 has a new BLAST-centred method for making a random dataset from an input dataset, retaining the
    approximate evolutionary relationships as defined by BLAST homology, which should result in similar GABLAM statistics
    for the randomised dataset. For this, a random sequence is created first. Any BLAST hits between this and other
    sequences are then mapped, keeping the required percentage identity (and using different amino acids drawn from the
    frequency pool for the rest). The next sequence is taken, completed and then the same process followed, until all
    sequences have been made. Improvements to make: (a) incorporate similarity too; (b) adjust aa frequencies after BLAST
    mapping. This method is activated by the blastgen=T option and has limited options as yet.
    NB. The input dataset will *not* be subject to rje_seq filtering.

    !!!NEW!!! Version 1.6 has an EST randomiser. This will go through each sequence in turn and generate a new sequence of
    the same length using the NT frequencies (or markov chain frequencies) of just that sequence. Updated in V1.7 to make
    this work for proteins too.

    Commandline:
    ## Generation options ##

        seqnum=X        : Number of random sequences to generate [24]
        seqlen=X,Y      : Range of lengths for random sequences [10]
        markovx=X       : Order of markov chain to use for sequence construction [1]
        aafreq=FILE     : File from which to read AA Freqs [None]
        xmerfile=FILE   : File from which to read Xmer frequencies for sequence generation [None]
        xmerseq=FILE    : Sequence file from which to calculate Xmer frequencies [None]
    
    * xmerseq is overwridden by xmerfile and aafreq. aafreq only works if markovx=1 and is over-ridden by xmerfile *

        nrgen=T/F       : Whether to generate a non-redundant sequence list (whole-sequence redundancies only) [True]
        poolgen=T/F     : Whether to build sequences using a fixed AA pool (exact freqs) or probabilities only [False]
        poolcyc=X       : Number of times to retry making sequences if rules are broken [1]
        maxhyd=X        : Maximum mean hydrophobicity score [10]
    

    ## Output & Naming ##

        outfile=FILE: Output file name [randseq.fas]
        randname=X  : Name 'leader' for output fasta file [randseq]
        randdesc=T/F: Whether to include construction details in description line of output file [True]
        idmin=X     : Starting numerical ID for randseq (allows appending) [1]
        idmax=X     : Max number for randseq ID. If > seqnum, will use seqnum. If >0, no zero-prefixing of IDs. [0]
        append=T/F  : Whether to append to outfile [False]
    

    ## Other Xmers of Interest ##

        screenfile=FILE : File of Xmers to screen in generated sequences [None]
        xmerocc=T/F     : Whether to output occurrences of screened Xmers [True]
        screenx=X       : Reject generated sequences containing screened Xmers <= X [0]
        screenrev=T/F   : Whether to screen reverse Xmers too [False]
    

    ## Peptide Scrambling Parameters ##
    # Uses seqnum=X, randdesc, idmin/max=X and randname=X but for each input peptide (oldname_randnameID)

        scramble=T/F    : Run peptide scrambler [False]
        fullscramble=T/F: Generate all possible scrambles for each peptide in TDT [False]
        scramblecyc=X   : Number of attempts to try each scramble before giving up [10000]
        seqin=FILE      : Sequence file containing peptides to scramble [None]
        peptides=LIST   : Alternative peptide sequence input for scrambling []
        outfile=FILE    : Output delimited file of scrambled peptides or peptide and scrambled sequence. [scramble.tdt]
        teiresias=X		    : Length of patterns to be screened by additional TEIRESIAS search on scrambled vs original [0]
        teiresiaspath=PATH	: Path to TEIRESIAS ['c:/bioware/Teiresias/teiresias_char.exe'] * Use forward slashes (/)
    

    ### BLAST-based dataset randomiser (uses some of the Output options listed) ###

        blastgen=T/F    : Activate the BLASTGen method [False]
        seqin=FILE      : Input sequence file to randomise [None]
        keepnames=T/F   : Whether to keep same input names in outfile [False]
    

    ### EST Randomiser ###

        estgen=T/F      : Whether to run EST randomiser method [False]
    

    Uses general modules: copy, os, re, string, sys, time
    Uses RJE modules: rje, rje_markov, rje_seq, rje_blast
    Other modules needed: rje_dismatrix, rje_pam, rje_sequence, rje_uniprot


    © RJ Edwards 2015. Last modified 7 Jan 2015.