CrocoPhylogeny

From CrocoWiki
Jump to: navigation, search

Below you can find the CrocoPhylogeny user manual, which contains all the information you need in order to make efficient use of CrocoPhylogeny.

The link for accessing the CrocoPhylogeny web server you can find here.

Note that you need not install any external tool to visualize CrocoPhylogeny output, as all your results are automatically linked to iTOL tree visualization web server.

Happy CrocoPhylogeny!

I) What can I do with CrocoPhylogeny?

Phylogeny is a beautiful science. We love to study and understand how different pieces of an evolutionary puzzle come together. However, it is not always that easy to infer phylogenetic relationships with ever growing databases including data sets that grow more and more complex every day. CrocoPhylogeny was developed to attend the needs of scientists from fields related to biology that wish to understand whether there are, and what are the, phylogenetic relationship of large and/or complex datasets that could not be properly handled with currently available tools. For instance, an evolutionary biologist knows how challenging it is to consider a whole genome phylogeny, with all its complications. You could also consider how a microbiologist colleague would find him/herself in trouble if looking for a way to infer whether there are functional similarities involving uncharacterized microbial gene clusters that at first look seem to have independent origin. With CrocoPhylogeny, the solutions for both problems are easily accessible.

Using CrocoPhylogeny, within 1 hour you could do a phylogenetic analysis including about 50 whole genomes of microbes, hundreds of plastid genomes, or even Y chromosome phylogenetic analyses including organisms of different taxonomic classes. If you are still reading at this point, and it all seems too good to be true, I invite you to browse through CrocoPhylogeny sample results, where you can see all such test cases for yourself.

CrocoPhylogeny offers an online platform for running, monitoring, and accessing your phylogenetic results. With CrocoPhylogeny, you don't need to worry about asking for a free computer to run your analyses, installing complicated software, or facing the so feared command line terminal. We take care of the ugly part of the setup so that you can focus on the part of your research that matters the most for you.

As CrocoPhylogeny allows for relatively large data sets, with settings varying from fast to sensitive, it is possible that some jobs may take longer than others. For that purpose, CrocoPhylogeny features a queuing system so that you can know whether your job have already begun and, if so, how long it will take to finish. Like this you can plan the subsequent steps of your analysis accordingly.


II) Input and Output format

1. Input file

CrocoPhylogeny aligns a set of nucleotide sequence in a pairwise fashion and reports distances between each pair of sequences. From those distances, a phylogenetic tree is calculated. Therefore, in order to run CrocoPhylogeny, you will need to specify input files containing the sequences from which you wish to build the phylogenetic trees. In more detail, you should select as input files one file for each desired leaf of your tree, meaning that all the contents of each individual file will be considered a single entity of your analysis. For example, to perform a phylogeny involving 200 16SrRNA sequences, you will have to select 200 files as input, each of which should contain a single 16SrRNA entry in FASTA format. The resulting trees will be generated using the filenames (without the file extension) as leaf names.

If all your sequences are within a single FASTA input file, you can easily extract each sequence to an individual file using the following command on a Linux terminal:

cat YOUR_INPUT_FILE.fa | awk '{if (substr($0, 1, 1)==">") {dest=(substr($0,2,NUMBER_OF_CHARACTERS_OF_HEADER_TO_BE_USED_AS_FILENAME) ".fa")} print $0 > dest }'

This command will parse 'YOUR_INPUT_FILE.fa' and create new files with one sequence each, naming each newly created file based on its corresponding FASTA sequence header from position 2 (the one right after the '>') until position 'NUMBER_OF_CHARACTERS_OF_HEADER_TO_BE_USED_AS_FILENAME', which you can set as 10, 20, or whichever value suits your data set well. This command will not change the original 'YOUR_INPUT_FILE.fa' file in any way.

Please note that currently (version 1.0.18.11.29), in order to prevent long queues, the maximum size of the input data set is 200 MB, which is about 200 Mbp in FASTA format, and that the maximum number of individual files that you can submit is 4000. Those however, are not physical limits of the server. Therefore, if you have a specific data set that is larger than 200 MB or contains more than 4000 entries, and you wish to try it with CrocoPhylogeny, you can contact the developer directly (Ravi José Tristão Ramos, mail: 455306@mail.muni.cz) for the possibility of running it during the offload periods of the server.

Your input files, as well as your results, will be kept for 1 month after submission. Please make sure you download all your results before they are deleted.

2. Output files

Once your CrocoPhylogeny run is finished, you will have access to a number of result files, which are explained in detail in this section.

There 2 file types (file formats) of output files:

  • Distance matrices (in PHYLIP format, they are named Matrix_******.dst)
  • Tree files (in Newick format, they are named NJ_tree_******.dst)

Each tree file is derived from its corresponding tree file using the Neighbor Joining method.

Five distance matrices, and their corresponding trees, are generated by applying 5 different 'minimum score' filters to their sub-alignment components (i.e. only sub-alignments that score above that threshold in a BLAST-like alignment are considered when creating that distance matrix). The score filters are such that a higher value to a score filter can be interpreted as those results taking into account a smaller number of sub-alignments, but of higher average homology. In the current version of CrocoPhylogeny (1.0.18.11.29), the 5 possible minimum score filters are:

  • 16 - Roughly corresponding to 8 sequentially matched nucleotides
  • 22 - Roughly corresponding to 11 sequentially matched nucleotides, 16 matched nucleotides with 2 mismatches, or 15 matched nucleotides with a single gap of length 1*
  • 28 - Roughly corresponding to 14 sequentially matched nucleotides, 19 matched nucleotides with 2 mismatches, or 18 matched nucleotides with a single gap of length 1*
  • 34 - Roughly corresponding to 17 sequentially matched nucleotides, 22 matched nucleotides with 2 mismatches, or 21 matched nucleotides with a single gap of length 1*
  • 40 - Roughly corresponding to 20 sequentially matched nucleotides, 25 matched nucleotides with 2 mismatches, or 24 matched nucleotides with a single gap of length 1*

*the examples given above list a few possibilities to illustrate the meaning of the scores, and do not represent all possibilities for obtaining each of the mentioned BLAST-like alignment scores

Additionally, in case you run CrocoPhylogeny in sensitive mode, there are 2 variations of each distance matrix (and of their corresponding trees), namely MAX and MIN. The MAX matrix is composed by comparing all values of similarity between all pairs of sequence A and B and between B and A (as the whole alignment is not reciprocal), and always selecting the greater of the two values. The MIN matrix is constructed in a similar fashion, but always selecting the lesser of the two values. The final understanding is that, depending on the level of divergence among your sequences, the MAX or MIN matrices will provide you a more fitting result. Alternatively, the MAX and MIN scores can be used to cluster non-phylogenetic sequence data according to, respectively, a "maximal possible difference" or "minimal possible difference" criterion between each pair of sequences.

Here is a screenshot of how a results page currently looks like (version 1.0.18.11.29).


III) Settings, how to use them, and their effect on the results

1. Only Forward / Forward and Reverse

This setting lets the user control whether CrocoPhylogeny should also consider that subsequences within the data set may be found in reverse character order among different entries or if the algorithm should only compare sequences the sense they are in the file. This could happen for instance in whole genomes or in sets of genes that are carried together by transposons, plasmids, and other forms of gene relocation/transfer. In such cases, CrocoPhylogeny will consider that two data set entries are close if part of its sequence is in reverse order (e.g. ABCDEFGHIJKLMN would be closely related to ABCDEFGNMLKJIH if the 'Forward and Reverse' option is set, and more distantly related if the 'Only Forward' option is set).
The process of searching for reversed sequences may introduce noise/artifacts to the analysis in case the data set is not complex, but known to be ordered (e.g. 16SrRNA, handpicked genes, etc), therefore this setting is marked as 'Only Forward' by default.

2. Speed / Sensitivity

This setting lets the user control whether CrocoPhylogeny should be run on a 'Fast', 'Normal', or 'Sensitive' mode.
The Fast mode reduces the chance of CrocoPhylogeny finding an exact match seed for the BLAST-like alignment. In practice, given a sequence "ABCDEFGHIJKL", and another sequence "abcdefghijkl", CrocoPhylogeny will attempt to find exact matches between the pairs "ABCDEFGH/abcdefgh", "EFGHIJKL/abcdefgh", "ABCDEFGH/efghijkl", and "EFGHIJKL/efghijkl", always skipping 4 letters on both sequences to find the initial exact match seed for the alignment. Once any identical match is found, it will extend this exact match seed into a BLAST-like alignment normally (with mismatches, gaps, etc).
In contrast to the Fast mode, the Normal mode attempts all possibilities when looking for an exact match seed for the alignment. In practice, given a sequence "ABCDEFGHIJKL", and another sequence "abcdefghijkl", CrocoPhylogeny will attempt to find exact matches between the pairs "ABCDEFGH/abcdefgh", "BCDEFGHI/abcdefgh", ..., "EFGHIJKL/abcdefgh", then, "ABCDEFGH/bcdefghi", "BCDEFGHI/bcdefghi", ..., "EFGHIJKL/bcdefghi" ..., all cases until "EFGHIJKL/efghijkl". Once any identical match is found, it will extend this exact match seed into a BLAST-like alignment normally (with mismatches, gaps, etc).

3. Length of subsequences

This setting lets the user select the length of the subsequences that will be used for the similarity calculation for the phylogeny tree. This value corresponds to the length for which CrocoPhylogeny will attempt to extend an exact match seed alignment; it also corresponds to the the maximum length of the subsequences for the purpose of percent similarity calculation. This is better explained with examples:
a) Length = 320. Suppose a exact 8-mer match is found between a pair of regions of two sequences. That 8-mer will be extended into a blast-like alignment. Lets suppose that it reaches a maximum score of 200 (which could correspond, for example, to a 100 sequential matches). In this case, out of the possible length of 320, it matched 100. We say that the rough score for this subsequence pair is 100 divided by 320, or 31.25%.
b) Length = 240. Now consider the same example as above, but this time you divide 100 by 240 to find the rough score for that subsequence pair. Therefore, similarity score would be 41.67%.

This setting of course raises a question: how can CrocoPhylogeny yield equally valid results with different 'Length settings'? The answer is that this setting will actually change both the branch lengths of your tree, and the the interpretation of the results. Firstly, with smaller lengths such as 80 nucleotides, you are targeting similarities up to 80 nucleotides long (which could represent the encoding for a conserved protein domain, or a small RNA motif). With length 80, you could for instance match two pieces of 80 length between two subsequences, then have a 100 nucleotide portion with high divergence, and back to another 80 nucleotide similarity island, and this would yield an average similarity equal to that of the two homologous portions length 80 nucleotides (not computing the 100 nucleotide long divergent sequence into the score).
On the other hand, larger values of Length, such as 320 nucleotides, you are targeting larger portions of a genome, possibly less conserved, or even small protein coding genes such as those coding for histones. Following the previous example, you would have 80 matches + 100 mismatches + 80 matches within the 320 nucleotide region. However, the 100 mismatches would in practice interrupt the alignment in that segment (for lowering the score too much), and only the first 80 matches would be found within the 320 nucleotide Length setting.

Although the examples set here are a bit extreme, they help illustrate how different settings may affect your results. As a rule of thumb, the least divergent you expect your sequences to be, you can use larger values for Length (320), while very divergent data sets are likely to perform better with smaller values for Length (80). However, one should notice the interpretation is different, and will most likely NOT correspond to a an average similarity between both sequences, but an average similarity of the 'islands of similarity' found between two sequences.


IV) How does it work?

CrocoPhylogeny was was developed to allow for the phylogeny of complex / large data sets. In practice, it clusters input sequences using an alignment based similarity score. In this section we explain in detail the different steps of the algorithm.

1. What happens when you upload a file

CrocoPhylogeny derives its sequence alignment from nucleotide FASTA files (please refer to section II.1. if you need more details about input file selection). Before the upload even begins, CrocoPhylogeny web server will check if the total length of the input sequence files correspond to the current (version... ) limit of maximum input files size. Once the upload button is clicked, a progress bar for the upload will show, and the page will start uploading your input files to the CrocoPhylogeny web server. Once they are uploaded, the files will be checked for file format compliance (fasta NUCLEOTIDE), and a possibility of errors (data corruption during network transfer, lack of disk space on the server, etc). In case any error happens, the user is redirected to an error page with a brief description about the most likely cause for that error. In case all checks succeed, a new task is created corresponding to the input file and selected settings, the task is assigned to a queue of jobs in CrocoPhylogeny, and the user is redirected to his task page, which is refreshed every few seconds, informing the user about the queue position (if queued), percent progress for the task and estimated time to finish (if running), or the results for this task (if finished).

2. How the alignment is performed

Although we do make an effort to be as user friendly and clear as possible about every detail in this manual, this section is the most technical part of this manual, as it describes the core algorithm of CrocoPhylogeny.

In the original version of CrocoPhylogeny, the alignment part was handled by CrocoBLAST, a wrapper that helped speed up BLAST calculations. While this original version of CrocoPhylogeny was very apt at aligning complex sequences, it was not efficient enough to conduct web-based whole-genome alignments. To address this limitation, the current version of CrocoPhylogeny implements its own BLAST-like sequence alignment module, which follows the BLAST score system for extending local alignments but uses a highly efficient indexing system that permits whole-genome comparisons as a web service. The algorithm runs as a CPU-supervised queue and file management system, while the alignment core is implemented in CUDA C and runs on a high-performance GPU.

2.1 Indexing of the sequences and 8-mer alignment seed

Each input file contains a single sequence. Each sequence is parsed into an indexed while being loaded into memory. This index, in practice, is much larger than the sequence itself, but has a data structure that allows for the data to be found much more quickly than in a regular sequence.
Additionally, to reduce the memory usage footprint, every index and sequenced are loaded in memory such that a block of four nucleotides occupy a single byte. That is possible because a block of 4 nucleotides, with 4 possibilities each yields a total of 44 possibilities, while a single byte consists of 8 bits, each with 2 possibilities (0 or 1), yielding a total of 28 possibilities. As 44 equals 28, there can be a direct correlation between the value of a single byte and the sequence of four nucleotides.

Below we describe the index data structure in detail but, for clarity, we refer to 4-mers and sequence indices as their corresponding sequence instead of a single byte. Additional clarifications were given whenever deemed useful.

Notation:
Generically, we use V[k] to denote the element at position k of a vector V. We use quat as an enumerator from 1 to 256 (corresponding to the 256 possibilities of nucleotide 4-mer, AAAA, AAAC, AAAG, etc)
We use i as an enumerator from 1 to the number of input sequences


Let Si be a vector containing the whole nucleotide sequence from one of the input files
Let L(Si) be the length of Si
Let Nquat_i be the number of occurrences of a certain nucleotide 4-kmer in Si Let Indexi be a vector of length L(Si) sorted such the first N1_i elements contain as values all the positions in which AAAA can be found in the original sequence; and that the next N2_i elements contain as values all the positions in which AAAC can be found; and so on for all 256 4-mers
Let QuatIndexi be a vector of length 256. Each of the 256 elements (referred to as quat) contain as value their corresponding Nquat_i

Within this indexed sequence data structure, it becomes very easy to find any 4-mer in Si.
For example, lets suppose that we are looking for the sequence AAACGACTAGA within the whole human genome (S1), that was indexed as described. CrocoPhylogeny looks for an exact match with the 8-mer marked in bold, and then tries to extended the alignment in a blast-like fashion. The index plays a key role into finding the 8-mer. First, we look for the first 4-mer, AAAC (which corresponds to the 3rd of the 256 4-mers). The process of looking for it, is already intrinsically solved without needing to perform a search: any occurrence of AAAC within the human genome can directly be referenced into the original sequence by

S1[Index1[QuatIndex3 + j]] , with j varying from 1 to N3_1

Note that no sequence comparison was made at the alignment stage to find the first 4-mer, AAAC. You simply access the sequence at a certain index and you know that it will correspond to AAAC based on how the index is built.
The next step, is to find the second 4-mer, GACT. This step basically requires the next block of four nucleotides, after finding AAAC, in Si to match GACT. In mathematical notation, considering S1 as a vector for which each position contains one nucleotide, you would say that, if P is the position for which AAAC is found (P = Index1[QuatIndex3 + j]), then S1[P+4] must be equal to G, S1[P+5] must be equal to A, S1[P+6] must be equal to C, and S1[P+7] must be equal to T.
In reality, S1 is a vector for which each position contain 4 nucleotides (within 1 byte, as explained above). Therefore, sole condition for an 8-mer to be found is a 1 byte comparison. i.e
The condition is that GACT byte value (which is a number from 1 to 256, being 138 in this case) must be found at position right after the first 4-mer (byte value 3 in our example). Mathematically, this is expressed as:

S1[Index1[QuatIndex3 + j]+1] = 138

This satisfies the finding of the 8-mer AAAC(3)GACT(138) within sequence S1. If this condition is true, the algorithm proceeds to the "extension" stage of the alignment.

p.s. one may notice that in this manual we only cover the simple case in which a given 4-mers matches entirely a S1[] vector position. For the sake of clarifying that all alignment possibilities are considered, I add this comment to clarify that a given 4-mer can for instance be encoded by the last two characters of S1[P] combined with the first two characters of S1[P+1], and that CrocoPhylogeny index contains all such possibilities, and is able to assess alignments by applying bit shifts and bitwise comparisons in the data structure.

2.2 Extending an initial alignment seed

As described in the previous item (IV.2.2), the alignment algorithm of CrocoPhylogeny starts with finding an exact 8-mer match. Once this 8-mer match is found, the subsequent nucleotides within Length (setting described in section III.3 of this manual, which can carry a value of 80, 160, 240, 360) undergo a BLAST-like alignment stage described here. We call it BLAST-like because a) BLAST is a well known alignment tool to which bioinformaticians can relate and b) we use the same local alignment score system (rewards, penalties, etc), with the same values as the default BLASTN program (found here). As the current implementation of the alignment extension follows a slightly different workflow than that of BLAST, it will be described in this section.

Notation:
Let Len be the length of a subsequence, which is set in the beginning of the run (Len = 80, 160, 240, or 320).

For this example, we name SubA and SubB a pair of subsequences from sequences A and B, respectively, for which an exact 8-mer match was found. SubA and SubB will have the initial 8-mer seed extended.

The 9th nucleotide of SubA (the first after the 8-mer) is compared to the 9th nucleotide of SubB. In case there is a match, the 10th nucleotide of each sequence is compared, and so on until the nucleotides do not match. At that point, the algorithm branches into the mismatch, insertion, or deletion possibilities, and follow one of them until another match is found. The algorithm them follows matching sequential nucleotides until, at another point, the sequences do not match again, which will again trigger the branching into mismatch, insertion, or deletion possibility. In the end of a branching line, when the position Len of either SubA or SubB is reached, the algorithm records the score and traces back to the last branching and considers the other possibilities. In this fashion, all relevant branches (which can mathematically achieve a greater score than the current highest score registered for that pair SubA / SubB) are considered. Every time a new highest score is found for the pair pair SubA / SubB, the possibilities of branching are narrowed (as it is harder for a yet non evaluated branch to mathematically be able to surpass the current score). By the end of the extension process, a large number of alignment possibilities was assessed and the best score found is registered as the final score for the pair SubA / SubB.

2.3 Computing the final score between two sequences X and Y

The previous items of this manual (IV.2.1 and IV.2.2) described how the score between two subsequences of length Len (setting described in section III.3 of this manual, which can carry a value of 80, 160, 240, 360) is calculated. Here we described how the score between two sequences (each containing up to millions of subsequences of length Len). In simple terms, all the found subsequence pairs (found based on what is described on IV.2.1 of this manual) carry a best score (calculated based on what is described on section IV.2.2 of this manual). All such scores, for all subsequence pairs found between the two sequences SeqX and SeqY are then added up, and the total sum is divided by the number of subsequences contributing to the score. This effectively calculates a mean score, which is then divided by the maximum possible score for that Len, resulting in a percent scale in terms of how close the average of the score from the subsequences is to the maximum possible alignment score.

This process is performed for all quality filters described in section II.2 of this manual. For each higher quality filter, less sequences will contribute to the average similarity, but with higher average similarity. Which of the quality filters yields a better result to each dataset requires user interpretation, but it can be understood that a lower quality filter accepts contributions from islands of similarity between the 2 sequences with very loose sequence homology, while higher quality filters narrow it down to more homologous islands of similarity between the 2 sequences.

Finally, this process is repeated for all pairs of sequences within your input data set. For a data set with N sequences, there will be N*(N-1)/2 possible pairs of sequences to be computed. The final score for all pairs is finally stored as a matrix, in PHYLIP format, as described in section II.2 of this manual.


V) Technical details and limitations

CrocoPhylogeny is free to use within the conditions of its license, and has been available as a web server since October 2018 at https://crocodile.ncbr.muni.cz. There is no registration or login requirement for running CrocoPhylogeny.

1. Hardware architecture

CrocoPhylogeny is hosted on its own dedicated PC with a 6-core/12-thread Intel® Core™ i7-8700 CPU, 32 GB of DDR4 memory, a NVIDIA GeForce GTX 1080 Ti GPU, a 256 GB SSD (for the operating system and calculations) and additional disk space over the network for storage.

2. Software architecture

CrocoPhylogeny runs on a Ubuntu 18.04 Linux operating system. CrocoPhylogen webpage is written in HTML/CSS, with additional Javascript and PHP scripts, and it is hosted using Apache HTTP Server software. CrocoPhylogeny software was written in C (CPU - base program, parsing and indexing of sequences) and CUDA C (GPU - alignment core) using Nsight Eclipse Edition IDE (gcc and nvcc compilers). CrocoPhylogeny primary output is a distance matrix; it uses the software fneighbor for generating a neighbor joining tree from the distance matrix, as secondary output.

3. Limitations

The current version (1.0.18.11.29) of CrocoPhylogeny only accepts as input sequences in FASTA format.
We intend to expand it by adding support to compressed sequence files (zip, tar.gz, etc), and support to other sequence containing file formats.

The current version (1.0.18.11.29) of CrocoPhylogeny accepts 100 MB of sequences as total size of input, and up to 4000 files.
This is not a hardware limitation, but a matter of preventing accumulation of jobs on queue. While the figure is relatively high for this type of software, the number will be periodically reevaluated based on server load, and on every performance-related update.
If you have a larger data set that you wish to try with CrocoPhylogeny, do not hesitate to contact us (455306@mail.muni.cz) for the possibility of running your data set during the offload periods of CrocoPhylogeny. Depending on its sequence characteristics, CrocoPhylogeny can complete the run for a data set of several gigabytes in a few days.

The current version (1.0.18.11.29) was designed to work well primarily with data sets that are hard to treat with common approaches for phylogeny. Such data sets include whole genomes, complex non-ribosomal polyketide synthase gene clusters with inversions, data sets including partially reversed sequences, or even data sets consisting of partial sequences without a common region of sequence overlap. While CrocoPyhlogeny performs well on such cases, as far as our tests go, it does not perform well for sequences with very high similarity (>99%) preventing its accurate use in human mitochondria haplotyping studies for example.
This aspect is unlikely to be changed in the near future, as it is not the primary designed use of CrocoPhylogeny. However, if it happens that, after an update that changes CrocoPhylogeny core algorithm, it starts perform well for such data sets, this will be informed on the web page, version history, and user manual.

4. Tested browsers

Below there is a table containing all the combinations of operating systems (OS) and browsers for which CrocoPhylogeny was tested and found to work.

OS OS version Chrome Firefox Edge Safari
Linux Ubuntu 18.04 70.0.3538.110 61.0.1 n/a n/a
macOS High Sierra not tested 61.0 n/a 12.0
Windows 10 71.0.3578.98 61.0 42.17134.1.0* n/a

*On Windows, the browser Edge had trouble displaying the progress bar during upload. Despite this issue, all files were uploaded and the task was run normally.

5. Troubleshooting

CrocoPhylogeny undergo certain error checks before each run, and informs the users of what happened with clear messages whenever possible. However, as it is a new software, it is possible that some error cases were not completely accounted for. If you are experiencing unexpected/unwanted behavior from CrocoPhylogeny, do not hesitate in contacting us at 455306@mail.muni.cz, informing with as much detail as you can the steps you take until the problem happens so that we can try to reproduce/identify the issue.