MIToS.jl: Mutual Information Tools for protein Sequence analysis in the Julia language

Background Coevolution methods generate a growing interest in the scientific community due to its usefulness to predict the tridimensional protein structures, protein-protein binding interfaces and functionally important sites [1,2]. Mutual Information (MI) algorithm is useful for determining covariation or coevolving between positions in a Multiple Sequence Alignment (MSA). There is a great number of tools for estimating MI and derived scores from a MSA [3,4]. Although, none of them use a high level programming language, easy to use and modify, while having a good performance. As the Julia language is a high level programming language for scientific computing with a close to C performance [5], MIToS implements MI analysis and several useful tools for MSA and protein structure management in it. Materials and methods MIToS starting point was an improvement of the algorithm published by Buslje et. al. [1]. Other MI derived scores described in Brown & Brown [3] and Dickson & Gloor [4] were also included in MIToS. MIToS implements all the necessary tools for developing and testing new scores based on amino acid frequencies: functions and types for dealing with MSAs, parsing protein structures and determine inter residue contacts, mapping information between sequence and structure using SIFTS and predictive performance testing using ROC curves. MIToS also allows running these analysis from a command line, without requiring programming knowledge. We successfully used it to integrate structural and sequence information from a protein family large dataset. Results MIToS modules allow to write and run an entire protein sequence and structure analysis pipeline in a single programming language. Julia performance and easy to use parallelism allow us to run these analyses on a large dataset of protein sequence and structures and to test multiple hypotheses, parameter combinations, etc. As a result, we were able to create new knowledge about the relation between the evolutionary signals and the change of protein structures through the evolution [6]. Conclusions MIToS allows users to access the the Julia language programming power for analysing and managing protein multiple sequence alignments. The implementation of several useful scripts in MIToS for command line execution allows acceding to this new MI implementations and its derived score to non-programmers. MIToS tools makes MSA editing and MI calculation easy and facilitates the integration of sequence and structure information. References <ol type="1"> <li>Buslje, C. M., Santos, J., Delfino, J. M., & Nielsen, M. Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics 2009, 25(9), 1125-1131.

Buslje, C. M., Teppa, E., Di Doménico, T., Delfino, J. M., & Nielsen, M. Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 2010, 6(11), e1000978-e1000978.
Brown, C. A., & Brown, K. S. Validation of coevolving residue algorithms via pipeline sensitivity analysis: ELSC and OMES and ZNMI, oh my. PloS one 2010,5(6), e10779.
Dickson, R. J., & Gloor, G. B. The MIp Toolset: an efficient algorithm for calculating Mutual Information in protein alignments. arXiv preprint 2013, arXiv:1304.4573.
Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. Julia: A fresh approach to numerical computing. arXiv preprint 2014, arXiv:1411.1607.
Zea, D.J., Monzon, A.M., Parisi, G. and Marino-Buslje, C., 2017. How is structural divergence related to evolutionary information?. bioRxiv, p.196782.

Speaker's bio

I learned C at the university, then I did Bash, Perl, Python and R during the PhD. Now I’m a postdoc student doing Bioinformatics that has moved all its research pipeline into Julia. I think that Julia and its ecosystem are mature enough to do research with it. Also, MIToS shows how Julia can be used to solve the two/multiple language problem in Bioinformatics. MIToS also uses high-level abstractions that are possible in Julia to be faster than its C predecessor, even for someone without a strong programming background.