GENE ONTOLOGY PREDICTION ON NON-EUCLIDEAN DOMAIN
FREDES FRANCO, NICOLÁS IGNACIO
MetadataShow full item record
Since the development of massive sequencing methods, there is a vast gap between the available data of protein sequences and their corresponding experimentally annotated protein functions. Bioinformatics has traditionally approached this asymmetry mainly by using Blast-based algorithms. Recently, deep learning architectures have been developed to predicted protein GO annotations solely from its amino acid sequence or complemented with additional information such as Protein-Protein Interaction (PPI). The former exhibits a lower performance compared with the latter approach that uses extra information. However, features as PPI need to be determined using in vitro or in vivo procedures, limiting its applicability. Furthermore, the deep learning approaches have ignored the possibility of leveraging the GO hierar chical behavior using a hyperbolic neural network, a framework precisely adequate for this kind of data. This thesis proposes a novel Hyperbolic Deep Learning architecture call HyperGO, which predicts the protein GO terms from its amino acid sequence alone. An algorithm based on Alphafold preprocessing is used over the protein sequences to enrich the protein representation information. We hypothesize that this preprocessing can provide context information to the amino acid sequence data as the HyperGO input, keeping its applicability to completely unknown proteins. A transformer encoder calculates the global patterns of preprocessed protein representation. The transformer output is then reshaped and processed by a hyperbolic network that exploits the GO hierarchical nature to predict the protein functions, working in the Poincaré ball space. HyperGO performance is evaluated over a part of SwissProt 2019 using the CAFA scores (Fmax and S min) and AUPR. The results are compared with some traditional bioinformatics methods and DeepGOPlus, achieving better results in S min and AUPR scores for each sub ontology and Fmax for molecular function (MFO).