GENE ONTOLOGY PREDICTION ON NON-EUCLIDEAN DOMAIN
Abstract
Since the development of massive sequencing methods, there is a vast gap between the available data of
protein sequences and their corresponding experimentally annotated protein functions. Bioinformatics has
traditionally approached this asymmetry mainly by using Blast-based algorithms. Recently, deep learning
architectures have been developed to predicted protein GO annotations solely from its amino acid sequence
or complemented with additional information such as Protein-Protein Interaction (PPI). The former exhibits a
lower performance compared with the latter approach that uses extra information. However, features as PPI
need to be determined using in vitro or in vivo procedures, limiting its applicability.
Furthermore, the deep learning approaches have ignored the possibility of leveraging the GO hierar chical behavior using a hyperbolic neural network, a framework precisely adequate for this kind of data. This
thesis proposes a novel Hyperbolic Deep Learning architecture call HyperGO, which predicts the protein GO
terms from its amino acid sequence alone. An algorithm based on Alphafold preprocessing is used over the
protein sequences to enrich the protein representation information. We hypothesize that this preprocessing can
provide context information to the amino acid sequence data as the HyperGO input, keeping its applicability
to completely unknown proteins. A transformer encoder calculates the global patterns of preprocessed protein
representation. The transformer output is then reshaped and processed by a hyperbolic network that exploits
the GO hierarchical nature to predict the protein functions, working in the Poincaré ball space. HyperGO
performance is evaluated over a part of SwissProt 2019 using the CAFA scores (Fmax and S min) and AUPR.
The results are compared with some traditional bioinformatics methods and DeepGOPlus, achieving better
results in S min and AUPR scores for each sub ontology and Fmax for molecular function (MFO).