Thesis Caracterización de microorganismos de interés biotecnológico mediante el uso de datos genómicos y reconocimiento de patrones
Loading...
Date
2021-12
Authors
Journal Title
Journal ISSN
Volume Title
Program
Ingeniería Civil Telemática
Departament
Campus
Campus Casa Central Valparaíso
Abstract
La caracterización de microorganismos y enzimas mediante herramientas bioinformáticas ha acortado la brecha entre la ciencia y el desarrollo tecnológico. Sin embargo, aun existen problemáticas asociadas a la calidad de los datasets para su utilización en herramientas de aprendizaje automático. La presente memoria describe el proyecto de desarrollo de un clasificador binario para identificar enzimas degradadoras de contaminantes aromáticos en secuencias genómicas. Se emplean técnicas de aprendizaje automático como SVM con kernel RBF, KNeighbors y Random Forest. Se elaboran seis datasets con diferentes características de balance y longitud de secuencia, sobre los cuales se entrenan y prueban los modelos. A pesar de los buenos resultados iniciales como curvas ROC, accuracy, F1-score y AUC, la aplicación en genomas de Escherichia coli y Paraburkholderia xenovorans LB400 revela una alta incidencia de falsos positivos, lo que indica la necesidad de mejorar la representatividad de los datasets y la metodología de clasificación. Esta memoria resalta la importancia de la validación robusta y la potencial aplicación de aprendizaje profundo para futuras investigaciones.
The characterization of microorganisms and enzymes using bioinformatics tools has bridged the gap between science and technological development. However, there are still challenges associated with the quality of datasets for their use in machine learning tools. The present work describes the project of developing a binary classifier to identify enzymes degrading aromatic contaminants in genomic sequences. Machine learning techniques such as SVM with RBF kernel, KNeighbors, and Random Forest are employed. Six datasets with different balance and sequence length characteristics are created, on which the models are trained and tested. Despite the promising initial results, such as ROC curves, accuracy, F1-score, and AUC, the application to genomes of Escherichia coli and Paraburkholderia xenovorans LB400 reveals a high incidence of false positives, indicating the need to improve dataset representativeness and classification methodology. This work highlights the importance of robust validation and the potential application of deep learning for future research.
The characterization of microorganisms and enzymes using bioinformatics tools has bridged the gap between science and technological development. However, there are still challenges associated with the quality of datasets for their use in machine learning tools. The present work describes the project of developing a binary classifier to identify enzymes degrading aromatic contaminants in genomic sequences. Machine learning techniques such as SVM with RBF kernel, KNeighbors, and Random Forest are employed. Six datasets with different balance and sequence length characteristics are created, on which the models are trained and tested. Despite the promising initial results, such as ROC curves, accuracy, F1-score, and AUC, the application to genomes of Escherichia coli and Paraburkholderia xenovorans LB400 reveals a high incidence of false positives, indicating the need to improve dataset representativeness and classification methodology. This work highlights the importance of robust validation and the potential application of deep learning for future research.
Description
Keywords
Clasificador Binario, Homología, Oxigenasa
