Bioinfo Lab - Projects

Our Projects

Here are some ongoing and past projects.

Bioinformatics for combatting antimicrobial resistance

Antimicrobial resistance (AMR) occurs when pathogens evolve to resist medicines used to treat their infection. WHO has declared it as one of the top health threats to humanity. Our lab, along with our collaborators in SComB and the e-Asia JRP ATTACK-AMR project, is using metagenomics to understand the abundance and diversity of AMR-related genes in hospital wastewaters. Another angle of attack on AMR is phage therapy, an alternative to conventional antibiotics, the heavy and loosely regulated use of which has been exacerbating the AMR crisis. Given that lab experiments to study phages can be costly and tedious, our lab is investigating state-of-the-art machine learning techniques to aid in phage characterization.

Exploring applications of biological language models

Representing biological sequences as numerical vectors is the first step in building machine learning tools for any bioinformatics task. However, since sequences can be characterized using hundreds of properties, it is difficult to select which features are most likely to be informative. This challenge has motivated the development of biological language models. By learning patterns from large-scale protein databases, these models are able to automatically transform sequences into vectors that have been shown to capture aspects of the “grammar of life.” Our lab has been working on applying these biological language models to important downstream tasks such as predicting phage-host interaction. Check out this paper, this preprint, and some code here and here.

Computational interpretation of genomic regions implicated by genome-wide association studies in rice

Rice feeds half of humanity. The production of rice needs to match human population growth while being environmentally sustainable and climate change-resilient. These challenges have motivated the identification of genetic factors behind agronomically important traits, often using genome-scale techniques such as QTL analysis or genome-wide association studies (GWAS). These studies report regions in the genome that are statistically significant, but they remain short of explaining the biological significance. Our lab has been working on software solutions to gain biological insights on statistically significant genomic sites. Check out this paper, this web app and its source code.

Differential gene expression analysis for non-model organisms

RNA-seq is being increasingly adopted for gene expression studies in a panoply of non-model organisms, with applications spanning the fields of agriculture, aquaculture, ecology, and environment. For organisms that lack a well-annotated reference genome or transcriptome, a conventional RNA-seq data analysis workflow requires constructing a de-novo transcriptome assembly and annotating it against a high-confidence protein database. We propose a shortcut that avoids the computationally demanding assembly process and instead obtains counts for differential expression analysis by directly aligning RNA-seq reads to the high-confidence proteome that would have been otherwise used for annotation. Check out these papers here, here, and here, and these source codes here and here.

Bioinformatics for HIV surveillance

The landscape of molecular surveillance for HIV is undergoing a significant transformation, shifting from the conventional Sanger sequencing approach towards the utilization of modern high-throughput sequencers. This shift has necessitated the construction of bioinformatics pipelines for analyzing the big data output of these sequencers. Our lab is working towards a pipeline that improves accuracy by taking into account the high mutation and recombination rates seen in HIV.

Bioinformatics for cancer genomics

In collaboration with DLSU’s Translational Research and Medicine Unit and St. Luke’s Medical Center, our lab is applying modern multi-omics technologies on commercial cell lines as well as real patient tissues, to investigate regulated cell deaths in the context of colorectal cancer.

Implementing parallel computing for sequence alignment algorithms

Algorithms involving DNA/RNA to DNA/RNA (i.e., sequence alignment) or DNA/RNA to protein (i.e., frame alignment) can be sped up using various parallel computing techniques. Some past projects include alignment algorithms using single instruction multiple data (SIMD) based on x86 AVX instruction sets, or using single instruction multiple threads (SIMT) based on the NVIDIA CUDA platform and customized hardware configuration based on PYNQ Z2 field programmable gate array (FPGA).