Fall Research Expo 2021

An Automated Deep Learning Analysis Pipeline for Classification in Tabular Biomedical Data

In this work, we expand upon the recent pre-print, AutoMLPipe-BC, by the Unbounded Research in Biomedical Systems (URBS) Laboratory at the Perelman School of Medicine. We examine deep learning (DL) approaches to compare newer techniques to traditional ML methods. Specifically, we focus on tabular (table format) data and look to develop our pipeline with generalized classification tasks in mind (adaptable to multiclass and multi-label classification), while evaluating on a binary classification task. 

Following a thorough literature search of viable deep learning techniques, I identified a few promising models that can be implemented for the AutoMLPipe-DL pipeline. We take inspiration from the recent rise of attention models (including self-attention Transformers), which learn to focus on relationships within specific parts of the feature set, and have proven successful for text (natural language processing) and more recently, image (computer vision) modalities. The model we chose, the TabNet, presents a unique advantage in offering both supervised and semi-supervised training approaches, which adds an additional degree of model complexity which can prove effective when scaling to larger data. We also turn to models designed for other modalities and use a transformation procedure to convert tabular data into a series of images to utilize the vast computer vision literature, including Convolutional Neural Network architectures. By exploring the potential for graphical probabilistic models such as Restricted Boltzmann Machines and Deep Belief Networks, our work is one of the few instances of using these models on applications involving tabular data. Our approach involves using the Restricted Boltzmann Machine as an unsupervised feature extractor, followed by a downstream classification model, using several of the state-of-the-art models from AutoMLPipe-BC. At the same time, this presents an opportunity for future studies which would involve using a deep learning model or another complex ML algorithm as the downstream classifier. 

With the task of integrating complex models into the framework of the existing pipeline, there was the need to modify the structure of the extensive code developed, as we look to migrate the models to PyTorch, the preferred library for deep learning. This effort necessitated a thorough model selection process, as the models we look to implement needed to satisfy the constraints for code structure, especially with the original pipeline being largely implemented with the scikit-learn library. We aim to solve this problem by using SKORCH, a scikit-learn wrapper for PyTorch models, as well as identifying scikit-learn compatible implementations of the more complex models. 

After implementing 12 deep learning algorithms, we were successfully able to execute 8 of them in the new AutoMLPipe-DL pipeline on a biomedical dataset, which achieved comparable results to the ML models in AutoMLPipe-BC. We are also excited by the potential to further improve these models through an efficient hyperparameter optimization procedure. Furthermore, as the benchmark dataset used for experimentation was relatively small, and deep learning models improve with more data, we have reason to believe that experimentation with a larger dataset would yield better results. 

PRESENTED BY
PURM - Penn Undergraduate Research Mentoring Program
Wharton, Engineering & Applied Sciences 2024
Advised By
Ryan J. Urbanowicz
Assistant Professor of Informatics
Join Keshav for a virtual discussion
PRESENTED BY
PURM - Penn Undergraduate Research Mentoring Program
Wharton, Engineering & Applied Sciences 2024
Advised By
Ryan J. Urbanowicz
Assistant Professor of Informatics

Comments

As you mentioned, one of the tricky things about selecting a particular model is its suitability for a particular use case (in addition to how fast the field of ML moves), so I’m curious how you settled on TabNet.  Please bear in mind that I’m not a computer scientist when you answer!  But since I am a scientist who often (or at least used to) deal with large, complex biomedical data, what applications do you think your findings might most immediately impact or improve?