Federated Learning Algorithms
This summer I had the opportunity to explore privacy preserving distributed algorithms (PDA). The use cases of these include training machine learning models with data distributed at different sites. This is important since traditional machine learning models require all training data to be centralized in one location, but this presents a security and privacy risk. Thus, PDA aims to train effective models without exposing individual, patient level data.
In order to first gain a solid grasp of machine learning concepts, I obtained the Stanford Online Machine Learning certificate while conducting literature review of existing federated learning methods. Specifically, I first focused on distributed Support Vector Machines, and how it is applied in a federated setting. However, after reviewing several papers and exploring the applications of SVMs, I realized that this algorithm had been largely abandoned by industry and academia due to its limited use cases. I then turned my focus to XGBoost, a novel machine learning library with incredible promise due to its speed and effectiveness. I looked into the implementation of Federated XGBoost, an open-source project pioneered by Berkeley’s RISE Lab and sought to understand the optimizations and approximations used. Along with several other papers, I learned that data could be aggregated using histograms, which drastically speeds up training time, reduces privacy risk, and doesn’t suffer performance-wise in practice.
Additionally, I experimented with ensemble learning. Ensemble learning is essentially training multiple models and combining them to create a larger, more complex model. I used libraries ranging from SKLearn, Tensorflow, to XGBoost and LightGBM and tested them on classification tasks such as the Higgs Boson dataset.
Finally, I helped develop a web application that facilitates the training process of PDA. Since raw data cannot be transmitted, summary statistics are shared between collaborators. I learned and implemented a MongoDB database and created the REST API for the application.
Comments
SVMs
You're right, Andrew, after learning about SVMs in CIS 520, I can honestly say that I rarely run into them nowadays. Most of their classification tasks have been taken over by neural networks. If you learned Tensorflow, I also recommend learning PyTorch!
Great presentation!
Andrew, great presentations and great summary of what you have done over the summer.
After we complete the current project, we will move on to the few shot federated algorithms. Look forward to working with you.
Best,
Yong