Fall Research Expo 2023

A Machine Learning-Based Approach to Real-Time Anomaly Detection in pp Collisions at the LHC

View Poster

The Large Hadron Collider is a particle collider that accelerates and collides beams of protons together. While hundreds of millions of proton-proton collisions occur when two proton beams collide, the vast majority of collisions (referred to as background events) produce jets of low-energy hadrons. However, in some theories of new physics, vanishingly small amount of collisions (referred to as signal events) do produce anomalous phenomena. Detecting these signal events in real-time necessitates a model capable of filtering out most events, though ensuring that anomalous events are kept, while being simple enough to satisfy hardware limitations.

The efficacy of each model developed over the course of this project was determined by testing on 4 signal event datasets, each comprised of instances of a distinct signal event.

4 different models were ultimately tested: a supervised model, an autoencoder model, and two variable autoencoder models.

In the supervised model, the provided data was pre-labeled, with signal events labeled 1 and background events labeled 0, and binary cross-entropy was used as the loss function. A separate supervised model was created for each signal event considered.

An autoencoder is an unsupervised neural network that attempts to reconstruct a given input, and is trained solely on non-anomalous inputs. The loss function for an autoencoder consists solely of a reconstruction loss, though this was slightly modified for the autoencoder model used in this project.

A variational autoencoder (VAE) is similar to an autoencoder, except for a modification that ensures that the latent space is a normalized distribution from which points are randomly sampled. The loss function for a VAE consists of a reconstruction loss and a latent space loss. Two different metrics were used for the reconstruction loss: mean squared error and earth mover’s distance, with one VAE model constructed using each as the reconstruction loss metric in the overall loss function. Since calculating the EMD is computationally expensive, a separate model was trained and utilized to estimate the EMD when training the EMD-based VAE model.

The relative utility of each model was quantified primarily by the resultant receiver operating characteristic (ROC) curves produced from the predictions of each model on datasets of each signal event, as well as the efficiency curves produced.

The goal of this project was to determine the model that would maximize the TPR at an FPR of 10^-4 (this restriction is due to the relative distributions of signal & background events), and thus the analysis of the ROC curves focused primarily on the TPR at this point.

The efficiency plots produced for each model were binned by the MET (missing energy in each event).

The supervised models appear to provide the best tradeoff between a high true positive rate and a flat efficiency curve, as expected. The autoencoder provides a better tradeoff than either variational autoencoder model, suggesting that it is better suited for this task.

Future directions for this project could include considering further autoencoder architectures and considering model efficiency with respect to alternate metrics (ex. total energy in each event).