Sifting the GOTO transient stream with transient-optimised Bayesian source classification

Abstract

Modern synoptic sky surveys have proven transformative for time-domain astrophysics, leveraging high-performance machine learning classification to efficiently sift and categorise the vast quantities of data these projects generate. With upcoming step changes in survey capability (e.g. Vera Rubin Observatory) further increasing this data rate, continual improvements are required to efficiently exploit the unprecedented discovery stream these projects provide.

In this talk we present recent progress on source classification in the context of the 'real-bogus' problem (Killestein et al., 2021), applying Bayesian deep learning to the incoming candidate stream of the Gravitational-wave Optical Transient Observer (GOTO) survey. Our classifier uniquely provides uncertainty-aware predictions to assist human interpretation and model development, and can be trained to state-of-the-art (1% FPR, 1.5% FNR) performance with minimal human labelling via a fully-automated dataset generation process. The implementation of a novel data-driven augmentation scheme to generate realistic synthetic supernovae to train on drives this high performance, yielding a CNN classfier uniquely optimised for locating extragalactic transient sources. This provides marked improvements in the recovery of faint and nuclear transients of key interest, and the kilonovae that form the central goal of GOTO's science case.

We showcase early results from the next-generation GOTO source classifier, leveraging a high-performance database of contextual information to provide rich, multi-label, hierarchical classification in real-time -- reducing human vetting effort in the short-term, leading towards fully autonomous, real-time triggering of spectrophotometric follow-up.

Time-domain astrophysics with GOTO

Image of the GOTO-1 node on La Palma.

(c) Krzysztof Ulaczyk (2022)

GOTO web page

GOTO prototype paper

The Gravitational-wave Optical Transient Observer (GOTO) is a fully autonomous wide-field telescope array, designed specifically to hunt the afterglows of gravitational wave events. It is a multi-node, multi-site project, with each node made up of 16x40cm f/2 astrographs, capable of reaching magnitude 19.5 in 60s exposures, with a combined field of view of 47 square degrees. This combination of automation, depth, and wide field makes it uniquely suited for following up a wide range of fast transients, as well as providing a ~nightly cadence all-sky survey for large swathes of the sky.

This datastream is significant, with vast scientific discovery potential. GOTO is already accessing regions of transient parameter space that are difficult to probe with other facilities - the GW follow-up strategy is sensitive to a range of fast transients typically inaccessible with the slower cadence of other surveys. However, this is only possible with high-performance pipelines and machine learning filtering - with the burden placed on these scaling with increasing size of facilities. It is clear that the techniques used for last generation's surveys cannot be used in the upcoming LSST era, let alone with current sky surveys like GOTO and ZTF. It is therefore crucial to keep improving our classifiers, with even marginal percentage gains in accuracy adding up to 10s of thousands less candidates for humans to vet over a typical week. These algorithms have slowly climbed in accuracy over the past years with the introduction of deep learning.

Training set: transient optimisation

Fully-automated training set generation

We generate our training set in a fully automated way, to avoid human biases and enable large (~400k examples) datasets to be generated rapidly.

Positive class (i.e. real transients): cross-match difference image detections with known asteroids in the field using the SkyBoT cone search HTTP API. Minor planets are a common contaminant in time-domain astrophysics - yet are useful here as they are 'real' PSFs - advantage over modelling.
Negative class: remove all known minor planets and variable stars, and randomly sample the remaining difference image detections to generate a diverse set of bogus examples. Eyeballing indicates that this approach has <0.5% contamination, and is many orders of magnitude faster than human labelling.

With this approach, we generate a 400,000 example training set in under 24h, significantly larger than the largest human-labelled dataset. This set also accurately represents the intrinsic data distribution (magnitude, FWHM, etc.) and is free of human bias.

Synthetic transient augmentation

GOTO's specific science goals are focused around finding extragalactic transients, associated with galaxies. To optimise our classifier for the recovery of these types of objects, we extend our method above to generate realistic synthetic transients typical of those we are trying to detect. We do this as follows:

Select a known minor planet from the image
Search in the GLADE galaxy catalog for nearby hosts
Inject the minor planet at some random offset from the host by summing the stamps.

This generates very convincing supernovae that we can use to augment our positive classes.

To avoid the classifier 'cheating', we also inject blank galaxies as negative examples. In testing, without doing this, the classifier would classify anything near a galaxy as positive, which is counterproductive and is a trivial way to maximise recovery (at the cost of an overwhelming false positive rate!)

Figure illustrating example outputs from our synthetic transient generation algorithm. From left to right: science image, template image, difference image as generated via HOTPANTS, and a peak-to-peak image designed to capture spurious hot pixels/rapid variability.

Applying Bayesian neural networks

Dropout as an approximate Bayesian neural network

Uncertainty-aware predictions are generated using Monte Carlo dropout to perform approximate Bayesian inference. Gal and Ghahramani (2016) revealed a link between dropout as a regularisation method and approximate Bayesian inference. By sampling from realisations of the model with weights 'dropped out', set to zero, we can approximately sample from the predictive posterior of the network, gaining valuable information about the confidence of a given prediction. One such posterior distribution is plotted below.

Example of posterior samples for a transient in a faint host. We see the classifier accurately predicts this as a real detection, however has some uncertainty (likely due to the sharp PSF in the science image)

Active learning

We briefly experimented with active learning, and saw no gains over a random, naive selection. We think this is possibly because the active learning overprioritised weird artifacts which don't generalise to better performance.

Training set visualisations with t-SNE

We can combine classifier confidence as obtained via BNNs with other techniques to look deeply at the way our classifier has learned from the training set. In the below plot, we can identify specific areas of latent space the classifier finds harder to learn (i.e. has lower confidence) -- these lie largely where we expect (on class boundaries), but also indicate that specific morphologies of bogus detection (i.e. cosmic rays) are often confused with minor planets.

t-SNE maps of training set, coloured by class on the left panel, and confidence on the right panel.

Future directions: context-aware classification

Challenges of contextual information

Incomplete: we know that our completeness fractions for galaxy catalogs very rapidly fall off beyond ~100 Mpc. Therefore we might see the transient host in the image, but not in our meta-catalog
Incorrect: Misclassifications are a natural part of large-scale catalogs, for example white dwarfs being misclassified as AGN in some photometric catalog.
Inconclusive: Multiple catalogs may report multiple different object types for the same astronomical source, and it is unclear how to choose the correct one.

It is clear that different approaches are needed to perform nuanced classifications, whilst also taking into account the above issues.

Preliminary results

To provide real-time contextual information to the classifier, we build an optimised database containing subsets of 12 catalogs of galaxies, AGN, and variable stars.

We classify under an 'apparent' scheme, using the labels

VS: variable star // OR: orphaned source // NT: nuclear transient // SN: supernova // BS: bogus source

We then train a CNN on images only, and feed this score into a random forest using the catalog information. We censor data randomly when training the random forest to promote balance between the CNN score and context.

On images alone, we obtain 90.5% accuracy, climbing to 98.5% accuracy when all contextual information is included. Below is a sample confusion matrix,

Left: fully trained meta-classifier performance on the test set only image-level information (100% censored context) - note the confusion between NT/SN/OR classes.

Right: fully trained meta-classifier performance on the test set using all available contextual information also.

With the now relatively-pure stream provided by our real-bogus classifier, effort has shifted to more general source classification. This can be used in multiple ways (which we are currently developing):

Automated prioritisation of candidates: minimises human vetting effort
Automated real-time reporting to e.g. TNS
Fully autonomous follow-up campaigns with other connected facilities

The real-bogus classification problem

Difference imaging: an imperfect technique

Most modern transient surveys detect variability via difference imaging, by subtracting a template observation to reveal objects that have changed in brightness since then. This is a complex process, as PSFs must be matched between images, and differential background corrected for. For various approaches to this, see Alard and Lupton (1999), Becker (2015), and Zackay (2016).

Example cutout of a GOTO difference image - centred on the comet C/2017 K2 which appears as a positive flux residual. Note the subtraction residuals of the bright stars present.

The vast majority of detections in this 'difference' image are merely artifacts of the process, arising from imperfect PSF matching or bad subtractions. There are also clear artifacts from bright stars, and cosmic rays that have been broadened into PSF-like shapes by convolution with the PSF-matching kernel.

The real-bogus problem

With the high rates of false positives via the difference imaging method, how can we detect transients in real-time, with minimal human effort, and maximal recovery of real objects? This is where machine learning can help! By training a classifier on metadata of difference image detections with labels of positive (real) or negative (bogus), we can create a fully automated classifier to remove the majority of spurious detections.

Deep learning-based techniques currently lead this, which can represent far more complex decision boundaries than simple machine learning, and features are learned automatically via the process. Our new classifier is based on a convolutional neural network architecture, which we optimise in contrast to other works via hyperparameter tuning.

Performance

To verify real-world performance, we built a test set of over 700 spectroscopically confirmed transients seen in GOTO commissioning data via the TNS. We obtain a balanced accuracy of 97.2%, broadly in line with the scores we obtain on the test set (see below). We use subsets of this to check the consistency of performance with changing magnitude and host offset, plotted below:

Of particular interest: >30% improvement in recovery of faint (>19.5) transients, ~14% improvement in recovery of nuclear transients over a classifier trained purely on minor planets.

Recovery rate (TPR) as a function of GOTO discovery magnitude at a fixed real-bogus threshold of 0.5. The dashed line indicates the performance of a classifier with a similarly sized training set, but with only minor planet detections.

Error bars are derived directly from the classifier score posteriors. The number of detections per bin is written below each bar. The sharp drop in the number of detections beyond 𝐿 ∼ 19.5
is associated with the median 5-sigma limiting magnitude of the GOTO prototype, thus expected

Recovery rate of transients that can be reliably associated with a host galaxy (as cross-matched with WISExSuperCosmos)as a function of host offset. As above, error bars are derived from the classifier score posteriors, and a similarly-sized minor planet-based classifier is plotted for comparison. There is a marked improvement in the recovery rate for very small host offsets, particularly for nuclear transients.

Test set performance

As is normal with ML approaches, we reserve 10% of our training dataset as a held-out test set, to confirm accuracy on data that has not yet been 'seen' by the classifier.

This shows excellent performance, attaining balanced accuracies of 99.49% and 99.19% (F1 score: 0.9925) on the minor planet and synthetic transient test datasets respectively after extensive hyperparameter tuning.