MapLC: unsupervised Machine Learning for discovering anomalous variability

Oleksandra Razim 1

  • 1 University of Nova Gorica, Nova Gorica

Abstract

The already existing large-scale surveys, such as Gaia and ZTF, and especially upcoming surveys, such as LSST and Roman, promise to provide us with hundreds of thousands (currently) and dozens of millions or even billions (in the future) of high-quality, well-sampled stellar light curves. With these data, we are entering the age in which we can observe not only common, slowly-changing and periodic variable objects but also highly sporadic, transient, rapidly changing and simply intrinsically rare, the ones we would call anomalies. Characterization of these objects promises to shed light on stellar evolution, especially at the earliest and latest stages of the stellar life cycle, improve the distance ladder and promote the development of new techniques, e.g. based on astroseismology. However, to date, we don't have a reliable method of discovering rare and novel variable objects. Most of the new types of variable stars discovered in the last twenty years were serendipitous discoveries, found either during investigations of the samples of already known objects (e.g. Blue Large-Amplitude Pulsators (Pietrukowicz et al. 2017) and Fast Yellow Pulsating Supergiants (Dorn-Wallenstein et al. 2020)) or through citizen scientists efforts (e.g. Boyajian stars (Boyajian et al. 2016)), which amounts to thousands of human-hours of sifting through massive volumes of information. With billions of light curves becoming available in the next ten years, relying on manual search for previously unseen types of objects is far from effective and will likely result in the delay of their discoveries by years if not decades.
This talk presents MapLC, a project for developing a pipeline for semi-automatic detection of rare, novel and anomalous variable objects, either periodic, stochastic, or transient. The project relies on unsupervised Machine Learning and dataset topology techniques, and investigates which feature sets and methods are best suited for differentiating true anomalies from bogus' objects. As test cases, we are using a number of already known rare classes of variable objects, discovered within last decades, and analyse whether it is possible to re-discover them in a semi-automatic manner. The secondary goal of the project is to analyse which areas of the multi-dimensional parameter spaces, used for variable sources characterization, still remain understudied, and, therefore, may hide previously unknown objects.

Anomalous variable sources

Astronomy has entered the petabyte era, rendering traditional methods of discovery, where experts manually examine images, spectra, or light curves, unfeasible. Machine Learning is now routinely used for classification and characterization of celestial objects, but anomaly detection is still a challenge. The absence of ground truth makes it difficult to distinguish between physically interesting outliers and spurious data artifacts. While image-based anomaly detection is relatively well-developed, identifying novel variability in light curves remains a less-charted frontier.

Space-based missions like TESS and Kepler revealed that nearly 60% of stars exhibit variability when observed at millimagnitude precision. With the Legacy Survey of Space and Time (LSST) about to start monitoring billions of stars at similar or slightly worse precision, albeit with a sparcer cadence, and the Roman Space Telescope soon to follow, we need to learn how to detect meaningful novelties within vast, noisy, and heterogeneous datasets. Failing to do so may mean overlooking key insights into stellar evolution - or even entirely new physical phenomena.

We introduce MapLC, a project aimed at developing a semi-automated, model-agnostic pipeline for anomaly discovery in variable sources. Our approach combines three complementary strategies: a blind search for unanticipated behaviors, a reconstruction of known rare classes through unsupervised methods, and a targeted probe into regions of parameter space deemed promising by theoretical and statistical reasoning.

Three pronged approach

Prong 1: blind search

✓ Take community-developed feature sets;

✓ feed them to common anomaly detection methods;

✓ see what we get out;

✓ analyse with additional data (spectra);

✓ ask the community for help.

Prong 2: test cases

✓ Take known rare cases and anomalies (BLAPs, dippers, etc);

✓ take the datasets in which they were discovered;

✓ see if they can be discovered non-serendipitously, using unsupervised and semi-supervised ML.

Prong 3: parameter space analysis

✓ Take already existing datasets;

✓ Run ‘discovery metrics’ for these surveys in different areas of parameter space;

✓ Identify previously understudied areas which will be probed with LSST and subsequent surveys;

✓ Upon the arrival of the LSST data, search for anomalies in these areas.

Test science cases

To develop an anomaly detection package that would be applicable to a large variety of problems, the 'Prong 2' strategy incorporates test science cases selected to (a) span diverse modes of astronomical variability, namely, transient, stochastic, transitional, and periodic, and (b) focus on phenomena recently discovered but poorly sampled, where larger datasets could bring significant progress. Our current set of test cases includes:

Transients: Tidal Disruption Events (TDEs), particularly double-peaked or recurring events, when a supermassive black hole disrupts a binary system (double TDEs) or partially disrupts a star across multiple passages (recurring TDEs). The LSST is expected to detect numerous TDEs per night, inclusing these variants.

Stochastic variability: Dipping stars (e.g., Tabby’s Star analogs) and Young Stellar Objects (YSOs). KIC 8462852 (Tabby’s Star, Boyajian 2016) was first identified through citizen science. Later, archival searches revealed similar cases, demonstrating how such anomalies can go unnoticed for years or even decades without dedicated novelty detection efforts. YSOs, known for their erratic variability, offer an ideal testbed for clustering-based methods and are crucial to understanding early stellar evolution.

Transitional variability: Changing-state AGNs are prime examples of sources shifting between variability regimes. LSST’s decade-long observations are likely to capture a substantial sample of these objects.

Periodic anomalies: Blue Large Amplitude Pulsators (BLAPs, Pietrukowicz 2017) and Fast Yellow Pulsating Supergiants (FYPS, Dorn-Wallenstein 2022) provide compelling, though challenging, examples. Since LSST’s cadence and precision may not be ideal for such short-period variables, archival datasets from Kepler, TESS, and OGLE must be used to assess whether these classes could have been detected via unsupervised methods.

Complex periodic systems: Binaries with pulsating components represent a broad category, often difficult to characterize yet highly relevant for understanding stellar evolution and supernova progenitors. Their diverse variability signatures make them a suitable for dimensionality reduction and clustering algorithms.

Methods

One of the goals of the MapLC package is to provide the community with a model-agnostic anomaly detection tool suitable for identifying a wide range of anomaly types. With this in mind, MapLC must include algorithms based on diverse principles. As a starting point, we plan to implement the following:

Isolation Forest – A decision tree-based anomaly detection algorithm particularly effective for identifying cluster outliers. Its low computational cost and interpretability make it a popular choice for astronomical datasets (e.g., Villar 2021, Webb 2020, Sánchez-Sáez 2021, Etsebeth 2024).

Self-Organizing Maps (SOMs) – A dimensionality reduction method that projects high-dimensional data onto a typically 2D space while preserving topological structure. In practice, this means that data points that are neighbors in the original parameter space remain neighbors on the SOM map, enabling result interpretation via partial post-labelling (e.g. Sanders 2023). In this post-labelling regime, SOMs can also be used for anomaly detection (Razim 2021).

DBSCAN and HDBSCAN – Density-based clustering algorithms that perform well on datasets with complex topologies. Their main drawback is high computational cost, which limits their application to large datasets with many features. However, HDBSCAN has been successfully applied to lower-dimensional datasets (Webb 2020, Hunt 2021), or to feature spaces obtained via dimensionality reduction techniques (Queiroz 2023).

Acknowledgements

This project has received funding from the European Union's Horizon Europe research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 101081355.