Applying machine-learning methods to build a photometric classifier for massive stars in nearby galaxies

Grigoris Maravelias ^1,2 , Alceste Z. Bonanos ¹ , Frank Tramper ¹ , Stephan de Wit ¹ , Ming Yang ¹ , Paolo Bonfini ^2,3

Abstract

Mass loss is a key parameter in the evolution of massive stars. Despite the recent progress in the theoretical understanding of how stars lose mass, discrepancies between theory and observations still hold. Even worse, episodic mass loss in evolved massive stars is not included in the models while the importance of its role in the evolution os massive stars is currently undetermined. A major hindrance to determining the role of episodic mass loss is the lack of large samples of classified stars. Given the recent availability of extensive photometric catalogs from various surveys spanning a range of metallicity environments, we aim to remedy the situation by applying machine learning techniques to these catalogs.We compiled a large catalog of known massive stars in M31 and M33, using IR (Spitzer) and optical (Pan-STARRS) photometry, as well as Gaia astrometric information. We grouped them in 7 classes (Blue, Red, Yellow, B[e] supergiants, Luminous Blue Variables, Wolf-Rayet, and outliers, e.g. QSO's and background galaxies). Using this catalog as a training set, we built an ensemble classifier utilizing color indices as features. The probabilities from three machine-learning algorithms (Support Vector Classification, Random Forests, Neural Networks) are combined to obtain the final classifications.The overall performance of the classifier is ~87%. Highly populated (Red/Blue/Yellow Supergiants) and well-defined classes (B[e] Supergiants) have a high recovery rate between ~98-74%. On the contrary, Wolf-Rayet sources are detected at ~20% while Luminous Blue Variables are almost non-existent. The is mainly due to the small sample sizes of these classes, although M31 and M33 have spectral classifications for several massive stars (about 2500). In addition, the mixing of spectral types, as there are no strict boundaries in the features space (color indexes) between those classes, complicates the classification. In an independent application of the classifier to other galaxies (IC 1613, WLM, Sextans A) we obtained an overall accuracy of ~71% despite the missing values on their features (which we replace with averaged values from the training sample). This approach results only in a few percent difference, with the remaining discrepancy attributed to the different metallicity environments of their host galaxies. The classifier's prediction capability is only limited by the available number of sources per class, reflecting the rarity of these objects and the possible physical links between these massive star phases. Our methodology is also efficient in correctly classifying sources with missing data and at lower metallicities, making it an excellent tool for spotting interesting objects and prioritizing targets for observations. Future spectroscopic observations will offer a test-bed of its actual performance along with opportunities for improvement.

Why bother ?

Although rare, massive stars (M∗ > 8−10 Msolar ) play a crucial role in the Universe, as they

display strong stellar winds
deposit energy and momentum to the interstellar medium
produce a series of elements
shed chemically processed material as they evolve
enhance the galactic environment of their host galaxies when exploding as supernovae
die as neutron stars or black holes, offering insights to extreme physics (gravity, temperature)
are progenitors of gamma-ray bursts and gravitational wave sources
are very luminous and can be observed in more distant galaxies > ideal tool to understand stellar evolution across cosmological time, especially for first galaxies (such as those to be observed with James Webb Space Telescope).

However, we still lack knowledge, especially for the evolution of massive stars beyond the main-sequence.

Key factors that determine their evolution:

initial mass
metallicity
stellar rotation
mass-loss
binary interactions.

In particular, mass loss

affects strongly the stellar evolution
enriches the immediate circumstellar environment
forms complex structures (shells, disks, etc.).

Mass loss is not homogeneous, even at the simplest cases of stellar winds from single stars.

A number of massive stars experience episodic activity and outbursts, e.g. Wolf-Rayet stars (WR), Luminous Blue Variables (LBV), Blue Supergiants (BSG), B[e] Supergiants, Red Supergiants (RSGs), Yellow Supergiants (YSGs).

But ...

How important the episodic mass loss is?

How it depends on the metallicity (in different galaxies)?

What links between the different evolutionary phases exists?

ASSESSing the importance of episodic mass loss

The presence of circumstellar material in various phases of massive stars and super luminous supernovae implies a central role of episodic mass loss in the evolution of massive stars. IR photometry is an ideal tool for distinguishing massive stars and in particular those with dusty environments (Bonanos et al. 2009, 2010, Britavskiy et al. 2014, 2015).

However, a major hindrance for this study is the lack of large samples of evolved massive stars with known spectral types in various metallicity environments .

The ERC-funded project ASSESS ("Episodic Mass Loss in Evolved Massive stars: Key to Understanding the Explosive Early Universe") aims to determine the role of episodic mass by:

assembling a large sample of evolved massive stars in a selected number of nearby galaxies at a range of metallicities through multi-wavelength photometry,
performing follow-up spectroscopic observations on candidates to validate their nature and extract stellar parameters,
testing the observations against the assumptions and predictions of the stellar evolution models.

For the first step, we applied machine-learning techniques to mid-IR point-source catalogs from Spitzer Space Telescope that became publicly available only recently (Khan et al. 2015; Khan 2017; Williams & Bonanos 2016).

The ASSESS group (from left): Frank Tramper, Grigoris Maravelias, Alceste Bonanos, Ming Yang, Stephan de Wit (photo by Dimitra Abartzi).

Performance

Performance on M31 and M33 galaxies

Support Vector Classifier

Random Forest

Neural Network

Upper panels: The confusion matrices for the SVC, RF, and NN methods, respectively.

Middle panels: The precision, recall, and f1-score metrics for SVC, RF, and NN methods, respectively. These results originate from single runs, i.e. by training each method with 70% of our sample and applying the classifier to the validation sample (30%).

Lower panels: The Receiver Operating Characteristic curves for the three methods, along with the values of the Area Under Curve for each class (in parenthesis).

In general, the algorithms perform equally well (~79%, ~86%, ~85%) with the best results obtained for RSG (~98%), YSG (~80%), BeBR (~82%), and BSG (~83%). The success rate for WR is low (~20-40%) while LBV are not recovered (due to their very small sample and photometric/spectroscopic variability).

The last panels visualize the performance of the methods. In this case the classifier works in a binary mode, i.e. it compares each class against all others. The resuls show a consistent behavior and a good performance, well above the random classifier (indicated with the dashed line)

Combining predictions

The methods used are based on different concepts: SVC identifies the best hyperplane that separates the classes, RF creates random decision trees with various nodes and thresholds, while NN finds the differences in the feature space that best separate the classes. Given the similar accuracy across the methods we combined their probability distributions:

P_final= P_SVC / 3 + P_RF / 3 + P_NN / 3

A repeated 5-fold CV test was performed to estimate the overall accuracy (shown in the Table below, along with the corresponding results per method).

Overall we achieved an accuracy of 0.86+/-0.02 %.

Performance of test galaxies

After training our three classifiers to the whole sample of M31 and M33 sources, we applied their combined model to a number of (lower) metallicity galaxies: IC 1613, WLM, Sextans A.

The confusion matrix for the three test galaxies, with obvious issues arising in the classification of BSG and YSG. The overall accuracy is ~75% (slightly improved with respect to the number quoted in the abstract), less than that achieved for M31 and M33 (~86%).

Why different perfomance?

a. missing data

Some sources do not have measurements in the selected bands. For this approach to work though, we require some values. Additionaly, the majority of the sources (in the full catalogs compiled from the whole Spitzer source catalogs) do not have measurements either.

To assess the importance of missing data we split the M31 and M33 source sample into training (70%) and validation (30%). Using the fraction of sources with missing values (~14%) in the test galaxies and the statistics of missing measurements per band (~1-13%), we modified the validation sample by replacing randomly values with the mean values (which have been determined by the statistics of the M31 and M33 sample). Running this experiment for 30 iterations shows a small difference (~1-2%) with respect to the original sample.

The difference between the accuracy of the original validation set and the modified one (randomly altered values based on the statistics of missing values in the test galaxies). The impact of missing data imputation is very small at the order of 1-2% (dashed line at ~1%).

b. metallicity

M31 and M33 have metallicities above, and solar-subsolar, respectively (Peña & Flores-Durán 2019), while the three test galaxies are of lower metallicity (Boyer et al. 2015).

Lower metallicity may affect both extinction and evolution that could lead to shifts in the intrinsic color distributions.

However, given the lack of photometric data and source numbers for lower metallicity galaxies, it is currently impossible to examine the effect of metallicity thoroughly.

Building the classifier

Sample collection and processing

1. Spectral types collected from literature:
training sample: M31 (1142 sources) and M33 (1388 sources)
test sample: WLM (36 sources), IC 1613 (20 sources), Sextans A (16 sources)

2. Identifying foreground sources using Gaia DR2 astrometric information (corresponding to ~8% of the original sample).

3. Collecting photometric data:

Spitzer mid-IR photometry (from catalogs that recently became public, which ensures the use of positions derived from a single instrument that helps with the crossmatching with other various works).
Pan-STARRS optical photometry.

After the above screening our training sample consists of 527 M31 and 563 M33 sources.

Class selection

From the 1089 sources we removed 12 with ambiguous classification. All these sources correspond to 35 spectral classifications (with a handfull of objects respresenting some of these). To improve the efficiency of the classifier we grouped sources into wider classes, excluding mainly sources with uncertain spectral types. Our classification scheme consists of 7 classes that correspond to :

BSG : main-sequence and evolved OBA stars, including those with emission (eg Be).
WR : all types of (classical) Wolf-Rayet stars
LBV : only secure Luminous Blue Variables
BeBR : secure and candidate B[e] supergiants
YSG : F and G stars, including those identified as Yellow Supergiants
RSG : K and M stars, including Red Supergiants and those with a B companion (Neugent et al. 2019)
GAL : AGN, QSO, and background galaxies - this class collects all outliers that would fit in the rest (as these will be the contaminant sources for our galaxies)

Feature selection

For supervised approaches we need to train in a well characterized sample. For this we need a complete set of photometry. The available bands in the optical and mid-IR are g, r, i, z, y and [3.6], [4.5], [5.8], [8.0], [24], respectively.

However, most sources have upper limits in [5.8], [8.0], and [24] which means that their values are not certain. Using those bands and rejecting sources with upper limits would decrease our training sample by half. We opted to disregard those bands. g band displays some issues for fainter objects and it is the one mostly affected by extinction. As we opted not to correct for extinction (due to lack of known extinction laws for most of our galaxies) we disregard it also.

Gaia photometry is also excluded based on the lack of enough sources measured in more distant galaxies.

To remove any distance dependence we used the color terms of the bands used, i.e. : r-i, i-z, z-y, y-[3.6], [3.6]-[4.8], which is the simplest and most intuitive representation of the data^1,2.

¹ the above consequtive color indices are the ones that are affected less from extinction, less correlated, and help distinguish more some classes.
²various experiments with different transformations of the data (e.g. normalized fluxes, standardizing data) did not result to any significant improvement.

Feature (color indices) vs. wavelength per class. At each panel the black dashed lines correspond to the individual sources while the colored solid lines corresponds to their average (the number of available sources is also displayed). The last panel contains only the averaged lines to highlight the differences between the classes. The vertical dashed lines correspond to the average wavelength per color index, as shown at the top of the figure.

ML algorithms implementaion and optimization

We used the scikit-learn package to implement our algorithms of choice: Support Vector Classifier, Random Forest, and Multi-layer Perceptron (fully connected Neural Network).

Our imbalanced sample is taken care by using the class_weight='balanced' option in all classifiers. We also took into account the multi-label nature of our problem by setting the appropriate parameters to their according values. We performed K-fold Cross Validation in search of the optimal hyper-parameters. For the NN we experimented with different combinations of layers and nodes:

Accuracy for the different Neural Network combinations of fully connected layers (number of hidden layers with the number of nodes) per solver. We achieved the best value for ’adam’, whcih works systematically better among the solvers. The best result is achieved for a network with two layers with 64 nodes each.

Lessons learned

1. What is the impact of sample volume and class mixing?

A major concern in ml applications in the representativeness of the samples used. We explored this by performing iterative runs for each algorithm by adjusting the size of the training sample used (from 1% to 100%). We show the recall (the recovery rate) per class, where we notice an overall improvement with increasing sample size. This means that the larger the initial sample, the more representative it is of the parent population and, therefore, a better representation of their features is seen by the classifier (leading to more accurate predictions).

BSG and RSG : Given the higher initial numbers available for these populations, they achieve a higher accuracy much faster.
BeBR : although a small sample, it is well characterized (e.g. Lamers et al. 1998, Kraus 2019). That helps our classifier to distinguish this class from the others.
LBV : they share similar observables with BSG, WR, and BeBR (e.g. Smith 2014). During outbursts their spectral appearance changes significantly from O/B to A/F type. Additionaly, their very small number does not provide enough representativeness for this class, leading to poor recovery.
WR : a rare class of objects, that originates from LBV and BSG (a class that includes O stars). Given that their color indices (features) can be similar, we expect a confusion and a low recovery. They benefit though from increasing samples.
YSG : lying in between the BSG and the RSG in the Hertzsprung–Russell diagram the contamination from both of these classes is expected.

2. What if the labels are not correct?

The uncertainty in the labels, i.e. the spectral types, can be either due to classification errors or due to the lack of strict boundaries (in the feature space) for our classes¹. However, in supervised ml applications they are considered absolute truth, which leads to innacurate predictions.

Dorn-Wallenstein et al. 2021 used SVM to classify similar classes of massive stars in our Galaxy. They achieved ~53% while we obtained 79%. Part of this difference is attributed to the use of a more homogenously classified sample in our case, in contrast to their sample which is derived from a variety of works with different instruments and strategies, since 1970s.

To address that we need algorithms that take into account label uncertainties during the training process (which is not trivial).

¹ We still lack robust knowledge with respect to the transition of these sources through the various phases.

3. How sensitive is the classifier to the features used, and which are more important per class?

We used two approaches to infer feature sensitivity:

Left panel : feature permutation, where the values of each feature are shuffled and the final accuracy difference is checked (the larger the difference the more important the feature is). This method is subject though to correlation between the features. r-i, i-z, and y-[3.6] are the features that display the larger influence.

Right panel: drop-a-feature and retrain. This approach is totally agnostic with respect to the data and the method (but expensive in terms of computation time). In our case, the sample size is small and it easy to retrain. r-i and i-z seem more important (consistent with the previous approach) in this approach.

Using the second approach we can investigate the impact of each feature per class¹.

BeBR : r−i and z−y seem to be the most important, given the overall redder colors because of the dusty environment around these objects.
WR : r−i is the most important feature as bright hot sources.
YSG : more sensitive to i−z, indicative on their position in between the BSG and RSG classes
BSG/RSG : do not show any significant dependence - most probably because the include a wider range of objects that possibly mask significant differences between the bands (e.g. Bonanos et al. 2009, 2010)
GAL : sensitive to both z−y and y−[3.6], partly due to the PAH component.

If we were to exclude any of these features, we would get poorer results for some of the classes.

¹for LBV the sample is so small and our prediction capability so poor that no conclusion can be deriven.

Why bother ?

ASSESSing the importance of episodic mass loss

Performance

Building the classifier

Lessons learned

What's next ?

Questions ?