Stellar Spectral Type Classification Machine Learning API with LightGBM, FastAPI, Uvicorn, and Docker

5 min readJul 18, 2021

Astrophysical Introduction to Spectral Types

Star spectra is a gold mine for astronomers. From spectra, much valuable information about a specific star can be gathered, such as surface temperature, surface gravity, and metallicity. Another relevant observable parameter is the luminosity, which is derived from how bright a star is —luminosity is distance independent, while the magnitude is distance-dependent measurement.

Temperature determines the star’s color–which is the dominant peak wavelength regime of Planck function, and the surface brightness. The atmospheric pressure of a star depends on the surface gravity, which is related to its size. The size and brightness yield the luminosity, in which luminosity is related to the area and temperature to the power of four, and further is related to mass to the power of 3.5.

The Morgan–Keenan system classified stars into letters (O, B, A, F, G, K, M) from the hottest (O type) to the coolest (M type). Each letter class is subdivided with numeric digits with 0 being warmest and 9 being coolest (e.g., A8, A9, F0, and F1 form a sequence from hotter to cooler). The classification is expanded into luminosity classes, in which luminosity class 0 or Ia+ is used for hypergiants, class I for supergiants, class II for bright giants, class III for regular giants, class IV for subgiants, class V for main-sequence stars, class sd(or VI) for sub-dwarf stars, and class D (or VII) for white-dwarf stars. Sun has the spectral class of G2V, which is a main-sequence star with a surface temperature around 5,800 K.

Dataset

A dataset example which I used is from https://www.kaggle.com/deepu1109/star-dataset, but since the G class has only one member, so I decided to augment the data manually, which can be accessed in the repository under data/6 class csv.csv.

Data Preprocessing and Training

First, we need to have a model to be served. I use LightGBM which is based on additive decision trees. This is due to the nature of spectral type which is regime-based on several parameters space, and actually in a simpler way we can solve it with decision tree, but I just want to bring something new to my fellows.

LightGBM is an open-source API-level package from Microsoft, which is a separate package. You can install it from PyPI withpip install lightgbm. Scikit-learn is used for scaler and evaluation.

Preprocessing

In this step, we need to encode the categorical features into numerical features. After that, MinMaxScaler is applied to transform the features.

import os
import sys
import numpy as np
import pandas as pd
import joblib
import sklearn
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score
from category_encoders import OrdinalEncoder
from sklearn.preprocessing import MinMaxScalerSRC_FILE = './data/6 class csv.csv'
df = pd.read_csv(SRC_FILE)
print(df.head())
excluded_columns = ['star_type']

#Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants
label_column = df['star_type']

numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
numeric_features = [col for col in numeric_columns if col not in excluded_columns]

categorical_features = [col for col in categorical_columns if col not in excluded_columns]
features = numeric_features + categorical_features

df = df[features]
df = df.fillna(0)

X_train, X_valid, y_train, y_valid = train_test_split(df, label_column, 
                                                        test_size=0.2, random_state=42, stratify = label_column)

le = OrdinalEncoder(categorical_features)
le.fit(X_train[categorical_features])

X_train[categorical_features] = le.transform(X_train[categorical_features])
X_valid[categorical_features] = le.transform(X_valid[categorical_features])

scaler = MinMaxScaler()

scaler.fit(X_train)
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

Hyperparameter Tuning and Training

To find the most suitable LGBM hyperparameter, we run several training jobs and save the models. Later, we could decide which model we gonna use. For detailed parameters, you can look at the LGBM documentation. Joblib is used instead of pickle to store label encoding and models. The F1-score is then calculated with the weighted F1-score for each class to determine model performance on the validation set.

num_leaves = [10,20,30,40,50]
max_depth = [2,3,4,5,6,7,8]
learning_rate = [0.05, 0.01, 0.005, 0.001]
result = []

joblib.dump(le, './model/label_encoder.joblib')
joblib.dump(scaler, './model/minmax_scaler.joblib')
joblib.dump(features, './model/features.joblib')
joblib.dump(categorical_features, './model/categorical_features.joblib')



for num in num_leaves:
    for depth in max_depth:
        for lr in learning_rate:
            clf = LGBMClassifier(random_state = 420, 
                                num_leaves = num, 
                                max_depth = depth, 
                                learning_rate = lr)
            clf.fit(X_train, y_train)

            valid_prediction = clf.predict(X_valid)
            accuracy = accuracy_score(valid_prediction, y_valid)
            f1 = f1_score(valid_prediction, y_valid, average = 'weighted')

            metadata = {"num_leaves ": num,
                        "max_depth": depth,
                        "learning_rate": lr,
                        "accuracy": accuracy, 
                        "f1_score": f1}
            print(metadata)
            print(classification_report(y_valid,valid_prediction))
            result.append(metadata)

            joblib.dump(clf, f'./model/lgb_model_{f1}.joblib')

print(result)

Deploying

For deployment, I use Uvicorn instead of Flask. The entire app file runs on FastAPI. Pydantic is used for data parsing, which managing errors if the base model schema does not meet. To sum it up, I also write a Dockerfile.

Here is the app.py

import joblib
import uvicorn
import numpy as np
import pandas as pd
from pydantic import BaseModel
from fastapi import FastAPI
import joblib

app = FastAPI(title='spectral class prediction', version='1.0',
              description='spectral class prediction using machine learning')

le  = joblib.load('../model/label_encoder.joblib')
clf = joblib.load('../model/lgb_model_1.0.joblib')
features = joblib.load('../model/features.joblib')
categorical_features = joblib.load('../model/categorical_features.joblib')
scaler = joblib.load('../model/minmax_scaler.joblib')

class schema(BaseModel):
    temperature: float
    luminosity: float
    radius: float
    absolute_magnitude: float
    star_color: str
    spectral_class: str

@app.get('/')
@app.get('/home')
def read_home():
    
    """
     Home endpoint which can be used to test the availability of the application
    """

    return {'message': 'system up'}

@app.post("/predict")
def predict(data: schema):
    data_dict = data.dict()
    data_df = pd.DataFrame.from_dict([data_dict])
    data_df = data_df[features]
    data_df[categorical_features] = le.transform(data_df[categorical_features])
    
    data_df = scaler.transform(data_df)
    print(data_df, flush = True)
    
    prediction = clf.predict(data_df)
    print(prediction, flush=True)
    ##Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants
    if prediction == 0:
        prediction_label = "Red Dwarf"
    if prediction == 1:
        prediction_label = "Brown Dwarf"
    if prediction == 2:
        prediction_label = "White Dwarf"
    if prediction == 3:
        prediction_label = "Main Sequence"
    if prediction == 4:
        prediction_label = "Super Giants"
    if prediction == 5:
        prediction_label = "Hyper Giants"

    return {"prediction": prediction_label}

if __name__ == '__main__':
    uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

The prediction endpoint is served at port 8000. You can run it independently by python main.py , but I advise you to run the Dockerfile by

docker build -t fastapi-startype-lgbm .

which builds the docker container. And then after it is finished, run it by

docker run -dp 8000:8000 fastapi-startype-lgbm

which we wanna bind the docker port 8000 to our local port.

A sample request script is provided. You can also utilize Postman application for this purpose.

import json
import requests
data = {
    "temperature": 3068,
    "luminosity":0.0024,
    "radius": 0.17,
    "absolute_magnitude":16.12,
    "star_color": "Red",
    "spectral_class":"M"
}

response = requests.post("http://localhost:8000/predict", json=data)
print(response.text)

That’s it. Now, try it with the different types of stars!

Here is the GitHub repository.

References

This tutorial is heavily influenced by a tutorial from Vivek Kumar, which also has a youtube video.

The astrophysics notes were taken from skyandtelescope and Wikipedia.

About The Author

Salman is the Chief Data Officer of Allure AI, an emerging Indonesian-based AI startup that recommends the personalized skincare products and routines. He graduated from the Department of Astronomy, Institut Teknologi Bandung with a thesis on the utilization of deep learning to determine stellar parameters from spectra. Previously, he worked as a research assistant for the department and undergone the AI residency at Konvergen AI. He was also involved in the SETI@home project, college humanoid robotics team, Princeton University Physics of Life Summer School 2020, and Machine Learning Summer School Indonesia 2020.

Stellar Spectral Type Classification Machine Learning API with LightGBM, FastAPI, Uvicorn, and Docker

Astrophysical Introduction to Spectral Types

Dataset

Data Preprocessing and Training

Preprocessing

Hyperparameter Tuning and Training

Deploying

References

About The Author

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Salman Chen

No responses yet