Stellar Spectral Type Classification Machine Learning API with LightGBM, FastAPI, Uvicorn, and Docker

Salman Chen
5 min readJul 18, 2021

--

Photo by Nathan Anderson on Unsplash
Photo by Nathan Anderson on Unsplash

Astrophysical Introduction to Spectral Types

Star spectra is a gold mine for astronomers. From spectra, much valuable information about a specific star can be gathered, such as surface temperature, surface gravity, and metallicity. Another relevant observable parameter is the luminosity, which is derived from how bright a star is —luminosity is distance independent, while the magnitude is distance-dependent measurement.

Temperature determines the star’s color–which is the dominant peak wavelength regime of Planck function, and the surface brightness. The atmospheric pressure of a star depends on the surface gravity, which is related to its size. The size and brightness yield the luminosity, in which luminosity is related to the area and temperature to the power of four, and further is related to mass to the power of 3.5.

MK classification. Image from Wikipedia

The Morgan–Keenan system classified stars into letters (O, B, A, F, G, K, M) from the hottest (O type) to the coolest (M type). Each letter class is subdivided with numeric digits with 0 being warmest and 9 being coolest (e.g., A8, A9, F0, and F1 form a sequence from hotter to cooler). The classification is expanded into luminosity classes, in which luminosity class 0 or Ia+ is used for hypergiants, class I for supergiants, class II for bright giants, class III for regular giants, class IV for subgiants, class V for main-sequence stars, class sd(or VI) for sub-dwarf stars, and class D (or VII) for white-dwarf stars. Sun has the spectral class of G2V, which is a main-sequence star with a surface temperature around 5,800 K.

Dataset

A dataset example which I used is from https://www.kaggle.com/deepu1109/star-dataset, but since the G class has only one member, so I decided to augment the data manually, which can be accessed in the repository under data/6 class csv.csv.

Data Preprocessing and Training

First, we need to have a model to be served. I use LightGBM which is based on additive decision trees. This is due to the nature of spectral type which is regime-based on several parameters space, and actually in a simpler way we can solve it with decision tree, but I just want to bring something new to my fellows.

LightGBM is an open-source API-level package from Microsoft, which is a separate package. You can install it from PyPI withpip install lightgbm. Scikit-learn is used for scaler and evaluation.

Preprocessing

In this step, we need to encode the categorical features into numerical features. After that, MinMaxScaler is applied to transform the features.

import os
import sys
import numpy as np
import pandas as pd
import joblib
import sklearn
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score
from category_encoders import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
SRC_FILE = './data/6 class csv.csv'
df = pd.read_csv(SRC_FILE)
print(df.head())
excluded_columns = ['star_type']

#Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants
label_column = df['star_type']

numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
numeric_features = [col for col in numeric_columns if col not in excluded_columns]

categorical_features = [col for col in categorical_columns if col not in excluded_columns]
features = numeric_features + categorical_features

df = df[features]
df = df.fillna(0)

X_train, X_valid, y_train, y_valid = train_test_split(df, label_column,
test_size=0.2, random_state=42, stratify = label_column)

le = OrdinalEncoder(categorical_features)
le.fit(X_train[categorical_features])

X_train[categorical_features] = le.transform(X_train[categorical_features])
X_valid[categorical_features] = le.transform(X_valid[categorical_features])

scaler = MinMaxScaler()

scaler.fit(X_train)
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

Hyperparameter Tuning and Training

To find the most suitable LGBM hyperparameter, we run several training jobs and save the models. Later, we could decide which model we gonna use. For detailed parameters, you can look at the LGBM documentation. Joblib is used instead of pickle to store label encoding and models. The F1-score is then calculated with the weighted F1-score for each class to determine model performance on the validation set.

num_leaves = [10,20,30,40,50]
max_depth = [2,3,4,5,6,7,8]
learning_rate = [0.05, 0.01, 0.005, 0.001]
result = []

joblib.dump(le, './model/label_encoder.joblib')
joblib.dump(scaler, './model/minmax_scaler.joblib')
joblib.dump(features, './model/features.joblib')
joblib.dump(categorical_features, './model/categorical_features.joblib')



for num in num_leaves:
for depth in max_depth:
for lr in learning_rate:
clf = LGBMClassifier(random_state = 420,
num_leaves = num,
max_depth = depth,
learning_rate = lr)
clf.fit(X_train, y_train)

valid_prediction = clf.predict(X_valid)
accuracy = accuracy_score(valid_prediction, y_valid)
f1 = f1_score(valid_prediction, y_valid, average = 'weighted')

metadata = {"num_leaves ": num,
"max_depth": depth,
"learning_rate": lr,
"accuracy": accuracy,
"f1_score": f1}
print(metadata)
print(classification_report(y_valid,valid_prediction))
result.append(metadata)

joblib.dump(clf, f'./model/lgb_model_{f1}.joblib')

print(result)

Deploying

For deployment, I use Uvicorn instead of Flask. The entire app file runs on FastAPI. Pydantic is used for data parsing, which managing errors if the base model schema does not meet. To sum it up, I also write a Dockerfile.

Here is the app.py

import joblib
import uvicorn
import numpy as np
import pandas as pd
from pydantic import BaseModel
from fastapi import FastAPI
import joblib

app = FastAPI(title='spectral class prediction', version='1.0',
description='spectral class prediction using machine learning')

le = joblib.load('../model/label_encoder.joblib')
clf = joblib.load('../model/lgb_model_1.0.joblib')
features = joblib.load('../model/features.joblib')
categorical_features = joblib.load('../model/categorical_features.joblib')
scaler = joblib.load('../model/minmax_scaler.joblib')

class schema(BaseModel):
temperature: float
luminosity: float
radius: float
absolute_magnitude: float
star_color: str
spectral_class: str

@app.get('/')
@app.get('/home')
def read_home():

"""
Home endpoint which can be used to test the availability of the application
"""

return {'message': 'system up'}

@app.post("/predict")
def predict(data: schema):
data_dict = data.dict()
data_df = pd.DataFrame.from_dict([data_dict])
data_df = data_df[features]
data_df[categorical_features] = le.transform(data_df[categorical_features])

data_df = scaler.transform(data_df)
print(data_df, flush = True)

prediction = clf.predict(data_df)
print(prediction, flush=True)
##Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants
if prediction == 0:
prediction_label = "Red Dwarf"
if prediction == 1:
prediction_label = "Brown Dwarf"
if prediction == 2:
prediction_label = "White Dwarf"
if prediction == 3:
prediction_label = "Main Sequence"
if prediction == 4:
prediction_label = "Super Giants"
if prediction == 5:
prediction_label = "Hyper Giants"

return {"prediction": prediction_label}

if __name__ == '__main__':
uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

The prediction endpoint is served at port 8000. You can run it independently by python main.py , but I advise you to run the Dockerfile by

docker build -t fastapi-startype-lgbm .

which builds the docker container. And then after it is finished, run it by

docker run -dp 8000:8000 fastapi-startype-lgbm

which we wanna bind the docker port 8000 to our local port.

A sample request script is provided. You can also utilize Postman application for this purpose.

import json
import requests
data = {
"temperature": 3068,
"luminosity":0.0024,
"radius": 0.17,
"absolute_magnitude":16.12,
"star_color": "Red",
"spectral_class":"M"
}

response = requests.post("http://localhost:8000/predict", json=data)
print(response.text)

That’s it. Now, try it with the different types of stars!

Here is the GitHub repository.

References

This tutorial is heavily influenced by a tutorial from Vivek Kumar, which also has a youtube video.

The astrophysics notes were taken from skyandtelescope and Wikipedia.

About The Author

Salman is the Chief Data Officer of Allure AI, an emerging Indonesian-based AI startup that recommends the personalized skincare products and routines. He graduated from the Department of Astronomy, Institut Teknologi Bandung with a thesis on the utilization of deep learning to determine stellar parameters from spectra. Previously, he worked as a research assistant for the department and undergone the AI residency at Konvergen AI. He was also involved in the SETI@home project, college humanoid robotics team, Princeton University Physics of Life Summer School 2020, and Machine Learning Summer School Indonesia 2020.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Salman Chen
Salman Chen

Written by Salman Chen

Astro grad student at NTHU — interested in astrophysics and neuroscience, love chocolate and cookies

No responses yet

Write a response