Data Science | Engineering | Machinery Safety

Data Science Projects

This section showcases end-to-end projects in Data Science (DS), Data Analysis (DA), and Computer Science, spanning domains including aviation safety, finance, and full-stack development. Projects are hosted on their respective platforms — GitHub, Hugging Face (HF), and Kaggle — where notebooks, datasets, and live demos are available in full.

Project: Aviation Human Factor Incidents and Preventive Measures

Narrative: NASA ASRS (Aviation Safety Reporting System) dataset, filtered to ~40,000 human-factor incidents (2005–2025). An end-to-end machine learning pipeline was developed to identify human-factor risks and generate data-driven preventive insights.

Dataset: 111K ASRS | 40K filtered (human factors)

Models: MiniLM embeddings | XGBoost (Binary Relevance, 69 labels)

Features: Text embeddings + 39 structured variables

Deployment: Hugging Face Spaces | Gradio app | REST API inference

Links:

GitHub

Hugging Face

Project: Multi-Task NLP for Aviation Incident Risk Estimation

Narrative: NASA ASRS dataset (~38,000 incident reports, 2012–2022) combining structured metadata and free-text narratives. A multi-task NLP model was developed to classify key incident dimensions and support data-driven risk analysis.

Dataset: 38K ASRS reports (2012–2022)

Models: DistilBERT (shared encoder + 3 heads)

Pipeline: Text consolidation | Tokenization | 512-token constraint (~21.5% truncation)

Training: AdamW | Stratified split | Class imbalance handling

Tools: Python | PyTorch | HuggingFace Transformers | scikit-learn

Links:

GitHub

Hugging Face

Project: Finance Prediction

Narrative: Yahoo Finance dataset (~13,800 records, 2015–2025) covering aviation-sector companies. A machine learning model was developed to predict 30-day stock performance and support data-driven financial analysis.

Dataset: 13.8K rows (Yahoo Finance, 2015–2025)

Features: Rolling stats | Volatility | Momentum | Daily returns

Target: stock_growth (binary, ~54/46 balance)

Models: Logistic Regression | Random Forest

Results: Evaluated via ROC-AUC across 5 aviation-sector stocks

Tools: Python | Pandas | scikit-learn | yfinance

Links:

GitHub

Kaggle

Project: Netflix Movies Rating Prediction

Narrative: Netflix movies dataset (~16,000 titles, 2010–2025) enriched with engagement-based features. A regression model was developed to predict ratings and uncover patterns influencing audience perception.

Dataset: 16K Netflix movies (2010–2025)

Features: Director/Cast engagement (LOO) | Log transformations

Results: vote_count_log correlation ~0.59 (vs ~0.19 raw)

Models: XGBoost Regressor

Tools: Python | Pandas | NumPy | scikit-learn

Links:

GitHub

Hugging Face

Kaggle

Project: Safety in Aviation Industry — EDA (Exploratory Data Analysis)

Narrative: Aviation accident dataset (1908–2023) combining historical records for safety analysis. Exploratory data analysis was conducted to identify long-term trends, fatality patterns, and support data-driven insights on aviation safety evolution.

Links:

GitHub

Kaggle

Project: Movies & Shows Suggester Website (Harvard CS50 Final Project)

Narrative: Movie and TV datasets integrated into a web-based recommendation system. A Flask application was developed to suggest content using similarity logic and natural language queries.

Links:

GitHub

Website

Project: ASRS Dataset Creation

Narrative: NASA ASRS dataset (~111,000 aviation incident reports, 2005–2025) collected and processed from raw sources. A data pipeline was developed to clean, standardize, and structure the dataset for scalable analysis and machine learning applications.

Links:

Kaggle (Notebook)

Kaggle (Dataset)