top of page
white.png

Data Science Projects

This section showcases end-to-end projects in Data Science (DS), Data Analysis (DA), and Computer Science, spanning domains including aviation safety, finance, and full-stack development. Projects are hosted on their respective platforms — GitHub, Hugging Face (HF), and Kaggle — where notebooks, datasets, and live demos are available in full.

white.png

Project: Aviation Human Factor Incidents and Preventive Measures

Narrative: NASA ASRS (Aviation Safety Reporting System) dataset, filtered to ~40,000 human-factor incidents (2005–2025). An end-to-end machine learning pipeline was developed to identify human-factor risks and generate data-driven preventive insights.

Dataset: 111K ASRS | 40K filtered (human factors) 

Models: MiniLM embeddings | XGBoost (Binary Relevance, 69 labels) 

Features: Text embeddings + 39 structured variables 

Deployment: Hugging Face Spaces | Gradio app | REST API inference 

Tools: Python | Pandas | scikit-learn | SentenceTransformers | XGBoost | Gradio

Links:

GitHub

Hugging Face

white.png

Project: Multi-Task NLP for Aviation Incident Risk Estimation

Narrative: NASA ASRS dataset (~38,000 incident reports, 2012–2022) combining structured metadata and free-text narratives. A multi-task NLP model was developed to classify key incident dimensions and support data-driven risk analysis.

Dataset: 38K ASRS reports (2012–2022) 

Models: DistilBERT (shared encoder + 3 heads) 

Pipeline: Text consolidation | Tokenization | 512-token constraint (~21.5% truncation) 

Training: AdamW | Stratified split | Class imbalance handling 

Tools: Python | PyTorch | HuggingFace Transformers | scikit-learn

Links:

GitHub

Hugging Face

white.png

Project: Finance Prediction

Narrative: Yahoo Finance dataset (~13,800 records, 2015–2025) covering aviation-sector companies. A machine learning model was developed to predict 30-day stock performance and support data-driven financial analysis.

Dataset: 13.8K rows (Yahoo Finance, 2015–2025)

Features: Rolling stats | Volatility | Momentum | Daily returns

Target: stock_growth (binary, ~54/46 balance)

Models: Logistic Regression | Random Forest

Results: Evaluated via ROC-AUC across 5 aviation-sector stocks

Tools: Python | Pandas | scikit-learn | yfinance

Links:

GitHub

Kaggle

white.png

Project: Netflix Movies Rating Prediction

Narrative: Netflix movies dataset (~16,000 titles, 2010–2025) enriched with engagement-based features. A regression model was developed to predict ratings and uncover patterns influencing audience perception.

Dataset: 16K Netflix movies (2010–2025) 

Features: Director/Cast engagement (LOO) | Log transformations 

Results: vote_count_log correlation ~0.59 (vs ~0.19 raw) 

Models: XGBoost Regressor

Tools: Python | Pandas | NumPy | scikit-learn

Links:

GitHub

Hugging Face

Kaggle

white.png

Project: Safety in Aviation Industry — EDA (Exploratory Data Analysis)

Narrative: Aviation accident dataset (1908–2023) combining historical records for safety analysis. Exploratory data analysis was conducted to identify long-term trends, fatality patterns, and support data-driven insights on aviation safety evolution.

Dataset: ~100K+ aviation accident records (1908–2023)
Features: Temporal trends | Fatality rates | Aircraft & operation attributes
Results: Clear long-term safety improvement trends | Risk pattern identification | Power BI dashboard
Tools: Python | Pandas | NumPy | Matplotlib | Seaborn | Power BI

Links:

GitHub

Kaggle

white.png

Project: Movies & Shows Suggester Website (Harvard CS50 Final Project)

Narrative: Movie and TV datasets integrated into a web-based recommendation system. A Flask application was developed to suggest content using similarity logic and natural language queries.

Dataset: Movies + Shows databases (Netflix-focused)
Features: Metadata filtering | Similarity-based recommendations | NLP query interface
Results: Interactive recommendation system | Natural language search (OpenAI API)
Tools: Python | Flask | SQL | OpenAI API | HTML/CSS

Links:

GitHub

Website

white.png

Project: ASRS Dataset Creation

Narrative: NASA ASRS dataset (~111,000 aviation incident reports, 2005–2025) collected and processed from raw sources. A data pipeline was developed to clean, standardize, and structure the dataset for scalable analysis and machine learning applications.

Dataset: 111K ASRS reports (2005–2025)
Features: Data cleaning | Deduplication (ACN) | Schema standardization | Data validation
Results: Public dataset published (Hugging Face & Kaggle) | Foundation for NLP projects
Tools: Python | Pandas | NumPy | glob | os 

Links:

Kaggle (Notebook)

Kaggle (Dataset)

© 2026 - Matheus Hagemann | Machinery Safety Expert · CMSE® · TÜV Nord | B.Eng Control & Automation | Specialization in Workplace Safety | MBA Financial Management

bottom of page