Data Scientist · Researcher · Liverpool, UK

Shweta Debjit
Sarkar

Building machine learning models that are not just accurate, but interpretable and actionable across environmental exposure data, clinical time-series, and large scale behavioural cohorts.

View Research Get in Touch

Years Experience

Degrees

Preprints

10+

Tools Mastered

📍 Liverpool, United Kingdom

About

From data to
decisions

I am a data scientist with practical experience building machine learning models on complex, multi source datasets including environmental exposure data, clinical time-series, and large scale behavioural cohorts.

Across four independent research preprints, I have developed strong instincts for making models not just accurate but interpretable and actionable. My work spans spatial epidemiology, clinical risk prediction, and ICU patient monitoring.

I am drawn to doctoral research that applies these skills to sustainability and systems-level problems building tools that help researchers and policymakers see complex interactions in real time, not just in reports.

Publications & Preprints

Peer-reviewed research

Machine Learning-Based Risk Factor Analysis for Cervical Cancer Prediction: A Comparative Study of Class Imbalance Handling Strategies with SHAP-Driven Clinical Risk Stratification

Sarkar, S.D. & Roy, S. (2026)

Preprint · Zenodo

DOI: 10.5281/zenodo.19368694 ↗

A Mathematical Index for Quantifying and Reducing Student Depression Risk Through Personalised Behavioural Guidance

Sarkar, S.D. & Roy, S. (2026)

Preprint · Zenodo

DOI: 10.5281/zenodo.19457072 ↗

Explainable Early Prediction of Acute Hypotensive Episodes in ICU Patients Using Routine Vital Signs: An eICU Study

Roy, S. & Sarkar, S.D. (2026)

Preprint · Zenodo

DOI: 10.5281/zenodo.19161469 ↗

Local Bounded Stochastic Imputation for Ordered Numerical Sequences

Roy, S. & Sarkar, S.D. (2026)

Preprint · Zenodo

DOI: 10.5281/zenodo.19429162 ↗

Technical Skills

Tools & methods

Programming & Databases

Python (Advanced) R SQL

ML & Statistical Methods

XGBoost Random Forest SVM LSTM Autoencoders SHAP SMOTE/ADASYN Survival Analysis MICE/KNN Imputation

Spatial & Data Analysis

GeoPandas Spatial Data Linkage Epidemiological Methods Exposure Analysis

ML Frameworks

TensorFlow PyTorch Scikit-learn Hugging Face NumPy Pandas SciPy

Visualisation & Dashboards

Matplotlib Seaborn Power BI Amplitude

Tools & Platforms

Jupyter Git / GitHub Google Colab Docker Excel (Advanced)

Selected Projects

Applied research

2026

Satellite Wildfire Classifier

Two paired wildfire detection models exposing a critical ML lesson: a satellite classifier achieved 99.5% accuracy, but Grad-CAM revealed it learned land-use patterns, not fire damage a textbook shortcut learning failure. A second model on real fire/smoke imagery achieved genuine 100% accuracy, confirmed by interpretability analysis.

EfficientNetB0 Grad-CAM PyTorch Shortcut Learning

2026

Coral Reef Health Classification

Binary classifier (healthy vs bleached coral) comparing VGG16 and EfficientNetB0 transfer learning on 923 images. Demonstrates the bias-variance tradeoff in small-dataset deep learning VGG16 overfit severely with 119M trainable parameters, while fine-tuned EfficientNetB0 achieved 79.5% accuracy with only a 2.1% train/val gap. Grad-CAM confirms ecologically valid feature attention.

EfficientNetB0 VGG16 Grad-CAM Transfer Learning

2026

Skin Lesion Classifier - CNN from Scratch

7-class skin lesion classification on HAM10000, training two custom CNNs from scratch (CNN V1: 39.5% balanced accuracy; CNN V2: 42.1%) and benchmarking against ResNet50 transfer learning (83.8%). Demonstrates empirically why transfer learning is essential for medical imaging with limited data dermatofibroma recall was 0% for both scratch models vs 96% for ResNet50.

PyTorch ResNet50 HAM10000 Medical Imaging

2026

Job Posting Bias Analyser

NLP pipeline detecting gender-coded, age-biased, and exclusionary language in 123,842 real LinkedIn job postings, grounded in the Gaucher et al. (2011) lexicon. Found 46% of postings lean masculine, with Tech and Venture Capital among the most biased industries a measurable gap between stated diversity commitments and actual hiring language.

NLP Fairness in AI Python Text Analysis

2025

Wildlife Camera Trap Detector

Binary classifier (blank vs animal-present) for conservation camera trap images using fine-tuned ResNet18 on the Serengeti2 dataset. Data augmentation reduced the generalisation gap from 6.93% to 1.03%, achieving 85.11% test accuracy. Grad-CAM confirms the model attends to animal regions rather than background landscape essential for trustworthy deployment in wildlife monitoring.

ResNet18 Grad-CAM PyTorch Conservation AI

2026

AI Agency in Student Learning

Pilot classroom observation study (N=73, Years 7–11) across two Liverpool secondary schools examining whether students who actively interrogate AI output show higher conceptual understanding than those who passively copy it. Key finding: 32% of students claimed ownership of an answer they could not explain — a gap invisible to current assessment methods, consistent across both SEN and non-SEN populations.

Education Research AI Ethics Mixed Methods SEN

2026

Clinical RAG - Parkinson's & Alzheimer's

Retrieval-Augmented Generation pipeline for question-answering over Parkinson's and Alzheimer's research literature. Built a semantic search and retrieval layer over a corpus of clinical abstracts, with retrieval evaluation metrics and query similarity analysis to assess answer grounding quality.

RAG NLP Clinical AI LLM

2026

T2DM Urinary Metabolomics Analysis

End-to-end metabolomics pipeline identifying urinary biomarkers of Type 2 Diabetes Mellitus from NMR spectroscopy data (MetaboLights MTBLS1, N=132). Combined ExWAS with Benjamini-Hochberg FDR correction and Random Forest classification (cross-validated AUC = 0.985) to surface 13 metabolites confirmed by both methods including hippurate and branched chain amino acids consistent with published T2DM literature.

Metabolomics ExWAS Random Forest Biomarker Discovery

2026

Air Pollution & Breast Cancer Incidence in England

Spatial epidemiological analysis linking environmental exposure data to cancer outcomes across English regions using ML and geospatial data linkage.

GeoPandas Spatial Analysis Epidemiology Python

2025

Environmental & Atmospheric Prediction: Rainfall Classification

Binary classification of daily rainfall occurrence from multi-variable atmospheric observations humidity, pressure, wind direction, and temperature.

Classification Atmospheric Data Scikit-learn

2026

Breast Cancer Wisconsin Prediction

Clinical ML classification of malignant vs benign tumours using Logistic Regression, XGBoost, and Random Forest, with SHAP explainability for diagnostic feature identification.

XGBoost SHAP AUC-ROC Clinical ML

2025

MoovBuddy Cohort Analysis - MSc Thesis

Longitudinal cohort analysis on 100,000+ anonymised user records across 10+ countries. Identified 88% install-to-subscription drop-off and modelled country-level variation.

RFM Segmentation Funnel Modelling SQL Power BI

Experience

Where I've worked

Teaching Assistant — Maths, Science & Computing

Jan 2026 – June 2026

Dixons Fazakerley Academy · Liverpool, UK

Supported classroom delivery across Maths, Science, Computing, and Business for GCSE and A-Level students.
Delivered targeted interventions for students with autism, ADHD, and dyslexia, adapting materials to individual learning profiles.
Collaborated with senior teachers to develop personalised study and development plans.

Research Intern — User Behaviour Analysis (MSc Thesis)

May 2025 – Aug 2025

MoovBuddy Ltd · London, UK

Conducted longitudinal cohort analysis on anonymised behavioural data from 100,000+ users across 10+ countries using Python, Scikit-learn, SQL, and Power BI.
Identified 88% install-to-subscription drop-off and modelled country-level variation in user outcomes.
Developed a four-phase evidence-based recommendation framework shared with the marketing team.

Teaching Assistant — Maths, Science & Computing

Sep 2025 – Dec 2025

Childwall Abbey School · Liverpool, UK

Supported students with mild to severe neurological disorders in a specialist school environment.
Assisted with coursework delivery and exam preparation, adapting approaches to individual learning profiles.

Business Analyst

Dec 2021 – Mar 2023

Indothai Group · India

Managed and analysed operational data from 300+ daily transactions using SQL and Excel.
Built performance dashboards identifying trends in service quality and revenue metrics.

Education

Academic background

MSc Business Analytics & International Business

University of Dundee

September 2024 – September 2025 · Dundee, UK

Machine Learning · Statistical Methods for Business Intelligence · Big Data Analytics & Predictive Modelling · Database Management · Deep Learning

🏆 Postgraduate Scholarship — awarded for academic merit
🥈 Second Place — First Semester Group Projects

Bachelor of Engineering — Electrical Engineering

S.B. Jain Institute of Technology, Management & Research

August 2015 – August 2019 · Nagpur, India

Signal Processing · Applied Mathematics · Probability and Optimisation · Network Analysis · Control Systems

🎖️ General Secretary — Inter-Departmental Forum (2017–2019)

Let's collaborate

Open to doctoral research opportunities, collaborations, and data science roles. Always happy to connect.

Email LinkedIn GitHub

Shweta Debjit
Sarkar

From data to
decisions

Contact

Profiles

Certifications

Peer-reviewed research

Tools & methods

Programming & Databases

ML & Statistical Methods

Spatial & Data Analysis

ML Frameworks

Visualisation & Dashboards

Tools & Platforms

Applied research

Where I've worked

Academic background

Let's collaborate

Shweta DebjitSarkar

From data todecisions

Contact

Profiles

Certifications

Peer-reviewed research

Tools & methods

Programming & Databases

ML & Statistical Methods

Spatial & Data Analysis

ML Frameworks

Visualisation & Dashboards

Tools & Platforms

Applied research

Where I've worked

Academic background

Let's collaborate

Shweta Debjit
Sarkar

From data to
decisions