Data Scientist · Researcher · Liverpool, UK

Shweta Debjit
Sarkar

Building machine learning models that are not just accurate, but interpretable and actionable across environmental exposure data, clinical time-series, and large scale behavioural cohorts.

7+
Years Experience
2
Degrees
4
Preprints
10+
Tools Mastered

📍 Liverpool, United Kingdom

From data to
decisions

I am a data scientist with practical experience building machine learning models on complex, multi source datasets including environmental exposure data, clinical time-series, and large scale behavioural cohorts.

Across four independent research preprints, I have developed strong instincts for making models not just accurate but interpretable and actionable. My work spans spatial epidemiology, clinical risk prediction, and ICU patient monitoring.

I am drawn to doctoral research that applies these skills to sustainability and systems-level problems building tools that help researchers and policymakers see complex interactions in real time, not just in reports.

Peer-reviewed research

P1
Machine Learning-Based Risk Factor Analysis for Cervical Cancer Prediction: A Comparative Study of Class Imbalance Handling Strategies with SHAP-Driven Clinical Risk Stratification
Sarkar, S.D. & Roy, S. (2026)
Preprint · Zenodo
DOI: 10.5281/zenodo.19368694 ↗
P2
A Mathematical Index for Quantifying and Reducing Student Depression Risk Through Personalised Behavioural Guidance
Sarkar, S.D. & Roy, S. (2026)
Preprint · Zenodo
DOI: 10.5281/zenodo.19457072 ↗
P3
Explainable Early Prediction of Acute Hypotensive Episodes in ICU Patients Using Routine Vital Signs: An eICU Study
Roy, S. & Sarkar, S.D. (2026)
Preprint · Zenodo
DOI: 10.5281/zenodo.19161469 ↗
P4
Local Bounded Stochastic Imputation for Ordered Numerical Sequences
Roy, S. & Sarkar, S.D. (2026)
Preprint · Zenodo
DOI: 10.5281/zenodo.19429162 ↗

Tools & methods

Programming & Databases

Python (Advanced) R SQL

ML & Statistical Methods

XGBoost Random Forest SVM LSTM Autoencoders SHAP SMOTE/ADASYN Survival Analysis MICE/KNN Imputation

Spatial & Data Analysis

GeoPandas Spatial Data Linkage Epidemiological Methods Exposure Analysis

ML Frameworks

TensorFlow PyTorch Scikit-learn Hugging Face NumPy Pandas SciPy

Visualisation & Dashboards

Matplotlib Seaborn Power BI Amplitude

Tools & Platforms

Jupyter Git / GitHub Google Colab Docker Excel (Advanced)

Selected Projects

Applied research

2026
Satellite Wildfire Classifier

Two paired wildfire detection models exposing a critical ML lesson: a satellite classifier achieved 99.5% accuracy, but Grad-CAM revealed it learned land-use patterns, not fire damage a textbook shortcut learning failure. A second model on real fire/smoke imagery achieved genuine 100% accuracy, confirmed by interpretability analysis.

EfficientNetB0 Grad-CAM PyTorch Shortcut Learning
2026
Coral Reef Health Classification

Binary classifier (healthy vs bleached coral) comparing VGG16 and EfficientNetB0 transfer learning on 923 images. Demonstrates the bias-variance tradeoff in small-dataset deep learning VGG16 overfit severely with 119M trainable parameters, while fine-tuned EfficientNetB0 achieved 79.5% accuracy with only a 2.1% train/val gap. Grad-CAM confirms ecologically valid feature attention.

EfficientNetB0 VGG16 Grad-CAM Transfer Learning
2026
Skin Lesion Classifier - CNN from Scratch

7-class skin lesion classification on HAM10000, training two custom CNNs from scratch (CNN V1: 39.5% balanced accuracy; CNN V2: 42.1%) and benchmarking against ResNet50 transfer learning (83.8%). Demonstrates empirically why transfer learning is essential for medical imaging with limited data dermatofibroma recall was 0% for both scratch models vs 96% for ResNet50.

PyTorch ResNet50 HAM10000 Medical Imaging
2026
Job Posting Bias Analyser

NLP pipeline detecting gender-coded, age-biased, and exclusionary language in 123,842 real LinkedIn job postings, grounded in the Gaucher et al. (2011) lexicon. Found 46% of postings lean masculine, with Tech and Venture Capital among the most biased industries a measurable gap between stated diversity commitments and actual hiring language.

NLP Fairness in AI Python Text Analysis
2025
Wildlife Camera Trap Detector

Binary classifier (blank vs animal-present) for conservation camera trap images using fine-tuned ResNet18 on the Serengeti2 dataset. Data augmentation reduced the generalisation gap from 6.93% to 1.03%, achieving 85.11% test accuracy. Grad-CAM confirms the model attends to animal regions rather than background landscape essential for trustworthy deployment in wildlife monitoring.

ResNet18 Grad-CAM PyTorch Conservation AI
2026
AI Agency in Student Learning

Pilot classroom observation study (N=73, Years 7–11) across two Liverpool secondary schools examining whether students who actively interrogate AI output show higher conceptual understanding than those who passively copy it. Key finding: 32% of students claimed ownership of an answer they could not explain — a gap invisible to current assessment methods, consistent across both SEN and non-SEN populations.

Education Research AI Ethics Mixed Methods SEN
2026
Clinical RAG - Parkinson's & Alzheimer's

Retrieval-Augmented Generation pipeline for question-answering over Parkinson's and Alzheimer's research literature. Built a semantic search and retrieval layer over a corpus of clinical abstracts, with retrieval evaluation metrics and query similarity analysis to assess answer grounding quality.

RAG NLP Clinical AI LLM
2026
T2DM Urinary Metabolomics Analysis

End-to-end metabolomics pipeline identifying urinary biomarkers of Type 2 Diabetes Mellitus from NMR spectroscopy data (MetaboLights MTBLS1, N=132). Combined ExWAS with Benjamini-Hochberg FDR correction and Random Forest classification (cross-validated AUC = 0.985) to surface 13 metabolites confirmed by both methods including hippurate and branched chain amino acids consistent with published T2DM literature.

Metabolomics ExWAS Random Forest Biomarker Discovery
2026
Air Pollution & Breast Cancer Incidence in England

Spatial epidemiological analysis linking environmental exposure data to cancer outcomes across English regions using ML and geospatial data linkage.

GeoPandas Spatial Analysis Epidemiology Python
2025
Environmental & Atmospheric Prediction: Rainfall Classification

Binary classification of daily rainfall occurrence from multi-variable atmospheric observations humidity, pressure, wind direction, and temperature.

Classification Atmospheric Data Scikit-learn
2026
Breast Cancer Wisconsin Prediction

Clinical ML classification of malignant vs benign tumours using Logistic Regression, XGBoost, and Random Forest, with SHAP explainability for diagnostic feature identification.

XGBoost SHAP AUC-ROC Clinical ML
2025
MoovBuddy Cohort Analysis - MSc Thesis

Longitudinal cohort analysis on 100,000+ anonymised user records across 10+ countries. Identified 88% install-to-subscription drop-off and modelled country-level variation.

RFM Segmentation Funnel Modelling SQL Power BI

Where I've worked

Teaching Assistant — Maths, Science & Computing
Jan 2026 – June 2026
Dixons Fazakerley Academy · Liverpool, UK
  • Supported classroom delivery across Maths, Science, Computing, and Business for GCSE and A-Level students.
  • Delivered targeted interventions for students with autism, ADHD, and dyslexia, adapting materials to individual learning profiles.
  • Collaborated with senior teachers to develop personalised study and development plans.
Research Intern — User Behaviour Analysis (MSc Thesis)
May 2025 – Aug 2025
MoovBuddy Ltd · London, UK
  • Conducted longitudinal cohort analysis on anonymised behavioural data from 100,000+ users across 10+ countries using Python, Scikit-learn, SQL, and Power BI.
  • Identified 88% install-to-subscription drop-off and modelled country-level variation in user outcomes.
  • Developed a four-phase evidence-based recommendation framework shared with the marketing team.
Teaching Assistant — Maths, Science & Computing
Sep 2025 – Dec 2025
Childwall Abbey School · Liverpool, UK
  • Supported students with mild to severe neurological disorders in a specialist school environment.
  • Assisted with coursework delivery and exam preparation, adapting approaches to individual learning profiles.
Business Analyst
Dec 2021 – Mar 2023
Indothai Group · India
  • Managed and analysed operational data from 300+ daily transactions using SQL and Excel.
  • Built performance dashboards identifying trends in service quality and revenue metrics.

Education

Academic background

MSc Business Analytics & International Business
University of Dundee
September 2024 – September 2025 · Dundee, UK
Machine Learning · Statistical Methods for Business Intelligence · Big Data Analytics & Predictive Modelling · Database Management · Deep Learning

🏆 Postgraduate Scholarship — awarded for academic merit
🥈 Second Place — First Semester Group Projects
Bachelor of Engineering — Electrical Engineering
S.B. Jain Institute of Technology, Management & Research
August 2015 – August 2019 · Nagpur, India
Signal Processing · Applied Mathematics · Probability and Optimisation · Network Analysis · Control Systems

🎖️ General Secretary — Inter-Departmental Forum (2017–2019)

Let's collaborate

Open to doctoral research opportunities, collaborations, and data science roles. Always happy to connect.