
Course
Applied Machine Learning for scientific research
Scope
Recent advances in deep neural networks have led to increased interest from researchers about both the cutting edge “deep learning” models and classical “machine learning” techniques, which have historically been very underused in scientific research. In this course, we will start with the basics of machine learning and, using scientific examples throughout, use code, and visualizations to explore all the components that go into making practical use of machine learning (ML) techniques for research.
Learning goals
We will discuss how the various algorithms work at a high level rather than going into mathematical detail, and will focus on the practicalities of ML work:
- how to gather, clean, and pre-process datasets;
- how to choose between different ML tools;
- how to score and evaluate models; and - how to take advantage of pre-trained models for deep learning tasks.
At the end of the course, participants will have built and trained several different types of model from scratch and be confident applying ML techniques to their own research problems.
Assumed knowledge
The course is aimed at researchers at any career stage who have little or no experience of machine learning and want to learn how to apply machine learning to research problems in the natural sciences. The course is *not* suitable for novice programmers; we will assume a solid knowledge of Python programming and problem-solving. We will also make use of several packages from the scientific Python stack, so knowledge of pandas/numpy/seaborn/matplotlib is considered a plus. We will make use of example datasets that should be easily understood by anyone with a background in science. As always, if anyone is unsure about the suitability of the course, please contact Martin Jones (martin@pythonforbiologists.com).
Programme
Session 1 - Intro, history, background and environment
In the first session, we introduce key background concepts, including the relationship between machine learning (ML) and AI, the distinction between classical ML and deep learning, and the core idea of learning from data. We explore ways to categorize ML approaches—such as supervised vs. unsupervised learning and regression vs. classification—and begin to examine the trade-off between simple and complex methods. We highlight how factors like parameter count, computational and data requirements, and interpretability often scale together. The session also covers practical aspects of applying ML in research, including dataset acquisition, the difference between training and inference, and the roles of pre-training and fine-tuning. We discuss essential practices like model evaluation, comparison, visualization, and feature engineering. Finally, we ensure everyone is set up with the necessary datasets, tools, and programming environment to smoothly run code and exercises throughout the course.
Session 2 - Core concepts of classification
In this session, we dive into a simple one-feature classification problem by first creating a manual classifier to introduce core concepts like features and classes. This example helps us explore two key ML questions: how to score models and how to visualize their behavior, including the use of a confusion matrix. We then examine why manual classifiers don’t scale and introduce the K-Nearest Neighbours (KNN) algorithm with an intuitive explanation and simple implementation using pandas. This leads to a discussion on parameter tuning and the importance of dividing data into training and test sets, along with an introduction to cross-validation. The session concludes with an exercise to build, optimize, visualize, and evaluate a KNN classifier on a new dataset, reinforcing the foundational elements common to all classification tasks.
Session 3 - Sklearn and adding features
This session focuses on two main goals: understanding the sklearn package architecture and introducing feature engineering. After a brief recap of the Python ML ecosystem, we explore how sklearn handles data and models, streamlining previous workflows. Using built-in implementations lets us shift to higher-level topics, such as stratified splitting, handling unbalanced datasets, and avoiding sorted data pitfalls. We then add features to our classification problem, use visualization to observe their impact, and highlight the importance of scaling. This leads into key aspects of feature engineering and how parameters affect model behavior, particularly over- and underfitting. We end with an introduction to feature selection methods, like sequential and univariate approaches. The exercise involves applying these concepts to optimize a classifier using a multi-feature dataset.
Session 4 - Binary classification and new models
This session focuses on binary classification and introduces new algorithms. We begin by contrasting binary with multiclass classification, noting that while some aspects become simpler, scoring can be more complex. Using confusion matrices, we revisit concepts like recall, sensitivity, and specificity, and explore trade-offs between true positives and negatives. We also cover representing categorical and multidimensional data through encoding methods (e.g., ordinal, one-hot), highlighting their role in avoiding bias. The session then introduces support vector machines (SVM) and decision trees. With three distinct algorithms, we use visualization, benchmarking, and scoring to compare performance, interpretability, and computational cost—raising the key question of model selection. The exercise involves preprocessing a complex dataset, applying feature engineering, and comparing KNN, decision tree, and SVM based on metrics like recall and precision.
Session 5 - Regression
In this session, we shift from classification to the other major ML task: regression. Starting with a simple example, we compare regression and classification in terms of workflow, noting that while visualization and scoring differ, concepts like feature engineering and parameter tuning remain relevant. We show that many problems can be framed as either regression or classification, and that many algorithms apply to both with minor adjustments. Visualizing regression models with categorical features offers insights into model behavior and interpretability, especially when considering feature count and whether models produce linear or stepwise predictions. We then introduce a large, unstructured dataset to explore feature extraction, highlighting how quickly feature sets can grow and revisiting scalable feature selection methods. The exercise involves building features from unstructured data, identifying those with strong predictive value, and using ML to uncover patterns beyond simple visual analysis, including interpreting complex confusion matrices.
Session 6 - Unsupervised learning
This session covers clustering and dimensionality reduction—tools widely used in science but not always seen as machine learning. By building on earlier topics, we draw parallels between clustering and classification, including overfitting, scoring, and parameter tuning, while also highlighting key differences with unlabelled data. We explore clustering algorithms, using visualization and metrics like homogeneity and completeness, and discuss the challenge of choosing the number of clusters. We then introduce dimensionality reduction, focusing on its use in visualization and as a feature extraction alternative, while addressing common PCA misconceptions. The session ends with a look at tuning dimensionality reduction for model performance. Exercises include clustering an unlabelled dataset and applying PCA to reduce dimensionality in a high-dimensional dataset, balancing accuracy and efficiency.
Session 7 - Artificial neural networks
This session begins with a simple introduction to artificial neurons (perceptrons), followed by building single-layer neural networks using sklearn. We explore different ways of assigning weights and running inference to build intuition. With this foundation, we overview the backpropagation algorithm and key training concepts like epochs, batches, learning rate, convergence, and training loss. Even simple networks reveal how parameter-rich ANNs are, making parameter tuning essential. Using visualization, we examine how architecture affects behavior and enables a level of customization unique to neural networks. We then outline two main use cases: building and training networks from scratch (requiring large datasets and specialised tools) versus fine-tuning powerful pre-trained models with smaller datasets. The exercise involves training and evaluating a neural network on a complex scientific dataset, highlighting the trade-off between predictive power and interpretability.
Session 8 – Pre-trained models and fine tuning
This session covers key concepts behind large, deep neural networks, including specialized architectures like convolutional networks, recurrent neural networks, and transformers, which build on simpler networks for specific tasks. We introduce embeddings as a form of feature engineering representing complex inputs in high-dimensional space, and highlight the scale of data and hardware required. The focus is practical: since researchers won’t build these models from scratch, we emphasize how to access and use existing models via APIs, especially large language and multimodal models popular in natural language processing. We then explore deep learning models relevant to scientific work, reviewing code and dependencies while linking back to earlier simpler models. We also outline common scientific ML tasks: data preparation, feature engineering, model comparison, parameter tuning, and evaluation. The exercise involves downloading a pre-trained image model, fine-tuning it on a scientific dataset, and experimenting with synthetic data to assess performance.
Session 9 & 10
The final two sessions are dedicated to students completing full ML workflows, covering data gathering, merging, cleaning, feature extraction and selection, model selection, parameter tuning, and evaluation. Working with real-world datasets may require custom coding for tasks like web scraping, merging multiple sources, and cleaning human-curated data. Students are encouraged to use their own datasets, though suitable examples will be provided if needed. The course will conclude with case study presentations by students or the trainer, demonstrating the application of course concepts to real scientific problems.
Course activities
The activities of the course will be a mixture of demonstrations, discussions, practical exercises and presentations of working code. Each session (i.e. morning or afternoon) will begin with a demonstration of code for approaching a particular part of the machine learning workflow, with discussion based on the material and questions from students. To make it easier to follow along in a hands-on way, the students and trainer will work from copies of the same notebook, meaning that code can be edited and re-run easily to explore its behaviour and answer questions. Each session will also have some practical time with set exercises and challenges for the students to attempt. Each session will end with a wrap up where we take a look at exercise solutions, discover the range of different approaches that students have used and discuss the trade offs involved in different solution designs.
General information
Registration
Early bird registration deadline: 1 October 2025
Regular registration deadline: 15 October 2025
Group size
Minimum: 10
Maximum: 15
Course duration
5 full days (3 days in the week of 3 November, and 2 days in the week of 10 November)
Credit points
1.7 EC
Language
English
Fee
WIMEK, PE&RC, WIAS, WASS, ESP, and VLAG PhDs with TSP | €330 (Early) / €380 (Regular) |
SENSE PhDs with TSP | €660 (Early) / €710 (Regular) |
Other PhDs | €700 (Early) / €750 (Regular) |
Staff of WUR graduate schools | €700 (Early) / €750 (Regular) |
Other academic participants | €740 (Early) / €790 (regular) |
Nonacademic participants | €990 (Early) / €1040 (regular) |
The fee does not include accommodation, breakfast or dinner. Accommodation is not included in the fee of the course, but there are several possibilities in Wageningen. For information on B&B’s and hotels in Wageningen please visit proefwageningen.nl/overnachten. Another option is Short Stay Wageningen. Furthermore, Airbnb offers several rooms in the area. Note that besides the restaurants in Wageningen, there are also options to have dinner at Wageningen Campus.
Cancellation conditions
- Up to 4 (four) weeks prior to the start of the course (i.e., up to October 6, 2025 at 10:00), cancellation is free of charge.
- Up to 2 (two) weeks prior to the start of the course (i.e., up to October 20, 2025 at 10:00), a fee of 150 € will be charged.
- In case of cancellation within less than two weeks prior to the start of the course, or if you do not show at all, a fee of €710 will be charged.
Note: If you would like to cancel your registration, ALWAYS inform us. By NOT paying the participation fee, your registration is NOT automatically cancelled (and do note that you will be kept to the cancellation conditions).
Also note that when there are not enough participants, we can cancel the course. We will inform you if this is the case a week after the early bird deadline. Please take this into account when arranging your trip to the course (i.e., check the re-imburstment policies).