Leveraging open data from PaN facilities for machine learning

Europe/Paris
Synchrotron SOLEIL - CNRS - CEA Paris-Saclay L'Orme des Merisiers Départementale 128, 91190 Saint-Aubin
Description

Exploiting open data for machine learning training: can the Photon and Neutron community do it?

Introduction

During the last decade, most European Photon and Neutron (PaN) facilities have adopted open data policies, making data available for the benefit of the entire scientific community. At the same time, machine learning (ML) is seen as an essential tool to address the exponential growth of data volumes from PaN facilities.

Exploitation of experimental training datasets is a key component of machine learning. The combination of ML algorithms and open data can therefore be seen as an ideal marriage that would ultimately help the entire community to tackle ‘big data’ challenges with more automation.

However, finding the right data to train machine learning algorithms is a challenge and one of the motivations for making data FAIR is exactly that: to provide scientists working on AI applications with quality training datasets.

But what does 'quality' mean to PaN science communities? What metadata fields are needed to find the data, to understand if it is suitable for our research, and ultimately to be able to ingest it in our training models? How can we provide sufficiently rich metadata? What would be the enablers for more machine learning applications? How can we improve the collaboration between data producers (domain scientists) and data consumers (ML experts)?

Objectives

With this workshop, we aim to discuss these questions, among staff and users of the LEAPS and LENS facilities, across disciplines and across Europe.

We will present projects and teams that have successfully used open datasets from PaN facilities to train their specific ML application (data consumers), as well as domain scientists (data producers) who have published curated data specifically for ML applications.  

We will also look at cases where it hasn't worked so well, to identify what needs to be better curated on the FAIR data management side or understand the challenges in finding ML experts to effectively utilise the available data. A significant part of the workshop will be dedicated to discussion.

Call for abstracts

During the workshop, we have slots available for 20 minute presentations. We are particularly interested in contributions that can address the following points:

  • Have you used open data for machine learning? What has been your experience?
  • Do you feel your research could benefit from more and better curated open data? Tell us why and how.
  • Have you already made open datasets available to your community for machine learning training purposes? What has been your experience?
  • Have you managed to enrich the metadata of open datasets for better training? Share with us how you accomplished it.

Whatever your discipline, as long as the data originally came from experiment(s) at a photon or neutron source, please submit an abstract in the "Call for Abstracts" section.

Venue

The workshop will be held at the SOLEIL synchrotron, Saint Aubin (France) as a satellite event of the LEAPS General Assembly. Please note there is a 50€ registration fee for on-site participants. 

Remote participation will be possible too and is free of charge.

  • Tuesday, 17 October
    • 12:00 14:00
      Lunch / arrival / registration 2h
    • 14:00 14:15
      Introduction and objectives day 1 15m Principal/0-0 - Salle Amphitheatre (Batiment Principal)

      Principal/0-0 - Salle Amphitheatre

      Batiment Principal

      90
      Show room on map
      Speaker: Paul Millar (DESY)
    • 14:15 14:35
      Machine learning in the PaN software landscape and the role of open data 20m Principal/0-0 - Salle Amphitheatre (Batiment Principal)

      Principal/0-0 - Salle Amphitheatre

      Batiment Principal

      90
      Show room on map

      Please note this talk will be recorded.

      Speaker: Robert McGreevy (STFC - Ada Lovelace Centre (ALC))
    • 14:35 15:20
      Next-generation materials-data curation and processing methods for machine-learning applications 45m Principal/0-0 - Salle Amphitheatre (Batiment Principal)

      Principal/0-0 - Salle Amphitheatre

      Batiment Principal

      90
      Show room on map

      Please note this talk will be recorded.

      Speaker: Jacqui Cole (University of Cambridge)
    • 15:20 17:30
      Community talks Principal/0-0 - Salle Amphitheatre (Batiment Principal)

      Principal/0-0 - Salle Amphitheatre

      Batiment Principal

      90
      Show room on map

      Chairs: Majid Ounsy and Zdenek Matej

      • 15:20
        Machine-learning-enhanced analysis of surface scattering data – Fundamentals and applications 20m

        Surface and interface scattering is an indispensable tool in modern thin film research and material science. X-ray and neutron reflectivity measurements offer a diverse array of experimental insights, presenting an opportunity to integrate this technique through machine learning (ML) / Neural Network-based methods. Within the framework of DAPHNE4NFDI (Data for Photon and Neutron Science), collaborative efforts between the groups at the University of Kiel (Dr. Bridget Murphy) and Tübingen (Prof. Frank Schreiber) are focused on enhancing machine learning models for the analysis of X-ray and Neutron reflectivity datasets by the Python package "mlreflect". [1]

        After introducing general concepts of ML, we will discuss the opportunities and challenges related to ML for the analysis of surface scattering data. In particular, we will discuss the specific difficulties related to real-world surface-scattering experiments, including background scattering, small, but finite miscalibration of the incident angle, as well as the specifics of surface scattering using neutrons with peculiar cases of the scattering length density (SLD) profile, which can lead to the absence of a critical angle. [1,2]

        We will discuss an open experimental dataset of raw X-ray reflectivity measurements together with corresponding fit parameters, intentionally published to use as training or test data for machine learning models. [3,4]

        Finally, we will briefly comment on the opportunities of closed-loop experiments [5] using ML for quasi-real-time data analysis and direct feedback to the experiment.
        We acknowledge financial support by the BMBF and the DFG (ErUM-data, ErUM-pro, and DAPHNE4NFDI).

        [1] Neural network analysis of neutron and X-ray reflectivity data: automated analysis using mlreflect, experimental and feature engineering, A. Greco et al., Jounal of Applied Crystallography, 55, 362 (2022)
        [2]A. Greco, V. Starostin, A. Hinderhofer, A. Gerlach, M. W. A. Skoda, S. Kowarik, and F. Schreiber.
        Neural network analysis of neutron and X-ray reflectivity data: Pathological cases, performance and perspectives Mach. Learn.: Sci. Technol. 2 (2021) 045003
        [3] Reflectometry curves (XRR and NR) and corresponding fits for machine learning. Pithan, Linus, Greco, Alessandro, Hinderhofer, Alexander, Gerlach, Alexander, Kowarik, Stefan, Rußegger, Nadine, Dax, Ingrid, & Schreiber, Frank. (2022). Zenodo. https://doi.org/10.5281/zenodo.6497438
        [4] A. Hinderhofer, A. Greco, V. Starostin, V. Munteanu, L. Pithan, A. Gerlach, and F. Schreiber.
        Machine learning for scattering data: strategies, perspectives, and applications to surface scattering
        J. Appl. Cryst. 56 (2023) 3
        [5] Closing the loop: Autonomous experiments enabled by machine-learning-based online data analysis in synchrotron beamline environments, L. Pithan et al., J. Synchrotron Rad., in print (2023)

        Please note that this talk will be recorded.

        Speaker: Alexander Hinderhofer
      • 15:40
        Enhancing open data for Neural Network based reflectometry analysis – A future perspective for valuable test data sets 20m

        Researchers from Kiel and Tübingen University, as part of the DAPHNE4NFDI initiative, are collaborating to enhance machine learning models for analyzing X-ray and neutron reflectivity datasets using the Python package "mlreflect" [1] developed at Tübingen University in the group of Frank Schreiber. The collaborative effort has achieved success during beamtime at ID10 at the European Synchrotron Radiation Facility (ESRF), employing a closed-loop system for autonomous experiments guided by machine learning data analysis [2]. The University of Kiel's expertise in X-ray reflectivity analysis of liquid samples ([3]) complements the mlreflect package through expanded and improved training data. Based on the mlreflect implementations at VISA the successful machine learning data analysis and prediction at ID10 could be continued on lipid bilayer systems. Future machine learning-driven algorithms planed at P08, DESY underscore the growing significance of machine learning in reflectivity measurements.
        One of the biggest challenges for machine learning still remains the lack of sufficient experimental data, requiring reliance on simulated data for training. Moreover, enough test data are often missing to validate the simulation-based reflectivity predictions. To ensure the reproducible utilization of open data in reflectivity measurements, it is imperative to formulate metadata accurately, too.
        DAPHNE4NFDI is developing a minimalist reflectivity metadata schema based on measurements at the DESY beamline P08. The Open Reflectometry Standards Organization (ORSO) is also actively engaged in developing a file format for reduced reflectivity data. Ideally, a reflectivity metadata schema will encompass all metadata for subsequent basic. Open data with a comprehensive metadata schema promise benefits for the training and validation of machine learning models.
        The presentation of open reflectivity data samples, as demonstrated by Linus Pithan on Zenodo, simplifies the evaluation of mlreflect prediction algorithms [4]. In this process, SciCat emerges as a promising platform for aggregating reduced data for machine learning, offering cross-referencing with other repositories to enhance accessibility and future open reflectivity data sets.
        We acknowledge financial support by the BMBF through ErUM-pro, and DAPHNE4NFDI through the NFDI.
        [1] A. Greco et al., Jounal of Applied Crystallography, 55, 362 (2022)
        [2] L. Pithan et al., J. Synchrotron Rad., in-print (2023)
        [3], B.M. Murphy, M. Greve,B. Runge, C.T. Koops, A. Elsen, J. Stettner, O.H. Seeck, and O.M. Magnussen, J. Synchrotron Rad., 21, 45 (2014)
        [4] Pithan, Linus, Greco, Alessandro, Hinderhofer, Alexander, Gerlach, Alexander, Kowarik, Stefan, Rußegger, Nadine, Dax, Ingrid, & Schreiber, Frank. (2022). Zenodo. https://doi.org/10.5281/zenodo.6497438

        Speaker: Lukas Petersdorf (CAU Kiel - DAPHNE4NFDI)
      • 16:00
        Visual Diagnostics for Macromolecular X-Ray Diffraction: AUSPEX 20m

        Structures of biological macromolecules are the key to understanding the processes of life and form the basis for developing new drugs, e.g. against COVID-19. Traditionally, the initial quality of the X-ray data set is evaluated by looking at detector images as they are recorded. An expert user used to be able to recognize problems and after collection, the data would be integrated, scaled and merged with software that required considerable manual intervention and expertise. Data quality indicators were mostly designed so that they could be calculated rapidly with the limited computing power available, and were developed to provide information about the overall data consistency, completeness and resolution, often in the form of mean derivatives and R-values.
        Today, data collection is many orders of magnitude faster, in particular due to the brightness of the X-rays obtained from modern sources. The high X-ray flux, coupled with fast-readout pixel detectors means that manual inspection of the raw data is no longer practical. Unfortunately, there is a severe mismatch between the robustness of our current diagnostics and our reliance on automatic processing as many of the quality indicators in use by the automatic algorithms are not reliable enough for correct decision-making. The lack of visual inspection of detector images by expert users has created a gap in the quality control of experiments.
        New algorithms which play to the strengths of modern computing power and robust statistical analyses need to be developed and implemented. In addition, much may be gained from taking the whole statistical distribution of the data into account, or even visualising the entirety of the data set instead of mean values. To address this need, we have started a software package for exploratory analyses of crystallographic data. AUSPEX [1] provides a visual and intuitive way of revealing problems in diffraction data that either require a specific processing approach. AUSPEX is available as part of CCP4 and as web service at auspex.de. The software was developed using open data; however, the lack of unprocessed raw images from beam lines is often a roadblock for new method development. Since 2021, we are utilizing convolutional neural networks [2] when statistical indicators fail us and we have made first steps towards "explainable AI" for our developments...

        [1] Thorn et al. (2017). ActaCrystD73,729.
        [2] Nolte et al. (2022). ActaCrystD78,187.

        Speaker: Andrea Thorn (Universität Hamburg)
      • 16:20
        Break 20m
      • 16:40
        Microcrystal segmentation for SSX (project) 20m

        Serial Synchrotron Crystallography (SSX) experiments conducted at microfocus beamlines involve the collection of diffraction data from multiple microcrystals contained within one or more experimental supports until a complete dataset is obtained (Diederichs & Wang, 2017). An experimental sample for SSX typically consists of a set of 10 to 10,000 crystals with sizes around 5x5x5 µm³. A widely adopted strategy in the field to locate the positions of these crystals is to scan an area of interest with the X-ray beam to measure the amount of diffraction at each point within that region, thereby discovering the crystal positions within it (Coquelle et al. 2015). The drawback of this approach is that it results in what is referred to in the field as radiation damage, which is particularly severe for microcrystals (De la Mora et al. 2020).

        Microcrystal segmentation for SSX, prior to the final data collection, is presented as a non-invasive and rapid alternative to crystal localization via X-ray beam scanning. Currently, there are some deep learning-based models for crystal segmentation, but these models are often focused on very specific scenarios (Kardoost et al. 2023, Tran et al. 2020) or on the segmentation of a small number of crystals with dimensions much larger than the typical scenario of an SSX experiment (Bischoff et al. 2022). The main challenge in crystal segmentation for SSX is the annotation, which, given the wide range of scenarios in these experiments, is particularly complex. This challenge is exacerbated by the lack of open, organized, and annotated image data on crystals, except for the MARCO dataset (https://marco.ccr.buffalo.edu), which is primarily focused on image classification. Bischoff et al. have addressed the annotation problem for suspended microcrystals synthetically by generating a simulated training dataset with fully defined bounding boxes.

        Our goal is to extend the work of Bischoff et al. in order to generate crystal images that allow for the creation of more generalized models for SSX, including various lighting patterns and different types of crystals. These generalized models can then be fine-tuned with a small number of real images for more specific tasks. Additionally, we anticipate that this model will assist in annotating real datasets, which can later be refined with human intervention, thereby contributing to the creation of open, organized, and annotated crystal datasets.

        Please note that this talk will be recorded.

        Speaker: Dr Nicolas Soler (ALBA-CELLS)
      • 17:00
        Suggestions for human accessibility of photon science dataset 20m

        As the field of machine learning continues to advance in photon science and related fields, the availability of high-quality datasets plays a pivotal role in the development, use, and evaluation of models. However, the landscape of datasets remains quite inaccessible and diverse, with many being unstructured, poorly organized, or lacking essential documentation. In this presentation, we share our experiences and insights gained from working with datasets for machine learning applications in tomography, holography, and generally, phase retrieval.

        We begin by emphasizing the importance of data sharing, collaboration, and transparency of datasets in photon science research and other interdisciplinary fields. An essential aspect of scientific research is reproducibility as it allows us to independently verify the methods developed by other scientists. Testing and refining the existing algorithms will produce more robust and accurate machine learning models. Furthermore, it maximizes the potential of machine learning in accelerating scientific discoveries and driving technological innovation but this is only possible when data is readily-available to the photon science community.

        We then address the common challenges faced when dealing with unstructured and unorganized datasets. These challenges encompass issues related to data cleanliness, consistency, and comprehensibility. Drawing upon real-world experiences, we illustrate the impact of such challenges on machine learning experiments, including reduced model performance, wasted research time, and potential biases.

        To mitigate these challenges apart from sharing data in a FAIR fashion, we propose a set of comprehensive guidelines for dataset publication, i.e. a mini data sheet template. These guidelines emphasize the importance of structuring and documenting datasets effectively to ensure their usability by the wider research community. We discuss best practices for data cleaning, organization, and annotation --- highlighting the critical role that clear metadata and data sheets play in enhancing dataset quality.

        Speaker: Peter Steinbach (Helmholtz-Zentrum Dresden-Rossendorf)
    • 17:30 17:45
      Wrap-up and working points for day 2 15m Principal/0-0 - Salle Amphitheatre (Batiment Principal)

      Principal/0-0 - Salle Amphitheatre

      Batiment Principal

      90
      Show room on map
    • 19:30 21:30
      Networking dinner 2h
  • Wednesday, 18 October
    • 09:00 09:15
      Introduction and objectives day 2 15m
      Speaker: Sophie Servan (DESY)
    • 09:15 09:45
      Metadata requirements for single crystal raw diffraction data re-use. Experience of IUCrData - Raw Data Letters. 30m

      Please note this talk will be recorded.

      Speaker: Loes Kroon-Batenburg (Utrecht University)
    • 09:45 11:45
      Parallel working session: Working group 1 Principal/1-57 - Salle A1.1.57/Orion (Batiment Principal)

      Principal/1-57 - Salle A1.1.57/Orion

      Batiment Principal

      8
      Show room on map

      Working group 1: Harnessing open data for machine learning: curated repositories, pre-processing, reproducibility
      Chair: Paul
      Room: ORION // breakout room 1

      Working group 2: Collaboration between data producers and consumers: challenges and opportunities
      Chair: Zdenek
      Room: PYXIS // breakout room 2

      Working group 3: Publishing qualitative datasets: metadata, citation, licence, quality-checks
      Chair: Majid
      Room: VIRGO // breakout room 3

    • 09:45 11:45
      Parallel working session: Working group 2 Principal/1-22 - Salle A2.1.22/Pyxis (Batiment Principal)

      Principal/1-22 - Salle A2.1.22/Pyxis

      Batiment Principal

      7
      Show room on map

      Working group 1: Harnessing open data for machine learning: curated repositories, pre-processing, reproducibility
      Chair: Paul
      Room: ORION // breakout room 1

      Working group 2: Collaboration between data producers and consumers: challenges and opportunities
      Chair: Zdenek
      Room: PYXIS // breakout room 2

      Working group 3: Publishing qualitative datasets: metadata, citation, licence, quality-checks
      Chair: Majid
      Room: VIRGO // breakout room 3

      • 09:45
        Working group 2 part I 1h
        Speakers: Markus Janousch (Paul Scherrer Institut), Zdenek Matej (MAX IV Laboratory, Lund University)
      • 10:45
        Break 15m
      • 11:00
        Working group 2 part II 45m
        Speakers: Markus Janousch (Paul Scherrer Institut), Zdenek Matej (MAX IV Laboratory, Lund University)
    • 09:45 11:45
      Parallel working session: Working group 3 Principal/1-48 - Salle A1.1.48/Virgo (Batiment Principal)

      Principal/1-48 - Salle A1.1.48/Virgo

      Batiment Principal

      7
      Show room on map

      Working group 1: Harnessing open data for machine learning: curated repositories, pre-processing, reproducibility
      Chair: Paul
      Room: ORION // breakout room 1

      Working group 2: Collaboration between data producers and consumers: challenges and opportunities
      Chair: Zdenek
      Room: PYXIS // breakout room 2

      Working group 3: Publishing qualitative datasets: metadata, citation, licence, quality-checks
      Chair: Majid
      Room: VIRGO // breakout room 3

    • 11:45 12:15
      Report to the group 30m
      Speakers: Majid OUNSY (SOLEIL), Markus Janousch (Paul Scherrer Institut), Paul Millar (DESY), Zdenek Matej (MAX IV Laboratory, Lund University)
    • 12:15 12:30
      Workshop conclusion 15m
      Speaker: Sophie Servan (DESY)