Exploiting open data for machine learning training: can the Photon and Neutron community do it?
During the last decade, most European Photon and Neutron (PaN) facilities have adopted open data policies, making data available for the benefit of the entire scientific community. At the same time, machine learning (ML) is seen as an essential tool to address the exponential growth of data volumes from PaN facilities.
Exploitation of experimental training datasets is a key component of machine learning. The combination of ML algorithms and open data can therefore be seen as an ideal marriage that would ultimately help the entire community to tackle ‘big data’ challenges with more automation.
However, finding the right data to train machine learning algorithms is a challenge and one of the motivations for making data FAIR is exactly that: to provide scientists working on AI applications with quality training datasets.
But what does 'quality' mean to PaN science communities? What metadata fields are needed to find the data, to understand if it is suitable for our research, and ultimately to be able to ingest it in our training models? How can we provide sufficiently rich metadata? What would be the enablers for more machine learning applications? How can we improve the collaboration between data producers (domain scientists) and data consumers (ML experts)?
We will present projects and teams that have successfully used open datasets from PaN facilities to train their specific ML application (data consumers), as well as domain scientists (data producers) who have published curated data specifically for ML applications.
We will also look at cases where it hasn't worked so well, to identify what needs to be better curated on the FAIR data management side or understand the challenges in finding ML experts to effectively utilise the available data. A significant part of the workshop will be dedicated to discussion.
Call for abstracts
During the workshop, we have slots available for 20 minute presentations. We are particularly interested in contributions that can address the following points:
- Have you used open data for machine learning? What has been your experience?
- Do you feel your research could benefit from more and better curated open data? Tell us why and how.
- Have you already made open datasets available to your community for machine learning training purposes? What has been your experience?
- Have you managed to enrich the metadata of open datasets for better training? Share with us how you accomplished it.
Whatever your discipline, as long as the data originally came from experiment(s) at a photon or neutron source, please submit an abstract in the "Call for Abstracts" section.
The workshop will be held at the SOLEIL synchrotron, Saint Aubin (France) as a satellite event of the LEAPS General Assembly. Please note there is a 50€ registration fee for on-site participants.
Remote participation will be possible too and is free of charge.