Suggestions for human accessibility of photon science dataset

17 Oct 2023, 17:00
20m
Principal/0-0 - Salle Amphitheatre (Batiment Principal)

Principal/0-0 - Salle Amphitheatre

Batiment Principal

90
Show room on map
Talk Open data for machine learning Community talks

Speaker

Peter Steinbach (Helmholtz-Zentrum Dresden-Rossendorf)

Description

As the field of machine learning continues to advance in photon science and related fields, the availability of high-quality datasets plays a pivotal role in the development, use, and evaluation of models. However, the landscape of datasets remains quite inaccessible and diverse, with many being unstructured, poorly organized, or lacking essential documentation. In this presentation, we share our experiences and insights gained from working with datasets for machine learning applications in tomography, holography, and generally, phase retrieval.

We begin by emphasizing the importance of data sharing, collaboration, and transparency of datasets in photon science research and other interdisciplinary fields. An essential aspect of scientific research is reproducibility as it allows us to independently verify the methods developed by other scientists. Testing and refining the existing algorithms will produce more robust and accurate machine learning models. Furthermore, it maximizes the potential of machine learning in accelerating scientific discoveries and driving technological innovation but this is only possible when data is readily-available to the photon science community.

We then address the common challenges faced when dealing with unstructured and unorganized datasets. These challenges encompass issues related to data cleanliness, consistency, and comprehensibility. Drawing upon real-world experiences, we illustrate the impact of such challenges on machine learning experiments, including reduced model performance, wasted research time, and potential biases.

To mitigate these challenges apart from sharing data in a FAIR fashion, we propose a set of comprehensive guidelines for dataset publication, i.e. a mini data sheet template. These guidelines emphasize the importance of structuring and documenting datasets effectively to ensure their usability by the wider research community. We discuss best practices for data cleaning, organization, and annotation --- highlighting the critical role that clear metadata and data sheets play in enhancing dataset quality.

What topics do you think we should discuss in the working sessions?

Components for a good dataset

Which point of view is your contribution addressing? I've used open data to train my models
What best describes your position? machine learning expert

Primary authors

Erik Thiessenhusen (Helmholtz-Zentrum Dresden-Rossendorf) Peter Steinbach (Helmholtz-Zentrum Dresden-Rossendorf) Ritz Ann Aguilar (Helmholtz-Zentrum Dresden-Rossendorf)

Presentation materials