Adla Lagström Jebara, Fabian Sundholm:
Management of Training Data for Deep Learning Applications: Requirements and Solutions,
summary,
report, March 2024.
Abstract:
In the realm of software development, extensive research has been conducted on source code management, but little to no attention has been given to managing associated data, such as the large volume of training data needed for the development of deep learning applications.
This thesis aims to investigate if there is a scalable solution for storing and managing training data used in different variants of machine learning models. This research includes identifying and formulating requirements for a training data management system, proposing design solutions to address these requirements, and finally, implementing a proof of concept.
The requirement specification was formulated through literature reviews and developer interviews. Design solutions were developed in alignment with the identified requirements and by exploring available tools. Thereafter, one of the two design solutions was chosen for implementation in a proof of concept.
The research findings include a comprehensive list of requirements, including key requirements such as versioning, scalability, traceability, and data lifecycle management. The proof of concept demonstrated that the proposed design solution did not fully meet the requirements, indicating a complexity in addressing the problem beyond initial expectations.
Due to time and resource constraints, a satisfactory full implementation of a proof of concept was not achieved. Moreover, a built solution meeting all the requirements to a satisfactory degree likely does not exist. Nevertheless, our research indicates that given additional time and resources, it is feasible to address the problem. Consequently, an interesting future work could be the development and implementation of such a solution.