Analysis of software engineering data with self-organizing maps
Permanent address of the item is
Nowadays software developers have tools that help to assist them during development and management processes of software products. These tools store large amount of Software Engineering data resulted from these processes. Analysis of the data can reveal valuable information about project performance and discover useful business patterns used during development. This information can be utilized to find projects where teams have some management problems and help to improve situation. Currently existing methods in the field have applicability limitations, because they require an expert knowledge for evaluation and are not capable to deal with large number of projects. In this thesis, we will explore possibility to apply Machine Learning methods to analysis of software engineering data. Machine Learning methods are capable to build a model from sample inputs and produce data-driven predictions. They have been gaining popularity over the last decades and show promising results in applications, where human expertise was traditionally used. In this work, we attempt to extract and analyze software project management patterns from software engineering data stored in the GitHub repositories. For this purpose, we have developed a system, which is capable of collecting the project data, extracting their features and comparing properties of large number of projects between each other. To collect projects, we used Unified Data Model that is capable of storing of software engineering data from various sources; we have also spotted a few limitations of this model and have improved it to meet requirements of our work. Obtained data was used for training of Self-Organizing Maps. The resulted map have demonstrated clear grouping principles of the projects according to chosen feature set. We estimated efficiencies for distinct areas of the map. Effects of different events occurred during issue lifetime, such as user assignation and labeling, were also investigated. Based on data analysis, we showed that labeling and user assignation is beneficial and can potentially decrease issue resolution time. The main result of our work is evaluation system that is capable of data collection, storage, cleaning and evaluation. Evaluation part of our system was based on analysis of 230 individual projects that was result of cooperation of 100 000 unique users from GitHub community. Further research directions can include verification of estimation subsystem by GitHub users who participated in project development.