Decision tree learning with hierarchical features
Permanent address of the item is
Hierarkinen data päätöspuissa
The performance of machine learning methods depends on the data they are given. Real life data sets can be incomplete and consist of various types of data. However, many methods are capable of handling only nominal and numerical features without any missing data, which causes loss of potentially useful information. Thus this thesis had two research questions: can the information from hierarchical and interval values be utilized, and can the treating of missing values be integrated into a model handling untraditional data types. In this work we developed a decision tree algorithm (DTHF) that uses hierarchical data to construct a model. The method requires that all data is preprocessed to a hierarchical form. In addition to nominal and numerical features, the method is capable of handling missing data, interval and hierarchical features, and several values for a single feature. DTHF was tested using twelve data sets and the results were compared with results from CART and C4.5 decision tree algorithms. Tests were conducted using data sets without missing values as well as with sets with various rates of missing data. If data are not missing, there is no significant difference between DTHF, CART, and C4.5. However, if data are missing CART performs remarkably better than C4.5 and DTHF, which have similar performance. More tests are needed to give a sufficient understanding of the method's performance. Especially, there is a need for tests utilizing the DTHF's capabilities. Further research topics are expanding the method into a random forest and studying how the transforming of data to a hierarchical form should be done. Other topics could be adding pruning and studying how the data set aspects the performance when data are missing.