Natural language processing techniques for measuring word similarity
Permanent address of the item is
Luonnollisen kielen käsittelyn menetelmät sanojen samankaltaisuuden mittaamisessa
An artificial intelligence application considered in this thesis was harnessed to extract competencies from job descriptions and higher education curricula written in natural language. Using these extracted competencies, the application is able to visualize the skills supply of the schools and the skills demand of the labor market. However, to understand natural language, computer must learn to evaluate the relatedness between words. The aim of the thesis is to propose the best methods for open text data mining and measuring the semantic similarity and relatedness between words. Different words can have similar meanings in natural language. The computer can learn the relatedness between words mainly by two different methods. We can construct an ontology from the studied domain, which models the concepts of the domain as well as the relations between them. The ontology can be considered as a directed graph. The nodes are the concepts of the domain and the edges between the nodes describe their relations. The semantic similarity between the concepts can be computed based on the distance and the strength of the relations between them. The other way to measure the word relatedness is based on statistical language models. The model learns the similarity between words relying on their probability distribution in large corpora. The words appearing in similar contexts, i.e., surrounded by similar words, tend to have similar meanings. The words are often represented as continuous distributed word vectors, each dimension representing some feature of the word. The feature can be either semantic, syntactic or morphological. However, the feature is latent, and usually not under understandable to a human. If the angle between the word vectors in the feature space is small, the words share same features and hence are similar. The study was conducted by reviewing available literature and implementing a web scraper for retrieving open text data from the web. The scraped data was fed into the AI application, which extracted the skills from the data and visualized the result in semantic maps.