• English
    • suomi
  • English 
    • English
    • suomi
  • Login
View Item 
  •   TUT DPub Home
  • Tampereen teknillinen yliopisto
  • Master's theses - Open access
  • View Item
  •   TUT DPub Home
  • Tampereen teknillinen yliopisto
  • Master's theses - Open access
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Natural language processing techniques for measuring word similarity

Hjelt, Teppo

Permanent address of the item is http://URN.fi/URN:NBN:fi:tty-201811142578
Thumbnail
View/Open
hjelt.pdf (1.709Mb)
Date
2018-12-05
Summary View
Show full item record
Alternative title: Luonnollisen kielen käsittelyn menetelmät sanojen samankaltaisuuden mittaamisessa
Abstract
An artificial intelligence application considered in this thesis was harnessed to extract competencies from job descriptions and higher education curricula written in natural language. Using these extracted competencies, the application is able to visualize the skills supply of the schools and the skills demand of the labor market. However, to understand natural language, computer must learn to evaluate the relatedness between words. The aim of the thesis is to propose the best methods for open text data mining and measuring the semantic similarity and relatedness between words.

Different words can have similar meanings in natural language. The computer can learn the relatedness between words mainly by two different methods. We can construct an ontology from the studied domain, which models the concepts of the domain as well as the relations between them. The ontology can be considered as a directed graph. The nodes are the concepts of the domain and the edges between the nodes describe their relations. The semantic similarity between the concepts can be computed based on the distance and the strength of the relations between them.

The other way to measure the word relatedness is based on statistical language models. The model learns the similarity between words relying on their probability distribution in large corpora. The words appearing in similar contexts, i.e., surrounded by similar words, tend to have similar meanings. The words are often represented as continuous distributed word vectors, each dimension representing some feature of the word. The feature can be either semantic, syntactic or morphological. However, the feature is latent, and usually not under understandable to a human. If the angle between the word vectors in the feature space is small, the words share same features and hence are similar.

The study was conducted by reviewing available literature and implementing a web scraper for retrieving open text data from the web. The scraped data was fed into the AI application, which extracted the skills from the data and visualized the result in semantic maps.

University: Tampereen teknillinen yliopisto - Tampere University of Technology
Faculty: Talouden ja rakentamisen tiedekunta – Faculty of Business and Built Environment
Department: Tietotekniikka – Pervasive Computing
Degree Programme: Johtaminen ja tietotekniikka (Pori)
Pages: 74
Publication type: Diplomityö - Master's thesis
Collections:
  • Master's theses - Open access

| Help | Contact Us | Send Feedback
 

 

Browse

All of TUT DPubCommunities & CollectionsBy Issue DateAuthorsTitlesDepartmentFacultyThis CollectionBy Issue DateAuthorsTitlesDepartmentFaculty

My Account

LoginRegister

| Help | Contact Us | Send Feedback