Algorithmic extraction of data in tables in PDF documents
Permanent address of the item is
Taulukkomuotoisen datan talteenotto PDF dokumenteista
Tables are an intuitive and universally used way of presenting large sets of experimental results and research findings, and as such, they are the majority source of significant data in scientific publications. As no universal standardization exists for the format of the reported data and the table layouts, two highly flexible algorithms are created to (i) detect tables within documents and to (ii) recognize table column and row structures. These algorithms enable completely automated extraction of tabular data from PDF documents. PDF was chosen as the preferred target format for data extraction because of its popularity and the availability of research publications as natively digital PDF documents, almost without exceptions. The extracted data is made available in HTML and XML formats. These two formats were chosen because of their flexibility and ease of use for further processing. The software application that was created as a part of this thesis work enables future research to take full advantage of existing research and results, by enabling gathering of large volumes of data from various sources for a more profound statistical analysis.