Full Paper

Automatic Metadata Extraction From Museum Specimen Labels

P. Bryan Heidorn ,Qin Wei

DOI: 10.23106/dcmi.952109189

Abstract

This paper describes the information properties of museum specimen labels and machine learning tools to automatically extract Darwin Core (DwC) and other metadata from these labels processed through Optical Character Recognition (OCR). The DwC is a metadata profile describing the core set of access points for search and retrieval of natural history collections and observation databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements from these labels. The automated text extraction tools are provided as a web service so that users can reference digital images of specimens and receive back an extended Darwin Core XML representation of the content of the label. This automated extraction task is made more difficult by the high variability of museum label formats, OCR errors and the open class nature of some elements. In this paper we introduce our overall system architecture, and variability robust solutions including, the application of Hidden Markov and Naïve Bayes machine learning models, data cleaning, use of field element identifiers, and specialist learning models. The techniques developed here could be adapted to any metadata extraction situation with noisy text and weakly ordered elements.

Author information

P. Bryan Heidorn

University of Illinois at Urbana-Champaign

Qin Wei

University of Illinois at Urbana-Champaign

Cite this article

Heidorn, P. B., & Wei, Q. (2008). Automatic Metadata Extraction From Museum Specimen Labels. Proceedings of the International Conference on Dublin Core and Metadata Applications, 2008. https://doi.org/10.23106/dcmi.952109189
Published

Issue

DC-2008--Berlin Proceedings
Location:
Berlin, Germany
Dates:
September 22-26, 2008
CC-0 Logo Metadata and citations of this article is published under the Creative Commons Zero Universal Public Domain Dedication (CC0), allowing unrestricted reuse. Anyone can freely use the metadata from DCPapers articles for any purpose without limitations.
CC-BY Logo This article full-text is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license allows use, sharing, adaptation, distribution, and reproduction in any medium or format, provided that appropriate credit is given to the original author(s) and the source is cited.