KNB Data

KNB Informatics Research

Introduction
Ecological data are extremely variable in their syntax and semantics. This section explains the approaches being used for managing complex ecological data and metadata from our various collaborators. It explains our goals, and some of the reasoning for the architecture that we are developing.
Goals
Our collaborators and colleagues include ecological and environmental scientists spread around the nation (for that matter, the world). Thus, the data that they generate are also dispersed: it is collected in widely dispersed locations, and it is housed at a variety of widely dispersed institutions. This is appropriate because it keeps data close to its primary user, the data owner/collector. However, the formation of NCEAS , the LTER Network Office , and similar institutions aimed at cross-site, interdisciplinary, synthetic research has demonstrated that this distributed set of valuable data are largely inaccessible to anyone except the original investigators. So, what is good for the individual investigator (local data) is not always the best for the ecological community (lack of access to national data resources). Thus, the KNB. We conceived of the KNB as a mechanism for scientists to discover, access, interpret, analyze, and synthesize the wealth of data that is collected by ecological and environmental scientists nationally (and eventually internationally). The infrastructure for this network must deal with the major impediments to synthesizing data on ecology and the environment: Data is widely dispersed Data is heterogeneous Synthetic analysis tools are needed
KNB Architecture
To address these issues, we have taken a layered approach to infrastructure development. The three principal layers are: data access, information management, and knowledge management. Data Access: The base layer, data access, addresses the dispersed nature of data. It consists of a national network of federated institutions that have agreed to share data and metadata using a common framework, principally revolving around the use of the Ecological Metadata Language as a common language for describing ecological data, and the Metacat metadata server, a flexible database based on XML and built for storing a wide variety of metadata documents. In addition, we plan on using the Storage Resource Broker , a distributed data system developed at SDSC , for linking the highly distributed set of ecological field stations and universities housing ecological data. Finally, we are developing a user-friendly data management tool called Morpho that allows ecologists and environmental scientists manage their data on their own computers and access data that are a part of this national network, the KNB. Information Management: The middle layer, information management, addresses the heterogeneous nature of ecological data. It consists of a set of tools that help convert raw data accessible from the various contributors into information that is relevant to a particular issue of interest to a scientist. There are two major components of this information management infrastructure. First, the Data Integration Engine will provide an intelligent software environment that assists scientists in determining which data sets are appropriate for particular uses, and assists them in creating synthesized data sets. Second, the Quality Assurance Engine will provide a set of common quality assurance analyses that can be run automatically using information gathered from the metadata provided for a data set. Knowledge Management: The top layer, knowledge management, addresses the need for high quality analytical tools that allow scientists to explore and utilize the wealth of data available from the data and information layers. It consists of a suite of software applications that generally allow the scientist to analyze and summarize the data in the KNB. The Hypothesis Modeling Engine is a data exploration tool that uses Bayesian techniques to evaluate the wide variety of hypotheses that can be addressed by a particular set of data. We also plan to provide various visualization tools that allow scientists to graphically depict various combinations of data from the data and information layers in appropriate ways.
KNB Sites
A wide variety of organizations and sites have agreed to participate in the development and testing of the KNB. The LTER Network of over 24 research stations has agreed to fully participate in the network, along with a variety of sites from the Organization of Biological Field Stations (OBFS) and the UC Natural Reserve System . In addition, we have a variety of individual and site collaborators from the Multi-Agency Rocky Intertidal Network . As the technology for the KNB matures, we expect to add many new sites and sources of data to the network. Sites interested in participating in the prototype network or in the final deployed network should contact jones@nceas.ucsb.edu .