DATA NEEDS OF ACADEMIC RESEARCH IN THE INTERNET ENVIRONMENT

Summary of a talk at the US National Institute

of Standards and Technology, December 4, 1996

Gary Wiggins

Indiana University Chemistry Library

wiggins@indiana.edu

I. Introduction

Although there are many free data sources on the Internet, some of the free compilations lack the kind of rigorous quality control found in data sets available in the commercial sector. Various types of scientific and technical data available on the Internet were surveyed and demonstrated during the talk at NIST. Without standards, questions of accuracy and reliability of Internet data invariably arise.

In a comment made on CHMINF-L on October 30, 1996, David Lide said, "All in all, the chemical data now available on the web is in a different class from the data found in refereed journals, critical reviews and books from reputable publishers." The comment was not intended to flatter the compilers of Web data sources.

A survey was undertaken in October 1996 to determine for Web data whether steps are needed to improve the quality of data on the Web. Questions sent to CHMINF-L & CHEMWEB were intended to:

  1. Accuracy of Internet Data and Suggestions for Improvement.

The accuracy of the data on the Web was roundly criticized by some. They pointed out that units are frequently omitted and transcription errors are often encountered. Respondents noted that very few sources on the Web have quality assurance statements, few give the source of the data, and if they do, they often indicate that the data are copied from outdated sources. Therefore, a need was expressed for a minimal level of auxiliary information (metadata) providing at least such information as authorship, units, conditions of measurement, and references to primary and secondary sources of data. Furthermore, standard symbols and terminology ought to be used in the compilations, some of which suffer from a lack of guidelines on how to handle special characters

In light of the above criticism, the following steps were suggested to improve data on the Web:

At a minimum, compilers ought to provide descriptions of physical theories on which data are based, full references to literature, and descriptions of the format of the database and its search capabilities.

There are some standardization efforts underway on the net, particularly on the publishing side, where the CLIC project, Chemical MIME, and CML were noted as hopeful signs. Although one person questioned whether standardization was worthwhile, in general, there was seen to be a role for IUPAC, CODATA, or other bodies in the area of data certification.

III. Finding Data on the Web.

A quote from a recent computer journal points out one of the fundamental problems of using the Web to obtain data: "While some might argue that the Internet is designed to make information in a single location accessible to users around the world, the large number of mirrored sites already in existence points out the Net's inadequacy." (Byte, December 1996, p. 116)

One respondent to the survey noted the relevance of Lebedev's study of Internet search engines (http://www.chem.msu.su/eng/comparison.html). Lebedev searched for data using words on 11 Web search engines. He concluded that Excite retrieves a comparable number of documents to Altavista and that Metacrawler is the most powerful search engine for scientific and technical information. The author compared his Internet searches to INSPEC results for the same information covering 1994 & 1995. He found that only 5-10 % of relevant information is on the net. However, Lebedev considers the Web to be good for supplemental information on authors, on their work and research projects, and on foundations supporting them.

It is possible to find data on the Internet by following some generally accepted procedures, such as:

Lists of Sources (Guides)


http://www.indiana.edu/~cheminfo/ca_accc.html
http://www.indiana.edu/~cheminfo/ca_ppi.html


http://plasma-gate.weizmann.ac.il/DBfAPP.html


http://www.iop.org/Physics/Resources/phsoft.html

Known Sources


http://physics.nist.gov/PhysRefData/contents.html

http://www.shef.ac.uk/~chem/chemputer/

http://dragon.labmed.umn.edu/~lynda/index.html

Comprehensive Chemistry Guides

http://chemfinder.camsoft.com

http://schiele.organik.uni-erlangen.de/services/webmol.html

http://www.tripos.com/spacecrunch/

Other Examples

http://www.lib.utexas.edu/Libs/Chem/info/thermodex/

http://funnelweb.utcc.utk.edu/~athas/databank/intro.html

Internet Demos

http://www.indiana.edu/~cheminfo/ca_accc.html

Go to the Analytical Chemistry page, then to MS Links at SIS, then Dave's Math Tables

http://www.sisweb.com/math/tables.htm

http://micro.ifas.ufl.edu/

Plays "Happy Birthday to You" on an NMR Spectrometer!

http://xray.uu.se/hypertext/corexdb.html

SEARCH naphthalene

http://alfred.niehs.nih.gov/LMB/stdb

ENTER THE DATABASE doesn't work, but HIPPO does

http://emrs.chm.bris.ac.uk/

Beautiful background!

In "About the Database" in the Introduction, Spectra examples,

Show the example Cu(II) (nothing else works!)

http://www.cica.indiana.edu/~recip/

http://www.indiana.edu/ReciprocalNet.html

http://molbio.info.nih.gov/cgi-bin/pdb

Search dehalogenase (E.C.3.8.1.5)

http://webbook.nist.gov/chemistry

Look for 91-56-5

http://ozone.sph.unc.edu

Has "Environmental Data, but it's "under construction"

http://www.lib.utexas.edu/Libs/Chem/info/thermodex/

Search Gibbs Free Energy and organic

http://chemfinder.camsoft.com

Search MEK

http://schiele.organik.uni-erlangen.de/services/webmol.html

Search MEK, then 2-butanone

http://www.tripos.com/spacecrunch/


http://www.bris.ac.uk/Depts/Chemistry/MOTM/motm.htm