Sloan Digital Sky Survey
Review of Data Processing
Operations
Science Archive
Ani Thakar
July 7, 2000
The SDSS Science Archive (SX) consists of the following
components:
- A commercial object-oriented DBMS - Objectivity -
that serves as the data repository or warehouse. In addition to providing
persistence and data integrity for an object data model, Objectivity also
provides binary compatibility across the major hardware platforms, transparent
and configurable data organization, superior data-mining performance for
complex data models, distributed database capability and automated replication
of the archive.
- A client-server interface to the data warehouse later
that consists of:
- a Tcl/Tk GUI client - the SX Query Tool or sxQT -
that enables users to formulate queries in the SQL-like SX Query Language
(SXQL) and submit them to the SX server of their choice (see below). The
sxQT pScience Archive Overview The SDSS Science Archive (SX) consists of the
following components:
- A commercial object-oriented DBMS - Objectivity - that serves as the
data repository or warehouse. In addition to providing persistence and data
integrity for an object data model, Objectivity also provides binary
compatibility across the major hardware platforms, transparent and
configurable data organization, superior data-mining performance for complex
data models, distributed database capability and automated replication of
the archive.
- A client-server interface to the data warehouse later
that consists of:
- A Tcl/Tk GUI client - the SX Query Tool or
sxQT - that enables users to formulate queries in the SQL-like SX Query
Language (SXQL) and submit them to the SX server of their choice (see
below). The sxQT provides a multi-window environment that enables the user
to initiate multiple sessions with different SX servers and submit multiple
queries in parallel.
- A multi-threaded and distributed server (sxServer) written in C++
that creates and manages user sessions and executes user queries. It
performs user authentication and routes query output in the appropriate
format to the specified output targets. The sxServer has the ability to
handle multiple parallel user sessions as well as multiple parallel user
queries per session. Queries that need to access distributed data are
executed in distributed mode by sending the required parts of the queries to
remote hosts and executing them with remote slave servers.
Status of SX Requirements
The original requirements for the SX (circa 1995) are listed below along with
an indication beside each requirement as to whether it has been met (Yes), not
met (No) or not applicable any more (NA). Other comments are added as necessary.
The science archive shall consist of:
- A science database that shall:
- Retain calibrated object catalogs (photometric CCD output)
Yes
- Retain parameters from spectroscopic pipeline Yes
- Enhanced goal: provide ability to recalibrate object catalogs
No
- Retain references to atlas images and extracted spectra Yes
- Provide ability to carry out manual target selection for certain
target categories Yes
- Provide ability to carry out offline QA activities Yes
- Provide ability for SDSS scientists to extract subsets of retained
data Yes
- Enhanced goal: Provide ability to retain scientist-derived data sets
No
- Enhanced goal: Provide smooth transition to public distribution
system No
- A set of files tracked by the science database Yes (tsObjc,
tsField, fpAtlas, fpBin, psSpec, fpMask)
- A set of files not tracked by the science database Yes (all other
files)
Summary:
All primary science archive requirements have been met, with the exception
of the enhanced goals.
I. Input to Science Archive
Survey Definition
- A description of the North Imaging survey area
Yes
- Survey progress: A description of sky inserted into database to date
Yes
Final Astrometric Calibration
- List of r' band calibration coefficients on a frame-by-frame basis
Yes
[TBD: Are position errors stored on an
object-by-object basis?]
Final Photometric Calibraton
- List of photometric calibration coefficients on a frame-by-frame basis
Yes
Merged Object Lists
- A list of calibrated objects and parameters from the Frames pipeline of
photo Yes
- A list of object masks from the Frames pipeline of photo No
(photo not providing masks, only bit flags), needs to be extracted (who,
when?)
[TBD: Do 1 and 2 provide all
information about masked areas of sky?]
- Run and Field information. Might be needed for recalibrations
Yes
- Star/Galaxy/QSO classifications Yes
- Enhanced goal: Cross-identifications to other catalogs No (wait
till real mandate for this; FIRST, USNO, ROSAT incl.)
Target Selection
- A list of all targetable objects with target selection categories
No (target does not store this - to be changed)
- A list of all objects from a selected as targets with selection category
Yes
- Tiling flags for all objects in 2 Yes (coming with schema
for spectra)
- Reports for all targets selected manually No (target does
not store this - to be changed)
Spectroscopic Pipeline
- Redshifts and parameters of all targeted objects Yes
- Enhanced goal: Tile and plate information Yes
[TBD: If a target has multiple spectra obtained, is
there a need to assign one as a primary measurement?]
NA
- Enhanced goal: Scientist derived catalogs No
- Enhanced goal: Other input catalogs No
Separate files tracked from Science Database
- Atlas Images Yes (under construction)
- 1-D spectra Yes (under construction)
Separate files not tracked or accessible from Science Archive
- Compressed pixel map
- Full corrected pixel map
- Corrected spectroscopic frames d. Unused data
[TBD: Southern Survey Yes (same as Northern)]
Summary:
All the basic requirements for Inputs have been met - the SX meets these
requirements to the extent possible with the current inputs provided by the
photo and spectro pipelines. The enhanced goals have not been
met.
II. Functional Goals
User will be able to carry out efficient queries to locate objects over one
or more ranges of following attributes:
- Longitude or latitude in several spherical coordinates
- J2000 Ra and Dec Yes
- J1950 Ra and Dec Yes
- Enhanced goal: Ra and Dec of arbitrary epoch No
- Galactic coordinates Yes
- Survey Coordinates Yes
- Any linear combination of the two coordinates Yes
- Radius within a give point of the sky Yes
- u' g' r' i' z' (One set of magnitudes per object) Yes
- Any linear combination of 3. Yes
- Object radius (one per object) Yes
- Surface brightess formed by c and d Yes
- Star/Galaxy/QSO Classification flag Yes
- Object class (small/medium/big/mask) Yes
- Target Selection Category Yes (target flags)
- Spectrum available flag Yes (link exists?)
User will be able to carry out queries on any retained object parameter
(subject to implementation constraints) Yes
Enhanced Goal:
Summary:
All the basic functional goals, with the exception of efficient repeated
queries and the ability to make plots in the SX, have been met. One of the two
enhanced goals has been met.
----------------
III. Technical Goals
User interface
- User interface shall be developed in a TCL/Tk/TclDp
environment Yes
- User interface shall communicate with a query support
layer via ASCII interface protocol Yes
- Data shall be returnable to files, sockets, or pipes.
Returned data shall use binary machine independent format
(FITS binary, ASCII if appropriate, TBD: FITS ASCII, other?) Yes
Enhanced goal:
Data shall be stored in a system providing industry-standard OSQL-like
interface to enable use of commercial products to provide alternative view of
the database Yes (Objy has SQL++)
Distributability
- A master copy of all data shall be maintained (the
Master Science Archive) Yes
- Capability shall be present to replicate all or part
of the Master Science Archive as local databases at SDSS institutions.
Replication may consist of:
- Science Database in its entirety Yes
- All or part of separate files tracked by Science
Database Yes
- No capability shall be present to replicate an
arbitrarily selected subset of the science database beyond that described by
section 1.3 of USER INTERFACE Yes
- Replication of databases shall be possible on all SDSS supported
platforms Yes
- d. No capability is required be present to replicate all or part of
separate files not tracked by Science Database Yes
Security
- Master Science Archive shall be protected against corruption by SDSS
participant user Yes
- Master Science Archive shall be protected against unauthorized access Yes
Summary: All the technical goals have been met.
SX Usage Stats
Usage statistics for the SX for the last 6 months are shown below.
Total number of SX logins
Total number of queries submitted
Number of queries completed
Number of queries aborted by user
Number of queries not completed due to system errors
Average query execution time (predicted : actual)
SX usage by institution (# of logins for each institution)
Usage with time chart (# of logins per day)
Stability of SX server:
# of server crashes per day (plot since Mar/00)
SX bug reporting stats
SX Performance and Objectivity Issues
Our goal is to have a scalable, primarily I/O-limited archival system. The
current release of Objectivity has prevented us from reaching that goal due to
performance problems that we have identified with it. Benchmarks conducted by
the SX development team that compared Objectivity performance with Microsoft's
SQL Server indicate that scan speeds with Objectivity were more than an order of
magnitude slower than with SQL Server. Even after allowing for the additional
complexity of an object database, this was a serious shortfall in performance,
and quite unacceptable. Most of this discrepancy was attributable to the time it
took to open individual objects in Objectivity. This operation was using up
several thousand CPU cycles instead of the expected few hundred cycles.
This problem was reported to Objectivity in the last quarter of 1999 and in
December 1999 Objectivity agreed to put it on their "warmlist" and fix it for
v6.0 due to be released in Summer 2000. However, our ongoing discussions with
Objectivity over the last few months revealed that the problem had not in fact
made it on their warmlist. So beginning in May 2000 we started to press
Objectivity on this issue and initiated discussions with Mats Persson at
Objectivity to resolve the problem. Alex Szalay went to Objectivity HQ for
several days to impress the urgency of this issue upon them. Peter Kunszt spent
several weeks in June conducting intensive profiling and benchmarking of our
Objectivity code and was able to finally convince Objectivity of the fundamental
nature and severity of the problem. We are happy to report that Mats has been
able to speed up the Objectivity code by a factor of 15 as a result of this
effort, and the performance is now comparable to SQL Server. However, these
changes are still limited to the Windows/NT version of Objectivity, and we
intend to work with Objectivity to ensure that most or all of them will make it
into the UNIX versions by the time Objectivity releases v6.0 (now due September
2000).
Results of benchmarking:
Straight (no predicate) Scan Speed (Mb/s) Tests between Jun 5 and Jun 27,
2000: SQL-Server vs Objectivity
2M objects
8M objects
Predicate Scan Speed (Mb/s) Tests between Jun 5 and Jun 27, 2000:
SQL-Server vs Objectivity
2M objects
8M objects
Hardware Issues:
We have been able to obtain sequential I/O speeds of 1.3M objects/s with SQL
Server (running 2 threads) and 450k objects/s with Objectivity (running as a
single thread) on an Intel/Linux box. The maximum raw I/O speeds possible with
this hardware is in excess of 200Mb/s, and with the latest Objectivity
performance improvements, we expect to be able to achieve 100 Mb/s with
Objectivity on a Linux box in the near future. The speeds we have been able to
achieve on the SGI are quite disappointing by comparison (3 - 17 Mb/s) and
considering that SGI's implementation of UNIX and POSIX threads has a number of
pecularities that have caused serious problems for us in the past (costing us
over 6 man-months of effort to resolve them), it seems that the Intel/Linux
clusters are the preferred hardware environment to deploy SX on in the future.
|