Sloan Digital Sky Survey

Observing Operations | Reviews | Survey Management

Sloan Digital Sky Survey
Review of Data Processing Operations

Science Archive

Ani Thakar
July 7, 2000

The SDSS Science Archive (SX) consists of the following components:

A commercial object-oriented DBMS - Objectivity - that serves as the data repository or warehouse. In addition to providing persistence and data integrity for an object data model, Objectivity also provides binary compatibility across the major hardware platforms, transparent and configurable data organization, superior data-mining performance for complex data models, distributed database capability and automated replication of the archive.
A client-server interface to the data warehouse later that consists of:
- a Tcl/Tk GUI client - the SX Query Tool or sxQT - that enables users to formulate queries in the SQL-like SX Query Language (SXQL) and submit them to the SX server of their choice (see below). The sxQT pScience Archive Overview The SDSS Science Archive (SX) consists of the following components:
- A commercial object-oriented DBMS - Objectivity - that serves as the data repository or warehouse. In addition to providing persistence and data integrity for an object data model, Objectivity also provides binary compatibility across the major hardware platforms, transparent and configurable data organization, superior data-mining performance for complex data models, distributed database capability and automated replication of the archive.
A client-server interface to the data warehouse later that consists of:
- A Tcl/Tk GUI client - the SX Query Tool or sxQT - that enables users to formulate queries in the SQL-like SX Query Language (SXQL) and submit them to the SX server of their choice (see below). The sxQT provides a multi-window environment that enables the user to initiate multiple sessions with different SX servers and submit multiple queries in parallel.
- A multi-threaded and distributed server (sxServer) written in C++ that creates and manages user sessions and executes user queries. It performs user authentication and routes query output in the appropriate format to the specified output targets. The sxServer has the ability to handle multiple parallel user sessions as well as multiple parallel user queries per session. Queries that need to access distributed data are executed in distributed mode by sending the required parts of the queries to remote hosts and executing them with remote slave servers.

Status of SX Requirements

The original requirements for the SX (circa 1995) are listed below along with an indication beside each requirement as to whether it has been met (Yes), not met (No) or not applicable any more (NA). Other comments are added as necessary.

The science archive shall consist of:

A science database that shall:

Retain calibrated object catalogs (photometric CCD output) Yes
Retain parameters from spectroscopic pipeline Yes
Enhanced goal: provide ability to recalibrate object catalogs No
Retain references to atlas images and extracted spectra Yes
Provide ability to carry out manual target selection for certain target categories Yes
Provide ability to carry out offline QA activities Yes
Provide ability for SDSS scientists to extract subsets of retained data Yes
Enhanced goal: Provide ability to retain scientist-derived data sets No
Enhanced goal: Provide smooth transition to public distribution system No

A set of files tracked by the science database Yes (tsObjc, tsField, fpAtlas, fpBin, psSpec, fpMask)
A set of files not tracked by the science database Yes (all other files)

Summary:

All primary science archive requirements have been met, with the exception of the enhanced goals.

I. Input to Science Archive

Survey Definition

A description of the North Imaging survey area Yes
Survey progress: A description of sky inserted into database to date Yes

Final Astrometric Calibration

List of r' band calibration coefficients on a frame-by-frame basis Yes

[TBD: Are position errors stored on an object-by-object basis?]

Final Photometric Calibraton

List of photometric calibration coefficients on a frame-by-frame basis Yes

Merged Object Lists

A list of calibrated objects and parameters from the Frames pipeline of photo Yes
A list of object masks from the Frames pipeline of photo No (photo not providing masks, only bit flags), needs to be extracted (who, when?)

[TBD: Do 1 and 2 provide all information about masked areas of sky?]
Run and Field information. Might be needed for recalibrations Yes
Star/Galaxy/QSO classifications Yes
Enhanced goal: Cross-identifications to other catalogs No (wait till real mandate for this; FIRST, USNO, ROSAT incl.)

Target Selection

A list of all targetable objects with target selection categories No (target does not store this - to be changed)
A list of all objects from a selected as targets with selection category Yes
Tiling flags for all objects in 2 Yes (coming with schema for spectra)
Reports for all targets selected manually No (target does not store this - to be changed)

Spectroscopic Pipeline

Redshifts and parameters of all targeted objects Yes
Enhanced goal: Tile and plate information Yes

[TBD: If a target has multiple spectra obtained, is there a need to assign one as a primary measurement?] NA
Enhanced goal: Scientist derived catalogs No
Enhanced goal: Other input catalogs No

Separate files tracked from Science Database

Atlas Images Yes (under construction)
1-D spectra Yes (under construction)

Separate files not tracked or accessible from Science Archive

Compressed pixel map
Full corrected pixel map
Corrected spectroscopic frames d. Unused data

[TBD: Southern Survey Yes (same as Northern)]

Summary:

All the basic requirements for Inputs have been met - the SX meets these requirements to the extent possible with the current inputs provided by the photo and spectro pipelines. The enhanced goals have not been met.

II. Functional Goals

User will be able to carry out efficient queries to locate objects over one or more ranges of following attributes:

Longitude or latitude in several spherical coordinates
1. J2000 Ra and Dec Yes
2. J1950 Ra and Dec Yes
3. Enhanced goal: Ra and Dec of arbitrary epoch No
4. Galactic coordinates Yes
5. Survey Coordinates Yes
6. Any linear combination of the two coordinates Yes
Radius within a give point of the sky Yes
u' g' r' i' z' (One set of magnitudes per object) Yes
Any linear combination of 3. Yes
Object radius (one per object) Yes
Surface brightess formed by c and d Yes
Star/Galaxy/QSO Classification flag Yes
Object class (small/medium/big/mask) Yes
Target Selection Category Yes (target flags)
Spectrum available flag Yes (link exists?)

User will be able to carry out queries on any retained object parameter (subject to implementation constraints) Yes

Enhanced Goal:

All calibrated quantities can be recomputed using improved astrometric and photometric calibrations. Queries can be performed on the recalibrated quantities Yes (in principle, with methods defined on objects)
For all efficient queries, return an esimated number of objects to be located Yes
For all located objects, users shall be able to specify an arbitrary subset of stored parameters to be returned (subject to implementation constraints) plus the following derived quantities: Yes
1. a. Number of located objects Yes
2. b. TBD: Extra parameters Yes (e.g. AVERAGE, MIN, MAX, STDDEV etc.)
Users shall be able to perform the following functions:
1. Efficient repeated queries [e.g, get all objects within each of 10,000 QSOs in my favorite catalog).] No
2. Make simple plots, etc. of returned parameters (e.g., SMONGO) No, but can do with AEs
3. Formulate new queries based on results of previous queries NA
4. TBD: What else?

Summary:

All the basic functional goals, with the exception of efficient repeated queries and the ability to make plots in the SX, have been met. One of the two enhanced goals has been met.

----------------

III. Technical Goals

User interface

User interface shall be developed in a TCL/Tk/TclDp environment Yes
User interface shall communicate with a query support layer via ASCII interface protocol Yes
Data shall be returnable to files, sockets, or pipes.

Returned data shall use binary machine independent format (FITS binary, ASCII if appropriate, TBD: FITS ASCII, other?) Yes

Enhanced goal:

Data shall be stored in a system providing industry-standard OSQL-like interface to enable use of commercial products to provide alternative view of the database Yes (Objy has SQL++)

Distributability

A master copy of all data shall be maintained (the Master Science Archive) Yes
Capability shall be present to replicate all or part of the Master Science Archive as local databases at SDSS institutions. Replication may consist of:
1. Science Database in its entirety Yes
2. All or part of separate files tracked by Science Database Yes
3. No capability shall be present to replicate an arbitrarily selected subset of the science database beyond that described by section 1.3 of USER INTERFACE Yes
4. Replication of databases shall be possible on all SDSS supported platforms Yes
d. No capability is required be present to replicate all or part of separate files not tracked by Science Database Yes

Security

Master Science Archive shall be protected against corruption by SDSS participant user Yes
Master Science Archive shall be protected against unauthorized access Yes Summary: All the technical goals have been met.

SX Usage Stats

Usage statistics for the SX for the last 6 months are shown below.

Total number of SX logins

Total number of queries submitted

Number of queries completed

Number of queries aborted by user

Number of queries not completed due to system errors

Average query execution time (predicted : actual)

SX usage by institution (# of logins for each institution)

Usage with time chart (# of logins per day)

Stability of SX server:

# of server crashes per day (plot since Mar/00)

SX bug reporting stats

SX Performance and Objectivity Issues

Our goal is to have a scalable, primarily I/O-limited archival system. The current release of Objectivity has prevented us from reaching that goal due to performance problems that we have identified with it. Benchmarks conducted by the SX development team that compared Objectivity performance with Microsoft's SQL Server indicate that scan speeds with Objectivity were more than an order of magnitude slower than with SQL Server. Even after allowing for the additional complexity of an object database, this was a serious shortfall in performance, and quite unacceptable. Most of this discrepancy was attributable to the time it took to open individual objects in Objectivity. This operation was using up several thousand CPU cycles instead of the expected few hundred cycles.

This problem was reported to Objectivity in the last quarter of 1999 and in December 1999 Objectivity agreed to put it on their "warmlist" and fix it for v6.0 due to be released in Summer 2000. However, our ongoing discussions with Objectivity over the last few months revealed that the problem had not in fact made it on their warmlist. So beginning in May 2000 we started to press Objectivity on this issue and initiated discussions with Mats Persson at Objectivity to resolve the problem. Alex Szalay went to Objectivity HQ for several days to impress the urgency of this issue upon them. Peter Kunszt spent several weeks in June conducting intensive profiling and benchmarking of our Objectivity code and was able to finally convince Objectivity of the fundamental nature and severity of the problem. We are happy to report that Mats has been able to speed up the Objectivity code by a factor of 15 as a result of this effort, and the performance is now comparable to SQL Server. However, these changes are still limited to the Windows/NT version of Objectivity, and we intend to work with Objectivity to ensure that most or all of them will make it into the UNIX versions by the time Objectivity releases v6.0 (now due September 2000).

Results of benchmarking:

Straight (no predicate) Scan Speed (Mb/s) Tests between Jun 5 and Jun 27, 2000: SQL-Server vs Objectivity

2M objects

8M objects

Predicate Scan Speed (Mb/s) Tests between Jun 5 and Jun 27, 2000: SQL-Server vs Objectivity

2M objects

8M objects

Hardware Issues:

We have been able to obtain sequential I/O speeds of 1.3M objects/s with SQL Server (running 2 threads) and 450k objects/s with Objectivity (running as a single thread) on an Intel/Linux box. The maximum raw I/O speeds possible with this hardware is in excess of 200Mb/s, and with the latest Objectivity performance improvements, we expect to be able to achieve 100 Mb/s with Objectivity on a Linux box in the near future. The speeds we have been able to achieve on the SGI are quite disappointing by comparison (3 - 17 Mb/s) and considering that SGI's implementation of UNIX and POSIX threads has a number of pecularities that have caused serious problems for us in the past (costing us over 6 man-months of effort to resolve them), it seems that the Intel/Linux clusters are the preferred hardware environment to deploy SX on in the future.