|
Sloan Digital Sky Survey
Review of Data Processing
Operations
Overview and Status of the Spectroscopic Pipeline
as of July 2000
Josh Frieman
July 7, 2000
Outline
I. Overview of the Spectroscopic Pipeline
- Purposes of the pipeline
- Structure of the pipeline
II. Status of the Pipeline
- A Brief History
- Pipeline Requirements and Current Performance wrt Requirements
III. Future of the Pipeline
- Tasks remaining to satisfy requirements and estimated time to completion
- Enhanced goals and plans to achieve them
- Staffing needed for long-term code maintenance and operations
Personnel: As far as I am aware, the following people have
contributed to the development of the Spectroscopic pipeline (I would like
to be informed of others): M. Bernardi, S. Burles, F. Castander, A.
Connolly, D. Finkbeiner, L. Hui, D. Johnston, J. Loveday, R. Lupton, A.
Meiksin, A. Merrelli, J. Munn, R. Nichol, A. Pope, D. Schlegel, M. Strauss,
M. Subbarao, B. Wilhite, B. Yanny.
I. Overview of the Spectroscopic Pipeline
1. Purposes of the pipeline
The spectroscopic pipeline is designed to be a multi-purpose, highly
automated pipeline for processing the 10^6 galaxy, QSO, and stellar spectra
from the SDSS spectrographs. This is an unprecedented challenge, mitigated
by the very high quality spectra these spectrographs produce.
The pipeline is designed to extract and process all spectra taken in the
course of the Survey, and specifically to:
- archive the reduced, red/blue combined, co-added 1d spectra
- classify objects (Star, Galaxy, QSO) independently of the target selection
pipeline
- estimate redshifts and provide spectral information for all objects
possible, with required redshift accuracy and success rates depending on
object type (see Sect. II)
In addition to the above goals, which enable science to be carried
out with the spectra, the pipeline has the following roles in survey
operations:
- provides near real-time diagnostic S/N outputs for APO observers so that
exposure time for each plate can be determined under varying conditions;
this is a critical element in determining Survey time to completion
- provides diagnostic outputs for spectroscopic data processors at Fermilab
so that data quality on each processed plate can be rapidly assessed as
satisfying Survey scientific requirements or not (i.e., a check on the
previous point); these outputs also provide QA monitoring of the
spectrographic system over time
- provides feedback to target selection on object classification and on
redshift success rates for different object classes as function of
magnitude, surface brightness, etc; this allows the Working groups to
optimize TS parameters for survey efficiency and completeness
Aside from these functions in survey operations, the Spectroscopic
pipeline is not a critical-path item for the Survey to operate.
2. Structure and Functionality of the Pipeline
N.B.: The available documentation on the design of the spectroscopic
pipeline is unfortunately out of date (the last major update to the printed
documentation was at the time of the Spectroscopic pipeline review in April
'99). This will be rectified in the next few months, when technical papers
describing the 1d and 2d pipelines will be prepared for publication.
The Spectroscopic pipeline is split operationally into two parts, 2d and
1d. The 2d pipeline reduces the raw data and calibration images from the red
and blue CCD cameras from each spectrograph and outputs merged, combined,
flux-calibrated spectra and noise for analysis by the 1d pipeline. The 1d
pipeline determines emission and absorption redshifts, classifies spectra by
object type, and outputs spectral information about each object. During
normal data processing operations at Fermilab, a batch of spectroscopic
plates is first run through the 2d pipeline and then through 1d.
Spectro2d
In October of 1999, it was determined that the then-current version
of 2d, which was built around IRAF commands, was not flexible or robust
enough to satisfy the Survey requirements. The Spectro2d pipeline was
completely rewritten from scratch by S. Burles and D. Schlegel in IDL
(idlspec2d). The current version of the code used for data
processing (3c) was tagged in early May of this year and was
then determined to meet the Survey Requirements of this year and
was determined to meet the Survey Requirements in early May.
Key Tasks of Spectro2d
- Overscan
- Bias and dark subtraction
- Spatial tracing of fiber images from flat-field image
- Optimal extraction of flats and arcs
- Wavelength calibration from arc lamps and sky lines
- Flat-fielding: remove spectral response using flat exposures; remove pixel
to pixel response using uniformly illuminated flats
- Optimal Object and sky extraction, including scattered light removal
- Sky subtraction using supersky built from sky fibers
- Telluric absorption removal using spectrophotometric standard spectra
- Flux calibration from spectrophotometric standard (F) stars spectra
- Output spectrum, variance, and pixel mask information
- Rebin spectra and merge red and blue halves; heliocentric correction
- Combine multiple 15-minute exposures as needed to reach required S/N per
plate (multiple-night exposures are combined if necessary, but not if a
plate has been replugged)
- Output diagnostic information on cumulative S/N per plate: this is used by
the spectroscopic data processors to decide if a given plate satisfies
Survey requirements and can be declared "good, done" (and unplugged). The
current requirement is that the mean (S/N)^2 100 for objects with fiber
magnitudes of g'=19.2, r'=19.25, i'=18.9 in each spectrograph. This
limit will be reviewed and refined in the next month (see below)
- Output other diagnostic information that provides QA monitoring on the
spectroscopic system (e.g., errors in wavelength solution, QA for
scattered light levels, throughput compared to PHOTO g,r,i magnitudes,
etc)
For a list of the routines used in idlspec2d, see IDL Help for IDLSPEC2d.
In addition, a faster, stripped-down version of the 2d pipeline is run on
the mountain during spectroscopic observations. This mountain-top version of
the 2d pipeline
- provides "real-time" information to the APO observers on the
approximate S/N achieved per 15-minute exposure
Based on this information, the APO observers carry out as many
spectroscopic exposures on a plate as are necessary under the given
conditions to achieve the cumulative (S/N)^2 deemed necessary to meet the
Survey requirements. Currently, this is set at (S/N)^2 80 at the same
magnitudes as above; this translates approximately to (S/N)^2 100 in
idlspec2d.
Spectro1d
The 1d pipeline, which analyzes the combined spectra output by
spectro2d, is written in C and TCL. A velocity dispersion module written in
IDL has been recently added. The version of the code currently used in data
processing, 1d_4 was/will be cut on July 10 of this year. The code outputs a
FITS image for each fiber: it includes the 1d spectrum, noise, and mask
array from 2d, basic information about the target from PHOTO and TS, as well
as line measurements and redshift determinations. These outputs are listed
in the (unofficial) Spectro Data Model; the
schema of outputs of the pipeline as they will appear to the SX database can
be found at this Web site.
Although out of date, general information about the structure of the 1d
pipeline can be found at the Spectro
Pipeline Home Page.
Key Tasks of Spectro1d
- Fit and subtract continuum
- Find and fit emission lines (using a wavelet filter)
- Identify emission lines and determine an emission-line redshift and error
where possible
- Cross-correlate the spectra with stellar, galaxy, and QSO templates to
determine a cross-correlation redshift and error (emission lines are
removed in this procedure, except for the QSO template)
- Classify all object spectra as Galaxy, QSO, high-redshift QSO, Star, or
unknown (currently a discrete choice)
- Output a final redshift by comparing emission and absorption redshift
confidence levels; final redshift determination is denoted high
confidence, low confidence, inconsistent, or failed
- Output information on identified emission and absorption lines (equivalent
widths, etc)
- Output warning flags for spectra (e.g., blue or red half of spectrum
missing; conflicting high-CL cross-correlation redshifts; outlier redshift
for object class; classification differs from TS)
- Flag low-confidence level redshift determinations for subsequent
inspection and interactive z-determination
- Provide interactive redshift determination environment for inconsistent,
failed, and other spectra: to be used by Spectro pipeline developers to
manually determine redshifts. It is expected that this will be necessary
for a few percent of all spectra; also to be used on a spotcheck basis for
QA monitoring.
- Estimate velocity dispersion and error for galaxies
- Provide diagnostic output for Spectroscopic data processors to monitor QA
(e.g., compare emission and absorption redshifts for all objects where
both are available and check dispersion and outliers; plot redshift vs.
magnitude and redshift histogram for each plate, etc)
II. Status of the Spectroscopic Pipeline
1. Recent History of Pipeline Development
- Prior to 3/99: Pipeline development. Pipeline testing by M. Strauss using
simulations provided by D. Schneider.
- 3/99: first 3 spectroscopic plates taken with SDSS 2.5m telescope
- 4/99: Review of Spectroscopic pipeline at Chicago.
- 9-10/99: First calibrated spectra (though shutter problems and light
leaks).
- 10/99: Spectroscopic pipeline meeting at Fermilab. Decision taken shortly
thereafter to rewrite spectro2d.
- 12/99: Spectroscopic pipeline meeting at Princeton. Task lists and
schedule to pipeline completion updated. As of May 2000, the 2d and 1d
pipelines were within 1-2 months of this schedule.
- 3/00-6/00: Spectra of improved quality and in larger numbers obtained.
Processing of the data through the pipeline by spectroscopic data
processors at Fermilab gradually becoming routine.
- 5/00: Spectroscopic pipeline meeting at Fermilab, shortly after cuts of 2d
and 1d. 2d determined to satisfy Survey requirements.
- 7/00: New version of 1d cut and used to reprocess plates at FNAL. Data
Processing Review at FNAL.
2. Spectroscopic Requirements and Current Performance
The requirements on the performance of the SDSS Spectroscopic
pipeline are described in two documents: Section 7 of the Scientific
Requirements and Scientific Commissioning for the SDSS and the Offline
software processing requirements. Note also that some of the
requirements on the spectroscopic system are implicitly linked to
requirements on Target Selection enumerated in Section 6 of the Scientific
Requirements and Scientific Commissioning for the SDSS.
The current status of the Spectroscopic pipeline in terms of the Survey
requirements is described in varying levels of detail in three documents:
- Michael Strauss' document status
of requirements for the survey provides a general overview.
- Mark Subbarao's webpage
Status of 1d with respect to SDSS Science requirements describes the
current performance of the 1d pipeline with respect to the requirements
set out in Sec. 7 of the Scientific
Requirements and Scientific Commissioning for the SDSS as well as
those set out in the Offline
software processing requirements. Parts of the information on this
page have been incorporated into Requirements
and Status of the SDSS Spectroscopic Systems. The numbers reported on
this page are based on comparison of spectro1d outputs (using the checked-out
version of the code as of 6/23/00) with manual determination of
redshifts on several plates of varying S/N taken in the Spring of this
year. Over the next month, all reduced spectroscopic plates taken so far
(over 40) will be manually checked; this will allow more accurate
statistics on pipeline and spectroscopic system performance as a function
of plate S/N to be compiled.
Summary on Pipeline Status and Requirements
Based on tests carried out and summarized in the documents above,
- idlspectro2d is deemed to satisfy the Survey science requirements
- spectro1d satisfies most of the Survey science requirements at the current
minimum acceptable S/N level per plate. This includes the requirements on
redshift success rates for galaxies, BRGs, and QSOs, the redshift accuracy
requirements on galaxies and QSOs (though these have only been tested
internally; external comparison is difficult and remains to be carried
out), and the successful classification of galaxy and QSO spectra as such.
The spectro1d requirements which are not yet satisfied or verified are
the following:
- Although 95% of QSO spectra are successfully classified as such, the
requirements also state that 99% of the QSO spectra not correctly
classified should be classified as unknown. Currently the majority of
misclassified QSO spectra (typically 0 or 1 per plate) are instead
identified as stars or galaxies by the pipeline. It is the goal of the
pipeline that misclassified QSO spectra will be flagged for manual
classification and redshift determination.
- Currently 96% of stellar spectra are correctly classified by the pipeline
(based on 5 plates), while the requirement is 99%. This will be addressed
by the addition of hot and cool stellar templates for the
cross-correlation analysis.
- The status of the requirement on the accuracy of BAL QSO redshifts is not
yet known; internal reproducibility is within the required accuracy
limit, but comparison with manual redshift determinations needs to be done.
This will occur in the next few weeks.
Note: internal and external verification of redshift accuracy
for galaxies at the 30 km/sec level is a difficult problem. The pipeline
developers will compare 1d redshifts with high-resolution redshifts from the
literature where available. Aside from this, manual determination of
redshifts of all the spectra taken so far (~42 plates) will be used as a
"truth table"; statistical comparison with these manual redshifts will be
used to determine whether the pipeline meets the science requirements of the
survey.
III. Future Development of the Spectroscopic Pipeline
1. Remaining Tasks and Estimated Time to Completion
The near-term prioritized tasks for the 1d pipeline listed here are aimed
at satisfying the remaining Survey science requirements. It is expected that
these will be completed by 9/1/00:
- Add stellar templates to improve stellar classification success rate
- Improve QSO classification scheme so that unclassified QSO spectra are
flagged for visual inspection
- Manually determine redshifts and classification for all objects on all
plates observed so far; this will be used to check classification and
redshift success rates as functions of plate S/N and thus to determine
required plate S/N for survey operations. It will also be used to determine
whether additional parameters beyond plate S/N need to be checked in
declaring plates "done".
- Recalibrate emission and absorption redshift Confidence Levels using
manual redshifts above.
- Implement redshift warning flags (e.g., missing half-spectra, discordant
high-CL redshift determinations, etc)
- Check external accuracy of redshifts where possible from the literature
- Check BAL QSO redshifts against manual determination from the QSO working
group to verify required accuracy
- Improve continuum fitting algorithm (expected to improve QSO redshift
success and classification)
In addition to these tasks, both the 2d and 1d pipelines need to
implement regression tests to ensure code robustness in the course of future
development.
2. Enhanced Goals and Estimated Time to Achieve Them
Spectro2d:
It is expected that the following enhanced goals will be achieved by
approximately 10/1/00:
- output estimated error covariance in addition to variance
- implement code to reduce binned diamond-pattern exposures
- improve flux calibration based on spectrophotometric standards?
Spectro1d:
The following enhanced goals of Spectro1d are expected to be achieved at
various stages, but in all cases by the end of 2000; they are listed in
expected order of completion:
- Test and finalize galaxy velocity dispersion code
- Output line indices
- Implement galaxy and QSO spectral classification using eigenspectrum (PCA) analysis
- Implement stellar spectral classification using non-continuum-subtracted
spectra/templates
- Determine feasibility of galaxy spectrophotometry using diamond-pattern
exposures; implement to the extent possible
3. Long-term Staffing Needs
The long-term staffing needs for the spectroscopic pipeline are based on
the two roles that will be needed during Survey Operations:
- Long-term code maintenance: QA monitoring; possible code development
necessary if hardware changes significantly (i.e., some component breaks
or degrades over time); possible additional enhancements desired by the
collaboration; provide assistance as needed to Spectroscopic data processors
at Fermilab
- Interactive Redshift Determination: this will be carried out by the
pipeline developers over the course of the Survey for up to a few percent
of all spectra. This will provide z determinations and object
classifications for objects where the automated pipeline cannot make a
determination as well as provide spotchecking of the automated pipeline
for QA.
The anticipated long-term staffing level (2001 and beyond) needed to meet
these requirements is 1 FTE supported by ARC.
|