PURPOSE: To compare the psychometric performance of two rating instruments used to assess trainee performance in three clinical scenarios. METHODS: This study was part of a two-phase, randomized trial with a wait-list control condition assessing the effectiveness of a pediatric emergency medicine curriculum targeting general emergency medicine residents. Residents received 6 hours of instruction either before or after the first assessment. Separate pairs of raters completed either a dichotomous checklist for each of three cases or the Global Performance Assessment Tool (GPAT), an anchored multidimensional scale. A fully crossed personxraterxcase generalizability study was conducted. The effect of training year on performance is assessed using multivariate analysis of variance. RESULTS: The person and personxcase components accounted for most of the score variance for both instruments. Using either instrument, scores demonstrated a small but significant increase as training level increased when analyzed using a multivariate analysis of variance. The inter-rater reliability coefficient was >0.9 for both instruments. CONCLUSIONS: We demonstrate that our checklist and anchored global rating instrument performed in a psychometrically similar fashion with high reliability. As long as proper attention is given to instrument design and testing and rater training, checklists and anchored assessment scales can produce reproducible data for a given population of subjects. The validity of the data arising for either instrument type must be assessed rigorously and with a focus, when practicable, on patient care outcomes.