When a test cannot be scored automatically by the computer, the alternative is to scan the answer sheets onto the computer and have evaluators read the image files through network. In order to understand the evaluators’ eye movements and scoring stability, the evaluators’ eye movements were monitored while they were scoring paper-and-pencil items and computer items with or without eye tracker. The results showed that the evaluators with less computer experiences and those among the older age groups showed higher frequency of eye blinking, especially during item category shift. The oldest age group had the highest eye movement frequency. All three types of scoring showed very short gazing time on small items, and longer gazing time during the shift of item categories. Criteria for evaluation and the format of answers played a critical role in determining scoring stability. Stability was highest for coding items, medium for specialized terms, and lowest for mapping items.