Written by Dennis Maynes, Chief Scientist, Caveon Data Forensics
Reliability and validity are the two pillars which support the psychometric standards of testing. Measurement specialists take these two foundational principles extremely seriously. In fact, nearly every aspect of their work is focused on administering reliable and valid assessments. As long as test fraud has been seen as a behavioral problem and not a validity threat, psychometricians have been content to ignore security risks and cheating on exams. The times are changing, however. I recently attended the Conference on Statistical Detection of Potential Test Fraud in Madison, Wisconsin. Over forty presentations were made covering many important issues in test security. The single topic that received more attention than any other was the detection of item compromise and item pre-knowledge.
Greg Cizek, former president of NCME, remarked that we need to treat test security breaches as measurement and not behavioral issues. This is a position that I have advocated for some time. We should question the validity of the test score, not the presence or absence of ethical behavior. It is refreshing to me that measurement professionals are adopting this point of view. When you have seen as much data as I have seen, it doesn't take long to realize that cheating damages the integrity of the process and the sanctity of the assessment more than anything else. I believe that measurement professionals are finally realizing that cheating strikes at the very heart of reliability and validity. The assessments are simply not reliable or valid when the items are compromised. We are engaged in a serious campaign to stem the rising tide of cheating. Or, at least to detect and invalidate the scores of those who do so.
At this year’s conference, researchers used live and simulated data to discuss issues concerning item compromise. A wide variety of methods were presented, with varying degrees of success. Some of the presented research used:
- Similarity statistics to detect potential item compromise,
- Kullback-Leibler divergence to model operation and pretest items to detect compromised operational items,
- Simulated Annealing and combinatorial search to detect unknown groups of test takers who have had pre-knowledge of unknown subsets of the items,
- Deterministic Gated models to detect performance differences on secure versus compromised items,
- Jackknife residuals to indicate potential contamination of test result data by individuals with pre-knowledge of items, and
- Artificial Neural Networks to learn by example the characteristics of test takers who used compromised test items.
I was privileged to organize a session in which four research teams accepted the challenge to analyze a live data set where the scored items were compromised and the non-scored items were not compromised. The teams were tasked with finding the scored items and then with removing individuals who used the compromised content from the data set so that reliable item statistics could be obtained. Every team used a different approach. They were diligent and thorough. From their work, I learned that structure is present in the data that can be exploited to detect compromised items and those who use them.
Our research in this area is hampered by the lack of benchmark data sets which can be analyzed by everyone. This impediment was discussed at the conference. Such data sets would allow methods to be readily compared. I hope to contribute in some way to gathering and documenting benchmark data sets. By so doing, I am very optimistic that reliable, scientifically-valid, and powerful methods will be developed that will help us “decontaminate” the data so that item statistics may be trusted. All measurement professionals should be vitally interested in the quest to ensure that assessments are reliable and valid by detecting and removing from the test result data those individuals who have pre-knowledge of the item content.