When Contaminated Item Statistics Undermine the Quality of Assessments
How would you reply if you were asked, “Does it really matter if some people cheat on tests?” I have heard a lot of opinions expressed in response to this question. On one end of the spectrum, you will hear opinions such as those held by Jolie Fitch, ringleader of the Steinmetz Academic Decathlon cheaters, who told Dr. Phil McGraw that “life is a gray area,” that it was “not morally wrong,” and that “anything you can get away with is ok.” At the other end, you will hear those who agree with Kathleen M. Rice, Naussau County Attorney, who remarked when SAT cheating was uncovered on Long Island, “If we can’t teach 16-, 17- and 18-year-olds that cheating is wrong, shame on us.” But, setting aside the context and question of moral or immoral behavior, does it matter?
Psychologists appear to be split on the question of cheating. Some view cheating on tests as an assessment issue. Others tend to view cheating on tests as a behavioral issue. Recently, measurement professionals have stated that cheating on tests is a validity issue. I would like to offer a somewhat different opinion. In my opinion, cheating on tests is an issue of data integrity. Data integrity definitely makes a statement about the validity of test scores. But, I view it as transcending test score validity. When test takers cheat, the data are contaminated. Consequently, decisions concerning the quality of the assessment instrument made using contaminated test data are flawed. The computer science aphorism “garbage in, garbage out” comes to mind. This, in my mind, is a critical point which tends to be ignored by practitioners.
Accurate statistical data are necessary for ensuring that test scores are reliable and valid. This is done by using the data to eliminate poorly performing items, remove biased items, and assemble the test to meet desired measurement objectives. These statistical decisions may be profoundly affected when the data are contaminated through cheating. For example, poorly performing items are usually identified as those which have low correlations between the item scores and total scores (also known as point-biserial correlations). However, I have seen data where the exam was known to be compromised and the newly published items (which were not compromised) appeared to be poorly performing. In this scenario, the compromised items would be retained and the uncompromised items would be discarded.
In 2013, several researchers reviewed such a data set and reported their findings at the Second Conference on Statistical Detection of Potential Test Fraud. The conference abstract read, in part: “A few years ago a certification program republished an exam which was known to be compromised with new, non-scored items. The data suggested that the new items were psychometrically unsound. But, this was not true. Contamination in the data by braindump users prevented psychometric analysis of the new items.” The potential disparity between the old (compromised) and new (uncompromised) items is illustrated in Figure 1.
Figure 1: Scatter Plot of Item Point-Biserial Correlations and P-Values
In Figure 1, the uncompromised items have low correlations. They would be discarded. The compromised items would be retained. Hence, contaminated data would lead to poor decisions regarding which items should be retained.
When a measurement professional is dealing with a situation such as that shown in Figure 1, it is imperative to understand how the data are contaminated by cheaters and how to make good decisions despite the contamination. If this is not done, the investment in the assessment may be jeopardized and statements about reliability and validity will be meaningless. So, the measurement professional needs to perform at least two tasks: (1) detect and determine whether the assessment has been compromised, and (2) identify and remove from the data set the test results for individuals who may be cheating. Only after performing both of these tasks, should the measurement professional proceed forward with standard analyses and processes that establish the reliability of the assessment.
In general, when the set of compromised items and the group of individuals using the compromised content are unknown, it is easier to state the above tasks than to accomplish them. In fact, this is a very difficult computational task, as reported by Dmitri Belov at the annual conference of the NCME in 2014. Dmitri said that the general solution to the problem requires a combinatorial search. For this reason, if you suspect that some of your test questions might be compromised, it is best to insert security questions or embedded verification questions into the exam which can be used to detect users of compromised content and expunge them from the data.
If you would like to discuss dealing with situations when test questions have been compromised, please join us at the Conference on Test Security which will be held October 1 and 2, 2014 in Iowa City, Iowa, hosted by ACT.