The case of the befuddled answer copier

About a year ago, a university dean asked for our help. A professor in the college decided to use two versions of the test (each version had the same questions, but in different orders) for the final exam. While grading the exams, one student had a very low score (22%), so the professor graded the exam using the answer key for the other form of the test, resulting in a much higher score (63%). Thinking that the exam was mislabeled, the professor rechecked the labeling of the answer sheet and the test booklet. The professor then asked the student to verify the test booklet and answer sheet, confirming that there was not a mislabeling error of the test form. At that point, the professor suspected the student had cheated. After considering the case carefully, the department faculty was of the opinion that the student had cheated and should be expelled. The dean asked us to provide statistical evidence for or against this allegation in order to support the faculty’s decision.

This was an interesting problem for me. It was the first time that I would analyze such a small data set (less than one hundred tests). It was also the first time that I would apply our probability and deduction methods in the analysis of cross-form answer copying. Most test administrators (i.e., teachers and instructors) ignore cross-form answer copying because the answer copier is naturally punished with a failing score. The answer copier will not dispute the low score because it requires admission of the fraudulent activity. Almost no academic research has been published for cross-form answer copying. I suspect there is not much research because there is no such thing as a count of identical incorrect responses (the staple statistic for nearly all answer-copying statistics) in cross-form answer copying analysis, even though the probability derivations are fairly straightforward.

After building the computational procedure, I found one extremely similar pair of tests (probability less than one in one trillion squared). I will denote this pair of tests as “# 32” and “# 121”. The score for test #32 was 22% (approximately equal to the guessing proportion). The score for the test with the alternate answer key was 63%. For illustrative purposes I have aligned the two test responses below.

Table 1: Aligned responses for extremely similar tests

Aligned Responses

You will notice that beginning with question #26, all the responses are identical (that’s 49 questions in a row!). The response is shown in bold if the response is correct. It is highlighted in gold if it is identical for both tests and if it is incorrect. It is highlighted in tan if it is identical for both tests and if it is incorrect. The statistical evidence confirmed all that the faculty suspected. We reported the result. The University decided to let the test score stand and not expel the student. However, it is almost certain that student #32 failed the course and, from this point on, would have to be very careful and not be caught again.

It is interesting to consider what the result might have been if test #32 were indeed mislabeled. In this situation, we have 65 identical answers with 20 that were incorrect and 45 that were correct. The probability of the similarity is less than one in one hundred billion. So, we have the same result (except the probability is not quite as extreme).

The seven mismatched questions above provide two very important clues. First, if test #32 were mislabeled, these seven non-matching questions on test #32 would be incorrect. When we compare this fact with the additional observation that all the same questions on test #121 were correct, we are left with an inference that student #32 is indeed the answer copier. Our source-copier analysis gives us odds of 4,486 to 1 that student #32 is the answer copier.

The second clue leads us to believe that the answer sheet was in fact labeled correctly. A class of statistics, known as person-fit statistics, assesses whether the test response pattern is consistent with expected test-taking behavior. We have developed one such statistic derived from item response theory, which measures score consistency. When we computed this statistic on test #32, we found the test to be aberrant with an extreme probability of .0001. In order to understand the nature of this extremeness the statistical contribution to aberrance for each test question was computed and plotted (shown in Figure 1).

Figure 1: Illustration of aberrance for test #32 (in question order)

Aberrance for Test 32

The values in the plot are approximate z-scores for the aberrance statistic. The responses that are least consistent with the score that was awarded on the test (assuming the test is mis-labeled) correspond to the points having the largest z-scores. It just happens that these are the same questions where test #32 did not match test #121 (shown using orange squares). We conclude that test #32 does not conform to the expected test taking model, and that the non-conformance is the result of answering seven of seven questions correctly for form 1 where the two tests had mis-matching answers. We conclude that student #32 did indeed have access to test form #1, while taking the test. Student #32 would have been better off doing his or her own work, rather than blithely copying from a neighbor.

Dennis Maynes

Chief Scientist, Caveon Test Security

Leave a Reply