The Impact of Braindump Sites on Item Exposure and Item Parameter Drift

The Impact of Braindump Sites on Item Exposure and Item Parameter Drift
By Russell W. Smith Ph.D., Psychometrician, Thomson Prometric
Email: russell.smith@thomson.com

Paper presented at the Annual Meeting of the American Education Research Association
April 2-6, 2004, San Diego, California

Abstract
Given the number of so-called “Braindump” sites emerging on the internet, it is necessary to understand their effects on the security and validity of exam scores. This study examines the impact of braindump sites on an exam using two different approaches, one experimental and one non-experimental. Research items were posted verbatim to a popular brain dumpsite immediately upon the exam’s release. A drift analysis compared the item parameter drift of intentionally exposed unscored items to the drift of the live items. There was little difference in the drift in item difficulty between the live and the experimental items. Popular brain-dump sites were searched after the live publication of the exam to identify the rate and quality of the exposure of the exam items. The results found that most of the bank was available after about 8 months and the items were surprisingly accurate. A p-value trend analysis was used to explore the relationship between exposure, impact, and parameter drift.

Introduction 
The advent of computerized exams and ever increasing access to the World Wide Web has redefined the way items and item banks are maliciously shared. At least three particular types of websites have been problematic for test security. Some “Exam Preparation” sites charge potential candidates a fee to gain access to item banks. Many of these websites are legitimate and provide legitimate practice items. Other illegitimate exam preparation sites attempt to recover the actual live item bank. The second type of website is internet auction sites such as eBay. These websites provide a forum for which people can sell items to potential candidates. Many of these auctions may be selling legitimate practice items. However, personal experience has shown that many of these auctions sell partially recovered live item banks. The third type of website allows candidates to post their own practice item banks and advice regarding exams. These are often referred to as “Braindump” sites. Some of these sites charge an access fee while others provide free access and earn money by advertising. Most advertisers appear to be exam preparation sites that charge fees. Many of these sites warn candidates not to post actual item banks, but there are no real controls in place to prevent this from happening.

A case-in-point of the problems that such websites can cause is the 2002 security breach of a portion of the Graduate Records Exam (GRE) in China. Educational Testing Service responded to the security breach by temporarily discontinuing the computerized version of the Computer Science Test and opting to return to a paper-based administration. (ETS Press Release, 2002).

Information Technology (IT) certification exams may be at greater risk of security breaches due to these websites because the candidates that take these exams are generally very skilled in computers and the use of the internet.    This is not to imply that such candidates would be more apt to utilize such sites, only that if they chose to, they would have the knowledge and access. Jones, Smith, Jenson, and Peterson (2004) found “a discernable upward trend in proportion of items drifted as a function of mean exposures per item across (15 IT certification) exams.”

This study investigates the impact that braindump sites can have on the performance of IT certification exams and items. Six research items were posted to a braindump site the day the live version of the exam was released. It is expected that those items would become noticeably easier compared to the live scored items.

Item parameter drift measures the change in item parameters over time (Goldstein, 1983). Drift is different from impact, which measures the difference between the abilities of distinct candidate groups on the same items. The candidates may be grouped based on time. It is possible that one group of candidates has a higher ability than another group. This is particularly important to consider in IT certification because it is often the case that exams are developed in conjunction with the technology and training materials. Therefore, candidate abilities may dramatically increase over time.

In this study, it was expected that the research items would get noticeably easier over time, while controlling for candidate ability. At the same time, it was expected that the probability of success on any item would increase over time. If the probability of success were to increase based on increased awareness of new technology, general exposure, or due to greater availability of training then it would be expected that the probability of success would increase across the entire bank of items. If, on the other hand, the probability of success increases largely due to exposure on a braindump site, then it would be expected that the exposed items would drift further than items that have not been exposed.

Methods 
The first part of this study consisted of embedding 6 research items on the live version of a particular IT certification exam. The beta version of the exam consisted of two forms. From these forms, six pre-equated live forms were assembled. Two of the research items were added to each of the six live forms. The scores on these items did not factor into the final score or pass-fail decision of the candidates. A disclaimer was added to the exam that explained that the client often added unscored items to live exams in order to collect information on the performance of those items. The day the live version of the exam was released, all six of the questions were posted to a popular and free braindump site in two separate postings of three items a piece. The items were posted verbatim and with the correct option or options keyed.

An item drift analysis was conducted for all items by using the differential item functioning (DIF) in Winsteps (Linacre, 2003). Winsteps measures drift by calibrating all items and candidates, then calibrating items within candidate groups while holding their ability estimates constant. DIF contrasts are then expressed as

where  is Rasch parameter for item i calculated over group 0 and  is the Rasch parameter over group 1.   Drift contrasts are tested for significance based on a t-test

with df approximated by n 0 + n 1 – 2 and alpha = .05. The Rasch-based DIF procedure implemented in Winsteps is based on the same theoretical properties as the Mantel-Haenszel method (Linacre & Wright, 1987).

Item measures calibrated using data from the beta candidates were compared to item measures calibrated using data from the live candidates. Additionally, beta item p-values are plotted against live item p-values .

The second part of this study is an investigation of popular braindump sites to identify the rate and quality of the exposure of the exam items. Braindump sites were located using internet searches and word of mouth from testing candidates. The identified sites were periodically searched for updates. Only free sites were investigated. The results of the study reports approximate item exposure rates and accuracy.

Results 
Table 1 shows the results of the drift analysis for the research items. It includes the beta item measure, the live item measure, the contrast between the two, the joint standard error, the Mantel-Haenszel -statistic, degrees of freedom, and the probability of if there were no difference between the groups. The same information is provided in the Appendix for the scored items. Figure 1 shows a scatter plot of the beta and live item measures for the scored and experimental items. Figure 2 shows a scatter plot of the beta and live item p-values .

Table 1. Research Item Drift Analysis Results

Figure 1. Scatter plot of the beta vs. live item measures.

Figure 2. Scatter plot of Beta and Live p-values .

An internet search of free braindump sites found about 25 percent of the item bank was exposed within 3 weeks of the exam being published live and with a fair amount of accuracy. After 8 months nearly the entire exam bank, over 200 items, was posted with nearly perfect accuracy including the answer key.

Conclusions 
It was expected that there would be significant impact showing candidate abilities increasing from the beta version of the exam to live version. This expectation held true as evidenced in Figure 2. The beta pass rate for this exam was 34.6%, the live pass rate calculated just 10 months after the beta was 65.5%. It was also expected that the item parameter drift for the experimental items would show them becoming easier at a faster rate than the other items. It was initially surprising that this pattern did not hold true. In fact, 3 of the 6 research items appeared significantly more difficult based on the results of the drift analysis while only one appeared easier. Looking at the graphs, the experimental items did not show a distinct pattern any different from the scored items.

The search of braindump sites was limited to those that were free. It did not take into consideration pay sites, auction sites, email list serves, personal communication between potential candidates, or any other means of exposing items in the bank. Still, at least part of the item bank was found to be compromised as early as 3 weeks after the live release and almost entirely exposed, with great accuracy, after 8 months. It is no wonder the results of the drift analysis showed no difference between the research and the scored items. The scored items were being exposed as well!

The results of this analysis are a snap shot a year and half after this exam went live. A major limitation of this study is that item parameter drift is something that happens as a trend over time. Interpreting drift as a snap shot, especially after this long, can make drift difficult to interpret. The mathematical constraints of the calibration convolutes the directionality of parameter drift. In this study, the drift observed in the research items is confounded by item parameter drift in the entire bank to the point that 3 of the research items actually appeared to become more difficult, at least in relation to the other items.

In order to better understand the influence of the item exposures on their difficulties, a p-value trend analysis was conducted on the research items. A p-value trend analysis plots the p-value, or the residualp-value for a window of candidates of a fixed size, in this case 200. For example, the first -value in the trend represents the difference between the p-value calculated over candidates 1 through 200 and the average p-value across all the items for the same set of candidates. The second -value in the trend represents the difference between the p-value for candidates 2 through 201 and the average p-value across all items for candidates 2 through 201, et cetera. Figure 3 graphically displays the results of the residual p-value trend analysis for the research items during the live version of the exam. The interpretation of the trend within Figure 3 is ambigous as it only covers a three day period of time. It has been included to show contrast with the live trends. Figure 4 graphically displays the results of the residual p-value trend analysis for the research items during the live version of the exam.

It is interesting to see the increase in the item p-values between the end of the beta and first 200 live candidates, especially knowing that about 2 months had gone by and at least part of the bank was exposed during that time. The increase is particularly dramatic for Item 4 where its p-value tended to be about .45 below the average p-value during beta and about .1 to .2 below the average p-value during the live administration, even though its Rasch item measure showed it becoming more difficult.

There is a large difference between the average beta p-value and the average live p-value , even though the trend is relatively flat for both groups. If there is much impact, it would be expected that the average p-value would gradually trend upwards. Instead, it jumps greatly between the beta and live versions of the exam then remains relatively flat. The same kind of differences between performance on beta and live exams was observed in 14 other IT certification exams in the Jones et al. study (2004). The outstanding question remains: Is the increase from beta to live a change in the population, perhaps due to increased knowledge, exposure and training, or is the increase due to inappropriate exposure?

Figure 3. A residual p-value trend analysis for the research items in the beta version of the exam. 

Figure 4. A residual p-value trend analysis for the research items in the live version of the exam.

Given the unfortunate reality of the number of posts and the amount of website traffic, there are likely individual candidates who view these posts, get an unfair advantage, and pass the test when their ability is less than the passing standard. However, this appears to be only a subset of candidates. It is interesting to note that the pass rate stabilized at around 65%. Knowing that nearly the entire bank was exposed with surprising accuracy, this may imply that not all candidates are utilizing the sites. However, the few that do utilize these sites fraudulently jeopardize the validity of the exam and the meaning of the test scores.

This study is not meant to generalize to other testing programs, it would likely not be generalizable at all outside the IT certification realm, and it may not generalize to exams of other sponsors. However, it does show that braindump sites may be having a negative impact on exam programs. Statistical analyses, such as the p-value trend analysis and other similar methods that are presently being developed (see for example Lu and Hambleton, 2003), may prove useful in detecting certain types of fraud. However, we as test developers need to gain a better understanding of the impact of braindump sites and means of dealing with them prior to detecting problems.

References

ETS Press Release. (August 26, 2002). Security Breaches Force GRE Board to Cancel Computer Science Test Administrations. Website: www.ets.org/news/02082602.html .

Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement , 33, 315-332.

Jones, P. E., Smith, R. W., Jenson, E., & Peterson, G. (2004). Item Parameter Drift in Small-Volume Continuously Available Non-Adaptive Computerized Certification Tests in the Information Technology Industry . Paper presented at the Annual Meeting of the National Council on Measurement in Education, April 3-5, 2004, San Diego, California.

Linacre, J. M. (2003). WINSTEPS [Computer Program]. Chicago: MESA Press.

Linacre, J. M. & Wright, B. D. (1987). Item bias: Mantel-Haenszel and the Rasch model. Memorandum No. 39 MESA Psychometric Laboratory . University of Chicago, Department of Education.

Lu, Y., & Hambleton, R. K. (2003). Statistics for detecting disclosed item in a CAT environment. Center for Education Assessment Research Report No. 498 . Amherst, MA: University of Massachusetts, School of Education.

Posted with permission: ® Russell Smith, Thomson Prometric

Caveon

Leave a Reply