Formula One Assessments

Recently at Caveon, we have been discussing how to improve testing. Kelli Foster, our Vice President of Test Development Services, stated that we really need to use data to make sure that tests and test questions perform well. This comment resonated with me. It seemed right. It made me stop and wonder, “How can data help us build high-performing assessments? Is it possible to elevate the performance of assessments to Formula One standards?” I hope you catch my meaning. Formula One racing is where you find the highest performing cars and it has been responsible for some of the greatest innovations in automobile engineering during the past century. So, I ask, “Can we use data to accomplish the same goals for assessments?”

Let me offer three specific suggestions for using data to maximize performance of assessments:

  1. Remove or revise under-performing items,
  2. Establish and achieve desired performance standards, and
  3. Maintain and manage assessment performance.

Remove or revise under-performing items. To do this, you need to identify assessment components (typically items and/or objectives) that negatively affect assessment performance. The primary data analysis that does this is known as “Item Analysis.” Using item analysis, you can identify questions that are too hard, too easy, misaligned with the test content, poorly constructed, and so forth. However, other data analyses should also be performed, depending upon your program’s situation. For example, DIF (Differential Item Functioning) is used to identify items which perform inconsistently with respect to sub-populations (e.g., gender or ethnic group). Analysis of response times can identify questions which have a “back-door” solution. For example, one psychometrician told me that a question intended to assess the candidate’s knowledge of decay processes was easily answered using a “plug-and-guess” strategy, instead of by working the problem.

Establish and achieve desired performance standards. High-quality and high-performance assessments are created by design, not by accident. The design of the assessment should specify the desired reliability, precision, and range of measurement. The assessment needs to be suitable for the target test-taking population. All of these things can be verified using statistical and psychometric analysis of test results. Using Item Response Theory (IRT), psychometricians can determine whether items and tests meet or exceed target information functions. They can also determine whether the tests provide reliable results for inferences and decisions (e.g., pass/fail). If the desired precision has not been attained at the critical measurement points, the test designer can change the mix of items with respect to difficulty and discrimination. Or, alternatively, use test assembly software to ensure that the items, when administered as a group, provide the required measurement properties.

Maintain and manage assessment performance. Like most everything, test questions and assessments are subject to the effects of time. Generally, the questions don’t wear out like an article of clothing, but they may be disclosed. Or, they may no longer be relevant due to changes in technology and education. The phenomenon known as “item drift” is of especial concern among psychometricians. If the items become easier over time, but the item parameters are not changed, the test scores may be artificially inflated. The primary concern with item drift arises from test security risks, such as item theft and content disclosure. Thus, test security analyses are essential for maintaining high-performing assessments. When performance degradations are detected, new items readied for this purpose should replace the compromised items. The security analyses need to identify compromised items and, in so far as it is possible, determine which items should be replaced or refreshed.

Just like with Formula One racing, great care is needed to ensure that assessments are of high quality and perform to “Formula One” standards. Appropriate data analyses are essential for doing this. Test developers should not believe their work is finished after the test items have been produced and the standards have been set. To the contrary, the construction of high performance tests and test questions requires much work, after the items have been written. Statistical analyses of the data to remove under-performing items, achieve desired performance standards, and maintain assessment performance are critical tasks, which should not be neglected.

Please let us know whether these ideas have been helpful. Thank you.

Dennis Maynes

Chief Scientist, Caveon Test Security

Leave a Reply