We are delighted to have Professor Greg Cizek's permission to post in this blog his discussant comments from the CCSSO National Conference on Student Assessment two weeks ago. Professor Cizek is basing his observations on his careful review of the new CCSSO publication: TILSA Test Security Guidebook: Preventing, Detecting, and Investigating Test Security Irregularities by Drs. John F. Olson and John Fremer and on NCSA presentations about the Guidebook. In the article, he mentions several of those presenters by name. We have added the organizations for which they work in parentheses.
TILSA Test Security Guidebook: Next Steps
Presented and written by: Gregory J. Cizek, Professor of Educational Measurement and Evaluation, University of North Carolina at Chapel Hill
Thank you for the invitation to be here today. It is an honor to address this meeting and it has been a privilege to be associated with the Guidebook project. I want to first commend the outstanding job on the Guidebook done by John Olson (OEMAS) and John Fremer (Caveon); I would also like to recognize the great leadership Charlene Tucker (CCSSO) has provided for the project.
Before I get into the main focus of my remarks, I should first reiterate what John Olson said: the Guidebook is really the culmination of a lot of recent and needed attention to concerns about test data integrity, including the good work done by the USDOE and the National Council on Measurement in Education. The Guidebook represents a truly outstanding, comprehensive, and needed resource. Having said that, I certainly don't want my remarks to be taken as diminishing that effort at all, but my presentation will not dwell on all the positives; instead, I would like to focus on all the work that I think remains to be done. I will address seven points.
1) First, in my opinion, most conversations about cheating lack clarity regarding the kind of cheating we are talking about: student cheating or educator cheating. The problems are different; the methods for preventing, detecting, and responding are different; and the consequences are different. We should not continue to be sloppy in our usage on this point and we should begin to more carefully differentiate between the two.
2) Like Roger Ervin (KY DOE), I also am not 100% confident regarding the amount of cheating that is occurring—and here I am talking about educator cheating. I don't know if its 5% or 10%, but I doubt it's 1/10 of 1%. What I do know for certain is that there is uniformly more cheating than we think there is. I suspect it is our inclination because we work in educational systems to underestimate the amount of cheating, but we do so at our peril. Any state assessment director, state chief, or credentialing assessment director who believes his or her program doesn't have a serious cheating problem has simply not investigated the concern deeply enough. Regarding the specific context of large-scale K-12 accountability testing, I am also confident that the concern about educator cheating is going to get worse. With states being compared to each other on common core assessments and with educator evaluations more explicitly tied to test results, the incentives for cheating are only increasing. Accountability in education is here to stay. I'm certainly not arguing that accountability should be exclusively based on test scores, or that current accountability systems have reached a state of perfection; I am only saying that tests with consequences are likely here to stay and that those concerned about the integrity of test data must come to grips with that reality.
3) I believe that those of us in the field of assessment must now take even greater leadership on the issue of test data integrity. It would be wrong to wait for the next article in the AJC, the Herald-Leader, or the Gazette to press forward. Unlike perhaps some of my colleagues, I actually applaud the newspaper reporters and others in the media who have pushed the issue of test data integrity to the forefront, gathering data, doing analyses, and so on. It is a shame, though, that they took the lead and not educators and assessment specialists. We must now take leadership on this issue and not be in the same old position where we respond to media inquiries or analyses by saying "We'll look into that." Instead, we should be ahead of the curve so that, when a reporter calls with an inquiry, we are able to say, "We've already looked into that. Here are the analyses we've done and the follow-up actions we've taken."
4) And speaking of analyses… I would first strongly caution everyone NOT to do any analyses at all until first developing and adopting policies and procedures about what to DO with the results of any analyses. Perhaps the worst situation one could be in would be a situation where analyses have been conducted and it must be admitted publically that nothing has been done with the results. It is important to treat each similarly situated case the same way, and a coherent, comprehensive set of policies and procedures, uniformly applied, is essential. I realize that many of my colleagues are critical of the media for the analyses they have done, and they have articulated concerns that the analyses are rudimentary, fail to take into account this or that, and so on. To that allegation, I respond that at least the media publish their methods and findings and in perhaps the most transparent ways. One of our next needed steps is to publish our analytical methods in scholarly journals, peer-reviewed professional publications, and other outlets where they can be subjected to broader scrutiny and dissemination. And, although I recognize that today were are celebrating the release within the last hour of the Guidebook, I think that one of the next necessary projects is a Volume II of the Guidebook devoted to best practices in quantitative methods. There is remarkably little guidance or accepted best practices as regards methodological tools for detecting cheating and methods for investigating cheating. We must now turn to addressing that void.
5) We need to realize our strengths. By and large we are statisticians and psychometricians, assessment specialists and educators. Our expertise is generally not legal. Like Juan deBrot (WVA DOE), I would urge us to reframe our concerns about test data integrity not as cheating concerns, but as a validity issue. As a professional testing specialist, it is simply not incumbent upon me to prove that cheating has occurred for me to express my psychometric concerns or cautions about score validity. Additionally, I think that by locating concerns about test data integrity under the psychometric concern about score validity, we are in a much stronger position to advocate for the resources needed to provide reliable and valid test results than if the concern is cast merely as an obsession for catching cheaters.
6) In addition to realizing our strengths, we must also recognize our weaknesses. We generally have strong quantitative skills, but we generally have little to no relevant training or expertise in investigation or follow-up. I believe, for many reasons, that those activities should be outsourced and the help of specialists should be enlisted. I am convinced that one of the reasons those in Georgia were able to get to the bottom of cheating allegations there was the highly skilled investigators who knew what questions to ask, whom to ask, what questions to re-ask, and so on.
7) Finally, I want to confess that at least I have had some ambivalence about two issues: whether a single source of evidence is sufficient to pursue a concern of cheating, and what statistical threshold should be used as a criterion for quantitative analyses.
I think I am becoming less ambivalent about these two issues. Whereas I used to think that it was nearly always important to have more than one source of evidence, I now lean more toward accepting a single strong piece of evidence as reasonable for pursuing further investigation of suspected testing impropriety. Sometimes, a highly improbable finding simply demands further scrutiny.
And, I don't think that we can—or should—establish a common notion of a criterion of suspecting testing impropriety. I don't think that the debate should be decided as 2, 3, 4, 5, or whatever standard deviations or standard errors or whatever. It seems to me that, with all the demands on assessment budgets, it would be unrealistic to establish a single criterion. Instead, it seems to make the most sense to prioritize the allocation of resources. If an assessment budget only permits investigating the "worst of the worst", then those resources should allocated digging deeper into possible test data invalidity whether that means instances that exceed a 5 SD criterion, the 20 most outlying test centers, 1% of classrooms, or whatever the resources will allow.
In closing, I want to reiterate two main points. First, the Guidebook is a tremendous product and it will be an important resource. Second, the effort should not stop with the Guidebook; there is much more work to be done, and we can all hope that the next steps will be accomplished as well as the first ones.
(To purchase your copy of the Guidebook, click on image below.)