There is a fundamental flaw in the scaling of the exam that any B student who took a statistics class in high school should be able to identify. I will use plainspeak and not terms like ANOVA so that this all makes sense.

DCAS is attempting to compare 2 populations (original and makeup) and find the mean score of each population, then scale up the scores of the population that had the lower mean score to match that of the population that had the higher mean score. The theory is that the mean score directly indicates the difficulty of the test. The first problem here is that DCAS assumes that both populations are capable of achieving the exact same mean score if they sat for the exact same test on the exact same day.

Case in point: on the original test day, what was the mean score of those who took their test in Queens compared to those who took their test in Brooklyn? For DCASs theory to work, both locations would have to have the exact same mean score with similar standard deviations based on how many took the test at each site. If Queens scores 2 points on average higher than Brooklyn, was the Brooklyn test harder? Of course not, the variation is in the random capabilities of individuals who made up that group since they took the exact same test under similar conditions.

Switch gears to comparing the original test takers to the makeups. How can DCAS compare a sample of say 800 original test day takers to a smaller group of say 25 test takers? That smaller group CANNOT be used as a reference because their smaller population will result in skewed results based on random variations in their population which will result in larger affects on their mean score. Do you think it is a coincidence that all 3 Sgt, Lt, and Captain exams that were scaled ended in the makeup having the higher mean score which scaled up the original test scores?

The idea behind what DCAS is doing is nice, but it is impossible to be done accurately and should therefore not be done that way.

I get what you are saying. However, the scaling methodology hasn't been completely clarified for you to make your assumptions. I don't know if we will ever find out the actual scaling methodology used to compare the original exam and the makeup exams. I am sure with your statistics 101 background; you will be able to figure out the fraction or whole number used to scale the two exams. I do believe that they hired an outside firm for this purpose. I am hoping the outside firm is a "professional company" who will take these variables into consideration.

On a different note, why don't you suggest a better method to compare the original exam and the makeup exam. I think offering two different exams solves more problems that it creates.

There is a fundamental flaw in the scaling of the exam that any B student who took a statistics class in high school should be able to identify. I will use plainspeak and not terms like ANOVA so that this all makes sense.

DCAS is attempting to compare 2 populations (original and makeup) and find the mean score of each population, then scale up the scores of the population that had the lower mean score to match that of the population that had the higher mean score. The theory is that the mean score directly indicates the difficulty of the test. The first problem here is that DCAS assumes that both populations are capable of achieving the exact same mean score if they sat for the exact same test on the exact same day.

Case in point: on the original test day, what was the mean score of those who took their test in Queens compared to those who took their test in Brooklyn? For DCASs theory to work, both locations would have to have the exact same mean score with similar standard deviations based on how many took the test at each site. If Queens scores 2 points on average higher than Brooklyn, was the Brooklyn test harder? Of course not, the variation is in the random capabilities of individuals who made up that group since they took the exact same test under similar conditions.

Switch gears to comparing the original test takers to the makeups. How can DCAS compare a sample of say 800 original test day takers to a smaller group of say 25 test takers? That smaller group CANNOT be used as a reference because their smaller population will result in skewed results based on random variations in their population which will result in larger affects on their mean score. Do you think it is a coincidence that all 3 Sgt, Lt, and Captain exams that were scaled ended in the makeup having the higher mean score which scaled up the original test scores?

The idea behind what DCAS is doing is nice, but it is impossible to be done accurately and should therefore not be done that way.

I totally agree and have been saying this all along. But nutjobs like centurion dont want to face reality....so sad...

The technique of scaling is used on all standardized tests. Its not just dcas making something up.

I think that KingOfTheNorth still has a point. I read this link that TransitJoe posted and its saying stuff about how they compare difficulty by having "anchor" questions on both tests. but king is saying how can you get an accurate idea of how hard the makeup test is if only 25 people took it? Like 25 people is so small compared to how many took the original test that it can warp the results of even these anchor questions they use. I think if 500 people took the makeup and 800 took the original test then it would give a good cross section of people to test these anchor questions on. but by using only 25 people they could by chance be 25 smarter people who did better on the anchor questions. Like how the heck does the sergeant test get scaled by what seems to be 9 full points! was the makeup test that much easier? that's insane and hard to believe! Also the makeup people had what this RD company calls exposure so propbably already had an idea where the test writers were going so even though they may not have had knowledge of the actual questions since the makeup was different questions except for these anchor questions. they may have had a feel of what areas to focus on during the extra time they had to study.

The solution is to evenly break up all the people who signed up for the exam into 3 different test dates so each test has the same number of people taking it and will include dates that work for makeups. You can't fricken compare 25 to 800. King is right!