Taylorized Academic Labor


Mr. Kramer is an Assistant Professor history at Borough of Manhattan Community College.

            By now the process of casualization of the professoriate is well-known.  Larger and larger proportions of courses are being taught by adjuncts rather than full-time faculty.  At the City University of New York, for example, despite recent hiring efforts, the number of full-time faculty has fallen from about 11,300 to about 6,300 since the fiscal crisis of the 1970s, the balance being made up by adjuncts, a fourth of them graduate students.  Faced with these changes in the job market, we are accustomed to consoling ourselves with the thought that we are professionals.  No matter how poor the pay or the job conditions, we will always have skills that set us apart from our most exploited brethren.  But will we?  Or are historians really practitioners of a form of highly skilled labor, one that so far has evaded the replacement by the closely regimented wage labor that has befallen so many other forms of work?

            For a week in the summers of 2004 to 2006, I worked as a reader for the Advanced Placement World History exam in Lincoln, Nebraska.  World History is among the newest of the AP exams, which began in 1956 with 1,229 students.  According to the College Board, by last year, that number had grown to nearly 1.5 million students, taking more than 2.5 million exams.  These courses, as is well known, do in fact substitute for traditional college courses at a growing number of schools.  According to the AP Program, 90 percent of all American schools accept AP courses for credit or placement.  In 2006 there were 437 World History readers, who read 84,000 exams of three essays each in 7 days.  The standard fee for a reader was $1,450.

            This high level of efficiency was accomplished through a thorough Taylorization of the grading process.  As the labor historian David Montgomery has explained, in the original form of Taylorism, a manager would observe a skilled laborer, say a bricklayer, and divide his motions into separate movements of the arms, back and legs.  These movements were then reconfigured in the most efficient way and taught to workers who were closely supervised so that they would no longer work in their own way and at their own pace.  At the AP reading, several days in advance, a subgroup of readers called “table leaders” were flown to Lincoln to read sample exams.  Based on the scores they gave to a sampling of essays, a standard rubric was developed.  When the rest of the readers arrived, we were divided into three groups, one for each question, and trained by the table leaders for the first day on how to score each exam according to the rubric.

            The method of grading is what was called “core scoring,” which gives the students points for the “assets” included in the essay, but does not subtract points for incorrect information.  The question I was reading in 2006, for example, asked students to compare the goals and outcomes of the Mexican, Russian, and Chinese revolutions.  Like the specific “therbligs” that made up reconfigured bricklaying, the traditional holistic grading process was divided into specific assets.  The students received one point for a thesis, one to two points for addressing all parts of the question (one point for two countries, one goal/outcome, one similarity/difference; two points for all six), one to two points for evidence (one point for three pieces of evidence on two countries, two points for five pieces on two countries), one point for a direct comparison, and one point for an analysis of the direct comparison.  Students could then receive one to two points for the “expanded core,” which was a more holistic judgment by the reader based on extra evidence, comparisons, context, or chronology.  However, no matter how good an essay was, unless it received all the points in the core, it could receive no further points in the expanded core.  This meant some essays that were quite good would receive lower scores than some short, weak ones.  All this was laid out in a score sheet with appropriate check boxes.  Our exams were “backread” by the table leaders for at least one day and then sampled thereafter to ensure that we were “on standard,” or following the rubric.

            Like Frank and Lillian Gilbreth (subjects of Cheaper by the Dozen), after whom “therbligs” were named, the advance readers reconfigured the assets to produce the optimal scoring method.  Standards were adjusted based on the difficulty of the question.  For example, in 2004, my first year reading the exam, the thesis for the comparative question could be split into two sentences separated by erroneous information, and could be located anywhere in the essay.  At my last reading, because the students did a better job answering the question, the thesis could not be split, could not contain any errors, and had to be at the beginning or the end of the essay.  Each question was subject to different standards, as were assets within the rubric for the same question.  Evidence sentences, for example, as opposed to the thesis statement, could contain a false phrase, so long as the rest of the sentence was accurate.

            The head readers contended the purpose of core scoring was consistency, so that all essays, no matter when or by whom they are read, are subject to the same standard.  But another purpose was clearly efficiency.  A professor sitting at the table behind me was grading hundreds of essays each day.  I managed about 500 over the course of the week, which I was told was about average.  Core scoring also produces a reliance on textbooks for questions that one might think were open to interpretation.  For instance, we would score a direct comparison that said the Mexican revolution was trying to accomplish democracy, while the Russian Revolution was trying to accomplish communism.  One might also think that the Russian Revolution had something to do with democracy, but a student would have had to demonstrate a firm grasp of the material to make a valid comparison on such a basis.

            Scoring was carried out in a large barn called the Exposition Hall, which was divided into “yurts,” curtained partitions of two groups of four pushed together tables, each with four pairs of experienced and new readers (“acorns”).  Each table was supervised by a table leader, and each bank of tables was assigned to a question by a “question leader.”  The question leaders were supervised by the “chief reader,” who had responsibility for completing the reading.  We began at 8 in the morning and finished at 4:30, with an hour and 15 minutes for lunch and two mandatory 15-minute breaks.  We never scored for more than two hours and 15 minutes at a stretch.  Meals and housing were provided at a Nebraska University dormitory.  At each break we were escorted to an adjoining shed with various snacks and beverages.  Tongue-in-cheek, each session ended with a factory steam whistle.

            To prevent the readers from feeling alienated, the chief reader made motivational announcements over a PA system at the beginning of each morning and afternoon, and the AP Program arranged activities each night, including a welcome party, a “Professional Night” for people to make contacts, and an “Open Forum” in which readers could ask questions of the test development committee.  Many of the high school teachers came to the reading because it gave them an advantage teaching their own AP students, but at least one tenure-track professor told me he was there for the money, as was I—a graduate student at the time.

            There seems to be an assumption that there is something specific to intellectual work that prevents it from becoming proletarianized to the same degree as skilled physical labor such as making shoes or furniture.  But commodification of labor, as Eugene Genovese has told us, is not an all or none, overnight phenomenon.  It is an ongoing process that, I would add, requires constant innovation to extend it to new areas of economic life.  With the exponential growth of standardized testing, and the continuing erosion of the tenured professoriate, I see little reason to think that the kind of labor I was performing at the AP reading does not represent a growing and relatively more important part of the academic labor market.

comments powered by Disqus

More Comments:

Sandor John - 1/14/2008

Jacob Kramer has done a real service with his welcome -- and remarkably restrained -- description of this surreal experience. Particularly important is highlighting connections between the ever-growing reliance on "casual" (vulnerable and malleable) academic labor, standardized testing, and the unending push for privatization.

I worked the same AP World grading gig that Jacob did, also for two successive years. By way of comparison, for almost a decade I worked in a "semi-skilled" job at Ma Bell, one of America's most notoriously regimented employers. "We have no sick days" was a phone company mantra, and computers timed job (and biological) functions in many departments. Though time and motion studies were regularly attempted, we usually succeeded in chasing away the clipboard-toting Taylorites. But then again, we had a union.

So let me tell you: even Ma Bell could pick up a thing or two from the "Grading Factory." While I met many nice people at the AP World readings and established some good professional relationships, Rubric-Land, where knowledge is chipped into bitty widgets to be weighed and graded "scientifically," is definitely a trip and a half.

In case you were wondering: questioning some of the more dubious aspects of rubric-logic was not, repeat not, a way to win brownie points.

Far be it from me to appear soft on old Ma Bell, but I have to say, at the phone company I never had a supervisor, livid with anger, follow me to the bathroom to tell me I was being "unfair" to the team by (supposedly) reducing productivity through excessive "going."

Some of the productivity-control techniques were picturesque. Test readers were advised to bring sweaters (some wore blankets) because the temperature was kept Arctic...so nobody would fall asleep. A similar function was played by piles of candy in front of each work station, plus lots of caffeine.

Lest anyone think the break whistle was just a gag, supervisory staff would sometimes stand frowningly tapping their watches as readers rushed in from grabbing coffee during break. Productivity units was definitely the name of the game, despite official claims that quality counted more than quantity of tests graded.

The Taylorized, mechanical nature of what graders were supposed to do has to reflect a conception that "measurable" learning means of jumping through the prescribed number of hoops in the prescribed way (as calibrated by private companies).

Thanks again to Jacob for his article.

Andrew D. Todd - 1/14/2008

It gets worse. Exams can be scored by computers. The AP exams are of course the first string of college placement/course equivalence exams. The second string are the CLEP exams, which, while they are not accepted in the best places, are still in widespread use. The CLEP exams rely overwhelmingly on machine-scored multiple-choice questions.


At a higher level, the New York Regents used to grant a B.A./B.S. for a 40-percentile score on the applicable GRE subject exam. It is an interesting question of what would happen if a high-school student took a GRE subject test in one of the more demanding subjects such as mathematics or foreign languages, and applied for credit on it against sophomore courses. The defect of multiple choice exams is, first and foremost, that they do not require what a linguist would call production-- they do not actually require the student to write anything. This reminds me of a conversation I had some years ago in a common room. Someone (no names, no pack drill) was complaining that he couldn't get a GTF grader, and would therefore have to read all those papers himself. I shrugged, and said, "Oh well, you can always convince yourself that multiple choice exams are equally valid, and give those instead." The complainer, replied, nettled, that he did consider multiple-choice exams equally valid. There was nothing much to say in response to that.

For administrative reasons, the ETS/Princeton people are moving from paper machine-scored tests to interactive tests with a computer. At present, they are still up in the GRE/MCAT/LSAT/GMAT range, but it is only a matter of time before they expand the computer systems downwards to tests taken by high school students. The interactive tests are adaptive-- they feed in harder or easier questions according to how the student is doing, and they are therefore believed to be more reliable.

However, once you have the student sitting in front of a computer, with a keyboard and a mouse, and all, it is no great trick to throw in other mode of evaluation besides multiple-choice questions. In some areas, such as mathematics and chemistry, computer scoring can be valid, in the sense that one can construct the subject as a formal game, analogous to chess or solitaire, and require the student to play against the computer. There is an ongoing philosophical discussion about how much a chess-playing computer actually _knows_ about chess, but if you are not a ranked tournament-level player, the distinction is probably somewhat moot. Similarly, for the last fifty years, mathematicians have been debating the significance of Newell, Simon, and Shaw's theorem proving program. However, to get to a level where the distinction is meaningful, a student would have to do well enough to start in a course more advanced than about 95% of the course sections which a typical college offers, far higher than the AP exams go.

Presumably one could do something approximately similar for languages with Babelfish. People are inevitably going to "study towards the test." If you give multiple-choice exams, an industry will spring up to provide books of sample tests for ambitious candidates to work their way through. However, if the test is an interactive computer game, people will get that computer game and spend a lot of time playing it.

There are people developing computerized essay scoring systems. The idea is that the student types his essay into a computer, where it can be crudely compared, Google fashion, to a reference essay. The system rewards "name dropping." For example, someone who writes an essay about the fifth century referring to Justinian, Belisarius, Theodoric, Pepin, etc. probably knows more than someone who only refers to King Arthur. A crude system, but it does seem to work after a fashion, at least to the extent that it makes students work at writing out summaries of the textbook.