Overcoming Turing: Rethinking Evaluation in the Era of Large Language Models

The success of OpenAI’s GPT-4  in passing various professional exams, raising questions about the extent of artificial intelligence and its comparison to human intelligence. Some have suggested that it passed the Turing test as its conversational abilities have become so proficient that distinguishing its responses from a human’s has become increasingly difficult.

The Turing test, proposed by the renowned computer scientist Alan Turing, evaluates a machine’s ability to exhibit human-like intelligence. Fundamentally, the Turing test was never actually considered a test, or even intended to be a validating assessment. The Turing test was more accurately understood as a philosophical thought experiment, akin to another known and related “test,” the Chinese Room Argument.

Imagine a native English speaker who knows no Chinese locked in a room full of boxes of Chinese symbols (a database) together with a book of instructions for manipulating the symbols (the program). Imagine that people outside the room send in other Chinese symbols which, unknown to the person in the room, are questions in Chinese (the input). And imagine that by following the instructions in the program the man in the room is able to pass out Chinese symbols which are correct answers to the questions (the output). The program enables the person in the room to pass the Turing Test for understanding Chinese but he does not understand a word of Chinese

Cole, David, “The Chinese Room Argument”, The Stanford Encyclopedia of Philosophy (Summer 2023 Edition), Edward N. Zalta & Uri Nodelman (eds.), 

Further to this clarification, the question behind the Turing test was not actually whether machines could think: it was understanding whether machines could mimic a person. Indeed Turing begins the article by describing it as the imitation game.

‘The [imitation] game may perhaps be criticised on the ground that the odds are weighted too heavily against the machine. If the man were to try and pretend to be the machine he would clearly make a very poor showing. He would be given away at once by slowness and inaccuracy in arithmetic.’

‘Computing machinery and intelligence’ from MIND : A Quarterly Review of Psychology and Philosophy, (Vol. LIX, N.S. No.236, Oct. 1950)

Moravec’s Paradox highlights the disparity between tasks that humans complete easily that are difficult for machines to replicate. 

“it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility”

 Erik Brynjolfsson, The Turing Trap: The Promise & Peril of Human-Like Artificial Intelligence, Daedalus SPRING 2022

Perhaps the development of a broader more holistic suite of assessments that consider both technical and human aspects paves the way for more meaningful human-machine collaboration and assesses technical performance, human capabilities, and the rationale for specific tasks.

Source: Stanford Law School



, ,



Leave a Reply

Your email address will not be published. Required fields are marked *