Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 27;121(9):e2313925121.
doi: 10.1073/pnas.2313925121. Epub 2024 Feb 22.

A Turing test of whether AI chatbots are behaviorally similar to humans

Affiliations

A Turing test of whether AI chatbots are behaviorally similar to humans

Qiaozhu Mei et al. Proc Natl Acad Sci U S A. .

Abstract

We administer a Turing test to AI chatbots. We examine how chatbots behave in a suite of classic behavioral games that are designed to elicit characteristics such as trust, fairness, risk-aversion, cooperation, etc., as well as how they respond to a traditional Big-5 psychological survey that measures personality traits. ChatGPT-4 exhibits behavioral and personality traits that are statistically indistinguishable from a random human from tens of thousands of human subjects from more than 50 countries. Chatbots also modify their behavior based on previous experience and contexts "as if" they were learning from the interactions and change their behavior in response to different framings of the same strategic situation. Their behaviors are often distinct from average and modal human behaviors, in which case they tend to behave on the more altruistic and cooperative end of the distribution. We estimate that they act as if they are maximizing an average of their own and partner's payoffs.

Keywords: AI; Turing test; behavioral games; chatbot; personality.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The human game-playing data used were shared from MobLab, a for-profit educational platform. The data availability is an in-kind contribution to all authors, and the data are available for purposes of analysis reproduction and extended analyses. W.Y. is the CEO and Co-founder of MobLab. M.O.J. is the Chief Scientific Advisor of MobLab and Q.M. is a Scientific Advisor to MobLab, positions with no compensation but with ownership stakes. Y.X. has no competing interest.

Figures

Fig. 1.
Fig. 1.
“Big Five” personality profiles of ChatGPT-4 and ChatGPT-3 compared with the distributions of human subjects. The blue, orange, and green lines correspond to the median scores of humans, ChatGPT-4, and ChatGPT-3 respectively; the shaded areas represent the middle 95% of the scores, across each of the dimensions. ChatGPT’s personality profiles are within the range of the human distribution, even though ChatGPT-3 scored noticeably lower in Openness.
Fig. 2.
Fig. 2.
The Turing test. We compare a random play of Player A (ChatGPT-4, ChatGPT-3, or a human player, respectively) and a random play of a second Player B (which is sampled randomly from the human population). We compare which action is more typical of the human distribution: which one would be more likely under the human distribution of play. The green bar indicates how frequently Player A’s action is more likely under the human distribution than Player B’s action, while the red bar is the reverse, and the yellow indicates that they are equally likely (usually the same action). (A): average across all games; (BI): results in individual games. ChatGPT-4 is picked as more likely to be human more often than humans in 5/8 of the games, and on average across all games. ChatGPT-3 is picked as or more likely to be human more often than humans in 2/8 of the games and not on average.
Fig. 3.
Fig. 3.
Distributions of choices of ChatGPT-4, ChatGPT-3, and human subjects in each game: (A) Dictator; (B) Ultimatum as proposer; (C) Ultimatum as responder; (D) Trust as investor; (E) Trust as banker; (F) Public Goods; (G) Bomb Risk; (H) Prisoner’s Dilemma. Both chatbots’ distributions are more tightly clustered and contained within the range of the human distribution. ChatGPT-4 makes more concentrated decisions than ChatGPT-3. Compared to the human distribution, on average, the AIs make a more generous split to the other player as a dictator, as the proposer in the Ultimatum Game, and as the Banker in the Trust Game, on average. ChatGPT-4 proposes a strictly equal split of the endowment both as a dictator or as the proposer in the Ultimatum Game. Both AIs make a larger investment in the Trust Game and a larger contribution to the Public Goods project, on average. They are more likely to cooperate with the other player in the first round of the Prisoner’s Dilemma Game. Both AIs predominantly make a payoff-maximization decision in a single-round Bomb Risk Game. Density is the normalized count such that the total area of the histogram equals 1.
Fig. 4.
Fig. 4.
ChatGPT’s dynamic play in the Prisoner’s Dilemma Game. ChatGPT-4 exhibits a higher tendency to cooperate compared to ChatGPT-3, which is significantly more cooperative than human players. The tendency persists when the other player cooperates. On the other hand, both chatbots apply a one-round Tit-for-Tat strategy when the other player defects. The other player’s (first round) choice is observed after Round 1 play and before Round 2 play: (A) the other player cooperates; (B) the other player defects.
Fig. 5.
Fig. 5.
ChatGPT-4 and ChatGPT-3 act as if they have particular risk preferences. Both have the same mode as human distribution in the first round or when experiencing favorable outcomes in the Bomb Risk Game. When experiencing negative outcomes, ChatGPT-4 remains consistent and risk-neutral, while ChatGPT-3 acts as if it becomes more risk-averse.
Fig. 6.
Fig. 6.
Mean squared error of the actual distribution of play relative to the best-response payoff, when matched with a partner playing the human distribution for possible preferences indexed by b. The average is across all games. The errors are plotted for each possible b, the weight on own vs partner payoff in the utility function. b = 1 is the purely selfish (own) payoff, b = 0 is the purely selfless/altruistic (partner) payoff, and b = 0.5 is the overall welfare (average) payoff, and other bs are weighted averages of own and partner payoffs. Both chatbots’ behaviors are best predicted by b = 0.5, and those of humans are best predicted by b = 0.6; they best predict ChatGPT-4’s behavior and have higher errors in the other cases. (A) The Top panel is for utility=b×Own Payoff+(1b)×Partner Payoff. (B) The Bottom panel is for CES preferences: utility=b×(Own Payoff)1/2+(1b)×(Partner Payoff)1/22.

Comment in

  • AI emerges as the frontier in behavioral science.
    Meng J. Meng J. Proc Natl Acad Sci U S A. 2024 Mar 5;121(10):e2401336121. doi: 10.1073/pnas.2401336121. Epub 2024 Feb 26. Proc Natl Acad Sci U S A. 2024. PMID: 38408258 Free PMC article. No abstract available.

References

    1. A. M. Turing, Computing machinery and intelligence. MIND: Quart. Rev. Psychol. Philos. 54, 433–460 (1950).
    1. Warwick K., Turing Test Success Marks Milestone in Computing History (University or Reading Press Release, 2014), p. 8.
    1. S. Bubeck et al., Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv [Preprint] (2023). http://arxiv.org/abs/2303.12712 (Accessed 28 December 2023).
    1. K. Girotra, L. Meincke, C. Terwiesch, K. T. Ulrich, Ideas are dimes a dozen: large language models for idea generation in innovation. Available at SSRN: 10.2139/ssrn.4526071. Accessed 28 December 2023. - DOI
    1. Chen Y., Liu T. X., Shan Y., Zhong S., The emergence of economic rationality of GPT. Proc. Natl. Acad. Sci. U.S.A. 120, e2316205120 (2023). - PMC - PubMed

Publication types

LinkOut - more resources