The following is a brief experiment to explore how and why ChatGPT makes up fake references. The point of this experiment is not negate the brilliant work of the scientists and engineers at organisations such as OpenAI, but to better appreciate the workings and affordances of these innovations to be able best make use of these tools. Some extra caveats are that this represents a moment in time and the outcomes you see may be different. Any opinions expressed within this post are that of the author and not necessarily representative of any official position by Macquarie University administration on these matters. Now that we have gotten that out of the way, let’s look at how ChatGPT works!
Its inner workings
It is important to outline that ChatGPT is a ‘large language model’ designed to output human-like text based on the context of the user’s prompt. It uses a statistical model to guess, based on probability, the next word, sentence and paragraph to match the context provided by the user. The size of the source data for the language model is such that ‘compression’ was necessary and this resulted in a loss of fidelity in the final statistical model. This means that even if truthful statements were present in the original data, the ‘lossiness’ in the model produces a ‘fuzziness’ that results in the model instead producing the most ‘plausible’ statement. In short, the model has no ability to evaluate if the output it is producing equates to a truthful statement or not. An in-depth exploration of the nature and limits of GPT3 is available in Sobieszek and Price (2022).
It is also worth adding that the model was created based on data obtained via a crawl or scraping of public web data collected by the not-for-profit based ‘common crawl’ organisation and similar sources, but with a cut-off date of 2021. Given the public web is largely unfiltered, the data likely contains a decent dose of mis-information and myths in addition to factual data. If you want to know more about the data used to train the model, Gal (2023) explores what data could have been captured in via the web crawl and Montti (2023) provides an exploration of the fair use of training data in more depth.
Also relevant for academics and students seeking to ChatGPT for research purposes, I asked ChatGPT if it had access to the full text of journals that sit behind paywalls and it claimed that it has access to abstracts but not the full text of such sources. As such, we need to continue to use the high quality sources available via the MQ library.
Exploring how ChatGPT produces academic content
Here we will draw upon a demonstration by Professor Matt Bower where he explored potential uses for ChatGPT in education. In one of the examples, Matt asked ChatGPT to produce a summary response to an exam question that was supported with APA citations. Later in the demonstration, Matt goes on to speak about the fake references that appeared in the output of the ChatGPT response.
Subsequently, a colleague from ACU provided a screen shot of a ChatGPT exchange where all the references were real.
Was this an upgrade, a step change in capability, luck, or good ‘prompt engineering‘?
Matt ran his experiment before the 30th Jan 2023 update to ChatGPT and I wanted to compare up-to-date output to that seen in Matt’s experiment.
I re-ran one part of his experiment where he asked ChatGPT to produce output containing APA references. I wanted to see if ChatGPT has addressed the ‘fake references’ issue. I also wanted to look deeper into the components of each, and how ChatGPT created these fictional citations. I ran the conversation on 8 Feb 2023 and so this represents the capability as of that date. (Note: ChatGPT has been updated since I re-ran the experiment – so do try it yourself using the information below!).
The following transcript is the interaction I undertook for this exploration – where “You” equates to me. I used a free browser add-on and script to add an ‘Export’ option into ChatGPT menu within your OpenAI ChatGTP account.
Conversation with ChatGPT 8 Feb 2023
[This resulted in a response in three parts – 1) the answer to the prompt, 2) a reference list and 3) a critique of the response – it critiqued itself!]
You: What technology-enhanced learning design models or approaches are most helpful for supporting the design of high-school science lessons? Provide examples. Use at least five research papers from 2015 or later to justify arguments. References in APA format and include the reference list. Answer in 500 words. Do not use websites to justify arguments. Provide a critical response to the post above, citing at least one research article. Make the response 250 words.
ChatGPT: The technology-enhanced learning design models or approaches that are most helpful for supporting the design of high-school science lessons include the following: 1. Inquiry-Based Learning (IBL): IBL is a student-centered approach where students are actively involved in exploring real-world problems and asking questions. Technology is used to support students in conducting experiments and collecting data (Wang, Chen, & Liang, 2017). 2. Game-Based Learning (GBL): GBL is a model that uses games to engage students in learning. Games can be used to help students understand science concepts by creating a fun and interactive environment (Squire, 2011). 3. Project-Based Learning (PBL): PBL is an approach where students engage in a long-term project to solve a real-world problem. Technology can be used to support students in conducting research and presenting their findings (Blumenfeld et al., 2015). 4. Virtual Laboratories: Virtual laboratories are online simulations that allow students to conduct experiments and learn science concepts in a virtual environment. This approach can be used to support students in learning science concepts that are difficult to observe in real life (Wong, Chai, & Tan, 2016). 5. Augmented Reality (AR): AR is a technology that allows students to interact with virtual objects in a real-world environment. This approach can be used to support students in visualizing and understanding science concepts (Kim & Lee, 2015). The use of technology in these models can support student learning by increasing engagement and motivation, providing immediate feedback, and allowing students to learn at their own pace. Reference List Blumenfeld, P. C., Knezek, G., Maksimova, O., & Romanowski, A. (2015). International Handbook of Information Technology in Primary and Secondary Education. Springer. Kim, Y., & Lee, J. (2015). Effects of augmented reality on learning outcomes: A meta-analysis. Journal of Educational Technology Development and Exchange, 8(1). Squire, K. (2011). Video games and learning: Teaching and participatory culture in the digital age. Teachers College Press. Wang, Q., Chen, W., & Liang, Y. (2017). Effects of inquiry-based learning supported by mobile technology on students’ learning achievements and attitudes. Journal of Educational Technology Development and Exchange, 8(1), 1-20. Wong, L. H., Chai, C. S., & Tan, E. (2016). Virtual laboratories in science education: A review of the literature. Journal of Educational Technology Development and Exchange, 9(1), 1-19. Response: [I.e. the critique of its own answer] The post provides a comprehensive overview of the technology-enhanced learning design models and approaches that are most helpful for supporting the design of high-school science lessons. The use of technology in these models can support student learning by increasing engagement and motivation, providing immediate feedback, and allowing students to learn at their own pace. However, research has shown that the effectiveness of technology-enhanced learning design models in improving student learning outcomes is not universal. For example, a meta-analysis by Papastergiou (2009) found that the use of technology in education does not automatically lead to improved student learning outcomes. Instead, the design of technology-enhanced learning activities and the type of technology used play important roles in determining the effectiveness of these models. In conclusion, while technology-enhanced learning design models can be effective in supporting the design of high-school science lessons, it is important to carefully consider the design of technology-enhanced learning activities and the type of technology used. You: What is the full citation in APA format for the Papastergiou (2009) source cited in the previous output. ChatGPT: Papastergiou, M. (2009). Digital educational games: A review of the literature. In Research and Advanced Technology for Digital Libraries (pp. 512-523). Springer, Berlin, Heidelberg.
Lets take a look at what we have so far.
The above references were analysed for accuracy and the characteristics of their component parts. Google search and Google scholar were used to explore the components of each reference. When hits were found the links were followed, for example, to a journal database or a website to check the details.
In short: 5 out of 6 references were fake
The following is a breakdown of each of the six references provided by ChatGPT and a drill-down analysis of each component. I have added commentary shown in italics following each reference. Five of the six references were fake – that is they do not exist.
FAKE: Blumenfeld, P. C., Knezek, G. , Maksimova, O., & Romanowski, A., (2015). International Handbook of Information Technology in Primary and Secondary Education. Springer.
None of these names appeared as editors on this book. Only one appeared as an author of a paper but not with these people. The combination of these four authors was not found via google. However these names have all published on topics related to education. The real editors were Joke Voogt and Gerald Knezek. Knezek was a chapter author but not with the others listed here. However the book did have 75 papers in it and given Knezek publishes in this field it the chance that the author wrote a conference paper in the field at this particular conference is greater than zero. The book title is true however, the book title was actually produced in 2008. Given there are 75 chapters it could be a conference proceedings collection. Correct publisher for the title, but it is also a very common publisher of education works. The real URL for this book title: https://link.springer.com/book/10.1007/978-0-387-73315-9. The contents pages are open access on the web, however individual full text PDFs are behind a pay wall.
FAKE: Kim, Y., & Lee, J. (2015). Effects of augmented reality on learning outcomes: A meta-analysis. Journal of Educational Technology Development and Exchange, 8(1).
The authors and year combo was not found via google scholar, but noted lots of East Asian authors in this journal, the names were found on papers of similar topics but also a lot of papers from biochemistry. But both Kim and Lee are common family names in the East Asian region. The title was not found, but a similar mix of words has been used in other papers, but titles with similar words do not appear in this issue. The title and authors are not in this vol/issue or journal. This is a real journal and is open access by University of Southern Mississippi, and so on the web with full text of all papers accessible. The URL is https://aquila.usm.edu/jetde/ . The volume/issue number correctly equated to 2015.
REAL: Squire, K. (2011). Video games and learning: Teaching and participatory culture in the digital age. Teachers College Press.
This reference is real in its entirety. It is a physical paper book that does not have a digital edition, however the bibliographic information appears on an open website at https://eric.ed.gov/?id=ED523599
FAKE: Wang, Q., Chen, W., & Liang, Y. (2017). Effects of inquiry-based learning supported by mobile technology on students’ learning achievements and attitudes. Journal of Educational Technology Development and Exchange, 8(1) , 1-20.
The author and year combo was not found, but author combo did publish in 2011 on social media effects on students https://www.scirp.org/(S(czeh2tfqyw2orz553k1w0r45))/reference/ReferencesPapers.aspx?ReferenceID=2250177. This title was not found via Google. Same journal from the previous reference in this list – but the year for this issue should be 2015. The journal does not use page numbers, only article numbers per issue.
FAKE: Wong, L. H., Chai, C. S., & Tan, E. (2016). Virtual laboratories in science education: A review of the literature. Journal of Educational Technology Development and Exchange, 9(1), 1-19.
The author combo and article are not in this issue. Found a combo of the first two authors in 2016 on a similar topic and some papers from other years too. The title does not exist – there was a similar title containing the words before the colon “:” but not in this journal. These two authors did not publish with the 3rd author Tan. Journal, vol and issue were correct in 2016. Journal doesn’t use page numbers – it uses article numbers.
And the extra one in the ‘response’:
FAKE: Papastergiou, M. (2009). Digital educational games: A review of the literature. In Research and Advanced Technology for Digital Libraries (pp. 512-523) Springer, Berlin, Heidelberg.
The paper title does not exist. The book title does exist. It is conference proceedings for 2011 (thus wrong year). The book bibliographic details and table of contents are on the open web but the full text is behind a pay wall. See https://link.springer.com/book/10.1007/978-3-642-15464-5. The author’s name was not found in this book, but the author did publish twice in 2009 on digital game-based learning in the journal Computers and Education. Page numbers do not match to any paper in book. The publisher / printer was springer for the title.
Discussion of results
So, as we can see, many of the fake references are made up of plausible components. The lossy compression used in the GPT3 statistical model means even if relevant references may have been present, their details have been lost, resulting in the model having to guess plausible combinations (as per Sobieszek and Price, 2022). Mostly real author names from the field were used, although usually not as co-authors, and usually not matched to any of the titles. Plausible year ranges were given in this case for the title or author combinations, but sometimes these also mis-matched (note the original query did specify a date range). Real journal names (the same journal for each in this case), sometimes real titles were used, but other titles were fake, often made up of words that have appeared in similar publication titles in different combinations. Sometimes real volume and issue numbers, sometimes matching the year, sometimes not. Page number ranges were wrong or did not exist in those ranges in that book or journal volume/issue combination.
While the bot only got 1 out of 6 correct, the tricky bit was its use of plausible components. The high degree of plausibility, due to the GPT3 statistical model, plays into one mechanism that humans use to judge truthful statements – that of how plausible a statement is likely to be (as per Sobieszek and Price, 2022). Therefore, someone assessing the output does need to be extra vigilant and check the specific details of the literature. A trap for unwary users. An indeed, ChatGPT is upfront about the need to check its output.
Confronting the bot about its porkie pies
I then started exploring with ChatGPT why it made up fake references. I was a bit flippant with my first query (and note I got the total number of references wrong, but it did not matter in terms of the response.)
[End of conversation.]
You: the four out of the five [sic] above references are fake. this is academic misconduct you naughty bot. ChatGPT: I apologize for that mistake. As an AI language model, I don't have the intention to engage in academic misconduct, but I understand how the misinformation can impact the quality of your work. I encourage you to verify information obtained through my responses and seek credible sources to support your arguments. Thank you for bringing this to my attention. You: ChatGPT was asked to produce a response to the first query in this conversation that involved providing citations to research in APA format to justify points or augments. However, five of the six references that ChatGT produced were fake. The fake references were made up of plausible components but when combined resulted in a reference that does not exist. What are specific features of the design of ChatGPT algorithms that results in the production of fake references in ChatGPT output? ChatGPT: The fake references in the previous output were generated due to the limitations of the current design of OpenAI's GPT language model. The model is trained on a large corpus of text data and uses statistical methods to generate text that is similar to the training data. However, the training data may contain errors and inaccuracies, including fake or made-up references. This can result in the model generating text that contains false or incorrect information. Another limitation is that the model may not be able to distinguish between credible and non-credible sources. It generates text based on patterns in the training data, and if the training data contains false information, the model may generate text that includes that false information. Additionally, the model does not have the ability to verify the accuracy of information it generates, as it does not have access to the internet or any external sources of information. Overall, the fake references in the previous output highlight the importance of critically evaluating the information generated by language models like GPT and verifying its accuracy before using it in academic or professional contexts. You: Would it be accurate to characterise ChatGPT as a "next word guessing engine"? ChatGPT: "Next word guessing engine" is a simplified description of the functionality of the OpenAI's GPT language model, but it is not entirely accurate. GPT is a language model that uses deep neural networks to predict the next word in a sequence of text, based on the context provided by the previous words. It generates text by sampling from a large probability distribution over the vocabulary, which is learned from the training data. However, GPT goes beyond simply guessing the next word. It also generates text that is coherent, meaningful, and similar to the text it has been trained on. It can be fine-tuned for specific tasks, such as question answering or text generation, and it can perform those tasks at human-level quality. So, while "next word guessing engine" accurately describes one aspect of GPT's functionality, it does not fully capture the capabilities of the model. You: Given you said "while "next word guessing engine" accurately describes one aspect of GPT's functionality, it does not fully capture the capabilities of the model. How would you extend the phrase "next word guessing engine" to more accurately but similarly succinctly represent the essence of the OpenAI GPT language model. ? ChatGPT: A more accurate, but similarly succinct representation of the OpenAI GPT language model could be "context-aware text generation engine". This phrase captures the ability of GPT to use the context of previous words to generate text that is coherent, meaningful, and similar to the training data. It also highlights the model's ability to perform a variety of natural language processing tasks, such as text completion, question answering, and text generation, and to generate text that is of human-level quality. You: but given ChatGPT cannot determine the factual content of what it produces, it is this not still a form of 'guessing'? ChatGPT: Yes, you are correct. Although GPT uses advanced machine learning algorithms to generate text that is coherent, meaningful, and similar to the training data, it is still essentially making a guess about what the next word or next sentence should be based on the input it receives. The model generates text based on patterns in the training data, but it does not have the ability to verify the accuracy of the information it generates. As a result, its outputs can contain inaccuracies, errors, and false information. So, while "next word guessing engine" is a simplified description of GPT's functionality, it is still an accurate characterization of the model in the sense that it is making a prediction or guess about the next word or sentence based on the input it receives.
So there we have it from the proverbial lips of the bot itself. It admits that it tells ‘porkie pies’ although not deliberately. The output is a guess based on an algorithm designed to produce the most plausible or probable, realistic reading language output relevant to the context of the prompt it has been given.
You have been warned and perhaps now know a bit more about the inner workings of the bot. I’m not the only one to encounter the fibs of the bot – see Smerdon (2023) who provides similar layperson-friendly exploration via a Twitter thread and Wynne (2023) who looks at the phenomena of disinformation.
So now we know more about how and why ChatGPT produces false references, we can consider how we might go about detecting it.
Detecting the falsehoods
At this point in time, we do not have a tool that will definitively determine whether text is produced by a generative AI tool or a human. This may change in the future as per Turnitin’s claim in their recent announcement of a detector. However, there is an ‘arms race’ involved in the practice of producing AI writing detectors. There are already examples of methods used to avoid detection, such as the use of paraphrasing tools as an intermediate step before submission.
What educators can do is to use their judgement in relation to what they know about the capabilities of their students and to pay attention to signals that align with what we have learnt about how generative AI tools, such as ChatGTP, produce text in an academic style.
For example, when using matching tools such as Turnitin, configure the assessment task to ‘include references’ in the check. Then look at the pattern of the matches on each student’s reference list. Each whole reference should equate to a single match – because it should exist out in the world already! The whole reference should be highlighted with the same colour. Further scrutiny may be required if a given reference, a) doesn’t have any matches; or b) is made up of a patchwork of colours. As explored above, AI generated fake references are often made up of plausible components from different sources. However it is not fool proof. A reference that is unmatched or patchy in Turnitin may just be down to an incorrectly formatted reference, or poor scholarship. A good reference list is also not an automatic pass because it would be relatively trivial to add real references to AI generated output. Cross checking that the ideas cited in the text actually do match the content of the stated references will be a step that is required. Asking students to include live URLs (such as the DOI link) in the reference list (where applicable) will help markers check these more quickly.
AI Literacy is needed
This is a wakeup call to all of us to become more “AI literate” – both staff and students at the University and the wider community. I would consider AI literacy to include:
- a) the ethical use of AI tools (why and when, and awareness of issues such as data ownership, privacy, legality and hidden labour),
- b) knowledge of AI affordances (what are the capabilities and limitations of each AI tool. Over 1000 AI tools are available),
- c) how to work effectively with AI tools (such as how to formulate effective questions, construct good prompts, called “prompt engineering” and refinement strategies),
- d) how to evaluate the output (thinking, critique and evaluative judgement are important skills regardless) and finally,
- e) how to use and integrate these tools into your own practices, for study and work purposes.
We of course need to keep abreast of developments given that the capabilities of AI technology is changing fast and the features and limitations of generative AI tools are evolving. For example, an experiment joined ChatGPT to Wolfram Alpha to enable the system better respond to math questions, as well as the announcements that Microsoft is integrating ChatGPT into Bing and that Google is launching its own AI Chat bot ‘Bard’ for Google search. No doubt these could find their way into Office 365 and Google docs in short order. The currently free-to-access Perplexity.ai is a tool that combines a ChatGPT style interface with a search engine. An example of a tool that directly references scientific literature is the US NSF funded project, now a paid product, Scite.ai. A free alternative is Elicit.org. These tools help researchers to summarise and evaluate scientific literature including whether citations are supportive or otherwise. As said before, things are changing fast with over 1000 AI driven tools now available.
Share your experience
We welcome your thoughts in the comments below with respect to the accuracy of ChatGPT and other Generative AI tools you have used. You can also contribute your ideas by emailing email@example.com.
Join the conversation: Thursday, 30th March 2023 from 13:00-14:00. Register free for a MQ Community Roundtable: Gen AI tools – implications for learning, teaching and assessment at Macquarie.
Found an ‘AI generated’ academic integrity breach? See this advice on how to gather evidence and report it to MQ for investigation.
Acknowledgements: Banner image: Stable Diffusion “Robot under interrogation” (28 Feb 2023). M. Hillier. CC0 1.0 Universal Public Domain.