Much conversation has been had on the idea that ChatGPT and other generative AI are just another calculator or just another search engine. There is some sense in the analogy of generative AI being disruptive technology. Education has been here before and we have adapted.

Knowing the differences in the affordances of technology tools better equips people to understand the problems and situations for which a tool can be utilised. This is pertinent to higher education because we have a role in providing advice to students about the appropriate use of these tools in their studies. Building AI literacy is likely to be an important ingredient in being productive in the future.

A summary table is presented below. Read on for the details.  

 CalculatorSearch Engine Generative AI (text)
Example Casio FX-82 Google search (no AI)ChatGPT (v3.5) no plugins.
Output is… Numbers according to the buttons pressed by the user. The best match according to the input from the user and mediating variables pertaining to the user (e.g. location, past searches). The most plausible, human like prose based on the context given by the user. 
Predictable output Yes Yes by Google, but not by the end user. No 
Reproducible output immediately (e.g. re-run the same query 2 seconds later) Yes Yes, by the same user No 
Reproducible output over time (e.g. re-run the query a month later) Yes No No 
Unique or new output is possible (i.e. has never been seen before) No No Yes 
The system can know truth from falsehood No No No 
The system can output data that equates to false information. No Yes (search results may point to false information) Yes (the system can create an arrangement of words that equate to false information). 
Comparing the affordances of ChatGPT, a search engine and a calculator.

If we examine the affordances of these different tools, we can appreciate how generative AI is indeed different in important ways that have implications for information quality.

Calculators

e.g. Casio FX82 et al. The device follows exactly what the user asks it to do. You ask a question of the machine by pressing buttons and it does not matter who uses the device, the result is the same for everyone, every time – provided the user presses the same buttons in the same sequence and does not make any mistakes. The device does not know truth from falsehood. The output follows directly from the input according to predefined rules. Scientific calculators do have a very complex set of rules defined by mathematics and as such the output is predictable and reproducible.

Significantly, a properly functioning device cannot produce unique or new information. However, given the potential for mistakes by the human, the human user needs to check that the output matches what they *expected* according the *intended* sequence of button presses. 

Search engines

Search engines such as Google or Bing (pre any integration with an AI chat bot!) will retrieve extracts or links to existing information based on what it is asked to do. The user asks a question of the machine by typing a few words, numbers or operators (of up to 32 words in length in the case of Google). The output can vary over time because information on the Internet changes and thus information in the database of the search engine changes over time.

In modern search engines driven by targeted advertising the output is mediated by variables related to the user such as their location and their history of activity across the web. As such the output may not be the same for everyone, and it may change over time. Like a scientific calculator, the process of retrieval and output follows from the user’s input according to complex set of defined rules that usually includes a lot of meditating variables (e.g. the user’s prior web use, location etc.). If all of these inputs are known, *in theory* the output is reproducible, although not in practice for the user.  

Just like a calculator, the system does not know truth from falsehood. Significantly, a search engine cannot produce unique or new information, it can only retrieve what is in its database, be it true or false. It is up to the human to use their judgement as to the veracity of the information presented. In part humans do this by evaluating the plausibility of the information and comparing the output to what they already know or believe about the world. 

Generative AI for text

e.g. ChatGPT (using GPT v.3.5, without plugins). The system is designed to generate plausible human-like text output (words, sentences, paragraphs or computer code etc) in response to the context (Prompt) provided by the user. In lay terms, generative artificial intelligence tools based on Large Language Models (LLM) essentially work as a next word guessing engine, with the output contextualised to the user’s input. The context can also include information from recent prior prompts received in the current conversation. The user can use natural language and any operators to format a question or statement (a ‘prompt’) to the machine. In the case of ChatGPT, the prompt can be quite long and contain a lot more detail than is possible with a search engine request.

The P in ChatGPT stands for ‘Pre-trained’ meaning that the database is fixed for a period of time (until the next update or supplemental guided training). The database for GPT3 and 3.5 series of models is not a database of facts but is instead a large statistical model of made up of 175 billion variables (words and word fragments) and the relationships between these words. The model was based on the input of over 500 gigabytes of text data scraped from the public web, plus some other sources. The exact source data is undisclosed but likely includes facts from trustworthy sources but also a whole host of poor quality, factually incorrect information, conspiracy theories and phobic material that is present on the public web. In the case of ChatGPT, OpenAI have made attempts to place additional filters on output to limit socially unacceptable information (such as racism) being presented to users, however this has only been partly successful to date.

In compiling such a large dataset some compromises had to be made. The sheer size of the model meant that some compression was required to make it workable. The compression is ‘lossy’ in that some information was lost in the process of compressing the data (in a similar way as compression in JPG images results in the values of two adjacent pixels being averaged). This means that the relationships between elements in the statistical model become ‘fuzzy’ and so any relationships between bits of information that may indicate a fact are also made less certain. In the end it means that the model can only estimate the relationships and so the best it can do is to produce the most ‘plausible’ output. This is at the heart of why large language models (LLMs) can output text that constitutes false information that is referred to as AI ‘hallucination‘. An example is the generation of fake references. Further, the design of the system includes an element of randomisation in order to produce variation in the output each time a prompt is submitted.

Stephen Wulfram (of Wulfram Alpha fame) recently provided a relatively lay person’s, but long exploration of how LLMS such as ChatGPT work and an in-depth exploration of the nature and limits of GPT3 is available in Sobieszek and Price (2022).

Key features of AI

Just like a calculator or a search engine, a generative AI text system does not know truth from falsehood, and in this case it cannot know because of the ‘fuzzy’ nature of the model. The process is a prediction or estimation and that means that the output does not necessarily follow directly from the input. There are no ‘hard wired’ rules or sets of variables that could result in predictable or reproducible output (the exception being selected post-hoc rules added to the model). This is not to say that any instructions in user input is ignored, but that it is used as the inputs for the prediction. The system attempts to guess a suitable, human-like set of words, sentences and paragraphs based on the context provided by the user input and any recent exchanges it has had in that session (conversation). While there are a complex set of rules that govern how the process works within the statistical model that produces the output, the prediction is not precise.

As such, the output is not reproducible, even if the rules governing process are known and the set of input variables is stored and run again on the same model, the output can be different. Significantly, this means that a query can produce unique or new information each time it is run. It can output combinations of words (sentences) that have never existed before. The combination of words may equate to information that is true or it could be false. The latter behaviour is referred to as ‘hallucination’. 

Over to the humans

It is up to the humans to use their judgement as to the truthfulness of the information. Again, in part, humans do this by evaluating the plausibility of the information and comparing this to what they already know or believe about the world. The very fact that the model attempts to produce plausible, human-like output means that such output interferes with one of the methods humans use to evaluate information. This is a potential trap for the unwary or those not trained in the critical evaluation of information. 

The nature and the current limitations of LLMs have implications for how such tools can be deployed in education settings. That is, because the model is fallible it cannot be relied upon to produce accurate information. This makes LLMs unsuited to taking on the roles such as that of a trusted tutor or information source. However there are many possible uses for imprecise language models such as in brainstorming, summarising information provided by the user or being a teacher’s helper.

What the future holds

Developments in generative AI are rapid with new announcements and new tools appearing on a regular basis. ChatGPT version 4 has been released, it provides a step up in capabilities but limitations such as generating fake references still persist, although OpenAI have claimed a 40% lower rate of falsehoods. The recent announcement of plugins for ChatGPT that include Wulfram Alpha, live web search, Python code interpreter and links to a range of online services will mean the capabilities of ChatGPT will evolve over time. It is possible that the accuracy of the output on specified tasks such as factual and computational problems will improve. Live web search and GPT4 have been combined in Microsoft’s Bing Chatbot. Google are also testing their own AI chat bot ‘Bard’. However, both have been seen to produce hallucinated or false information, so it is still ‘user be aware’.


Read other posts in the generative AI series. Recent posts include: A proposed AI literacy framework and the launch of Turnitin’s new AI writing detection feature.

Acknowledgements: Banner image: Mash up of images generated using Scribble Diffusion (28 March 2023) by M. Hillier. CC0 1.0 Universal Public Domain.

Posted by Mathew Hillier

Mathew has been engaged by Macquarie University as an e-Assessment Academic in residence and is available to answer questions by MQ staff. Mathew specialises in Digital Assessment (e-Assessment) in Higher Education. Has held positions as an advisor and academic developer at University of New South Wales, University of Queensland, Monash University and University of Adelaide. He has also held academic teaching roles in areas such as business information systems, multimedia arts and engineering project management. Mathew recently led a half million dollar Federal government funded grant on e-Exams across ten university partners and is co-chair of the international 'Transforming Assessment' webinar series as the e-Assessment special interest group under the Australasian society for computers in learning in tertiary education. He is also an honorary academic University of Canberra.

Leave a reply

Your email address will not be published. Required fields are marked *