... a token of mAI appreciation ...
How do Generative AI (GenAI) or Large Language Models (LLM) _really_ work? Basically ... predictive test on a grand scale. Every letter crafted by human hand.
In late 2023, I was a bit desperate - for work, that is.
The contracts had dried up and I wasn’t getting bites.
I had a whine on my Facebook page and a friend said “try AI”.
”NEVER!” said I. “It’s taking my jobs; buggered if I’ll let it near my CV!!”.
My friend, very sensibly, said something along the lines of “it’s here to stay, and everyone else is using it, so why not?”.
I hate it when people are sensible at me.
I hate it even more when their advice works. Read more in
Spoiler: I got a really really good job. Which is encouraging me to learn about GenAI/LLMs … and is why I’m writing this article.
At the time, and insofar as I’d really thought about how AI worked at all, I'd sort of assumed that things like ChatGPT were essentially Google with a fancy front-end that used programming cleverness to produce the nearly-indistinguishable-from-human-writing result.
Dear reader … I was wrong.
This article is my attempt at explaining how it works, in order for the everyday person to know when and how to trust it, when to be skeptical, when to run an absolute mile … and when to know it simply can’t do what’s claimed.
Table of Contents
The short(er) version
So … what is AI, really?
A note about terminology
Pattern-matching?
The cat sat on the …
A token bit of language
So the whole internet is tokenised?
The process, step-by-step
Create the first dataset for training the model
Create and fine-tune the assistant model
Point the model at an information dataset
Augment the search results
Why do we want to know all this?
Finally - what does this all mean for us?
The short(er) version
AIs sort of operate in two parts - the model itself, and the database from which it gets its information.
The model is the bit that you talk to, and which provides the natural-language-like response.
The model is trained on a a collection of information that is a static, unchanging set of data (a “dataset”) which can be derived from anywhere, but most commonly comes from the general Internet, including social media, websites, blogs, news sites, and may even include email.
The dataset is broken up into “tokens”, and the model “learns” which tokens appear in which patterns. (Broadly. A lot of other things go on as well).
Once the model has learnt how language works, it can apply those learnings to any curated dataset that can also be tokenised (and thus allowing pattern-matching).
This dataset can be the same one from which it learnt to generate language, or it can be any other dataset. This is actually sort of critical to realise.
The dataset still needs to be curated, however; it needs to be an offline system that can have tokens applied to allow the pattern-matching process.
So the information provided by many AIs - particularly the free public ones - can be extremely out-of-date, as it doesn’t come from the live internet.
It can also have some extremely unpleasant biasses, or be completely invented - “hallucinated”.
Further, it doesn’t do calculations. GenAI doesn’t know maths. It can’t translate. And the image generating AIs have no idea what words look like; they can’t generate them at all. You can’t make text look prettier with AI, because it doesn’t understand the concept of words. It’s just pixels on a grid to it.
It all comes down to the fact that AIs are generating content (thus “generative AI”, or GenAI) based on the likeliness or tendencies of words (or parts of words) to appear in proximity to eachother, which is itself based on the language tendencies of the original dataset.
You can thus well imagine what “facts” or opinions will emerge from a dataset derived from (say) social media.
So it pays to be wary of “facts” and “opinions” supplied by AI. Never trust it implicitly for that purpose. Always make it double-check itself (there’s now an entire related field called “prompt engineering” which focusses on crafting the perfect question, or set of questions/statements, to get very specific results out of any AI system).
But it does explain why it’s remarkably good in the creative content generation area.
I’m working on a few separate articles around this topic, all relying heavily on my understanding of the offline, tokenised nature of AI systems, so if that should change, I’ll have to rewrite everything :)
One where I try to create what I call a “trust model” for AI. It’s working backwards from what I believe might be the most trustworthy setup for an AI system (and which I’m not entirely sure is attainable at this point in time), and explaining that every time current AI deviates from that ideal, it loses a level of trustworthiness.
An article on prompt engineering once I can get my head around that. Short version: it pays to be sneaky and to work around the systems. Not my strong point - I tend to be blunt, and that actually doesn’t work well with GenAI.
I also have an article that outlines the ways that I, at least, can currently still identify AI-generated content; the places a professional writer can apply cynicism to be aware.
So … what is AI, really?
Firstly, I’ve learnt the correct name is (maybe) “Generative AI”, “Large Language Models”, and/or “Machine Learning”. You’ll see these summarised as “GenAI/LLM/ML”.
The “generative” part is critical.
AI generates language by matching patterns.
Or to put it another way, it’s a really really precise predictive text system, generating content by drawing from gigantic datasets.
A note about terminology
Since I originally drafted this article, the names have changed again, and experts are now differentiating between “generative AI” and “normal AI”. I’m not going to try and explain all the names, particularly as I’m not entirely convinced that these differences are at all useful.
Hopefully, once we understand HOW these systems do their things, we can understand the shifting landscape of the names.
However, it’s important to note that it’s not Artificial Intelligence as science fiction would paint it. The systems aren’t self-aware. They don’t understand or know anything, and they’re merely analysing language patterns and tendencies in order to generate human-intelligible content.
(Is it intelligence as we understand it? My gut answer is “no”, but unpicking my logic behind that is maybe a topic for another article, where philosophy, biology, and computing form a terrifying intersection … ).
So I try not to use “AI”, as we might still need that term in the future.
Pattern-matching?
Yup. Like predictive text. You know that exercise that goes around occasionally that tells you to write a phrase and then complete it by pressing the centre button of your phone?
Eg: “I’m going to write a book on” (press centre text bit like a mad person until a sentence forms that amuses you - or, at least, makes coherent sense).
The following screenshot shows a sentence formed by predictive text that says: I'm going to write a book on the history and the future and how to do that in a few weeks time to see what the next generation will look at. (Pretty boring, really).
Although your device only draws from what you’ve written in the past, it’s still machine learning. (And if you’re now horrified by the idea that your phone remembers what you write, type “remove predictive text dictionary” and your phone type into a web search, and follow the instructions to turn it off).
The cat sat on the …
Generative AI models basically do the same thing, but using really big databases of content.
This is how they find out that if (for example) an English-speaking human reads the words “the cat sat on the … “, the majority of us will expect the word “mat” to appear at the end. (Unless you’re my partner, who apparently expected “hat”. But I digress).
Computers don’t actually know this. GenAI doesn’t “know” this; it’s not intelligent. But it can learn what words, or parts of words, are frequently found in proximity to eachother.
This learning process is where the tokens come into the exercise.
A token bit of language
The way these systems learn is to break everything they’re exposed to into the smallest part they can. Software programs break up text (and sound to an extent) into tokens. Images are broken up into pixel grids (and I’m not covering that here because I haven’t learnt about it yet!).
A GenAI system called Poe gave me a nice example of tokenisation.
Question: "What is the capital of France?"
Tokenization:
Tokens: ["What", "is", "the", "capital", "of", "France", "?"]
Each word is treated as a separate token.Punctuation marks like "?" are also treated as tokens.
Subword tokens (if subword tokenization is used):
Subword tokenization is a technique that further breaks down words into smaller units, called subword tokens, to handle out-of-vocabulary words and improve language modeling. This example assumes the use of subword tokenization using the BPE (Byte Pair Encoding) algorithm.
For eg: ["What", "is", "the", "cap", "ital", "of", "France", "?"]
The word "capital" is split into "cap" and "ital" as subword tokens.It's important to note that the specific tokenization method and the resulting tokens can vary depending on the language model and the tokenization algorithm used. The examples above are just a simplified representation to illustrate the concept.
The smaller the token (words vs subwords), the more accurate the predictive text process will become, but the more work is required to perform the tokenising and subsequent analysis.
Furthermore, and as I understand it, when you ask a GenAI interface a question, it also breaks the question up into tokens, so it can go into its database and find tokens that match eachother.
Then it performs magic to put the tokens back together again and generate - there’s the generative AI bit of things - human-legible content.
Want to have a play?
This is an excellent Tokeniser Playground (give it some time to open). Pop in some text and see how the system breaks it up. Use the drop-down menu to see how different systems break the text up in different ways. I thoroughly recommend finding colloquial terms like “chooks” and “drizzling” to see how less-familiar words are broken up.
So the whole internet is tokenised?
Ah … no.
As you can imagine, the process to tokenise text is quite time-consuming.
This is where I had to change my entire assumptions of how AI worked. You cannot tokenise the entire live Internet - or any other rapidly-changing set of data - on the fly. Not at the moment, anyway. Maybe in the future, the underlying setup of the internet’s data will come pre-tokenised, for use with AI systems; but it isn’t right now.
Tokenisation has to be performed on a static dataset - one with boundaries and a finite size. One that can be downloaded to a computer (or a LOT of computers) that can do this enormous bit of work in the background. Depending on the size of the dataset you’re starting with, this can take days, or weeks, or even months, and it can use a stupendous amount of power to do so.
Furthermore, you need to ensure the data you’re putting the effort into tokenising is worthwhile; that it’s “clean”, and accurate, and duplications have been removed, and a whole lot of other things.
So when you’ve put that effort into selecting, cleaning, and tokenising a dataset for a model to learn how language works, you’re going to re-use it a lot. You might even release your dataset/s for others to use, and create their own generative AI frontend that will draw different conclusions, or read the data a little differently, to someone else’s.
But there’s a downside to this. When an AI is accessing a dataset in response to a request for factual or opinionated information, that data might be old. Very old in internet terms in some cases - late 2021, even.
Probably not an issue if you’re asking about the capital of France. Going to be unhelpful if you want a review of the latest advances in gene engineering.
The process, step-by-step
So. Most generative AI systems don’t go out to the general Internet and collect lots of information and refine it into an answer when you ask them a question.
They go into their own pretrained and tokenised dataset/s and refine them into an answer.
How do they do this?
I found the process of creating an AI system to be fascinating. When I asked some resident experts, they pointed me to this video from a very accessible expert named Andej Karpathy, who has a gorgeous accent and an amazingly clear way of explaining things.
After having watched it three times, I’ve tried to translate the somewhat software engineering concepts I learnt from him into somewhat more human-accessible concepts.
I think I’ve got it correct, and have run my understanding past other real experts who didn’t laugh at me, but if you have feedback, I’d love to hear it and will adjust my content accordingly!
Watch Andrej’s video here.
Create the first dataset for training the model
This is where the software that will become the generative AI learns how language actually works. A set of code is pointed at a carefully-curated dataset - for eg, a cleaned-up segment of the Internet in general, or just social media, or scientific papers - and it learns language patterns. Not just English - any language can be done this way.
So, first of all, a very, very large offline database of content is created. More is better - more variables means more flexibility - but it also takes more time and computing power. One base dataset is, for example, 10 terabytes (that’s a lot) of random content from across the Internet.
Some databases are cleaned up first – duplicates removed, outliers removed or fixes, outright errors removed. The cleaning process can be time-consuming and may not necessarily be required at this stage. Some cleaning can be done by computer programs, but error-checking needs to be done by humans who actually know what to look for; and the moment you add humans into the mix, things become slower and more expensive.
Other than that, however, the database is static. Once created, no more information is added to it.
This huge database is tokenised. Using the tokens, the system “learns” what words, image patterns, or soundwaves appear near, next to, or never around others. It learns patterns.
After about 1-2 weeks of fairly intense computing work (there’s a whole area of AI discussions that look at the environmental impact of model creation, because of the computing power required), you get two files – a parameters file (a relatively small set of data, at only a few gigabytes), and a set of code.
The parameters file is the one that holds the patterns; it’s like a compressed version of that huge dataset. The set of code provides the interface to the patterns - the assistant model.
These are, basically, the GenAI system. You can download both to your own computer. You can take your computer offline. When you run the code, you’ll get a GenAI interface. You can ask questions, and you’ll get responses just like the ones you get on any GenAI system.
The system is only drawing from that base dataset, however. You can now do more with it, like point the code at cleaner, more precise, more accurate, or more focussed models. You can play with the code to create prettier interfaces, or even combine datasets from a range of places.
You refine - or fine-tune - the model, basically.
As an aside …
Dataset development and management is one of the bits about AI that I find absolutely fascinating, and one I think is going to be critical in developing trustworthy and truly useful GenAI systems in the future.
I was directed to a truly helpful document at this point: The Foundation Model Development Cheatsheet. It provides links and information about the datasets, systems, and other considerations that a programmer might require to develop their own AI. I had no idea, for example, that there might be environmental considerations in developing AI - but it turns out they can be incredibly energy-heavy, particularly if you’re trying to clean up a new dataset for use.
Create and fine-tune the assistant model
The assistant model is the little program we ask questions of. ChatGPT is an assistant model, as is DALL-E, and Adobe’s Firefly, and Poe.com, and all the others.
You can use the raw code and dataset from the pretraining process, but that’s a very blunt instrument.
There’s a resource called “Hugging Face” at huggingface.co. It’s a repository of AI information. It holds freely-available datasets, models, and applications. Anyone can download these datasets to create, pretrain, and/or fine-tune their own models. You can even download pre-created models to create your own GenAI system.
So you take your assistant model, and find other datasets and tokeniser systems, and refine your system for about a day, until you’re happy with how it works.
You can give this model a name and pop it on the internet.
Congratulations! You have your very own GenAI!
Point the model at an information dataset
This model is still a closed garden. It doesn’t use the live Internet. ChatGPT-3’s “knowledge” ends in September 2021 or thereabouts (last I looked; it might have updated recently). It’s fine for information that doesn’t change much – the cat always sits on the mat, Paris is likely to remain the capital of France for a while – but it won’t give you links to websites or opinions on recent world events.
GPT-4 has access to more recent information, which is why it costs to access.
There are hundreds, if not thousands, of potential datasets out there. They cover a huge range of information. There’s scientific ones, and medical ones, and translation/language ones, and arts ones, and literature ones. So it really comes down to you deciding what you want your model to do, and then finding (or creating) a dataset that gives you the best chance of providing that information.
It particularly helps if your model was originally trained on the right sort of dataset. A model trained on social media and aimed at a general knowledge audience will be pretty good at generating everyday language, and should produce very good results if it then draws from a dataset that’s been curated for accurate general knowledge.
Medical researchers may find it frustrating, however, even if it’s drawing from a well-curated scientific and medical dataset; it may lack precision or, worse, hallucinate results. Everyday users will appreciate having medical information provided in everyday English, but won’t be aware that some of the information will be untrustworthy.
On the other hand, a model trained on a scientific dataset and then using even a well-curated health/medical dataset may produce solid scientific content, but be incomprehensible to the everyday user.
Augment the search results
Some GenAI systems can use the live internet (or other live databases) to fine-tune their results on the fly. After you ask your question, the system will create its own search and apply that not only its static dataset, but a more regularly-updated dataset, a single document, or (in some cases, such as Bing’s Copilot) the live Internet (which slows things down a lot).
It takes the results of the search, possibly tokenises them, and Does Computer Cleverness to incorporate them into the results it provides back to you.
This is called Retrieval Augmented Generation – RAG. It’s not used all the time, because computers are still not fast enough to incorporate live results into the ordinary AIs, where people expect an answer within a few seconds. But it will significant increase the accuracy of the results.
Why do we want to know all this?
I started learning all of this, because I wanted to know how much I could trust what AI systems were giving me.
I’ve spent a career’s lifetime teaching people how to assess information. I’ve been a medical librarian, a general reference librarian, a technical writer, a specialised content creator, a knowledge manager, a trainer and educator, a first-level support officer. All things that handle information in some way.
Determining the trustworthiness of an information source is second nature by now.
So naturally I was going to ask the same of GenAI systems. I didn’t expect the answer to be quite so complex but in the end, it all comes down to the base data, the core content. Who created it, and how, and why? What work have they done on it? For what purpose?
For GenAI, that means knowing what data the system was trained on, and what data it’s accessing now.
Right now, this information is not always supplied, so we everyday users have no way of assessing the likely accuracy of the content we’re being given.
Personally, I think that’s a problem. We’re being asked to trust systems that are already well-known for generating completely wrong, misleading, or downright dangerous content, purely because one of them learnt how to pattern-match from (eg) Facebook conversations. (I’ll put links to these examples in other articles, or this one will never get published … ).
We’re told that some datasets are “cleaned up” and “checked for accuracy”, but by whom? How are their biasses removed? How much cross-checking is done?
Finally - what does this all mean for us?
How should we be using GenAI, knowing all this?
GenAI is pretty good at analysing and re-writing existing content. If you feed something in, it can pattern-match what it’s learnt is “good” language in a particular context and create a new version of what you put in. With strict enough input, it should generally just use what you’ve supplied and not invent – “hallucinate” – too much new stuff. For eg, CVs and cover letters and articles.
If you’re asking for a precise answer to a specific question, do not trust the answer straight off. The results will be skewed by the underlying datasets and models,and you never know if it’s ingested an entire database full of conspiracy theories and children’s books that means it’ll tell you the earth is flat, the moon is made of green cheese, and polar bears live in Antarctica.
Same with images – if you’re looking for an accurate rendition of something, prepared to be disappointed at best. If you’re just after prettiness/snaps, they’re fine (ethical considerations notwithstanding).
Any information you put into a public GenAI system will most likely become part of its active dataset and can be used to inform someone else’s answers. Confidential etc information has been discovered this way, which is why Apple and Samsung no longer permit the use of any public AI by its staff – because code containing private and confidential information became part of its corpus and given to others.
Try to find what dataset/s the system you’re using was trained on, or is accessing now.
Try to find systems that match the purpose you have in mind. Don’t ask scientific questions of ChatGPT; don’t ask a scientific system to re-write your CV.
Thank you for reading this far, if you made it! Hopefully this helps keep the other articles a tad shorter …
Got a comment? I promise I read and respond to them, and will either update this article or create new ones in response to helpful information! (Which I will cross-check six ways from Sunday before use, I should note …).
Issues
· Can’t do maths, or conversions, or translations; not like Google can. It will pattern-match your query, not perform calculations on it.
· By this pattern matching even now there is another type of problem the LLMs are doing is called LLM Laziness. Eg “Sometimes as well, it will start to change your code and be like // rest of method here. Is that the same thing”
Thanks Fiona, that was really interesting and useful information. Trying to figure out felt so complicated but now I understand it (at a superficial level) at least.