The first Big Bang in 2012
The Big Bang in artificial intelligence (AI) refers to the breakthrough in 2012, when a team of researchers led by Geoff Hinton managed to train an artificial neural network (known as a deep learning system) to win an image classification competition by a surprising margin. Prior to that, AI had performed some remarkable feats, but it had never made much money. Since 2012, AI has helped the big technology companies to generate enormous wealth, not least from advertising.
A second Big Bang in 2017?
Has there been a new Big Bang in AI, since the arrival of Transformers in 2017? In episodes 5 and 6 of the London Futurist podcast, Aleksa Gordic explored this question, and explained how today’s cutting-edge AI systems work. Aleksa is an AI researcher at DeepMind, and previously worked in Microsoft’s Hololens team. Remarkably, his AI expertise is self-taught – so there is hope for all of us yet!
Transformers are deep learning models which process inputs expressed in natural language and produce outputs like translations, or summaries of texts. Their arrival was announced in 2017 with the publication by Google researchers of a paper titled “Attention is All You Need”. This title referred to the fact that Transformers can “pay attention” simultaneously to large corpus of text, whereas their predecessors, Recurrent Neural Networks, could only pay attention to the symbols either side of the segment of text being processed.
Transformers work by splitting text into small units, called tokens, and mapping them onto high-dimension networks – often thousands of dimensions. We humans cannot envisage this. The space we inhabit is defined by three numbers – or four, if you include time, and we simply cannot imagine a space with thousands of dimensions. Researchers suggest that we shouldn’t even try.
Dimensions and vectors
For Transformer models, words and tokens have dimensions. We might think of them as properties, or relationships. For instance, “man” is to “king” as “woman” is to “queen”. These concepts can be expressed as vectors, like arrows in three-dimensional space. The model will attribute a probability to a particular token being associated with a particular vector. For instance, a princess is more likely to be associated with the vector which denotes “wearing a slipper” than to the vector that denotes “wearing a dog”.
There are various ways in which machines can discover the relationships, or vectors, between tokens. In supervised learning, they are given enough labelled data to indicate all the relevant vectors. In self-supervised learning, they are not given labelled data, and they have to find the relationships on their own. This means the relationships they discover are not necessarily discoverable by humans. They are black boxes. Researchers are investigating how machines handle these dimensions, but it is not certain that the most powerful systems will ever be truly transparent.
Parameters and synapses
The size of a Transformer model is normally measured by the number of parameters it has. A parameter is analogous to a synapse in a human brain, which is the point where the tendrils (axons and dendrites) of our neurons meet. The first Transformer models had a hundred million or so parameters, and now the largest ones have trillions. This is still smaller than the number of synapses in the human brain, and human neurons are far more complex and powerful creatures than artificial ones.
Not by text alone
A surprising discovery made a couple of years after the arrival of Transformers was that they are able to tokenise not just text, but images too. Google released the first vision Transformer in late 2020, and since then people around the world have marvelled at the output of Dall-E, MidJourney, and others.
The first of these image-generation models were Generative Adversarial Networks, or GANs. These were pairs of models, with one (the generator) creating imagery designed to fool the other into accepting it as original, and the second system (the discriminator) rejecting attempts which were not good enough. GANs have now been surpassed by Diffusion models, whose approach is to peel noise away from the desired signal. The first Diffusion model was actually described as long ago as 2015, but the paper was almost completely ignored. They were re-discovered in 2020.
Transformers are gluttons for compute power and for energy, and this has led to concerns that they might represent a dead end for AI research. It is already hard for academic institutions to fund research into the latest models, and it was feared that even the tech giants might soon find them unaffordable. The human brain points to a way forward. It is not only larger than the latest Transformer models (at around 80 billion neurons, each with around 10,000 synapses, it is 1,000 times larger). It is also a far more efficient consumer of energy – mainly because we only need to activate a small portion of our synapses to make a given calculation, whereas AI systems activate all of their artificial neurons all of the time. Neuromorphic chips, which mimic the brain more closely than classic chips, may help.
Aleksa is frequently surprised by what the latest models are able to do, but this is not itself surprising. “If I wasn’t surprised, it would mean I could predict the future, which I can’t.” He derives pleasure from the fact that the research community is like a hive mind: you never know where the next idea will come from. The next big thing could come from a couple of students at a university, and a researcher called Ian Goodfellow famously created the first GAN by playing around at home after a brainstorming session over a couple of beers.