Understanding the capabilities and limitations of neural networks in the pursuit of artificial intelligence.
Artificial intelligence has been a longstanding pursuit in the field of computer science, driven by the ultimate goal of creating machines capable of human-like intelligence. At its core, this endeavor aims to empower computers with the ability to solve a wide range of problems, from answering general questions to proving theorems, writing code, creating art, and playing games. This ability to solve a large variety of arbitrary problems may be referred to as general intelligence.
The field of artificial intelligence has come a long way. Today, it is spearheaded by neural networks and machine learning techniques, which have been responsible for realizing many of the goals envisioned over the last century. We now have models capable of unprecedent super-human performance in complex game environments, such as AlphaGo
Indeed, large language models (LLMs) like ChatGPT are prime examples of what modern neural networks are capable of. Through extensive self-supervised pretraining and instruction-following fine-tuning, these models are able to achieve remarkable feats of intelligence. These include not only holding conversations in natural language, but also generating code, solving mathematical problems, composing and editing text, and providing translations between languages. Arguably, this is the closest we have ever been to having a system capable of general intelligence.
But can these models ever reason like a human? Despite their impressive abilities, these models are not infallible; they fall short of consistent generalization in many tasks. In fact, they occasionally make simple reasoning mistakes that a human would easily avoid. This raises some questions about what these models are ultimately capable of doing. Let’s disregard the question of learning for the moment and focus strictly on their theoretical capabilities. Could these models solve any problem that a human could solve? Exactly how powerful are they? And, perhaps most importantly, is there a systematic way to go about such questions?
In this article, we will attempt to answer these questions, unraveling the capabilities and limitations of contemporary neural network models. In doing so, we will hopefully shed light onto some of the challenges and potential solutions in the pursuit of general intelligence using neural networks.
The theory of computation is a branch of theoretical computer science that explores the fundamental principles of computation, encompassing the study of algorithms, computational problems, and formal computational models. Using this theory, we are able to analyze the capabilities of different models of computation, establish the computational complexity of different problems in terms of space and time, and even determine whether a problem is computable at all. To enable this, the theory of computation offers a wide range of formal tools, including automata, formal languages, grammars, logics, and circuits.
There are many different models of computation within this theory, each with their own set of capabilities and some stronger than others. A particularly important one is the Turing machine, which is capable of performing any algorithmic computation. According to the Church-Turing thesis, any function that is computable can be computed by a Turing machine. Thus, any model that can simulate a Turing machine is equally capable of general computation. Such models are said to be Turing-complete.
Knowing this, it would be natural to expect that a computational model that can perform arbitrary reasoning and solve arbitrary problems must be capable of performing arbitrary computation. In other words for a model to be capable of general intelligence, it must be capable of general computation (Turing-complete).
But what does all of this have to do with neural networks? Much like any computational model, neural networks transform an input into an output, and are therefore a subject of the theory of computation. Accordingly, we are able to determine the computational power of different neural network architectures based on this theory. In fact, this treatment of neural networks is not a new idea. Let’s explore this further.
The study of neural networks has a perhaps surprising relationship with the field of theoretical computer science, with each discipline influencing the other over the course of their development
In 1943, McCulloch and Pitts introduced the first computational model of biological neural networks. In this model, neurons are represented by threshold logical units (TLUs). Each unit exhibits an “all-or-nothing” behavior, where it can be considered either active or inactive based on the number of excitatory inputs and the presence of inhibitory inputs. Due to the binary nature of this model, McCulloch and Pitts argued that neural activity can be treated by means of propositional logic. Essentially, the output of each neuron in a network can be interpreted as the truth value of a logical proposition. By combining multiple neurons, we can therefore construct propositions of increasing complexity (see image above). Based on this, McCulloch and Pitts characterized the set of propositions that can be expressed by neural networks.
Later, in 1956, Kleene published a mathematical reworking of McCulloch’s paper which also included significant developments in the theory of finite automata. In his paper, titled Representation of Events in Nerve Nets and Finite Automata, Kleene characterized the set of “regular events” (which we today refer to as regular expressions), and showed that this is exactly what finite automata and McCulloch-Pitts networks with circles
Since then, neural networks have evolved significantly, incorporating continuous values, multiple layers, differentiable activations, and learning through backpropagation. This process has led to the development of the modern neural networks that we know today. Concurrently, the theory of computation has also advanced, influencing the study of the computational power of these neural networks.
With the development of modern RNNs
Around this time, an important line of research emerged, focusing on the theoretical investigation of the Turing completeness of neural networks. In their classical paper, Siegelmann and Sontag
However, there are limitations associated with the assumptions in Siegelmann’s work. Several authors have argued that these are unrealistic, leaving a significant gap between theory and practice
Level | Task | RNN | LSTM | Stack-RNN | Tape-RNN | Transformer (encoder) | Transformer (autoregressive) |
---|---|---|---|---|---|---|---|
R | Even Pairs | 100.0 | 100.0 | 100.0 | 100.0 | 96.4 | - |
R | Modular Arithmetic (Simple) | 100.0 | 100.0 | 100.0 | 100.0 | 24.2 | - |
R | Parity Check | 100.0 | 100.0 | 100.0 | 100.0 | 52.0 | - |
R | Cycle Navigation | 100.0 | 100.0 | 100.0 | 100.0 | 61.9 | - |
DCF | Stack Manipulation | 56.0 | 59.1 | 100.0 | 100.0 | 57.5 | 53.2 |
DCF | Reverse String | 62.0 | 60.9 | 100.0 | 100.0 | 62.3 | 53.5 |
DCF | Modular Arithmetic | 41.3 | 59.2 | 96.1 | 95.4 | 32.5 | - |
DCF | Solve Equation | 51.0 | 67.8 | 56.2 | 64.4 | 25.7 | - |
CS | Duplicate String | 50.3 | 57.6 | 52.8 | 100.0 | 52.8 | 100.0 |
CS | Missing Duplicate | 52.3 | 54.3 | 55.2 | 100.0 | 56.4 | - |
CS | Odds First | 51.0 | 55.6 | 51.9 | 100.0 | 52.8 | 54.7 |
CS | Binary Addition | 50.3 | 55.5 | 52.7 | 100.0 | 54.3 | 69.0 |
CS | Binary Multiplication | 50.0 | 53.1 | 52.7 | 58.5 | 52.2 | 52.2 |
CS | Compute Sqrt | 54.3 | 57.5 | 56.5 | 57.8 | 52.4 | 52.4 |
CS | Bucket Sort | 27.9 | 99.3 | 78.1 | 70.7 | 91.9 | 40.4 |
This trend has been followed by many recent works, both empirical and theoretical, which investigate the computational power of sequence models in practical settings
Neural Network Architecture | Recognizable Languages | References |
---|---|---|
Sigmoid / Tanh RNN | Regular languages | |
LSTM | Regular languages + Counter languages | |
GRU | Regular languages | |
CNN (finite receptive field) | Strictly local languages |
Within this context, Merrill
So far, much of the results we have seen focused on recurrent neural networks. But the most prominent architecture in today’s sequence processing landscape is the Transformer, which is the basis of models like ChatGPT and other LLMs. Due to the relatively recent emergence of this architecture
Inspired by Siegelmann’s work
Contrasting with Pérez’s approach, another line of research analyzed Transformers from a different perspective. In particular, this line of research focuses on the ability of Transformer encoders to process strings in a single step, as opposed to solving tasks using multiple decoder steps. In this context, the first theoretical results on the limitations of Transformers were shown by Hahn
Hahn’s work laid the ground for subsequent studies. Chiang and Cholak
In parallel, Weiss et al.
A more robust understanding of the computational model of Transformers emerged after Hao et al.
In summary, research suggests that the Transformer does not align with any of the major classes of the Chomsky hierarchy. Rather, results indicate that Transformers correspond to a limited parallel computational model, echoing the early idea that the attention mechanism has limited sequential processing abilities
Several approaches have been adopted to overcome the limitations of conventional architectures and create more powerful models. One of them, which we have already mentioned, is memory-augmented neural networks
One interesting line of research investigates the use of differentiable optimizers as neural network layers. The main idea is to use a parameterized optimization procedure as a part of the forward pass of a neural network and then differentiate through it such that its parameters can be learned. Examples of this kind of model include OptNet
In this context, a different yet related class of models captures our attention: the class of deep equilibrium models (DEQs)
However, it is crucial to understand the abilities of these models from a theoretical perspective. Are DEQs any more expressive than regular CNNs and Transformers? Does their infinite depth lend them more computational power? The short answer is yes. I will not go into much detail, but we can show that DEQs can simulate formalisms ranging from finite automata to Turing machines, depending on the assumptions made. For the Turing machine simulation, the key is to imagine that the vector \(\mathbf z\) represents the tape, where each position encodes a symbol, a state, and the presence of the head. Then, at each step, we only need to update the current symbol and state and then move the head to one of the neighboring positions, which can be easily done with convolution operations. Unfortunately, these results do not necessarily translate well to practice. For instance, running the experiments from Delétang’s work
The bottom line is that computational efficiency seems to be a major issue among all these approaches. It appears that there exists, in fact, a parallelism tradeoff, and that it is key to the efficiency and success of Transformers; and it seems that any attempt at a more powerful neural network architecture inevitably leads to complications.
We started this article on a quest to understand the reasoning and general computation abilities of neural networks. Particularly, we have focused on the computational power of Transformers, the architecture that powers most current LLMs. Research on the subject, however, reveals that the computations that these models can perform in a single step are quite limited, and that there are many problems that they cannot solve. Unfortunately, alternative architectures that might offer greater computational capabilities often come with a variety of drawbacks such as elevated computational costs, making them impractical choices for creating models capable of general intelligence.
But what if we give up on the idea of having a model with an “all-powerful” step and fully embrace the idea of solving problems through multiple explicit “weak” steps? After all, this is the approach used in Pérez’s paper in order to prove the Turing completeness of Transformers. This is also what the “step-by-step thinking” strategy
Let’s start by going back to Pérez’s work. In order to see if his Turing completeness proof holds in practice, we must go over its assumptions. The two main ones are unbounded computation and arbitrary precision. Let’s ignore the assumption of an unlimited number of computation steps, since that is precisely what we’re aiming for—taking as many steps as necessary in order to solve a problem. Let’s also ignore the arbitrary precision assumption, this time on the basis that the floating-point numbers on the computer are precise enough to carry Turing-complete computation for all practical problems (in fact, some believe that log-precision Transformers have enough capacity to compute a Turing machine step). Given this context, does Pérez’s proof apply in practice? Not necessarily. This is because the construction proposed by Pérez relies on a non-standard attention score function and an unconventional positional encoding scheme. Meanwhile, practical Transformers use dot-product attention and standard positional encoding schemes, which would likely render the attention patterns needed for the Turing machine simulation impossible. In the end, I believe that the practical computational power of autoregressive Transformer decoders is still not entirely clear.
However, even if such Transformers were proven to be Turing-complete in practice, a critical problem would remain: these models have linear space complexity and quadratic time complexity. Because of this, processing excessively long sequences becomes impractical. Certainly, such models are not able to naively reason for an indefinite amount of time, nor are they able to solve problems with very long descriptions. As an example, fixing a bug in a program with 10K lines of code can be unfeasible even for the largest commercial LLMs of today. Furthermore, while the attention operation is “dense” (i.e., the computation is performed over all input elements), the memory access pattern required to simulate a Turing machine should be mostly sparse. In Pérez’s simulation, after computing the current head position, we should only need to retrieve the current and the next element under the head. Similarly, in a practical setting, we can imagine that the sparsity of an LLM’s attention increases as its context gets longer through interaction. This phenomenon arises as a simple result of the increasing amount of irrelevant information in its history over time. From this, we can conclude that there is an inherent inefficiency in the way attention works, although this is the price we pay for the parallelism of Transformers.
In contrast, an ideal model should have a constant computation cost per step and sparse-access memory at inference time. This is akin to the construction proposed in a recent paper by Chung and Siegelmann
Yet, efficiency during training is just as important as efficiency during inference. After all, large-scale self-supervised pretraining is the cornerstone of building LLMs. Fortunately, a recent yet rapidly evolving line of research aims to develop efficient alternatives to Transformers while matching their performance levels
Using these recent architectures, it might be possible to create a model that is both Turing-complete and capable of efficient parallel and sequential computation. The key is to equip the model with an external memory module and use the right training setup. Specifically, this training setup would consist of two distinct phases. In the first phase, the model goes through parallel self-supervised pretraining, during which it learns from a substantial amount of text. In the second, the model is subjected to sequential reinforcement learning fine-tuning, where the model is equipped with the external memory module and tasked with solving various reasoning problems. The idea is that, in the self-supervised pretraining phase, the model would learn to compress and approximate language information. Then, in the reinforcement learning fine-tuning phase, the model would learn to use language to carry reasoning and computation in a precise and generalizable manner. In other words, it is as though the first phase consisted in learning the “machine” of intelligence, whereas the second phase consisted in operating it by learning the “program” of intelligence. In a way, this aligns with the notion presented by Mahowald et al.
But this is just a rough idea and, naturally, it has its own set of challenges. What is the best way to implement the external memory module? Would it ever be practical to sequentially fine-tune something like a large language model with reinforcement learning, even with constant generation cost? What datasets and training objectives should be used in the fine-tuning phase? Would learning from text data alone using current machine learning methods be enough? And how could a model like this be used within a broader text-based agent framework, being able to use external tools in order to accomplish its goals (e.g., LangChain)?
At the end of the day, it only seems certain that efficiency and Turing completeness are important properties of models capable of general intelligence. This is a very complex and broad subject and no single article would be able to address all the related problems and possibilities. The approach presented here is thus mere speculation; only time will tell what such models will look like.