Preface
At a conference this past weekend I had the luck of meeting & having a good discussion with Stephen Wolfram. This mathematician / physicist / scientist has been a personal hero for many years – indeed, after A New Kind of Science came out, as part of a college course I made a VLSI chip for rapidly running cellular automata. This chip, which was taped out courtesy of the university MOSIS system, drove a VGA display (remember those?) with fun ever-changing patterns.
Anyway, our discussion at the conference touched on a number of things, including SAT solvers, which led Stephen to mention his recent blog post on Boolean derivatives and analysis of “What’s really going on in Machine Learning”. Below is my response to that blog post, shared here, in the interest of open discourse, as another blog post.
— Tim
Response
Very stimulating read, and (as always) the quality and volume of the figures speaks to the systems you (all) have created. Discrete boolean derivatives are very clever, had not seen that before!
My general thought is that the examples are mostly memorizing / compressing given information (the piecewise constant function, an integer) onto minimal bases (ReLU MLPs, cellular automata). This works well; the more you exploit computational irreducibly, the better compression you can get, but the harder it is to do that compression (which I think has to do with the probability volume & navigability of the solution space).
However, most people use neural networks not for compression, but rather for generalization, such that the models work on new data. This requires the learned representation to be both computational and compositional – it should (either exactly or approximately) reproduce the computations that led to the data. If there are no regularities in the data, then it is incompressible, but it can be memorized. If the data is compressible = comprehensible, then you ought to learn an algorithm to reproduce it, preferably something within the rough equivalence class of the original generating algorithm.
Much real-world data is only weakly compressible, and hence must be memorized; the MLP layers in a LLM seem to do this, very likely by exploiting ‘computational irreducibly’ or the blessing of dimensionality etc. Yet once those segments of the data are memorized, they can be composed through the conditional message-passing algorithm evinced by the attentional layers. As I think you’ve eluded in other blog posts, language is uniquely amenable as a substrate for these algorithms and hence transformers.
Though conditional message passing is not the most general algorithm, it can certainly emulate cellular automata, do approximate Bayesian inference, and does seem to model the structure and nature of interactions in the natural world. Furthermore, I think that George & Symbolica are onto something by realizing that there are many levels of conditional message passing — hypergraph rewriting rules, where the dependency map depends not only on token content (= first-level attention) but also structure in upstream tokens (~= function calls or what Anthropic calls higher-level heads).
Yet this sort of composition is not nearly as general as mathematics! In the function-approximation examples given, where the model must learn a Heaviside function, given an appropriate (learned) basis it could generalize over future instances in this class (which can be learned by the position and sign of the steps). This tends to also make the learning faster; creating new bases is a form of meta-learning. Likewise, for the integer lifetime task, a suitable basis would be a counter, whence the task would be trivially learnable and generalize perfectly to new training data. Of course, it would also be significantly less compressed than the minimal or near-minimal representation (in terms of bits of rules + bits of dependency graph specifying how those rules are applied) evolved in the cellular automata solution.
Three questions
This raises several outstanding questions, one of which is if transformers (or their descendants) are adept at re-using bases and learned functions outside of pretrained linguistic contexts. If an algorithm for generating Heaviside functions, counters, search, integration, and (of recent focus) dynamic programming can be learned, can it be re-used and re-composed within the transformer, without human supervision? The answer seems to be a tentative ‘not yet’. [1]
Regarding search over such solutions, the post makes very cogent arguments for why the chain rule of calculus allows optimization of continuous-valued variables; not mentioned is the concept that ‘all points are saddles’ from optimization literature. With discrete parameters optimization frequently runs into points in solution space where no incremental changes lead to improvements in the objective function; they are not saddles. (But perhaps overparameterization can help? Or, as mentioned in the post, noise? Certainly both do in biological evolution..).
The second outstanding question is the overlapping or contrasting ideas of ‘computationally irreducible’ functions versus invertible functions. The latter can be exactly inverted (as is the case for continuous-valued deep learning) or approximately inverted (as is the case for sensory perception in-the-wild). Given the surfeit of details, particulars, and one-off non-generalizing facts in the world, I wholly agree with the utility of irreducible representations for memorizing information. As mentioned, this is likely exactly what is happening when the MLP layers in a transformer store facts. Yet these MLP layers are locally exactly invertible via the chain rule of calculus, hence trainable, which does not seem to hold for the cellular automata example.
Most useful functions are not ‘computationally irreducible’ and can be easily closed-form inverted. Much of mathematics (at least that used by a practicing engineer) is exactly this! If you cannot exactly invert a function, frequently you can approximately invert a function; with the addition of search the inverse can (probably but not provably) improved. This is basically what visual perception is: approximate and rapid inversion of the computational, compositional real-world generating process. Likewise, approximate and iterative inversion is exactly what programming is. Both are directly of interest to what Springtail is working on.[2]
Regarding biological evolution, pure computationally irreducible representations are certainly exploited by biology & evolution, but I they are not dominant because:
1) They are overly brittle; real organisms need to be robust to internal noise and highly variable environments.
2) They are hard to search over hence to learn. Arguably, mammals & birds are evolution’s answer to the problem of meta-learning: we are a set of metabolic, cognitive, cell-signaling, immune, and social primitives that allows for rapid adaptation. These systems are not per se invertible like engineering systems or neural networks, but because they are modular (= compositional), the number of bits required to select and change behavior is small (= rapid learning). You have to factorize a solution space to make searching for solutions effective = navigable.
In practice, biological systems are like engineered systems! The central dogma of biology is beautifully elegant, all various systems of the body seem to have a specified domain of expertise = restricted Markov blanket or set of dependencies; hierarchy is sloppy but omnipresent. Biology absolutely leans into the blessing of dimensionality when evolving the primitives that give rise to all the functions within a hierarchy, but not so hard; it breaks dependencies and factorizes solutions, which makes these solutions adapt well = evolution work more quickly.
The third question is why, if there are no inverses to given biological primitives, we still are able to assign approximate teleological roles to what they do?
I think this is an extremely important dual with repercussions on neural network learning: irreducible functions can still have approximate inverses, which appears to be a consequence of functional factorization & composition. [3]
Footnotes:
- I fear that this may require elements of discrete search.
- Uncertainty remains about the role of chunking and splitting regarding algorithmic approximation. Transformers get their data already split, and hence only have to chunk. Does this scale, does this compose??
- This post is only scratching the surface of even deeper questions of learning dependency (hyper-) graphs in systems with finite SNR on finite communication channels. There must be an extension of Shannon information to algorithmic domains, measures of the probability volume of solution space & its navigability, I just don’t know it… if there were such a formulation, cellular automata ought to be at one extreme (infinite SNR on the communication channels, absolute minimal dependency graphs), while biological neural networks are at another extreme (very limited SNR at any given node, but extremely large if stochastic dependency graphs).