Neuroscience and Machine Learning Are Swapping Their Worst Habits? | 10,000-Word Interview

Author

Samuel Gershman

Professor of Psychology, Harvard University

A professor in Harvard's Department of Psychology and Center for Brain Science. His lab studies the computational mechanisms of learning, memory, decision-making, and perception. He is also a member of Harvard's Kempner Institute for the Study of Natural and Artificial Intelligence. He is the author of What Makes Us Smart: The Computational Logic of Human Cognition. The Gershman lab aims to understand how individuals acquire complex, structured knowledge about their environment and how this knowledge supports adaptive behavior. The lab explores these questions using a combination of behavioral, neuroimaging, and computational techniques.

Machine Learning and Neuroscience Are Converging

Like most scientific fields, neuroscience has historically been dedicated to finding causal explanations for empirical phenomena. The field of machine learning, historically, has focused on building systems that can make predictions. Recently, however, the boundaries between the two are blurring: neuroscience is increasingly concerned with prediction and is adopting machine learning methods more and more; while machine learning is becoming more interested in causal explanation and is increasingly borrowing research methods from neuroscience.

Before discussing the impact of this role reversal, let's look at a few examples. Brain-Score, a project that evaluates models based on their ability to predict neural responses, exemplifies neuroscience's evolution into a predictive discipline. The platform includes a set of quantitative benchmarks (like neural recording data) and a model leaderboard. A parallel, machine-learning-inspired effort involves developing "foundation models" for neuroscience, which are trained on massive amounts of neural data and benchmarked by their predictive power.

https://www.brain-score.org/

In the machine learning world, its shift toward becoming an explanatory discipline has given rise to mechanistic interpretability research. Its ambition is to identify the internal operating mechanisms of machine learning systems trained for prediction tasks. Unlike earlier interpretability research that focused on identifying relationships between inputs and outputs (for example, explaining why a system denied a loan to one person but approved it for another), mechanistic interpretability seeks to explore the relationships among a system's internal computational components. Its lineage from neuroscience is openly flaunted, even aspiring to replicate a "connectomics" study within artificial systems. As Anthropic co-founder Chris Olah and colleagues wrote in a 2020 online essay:

https://distill.pub/2020/circuits/zoom-in/

What would it look like if we took individual neurons (in artificial neural networks), or even individual weights, as worthy objects of serious study? What if we were willing to spend thousands of hours tracing every neuron and its connections? What picture of a neural network would emerge?

Neuroscientists enthusiastically answered this call^[1], bringing their tools, concepts, and explanatory frameworks. This includes analyses of single-neuron tuning and population-level representational similarity, as well as methods for studying nonlinear dynamics and circuit ablations. Even when machine learning researchers don't explicitly borrow tools from neuroscience, they often end up reinventing similar methods.

However, I believe that machine learning's turn toward explanation may not bring us closer to understanding the essence of neural systems; if neuroscience replaces explanation entirely with prediction, we will have to sacrifice precious scientific insight. At the same time, explanation in machine learning is destined to encounter the same difficulties as explanation in neuroscience—namely, that these intricate, giant systems will not simply yield to neuroscience's conventional scalpel. Ironically, this was recognized long ago by machine learning researchers (and a handful of philosophers) but has yet to permeate the academic discourse of neuroscience.

Obstacles to Replacing Explanation with Prediction in Neuroscience

The tension between prediction and explanation is a recurring theme in philosophy, statistics, and social science. Historically, science has aimed to find mechanistic, causal explanations for natural phenomena—for example, explaining why L-DOPA improves Parkinson's disease symptoms by increasing dopamine levels. In systems neuroscience, causal-mechanistic explanations are usually "circuit mechanisms." This is also the inspiration for mechanistic interpretability research in machine learning—explaining specific functional neural modules through excitatory and inhibitory interactions. For instance, the causal mechanism believed to maintain stable eye position is a network of recurrently connected neurons^[2] that implements a line attractor.

Mechanistic causal explanations in neuroscience, like in other scientific fields, attempt to discard factors that might be useful for prediction but are "spurious correlations." For example, L-DOPA can produce side effects like involuntary movements and headaches, which correlate with its therapeutic effects on Parkinson's symptoms. A machine learning algorithm might be able to "predict" the therapeutic effect from the side effects, but no one in their right mind would claim that the side effects are the "cause" of the therapy. If you only treat the side effects (e.g., taking Tylenol for the headache) without touching the hypothesized causal mechanism (dopamine), the Parkinson's symptoms will not change.

Although the above example seems to illustrate a significant difference between prediction and causal-mechanistic explanation, current viewpoints in machine learning and statistics link the two together. A mechanistic causal explanation is essentially an "invariant prediction." A predictive algorithm might exploit spurious correlations in observational data, but under certain intervention conditions (like the Tylenol example above), this prediction is bound to fail spectacularly. Causal mechanisms are those predictive relationships that remain robust even after stripping away spurious correlations.

Invariant prediction may be a necessary condition for causality, but it does not itself reveal the causal mechanism. To understand a causal mechanism, one needs to measure and manipulate the system's components to determine which predictive relationships persist under which intervention conditions. Therefore, if neuroscientists still hold explanation as a goal, a pure focus on prediction (as seen with Brain-Score and neural foundation models) will not replace explanatory work.

Driven by concerns about system alignment, safety, and troubleshooting, machine learning researchers have recognized the importance of adopting more interventional approaches to study mechanistic causal explanations. One of the most influential approaches is based on the "circuit hypothesis"^[3], which posits that specific sub-networks within an artificial network secretly drive specific behaviors. Neuroscience seems to offer the perfect toolkit for identifying such circuits: including single-neuron and population-level tuning analyses, brain stimulation, and ablation/knockout techniques. However, some pessimistic research^[4] suggests that in trying to reduce a system to circuits, we are doomed to hit insurmountable "complexity barriers." In the worst-case scenario, to fully understand a neural system at the circuit level, the number of interventions required (like silencing subsets of neurons) grows exponentially with the number of neurons^[5]. This computational intractability^[6] exists even for approximate understanding of circuits in neural networks.

Another cherished assumption in neuroscience is that interventions can be used to establish functional localization. For example, if stimulating or silencing specific neurons changes a system's behavior in a specific way, researchers often infer that those very neurons functionally underlie that change. But evidence from machine learning^[7] suggests that such crude operations can create "localization illusions," where an intervention falsely associates a sub-network with a specific function. Furthermore, you can alter a system's output in specific ways by modifying synaptic weights^[8] outside the sub-network identified by functional localization. Another pessimistic finding shows that dimensionality reduction techniques widely used in neuroscience can fabricate "interpretability illusions"^[9]: even if a low-dimensional representation adequately summarizes a model's behavior on training data, these representations can fail when the model is tested on new data distributions.

These observations should send a chill down the spine of anyone trying to save machine learning with neuroscience tools. Likewise, they should sober up those who believe these tools can save neuroscience itself. In fact, the community has known for nearly a decade that neuroscience tools might be helpless in the face of even moderately complex computational circuits^[10]. Yet, these tools continue to be used in neuroscience, mainly because we haven't yet devised better alternatives.

Finally, let's be more positive. We must acknowledge that the dialogue between machine learning and neuroscience is extremely valuable, if only because it reveals the limitations of our tools and the fragility of our assumptions. The ongoing conversation between machine learning and neuroscience holds promise as a starting point for new methods.

To get a broader view of how the neuroscience community views the relationship between prediction and explanation, I invited eight neuroscientists to share their insights on the following questions: In neuroscience, can we replace explanation with prediction? Is circuit mapping sufficient as an explanatory framework for deep learning? Is it also sufficient as an explanatory framework for neuroscience itself?

Expert Opinions

Trenton Bricken (Anthropic)

Trenton Bricken, a technical staff member on the Alignment Science team at Anthropic. He is currently helping Claude with automated auditing and detecting alignment deviations.

For a neuroscientist, being able to record data from tens of thousands of neurons over a few days is a godsend. This data is often noisy and mostly obtainable from small mammals performing simple tasks. Meanwhile, large language models (LLMs) like Claude and GPT can perform a multitude of tasks at human-level or beyond, possess rich representations of the world, and can be studied deterministically—we can access every one of their neurons and neural connections. This incredibly rich data source, combined with the growing capabilities of LLMs, is driving the "mechanistic interpretability" research discussed in this article.

Although neuroscientists have good reasons to question the chasm between LLMs and biological brains, I believe there are some overlapping core computational principles. One of them is how information is represented and stored. Both biological brains and large models learn far more "things" than they have neurons or connections. To store this information (facts, memories, associations, etc.), they must find some way to efficiently compress information into low-dimensional representations. Research shows that LLMs encode information in "superposition": each piece of information is not stored in a single neuron but manifests as a pattern of activation across multiple neurons (in neuroscience, this is known as population coding). To reverse-engineer this compression mechanism, an algorithm called the sparse autoencoder reprojects the compressed low-dimensional representation back into a high-dimensional space—for example, decomposing a single layer of Claude 3 Sonnet into 30 million unique directions, each corresponding to an interpretable concept like the Golden Gate Bridge. This is a core computational problem that large models must solve and that brains also face; cracking it in the AI domain could likely lead to algorithms that help understand biological intelligence. As neuroscience recording technologies scale up, these tools might become equally powerful for decoding biological neural representations.

Jenelle Feather (Carnegie Mellon University)

Assistant Professor in the Neuroscience Institute and Department of Psychology at Carnegie Mellon University, where she leads the Computational Perception Lab. Her lab sits at the intersection of neuroscience, cognitive science, and artificial intelligence, studying the complex neural patterns underlying perception. By comparing computational models with biological systems, her research aims to uncover fundamental principles of perception, identify differences between current AI and human experience, and improve our models of the biological brain.

Neuroscience and machine learning have a deeply intertwined history. In recent years, parts of the boundary between these two fields have become even more blurred. In this column, Sam Gershman explores what happens when neuroscience turns to machine learning and questions the growing emphasis on predictive models for neural activity. While I share a cautious stance against blindly relying on these "digital twins," I am more optimistic about how a new era of high-fidelity predictive models can advance our understanding of neural processing.

A predictive model is essentially the forced reification of an abstract hypothesis about "how a computation is implemented" or "how a representation emerges." Models can be built at different levels of analysis, for example, by abstracting away biological implementation details or trying to explicitly incorporate them. If a model fails to predict observed data, the hypothesis instantiated in the model is falsified. But what if a model succeeds? The article's point about "spurious correlations" (or "shortcut learning") is well-taken. A model might predict the correct answer, but for the wrong reasons. However, this is not a reason to abandon predictive models altogether. Rather, it demands that we, as scientists, rigorously design experiments that attempt to "break" the spurious correlations in our predictive models.

Brain-Score and foundation models in neuroscience are already moving in this direction, for instance, by testing on "out-of-distribution" samples. The key point is that while a machine learning model might be large and complex, it is not a black box. In silico experiments offer efficiency and controllability. We can run massive numbers of simulations, perform precise ablations, derive target stimuli from the model itself, or alter the training data to conduct computationally controlled "rearing experiments." In this way, we can guide more efficient biological data collection and reveal potential confounds in existing hypotheses about neural representations.

The aforementioned "digital twins" hold tremendous potential in engineering approaches. For example, predictive models could be used to develop new, personalized neural prosthetics algorithms, such as cochlear implants or cortical stimulation. But we can also use models directly to examine neural representations. We can synthesize stimuli that drive specific neuronal populations or dissect the necessity of different biological motifs. Although this might require developing new tools and analytical techniques that perform better in these complex systems (as the "mechanistic interpretability" field is attempting), computational models provide the theoretical grounding for testing new analytical methods on real biological data.

Konrad Körding (University of Pennsylvania)

Professor of Integrative Knowledge (PIK) in Neuroscience at the University of Pennsylvania, also co-founder of Neuromatch and the Rigorous Community. He is known for his contributions to motor control, methods for neural data, and computational neuroscience, as well as his advocacy for and contributions to open science and scientific rigor. His research combines experimental methods with the application of computational principles. His work leans heavily on the concept of normative models, particularly Bayesian statistics. This led him to develop an app that predicts a scientist's h-index 10 years into the future. His experimental work involves motor learning and motor control, linking these phenomena to Bayesian ideas. Lately, he has focused on analyzing neural data and obtaining large-scale neural datasets. He is a frequent advocate for paradigm shifts in neuroscience and has published multiple papers on the application of deep learning in neuroscience.

We are witnessing two disciplines swap their worst habits: neuroscience mistakes benchmark predictions for understanding, while machine learning mistakes the language used to describe mechanisms for the mechanisms themselves. I believe the warning that neuroscience and machine learning might be getting confused is valid, and the cleanest way to address this is to draw a distinction between prediction (even invariant prediction of a certain kind) and causal inference.

Prediction (the forward problem) asks us to find a function mapping measurements x to an outcome y. Causal inference (the inverse problem) asks: which parts of the measured system actually influence the outcome, and how might we change them to produce a better outcome? Both problems are written as y=f(x), which is somewhat unfortunate because they are fundamentally different questions. It's not just the goal that differs; their geometry is different too.

Prediction does not require a one-to-one mapping, because correlated variables can substitute for one another. If two neurons (or two genes) are highly correlated, many models can make equally good predictions, but their attribution of "contribution" will diverge dramatically. Data usually concentrate on a few dimensions and are highly correlated across the dimensions of x. These correlations make prediction easier—we just need to make good predictions on the "manifold" where data typically lie.

The reason causal inference is difficult is exactly the same. Solving the inverse problem means you need to disentangle direct effects from indirect ones in the presence of correlations, which implicitly or explicitly means inverting the correlation structure. When this structure is ill-conditioned, tiny estimation errors can lead to huge fluctuations in the inferred causal factors. Good prediction is often the very hallmark of conditions that make causal inference difficult: strong correlations that are freely interchangeable.

There's another point that reinforces the earlier discussion on "causality as invariant prediction." In practice, invariance is almost always local: we usually verify stability across similar datasets with slight distribution shifts, rather than performing true causal interventions. After all, such interventions are costly. This local invariance can be useful, but it primarily verifies the similarity of different contexts. In contrast, causality enjoys its reputation because it aims for a much broader generalization: relationships that remain stable under a wide class of interventions because they reflect the mechanisms by which the system produces effects.

The practical lesson here is this: when speaking of invariance, one must define its domain: which interventions are involved, to what extent, and under what assumptions. The complexity barriers mentioned earlier reinforce this point. If a full circuit-level understanding of a neural system requires a number of interventions exponential in the number of neurons, then a practical "invariant prediction" remains invariant only within the tiny region of the intervention space we've actually probed. Given how we conduct experiments in neuroscience, where we typically only mildly perturb the brain, we might know very little about how the brain would respond to truly novel stimuli. It's a statement about local stability, not about a true causal structure that would hold under interventions we haven't done (and might not be able to do).

John Pearson (Duke University)

Associate Professor of Neurobiology at Duke University, where his lab focuses on theoretical and computational neuroscience, applying it to vision, motor control, and natural behavior.

The brain does not owe us an explanation. Faced with a recursive, nonlinear dynamical system like the brain, nothing suggests it must necessarily be describable in a way we can reason about. Yet, in a piecemeal, unpredictable fashion, the impossible happens: we do occasionally come to some understanding of things. In the primate oculomotor system, the fruit fly central complex, the songbird learning circuit, and the retinas of many species, we have at least gained a first draft of an understanding of brain function. All of this goes to show: if we judge solely by the interpretability of artificial neural networks, the world might appear much more unknowable than it actually is.

But why would that be? Let me suggest two answers. First, we have had relatively greater success in explaining systems that operate under significant constraints. These constraints can be information-related (e.g., early sensory systems need to selectively compress the world around them) or structural (e.g., the fruit fly navigation system requires highly specific inputs), but in all cases, neuroscience is handed a model far simpler than a general-purpose neural network, and it is this simplicity that allows experimentalists and theorists to elucidate the organizational principles of its function.

The second answer, of course, is evolution. More specifically, although mutations are random, the landscape evolution explores is highly structured. The fact that nervous systems must develop via genetically specified programs ensures that the resulting connectome types are subject to multiple constraints—organizational biophysics, locality, sparsity, and cell type. These networks are not randomly initialized; they are tuned by developmental processes to execute basic, often quite sophisticated, behaviors right at birth.

Therefore, neuroscientists find themselves in a more favorable position than perhaps expected. Yes, brain function is extraordinarily complex, and much of it will remain opaque to us for a long time. But this complexity has been accumulated incrementally, through fine-tuning and modification, and must be built according to developmental logic—this fact should be a source of optimism. Perhaps, in this case, the Gordian knot of the brain's complexity need not be cut with a single blow; we can peel it away layer by layer, like an onion.

Xaq Pitkow (Carnegie Mellon University)

Associate Professor of Computational Neuroscience at Carnegie Mellon University. He is a computational neuroscientist working to develop mathematical theories of the brain and general principles of intelligent systems. He predominantly studies how distributed nonlinear neural computation can leverage statistical algorithms to guide behavior in natural contexts. He develops novel analysis methods validated on synthetic agents and works closely with experimentalists to test theories with real data.

This article raises important points about the limitations of applying neuroscience methods to machine learning, and vice versa. Its two main arguments are that prediction cannot replace explanation, and that explanation is intractable for complex systems. I want to offer a more optimistic counter-argument: once we recognize what explanation truly provides—generalization—the problems raised by the two arguments dissolve.

The fundamental value of a mechanistic causal explanation is not that it decomposes a system into its parts, but that it enables us to make predictions under new conditions—across interventions, distribution shifts, and different task classes. This includes the invariant prediction mentioned earlier. But if explanation's value lies in its generalization capability, then there is no tension at all between prediction and explanation: explanation is exactly what allows predictions to generalize. The localization illusions and interpretability illusions discussed earlier are real, but they reflect a failure to test under sufficiently stringent generalization conditions that would expose the wrong structure.

Reframing the attempt at causal explanation through the lens of generalization also answers the complexity barrier concerns about neural circuit explanations. The article cites worst-case analyses, but the complexity upper bounds in those analyses assume that any neural circuit is possible. Real neural systems, both biological and artificial, possess rich structure—such as sparse connectivity and low-order interactions—and adopting these constraints as priors can make circuit-level explanations of neural systems more tractable than the worst-case suggests. Even at worst, whether a precise circuit reduction of a neural system is feasible or not, it is not the only level at which explanation can have impact in neuroscience. The right level of explanation is the one that provides sufficient generalization within our domain of interest.

Foundation models provide an interesting example. Do they explain anything? Many mechanistically distinct networks can produce the same input-output behavior on natural tasks, potentially even sharing underlying dynamics, making an exact circuit reduction unnecessary for some forms of generalization (except, of course, for interventions involving circuit elements not present in the model). Many explanatory constraints can hold without detailed mechanism, particularly at the level of representation or normative constraints on resources and behavior. These constraints can still be causal, at least in the sense of Aristotle's "final cause" (telos, or purpose). Foundation models provide real explanations: they generalize, they are falsifiable, and they tell us why a system works. They just aren't circuit diagrams. Functional equivalence on domain-relevant tasks is a weaker standard than full causal mechanism, but it turns out that for many questions about complex systems, this is precisely the right level of analysis.

Thus, the challenge is not in choosing between prediction and explanation, but in identifying the level of description that achieves generalization in a scientifically relevant domain and designing sufficiently powerful tests to demonstrate that generalization capability. This is where the interaction between neuroscience and machine learning is most valuable.

Gemma Roig (Goethe University Frankfurt)

Professor in the Department of Computer Science at Goethe University Frankfurt. She is a member of hessian.AI and is affiliated with the Center for Brains, Minds and Machines at MIT.

The growing convergence between neuroscience and artificial intelligence has pushed neuroscience toward a prediction-heavy discipline, thereby raising questions about explanation and causality. Modern deep learning models are now widely used to predict brain activity and compare representations between artificial and biological systems, especially in sensory and language domains. Constraining models with biological data was expected to systematically improve task performance and model robustness, but this has not yet fully materialized. Instead, the AI field has largely shifted its explainability efforts toward developing post-hoc analysis tools (many inspired by neuroscience) to probe the inner workings of these otherwise opaque models.

Despite their high complexity, AI models remain computational abstractions that omit many structural and dynamic properties of biological nervous systems. Representational alignment and predictive accuracy, while informative, are insufficient to establish mechanistic or causal explanations. For example, the representational alignment revealed by interpretability tools can lead to impressive improvements, but the source could be indirect training dynamics or model architecture, rather than the mechanisms these tools are believed to uncover.

Despite these limitations, the simplicity and controllability of AI models constitute a methodological advantage. Unlike biological systems, AI models can be directly intervened upon: components can be removed, modified, or retrained, and learning dynamics can be systematically altered. Such interventions enable controlled causal tests and the systematic identification of confounding factors, allowing for the evaluation of alternative explanations for observed behaviors or representations. Although these manipulations may not map directly onto biological systems, they can inform the construction of causal hypotheses that are often difficult to test directly in neuroscience. The current emphasis on prediction in neuroscience is justified, as it provides necessary empirical constraints. Strong predictive performance provides a minimal empirical foundation for an explanation. While prediction alone does not establish a mechanism, claims about mechanisms lack solid grounding without it.

Future progress requires integrating interpretability methods with explicit mechanistic analysis, rather than treating alignment or prediction as the end goal. Research should not focus solely on prediction and representational alignment but should target specific cognitive functions and deeply probe the model's internal circuits, transformation processes, and learned structures that implement that function.

Naomi Saphra (Harvard University)

Research Fellow at Harvard University's Kempner Institute and will be joining Boston University as faculty in 2026. She works on understanding the training process of language models through empirical research: when does the model learn to encode a linguistic pattern or other structure? What can that tell us about how and why the model works? Can we encode useful inductive biases into the training process? Recently, she has begun collaborating with natural and social scientists to use interpretability to understand the world around us.

Prediction can demonstrate our understanding, but only if we truly understand the system being used to make those predictions. If we train a black-box model on observational data and find it successfully predicts behavior, then all we possess is a second black box—which is barely an improvement over knowing nothing at all. However, if we can build predictions based on an intuitive simulation of the computational agent and then use that simulation to make predictions, then even if these intuitive simulations don't reflect the agent's causal mechanisms, they are (to some degree) correct. Our intuitive simulations have described the agent holistically, at a computational level, even if not in terms of its component implementation.

On the other hand, even if we successfully identify a causal mechanism, we may still be just as stuck as before, as this article points out by highlighting interpretability illusions. If the structure producing the mechanism is incomprehensible to humans, or if our explanation of how an intervention produces its effect is flawed, then the newly added explanation is merely a second black box added on, not an advance in our understanding of the computational agent.

Whether it's a brain, a large language model, or any other process, what counts as understanding a system? The key is not whether our description is causal or predictive, but whether the description itself is understood.

The bad news is that this property is intrinsically subjective. Some people might intuitively grasp a precise mathematical description of a system, while others can only accept the existence of such intuition on faith. Consequently, one person cannot know for sure if a new description advances human understanding unless it first advances their own personal understanding.

However, even if humans cannot comprehend a simulation with a billion parameters, there is good news. Even if a black-box description cannot directly advance our understanding, it may allow us to use new tools that the original agent lacks. Under this assumption, any predictive description has the potential to advance our understanding. The question remains: what kind of description advances our understanding?

James Whittington (University of Oxford)

Principal Investigator at the University of Oxford, leading a team that researches the foundations of artificial intelligence and neuroscience. He holds degrees in physics, medicine, and neuroscience from Oxford. He has worked in AI startups and large tech companies and currently consults for several AI technology firms. He is a co-founder of the Thinking About Thinking non-profit, organizing its scientific agenda and the program for the multiple summits and conferences it hosts annually.

Artificial neural networks are immensely powerful but difficult to interpret, much like their biological counterparts (brains). However, because of their effectiveness at predicting an output y from an input x, we are entering a "shut up and just train" paradigm for much of neuroscience data (echoing the "shut up and calculate" mentality in quantum physics). This article rightly questions this trade-off of comprehensibility for predictive power.

Uninterpretable models stand in stark contrast to the traditional neuroscience models of past decades, which were mostly hand-crafted and causal. Bayesian models are a prime example of this approach: inferring the distribution of a variable z from data y, based on a causal model y=f(z). Causal thinking is not only more interpretable, but it naturally handles "out-of-distribution" data, which is the hallmark of genuine understanding.

This is precisely the crux of the issue with predictive models. Without a causal model, a successful prediction might rely on variables correlated with the true causal variable, thus hindering generalization. Reading Agatha Christie novels might make you (or a large language model) adept at predicting the murderer in another of her novels, because you've understood her writing style, but it won't make you (or the LLM) a detective, because Agatha Christie most likely didn't orchestrate any real-life murders.

Invariant prediction attempts to mitigate this by identifying predictive relationships that persist across contexts (the causal logic of murder) and ignoring those that vary (the author's writing style). However, collecting data from enough contexts to determine whether a correlation is spurious is no easy feat, and even if enough data could be collected, the causal model learned by a neural network may well not be amenable to interpretability techniques.

Meta-learning simply elevates the "out-of-distribution" problem by one level: remaining flexible across different task structures requires a training set containing diverse structures. This is still prediction, but at a level of abstraction that helps understand the causality of the next level up. The price to pay for not relying on post-hoc explanations of data we don't understand is to build understanding block by block.

Ultimately, prediction and causal models sit at two ends of a spectrum, and both are crucial for progress. Although mechanistic interpretability on large models or the predictions of benchmark suites like Brain-Score often lack causal depth, they tackle problems beyond the limits of our current causal understanding, which can yield valuable fruit for those working on more granular causal explanations.

Translator's Note

In the current wave of AI4Science, it's become commonplace to use AI to build predictive models for a given field, with performance surpassing state-of-the-art models built by human scientists. However, as this article argues, at least in neuroscience, prediction is not understanding. Understanding is not just making generalizable predictions; it also involves giving a clear structural description of the operating mechanisms at an appropriate level of abstraction—and what counts as "appropriate" and "clear" is defined by humans. In this sense, even if AI4Science can complete a year's worth of a PhD student's work in a day in terms of building predictive models, it still cannot fully replace scientists.

This is not to say that AI-built models are without value. Science needs to constantly push beyond the limits of current causal understanding, and the way to achieve this is to build causal mechanistic models level by level, block by block. The tirelessly constructed predictive models by AI, along with corresponding process visualizations, will provide scientists with richer material to build causal models. AI's role is like that of a microscope/telescope, enabling scientists to see finer detail/farther.

https://www.sciencedirect.com/science/article/pii/S1389041723000906?via%3Dihub

https://doi.org/10.1073/pnas.93.23.13339

https://proceedings.neurips.cc/paper_files/paper/2024/file/abccb8a90b30d45b948360ba41f5a20f-Paper-Conference.pdf

https://doi.org/10.1007/s11229-023-04366-1

https://doi.org/10.1101/639724

https://doi.org/10.48550/arXiv.2410.08025

https://doi.org/10.48550/arXiv.2502.11447

https://proceedings.neurips.cc/paper_files/paper/2023/hash/3927bbdcf0e8d1fa8aa23c26f358a281-Abstract-Conference.html

https://doi.org/10.48550/arXiv.2312.03656

https://doi.org/10.1371/journal.pcbi.1005268

Neuroscience and Machine Learning Are Swapping Their Worst Habits? | 10,000-Word Interview

Related Articles

分享網址