Mining Activation Functions Like Crypto? DeepMind Builds a 'Compute Farm' to Brute-Force Search for the Next-Gen ReLU

Editor | Panda

For a long time, neural network activation functions have been like the spark plugs in an AI engine. From the early Sigmoid and Tanh, to the later ReLU that ruled the industry, and then to GELU and Swish in recent years, every evolution of activation functions has been accompanied by improvements in model performance. However, for a long time, finding the best activation function often relied on human intuition or limited search spaces.

Now, Google DeepMind is changing the rules.

In a newly released heavyweight paper, "Finding Generalizable Activation Functions," the DeepMind team demonstrated how they used AlphaEvolve to "mine" brand-new activation functions in an infinite Python function space.

Paper Title: Mining Generalizable Activation Functions

Paper Address: https://arxiv.org/abs/2602.05688

This is a victory for Neural Architecture Search (NAS) and even more so a methodological innovation. DeepMind did not search directly on the massive ImageNet but instead built a "micro-lab," using synthetic data to optimize specifically for Out-of-Distribution (OOD) Generalization capabilities.

The results were shocking: the machine not only rediscovered GELU but also mined a series of bizarre functions with periodic perturbation terms, such as GELUSine and GELU-Sinc-Perturbation. These functions demonstrated superior generalization capabilities surpassing ReLU and GELU on algorithmic reasoning tasks (like CLRS-30), while maintaining strong competitiveness on standard vision tasks.

Let's take a detailed look.

Saying Goodbye to Manual Tuning: AlphaEvolve and Infinite Search Space

Traditional Neural Architecture Search (NAS) is often limited by predefined search spaces, for example, only looking within combinations of "add, subtract, multiply, divide, unary functions." Although this method previously discovered Swish, it limited the boundaries of exploration.

DeepMind's core weapon this time is AlphaEvolve. This is an LLM-driven evolutionary coding system. Its workflow is not simple parameter adjustment but directly writing and modifying code.

LLM-based Mutation Operator

AlphaEvolve uses frontier LLMs like Gemini as "mutation operators." This means the search space is no longer a combination of discrete mathematical symbols but all possible Python functions. As long as it can run within a certain computational budget (FLOPs) and the input/output tensor shapes are consistent, any Python code is a potential activation function.

Evolutionary Loop

The operation flow of the entire system is as follows:

1. Initialization: Start with the standard ReLU function.

2. LLM Proposal: The LLM writes new function variants based on the current best function code. Notably, the LLM also writes down the "theoretical basis" for designing that function in code comments like a human programmer.

3. Micro Evaluation: The new function is implanted into a small Multi-Layer Perceptron (MLP) and trained on specific synthetic datasets.

4. Fitness Calculation: The key here is that the model must not only perform well on the training set, but because the fitness function is the validation loss on Out-of-Distribution (OOD) test data, the model must learn to draw inferences.

5. Iteration: The best-performing functions are retained in the database as seeds for the next round of evolution.

This approach allows AlphaEvolve to leverage the programming knowledge and mathematical intuition contained within the LLM, tending to generate meaningful functions, thereby greatly improving search efficiency.

Micro-Lab: Tackling Generalization with Synthetic Data

To avoid expensive searches on large datasets (like ImageNet), DeepMind adopted a "Small-Scale Lab" strategy.

They designed a series of simple synthetic regression tasks specifically to test the model's ability to capture data structures rather than rote memorization. The datasets include:

• Random Polynomials: Testing extrapolation ability.

• Spherical Harmonics: Testing the ability to encode periodic structures.

• Feynman Symbolic Regression Dataset: Testing the ability to fit physical equations.

The key setting lies in the distribution shift between the training set and the test set. For example, the model might train in the interval (0, 0.5) but must test in the interval (0.5, 1).

Researchers found that if an activation function can survive in this cruel "micro-lab," it often captures more essential Inductive Biases, thereby performing well in complex real-world tasks as well.

Mined Treasures: From GELUSine to "Turbulent" Functions

After multiple iterations of AlphaEvolve, the system "mined" several activation functions with unique characteristics. Some are improvements on existing functions, while others are not only bizarre but even carry a certain "physical intuition."

Star Players: GELUSine and GELU-Sinc-Perturbation

The most exciting discovery is that the best-performing functions often follow a general formula:

That is, a standard activation function (like GELU) plus a periodic perturbation term.

• GELUSine: , the LLM explained in the generated code comments that this sine term introduces periodic "oscillations," helping the optimization process explore the loss landscape and escape local minima.

• GELU-Sinc-Perturbation: This function not only retains the asymptotic behavior of GELU but also introduces controlled non-linear complexity near the origin through the Sinc function.

Complex Attempt: GMTU

AlphaEvolve also discovered a function named GMTU (Gaussian-Modulated Tangent Unit). It combines Tanh, Gaussian decay, and a linear leak term, looking like a modulated signal wave. Although it performs well on synthetic data, the formula is relatively complex, and the computational cost is relatively high.

Failed Lesson: Turbulent Activation

During the search process, AlphaEvolve once discovered a highly performing function called the Turbulent Activation.

This function was very "smart," using the Batch statistics (like mean and variance) of the input tensor to dynamically adjust the activation shape. In the synthetic data of the micro-lab, its performance crushed all opponents, with extremely low test loss.

However, this cleverness proved to be overfitting. When transferred to real tasks like ImageNet or CIFAR-10, the Turbulent function's performance plummeted. Because it relied too much on the Batch statistical features of specific datasets, it lost the generality of point-wise activation functions. This is a classic case of "high scores in the lab, low ability in the field," which conversely proves the robustness of point-wise activation functions.

Real World Exam: The Victory of OOD Generalization

To verify if the functions mined in the "micro-lab" are truly useful, DeepMind implanted them into standard ResNet-50, VGG, and Graph Neural Networks (GCN), testing them on CIFAR-10, ImageNet, CLRS-30, and ogbg-molhiv datasets.

The test results revealed several key facts:

1. King of OOD Tasks: On CLRS-30 (an algorithmic reasoning benchmark emphasizing training with small-scale data and generalizing to larger-scale problems), the newly discovered GELU-Sinc-Perturbation achieved a high score of 0.887, significantly better than ReLU (0.862) and GELU (0.874). This validates DeepMind's core hypothesis: functions optimized on synthetic OOD data can indeed transfer to algorithmic tasks requiring strong generalization.

2. Vision Tasks Keep Up: On ImageNet, although these new functions were optimized for small-scale data, GELUSine and GELU-Sinc-Perturbation still achieved accuracy on par with or slightly better than GELU (Top-1 Accuracy approx 74.5%), far exceeding ReLU (73.5%).

3. The Magic of Periodicity: Why is adding a periodic term like sin(x) or sinc(x) to an activation function effective? DeepMind researchers believe that standard activation functions (like ReLU) are often linear outside the training domain, making it hard to capture complex data structures. Periodic functions allow the model to "store" certain frequency information within the training domain and "retrieve" this information through periodic structures during extrapolation. As the LLM said in the code comments, this is an "implicit frequency analysis."

The table below summarizes the performance of key functions on different tasks:

Deep Thinking: The Future of AI Designing AI

DeepMind's research not only contributes a few useful activation functions but also triggers deep thinking about AI-assisted scientific research.

Code as Search Space

AlphaEvolve proved that letting LLMs directly write Python code as a search space is more flexible and powerful than predefined mathematical operators. The inherent programming norms and logical capabilities of LLMs make the functions they generate mostly readable and executable, and they can even provide explanations of "design concepts."

From Fitting to Generalization

For a long time, activation function design was mostly to optimize gradient flow (like ReLU solving vanishing gradients). But this study shows that the shape of the activation function directly affects the model's inductive bias. By introducing periodic structures, we are actually telling the neural network: "Many laws of this world are cyclical, not just linear."

Big Wisdom from "Small Data"

In an era pursuing trillion-parameter large models trained by PB-level data, DeepMind went against the trend, mining universal architectural components through a "micro-lab" with only a few hundred samples of synthetic data. This suggests that if we can precisely define the essence of "generalization" (like through OOD splits), small data can still pry open big wisdom.

Conclusion

It has to be said that the results of this paper are quite astonishing.

DeepMind's work tells us that at the most basic component level of neural networks, there still exists a vast uncharted territory.

Future AI models may have every line of code and every operator written by the AI itself. And what we need to do might just be to build a suitable "evolutionary lab" for them, like AlphaEvolve.

If you are training a model that deals with complex graph structures or requires strong logical reasoning, you might want to try replacing your nn.ReLU with nn.GELU(x) * (1 + 0.5 * sinc(x)), perhaps there will be unexpected surprises.

Mining Activation Functions Like Crypto? DeepMind Builds a 'Compute Farm' to Brute-Force Search for the Next-Gen ReLU

Related Articles

分享網址