A deep dive into how FANFormer architecture works and what makes it so powerful compared to Transformer
LLMs have always surprised us with their capabilities, with many speculating that scaling them would lead to AGI.
But such expectations have led to disappointments in the last few days, with GPT-4.5, the largest and best model for chat from OpenAI, performing worse than many smaller models on multiple benchmarks.
While DeepSeek-V3 scores 39.2% accuracy on AIME 2024 and 42% accuracy on SWE-bench Verified, GPT-4.5 scores 36.7% and 38% on these benchmarks, respectively.


This raises the question: Do we need a better LLM architecture to scale further?
Luckily, we have a strong candidate that has been put forward by researchers recently.
Called FANformer, this architecture is built by combining the powerful Fourier Analysis Network (FAN) into the Attention mechanism of Transformers.
The results of experiments performed with them are very promising, with FANformers consistently outperforming Transformer when scaling up model size and training tokens.
As seen in the plot below, a FANformer with 1 billion parameters performs better than other open-source LLMs of comparable size and training tokens.

Here’s a story where we deep dive into how a FANFormer works and discuss all the architectural modifications that make it so powerful.
We Begin With ‘Fourier Analysis Networks’
Standard deep neural networks/ MLPs do very well at capturing and modelling (“learning”) most patterns from training data, but there’s one domain where they largely fall short.
This is — Modelling Periodicity in data.
Since most data contain hidden periodic patterns, this hinders the learning efficiency of traditional neural networks.
Check out the following example where a Transformer struggles to model this simple mod
function even when given sufficient training resources.

This is fixed by Fourier Analysis Networks (FANs), which use the principles of Fourier Analysis to encode periodic patterns directly within the neural network.
This can be seen in the following example, where a FAN can better model a periodic sin
function compared to MLP, KAN and Transformer.

A FAN Layer is described using the following equation:

where:
X
is the inputW(p)
andW(p̄)
are learnable projection matricesB(p̄)
is the bias termσ
represents a non-linear activation function||
denotes concatenation
Compared to an MLP layer that applies a simple linear transformation followed by a non-linear activation, the FAN layer explicitly integrates periodic transformations (sine and cosine) with the linear transformation and non-linear activation.
+ There are no comments
Add yours