FANformer Is The New Game-Changing Architecture For LLMs

Estimated read time 3 min read

A deep dive into how FANFormer architecture works and what makes it so powerful compared to Transformer

LLMs have always surprised us with their capabilities, with many speculating that scaling them would lead to AGI.

But such expectations have led to disappointments in the last few days, with GPT-4.5, the largest and best model for chat from OpenAI, performing worse than many smaller models on multiple benchmarks.

While DeepSeek-V3 scores 39.2% accuracy on AIME 2024 and 42% accuracy on SWE-bench Verified, GPT-4.5 scores 36.7% and 38% on these benchmarks, respectively.

This raises the question: Do we need a better LLM architecture to scale further?

Luckily, we have a strong candidate that has been put forward by researchers recently.

Called FANformer, this architecture is built by combining the powerful Fourier Analysis Network (FAN) into the Attention mechanism of Transformers.

The results of experiments performed with them are very promising, with FANformers consistently outperforming Transformer when scaling up model size and training tokens.

As seen in the plot below, a FANformer with 1 billion parameters performs better than other open-source LLMs of comparable size and training tokens.

Here’s a story where we deep dive into how a FANFormer works and discuss all the architectural modifications that make it so powerful.

We Begin With ‘Fourier Analysis Networks’

Standard deep neural networks/ MLPs do very well at capturing and modelling (“learning”) most patterns from training data, but there’s one domain where they largely fall short.

This is — Modelling Periodicity in data.

Since most data contain hidden periodic patterns, this hinders the learning efficiency of traditional neural networks.

Check out the following example where a Transformer struggles to model this simple mod function even when given sufficient training resources.

Comparison of the performance of Transformer and FANformer on periodicity modeling

This is fixed by Fourier Analysis Networks (FANs), which use the principles of Fourier Analysis to encode periodic patterns directly within the neural network.

This can be seen in the following example, where a FAN can better model a periodic sin function compared to MLP, KAN and Transformer.

Performance comparison between different neural network architectures within and outside the domain of their training data for a sine function (Image from ArXiv research paper titled ‘FAN: Fourier Analysis Networks’)

A FAN Layer is described using the following equation:

where:

  • X is the input
  • W(p) and W(p̄)are learnable projection matrices
  • B(p̄)​​ is the bias term
  • σ represents a non-linear activation function
  • || denotes concatenation

Compared to an MLP layer that applies a simple linear transformation followed by a non-linear activation, the FAN layer explicitly integrates periodic transformations (sine and cosine) with the linear transformation and non-linear activation.

You May Also Like

More From Author

+ There are no comments

Add yours