Mixture-of-Mamba Improves Multimodal Pretraining of State Space Models

Multimodal AI Modeling: Mixture-of-Mamba Optimizes State Space Models

State Space Models (SSMs) are gaining increasing importance as an efficient alternative to transformers in sequential modeling. Their strength lies in the efficient processing of sequence data. However, the previous inability to effectively utilize modality-specific features limits their performance in multimodal pretraining, i.e., pretraining with data from various modalities such as text, image, and audio. A new architectural concept called "Mixture-of-Mamba" promises a remedy.

Mixture-of-Mamba introduces modality-related sparsity through modality-specific parameterization of the Mamba block. Inspired by Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), Mixture-of-Mamba extends the advantages of modality-related sparsity to SSMs while preserving their computational efficiency. This means that the model learns to selectively utilize information from different modalities, leading to a reduction in computational effort.

Evaluation in Various Pretraining Scenarios

The effectiveness of Mixture-of-Mamba was evaluated in three different multimodal pretraining scenarios:

Transfusion: Here, text and continuous image tokens are interleaved and trained with a diffusion loss function. Mixture-of-Mamba achieved comparable image losses in this scenario with significantly reduced computational effort. With a model size of 1.4 billion parameters, only 34.76% of the training FLOPs (Floating Point Operations) were required.

Chameleon: This scenario uses interleaved text and discrete image tokens. Here too, Mixture-of-Mamba achieved similar image losses with only 42.50% of the FLOPs at 1.4 billion parameters. The text loss was achieved with 65.40% of the FLOPs.

Three-Modality Scenario: This extended framework additionally integrates speech data. Mixture-of-Mamba achieved comparable speech losses with only 24.80% of the FLOPs at 1.4 billion parameters.

Synergistic Effects through Decoupling of Projection Components

An ablation study highlights the synergistic effects of decoupling projection components within the model. The joint decoupling of various components led to greater improvements than the individual modification of single components.

Conclusion: Modality-Related Sparsity as an Effective Design Principle

The results show that modality-related sparsity represents a versatile and effective design principle. The extension of this principle from transformers to SSMs sets new standards in multimodal pretraining. Mixture-of-Mamba makes it possible to combine the advantages of SSMs, such as their computational efficiency, with the ability to effectively utilize modality-specific information. This opens up new possibilities for the development of powerful and efficient AI models for processing multimodal data.