A new study from MIT CSAIL reveals how large language models like GPT-3 learn to integrate vision without any visual training data. The researchers found that individual units in the model act as “multimodal neurons”, dynamically translating visual concepts into related text.
The study analyzed GPT-J, a popular text-only transformer model, after augmenting it with a frozen image encoder called BEIT. Although GPT-J had never seen an image before, the combined model was able to generate surprisingly good captions for photos.
By attributing model outputs back to individual neurons, the researchers discovered units that activate in response to specific visual semantics like “horse” or “car”. These neurons then inject corresponding words like “gallop” or “drive” into the model’s predictions.
The findings suggest GPT-J learns to align vision and language representations inside its transformer layers. The linear adapter between modalities provides image features in the text embedding space, but does not directly encode visual concepts as discrete tokens. This translation happens later via the multimodal neurons.
According to the authors, the presence of these dynamic concept encoders could explain why language models generalize so well to new modalities with simple adapter modules. The study also demonstrates these models contain neurons selective for high-level abstractions beyond raw sensory input.
By characterizing how AI systems integrate new data types, researchers hope to better understand the superior cross-task abilities of large language models. The discovery of modality translation at the neuron-level also opens possibilities for controlling model behavior by tweaking individual units.