I'm Tomotake Sasaki, a senior researcher at Autonomous Learning Project in Fujitsu Research. Fujitsu Research aims to create “AI that can learn autonomously”, and has been running a joint research program towards this goal with researchers at Massachusetts Institute of Technology (MIT) and Center for Brains, Minds and Machines (CBMM) since 2019. One of the results of this collaboration conducted with Dr. Vanessa D'Amario and Dr. Xavier Boix has been accepted at NeurIPS 2021. In this blog post, I would like to show its outline.
The paper introduced in this blog post
- Title: How Modular Should Neural Module Networks Be for Systematic Generalization?
- Authors: Vanessa D'Amario, Tomotake Sasaki, Xavier Boix
- Conference: Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021)
- Link to Paper, Link to Presentation1, Link to Presentation2, Link to GitHub
The content of the accepted paper
Background: Systematic Generalization
Advanced information processing for images is one of the most expected tasks to be carried out by machine learning technologies, in particular, the deep learning. However, there is a great challenge for machine learning approaches in such tasks. That is, the possible combinations of the objects' categories, their attributes such as color and size, the position in the image, environmental conditions like illumination, and the spatial relation between them can be enormous, and the training data is hardly exhaustive of all of them in real-world problems. Despite the tremendous success of deep learning methods in many benchmarks, recent studies have shown that their performance dramatically degrade when the deep neural networks are tested with out-of-distribution combinations, i.e., new combinations that are not included in the training data.
On the other hand, a human can deal even with such new combinations if the one has seen each element individually in the past. This is called systematic generalization ability, and now it is regarded an important research topic to achieve high systematic generalization performance with deep learning methods.
Visual Question Answering and Neural Module Networks
Visual Question Answering (VQA) is one of the advanced information processing for images: an algorithm is required to answer a question associated with an image, such as shown in Figure 1.
The left image is an example from the VQA-MNIST datasets we created for this study. The right image is from CLEVR dataset [1], which is created by Stanford University and Facebook AI Research.
Recent studies on Visual Question Answering [2,3] has shown that a particular type of deep neural network named Neural Module Network (NMN) [4,5] can achieve higher systematic generalization performance than other types.
Neural Module Networks have modularity in the sense that they divide the information processing into the following three stages and use different neural networks for each one : 1) feature extraction from the input image, 2) information processing depending on the question, 3) output of the answer. Regarding the neural network(s) used in the second stage, there is a variant of NMN using only one network [4] and another variant using many networks [5]. This is a different type of modularity than the three-stage information processing structure, but the effect of this type of modularity on systematic generalization has not been paid much attention so far.
Investigating the effect of different degrees of modularity at different stages
In this study, we investigated whether the combination of the degree of modularity and the stage it is imposed at affects the systematic generalization performance, and if it does, what are the good ones, through large scale computational experiments. The main five combinations investigated in our study are shown in Figure 2a. (We report the results for more combinations in the appendix of the paper.)
We consider three degrees of modularity that are shown in Figure 2b using the case for VQA-MNIST as an example. The modularity shown in the middle, in which one neural network is assigned to a group such as shape, color, size and so forth, has been introduced for the first time in this study. It is also a first attempt to use more than two neural networks in the feature extraction stage and the output stage.
Analysis using VQA-MNIST and SQOOP datasets
We first ran experiments with the VQA-MNIST datasets newly created for this study and the SQOOP dataset proposed in a previous work [2].
Figure 3 shows examples of the VQA-MNIST. As you can see, the VQA-MNIST consists of four datasets of different types of task.
Figure 4 shows examples of the SQOOP dataset. The SQOOP dataset only consists of the problems that ask spatial relations between two objects (alphabets and digits). In the original SQOOP dataset, five objects are contained in one image. In this study we created another version that contains only two objects in one image.
Figure 5 shows the results for the VQA-MNIST datasets. The horizontal axes show the amount of combinations contained in the training data and the vertical axes shows the systematic generalization performance (the test accuracy on the out-of-distribution combinations).
Table 1 shows the systematic generalization performance on SQOOP dataset represented in percentage.
The findings based on the results shown above and further results reported in the appendix of the paper can be summarized as follows.
- Tuning the degree of modularity and the stage it is imposed has clear impact on systematic generalization performance.
- An intermediate degree of modularity (per group level), especially at the feature extraction state, is beneficial to achieve higher systematic generalization performance.
Application on Vector-NMN and CLEVR-CoGenT
We apply the findings obtained with VQA-MNIST and SQOOP to Vector-NMN [3], the state-of-the-art variant of NMN, and test its performance on Compositional Generalization Test split of CLEVR dataset (CLEVR-CoGenT).
We introduced the group level modularity in the feature extraction stage of a Vector-NMN and compared it with the original Vector-NMN and the Tensor-NMN, a baseline used in the paper that proposed the Vector-NMN.
Table 2 shows the systematic generalization performance represented in percentage for each of the thirteen types of question. In almost all cases, the Vector-NMN with the group level modularity at the feature extraction stage (rightmost column) shows the highest systematic generalization performance.
Closing remarks
This blog post gives just an outline of the result. Please check the paper for further details. If you are a participant of NeurIPS 2021, please also come to our poster session. Other results of the collaboration with MIT and CBMM can be found in this page.
Fujitsu Research seeks for new employees and interns. If you are interested in these opportunities, please contact Hiro Kobashi, Project Director in Autonomous Learning Project for a casual meeting (Sorry that the webpage is written in Japanese, but he accepts the chat in English).
References
[1] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recogni
tion (CVPR 2017), pages 2901–2910, 2017.
[2] Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic generalization: What is required and can it be learned? In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019.
[3] Dzmitry Bahdanau, Harm de Vries, Timothy J O’Donnell, Shikhar Murty, Philippe Beaudoin, Yoshua Bengio, and Aaron Courville. CLOSURE: Assessing systematic generalization of CLEVR models. arXiv preprint arXiv:1912.05783v2, 2020.
[4] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV 2017), pages 804–813, 2017.
[5] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual reasoning. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV 2017), pages 2989–2998, 2017.