Words are always spoken and understood in context. Linguists, psychologists and computer scientists have recognized this when they defined word meaning using their aggregate linguistic contexts across large samples. Early examples include Latent Semantic Indexing (Landauer & Dumais 1997), while recently, some computational models have become vastly successful in NLP, such as Word2Vec (Mikolov et al, 2013) or BERT (Devlin et al., 2018).
However, when young children learn to speak, they refer to a much richer context than language alone. Their parents expose them to a rich visual world, and that is also a common context for adult language use. In this talk, I argue that modern models for commonsense reasoning and natural language processing should therefore learn from visual and linguistic data.
I will present two examples of multimodal neural models that can bring together visual and linguistic context. One is a language model that predicts next words better when trained on image embeddings in addition to their captions (Ororbia et al., ACL 2019). The other is a BERT-based model that set a new state-of-the-art in visual commonsense reasoning, choosing answers to questions about movie stills and also giving a reason for these answers (Alberti et al., EMNLP 2019). It is pre-trained on a WWW corpus of images and ALT tags (texts added for accessibility purposes) before learning to answer questions about the movie screenshots.
David Reitter is a senior research scientist at Google Research, New York City, where he works on modeling conversational and multimodal interaction using very large-scale data. Until recently, he was an associate professor of information sciences at Penn State. There, his research group carried out NSF-funded research on computational models of human cognition. David did his postdoc in psychology at Carnegie Mellon University and holds a PhD in informatics from the University of Edinburgh (2008).