Microsoft Research has posted this interesting video:

To develop an Artificial Intelligence (AI) system that can understand the world around us, it needs to be able to interpret and reason about the world we see and the language we speak. In recent years, there has been a lot of attention to research at the intersection of vision, temporal reasoning, and language.

One of the major challenges is how to ensure proper grounding and perform reasoning across multiple modalities given the heterogeneity resides in the data when there is no or weak supervision of the data.

Talk slides:

tt ads