Keynotes from previous days will be archived below. To watch the keynote sessions live please go to the livestream.

Prof. Andrew Zisserman

University of Oxford

How can we learn sign language by watching sign-interpreted TV?

(Monday 22nd November: 11:05 - 12:05 GMT)

Sign languages are visual languages that have evolved in deaf communities. For many years, computer vision researchers have worked on systems to recognize individual signs, and towards translating sign languages to spoken languages. Now, with the progress in human action and pose recognition and in machine translation, due to deep learning, many of the tools are in place. However, the lack of large scale datasets for training and evaluation is holding back research in this area. This talk will describe recent work in developing automatic and scalable methods to annotate continuous signing videos and build a large-scale dataset. The key idea is to build on sign interpreted TV broadcasts that have weakly-aligned subtitles. These enable sign spotting and high quality annotation of the video data. Three methods will be covered: using mouthing cues from signers to spot signs; using visual sign language dictionaries to spot signs; and using the subtitle content to determine their alignment. Taken together, these methods are used to produce the new, large-scale, BBC-Oxford British Sign Language Dataset with over a thousand hours of sign interpreted broadcast footage and millions of sign instance annotations. The dataset is available for download for non-commercial research.

Transforming Drug Discovery using Digital Biology

(Tuesday 23rd November: 16:00 - 17:00 GMT)

Modern medicine has given us effective tools to treat some of the most significant and burdensome diseases. At the same time, it is becoming consistently more challenging and more expensive to develop new therapeutics. A key factor in this trend is that the drug development process involves multiple steps, each of which involves a complex and protracted experiment that often fails. We believe that, for many of these phases, it is possible to develop machine learning models to help predict the outcome of these experiments, and that those models, while inevitably imperfect, can outperform predictions based on traditional heuristics. To achieve this goal, we are bringing together high-quality data from human cohorts, while also developing cutting edge methods in high throughput biology and chemistry that can produce massive amounts of in vitro data relevant to human disease and therapeutic interventions. Those are then used to train machine learning models that make predictions about novel targets, coherent patient segments, and the clinical effect of molecules. Our ultimate goal is to develop a new approach to drug development that uses high-quality data and ML models to design novel, safe, and effective therapies that help more people, faster, and at a lower cost.

Prof. Katerina Fragkiadaki

Carnegie Mellon University

Modular 3D neural scene representations for visuomotor control and language grounding

(Wednesday 24th November: 14:30 - 15:30 GMT)

Current state-of-the-art perception models localize rare object categories in images, yet often miss basic facts that a two-year-old has mastered: that objects have 3D extent, they persist over time despite changes in the camera view, they do not 3D intersect, and others. We will discuss models that learn to map 2D and 2.5D images and videos into amodal completed 3D feature maps of the scene and the objects in it by predicting views. We will show the proposed models learn object permanence, have objects emerge in 3D without human annotations, can ground language in 3D visual simulations, and learn intuitive physics and controllers that generalize across scene arrangements and camera configurations. In this way, the proposed world-centric scene representations overcome many limitations of image-centric representations for video understanding, model learning and language grounding.

Prof. Davide Scaramuzza

University of Zürich

Vision-based Agile Robotics, from Frames to Events

(Thursday 25th November: 14:30 - 15:30 GMT)

Autonomous mobile robots will soon play a major role in search-and-rescue, delivery, and inspection missions, where a fast response is crucial. However, their speed and maneuverability are still far from those of birds and human pilots. Agile flight is particularly important: since drone battery life is usually limited to 20-30 minutes, drones need to fly faster to cover longer distances. However, to do so, they need faster sensors and algorithms. Human pilots take years to learn the skills to navigate drones. What does it take to make drones navigate as good or even better than human pilots? Autonomous, agile navigation through unknown, GPS-denied environments poses several challenges for robotics research in terms of perception, planning, learning, and control. In this talk, I will show how the combination of both model-based and machine learning methods united with the power of new, low-latency sensors, such as event cameras, can allow drones to achieve unprecedented speed and robustness by relying solely on onboard computing.