• NeurIPS paper accepted!

    Do you want to boost image classification, long-tailed recognition, out-of-distribution and open-set recognition with single line of code?

    Read our paper titled: Maximum Class Separation as Inductive Bias in One Matrix. For each class, it creates a vector that is maximally apart from all other classes, in a nice mathematical closed-form solution.

    Wonderful collaboration with Tejaswi Kasarla, Pascal Mettes, Max van Spengler, Elise van der Pol and Rita Cucchiara!

  • Neuro-Symbolic Reasoning

    Deep learning becomes more and more capable of detecting objects, classifying them, describing their attributes. This enables reasoning about such objects, their specifics, and relations!

    For instance, see the photo of my research group at TNO. Can you find a woman standing next to a man, both wearing glasses?

    First-order logic is able to describe this in a few lines of symbolic predicates. We recognize the symbols in the predicates via a language-vision model (CLIP). A neuro-symbolic program reasons about the logic and the recognized symbols with their probabilities.

    This reasoning is probabilistic and multi-hypothesis with pruning. A crucial asset, if you have many objects that may sometimes be uncertain.

    The result is highlighted. Pretty cool, right?

  • Maximum Separation between Classes

    Maximizing the separation between classes constitutes a well-known inductive bias in machine learning and a pillar of many traditional algorithms. By default, deep networks are not equipped with this inductive bias and therefore many alternative solutions have been proposed through differential optimization. We propose a simple alternative: encoding maximum separation as an inductive bias in the network by adding one fixed matrix multiplication before computing the softmax activations.

    Despite its simple nature, this one matrix multiplication provides real impact. We show that our proposal directly boosts classification, long-tailed recognition, out-of-distribution detection, and open-set recognition, from CIFAR to ImageNet.

    Work with Tejaswi Kasarla, Pascal Mettes, Max van Spengler, Elise van der Pol, Rita Cucchiara. Paper:

  • Using Language to Analyze Images

    Recent models connect language and vision. That is very powerful for analysis of images. Language is also useful for generalization to new labels. We can now search through words in images. That doesn’t have to be exactly the right word, and that makes it widely applicable. CLIP (OpenAI), LiT and Flamingo are recent models that show great performance and promise for several use-cases:

    • Which word describes a picture? Example: Domainnet-Real consists of 345 classes (airplane, car, bus, etc.). For each image, CLIP can select the most likely of the 345 labels. This is done via text-image matching. The accuracy is 80%, which is quite impressive when you consider that random gives 0.3%.

    • Clustering a large set of images. CLIP can convert an image to a vector (embedding). This vector is very distinctive. Example: clustering Domainnet-Real, tens of thousands of images, with k-means, to 345 clusters, then 68% is correct.

    • Ranking of images. CLIP can be scored on how well a text search term matches an image. You can use that score to sort the images. In practice we see that images that come higher in the list actually match the searched term. Even if the search term is quite specific, such as ‘drone’ or ‘airport’.

  • Explainability of Predictions

    I’m curious about deep learning and why it makes particular decisions.

    Q1. How little of a picture does such a model need for a particular classification? A1: For an ambulance, you only need to see a piece of will and a red stripe (see attached video).

    Q2. Which parts of a picture lead the model to a different classification? A2.1: An ambulance becomes more like a fire truck, if the red cross, the letters ‘ambulance’ and a few letters of ‘ambulance’ are not visible. A2.2: If the red stripes, the door, and the blue health symbol are not there, then an ambulance becomes the police car.

    Check out the demo:

    You see three types of vehicles: an ambulance, fire truck, or police car. For each picture, a part of the picture is first removed, as long as the classification remains the same (Q1). Later on, patches are removed to arrive at a different classification (Q2).

    Key observations: (1) All these pictures were convertible to any other class! (2) Logos are important. (3) Specific details are important too, such as grill in the front, red lamp, or blue color. (4) Regularly, the predictions are based on very limited visual evidence.

    For this experiment, an image was modeled as a bag of patches, encoded by a set prediction model. Modeling it as a set, enables us to remove patches (i.e., set elements) and still do predictions. The model can deal with a variable amounts of inputs.

  • How to select first labels for object detection?

    Learning object detection models with very few labels, is possible due to ingenious few-shot techniques, and due to clever selection of images to be labeled. Few-shot techniques work with as few as 1 to 10 randomized labels per object class.

    We are curious if performance of randomized label selection can be improved by selecting 1 to 10 labels per object class in a non-random manner. Our method selects images to be labeled first. It improves 25% compared to random image selection!

    Below you see the first images, selected by our method, to be labeled. These are representative and the object is well visible yet not too large.

    The paper is accepted for International Conference on Image Analysis Processing and presented next week. This research is part of DARPA Learning with Less Labels. More information when clicking on the post.

  • Multiset Prediction

    Sets are everywhere: the objects in an image {car, person, traffic light}, the atoms in a molecule, etc. We research set prediction: how can a neural network output a set? This is not trivial, because sets can differ in size. The neural network needs to deal with variable-sized outputs. See our ICLR 2021 paper for an energy-based approach.

    In our ICLR 2022 paper, we extend the method to deal with multisets. Contrary to sets, multisets can have multiple occurrences of a single element. For instance, {car, car, person, person, person, traffic light}. In multisets, elements may be identical, which poses a problem if we would like to get a different output for them. For instance, if we have {car, car} and we want to output the label and ordering {car-1, car-2}. Previous approaches are based on sets and cannot treat car-1 and car-2 differently.

    We propose a method that can treat car-1 and car-2 differently, to output {car-1, car-2}. This essential ingredient is called multiset-equivariance. Published at ICLR 2022 and presented on May 25 2022.

    This proves to be very helpful, not only to separate equal elements, but also similar elements. This property boosts the performance for both set and multiset prediction. Performance is favorable over Transformer and Slot Attention. In the example below, you see the prediction of objects and their properties, and how our method (last column) improves by refining iteratively (from left to right).

    Part of NWO Efficient Deep Learning, co-sponsored by TNO Appl.AI.

  • Domain-specific Pretraining

    We often retrain a deep learning model for a new domain and new classes. For instance, object detection in images that were recorded from above (overhead). These overhead images look very different from common datasets with (mostly) frontal images. Sometimes the classes are also different, such as swimming pools that become visible in overhead images.

    For that purpose, we need to retrain the model. Preferably with as few labels as possible. Selecting a proximal dataset (labeled overhead images) is very helpful. We used DOTA 1.5, a big overhead dataset, as pretraining. We were able to improve the Pool-Car dataset accuracy from mAP=0.14 to mAP=0.33 with just 1 label per class.

    Part of our research in DARPA Learning with Less Labels.

  • Visual Question Answering

    Visual Question Answering (VQA) is a deep learning model that makes it possible to ask questions about an image in natural language. However, this often goes wrong in practice, especially with compound questions (“Is there a woman and a boy?”) and with questions that require understanding of the world (“a boy is a child”). Our method (Guided-VQA) extends VQA with such knowledge.

    We have shown on the Visual Genome dataset that it leads to better answers. This technique allows us to search for objects for which we do not have an explicit detector, and combinations and relationships (“Is there a boy to the right of the woman?”).

    Our idea is to decompose the compound queries, and to add common-sense to the resolving of such queries. We do so by an iterative, conditional refinement and contradiction removal. The refinement enables a coarse-to-fine questioning (leveraging taxonomic knowledge), whereas contradictions in the questioning are removed by logic. We coin our method ‘Guided-VQA’, because it incorporates guidance from external knowledge sources.

    Published at International Conference on Image Analysis and Processing 2022.

    (click on post to see more details)

  • Set Prediction by an Energy-based Model

    Which celebrities have something in common? Please have a look at the top row in this image.

    Which attributes do they share? It depends on how you look!

    We developed a new deep learning model that is able to predict multiple answers. This is key to questions that are inherently ambiguous.

    Each row in the attached image is one answer. In total, the model produces four answers. Blue indicates the celebrities who are not sharing two attributes. These four answers by the model are correct. Row 1: man and no beard. Row 2: man and glasses. Row 3: man and bold. Row 4: bold and glasses.

    Paper at ICLR ‘21 with David Zhang and Cees Snoek (University of Amsterdam). More information when clicking on the post.

  • Out of distribution

    Deep Learning is a great tool. But, one of its problems is that it’s overly optimistic: too high confidences for unknown classes. That’s a problem in an open world where new classes may be encountered.

    See the unknown fruit below. Standard deep learning (left) is shockingly certain that it is pear (which it not the case). We extended a standard model with a notion of uncertainty: it says the unknown does not look like the known classes (right).

    At the Videos section, you can find more examples.

    Our model is an Equidistant Hyperspherical Prototype Network. The paper was accepted at NeurIPS, at a workshop about uncertainty in decision making. It is a collaboration with Pascal Mettes from the University of Amsterdam.

    The prototypes are almost for free. This model has the same backbone network, while requiring less weights for the classifier head. It just requires 5 additional lines of Python code for the prototypes and the cosine loss.

  • Deployable Decision Making in Embodied Systems

    In December, I was at the NeurIPS Workshop on Deployable Decision Making in Embodied Systems, to present our paper about modeling uncertainty in deep learning.

    This was our poster:

    Great talks by (and panel discussions with) experts in robotics, vision, machine learning (DeepMind, MIT, Columbia, EFPL, Boeing, etc.).

    The focus was robustness of systems that have machine learning components. Key questions were: how to safely learn models & how to safely deal with model predictions?

    Boeing mentioned that they only consider a machine learning component for the bigger system, if besides its output (e.g., a label with a confidence) it also produces a second output indicating its competence for the current image or object. Interesting thought!

    The workshop website is here.

  • Uncertainty of Predictions

    Knowing the uncertainty of a prediction by a deep learning model is key, if you want to base decisions on it. Especially in an open world, where not everything can be controlled or known beforehand.

    We have modeled prototypes that are maximally apart, so the unknown samples have maximum likelihood of falling in the in-between spaces. Indeed, out-of-distribution samples have a larger distance to the prototypes. The distance is a good proxy of uncertainty.

    This work was published in NeurIPS workshop on Deployable Decision Making (DDM). The illustration shows image segmentation and the uncertainty of the predictions per pixel.

    This research is part of TNO’s Appl.AI program. It is a collaboration of Intelligent Imaging and Automotive department (Helmond).

  • Hypergraph Prediction

    Predicting graphs is becoming popular, as many problems are structural and can be expressed as a graph. For instance, relations between objects in a scene. Often, the relations are multi-way, beyond standard graphs with edges that connect two nodes. Connecting multiple nodes requires hypergraph prediction. We propose to do this in a recurrent way, at each step refining the predicted hypergraph.

  • Using domain knowledge to improve artificial intelligence

    On Friday, I gave a talk at the European Big Data Forum, for 135 people. My talk was about using domain knowledge to improve AI:

    One of the challenges in applying artificial intelligence, and deep learning in particular, is a lack of representative data and/or labels. In some extreme cases, no labeled data samples are available at all. Besides the problem of data, applied models often have a lack of understanding the application’s context. Our research focuses on bridging that gap, such that deep learning can be applied with few labels, and that learned models have more context awareness.

    In this talk, I will show why this is important, and some approaches that may solve some of these issues. A key topic is to leverage domain or expert knowledge in machine learning. This offers a step towards more robust and aware applications of artificial intelligence.

  • Using knowledge and selecting labels in DL: 2 papers accepted

    Our two papers got accepted at International Conference on Image Analysis and Processing.

    One paper is about Visual Question Answering (VQA). We consider image interpretation by asking textual questions. We extended VQA with knowledge (taxonomy), called Guided-VQA, to enable coarse-to-fine questions. This research is part of the Appl.AI SNOW project.

    The other paper is about DARPA Learning with Less Labels. We propose to select particular images for labeling objects. We show that such selection is better for object detection when having very few labels. We consider only 1-10 labels per class, while standard is to have hundreds per class. For many practical applications few labels are available.

    Soon more information follows.

  • Uncertainty Quantification

    Quantifying uncertainty is a key capability for (semi)autonomous systems, in order to understand when the model is uncertain, e.g., when it encounters something unknown (out of distribution, OOD). For instance, robot SPOT is in a place that it does not know, so it can invoke assistance or go in safe mode.

    Our experiment is illustrated below: three categories of places are known (Home, Work, Public places), and, one category is unknown: Shops (seen only at test time).

    We have researched prototypes for this purpose. The paper is published at a NeurIPS workshop about uncertainty.

    Prototypes are a set of vector representations, one for each class. We propose equidistant hyperspherical prototypes that are maximally separated, such that OOD samples fall in between. We show that in practice this indeed happens. The distance to such a prototype is a good quantification of uncertainty.

    The figure shows an example of 4 classes and their prototype in 3D, with learned projections of samples (dots) and the samples that are considered to be outliers (squares).

  • Our paper is accepted at NeurIPS!

    We want Deep Learning models that generalize beyond previous observations. Someone who has never seen a zebra before, could nevertheless recognize one when we tell them it looks like a horse with black and white stripes. Standard Deep Learning cannot do that.

    In DARPA’s Learning with Less Labels, together with University of Twente, we have developed the ProtoProp model. Our model is compositional: it recombines earlier knowledge.

    The paper is accepted at NeurIPS.

    (read more by clicking the post)

  • Scenes as Sets of Objects

    In one of our projects, a robot is searching rooms for persons, in a search and rescue setting. The person may have fallen behind a couch, and therefore invisible. Can we infer whether it is likely that a person is in the current room, based on the observed objects? Quite well, see the illustrations below!

    One challenge is that the number of objects varies from scene to scene. That is why we need to model sets, read more by clicking on this post.

  • Appl.AI @ TNO

    At TNO, there are many research projects centered around Artificial Intelligence and Deep Learning, from Vision and NLP to Knowledge Graphs. The large research program for Applied AI is called Appl.AI. Appl.AI has a website portal where you can find more information. Hopefully it’s a useful overview and resource!

  • Active Vision for Robotics

    The cool thing about a robot is that it can act to improve its performance. For instance, by getting closer to the object, more details can be perceived, thereby resolving uncertainty. This demo video shows that the robot successfully approaches and confirms the objects of interest, and quickly abandons the other irrelevant objects.

    The robot’s goal is to find human dolls. Sometimes the object is not a doll, but a transformer toy. At first, it is always uncertain, because the object is very small. It decides to get closer to resolve the uncertainty. More detail is helpful to resolve what the object is.

  • Robotics and Deep Learning with SPOT

    Our research on robotics with SPOT was in the newspaper (Telegraaf). The goal is to provide SPOT with more autonomy via context awareness. As part of this awareness, we look into visual question answering (find me the room with a boy and a girl) and objects with part attributes (find the girl with the green sweater).

  • Compositionality of objects and attributes

    Humans are good at compositional zero-shot reasoning; Someone who has never seen a zebra before could nevertheless recognize one when we tell them it looks like a horse with black and white stripes.

    Machine learning systems have difficulty with compositionality. We propose a method that uses a graph that captures knowledge about compositions of objects and their attributes, and learn to propagate through the graph, such that unseen classes can be recognized based on their composition of known attributes.

  • Zero-shot Object Localization

    Detecting objects is important and works very well. But, to learn a model, many images are required. We research techniques to learn with less labeled images. To the extreme: zero labels.

    Our approach is to learn the object by its known parts. For instance, a bicycle composed of a frame, saddle, steer and wheels.

    Because there can be multiple bicycles in the scene, the algorithm needs to reason about the parts and their possible compositions. To that end, we developed a multiple hypotheses method that takes into account multiple criteria about parts, such as their overlaps and relative size.

  • Interactive Video Exploration

    Looking for particular concepts in a large set of videos or images is very relevant with today’s amounts of data. A correct way of visualizing by plotting similar instances close together, really depends on what you’re looking for. When searching for a particular person, the similarity should be based on people’s appearances, whereas for activities it should be based on motion patterns.

  • NATO Award

    For my contribution to a NATO research group (see post below), I received this beautiful award!

  • Content-based Multimedia Analytics

    For the last years, I was part of a NATO research group about multimedia analytics. The goal was to invent new ways of making use of multiple modalities.

    Our focus was on social media (text, images) and videos. For this work, we received an Excellence Award from NATO.

  • Learning from Simulation and Games

    Sometimes images of objects-of-interest are not available. For instance, rare objects such as a tank. Can we train an object detection model on simulated images? Yes, we can! A model trained on the game GTA-5 detected these objects in YouTube footage.

  • Competence in (un)known environments

    A video of our paper at ACM/IEEE Human Robot Interaction is now available on YouTube. The topic of the paper is: ‘Is my AI competent here?’ - for robots that move around and may get into unexpected situations where learned models may fail.

  • Learn objects with just one label?

    Can we learn how to detect an object if we just have one label? This is the key question that we are trying to solve in DARPA Learning with Less Labels program. During the first year of the program, we performed well on an international benchmark.

    Other teams are key players in AI in USA (e.g., Berkeley) and Australia. It is very nice to collaborate with and learn from these other teams.

  • Localizing Aggression in Videos

    Interpretation of human behavior from video, is essential for human-machine interaction, public safety and more. Together with the national police and the city of Arnhem, we researched the feasibility of aggression detection during nightlife.

  • Periodic Motion in Video

    I was honored to be part of the PhD committee of Tom Runia, about periodic motion in video. At the University of Amsterdam, with promotors prof. Cees Snoek and Arnold Smeulders. Periodic motion is important as it appears in many activities such as sports, working and cooking.

  • Transformers for Vision

    Transformers are becoming popular for vision tasks, including image classification, object detection and video classification. In this presentation, I explain Transformers and how they are applied to Vision tasks.

  • Detecting small objects by their motion

    When objects appear very small in images, standard object detection has difficulties. This is because it relies on the object’s texture, which becomes invisible at such small scales. Detecting small objects based on their motion is a better alternative. See the image below, where a standard detector fails (top row) and a motion-based detector (bottom row) detects the small objects.

  • Patent granted: Sequence Finder

    The innovation is called Sequence Finder. This algorithm finds temporal patterns (e.g., making a coffee) by propagating confidences (not binary!) about shorter-term activities (e.g., taking cup, pooring milk, turning on coffee machine).

  • AI says: am I competent here?

    AI agents such as robots are often not aware of their own competence under varying situations. Yet it is critically important to know whether the AI can be trusted for the current situation. We propose a method that enables a robot to assess the competence of its AI model.

  • Weighting evidence by thrustworthiness

    Our aim is to increase the autonomy of the SPOT robot. Our research covers perception, planning and self management. For perception, one of the challenges is to understand objects, even when they are novel or not seen before. Novel objects are detected by recognizing and combining their parts.

    But not all parts are equally discriminative or robust. This post is about combining evidence in a more principled way, by taking the evidence for each part into account.

    The coloring of the nodes in the part-hierarchy shows that some parts are more thrustworthy (e.g., head) than others (e.g., lower arm).

  • Novel object detection by a hierarchy of known parts

    Imagine a robot that can help with search and rescue tasks, to find victims, in situations that are dangerous to humans. In the SNOW project, we aim to endow the SPOT robot dog with such a capability. Key is to recognize victims, even if they are only partially visible. For instance, a person may have fallen behind a bed. This perception is done by a SNOW algorithm, called HITCH (Hierarchical Task-specific Characterization).

  • Novelties and Anomalies as Outliers

    Spotting the statistical outliers is relevant for detection of novelties (e.g., a yet unknown object) and anomalies (e.g., a production error).

    Suppose that you have only seen images of cars. Cars are the inliers. Now you want to know when objects are different from cars. These are the outliers.

  • Zero-shot object recognition

    Suppose that you want to recognize an object, but you don’t have any images of that object. Standard deep learning will fail without training samples. Now suppose that you have knowledge about its parts. Often, images of (everyday) parts are available. We have developed a technique, ZERO, to recognize unseen objects by analyzing its parts.

  • Spot What Matters

    Our paper ‘Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection’, is accepted for the International Workshop on Deep Learning for Human-Centric Activity Understanding (part of International Conference on Pattern Recognition).

  • Paper accepted at ICLR 2021

    Our paper on Set Prediction was accepted at International Conference on Learning Representations (ICLR 2021). This research is about predicting sets such as reconstructing point clouds or identifying the relevant subset from a large set of samples.

subscribe via RSS