Recent progress and breakthroughs in computer vision are becoming more and more upfront and visible thanks to the constant evolution of ICT applications in our daily lives. Application concepts like Industry 4.0 or Internet of Things, where integration of the virtual and physical world is an objective in itself, and also application in the services we habitually use on internet (banking e-commerce, social media, eLearning, etc.) are driving the ongoing application of related technologies in the development of these concepts. In all these fields computer vision has not only a promising future but an exponentially expanding field of application, as new services crop up in which to apply it.
Examples such as driverless cars, drone-based quality-control, classification and recognition, traffic cameras that recognize diverse driver-behavior patterns, virtual reality or breaking news about Amazon’s intention to patent pay-by-selfie technology all show the wide range for potential application of computer vision.
The different approaches to this challenge take in detection of objects, extraction of certain visual characteristics and their subsequent association, the use of classifiers together with automatic learning techniques for decision-making and automatic image annotation and recommendation of tags.
These ongoing computer vision developments center not only on image acquisition, processing and analysis (recognition of objects and decision-making) but also a better understanding of images, representing a quantum leap in the approach to the problem. Understanding of images depends to some extent on cognition, which takes in tasks involving not only recognition but also reasoning on the image or set of images dealt with; to achieve this, the models need to comprehend the interactions and relations between the objects making up said images.
To make headway in this “understanding” of images it is necessary to build up complex systems that tackle the problem in different phases. Use here is being made of the concept of Deep Learning, a branch of machine learning and AI based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple sigmoid (non-linear) functions. In general unsupervised training is used in phase I together with supervised training in phase II. The overall aim of this approach is to seek learning at various levels of abstraction that, taken together, produce understanding. The use of sigmoid functions enables complex, non-lineal problems to be solved. Deep Learning techniques have been under study for some time. They grew from such concepts as Neocognitron, which is a multi-layer, hierarchical neural network proposed by Kunihiko Fukushima in the eighties of last century. It is a model that has adapted very well to pattern recognition, and research is still underway on its use in cognition, driven by improvements in computation and the exponential increase of network training use cases.
There are various architectures for implementation of a Deep Learning solution. The commonest and one of the most important involves neural networks. As a first approach multi-layer neural nets are used; these model the learning process by means of diverse sigmoid functions in various layers, where layer k computes an output vector hk using the output hk-1 of the previous layer, starting with the input x = h0. In all layers except the input layer, a certain operation called activation function is carried out, one of the aims of which is to keep the processing result of said layer within the estimated range. Other alternatives to multi-layer neural networks are the so-called autoencoders which are neural networks, or more strictly unsupervised learning algorithms possessing very few layers, normally two or three. An auto-encoder can be trained to encode the input x into some representation c(x) so that the input can be reconstructed from that representation, i.e., it applies backpropagation.
Sequential use of autoencoders generates the architecture known as “Stacked Autoenconders” where another self-learning autoencoder is applied to the hidden-layer result and so on, whereby, although the architecture gets more complicated, the result is learning with more complex characteristics. For self-learning purposes, greedy algorithms are normally used. These heuristically choose the best option in each layer with the aim of obtaining an optimal overall solution (these algorithms have also produced very good results in genome sequencing where graph theory is applied to reassemble (and sequence) the DNA chains from sequences or small chunks generated by chemical processes).
Other architectures are those known as “Restricted Boltzman Machines”, which are generative neural nets whose learning includes distributions of probability on the basis of a set of inputs. Closely related to these are the Deep Belief Networks, which are neural networks that also use autoencoders; these are probabilistic generative models comprising multiple layers in which the weights of the hidden layer neurons are initialized casually by binary patterns.
But the testing of all of these architectures and the building up image-based cognitive learning systems from them call increasingly for robust and tried-and-tested mechanisms of large datasets for supervised learning. One example of systems of this type is Visual Genome.
Visual Genome is a dataset for modelling relationships between diverse image objects. It stores annotations of objects, attributes and relations existing in an image to learn these models. Each image has around 21 objects, 18 attributes and 18 pairwise relationships between objects.
Understanding of scenes will facilitate development of applications such as image searches, answers to questions and robotic interactions.
Visual Genome consists of a dataset, a knowledge base and a continual effort to connect structured image concepts with language. It allows a multi-perspective study of an image to be carried out, from information at the level of pixels as objects to relations requiring an inference and, even more, deep cognitive tasks such as answers to questions. It is a broad set of data for training and comparative assessment of the new generation of computer vision models. With Visual Genome it is hoped that these models will enable a broader understanding of the visual world to be developed, complementing computing object-detecting capacities with the ability to describe those objects and explain their interactions and relations. Visual Genome is a major formal representation of knowledge for visual understanding and a set of descriptors for translating visual concepts into language.
Visual Genome’s dataset is made up by seven main components: region descriptions, objects, attributes, relationships, region graphs, scene graphs and question-answer pairs. To make further headway in research into the comprehensive understanding of images, the first step is the collection of descriptions and question-answers that are raw texts without any restrictions on length or vocabulary. The next step is to extract objects, attributes and relationships to build up graph scenes representing a formal description of each image. This work is being done by means of a crowdsourcing platform designed to feed the database. The chosen platform for this task is Amazon Mechanical Turk (AMT), which is a commercial system designed for simple work at a unit price calling for human intervention where there are applicants or workers and providers or vendors, in this case Visual Genome. This system ensures continual database feeding, filling in the information of each image.
Visual Genome is an evolution of a first project called ImageNet, which is a great hierarchical image database (over one million as of today); the images are content-tagged so that each node of the hierarchy is represented by hundreds or thousands of images. Every year now the ImageNet Large Scale Visual Recognition Challenge is held and the competitors normally use, as might be expected from all the above, Deep Learning neural networks.
Similar systems to Visual Genome include Microsoft Common Objects in Contex (COCO), an image object recognition database. Google, Facebook and others run similar initiatives, such as MetaMind, which works on automatic image recognition using machine learning and Big Data to obtain descriptions by means of natural language of the physical world. Another, possible closer-to-hand example is Snapchat which uses Deep Learning[i] with more of a fun-based approach. All this gives an idea of the importance of these techniques in terms of their capacity to generate future products and services.
 Cognition: mental action or process of acquiring knowledge and understanding through thought, experience, and the senses. It encompasses processes such as knowledge, attention, memory and working memory, judgment and evaluation, reasoning, problem solving and decision making, comprehension and production of language, etc. Cognitive processes use existing knowledge and generate new knowledge.
Author: Miguel Hormigo Ruiz
Las opiniones vertidas por el autor son enteramente suyas y no siempre representan la opinión de GMV
The author’s views are entirely his own and may not reflect the views of GMV