main-article-of-news-banner.png

Ten years in: Deep learning changed computer vision, but the classical elements still stand

 

Computer Vision (CV) has evolved rapidly in recent years and now permeates many areas of our daily life. To the average person, it might seem like a new and exciting innovation, but this isn’t the case. 

CV has actually been evolving for decades, with studies in the 1970s forming the early foundations for many of the algorithms in use today. Then, around 10 years ago, a new technique still in theory development appeared on the scene: Deep learning, a form of AI that utilizes neural networks to solve incredibly complex problems — if you have the data and computational power for it.

As deep learning continued to develop, it became clear that it could solve certain CV problems extremely well. Challenges like object detection and classification were especially ripe for the deep learning treatment. At this point, a distinction began to form between “classical” CV which relied on engineers’ ability to formulate and solve mathematical problems, and deep learning-based CV. 

Deep learning didn’t render classical CV obsolete; both continued to evolve, shedding new light on what challenges are best solved through big data and what should continue to be solved with mathematical and geometric algorithms.

 

Limitations of classical computer vision

Deep learning can transform CV, but this magic only happens when appropriate training data is available or when identified logical or geometrical constraints can enable the network to autonomously enforce the learning process.

In the past, classical CV was used to detect objects, identify features such as edges, corners and textures (feature extraction) and even label each pixel within an image (semantic segmentation). However, these processes were extremely difficult and tedious.

Detecting objects demanded proficiency in sliding windows, template matching and exhaustive search. Extracting and classifying features required engineers to develop custom methodologies. Separating different classes of objects at a pixel level entailed an immense amount of work to tease out different regions — and experienced CV engineers weren’t always able to distinguish correctly between every pixel in the image.

 

Codificación, Programación, Laboral

 

 

Deep learning transforming object detection

In contrast, deep learning — specifically convolutional neural networks (CNNs) and region-based CNNs (R-CNNs) — has transformed object detection to be fairly mundane, especially when paired with the massive labeled image databases of behemoths such as Google and Amazon. With a well-trained network, there is no need for explicit, handcrafted rules, and the algorithms are able to detect objects under many different circumstances regardless of angle.

In feature extraction, too, the deep learning process only requires a competent algorithm and diverse training data to both prevent overfitting of the model and develop a high enough accuracy rating when presented with new data after it is released for production. CNNs are especially good at this task. In addition, when applying deep learning to semantic segmentation, U-net architecture has shown exceptional performance, eliminating the need for complex manual processes.

 

Going back to the classics

While deep learning has doubtless revolutionized the field, when it comes to particular challenges addressed by simultaneous localization and mapping (SLAM) and structure from motion (SFM) algorithms, classical CV solutions still outperform newer approaches. These concepts both involve using images to understand and map out the dimensions of physical areas.

SLAM is focused on building and then updating a map of an area, all while keeping track of the agent (typically some type of robot) and its place within the map. This is how autonomous driving became possible, as well as robotic vacuums.

SFM similarly relies on advanced mathematics and geometry, but its goal is to create a 3D reconstruction of an object using multiple views that can be taken from an unordered set of images. It is appropriate when there is no need for real-time, immediate responses. 

Initially, it was thought that massive computational power would be needed for SLAM to be carried out properly. However, by using close approximations, CV forefathers were able to make the computational requirements much more manageable.

SFM is even simpler: Unlike SLAM, which usually involves sensor fusion, the method utilizes only the camera’s intrinsic properties and the features of the image. This is a cost-effective method compared to laser scanning, which in many situations is not even possible due to range and resolution limitations.  The result is a reliable and accurate representation of an object.

 

 

 

The road ahead

There are still problems that deep learning cannot solve as well as classical CV, and engineers should continue to use traditional techniques to solve them. When complex math and direct observation are involved and a proper training data set is difficult to obtain, deep learning is too powerful and unwieldy to generate an elegant solution. The analogy of the bull in the China shop comes to mind here: In the same way that ChatGPT is certainly not the most efficient (or accurate) tool for basic arithmetic, classical CV will continue to dominate specific challenges.

This partial transition from classical to deep learning-based CV leaves us with two main takeaways. First, we must acknowledge that wholesale replacement of the old with the new, although simpler, is wrong. When a field is disrupted by new technologies, we must be cautious to pay attention to detail and identify case by case which problems will benefit from the new techniques and which are still better suited to older approaches.

Second, although the transition opens up scalability, there is an element of bittersweetness. The classical methods were indeed more manual, but this meant they were also equal parts art and science. The creativity and innovation needed to tease out features, objects, edges and key elements were not powered by deep learning but generated by deep thinking.

With the move away from classical CV techniques, engineers such as myself have, at times, become more like CV tool integrators. While this is “good for the industry,” it’s nonetheless sad to abandon the more artistic and creative elements of the role. A challenge going forward will be to try to incorporate this artistry in other ways.

 

Understanding replacing learning

Over the next decade, I predict that “understanding” will eventually replace “learning” as the main focus in network development. The emphasis will no longer be on how much the network can learn but rather on how deeply it can comprehend information and how we can facilitate this comprehension without overwhelming it with excessive data. Our goal should be to enable the network to reach deeper conclusions with minimal intervention. 

The next ten years are sure to hold some surprises in the CV space. Perhaps classical CV will eventually be made obsolete. Perhaps deep learning, too, will be unseated by an as-yet-unheard-of technique. However, for now at least, these tools are the best options for approaching specific tasks and will form the foundation of the progression of CV throughout the next decade. In any case, it should be quite the journey.

LeackStat 2023