Using very high-resolution images to train deep neural networks
The human bottleneck
At Earthcube, we automatically monitor dozens of sites around the globe to give our clients a continuous live feed that is relevant to their business. Thanks to computer vision, we can track constructions, vehicles, changes and most of the relevant features you can find on satellite or UAV images.
Unsurprisingly, one of the key tech we are using for object detection is deep learning. But with every deep learning application comes… a dataset. You can have the best team you want, a data scientist will only be as good as the data they can access.
Most AI-based imaging services can use structured or unstructured data around their images to weekly label them, at least. For an application based on Instagram images feed, it could be text comments or user tags, for a medical imaging application, it could be annotations from a physician or diagnostic information… But when considering satellite images, if the object you are looking for is not on a map, you have nothing.
You must rely on manual annotation of massive quantities of images. This is painful, slow and costly. In fact, this is the main obstacle for machine learning adoption in remote sensing communities. Images used to be rare and expensive, and labeling needed skilled photo-analyst and expensive tool, both almost only available in the armed forces.
Thanks to new constellations such as Planet or DigitalGlobe and new user-friendly GIS tools, remote sensing image interpretation is more accessible than ever, and it is now possible for a startup to build relevant datasets at scale. However, the labeling process is still painful, costly and is limited by the human effort you can put in. And when you are building large-scale AI-based service, you definitely don’t want to have a human bottleneck!
Several strategies can be implemented to reduce the amount of human labeling needed. However, a simple and very efficient one is to leverage the information content of VHR images using simple pre-training or what can be called the transfer learning process.
For the readers who are unfamiliar with the AI jargon, transfer learning is the process of improving the learning of one of your tasks with a priori knowledge on a different task. It is a basic but very effective tool.
For any data scientist working with remote sensing, the question of transfer is critical. Indeed, some providers such as Planet or DigitalGlobe are opening partnerships to provide access to archives for R&D or product development purposes. Several big public datasets are also accessible, such as SpaceNet, Fmow and soon xView. However, sometimes these archives and datasets may not match the resolution, observables or locations needed for the considered business cases, and some images may still lack labels or could be labeled for a different type of task.
As a company aiming to scale as fast as possible, we need to be able to leverage all accessible data to increase performances and speed-up the product development.
The power of very high resolution
Transfer learning works, it has been known for years. However, the big question for us was to test the influence of resolution on the pre-training process. Our hypothesis was that transfer learning should work better if trained on a higher resolution.
Why? Because Deep Neural Network can perform hierarchical feature extraction from raw images. Feature maps in the deep hidden layer contain more global and higher-level information whereas layers close to the input contain lower level features map, so a 30cm trained neural network could, in fact, contain layers needed for lower resolution image while the opposite may not be true.
The test consisted in training a classification network on VHR images, to then extract some relevant network layers to be used as a base for another training on segmentation using HR images (<1m).
We then compared the training times and performances between learning segmentation from scratch and transferring from classification to segmentation. The training process is summed up below, with the segmentation training from scratch in blue and with transfer learning in red.
Not only does using pre-training allow the network to converge much faster (more than 10 times fewer epochs) but it also outputs much better results in term of loss and accuracy! However, if we try to train a segmentation network on 30cm images using transfer learning from lower resolution, results are less conclusive. We do not increase global performances and reduce convergence time “only” by a factor of 3.
If this process is very useful to speed up training and increase performances, we also discovered that the same conclusion is true regarding the learning base: compared to a training “from scratch”, using VHR images to pre-train a network reduces the volume of training base needed to achieve the same performances by up to an order of magnitude, even on lower resolution images. And once again, the opposite does not seems to be as efficient. Our assumption appears to be confirmed: higher resolution transfer may be more efficient.
Looking at image generation
At Earthcube we like to use image generators such as autoencoders or GANs for feature mining and image similarity measurement, both useful for tracking changes, activity or comparing images. They are also wonderful ways to get intuitive results on how a network architecture can “understand” an image. We trained two autoencoders, one on Digital Globe 30cm images, one on Planet 4m dove-images with the same amount of “efforts” (similar learning base size and learning time) to get comparable results and compare a different kind of transfer learning process. The learning base was deliberately chosen to be very different (a harbor vs an airport). The training process was pretty simple with a limited number of epochs and fine-tuning.
If you are not familiar with machine learning, autoencoders are simply neural networks based compression-decompression algorithms which learn a constrained set of latent variables defining an image (the autoencoding process), then learn to use these variables to reconstruct the original input (the decoding process). They very often use convolutional layers and build sets of feature maps to find the “best” (or more salient) description of an image given the constraint. The user constrains the number of latent variables to force the algorithm to compress the data and thus prevent it from learning a basic identity function.
The original idea here was pretty simple: if a neural network trained on VHR images contains the feature maps needed for less resolved images, an auto-encoder trained on Digital Globe images would be able to re-generate Dove images with a fair performance.
We immediately realized that the convergence of the auto-encoder trained on VHR images was much easier and straightforward with very good reconstruction results after only a few of epochs.
But the final result was unexpected. Digital Globe trained auto-encoder was, in fact, more efficient to re-generate Dove images than Dove-trained autoencoder!
It seemed so weird that we retrained three times the Dove autoencoder to ensure that the results were consistent.
Of course, it did not mean that we couldn’t get very good results with Dove images if we had put more effort into it, especially in term of training time and fine-tuning. Our assumption seemed confirmed. Using very high resolution allowed the learning of necessary salient feature map and permitted to generate a satisfactory compression-decompression algorithm for almost any type of high-resolution satellite images.
Moreover higher resolution trained neural network seemed to have more generalization capability than lower resolution trained network, and seemed to be able to converge much faster to better performances. All this could be explained thanks to the power of spatial hierarchization of deep neural networks.
In the end, the similarity with the human photo-interpretation process is quite stunning. A beginner will always learn faster if trained on higher resolution images and will then be able to generalize their learning more efficiently. When working on small targets, such as cars or small buildings, it is always very useful to start working with an image which resolution is higher to “learn” the observables or decipher false positives before switching to a lower resolution.
For a commercial use case, one would want to use the resolution which best suits its application, leveraging higher revisits or even lower costs if you do not need VHR images. However, if you want to build an automatic pipeline, adding high-resolution images to this learning base (for example mixing Dove and Skysat images) can reduce drastically the amount of labeling you need while enabling you to deliver a higher quality product.
Renaud Allioux, CTO and Co-Founder at Earthcube.Get the full (satellite) picture!