By Thomas Sitbon for Preligens, 20/06/22
A guide to where to source data and how to describe images for training
In our 2 previous amazing articles: How we built an AI factory — Part 1 and How we built an AI factory — Parts 2&3, we wrote about how we built a proprietary framework used by all AI teams across the company.
In order to understand the full cycle of creation of an algorithm, I will now explain two main pre requisites in this article:
What are the data requirements for building AI detectors? — The case of geographic data
Global data diversity
To develop a machine learning algorithm specialized on geographic data, a large variability of data is required in terms of :
One other requirement is to also have a great variety of objects in different geographical contexts (for example aircrafts in parkings, runways, maintenance areas).
An aircraft can have different shapes, colors or positions but it will still belong to the same category and therefore, it will have to be classified in the very same class.
Complexity of geographical areas and observable with various forms — Source : 2022 Copyright Maxar Technologies
Our challenge: detect specific objects on all geographical areas around the globe
The objective is to be able to quickly set up an algorithm to detect a determined object on all geographical areas of the globe.
In order to do so, a data scientist will need to have access to:
Without those tools and this enriched database, it would be very difficult — not to say, impossible — to find the right images to build an algorithm.
How else would a data scientist find the images he needs to improve the algorithms he is working on ?
In this article, we take the example of building a database of satellite images including the most heterogeneous aircrafts models across the globe.
This training database will be used by ML teams to build best in class algorithms.
The tools we use:
First step: Building a geographic database
Where can aircrafts be found?
FINDING: Finding the location of observables can be a tedious and time consuming task. We have chosen to retrieve a large number of images that contain a wide variety of observables.
SELECTING: To select the images containing aircraft, we rely on open source data, for example OurAirport, OSM (OpenStreetMap), etc.
EXTRACTING: The extraction of OSM data has been facilitated by ImpOSM. It is an importer of OpenStreetMap data. It reads the files and imports the data into a PostgreSQL/PostGIS database.
MERGING: We have developed a process for merging OSINT (Open Source Intelligence) data followed with data quality control where we remove duplicate sites, data entry errors, etc.
As a result, we have created a database of tens of thousands of airports distributed on the surface of the planet. These sites are located by their latitude and longitude.
How we select sites of interest — Source : Preligens
Building a site database
However, building such a database is not enough. With the help of Open Source data, we added descriptive metadata for each airport: name, civilian or military, operational, size, function, country, number of runways, available equipment, etc.
With this information, the selection of site is more precise and of better quality.
Using some OSINT databases, we can get an accurate overall footprint of the site, as well as specific sub-areas within which our observables can be located : aircraft storage area, aircraft movement area.
We have the Area of Interest (AOI) of each image ! — Source : 2022 Copyright Maxar Technologies
Fig.3 We have the Area of Interest (AOI) of each image ! — Source : 2022 Copyright Maxar Technologies
The last task is the implementation of a descriptive database of all the airports distributed on the earth. We can make SQL queries on the type of site (military or civilian), on its function, its country of origin, etc.
For example, the search for Antonov An-124, can become very tedious. Today, there are only 28 aircrafts in flying condition. With a database selection : country=ukraine and russia ; military=yes ; material=an124
4 sites are selected, this selection allows to greatly improve the detection of uncommon observables.
Selection of rare observable, example on the An-124 — Source: 2022 Copyright Maxar Technologies / detections by Preligens
Second step: Acquiring satellite imagery
Obtaining a quality dataset in a minimum of time is a real challenge. The selection of images can become a real puzzle. The challenges are numerous: diversity of images, duplicate data, homogeneous separation between the test database and the train database, storage of information in the database, data normalization, etc.
Preligens has developed an automated image recovery process with major image providers.
Image recovery and normalization pipeline — Source : Preligens
Access to images
With the increase in the number of earth observation satellites, it’s possible to have several very high spatial resolution images several times a day. At Preligens, we have chosen to partner with the best earth observation companies: Airbus, Maxar and Planet. The use of these images has been automated through several APIs developed by the image providers :
The use of APIs makes it possible to simplify the selection of images based on very fine criteria :
In a few hours, it’s possible to recover several thousands of images, a few km² each, over the last 15 years and around the whole earth.
The data normalization step
Each image provider uses a different image format, metadata and file structuring. The normalization step is therefore key. The only objective is to normalize the data before it is used by machine learning algorithms.
For example: the file format, the pansharpening, the number of spectral bands or the encoding of the image.
Third step: description of the images
Introduction : Why describe a satellite image?
Choosing the right train and test dataset is a complex process. In order to help AI engineers to choose their datasets efficiently, we provide 2 categories of information:
Each satellite image has metadata provided by the image providers : view angles, sun elevation, sun azimuth, pixel resolution, uuid, captured at, etc. This metadata is available with an API call or in a file associated during the download with a non-standardized format (xml, json, txt, etc.). The recovered information is stored in a database.
Image metadata — Source : Preligens
Descriptive metadata of images
Automated description of an area of interest is a simple exercise if you know where to search for additional data.
ML teams need 4 important details :
Descriptive metadata to select specific datasets, exemple : desert images — 2022 Copyright Maxar Technologies
We solved this challenge by using open source data on a global scale:
For more details, the recovery of this information is fully covered in a previous article: Improving the description of satellite images using GIS data here.
Fourth Step: Data labeling — Without labeling, there is no machine learning
Construction of an ontology and a classification of geographic objects
To obtain homogeneous observables between classes, we built a descriptive ontology of observables. This ontology was developed with intelligence experts. For example, we described a hundred homogeneous classes of aircraft. Each observable is described in detail (length, width, type, function, etc.).
These descriptions allow the annotation teams to understand which observable to classify.
From ontology to labeling — Source : Preligens
Flexible and internal labeling application
There are several dozen companies that can label observables for machine learning. However, very few have implemented geographic data entry tools. At Preligens, we are fortunate to work with intelligence experts. We have developed an ontology of several hundred classes of observables. Each entity is clearly described with an annotation guide describing: the observable, the characteristics of specific observations and the environment describing the observable.
For more than 4 years, we have implemented a flexible and robust internal application that allows us to label data, classify them and set up a quality check.
The quality check is mandatory in the development of a machine learning dataset.
We have made the choice, a long time ago, to check all the labeled data several times and in a random way. We added an automatic error detection phase using geospatial tools and scripts: registration error, incomplete polygon, impossible classification, etc.
At a time when machine learning is taking all its place and importance in the observation of the earth, we note that Geographic Information Systems are one of the many elements which allow the construction of these algorithms.
To go further on building an AI factory, I strongly advise you to read this enlightening article : How we built an AI factory — Part 1 here
Read all our Preligens Stories on Medium here.