Example n° 7: Converting Lossless Aerial RGB GeoTIFF to Compressed JPEG + binary mask

This example can be helpful to deal with an use case like this:

  • Thousands of relatively small aerial GeoTIFFs (For example, 2048x2048 pixels each) covering a geographic region. Pixels at the edges with value RGB=255,255,255 are considered as filling background
  • There are too many files to be configured as a simple ImageMosaic and they should be reorganized. Otherwise, a preview of the whole region would need to access thousands of files at once, which is a no-go.
  • Due to the irregularity of the region edges, the whole collection isn’t covering a simple rectangular area.
  • Disk usage should be optimized

For this example we will use some data built on top of the aerial datasets.

Note

On a Windows machine you can set-up a shell with all GDAL Utilities, running directly the file OSGeo4W.bat under the %TRAINING_ROOT% folder.

  1. Navigate to the workshop directory $TRAINING_ROOT/data/user_data/gdal_processing_data (on Windows %TRAINING_ROOT%\data\user_data\gdal_processing_data) and find the rgb_data directory.

In fact, it doesn’t contain thousands of files in order to keep the training package small. However, it provides an idea of a set of image covering a custom region with wide void areas.

../../_images/aerial_mosaic.png

We can ideally regroup all the files into bigger chunks (marked in yellow) to be recompressed using JPEG. Moreover, due to the nature of irregular edges of the region, we also need to setup a binary mask to exclude pixels not belonging the original data (the white areas).

../../_images/aerial_mosaic_chunks.png

First, let’s set an environment variable to enable usage of internal masks:

export GDAL_TIFF_INTERNAL_MASK=YES (linux)

  or

SET GDAL_TIFF_INTERNAL_MASK=YES (windows)

Now, let’s start by creating a VRT on top of all the available files, and let’s assume that none of the valid pixels will be fully white (RGB=255,255,255). From the rgb_data directory let’s create a list of all the tif files available:

ls *.tif > list.txt (Linux)

        or

dir /b *.tif > list.txt (Windows)

Then, build a VRT on top of them, by also specifying the nodata:

gdalbuildvrt bigset.vrt -input_file_list list.txt -vrtnodata "255 255 255"

Now, create 4 bigger chunks using gdal_translate:

gdal_translate -srcwin 0 0 8192 8192 -CO TILED=YES -a_nodata none bigset.vrt chunk_r1c1.tif
gdal_translate -srcwin 8192 8192 4096 8192 -CO TILED=YES -a_nodata none bigset.vrt chunk_r2c2.tif
gdal_translate -srcwin 0 8192 8192 8192 -CO TILED=YES -a_nodata none bigset.vrt chunk_r2c1.tif
gdal_translate -srcwin 8192 0 4096 8192 -CO TILED=YES -a_nodata none bigset.vrt chunk_r1c2.tif

The size of the chunk depends on how many original files one wants to incorporate.

The next step is computing the mask for each chunk, and adding it back to the chunk itself. Follow these steps:

  1. Extract the 3 bands of each chunk (using gdalbuildvrt since we won’t actually write the 3 bands).
  2. Compute the binary mask by checking pixel values with gdal_calc. Pixels with value RGB=255,255,255 will be masked out.
  3. Recompose the image with the 3 bands and the newly computed mask (using again gdalbuildvrt to delay the actual writing at the very end).
  4. Write the result to a GeoTIFF using internal JPEG compression.
  5. Add the overviews

These are the relevant GDAL utilities commands:

Windows (you can put this on a reprocess.bat file in the directory containing the files):

echo off
for %%F in (chunk_????.tif) do  (
  echo Processing %%F
  echo Computing binary mask
  gdalbuildvrt -b 1 %%~nF_b1.vrt %%F
  gdalbuildvrt -b 2 %%~nF_b2.vrt %%F
  gdalbuildvrt -b 3 %%~nF_b3.vrt %%F
  gdal_calc --type=Byte --creation-option=PHOTOMETRIC=MINISBLACK ^
  --creation-option=NBITS=1 --creation-option=TILED=YES ^
  --creation-option=COMPRESS=DEFLATE ^
  -A %%~nF_b1.vrt -B %%~nF_b2.vrt -C %%~nF_b3.vrt ^
  --calc="logical_not(logical_and(logical_and(A==255,B==255),C==255))" ^
  --outfile %%~nF.msk.tif

  echo Combining files
  gdalbuildvrt -separate %%~nF_final.vrt %%~nF_b1.vrt %%~nF_b2.vrt %%~nF_b3.vrt %%~nF.msk.tif

  echo Writing composed files
  gdal_translate -CO "COMPRESS=JPEG" -CO "PHOTOMETRIC=YCBCR" ^
  -CO "JPEG_QUALITY=85" -CO "TILED=YES" -b 1 -b 2 -b 3 ^
  -mask 4 %%~nF_final.vrt final_%%~nF.tif
)

Windows (you can put this on a addo.bat file in the directory containing the files):

echo Adding overviews
for %%F in (final*.tif) do  (
  gdaladdo -r average %%F 2 4 8 16 32
)

Linux:

#!/bin/bash
export GDAL_TIFF_INTERNAL_MASK=YES

FILES="chunk_????.tif"
echo Computing binary mask
for F in $FILES
do
  echo Processing $f
  A=${F%%.*}_b1.vrt
  B=${F%%.*}_b2.vrt
  C=${F%%.*}_b3.vrt
  gdalbuildvrt -b 1 $A $F
  gdalbuildvrt -b 2 $B $F
  gdalbuildvrt -b 3 $C $F
  gdal_calc.py --type=Byte --creation-option=PHOTOMETRIC=MINISBLACK \
  --creation-option=NBITS=1 --creation-option=TILED=YES \
  --creation-option=COMPRESS=DEFLATE -A $A -B $B -C $C \
  --calc="logical_not(logical_and(logical_and(A==255,B==255),C==255))" \
  --outfile ${F%%.*}.msk.tif
  gdalbuildvrt -separate ${F%%.*}_final.vrt $A $B $C ${F%%.*}.msk.tif
done

echo Writing composed files
for F in $FILES
do
  gdal_translate -CO "COMPRESS=JPEG" -CO "PHOTOMETRIC=YCBCR" -CO "JPEG_QUALITY=85" \
  -CO "TILED=YES" -b 1 -b 2 -b 3 -mask 4 ${F%%.*}_final.vrt final_${F%%.*}.tif
done

echo Adding overviews
FILESOV="final*.tif"
for F in $FILESOV
do
  gdaladdo -r average $F 2 4 8 16 32
done

As you may notice, the disk usage went down from around 220MB of the original dataset to 22MB of the reprocessed chunks. Don’t delete any intermediate file for the moment, we are going to check a further optimization.

Further optimization

When dealing with very big datasets with wide void areas and big reprocessed chunks, we can do a further step to reduce some more disk usage on computed chunks, by using the SPARSE files. Any empty tile composing the image, can be marked with a special value of the TIFF tags so that no pixel bytes get allocated on disk to represent that fully empty tile. Think about a big chunk of 65536x65536 where there are only a couple of data tiles at the opposite corners. The amount of disk space that can be saved can be remarkable.

In reference to the images depicted at the very beginning (the one with datasets with red edges and yellow chunks), the yellow chunks will have several area not covered by datasets. The SPARSE_OK flag tells GDAL not to write pixels on those uncovered areas.

So let’s revisit a couple of steps. Basically, we are going to recreate the data chunks, this time using the sparse flag. However we will re-use the masks computed so far. Let’s recreate the vrt, this time without setting the nodata:

gdalbuildvrt bigsetsparse.vrt -input_file_list list.txt -vrtnodata none

Then, recompose the chunks using the Sparse files flag:

gdal_translate -srcwin 0 0 8192 8192 -CO TILED=YES -a_nodata none -CO SPARSE_OK=TRUE bigsetsparse.vrt chunk_r1c1.tif
gdal_translate -srcwin 8192 8192 4096 8192 -CO TILED=YES -a_nodata none -CO SPARSE_OK=TRUE bigsetsparse.vrt chunk_r2c2.tif
gdal_translate -srcwin 0 8192 8192 8192 -CO TILED=YES -a_nodata none -CO SPARSE_OK=TRUE bigsetsparse.vrt chunk_r2c1.tif
gdal_translate -srcwin 8192 0 4096 8192 -CO TILED=YES -a_nodata none -CO SPARSE_OK=TRUE bigsetsparse.vrt chunk_r1c2.tif

And re-compose the final chunks using the new chunk tifs together with the mask files already computed in the previous cases.

Windows (you can put this on a composesparse.bat file in the directory containing the files):

echo off
echo Compose files
for %%F in (chunk_????.tif) do  (
  gdal_translate -CO "COMPRESS=JPEG" -CO "PHOTOMETRIC=YCBCR" -CO "JPEG_QUALITY=85" ^
  -CO "SPARSE_OK=TRUE" -CO "TILED=YES" -b 1 -b 2 -b 3 -mask 4 %%~nF_final.vrt final_%%~nF.tif
)

echo Add overviews
for %%F in (final*.tif) do  (
  gdaladdo -r average %%F 2 4 8 16 32
)

Linux:

#!/bin/bash
export GDAL_TIFF_INTERNAL_MASK=YES

FILES="chunk_????.tif"
for F in $FILES
do
  echo Processing $f
  gdal_translate -CO "COMPRESS=JPEG" -CO "PHOTOMETRIC=YCBCR" -CO "JPEG_QUALITY=85" \
  -CO "SPARSE_OK=TRUE" -CO "TILED=YES" -b 1 -b 2 -b 3 -mask 4 ${F%%.*}_final.vrt final_${F%%.*}.tif
done

echo Adding overviews
FILESOV="final*.tif"
for F in $FILESOV
do
  gdaladdo -r average $F 2 4 8 16 32
done

Now, let’s delete all intermediate files and uncompressed chunks:

rm chunk* (linux)

  or

del chunk* (windows)