Joyent Manta Storage Service: Image Manipulation and Publishing
Part 1: The Getty Open Content Image Set.
On August 12, 2013 The J. Paul Getty Trust announced a new commitment to sharing the Getty’s digital resources freely with all, and launched the Getty Open Content Program. The initial set of 4,596 high quality digital art images was made available on Getty’s website for use without restriction.
At Joyent we recognized the enormous cultural value of this open content art collection, and that it represented an opportunity to show off the Joyent Manta Storage Service platform. So I have written a series of blogs that show how Manta and its compute capability can host, validate, resize, reformat, archive, checksum and serve this unique collection of digital images.
For the first in this series, let me cover the basics about the content and structure of the Getty Open image collection. I will show the wide distribution of image size found in the set, spanning 3 to over 300 MB. I will also explain how to get the associated metadata, and provide links to download a copy of the image set in both JPEG and WebP image formats. The compact WebP format images can be viewed with Chrome.
Getty Open Content Image Distribution
After some effort in downloading and validating the image set, I can start by summarizing the Getty Open Content in terms of the total original image file total size and the distribution of image sizes. They are fairly large JPEG formatted files comprising a total of 101.6 GB of historically important pieces of digital art, sculpture, artefacts and photographs.
The distribution of all 4,596 image sizes is shown in the histogram above. The peak on the left shows most of the images are around 20 MB, but 14 are larger than 100 MB. You can see the relative pixel size of the two smallest images in the figure, which are photographs of a ring and a small carved female figure. These have file sizes of 3 and 3.7 MB respectively. Five of the largest images in the collection are shown, ranging in size from 188.5 to 327.5 MB, which are:
188.5MB: John, Fourteenth Lord Willoughby de Broke, and his Family
- by Johann Zoffany, about 1766.
- Reduced to 4, 1, or 0.25 Megapixel JPEG.
- Reduced to 4, 1, or 0.25 Megapixel WebP.
- XMP Metadata.
Getty Open Content Image Data, Museum Object Identifiers and Metadata
The Getty Open Content image set is available under the terms and conditions from the J. Paul Getty Museum, and I have reproduced these here. Oddly, the image data set is not circumscribed by a single download, or a file list, but rather by a specific URL based query:
This could, of course, change at any time by updates to their database system.
A way to download all the image originals has not been provided, nor are there any checksums for download integrity.
You can go page-by-page through the thumbnails a few at a time and click download. I did that for one image before working out how to download the set, and I was directed to a form requesting information about my planned use of the image. If you decide to work with Getty Open Content image resource in any form, please download one image from the Getty Museum directly and fill out their form.
To get all the filenames for each image, I altered the above query URL to provide a page of 5000 results by increasing the query default value of
&rows=10. That trick gave me the whole image set displayed as thumbnails on a single HTML page, which I saved to my notebook. This step captured 4,599 thumbnails, which are named
33681201-T.jpg. I mentioned there are 4,596 images, as three of the images are broken at the download source, namely the images
I kept the list of image filenames, (minus the
-T thumbnail part) for downloading. The image numbers are non-sequential, so the list of image names is important for operating on the set.
The last two digits in each file name before the
.jpg extension are
01, which may be digits reserved for version numbering. The leading zeros in the filenames pad the numeric portions to 8 characters wide.
The plain image files can be linked to metadata information about where they came from, the artist, and location. There is curated metadata that is associated with the artwork on the Getty web site that you can retrieve by a web query, and there is embedded metadata in XMP format that can be retrieved from inside each image. You can retrieve information about the art from the source, using the base query together shown here together with a museum object id.
Construct the museum object number from the image file name by removing any leading zeroes and the last two digits
Example: MetaData Recovery from Image File
I recognized the image
Here is a zoom-in showing the detail of Sadie Pfeiffer's focused gaze, and her tattered apron, as reproduced from the high quality Getty Open original file image:
This image is well described in downloadable educational material made for teachers by the Getty Museum. Other web resources have lower-resolution versions of this image. For example, it is one in the set of over 5000 Lewis Hine images at lewishinephotographs.com (see here). The high resolution Getty Open image version shows remarkable clarity and detail. I noticed that the version of this photograph served by Art Institute Chicago (see here) is not cropped, whereas this high-resolution Getty Open version is cropped to remove the tattered edges of the photograph. This is a bit disappointing. While the number of Lewis Hine photos in the Getty Open image collection is a mere 6 at the moment, I hope this will increase as more of the collection is moved into the open content set. And I would plead, leave the cropping choice to the end-user.
So for this amazing photograph, the Getty Museum's object identifier is extracted as
68415 from image file identifier
The museum object identifier is used with the base query to retrieve the contextual metadata in HTML: 'http://search.getty.edu/museum/records/musobject?objectid=68415'
To retrieve the XMP formatted metadata carried within this image, I use the ImageMagick
convert command line tool, and show the summary line about the artwork with
$ convert.jpg 06841501.xmp$ cat 06841501.xmp | grep -A4 "
" Sadie Pfeiffer, Spinner in Cotton Mill, North Carolina; Lewis W. Hine, American, 1874 - 1940; North Carolina, United States, North America; negative 1910; print about 1920s - 1930s; Gelatin silver print; Sheet: 28 x 35.7 cm (11 x 14 1/16 in.); 84.XM.967.15
I have put a simple text file with the complete list of image file names and this summary line in the Downloads section at the bottom of this post.
Simplifying Content Delivery with Hierarchical Directories
Serving large numbers of high-resolution graphics for web and mobile content is a demanding task. The variety of customer display platforms ranges from handheld smartphones, to small and large tablets. New very high-resolution browsers on notebook and desktop computers can take advantage of much larger images as well. Web and mobile content providers should be with matching the best resolution images to the end-user display. Also, the increases in digital camera resolution are making originals bigger and bigger. Rather than just thumbnail and a single resized image, a range of resized images is needed by content providers.
So after a bit more digging on the Getty site, I found image versions served up in various sizes from two locations. For the Lewis Hine photograph I found:
http://www.getty.edu/art/collections/images/thumb/06841501-T.JPG - Thumbnail
http://www.getty.edu/art/collections/images/l.jpg - Large
http://www.getty.edu/art/collections/images/m.jpg - Medium
http://d2hiq5kf5j4p5h.cloudfront.net.jpg - Original
This illustrates the span of strategies for image hosting for web content delivery:
- variation in filename for thumbnails.
- variation in directory paths
/m/for large and medium sizes of the same filename image
- Object hosting for large size data transfers, in this case an Amazon CloudFront URL:
d2hiq5kf5j4p5h.cloudfront.netwhere the first part of the URL is a hash indicating the customer, bucket, and server location.
Here is a set of reduced images I made using a Manta compute job and the original version which are hosted at:
/mantademo/public/images/getty-open/500.jpg.jpg - 0.25 Megapixel
/mantademo/public/images/getty-open/1000.jpg.jpg - 1 Megapixel
/mantademo/public/images/getty-open/2000.jpg.jpg - 4 Megapixel
/mantademo/public/images/getty-open/originals.jpg - Original
One of the most intuitive and powerful features of the Joyent Manta Storage Service is that it features hierarchical directories.
This may seem obvious, but cloud object storage services have not previously delivered this standard filesystem feature.
Here the Manta account I am using is
mantademo and the
public subdirectory is automatically set up for publishing content for open downloading.
The private subdirectory tree is under
/mantademo/stor. Publishing content is as easy as moving it into the
The image resize jobs I used to were run directly on the compute capacity of Manta using the Manta command line utilities.
In Part 3 of this series I will go over - in detail - the shell script I built that creates this entire set of resized images using ImageMagick, as well as details on color preservation, and lots of Manta tips so that you can reproduce the ideas developed in this example in other Manta compute scenarios.
Downloading and Validation Strategy
From the thumbnail directory listing and a bit of work with
sed, I made a simple script that uses
cURL to download the full-sized images off CloudFront and I used
mput to push them up to my Manta account. That script is here. I downloaded the originals twice, once to my notebook and the other to Manta. I compared the two image sets to isolate any incomplete transfers, and verify that the collection is intact.
In fact one of the images on my notebook was only partially loaded by
cURL, probably at the moment I switched my wifi source for faster bandwidth. The image set was not available as a single downloadable file with a standard checksum. So this double download and compare validation is an important step to ensure I have captured the image set correctly.
The broken download produced a truncated image on my notebook, file
Download the Getty Open Set from Joyent Manta Storage Service
The complete validated set of images can be downloaded from these links, which are hosted on the Joyent Manta Storage Service:
0.25 Megapixel JPEG 292MB
0.25 Megapixel WebP 172MB
1 Megapixel JPEG 1.01GB
1 Megapixel WebP 545MB
4 Megapixel JPEG 3.94GB
4 Megapixel WebP 2.01GB
XMP Metadata 28.3MB
Remarkably, you can see how JPEG and WebP compare in the aggregate sizes of these equivalent tar archives, which are not further compressed by other protocols. This makes for a robust size comparison of the two formats across this large and diverse image set.
At .25 Megapixels, the WebP images are 41% smaller than the JPEG, at 1 Megapixel they are 46% smaller. and at 4 Megapixel they are 49% smaller. You can scroll back to the top and find links to the five largest images in the set in both formats at these resolutions and compare the image quality on a WebP compatible browser.
To wrap up, this blog post introduces the Getty Open Image set hosted on Manta and provides the downloads that get you the image set in resolutions suitable for further experimentation.
In follow-on parts of this series, I will dive deeper into the code and methodology I used to compute everything you see in this post, including validation, image conversion to WebP and resizing, color preservation, creating archives, computing checksums, and extracting metadata one-liners, all on the Joyent Manta Storage Service.
Post written by Christopher Hogue, Ph.D.