Extracting Google Cardboard photos

Recently, Google announced a free app for android that allows you to take 360 degree, 3d panoramas that also include ambient audio. The app works really great, allowing you to take photos much faster than their Photosphere app (Which was renamed to Google StreetView for some reason).
Unfortunately, their app is pretty limited, and requires a Cardboard to actually view the images. I was curious to see if the 3d images could be extracted out to be used in other devices, e.g. Oculus Rift. Here, I'll show you how I extracted out the images and the corresponding audio files.
If you want to follow along, here's the raw image.
First, for understanding any image, I always first go to the EXIF data. Jeffrey's EXIF Viewer is the easiest tool when looking for interesting bits in a a photo. We're in luck - in addition to telling us the GPS location and camera information (UC San Diego Geisel Library, Nexus 6, btw), we also get some interesting XMP information:
XMP Toolkit                      Adobe XMP Core 5.1.0-jc003
Cropped Area Left Pixels         0
Cropped Area Top Pixels          1,538
Cropped Area Image Width Pixels  9,840
Cropped Area Image Height Pixels 1,872
Full Pano Width Pixels           9,840
Full Pano Height Pixels          4,920
Initial View Heading Degrees     180
Mime                             audio/mp4a-latm
Mime                             image/jpeg
Has Extended XMP                 E55A6FF153CBFB8DBE5E4B22C1ADDF5F
Data                             (4,299,744 bytes binary data)
Data                             (1,047,496 bytes binary data)
I didn't know what XMP was before playing around with the photo, but Wikipedia to the rescue:
XMP standardizes a data model, a serialization format and core properties for the definition and processing of extensible metadata. It also provides guidelines for embedding XMP information into popular image, video and document file formats, such as JPEG and PDF, without breaking their readability by applications that do not support XMP.
Great, pretty straightforward. This explains why the raw image is still viewable, despite having extra information encoded. Let's try extracting out that data.
Python has a nice library that decodes the XMP format, so lets use that: pip install libxmp.
Let's open the file now:
from libxmp.utils import file_to_dict
image_file = 'IMG_20151204_143434.vr.raw.jpg'
xmp = file_to_dict(image_file)
Now, the variable xmp has a dictionary of the neatly decoded data:
print(xmp.keys())
>>> [u'http://ns.google.com/photos/1.0/image/',
 u'http://ns.google.com/photos/1.0/panorama/',
 u'http://ns.adobe.com/tiff/1.0/',
 u'http://ns.adobe.com/xap/1.0/',
 u'http://ns.google.com/photos/1.0/audio/']
We're looking for an image and an audio file. The dictionary key http://ns.google.com/photos/1.0/image  has the image that we're looking for. To extract it out, we need to base64 decode it:
import base64
right_image = base64.b64decode(\
          xmp['http://ns.google.com/photos/1.0/image/'][1][1])
open('right_image.jpg', 'w').write(right_image)
Now let's take a look at what the image looks like:

Looks pretty similar to the original image that we got. Let's do a diff to see what changed:
 Here, I'm comparing the original image (left eye) and the XMP-extracted image (right eye). The gray areas create the 3d effect, due to the parallax between the two eyes.
Here, I'm comparing the original image (left eye) and the XMP-extracted image (right eye). The gray areas create the 3d effect, due to the parallax between the two eyes.
Finally, lets get the audio information out. Same deal as before - base 64 decoding:
audio = base64.b64decode(\
   xmp['http://ns.google.com/photos/1.0/audio/'][1][1])
open('audio.aac', 'w').write(audio)
I figured out the filetype by exploring the dictionary a little more: xmp['http://ns.google.com/photos/1.0/audio/'][0] shows GAudio:Mime => audio/mp4a-latm, which points to AAC. Here's the audio. (alas, it's not very interesting).
Here's the script in it's entirety:
import base64
from libxmp.utils import file_to_dict
xmp = file_to_dict(image_file)
right_image = base64.b64decode(\
   xmp['http://ns.google.com/photos/1.0/image/'][1][1])
audio = base64.b64decode(\
   xmp['http://ns.google.com/photos/1.0/audio/'][1][1])
open('right_image.jpg', 'w').write(right_image)
open('audio.aac', 'w').write(audio)
Now you can view these images in your favorite VR goggles, woohoo! Thanks Google for using an open, easy-to-use format.