Types for Machine Learning




Julian Dolby

IBM Thomas J. Watson Research Center





PLDI PC Meeting Workshop, February 2018

Outline

  • What is the problem?
  • Why is the problem?
  • How can types and analysis help?
  • Status

MNIST data

Machine learning with MNIST

  • Sets of 28x28 images used as training, test data
# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=False)
  • Conceptually mnist.train.data image array
    • grayscale, 1 value per pixel
    • read as arrays of 784 ($28*28$) numbers
    • result is a tensor of (number of images) x 784

Machine learning with MNIST

  • Recognize digits from images

  • Neural network from original image “features”
    • images can be smoothed, convolved, etc
    • Tensorflow provides many manipulations
  • Focus on single 2-d convolution example (line 45)
    • 784-element vector wrong shape for 2D operation
    • need to reshape it (line 42)

Example with MNIST data

The Problem

  • Non-local information and complex semantics
    • reshape (line 42) needs original shape
    • reshape (line 42) for conv2d (line 45)
  • Information never explicit in the code

  • Tutorial has information in comments
    • real code has such comments too
    • examples at IBM, other places

How types and analysis help

  • Comments have severe limitations
    • tedious to write
    • error prone
    • no semantics
  • Tensor dataflow information is vital
    • tensors are complex
    • API requirements are complex
    • data flow is complex
  • Analysis tracks tensor “types”
    • “types” capture element meanings, dimensions
    • only vital declarations; preserve flexible, dynamic Python

Express semantics of data

  • Only MNIST arrays declared explicitly
    mnist : {
     training : {images : [channel; x(28)*y(28); batch]}, 
     test : {images : [channel; x(28)*y(28); batch]}
    }
    mnist = input_data.read_data_sets(...)
    
  • Information must be tracked in the code
    [channel; x(28)*y(28); batch] ->
    [channel; 1; x(28); y(28); batch]
    x = tf.reshape(x, shape=[-1, 28, 28, 1])
    
  • Types so far for structs and tensors
    • struct: var : { ("field" : type)* }
    • tensor: [ type ; label?(size)* ]

Analysis challenges

def conv_net(x_dict, n_classes, dropout, reuse, is_training):
    with tf.variable_scope('ConvNet', reuse=reuse):
        x = x_dict['images']
  • Python data structures like dictionaries, objects
def model_fn(features, labels, mode):
    logits_train = conv_net(features, num_classes, dropout, reuse=False,
is_training=True)
  • Interprocedural control- and data-flow
model = tf.estimator.Estimator(model_fn)
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': mnist.train.images}, y=mnist.train.labels,
    batch_size=batch_size, num_epochs=None, shuffle=True)
model.train(input_fn, steps=num_steps)
  • model.train ultimately calls model_fn

Status

  • Type system under development
    • examples on slides not recommended syntax
  • Analysis prototype being built with WALA
    • front end built with Jython for ASTs
    • use common front end support in WALA
    • some trivial call graphs work, much to be done
    • code being developed on Github
  • Tensorflow APIs will be modeled with WALA
    • reuse support that models the DOM, J2EE