Fully Training a Core ML Convolutional Neural Network On-Device

In a previous article, I focused on transfer learning scenarios with Core ML, and in particular we saw how to create a new model on an iOS device, import embedding weights from a previously-trained model, and train the rest of the layers on-device, using private and local data:

Moving forward in my long journey towards developing a Swift federated learning infrastructure, this time I’ve investigated how to train, from scratch on iOS devices, a little bit more complex model architecture—a CNN.

Here, I’ll also create this model locally, directly in Swift, using the SwiftCoreMLTools library I’ve previously introduced, and available on this GitHub repo:

The MNIST dataset

In this article, we’re covering how to implement an image classification model using the standard MNIST dataset. Except in this case, we’ll be training the Core ML model directly on iOS devices, without previous training on other ML frameworks.

The MNIST database of handwritten digits is such a well-known dataset that I don’t really think it needs any introduction for the reader of this story.

Just to very quickly recap, this dataset is commonly used for introducing a specific neural network architecture—the convolutional neural network (CNN), which frequently used in image recognition and object detection domains. It provides 60,000 training and 10,000 testing samples: black and white images, of dimension 28×28, of handwritten digits from 0 to 9.

In the sample project linked at the end of this story, I’ve created an iOS/macOS SwiftUI application written in pure Swift that prepares this dataset, creates a CNN model directly in the app (using the SwiftCoreMLTools library mentioned above), and trains this model with Core ML, feeding it the local batch of prepared data.

LeNet CNN Architecture

The LeNet architecture is an excellent starting point for understanding the details and seeing the benefits of CNNs. The combination of LeNet CNN with the MNIST dataset is such a standard in machine learning “training” that it’s usually considered the “Hello, World” equivalent of deep learning for image classification.

It basically consist of two sets of convolutional layers—ReLu activation, and max pooling layers—followed by a fully-connected hidden dense layer, again usually with ReLu activation, and finally at the end another fully-connected dense layer with softmax activation for the classification result.

In this post, we’ll focus on how to directly build and train a LeNet CNN model for the MNIST dataset in Swift on an iOS device. We’ll then compare it with a classic “Python” approach based on a well-known ML framework such as TensorFlow.

Preparing data in Swift for Core ML training

Before discussing how to create and train the LeNet CNN network in Core ML, let’s see how to first prepare the MNIST training data in order to correctly batch it into the Core ML runtime.

In previous articles of this series on on-device Core ML training I’ve already covered how to use Core ML’s MLBatchProvider and other APIs to create batches of data.

In the following Swift snippet, the batch of training data is prepared specifically for the MNIST dataset bu simply normalizing the “pixel” values of each image from an original range of 0 to 255, to a more “understandable” range between 0 and 1.

func prepareBatchProvider() -> MLBatchProvider {
    var featureProviders = [MLFeatureProvider]()

    var count = 0
    errno = 0
    let trainFilePath = Bundle.main.url(forResource: "mnist_train", withExtension: "csv")!
    if freopen(trainFilePath.path, "r", stdin) == nil {
        print("error opening file")
    }
    while let line = readLine()?.split(separator: ",") {
        count += 1

        let imageMultiArr = try! MLMultiArray(shape: [1, 28, 28], dataType: .float32)
        let outputMultiArr = try! MLMultiArray(shape: [1], dataType: .int32)

        for r in 0..<28 {
            for c in 0..<28 {
                let i = (r*28)+c
                imageMultiArr[i] = NSNumber(value: Float(String(line[i + 1]))! / Float(255.0))
            }
        }

        outputMultiArr[0] = NSNumber(value: Int(String(line[0]))!)

        let imageValue = MLFeatureValue(multiArray: imageMultiArr)
        let outputValue = MLFeatureValue(multiArray: outputMultiArr)

        let dataPointFeatures: [String: MLFeatureValue] = ["image": imageValue,
                                                           "output_true": outputValue]

        if let provider = try? MLDictionaryFeatureProvider(dictionary: dataPointFeatures) {
            featureProviders.append(provider)
        }
    }

    return MLArrayBatchProvider(array: featureProviders)
}

Preparing the Core ML Model (CNN) for Training

Once we’ve prepared and normalized the batch for our training data, we can now prepare locally, in Swift, the CNN Core ML model using the SwiftCoreMLTools library.

In the following snippet, you can see the architecture of the LeNet CNN and how the layers (convolution, max pooling, flatten, and the hidden and final dense layers) are sequentially called using the SwiftCoreMLTools DSL builder.

In the SwiftCoreMLTools DSL function builder code below, you can also see how to pass to the Core ML model in the same context, along with essential training information and hyperparameters: loss function, optimizer, learning rate, number of epoch, batch size, and others.

public func prepareModel() {
    let coremlModel = Model(version: 4,
                            shortDescription: "MNIST-Trainable",
                            author: "Jacopo Mangiavacchi",
                            license: "MIT",
                            userDefined: ["SwiftCoremltoolsVersion" : "0.0.12"]) {
        Input(name: "image", shape: [1, 28, 28])
        Output(name: "output", shape: [10], featureType: .float)
        TrainingInput(name: "image", shape: [1, 28, 28])
        TrainingInput(name: "output_true", shape: [1], featureType: .int)
        NeuralNetwork(losses: [CategoricalCrossEntropy(name: "lossLayer",
                                   input: "output",
                                   target: "output_true")],
                      optimizer: Adam(learningRateDefault: 0.0001,
                                     learningRateMax: 0.3,
                                     miniBatchSizeDefault: 128,
                                     miniBatchSizeRange: [128],
                                     beta1Default: 0.9,
                                     beta1Max: 1.0,
                                     beta2Default: 0.999,
                                     beta2Max: 1.0,
                                     epsDefault: 0.00000001,
                                     epsMax: 0.00000001),
                      epochDefault: UInt(self.epoch),
                      epochSet: [UInt(self.epoch)],
                      shuffle: true) {
            Convolution(name: "conv1",
                         input: ["image"],
                         output: ["outConv1"],
                         outputChannels: 32,
                         kernelChannels: 1,
                         nGroups: 1,
                         kernelSize: [3, 3],
                         stride: [1, 1],
                         dilationFactor: [1, 1],
                         paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                             EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                         outputShape: [],
                         deconvolution: false,
                         updatable: true)
            ReLu(name: "relu1",
                 input: ["outConv1"],
                 output: ["outRelu1"])
            Pooling(name: "pooling1",
                         input: ["outRelu1"],
                         output: ["outPooling1"],
                         poolingType: .max,
                         kernelSize: [2, 2],
                         stride: [2, 2],
                         paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                             EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                         avgPoolExcludePadding: true,
                         globalPooling: false)
            Convolution(name: "conv2",
                         input: ["outPooling1"],
                         output: ["outConv2"],
                         outputChannels: 32,
                         kernelChannels: 32,
                         nGroups: 1,
                         kernelSize: [2, 2],
                         stride: [1, 1],
                         dilationFactor: [1, 1],
                         paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                             EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                         outputShape: [],
                         deconvolution: false,
                         updatable: true)
            ReLu(name: "relu2",
                 input: ["outConv2"],
                 output: ["outRelu2"])
            Pooling(name: "pooling2",
                         input: ["outRelu2"],
                         output: ["outPooling2"],
                         poolingType: .max,
                         kernelSize: [2, 2],
                         stride: [2, 2],
                         paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                             EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                         avgPoolExcludePadding: true,
                         globalPooling: false)
            Flatten(name: "flatten1",
                         input: ["outPooling2"],
                         output: ["outFlatten1"],
                         mode: .last)
            InnerProduct(name: "hidden1",
                         input: ["outFlatten1"],
                         output: ["outHidden1"],
                         inputChannels: 1152,
                         outputChannels: 500,
                         updatable: true)
            ReLu(name: "relu3",
                 input: ["outHidden1"],
                 output: ["outRelu3"])
            InnerProduct(name: "hidden2",
                         input: ["outRelu3"],
                         output: ["outHidden2"],
                         inputChannels: 500,
                         outputChannels: 10,
                         updatable: true)
            Softmax(name: "softmax",
                    input: ["outHidden2"],
                    output: ["output"])
        }
    }

    let coreMLData = coremlModel.coreMLData
    try! coreMLData!.write(to: coreMLModelUrl)
}

Resulting CNN model

As you might have noticed on the SwiftCoreMLTools DSL builder code above, the Core ML model we’ve just built has a couple of convolution plus max pooling nested layers, and then after flattening everything, it has a hidden layer and a final dense layer with softmax activation for the final classification.

Below is a visualization of the generated Core ML model (from Netron):

Compile and Train the Model

In another previous article, I’d discussed how to use the Core ML API to retrain/personalize an existing model on-device, and eventually download it from the cloud, or as in this case, directly generate it locally on the device with the SwiftCoreMLTools library.

I suggest looking at this previous article to see snippets of Swift code for compiling and start a training task using Core ML:

Baseline TensorFlow 2.0 model

In order to benchmark results—and in particular the training performance in terms of execution time—I’ve also recreated an exact replica of the same CNN model using TensorFlow 2.0.

The snippet Python code below illustrates the same model architecture in TF and the summary of output shapes of each layers.

You can notice here that the layers, layer shapes, and convolution filters and pooling sizes are exactly the same as in the Core ML model, which we created on-device with the SwiftCoreMLTools library.

Comparing results

Before looking at the training execution time, it’s important to note that both the Core ML and the TensorFlow model trained for the same number of epochs (10), with the same hyperparameters, obtaining very similar accuracy on the same 10,000 test images.

In particular, you can see from the Python code snippet below that the TensorFlow model, trained with the same Adam optimizer and categorical cross-entropy loss function, resulted in a final accuracy result on the test case greater than 0.98.

For the Core ML model, you can see from the iPhone app screenshot below that, training and testing with the same optimizer, loss function and of course the same train and test datasets, it also obtains a final accuracy result greater than 0.98.

Training performance benchmark: Core ML vs TensorFlow

For on-device Core ML model training, I’ve executed tests on macOS and on both an iOS emulator and real Apple devices. Doing this, I’ve noticed once again that training Core ML models on modern iPhone/iPad devices are really much more optimized than on a MacBook Pro with an i7 CPU, a Radeon GPU, and lots of memory.

To provide some real numbers about how good and promising on-device training is on current-gen iPhones I can say I was able to train the 60,000 MNIST samples for 10 epochs in about:

248 seconds on a iPhone 11 with the Core ML model
158 seconds using TensorFlow 2.0 on a i7 Mac Book Pro (using CPU only of course).

Of course ,there is a very huge gap between 248 seconds and 158 seconds. Basically an optimization of over 60%, even without considering using GPU— but the real point here is not to compare apples with oranges, but to have a glimpse at what mobile and wearable devices can do in the context of training locally, on device, very sensitive and personal data.

In particular, I think it’s important to reflect that training a single epoch on a mobile device with 585,958 parameters and 60,000 data points required something around 20 seconds.

Considering scenarios such as distributed training, and in particular federated learning, I really think these are very promising numbers. I’ll continue testing more on my long journey towards this federated learning platform.

By the way, if you want to contribute in any way—for example, by testing or implementing missing functionalities on the SwiftCoreMLTools library—please be my guest.

Final Touch: Core ML + SwiftUI

One final note here about how easy it is to integrate Core ML training with a powerful user interface tool such as SwiftUI + Combine.

Jupyter Notebooks and even other tools like TensorFlow.js are very good for building real real-time experiments, but I have to say that the opportunity that Core ML + SwiftUI offers for real on-device experimentation is really amazing.

In my very simple use case for this article about training the MNIST dataset on a iPhone, it was very easy for me to add a minimal touch interface to directly let users draw new digits on the screen and test them live.

The SwiftCoreMLTools library, with a Swift DSL implemented with the same Swift function builder used by SwiftUI, offers a really coherent and similar approach for building the model and experimenting with the interaction between the UI and the model in real-time scenarios.

The code

As always, the code for this story is completely open sourced and available on my GitHub personal account:

Special thanks

I want to finally thank here the Apple Core ML team on GitHub and on the Apple Feedback developer tool, for their very quick and fully-detailed help in providing me suggestions and insights on both the Core ML protobuf file format and the Core ML runtime.