Core ML On-Device Training, with Transfer Learning from Swift for TensorFlow Models

As a first small step towards a federated learning platform that supports mobile and wearable devices (in particular, devices within the Apple ecosystem) I’ve being developing a Swift library called SwiftCoreMLTools that mimics in Swift a subset of the functionalities of Apple’s CoreMLTools Python library.

To briefly give an overview of the library I’m working on…SwiftCoreMLTools exposes a DSL (function builder), as well as a classic API to declare Core ML models built from scratch as potentially re-trainable on device, and eventually importing weights from models trained on other frameworks.

The reasons I’ve being developing this library (which, by the way, is still in a very early stage) are fire, that I want to be able to simply interoperate, in some way, with Swift for TensorFlow on mobile devices, and to quickly export models and weights. Also, very importantly, I want to be able to create and train Core ML models on-demand, directly on devices.

You can find more specific info directly on this GitHub repo:

Continuation from a Previous Article

The SwiftCoreMLTools library basically attempts to simplify most of the tedious mechanisms described in my previous article “Swift loves TensorFlow and Core ML”, where I covered how it’s possible to directly use Core ML Protocol Buffer data structures to export a trained Swift for TensorFlow model in order to create and personalize a Core ML model directly on Apple devices:

In this article, I’ll show how this new library can simplify Core ML model creation in a slightly different scenario. I’m focusing on how to apply transfer learning, and specifically on how to reuse embedding representations trained for a previous model in a brand new model trained from scratch on Apple devices.

Transfer Learning

This machine learning technique has been very popular and successful in recent years in domains such as image classification and natural language modeling, but I’m using it here in a more general and simpler tabular data scenario.

It’s absolutely not my intention to venture into an introduction of ML concepts, such as transfer learning and categorical embedding. Just in case you need some background, I’d strongly suggest taking a look at this amazing article from fast.ai:

The tabular data we’re using here is a classical example of a multiple input scenario with some numerical features (a vector of floating point data) and some other categorical features.

In this context, the categorical features will be better learned by the model using an embedding layer for each of these feature. To very simply describe this mechanism, an embedding layer is a learnable lookup table that maps categorical variables in an approximation problem into a multi-dimensional Euclidean space.

Regarding transfer learning, usually this technique is used for fine tuning big pre-trained models to work for specialized use cases using limited training data. The goal of using transfer learning here is to simply train the model centrally once, to obtain this embedding representation, and then reuse the weights of these embedding layers in subsequent re-training on local models directly on devices.

For this specific use case (remember the long-running goal of implementing a complete federated learning infrastructure), the real reason we use transfer learning is that we want to maximize the privacy of the user, enabling us to train models on user-owned devices using only personal data that will never leave these devices.

When thinking about user privacy implications and the need to train models locally, it’s easy to understand that it’s impossible for the device model to successfully train a generalized embedding representation of categorical inputs using small amounts personal data generated by a single user. Some category values, for example, may be completely missing from a particular user’s data.

As illustrated in the diagram below, showing just the head of the two models, we’ll train this embedding representation once using Swift for TensorFlow (on Colab or any desktop or cloud environment you want) and reuse the weights from these embedding layers in the models we’re going to train directly on-device.

Data

In reality, the sample project I’m showing here (see at the end for the GitHub repo link) is an emulation of the general privacy-driven transfer learning approach described above. All the concepts are still valid, but I’ll use a simplified version of the approach using a small and very common dataset.

The data I’m using in this scenario is a very tiny tabular dataset — the Boston housing price dataset (creators: Harrison, D. and Rubinfeld, D.L. ) — and in particular, I’m using the same data to train both a “central” model and a “device” model. As a matter of fact, as the data is the same, the two models will have in this case identical architecture and layer shapes. The only difference will be that in the “device” model, the embedding layer weights will be frozen on training.

Anyway, we’ll talk a little bit more about this later in the conclusion, but I would anticipate that forcing this simulation to use same data on both models will allow us to run some direct inference comparisons both Core ML models— the “central” one, trained with S4TF, and the “device” one, trained directly with Core ML.

Dataset Details

The Boston housing price dataset has 11 numeric and 2 categorical predictive variables. Median value (attribute 14) is usually the target. Here’s the info for the various attributes (in order):

CRIM (Numerical): per capita crime rate by town
ZN (Numerical): proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS (Numerical): proportion of non-retail business acres per town
CHAS (Categorical): Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX (Numerical): nitric oxides concentration (parts per 10 million)
RM (Numerical): average number of rooms per dwelling
AGE (Numerical): proportion of owner-occupied units built prior to 1940
DIS (Numerical): weighted distances to five Boston employment centers
RAD (Categorical): index of accessibility to radial highways
TAX (Numerical): full-value property-tax rate per ten thousand dollars
PTRATIO (Numerical): pupil-teacher ratio by town
B (Numerical): 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
LSTAT (Numerical): % lower status of the population
MEDV (Numerical): Median value of owner-occupied homes in thousands of dollars

Data Transformations in Swift

These are basically the transformations that have to be applied to this dataset in order to correctly use the categorical embedding approach and end up with a good, normalized representation of the numerical features.

Categorical features must be encoded with one-hot-encoding representation and fed into embedding layers.
Numerical features have to be normalized using ZScores (x’ = (x’ — MEAN(X)) / STD(X)), fitting only training data.
Finally, the dataset has to be split into train and test sets.

Below, I’ve pasted the cells from the S4TF Notebook I used to prepare the data and get tensor representations for the numerical and categorical inputs and for the label values. The full source code of this Notebook is in the GitHub repo linked at the end.

Of course, S4TF does not run (yet?) on iOS devices, so in order to prepare the data for Core ML training, I’ve re-implemented the same transformations in “pure Swift” with no tensors, no TF functions, no ops for normalization, etc.

public struct HousingData {
    public let trainPercentage: Float
    public let numRecords: Int
    public let numColumns: Int
    public let numCategoricalFeatures: Int
    public let numNumericalFeatures: Int
    public let numLabels: Int
    public let numTrainRecords: Int
    public let numTestRecords: Int
    
    public let allCategoriesValues: [[Int32]]
    public let mean: [Float]
    public let std: [Float]
    
    public let xNumericalTrain: [[Float]]
    public let xCategoricalTrain: [[Int32]]
    public let yTrain: [[Float]]

    public let xNumericalTest: [[Float]]
    public let xCategoricalTest: [[Int32]]
    public let yTest: [[Float]]

    static func matrixTranspose<T>(_ matrix: [[T]]) -> [[T]] {
        if matrix.isEmpty {return matrix}
        var result = [[T]]()
        for index in 0..<matrix.first!.count {
            result.append(matrix.map{$0[index]})
        }
        return result
    }
    
    public init(trainPercentage:Float = 0.8) {
        // Load Data
        let filePath = Bundle.main.url(forResource: "housing", withExtension: "csv")
        let data = try! String(contentsOf: filePath!, encoding: String.Encoding.utf8)

        // Convert Space Separated CSV with no Header
        let dataRecords: [[Float]] = data.split(separator: "n").map{ String($0).split(separator: " ").compactMap{ Float(String($0)) } }

        // Data Ingestion
        let numRecords = dataRecords.count
        let numColumns = dataRecords[0].count

        // Randomize Data
        var index = Set<Int>()
        while index.count < numRecords {
            index.insert(Int.random(in: 0..<numRecords))
        }
        let randomDataRecords = index.map{ dataRecords[$0] }
        let dataFeatures = randomDataRecords.map{ Array($0[0..<numColumns-1]) }
        let dataLabels = randomDataRecords.map{ Array($0[(numColumns-1)...]) }
        
        // Split Numerical Categorical Features
        let categoricalColumns = [3, 8]
        let numericalColumns = [0, 1, 2, 4, 5, 6, 7, 9, 10, 11, 12]
        let numCategoricalFeatures = categoricalColumns.count
        let numNumericalFeatures = numericalColumns.count
        let numLabels = 1
        assert(numColumns == numCategoricalFeatures + numNumericalFeatures + numLabels)
        
        // Get Categorical Features
        let allCategoriesValues = dataFeatures.map{ row in categoricalColumns.map{ Int32(row[$0]) } }
                                        .reduce(into: Array(repeating: [Int32](), count: 2)){ total, value in
                                            total[0].append(value[0])
                                            total[1].append(value[1]) }
                                        .map{ Set($0).sorted() }
        let categoricalFeatures = dataFeatures.map{ row in categoricalColumns.map{ Int32(row[$0]) } }

        // Get Numerical Features
        let numericalFeatures = dataFeatures.map{ row in numericalColumns.map{ row[$0] } }
        
        // Categorize Categorical Features with Ordinal values
        var categoricalValues = Array(repeating: Set<Int32>(), count: 2)

        for record in categoricalFeatures {
            categoricalValues[0].insert(record[0])
            categoricalValues[1].insert(record[1])
        }

        let sortedCategoricalValues = [categoricalValues[0].sorted(), categoricalValues[1].sorted()]

        let ordinalCategoricalFeatures = categoricalFeatures.map{ [Int32(sortedCategoricalValues[0].firstIndex(of:$0[0])!),
                                                                   Int32(sortedCategoricalValues[1].firstIndex(of:$0[1])!)] }
        
        // Split Train and Test
        let numTrainRecords = Int(ceil(Float(numRecords) * trainPercentage))
        let numTestRecords = numRecords - numTrainRecords
        let xCategoricalAllTrain = HousingData.matrixTranspose(Array(ordinalCategoricalFeatures[0..<numTrainRecords]))
        let xCategoricalAllTest = HousingData.matrixTranspose(Array(ordinalCategoricalFeatures[numTrainRecords...]))
        let xNumericalAllTrain = Array(numericalFeatures[0..<numTrainRecords])
        let xNumericalAllTest = Array(numericalFeatures[numTrainRecords...])
        let yAllTrain = Array(dataLabels[0..<numTrainRecords])
        let yAllTest = Array(dataLabels[numTrainRecords...])
        
        // Normalize Numerical Features
        var xTrainNormalized = xNumericalAllTrain
        var xTestNormalized = xNumericalAllTest

        var mean = Array(repeating: Float(0), count: numNumericalFeatures)
        for r in xTrainNormalized {
            for c in 0..<mean.count {
                mean[c] = mean[c] + r[c]
            }
        }
        for c in 0..<mean.count {
            mean[c] = mean[c] / Float(numTrainRecords)
        }

        var std = Array(repeating: Float(0), count: numNumericalFeatures)
        for r in xTrainNormalized {
            for c in 0..<mean.count {
                std[c] = std[c] + pow(r[c] - mean[c], 2.0)
            }
        }
        for c in 0..<mean.count {
            std[c] = std[c] / Float(numTrainRecords - 1)
        }

        for r in 0..<xTrainNormalized.count {
            for c in 0..<numNumericalFeatures {
                xTrainNormalized[r][c] = (xTrainNormalized[r][c] - mean[c]) / std[c]
            }
        }

        for r in 0..<xTestNormalized.count {
            for c in 0..<numNumericalFeatures {
                xTestNormalized[r][c] = (xTestNormalized[r][c] - mean[c]) / std[c]
            }
        }
        
        // Initialize class properties
        self.trainPercentage = trainPercentage
        self.numRecords = numRecords
        self.numColumns = numColumns
        self.numCategoricalFeatures = numCategoricalFeatures
        self.numNumericalFeatures = numNumericalFeatures
        self.numLabels = numLabels
        self.numTrainRecords = numTrainRecords
        self.numTestRecords = numTestRecords
        self.allCategoriesValues = allCategoriesValues
        self.mean = mean
        self.std = std

        self.xNumericalTrain = xTrainNormalized
        self.xCategoricalTrain = xCategoricalAllTrain
        self.yTrain = yAllTrain

        self.xNumericalTest = xTestNormalized
        self.xCategoricalTest = xCategoricalAllTest
        self.yTest = yAllTest
    }
}

The Model

The model is a very simple deep neural network (DNN) with a bunch of hidden layers with ReLU activations and, as this is a regression problem, a final dense layer with no activation function.

The head of the model, as mentioned, has to manage multiple input sources— specifically, using an embedding layer for each categorical input, and finally concatenating the output of all the embedding layers with the numerical input and passing this concatenation to the real DNN.

Usually, when using multiple inputs with categorical embedding, a good practice is to feed numerical inputs and all outputs of the embedding layers into a separated dense layer before concatenating them. After this we can finally pass the output to the rest of the model.

Unfortunately, one of the current limitation of Core ML training support is that it isn’t capable of performing backpropagation across a concatenated layer.

As such, I had to simplify the model by removing all those dense input layers and directly concatenating the numerical input with all the output from the embedding layers.

Training

For both Swift for TensorFlow and Core ML models, the training loop is managed with exactly the same configuration in terms of batch logic, epochs, loss functions, optimizers, and hyperparameters.

Below are the cells from the S4TF Notebook for training the model (with the embedding layers!).

Core ML model creation with SwiftCoreMLTools

Looking at the DSL code below, you might notice that the only layers that receive weight values in their input are the embedding layers. Moreover, you might also notice that only the real dense layers (called InnerProduct in Core ML) are trained with an updatable parameter set to true, while the embedding layers are frozen.

Below is a graph representation of the exported Core ML model (from Netron):

SwiftUI sample application and conclusion

In order to test the training and the inference of the Core ML model on both macOS, iOS, and iPadOS devices I quickly developed a cross-platform demo app using SwiftUI, taking advantage of Combine to quickly publish on the UI feedback from the background training task.

I’m omitting here the Swift Core ML code for inference and training, as I’ve already showed something similar in my previous article—and of course, full source code is available at the GitHub repo linked below. By the way, the repo also includes a Swift Playground to quickly run inference tests directly inside Xcode.

I tested the App on an iPhone 11, iPad Pro, and a MacBook Pro, and i’m happy to report that both training and inferencing ran very smoothly.

Of course, the simple dataset I’ve used in this test is not a good one for seriously running benchmark comparison, but I have to say I’ve been really impressed by how fast it is to train Core ML models on iOS and iPadOS devices.

Swift for TensorFlow is of course in its early days, and it uses the TensorFlow runtime, only in eager mode currently, and even using Colab, I’ve trained only using the CPU. No real problem at all here, even when training with Core ML on a Mac. But again, I’ve been very positively surprised that training with CoreML on real iOS / iPadOS devices—on exactly the same data, the same model, for the same number of epochs—is super fast and promising. Core ML’s deep integration with GPUs and the Apple Neural Engine really make the difference, when compared to other approaches.

I’ll be testing more on my long journey towards this federated learning platform. If you want to contribute in any way, by testing or implementing missing functionalities on the SwiftCoreMLTools library, please be my guest.