It All Boils Down to Training Data

Is your model not performing well? Try digging into your data. Instead of getting marginal improvements in performance by searching for state-of-the-art models, you can drastically improve your model’s accuracy by improving the quality of your data.

Since most data scientists are adapting off-the-shelf algorithms to specific business applications, one of the most difficult challenges that data scientists face today is creating a continuous workflow that consistently feeds high-quality training data into their algorithms.

At the same time, your model is learning, and you want to be able to leverage this intelligent model to label the rest of your dataset.

Building the infrastructure to do annotation that integrates with your model and managing the workflow is the most challenging part of machine learning.

Iteration => Accuracy & Consistency

The axiom of garbage in, garbage out can be masked in training. Even when fed random noise, such as random labels or unstructured pixels, certain models are capable of overtraining to the point of attaining 0% training error (Understanding Deep Learning Requires Rethinking Generalization).

This is because recent high-capacity models like deep neural networks can memorize massive datasets. While these models don’t commit errors during training, when tested, they perform no better than random guessing.

Therefore, iteration and rigorous QA/QC processes are essential to a proper data labeling workflow.

Since, in most cases, automated evaluation without human input is either impossible or guarantees minimal quality, we’ll discuss how to implement QA/QC methods of the latter categories to help improve the confidence in the quality of your training data:

Test questions
Direct inspection
Consensus

Test questions and direct inspection are QA/QC methods that fit into category (ii) where the job provider, or data scientist, is directly responsible for evaluating quality.

Test questions is a standard technique. It refers to a set of data that is correctly labeled by the data scientist and then distributed randomly amongst labelers to test their accuracy. Direct inspection is the process of visually inspecting your labeled data to gauge accuracy.

Visual screening is a basic functionality that everyone should have to preprocess data and post-label review for accuracy. In his article, Why You Need To Improve Your Training Data, And How To Do It, Pete Warden recommends randomly browsing through your data.

This basic practice can reveal valuable information about your dataset, such as an “unbalanced number of examples in different categories, corrupted data (for example PNGs labeled with JPG file extensions), incorrect labels, or just surprising combinations.”

While most open source tools do not provide this essential feature, Labelbox is a repository of labeled data that allows you to visually browse and manage your data in one place.

While the QA/QC methods of category (ii) are extremely useful, they have two implicit drawbacks. First, they are inherently unscalable since the resources of the job provider, or data scientist, to evaluate the accuracy of crowdsourced labels is finite. Second, in order to perform these methods, the correct answers must already be known.

Consensus, on the other hand, is both inherently scalable and useful when the correct answers are unknown.

Consensus requires multiple annotators to provide labels for the same piece of data. With that information, consensus computes Intersection Over Union (IOU) to average out idiosyncrasies across labelers and get better attenuation of the signal.

In other words, the answers to the same question are compared to determine the rate of agreement. High agreement is indicative of a high-quality dataset, while low agreement typically points to poor data quality. But it can also be indicative of ambiguous examples.

Labelbox offers a built-in consensus tool so you can monitor your quality metrics in real-time. Read more about how the Labelbox Consensus tool works here.

Diminishing Marginal Returns

Google published a study that showed that even when you think you have enough data, adding more can make your model perform even better (The Unreasonable Effectiveness of Data). And yet, the answer is more complicated than more is always better.

The core question to ask is not whether you have enough data, but whether you have hit the efficient frontier where the marginal costs of labeling exceed the marginal gains in model performance.

To visualize this, plot the model’s performance over time on held-out evaluation data. For example, start with 1000 samples to train your model and evaluate it on 200 held-out samples to measure your starting accuracy.

Then collect another 1000 samples and repeat the experiment with the second set. The model is expected to do better with 2000 examples because it is learning to see natural variations in the data and filter out idiosyncrasies while better attenuating to signal.

Workflow Transparency

It’s commonplace to use a labeling service in which you outsource raw data and get labeled data in return. However, if you’re outsourcing your data labeling—but have no way of measuring the quality of the labeling service—you’re essentially gambling with your investment.

Outsourced labeling services can be a good go-to for basic object classification tasks, like labeling cars or dresses. If you need to generate a large labeling task force on a specific subject matter, there are different Business Process Outsourcing (BPO) firms that can accommodate particular specialized knowledge categories.

Through Labelbox, you can connect with our partner BPOs, monitor the quality of your outsourced data labeling services, and create and manage your own workflow all on a single unified platform.

To Sum it Up, Clean it Up

Your model is only as good as your training data. Now that you know how to ensure that your training data is consistent enough, accurately labeled, and sufficient in size, go clean it up!

Discuss this post on Hacker News

Originally published at medium.com on November 12, 2018. Follow Labelbox to stay at the forefront of training data management.

It All Boils Down to Training Data

Iteration => Accuracy & Consistency

Diminishing Marginal Returns

Workflow Transparency

To Sum it Up, Clean it Up

Fritz

Comments 0 Responses

Leave a Reply Cancel reply