How to Scale Training Data – Complete Guide

An inside look at how to outsource without compromising data quality

In order for data science teams to outsource annotation to a managed workforce provider — also known as a Business Process Outsourcer (BPO) — they must first have the tools and infrastructure to store and manage their training data.

Data management tools and infrastructure should support R&D product management teams, outsourced labeling teams, and internal labeling and review teams, working together in a single centralized place with fully transparent oversight.

Scaling with Subject Matter Expertise

There’s a direct relationship between the volume of your training data and the size of your annotation team. The alternative to scaling your annotation workforce through outsourcing is hiring an internal team of labelers. While this is an expensive option, it is sometimes the only option.

For example, scaling sensitive training data, such as medical data with HIPAA protections, might require a solely internal labeling workforce. Continuing with this example, medical data, such as CT scans, would need to be labeled by radiologists who have the necessary medical expertise to properly interpret the data.

The concern with outsourcing labeling work requiring subject matter expertise is that a BPO won’t be able to provide specialized labelers. While there is good reason to be skeptical about outsourcing complex or niche datasets, BPOs cover a surprisingly wide spectrum of subject matter expertise, and with a little bit of research, you might find one that offers a specialized annotation service capable of labeling your dataset at a fraction of the cost it would take to hire an internal team.

Grant Osborne, Chief Technology Officer at Gamurs, a comprehensive esports community platform powered by AI, describes his decision-making process behind using Labelbox’s outsourcing feature to scale annotations within the competitive gaming industry.

Gamurs is developing an AI coach for professional video game players. The AI coach will help improve gamer performance by learning from similar examples in which players are underperforming and suggest ways to enhance the gamer’s performance.

Grant originally considered crowdsourcing gamers from their large social media following to label their favorite games. At first, he looked into a number of popular crowdsourcing tools but quickly rejected this option because their revenue generation comes from annotation. Since these tools charge for storage based on the number of bounding boxes, the pricing structure is impractical at scale.

He then considered building a cheap in-house tool and hiring an internal team of labelers, until he spoke with Brian Rieger, Co-founder and Chief Operating Officer at Labelbox. Gamurs needed a platform for uploading and regulating images of multiple games with object detection.

In contrast to other commercial labeling tools, Labelbox’s pricing structure is based on a three-tiered system: Free, Business, and Enterprise. The subscription tiers are categorized by number of ML projects and dataset size. These tiers vary in price and access to certain platform features. Gamurs entrusted Labelbox to support the productionization of their AI applications not only because of its compatible pricing structure but also because of its developer-focused API.

Unsurprisingly, Grant was initially dubious about outsourcing specialized gaming actions on Dota2 or League of Legends to a BPO.

“We wanted to have an internal labeling team because the computer actions are complicated. How are we going to have an external company used to labeling simple objects, like stop signs and trees, label our games? However, Labelbox’s BPO partners told us to just send over a manual and they’d handle getting a dedicated annotation team up to speed.”

“Labelbox recommended two BPOs that would best fit our needs and said there were more if we were interested. The BPOs estimated that it would take ~3–4 weeks to get everyone completely trained. While this estimation was a bit optimistic for how complicated the material is, they were able to finish the training cycle in ~4–5 weeks.”

Despite the drastically different cost quotes from the two BPOs (with one at 1.5–2 cents per bounding box and the other at 10–12 cents per bounding box), Gamurs still decided to use a mixture of both BPOs with a 20-person labeling team from the first and a 10 person labeling team from the second.

“We will probably do a combination of BPOs based on their strengths per game. We will get them to do consensus and if one BPO is better at quality assurance but slower at labeling, we will use them to cross review the other team’s work.”

Scaling with Data Quality

The inverse misconception of outsourcing subject matter expertise is believing that all human labelers are equal when it comes to annotating an extremely simple dataset.

This perspective often downplays the importance of data quality in labeling. Check out our What’s a Pumpkin? section to learn how training a deep convolutional object detection model to identify something as simple as a pumpkin is actually much more complex than you might guess. Even with simple labeling tasks, to ensure data quality, you must be able to oversee the consistency and accuracy of labels across annotators and over time.

Labeling at scale without compromising data quality requires transparency throughout your labeling pipeline. Teams of data scientists that are outsourcing on locally run, in-house tools often send data to several different annotation services where labeling happens locally, sometimes in a variety of countries, and the data scientists must rely on these labelers to send the files via email or to do uploading acrobatics via Dropbox.

Consequentially, the data becomes fragmented, disorganized, and difficult to manage, leaving it vulnerable to problems in security, quality, and data management. In order to monitor labeling accuracy and consistency of outsourcing services in real time, companies, like SomaDetect have switched from managing their annotation workforce on a homegrown tool to managing it through Labelbox.

Labelbox is best-in-class for integrating your in-house labeling and review teams with your outsourcing team in one centralized place.

Not all Labelers are Equal

Factors that differentiate outsourcing go far beyond merely the subject matter expertise it services. Labelbox has hand-selected the top BPO firms based on the following criteria:

  • Pricing transparency
  • Quality customer service
  • Diversity in company size, regions of service, range of skills, and styles of engagement

We spoke with Michael Wang, Computer Vision Engineer at Companion Labs. He talked about his experience outsourcing on Labelbox with one of our recommended BPO partners. He explained why outsourcing with a dedicated team of labelers, as opposed to crowdsourcing random human labelers, produces higher quality training data:

Before choosing Labelbox, Companion Labs had compared Labelbox to a leading competitor by trying out both labeling service APIs in terms of quality metrics, time, and effort to label their project. Michael said that Labelbox has a higher quality outsourcing pool than the well-known competitor who crowdsources.

When asked how he chose who to work with amongst Labelbox’s partner BPOs, he explained that Labelbox provided two recommendations, which he evaluated both on quality and cost metrics. “Both providers were pretty amazing in terms of quality so choosing came down to the cost.”

Outsource on Labelbox

Managed workforce services are often an instrumental part of making an AI project successful. Therefore, we at Labelbox want to enable managed workforce providers to render their services in as frictionless a way as possible.

With Labelbox, teams of data scientists, annotators, and product managers can transparently manage small projects and experiments to super large projects, all on a single platform. Our focus is on making our customers as successful as they can be with their AI projects. Our customers are businesses of all sizes building and operating AI.

We have worked with a lot of managed workforce providers, and it’s clear to us that the best providers stand out from the rest in the services they provide and the customer-centric natures of their businesses. We have hand-selected BPO partners so our customers can have high-quality labeling services rendered directly inside their Labelbox projects.

An Effortless Two-step Process

On Labelbox, your internal and outsourced labelers can seamlessly work together on a labeling project. It’s so cohesive that there’s literally no seam between the two!

  1. Contact one of our Workforce Partners listed here.
  2. Share your Project with them by adding their “Firm Provider ID” (Provided by the Workforce Partner).

That’s really it! Your Project will show up as a shared Project in the Workforce Partner’s Labelbox account where they will be able to add and manage their own labelers in your project. They will have access to annotate, review annotations, and manage their labelers. The best part is that your internal team will be able to monitor their performance with complete transparency. For more information, check out our docs.

Discuss this post on Hacker News and Reddit

Originally published on on December 13, 2018. To stay at the forefront of training data management, follow Labelbox for more relevant content.

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *