As someone who has designed neural networks destined for real-time use on mobile devices, state-of-the-art NLP models like transformers have always felt off limits.
Standard configurations for a model like BERT can easily tip the scales at over 1GB in size with more extreme versions 10 to 100 times larger than that. Recent advances in distillation and quantization, however, motivated me to reconsider my assumptions that NLP was something only accessible with multiple high-end GPUs.
Continue reading “Optimizing a HuggingFace Transformer Model for Toxic Speech Detection”