Business information usually incorporates non-numeric information with many attainable values, for instance zip codes, medical prognosis codes, most well-liked footwear model. These high-cardinality categorical options include helpful data, however incorporating them into machine studying fashions is a little bit of an artwork kind.
I’ve been writing a collection of weblog posts on strategies for these options. Final episode, I confirmed how perturbed coaching information (stochastic regularization) in neural community fashions can dramatically scale back overfitting and enhance efficiency on unseen categorical codes [1].
In reality, mannequin efficiency for unseen codes can strategy that of recognized codes when hierarchical data is used with stochastic regularization!
Right here, I exploit visualizations and SHAP values to “look below the hood” and achieve some insights into how entity embeddings reply to stochastic regularization. The photographs are fairly, and it’s cool to see plots shift as information is modified. Plus, the visualizations recommend mannequin enhancements and may determine teams that may be of curiosity to analysts.