Backorder Prediction
Prediction of Backorder by Machine Learning Models
Introduction
An Inventory Management System was built in order to manage and track the records of the products but due to increase in demand, some of the products went out of stock and it was risky to stock more amounts of products considering the fact that there would be less orders and the business would suffer losses. Therefore, customers were offered to purchase products at present and were delivered later on a certain estimated date. These kinds of orders were named as backorders.
Table of contents
- Business problem
- Machine Learning formulation of Business Problem
- Existing solutions
- Dataset overview
- Performance Metric
- Exploratory Data Analysis
- Pre-Processing and Feature Engineering
- Data-Modelling
- Model Evaluation
- Model comparison
- Further Improvements
- References
1.Business problem:
When the demand increases and it becomes a tough job to maintain proper demand and supply ratio, predicting the backorders becomes an important component and necessity for the supply chain companies or any product based companies. As well as more backorders will lead a large number of customers to look out for other options and get the product from the competitors.
Data source : https://www.kaggle.com/c/untadta/data
2.Machine Learning formulation of business problem:
The concept of backorders was highly unacceptable by the customer as well as it affected the Transportation system, Production system and overall Management of the Product. Therefore, a proper prediction mechanism was needed to forecast the amount of order that could come. We can make use of historical data of the product from the Inventory management system and make a model that could predict the Backorders.
Business constraints
● Random demand may arise, which has no past history in the dataset.
● Prediction of a product for a day should be of minimum delay of 1 or 2 days
3.Existing Solutions:
- Predicting Material Backorders in Inventory Management using Machine Learning- A Research paper by Rodrigo Barbosa de Santis, Eduardo Pestana de Aguiar and Leonardo Goliatt
The Author of the research paper has very brilliantly elaborated the problem statement, root cause and solution to it. The author has handled the imbalance dataset by using various approaches. Firstly the External approach in which Undersampling and oversampling like Random- undersampling and SMOTE(Synthetic Minority oversampling) are used and secondly the Internal approach where models like Logistic Regression, Random forest, GBoost and Blagging. Among all the models experimented, GBoost and Blagging performed well as it has higher generalisation capacity and GBoost has higher scores among models.
2.Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques- A research paper by Samiul Islam & Saman Hassanzadeh Amin
URL:https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00345-2#Tab4
The author of this research has applied a bit different approach by implementing a ranged method on the features to deal with the changing pattern of the data due to human or machine error, which has further showcased 20% more efficiency in the result than the normal dataset. There was also involvement of hypothesis testing at the initial step to choose the machine learning model appropriately. To deal with imbalance dataset, they have applied both SMOTE and random oversampling. They have divided the data into four ranges: very-low, low, moderate, high and very high to predict. In this range numerical features are converted to categorical features. when the inventory level is converted into different range groups, it can be considered as an isolated node as the actual value in this feature is not changed.
4.Dataset Overview:
We have 23 columns in both Train and Test dataset
Sku(Stock Keeping unit) : Primary key
National_inv : The present inventory level of the product
Lead_time : Transit time of the product
In_transit_qty : The amount of product in transit
Forecast_3_month : The forecast of sales for coming 3 months
Forecast_6_month : The forecast of sales for coming 6 months
Forecast_9_month : The forecast of sales for coming 9 months
Sales_1_month : Sales of a particular product for last 1 months
Sales_3_month : Sales of a particular product for last 3 months
Sales_6_month : Sales of a particular product for last 6 months
sales_9_month : Sales of a particular product for last 9 months
Min_bank : Minimum amount of stock recommended
Potential_issue : Any problem identified in the product/part
Pieces_past_due: Amount of parts of the product overdue if any
Perf_6_month_avg : Product performance over past 6 months
Perf_12_month_avg : Product performance over past 12 months
Local_bo_qty : Amount of stock overdue
Deck_risk : Yes or No
Oe_constraint: Yes or No
Ppap_risk: Yes or No
Stop_auto_buy: Yes or No
rev_stop : Yes or No
Went_on_backorder : Target or dependent variable
5.Performance metric:
It is always necessary to choose the accurate Performance metric. So, to check the Performance of the model with imbalance dataset, we will use Area Under the Curve and F1 score
Area Under the Curve: AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1) and ROC curve is a Probability Curve that is plotted with True Positive Rate on Y-axis against False Positive Rate on x-axis.
F1 score: the harmonic mean of precision and recall.
6.Exploratory Data Analysis:
With the help of EDA, we will summarize and visualize the information from the dataset. It will helps us understand the correlations between the data and patterns of the data in better graphical representations.
There are 15 Numerical features and 8 Categorical features with 1 categorical Target. This dataset is highly imbalanced
- The SKU is the primary key which is of no use in the backorder prediction, so we need to drop this column.
- There are Null values in lead_time.
- perf_6_month_avg,perf_12_month_avg has negative value -99 which is needed to be fixed.
- The Last row consists Nan for every column which is of no use and should be removed.
- Categorical features has two categories ‘yes’ and ‘No’ which is needed to be represented as numbers(preferably 1 and 0) before fitting into the model.
- We can see that the difference between the 75th percentile value and max value is huge. This indicates presence of outlier.
- In perf_6_month_avg and perf_12_month_avg columns, it is observed that the lowest value is -0.99 which is needed to be replaced.
- Here, we can observe the in_transit_qty is highly correlated with sales, min_bank and forecast as earlier, we have seen sales and forecast are highly correlated.
- Now, when sales becomes high, the transport cost also goes high.
Min_bank is highly corelated with sales and forecast as stock is highly related to sales and forecast.
Summarizing numerical feature
Lead Time
- Plotting of lead_time is done on x-axis and their frequency on y-axis.
- Most of the lead time comes between 0 to 20 . so the missing values in lead time can be taken as median of the feature or removed completely.
Sales and went_on_backorder Plots
- Boxplot is clearly showing us, that backorder happened when sales was low. Infact, it shows that among all other feature, sales is one of the important to predict backorders.
Forecast and went_on_backorder plot
- Backorder only happened when the forecast is below 0.5 for all the three features.
Sales and Forecast plot
- By seeing this plot, we can observe the red points which denotes backorders are more between sales below 5000 and forecast below 25000 which is overall very low.
Bar plot to showcase categorical features with respect to backorder
- If potential_issue feature is ‘yes’ then there is more chance of product going to back order
- If oe_cosntraint feature is ‘Yes’ then there is more chance of product going to backorder.
- if rev_stop feature is set to Yes, no chance of product being going into backorder.
- Rest of the categories didnot showed much of effect.
7.Pre- Processing and Feature Engineering
Removing SKU as it is of no use in prediction.
Removing Nan values from Lead_time
Removing nan values from last row
Replacing negative value from perf_6_month_avg and perf_12_month_avg to 0.
Encoding of Categorical features
Replaced the nan values present in perf_6_month_avg and perf_12_month_avg columns with median value i.e 0.8, we cannot take mean as it is affected by outlier.
Normalisation of Numerical feature
By calling Minmaxer() function, we are normalising the numerical features. And we took only perf_6_month feature as both of them has similar effect and have taken sales for 1 and 9 month as sales for month 3 has high correlation with other feature.
8.Data Modeling
Here, the dataset will be given to different binary classification model and later the scores will be compared in order to decide which one is better.
- Feature normalisation is done only for linear models. Tree based models doesn’t require feature scaling or normalisation.
- Categorical features are encoded from ‘yes’ and ‘No’ to 1 and 0.
- Hyperparameter tuning is done using calibrated classifier, where cross validation is done so that we can get the best parameter and evaluate the model .
- Calibrated classifier is used for non linear models like Tree based model. It requires adjustment which is done by calibrated classifier.
Baseline Model
For Baseline model we will be using Dummy Classifier. It is especially useful for datasets where we have class imbalance.
we got AUC score of 0.5009. Stratified strategy performed good in baseline model. Rest of the model should perform better than the Baseline model. Let us see the rest of the models.
9.Model Evaluation:
Naive Bayes Classifier
Logistic Regression
K-Nearest Neighbour
SVM
Decision Tree
Random Forest
XGBoost Classifier
CATBoost Classifier
LGBM Classifier
10. Model Comparison
Let us Compare all the model and decide which model to be used for Deployment
AUC Table
- In AUC score, Random Forest, XGBoost classifier and CATBoost Classifier performed well among all with test auc score of 95%
F1 score Table
- In f1 score Random Forest and CATBoost classifier Performed best among all with scores of 99%. though time taken by CATBoost was less.
As we can see, CATBoost performed better than others and also it took less time. We will select CATBoost classifier for deployment and save the model in joblib.
Feature Importance by CatBoost Classifier
- In CatBoost, the national inventory is the most important feature in order to predict backorders.
- oe_constraint, potential_issue and rev_stop feature showed no importance in Catboost and also any of the tree model.
Model Deployment
def final(X):
X=X.to_frame().transpose()
model=load('/content/model.joblib') #loading catboostclassifier model
preprocessed_data=feature_extract(X)
y_predictions=model.predict(preprocessed_data)
return y_predictions
This Function will take input through form and predict by using the Catboost Classifier loaded in model.joblib.
Here is a video on model Deployment.
11. Further Improvement:
- For balancing Dataset, we can use class_weight=”balanced”.
- We can use different sampling technique like random undersampling or SMOTE.
- We can use ranged technique on some of the features like low, high, very high as ranged method is used to deal with the changing pattern of the data due to human or machine error, which will showcase more efficiency.
- If possible, we can train the model with more dataset in order to improve the predictions.
12. References:
- https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00345-2
- https://www.researchgate.net/publication/319553365_Predicting_Material_Backorders_in_Inventory_Management_using_Machine_Learning
- https://www.researchgate.net/publication/228084510_Combining_Bagging_and_Boosting
- https://sci-hub.se/10.1016/j.eswa.2015.12.032
- https://catboost.ai/docs/concepts/python-reference_catboost_get_feature_importance.html
- https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
Github Repository:
Linkedin Profile: