Methods to Increase Model Performance
As data is generated explosively every day, many companies resolve business problems using analytics to create data-driven insights. In 2020, NewVantage Partners conducted an annual survey on 85 large corporations. The results show that 99% of respondents already received measurable values from big data. Our practicum organization, Angel Flight West, is also using data to drive actionable business insights that will help the organization to grow.
Last quarter, our team used machine learning techniques to build models and predict the outcomes that benefit the organization. One of the main challenges that I overcame was increasing model accuracy. Here, I want to share the techniques that can be used to increase model performance in general and how the technique I chose affected our model performance.
In this blog, I will cover:
- How to increase model performance?
- How our chosen solution affected model performance?
How to increase model performance?
The model development process goes through various stages, starting from data cleaning to model building. Many factors could impact the model accuracy. I listed some commonly used methods below to improve the accuracy of a model:
1.Deal with missing data and outlier values
Missing and outlier values in the training data will decrease model accuracy, which will lead to incorrect model predictions. Usually, we can replace the missing and outlier values with median or mode. If a certain column with a missing value of over 30%, it is advisable to drop the whole column. It is essential to treat the missing and outlier values correctly.
2. Feature selection
Feature selection is a step to find out the best set of variables that explains the relationship well between the independent and dependent variables. We could select features based on either our domain knowledge or visualizations during the data exploration analysis phase. By choosing the most appropriate and related variables, the model performs better.
3. Algorithm Tuning
Parameter tuning is a method of finding the best value for each parameter to improve the model accuracy. Different machine learning methods have various parameters. By setting a range and trying out different parameters, we could find the highest model accuracy with the best fit value.
How our chosen solution affected model performance?
For this quarter, I mainly worked on the Easy to Fill model. The objective of this mission is to identify if the mission is Easy to Fill, by doing this the coordinators can better utilize time on missions that are more likely to get canceled due to no pilot or are otherwise in general difficult to fill.
I first updated the dataset from October 1st, 2019 to the present and dealt with missing and outlier values. The data originally contains 34 variables, including mission type, passengers, mission date, etc. After examining each variable and its data types, I decided to remove all the ids including mission_leg_id, passenger_id, etc, year, to_city, and from_city from the data to avoid over-fitting. Also, I converted NAs in the columns to 0 and created dummy variables for the age column to explain missing values. I dropped the lead time column whose null values were more than 80%. I ultimately kept 25 variables for further analysis. Based on MIP and our team’s domain knowledge, we removed six more unnecessary variables, including from_airport_distance, to_airport_distance, etc. After feature selection, I tuned the parameter by choosing the best value. I successfully increased our model by 2%.
2% may seem negligible. However, in the period of uncertainty due to covid-19, it is remarkable to improve model accuracy.
I hope this blog could help you to increase your model performance. I will keep improving my technical skills and updating blogs. I am looking forward to the winter quarter's work on our practicum project!