Using Machine Learning to Predict Bus Ticket Sales

Introduction

Sales are the lifeblood of a business. Sales forecasting is, therefore, a crucial part of any business’s financial planning.

In this tutorial, we look at one specific sales forecasting example: public bus ticket sales. Accurately predicting the demand for bus ticket sales can enable commuter companies to make informed decisions regarding their operations. The companies are able to gauge customer demand for certain rides, to manage resources and vehicles more efficiently, to offer promotions and sell other services more effectively, such as micro-insurance, or even improve customer service by being able to efficiently communicate alerts and other useful information.

The Dataset

The data used in this article is obtained from Zindi, a data science competition platform with the mission of building the data science ecosystem in Africa. The dataset comprises historic ticket purchasing details for 14 bus routes into Nairobi from October 2017 to April 2018, and includes information such as place or origin, the scheduled time of departure, the channel used for the purchase, the type of vehicle, the capacity of the vehicle, and the assigned seat number.

Data Pre-processing

Data pre-processing is an important step that describes any type of processing performed on raw data. The quality of data and the useful information that can be derived from it determine how a model can best learn from the data.

Data pre-processing includes cleaning, normalization, transformation, feature extraction, and selection among others. The end result of data pre-processing is the final training data which form the experience that the algorithm uses to learn.

We begin by importing the required libraries and at the same time suppressing pandas warnings:

import numpy as np
import pandas as pd

# ignore warnings from pandas
import warnings
warnings.filterwarnings('ignore')

The raw data is read using pandas in order to derive the target variable:

df_raw = pd.read_csv('train_revised.csv')

The dataset consists of a list of tickets each identified by the ride_id. To derive the target variable, we need to aggregate the list to know how many passengers there are for a given ride_id. The aggregate function defined below removes any duplicate data, and groups and counts the number of tickets per ride.

def aggregate():
    ride_id_dict = {} 
    for ride_id in df_raw["ride_id"]:
        if not ride_id in ride_id_dict:
            ride_id_dict[ride_id] = 1
        else:
            ride_id_dict[ride_id] += 1
    
    df_processed = df_raw.drop(['seat_number', 'payment_method', 'payment_receipt', 'travel_to'], axis=1)
    df_processed.drop_duplicates(inplace=True)
    df_processed.reset_index(drop= True, inplace=True)
    df_processed["number_of_tickets"]= np.zeros(len(df_processed))
    
    for i in range(len(df_processed)):
        ride_id = df_processed.loc[i]["ride_id"]
        df_processed.at[i,"number_of_tickets"] = ride_id_dict[ride_id]
    
    return df_processed

The aggregated data is then saved in a comma-separated values (CSV) file format.

aggregate().to_csv('data/train_aggregate.csv', index=False)

In order to proceed with the next phase of model training, we read the aggregated training data and list the first five (5) records using panda’s read_csv()and head()functions respectively.

#read CSV file in to pandas dataframe
df_train_set = pd.read_csv('train_aggregated.csv')

#list some records from the dataframe
df_train_set.head()

Feature Engineering

Feature engineering involves defining new features from the existing features in a dataset. The quality of results produced by a machine learning model is heavily dependent on the quality of the features in the dataset used to train that model. Creating new features helps in providing more information to the model about the target variable, which in turn improves the model’s performance.

We create additional features, namely travel_month,isEndWeek, isMorning and isEndMonth. These features are important, as it’s common for people to travel more in certain months, times of the day, and over the weekend.

df_train_set["travel_date"] = pd.to_datetime(df_train_set["travel_date"],
                                             infer_datetime_format=True)
df_train_set["travel_dow"] = df_train_set["travel_date"].dt.dayofweek #change the full date to day of week
df_train_set["travel_month"] = df_train_set["travel_date"].dt.month #extract month
df_train_set['hour_booked'] = pd.to_numeric(df_train_set['travel_time'].str.extract(r'(^d*)').loc[:,0])

df_train_set['ToM'] = df_train_set.travel_date.dt.day
df_train_set["isEndWeek"] = df_train_set["travel_dow"].apply(lambda x: 1 if x >= 5 else 0)
df_train_set["isMorning"] = df_train_set["hour_booked"].apply(lambda x: 1 
                                                              if x <= 21 else 0)
df_train_set["isEndMonth"] = df_train_set["ToM"].apply(lambda x: 1 if (x >= 5) 
                                                                  and (x < 28) else 0)

The feature extraction and selection process are also performed on the test data. The reason for having the test set is to ensure the model does not overfit (memorizing the data it has seen).

We drop the travel_date, travel_dow(day of the week), hour_booked and TOM (time of the month), as these features were only necessary for deriving the best features. The inplace = true and axis = 1ensure the underlying data is affected column-wise.

df_train_set.drop(['travel_date','travel_dow', 'hour_booked', 'ToM'], axis=1, inplace=True)

The features and target variables of our training set need to be separated:

X = df_train_set.drop(["number_of_tickets"], axis=1)
y = df_train_set.number_of_tickets

Model Training

With our data ready, it’s time to train our model. We settled on CatBoost, which is a machine learning algorithm based on gradient boosting over decision trees. Among the main advantages of this algorithm include support for both numerical and categorical features and its superior quality compared with other gradient boosting decision tree (GBDT) libraries. It also includes out of the box support for both numerical and categorical features.

Predicting the number of tickets sold is a regression problem. Regression is a technique used to predict continuous (dependent) variables given a set of independent variables.

params = {'depth': 12, 'iterations': 17000, 'l2_leaf_reg': 9, 
        'learning_rate': 0.03, 'random_seed': 1111,'logging_level' : 'Silent',
        'loss_function': 'MAE','l2_leaf_reg':10}
  
regressor = CatBoostRegressor(**params)

regressor.fit(X,y)
y_predict = regressor.predict(X) 

print ('Mean Absolute Error - ', metrics.mean_absolute_error(y_predict,y))

Prediction on Test Data

Just like in the training set, we need to prepare the test data for evaluation of the model to determine how the model performs on unseen data—and it gives us some indication as to whether the model generalizes or not.

We read the data using panda’s read_csv() function:

#read test data
df_test_set = pd.read_csv('test_questions.csv')
#get a snapshot of the data
df_test_set.head()

Just like with training data, we subject the test data to the feature engineering process. New features are created, categorical values encoded, and insignificant features dropped.

#create additional features
df_test_set["travel_date"] = pd.to_datetime(df_train_set["travel_date"],
                                             infer_datetime_format=True)
df_test_set["travel_dow"] = df_train_set["travel_date"].dt.dayofweek #change the full date to day of week
df_test_set["travel_month"] = df_train_set["travel_date"].dt.month #extract month
df_test_set['hour_booked'] = pd.to_numeric(df_train_set['travel_time'].str.extract(r'(^d*)').loc[:,0])

df_test_set['ToM'] = df_train_set.travel_date.dt.day
df_test_set["isEndWeek"] = df_train_set["travel_dow"].apply(lambda x: 1 if x >= 5 else 0)
df_test_set["isMorning"] = df_train_set["hour_booked"].apply(lambda x: 1 
                                                              if x <= 21 else 0)
df_test_set["isEndMonth"] = df_train_set["ToM"].apply(lambda x: 1 if (x >= 5) 
                                                                  and (x < 28) else 0)

#encode categorical features
df_test_set["car_type"] = pd.Categorical(df_test_set["car_type"], categories=car_type_categories)
df_test_set["car_type"] = df_test_set.car_type.cat.codes

df_test_set["travel_from"] = pd.Categorical(df_test_set["travel_from"], categories=travel_from_categories)
df_test_set["travel_from"] = df_test_set.travel_from.cat.codes

#drop features with negligible impact
df_test_set.drop(['travel_date','travel_dow', 'hour_booked', 'ToM'], axis=1, inplace=True)

After the processing phase, the prediction is then performed on the test data.

X_test = df_test_set.drop(['ride_id'], axis=1)
test_set_predictions = regressor.predict(X_test)

Submission File

The final step involves preparation of the submission file for submission to zindi.africa.

#create a dictionary of ride_id and predictions
d = {'ride_id': df_test_set["ride_id"], 'number_of_ticket': test_set_predictions}
df_predictions = pd.DataFrame(data=d)
df_predictions = df_predictions[['ride_id','number_of_ticket']]

#create csv file
df_predictions.to_csv('zindi_submission.csv', index=False)

With a mean absolute error of 3.7352, our solution ended up in top 15% on the leaderboard.

Conclusion

The resulting model can be used by bus operators to anticipate customer demand for certain rides, to manage resources and vehicles more efficiently, to offer promotions and sell other services more effectively, such as micro-insurance, or even improve customer service by being able to send alerts and other useful information to customers.

In addition, the statistics related to ticket sales can be leveraged by local municipal governments to assess the number of vehicles and passengers visiting cities at any given time for infrastructure resource planning.

And all of these processes can be improved with the power of machine learning!

Discuss this post on Hacker News and Reddit.