Name: Bhishan Poudel
Note: Please click on the triangle button to see the project (gif)
Section A: Regression
King County House Price:This is a famous regression problem for predicting the house price. The dataset is taken from Kaggle project House Sales in King County. The dataset contains the house price of King County of Washington state including the city Seattle. There is one index column `id` and one target column `price` and 19 house features such as `date, bedrooms, bathrooms, sqft_living, lat, long, etc`.
I did detailed feature analysis and created features such as age, age_after_renovation and so on. Also did log transformation of large numeric features and created dummy variables from categorical features. After doing feature selection and trying various machine learning methods, I found Random Forest works the best. In the end I found rmse of 26k, and r-squared and adjusted r-squared value of order 0.916. I have tried various models such as Random Forest, LightGBM, Xgboost, RidgeCV, SVR, HistGBR and Stacking Regressor and below have the video of all the notebooks.
King County House Price
Section B: Classification
Fraud Detection:In this project I used the Kaggle Creditcard Fraud data to determine whether the transaction is fraud or not. The dataset features already scaled, reduced dimension using PCA and anonimized with names `v1,v2,...,v28`. Our aim is to predict whether a transaction is fraud or not. The dataset is heavily imbalanced so I used the metric Recall to evalute the performance of the model. I tried various sampling methods such as undersampling and over-sampling (SMOTE) and tested various machine learning algorithms as well as deep learning architectures. In the end, I found catboost performed the best with Recall of 0.785 and 0 Untrue-Frauds and 21 Missed Frauds. Detailed video of jupyter notebooks is shown below.
Porto Seguro Auto Insurance:In this project I took the data from the kaggle project Porto Seguro. The aim of the project is to determine wheter the insurance client will file the claim next year or not. This is binary classification problem and the chosen metric of evaluation is normalized gini index (2*auc-1). The dataset has more 59 features such as binary features _bin, categorical features _cat and other features such as ind, reg, car, and calc. An example may have many missing features, I intelligently dealt with missing values and created dummy varialbles as need be and discarded unuseful features. After trying various different machine learning algorithms, I found lightgbm stood best with NormalizedGini 0.2 and AUC 0.6.
Porto Seguro Auto Insurance (Binary Classification)
Telecom Customer Churn:In this project I took the data from the kaggle project Telco Customer Churn. The aim of the project is to determine wheter the telecom client will churn next month or not based on their history upto now. This is a binary classification problem and I have created new custom metric for the project evaluation. In real life the cost of misclassifying leaving customer and not-leaving customer is different. For this project I defined a metric called "PROFIT" as following: profit = $400 for TP(truepositive) profit = 0 for TN(truenegative) profit = −$100 for FP (false positive) profit = -$200 for FN (false negative) The dataset has 19 features such gender, age, partners, etc. and target feature is "Churn". The dataset is relatively clean there are not much missing values. However, I did elaborate feature creation and feature selection so as to get the higher performance. In this project I did and extensive use of semi-automatic machine learning module "pycaret" and looked at various models. Among the pycaret models, the LDA gave me the best performance with profit of $80,200. Then I tested boosting module "xgboost" with extensive hyperparameter tuning using hyperband and bayesian-optimization which gave me the profit of $87,200. Also the plain simple logistic regression with cv was second in the line with profit of $82,500 (after hyperband HPO). Highlights:
- XGBoost slightly beat the LogisticRegressionCV with hyperband parameter tuning.
- Pycaret gives very high performance with little effort and less run time.
- Despite expectations, SMOTE oversampling gave much worse results than without upsampling.
- Used xgboost parameter "scale_pos_weight" to handle class imbalance.
- Used custom scoring function "get_profit" instead of popular metric "auc" or "auprc".
- XGBoost has large number of hyperparameters to tune, among the hyperparameter tuning libraries, hyperband was the best. bayes_opt and scikit-learn (GridSearchCV, RandomSearchCV) were inferior and/or more time-consuming in comparison to hyperband.
- Feature selection plays great role in metric performance, simply using one hot encoding gives much worse result even after extensive hyperparameter search.
Telco Customer Churn (Binary Classification)
Prudential Life Insurance:In this project I took the data from kaggle project Prudential Life Insurance. This is multi-class classification problem with the task of classifying given insurancy policy belongs to one of 8 responses 1 to 8. The metric of evaluation is Quadratic Weighted Kappa. The dataset has 127 features and 60k records. The features are like `Product_Info, Employment_Info, InsuredInfo` and so on. After assessing the performance of various models, I found Xgboost with Poisson objective gives the best results. (TestKappa = 0.6556).
Prudential Insurance (Multiclass Classification)
Section C: Timeseries Analysis
Timeseries Analysis for Web Traffic Data:In this project I took the data from kaggle project Web Traffic Time Series Forecasting. The dataset contains 145k samples and has first column as wikipedia page name and other 550 columns are the date the page was visited. In this project I used the data of year 2016 for the American Musician "Prince". The evaluation metric is sMAPE (symmetric Mean Absolute Error) and after trying various experiments, I found xgboost with features extracted from tsfresh gave me the best results. (sMAPE = 0.6356).
Timeseries Analysis for Web Traffic Data
Section D: Natural Language Processing (NLP)
Consumer Complaints:In this project I took the data from government catalog data Consumer Complaint Database . The aim of the project is to classsify the complaint to one of the 11 different categories such as `Debt Collection, Mortgage, Student loan, etc`. The dataset has millions of rows and for this project I used 200,000 samples. I did various text processing steps such as lowercase, remove punctuations, remove digits, remove stopwords, lemmatization and so on. This is text data modelling project, ML algorithms can not work with text data directly. We need to vectorize the data. Here, I used TfidfVectorizer with ngram of (1,2) and removed english stopwords. I tried various machine learning models such as Linear Regression, Linear SVC, Random Forest and so on, however, in the end the tuned Linear SVC gave me the best accuracy of 0.7222.