Skip to content

Predicting outbreaks of Dengu Fever using Linear Count Models: Driven Data Competition

by Joe Ganser

pic.png

Abstract

Dengue fever is a mosquito passed viral infection that is particularly hazardous in tropical environments. Drivendata.co is a website hosting an open source data science competition to find the best model to predict the number of dengue fever cases per day. Using historical infection records in conjunction with environmental data, the competition's goal is to build a count regression model that minimizes the mean absolute error metric. The overall goal of this analysis to break apart the problem and build a basic model. Inspired by Peter Bull's work on the Dengue Fever competition, we build upon this and explore non-linear models. Further research can illicit more accurate models.

I. Problem Statement Goal & Strategy

Predicting the number of daily infections of a disease is a count regression problem with a time series property. This means that the numbers we're predicting can ONLY be positive and finite, and it's possible that some of the variables may evolve over time. To solve this problem, the investigation goes through answering several key questions;

  • Is the time series property relevant?
  • How does the predictor variables relate to each other and the target variable?
  • How would a simple model of predicting the mean/median/zero each day score on our target metric, mean absolute error?
  • Are there non-linear relationships between the predictor variable and the target?
  • What features are important?
  • How do linear models perform?
  • How do NON-linear models perform?

Answering these questions will guide use to building a relevant model. The general strategy is to test several models, find the one with the best MAE (mean absolute error) score, and use it to predict on our test set. To experiment with different models, we split the data into train and validation sets.

II. ETL: Extract transform and load the data

The dataset consists of a training dataset and a testing one, which will later be used for forecasting with the best performing model. These are downloaded from AWS S3 buckets and are small enough to be saved locally.

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
test_data_features = 'https://drivendata-prod.s3.amazonaws.com/data/44/public/dengue_features_test.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20220729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220729T181653Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=c2cabb56ebd41bae9aa5ed99d2ad1b04c587029a3da26b150fedf274b2ec661e'
training_data_features = 'https://drivendata-prod.s3.amazonaws.com/data/44/public/dengue_features_train.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20220729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220729T181653Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=86638e7bbf2f4d0e35e91314a47ae3d94270b000a7255fa48a6eac87ffb3ecfb'
training_data_labels = 'https://drivendata-prod.s3.amazonaws.com/data/44/public/dengue_labels_train.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20220729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220729T181653Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=01891140cdf2983b4155e439da14527ccb65d4a311009564bec5e0b01d23cf9c'
try:
    df_test = pd.read_csv(test_data_features)
    df_train_features = pd.read_csv(training_data_features)
    df_train_labels = pd.read_csv(training_data_labels)
    df_train = df_train_features.merge(df_train_labels,on=['city','year','weekofyear'])
except:
    df_train = pd.read_csv('train.csv',index_col=0)
    df_test = pd.read_csv('test.csv',index_col=0)
target = 'total_cases'
df_train.head()
city year weekofyear week_start_date ndvi_ne ndvi_nw ndvi_se ndvi_sw precipitation_amt_mm reanalysis_air_temp_k ... reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c station_diur_temp_rng_c station_max_temp_c station_min_temp_c station_precip_mm total_cases
0 sj 1990 18 1990-04-30 0.122600 0.103725 0.198483 0.177617 12.42 297.572857 ... 73.365714 12.42 14.012857 2.628571 25.442857 6.900000 29.4 20.0 16.0
1 sj 1990 19 1990-05-07 0.169900 0.142175 0.162357 0.155486 22.82 298.211429 ... 77.368571 22.82 15.372857 2.371429 26.714286 6.371429 31.7 22.2 8.6
2 sj 1990 20 1990-05-14 0.032250 0.172967 0.157200 0.170843 34.54 298.781429 ... 82.052857 34.54 16.848571 2.300000 26.714286 6.485714 32.2 22.8 41.4
3 sj 1990 21 1990-05-21 0.128633 0.245067 0.227557 0.235886 15.36 298.987143 ... 80.337143 15.36 16.672857 2.428571 27.471429 6.771429 33.3 23.3 4.0
4 sj 1990 22 1990-05-28 0.196200 0.262200 0.251200 0.247340 7.52 299.518571 ... 80.460000 7.52 17.210000 3.014286 28.942857 9.371429 35.0 23.9 5.8

III. Exploratory data analysis (EDA)

It's important that we explore the structure of the data set to build a strategy for solving this problem. Let begin by exploring the time series of total cases, and it's histogram. The goal here is to examine relationships between predictors and the target, as well as determine how the time series evolved.

III.A Examining the target variable; total_cases

Lets evaluate the time series and histogram of our target variable.

import matplotlib.pyplot as plt
def plot_feature_timeseries(data,feature):
    time_series = data[['city','week_start_date',feature]]
    time_series['week_start_date'] = pd.to_datetime(time_series['week_start_date'])
    time_series.set_index('week_start_date',inplace=True)
    plt.figure(figsize=(18,8))
    plt.title('both cities',fontsize=20)
    plt.plot(time_series[time_series['city']=='sj'][feature],label='SJ',color='blue')
    plt.plot(time_series[time_series['city']=='iq'][feature],label='IQ',color='red')
    plt.legend(fontsize=20)
    plt.ylabel(feature,fontsize=20)
    plt.xticks(fontsize=20)
    plt.show()

plot_feature_timeseries(df_train,'total_cases')

png

We can see from the start that this is a cyclic time series count regression problem. It's possible that seasonal trends are at play, and it's important to determine if there is an evolution in values over time (stationarity).

Our target variable is total_cases, and in any regression problem it's always good to evaluate the histogram of our target.

plt.figure(figsize=(18,9))
plt.subplot(1,2,1)
iq_mean = round(df_train[df_train['city']=='iq']['total_cases'].mean(),2)
iq_var = round(df_train[df_train['city']=='iq']['total_cases'].var(),2)
plt.title('Historgram of Total cases in city: iq\n mean: {}, variance: {}'.format(iq_mean,iq_var),fontsize=20)
df_train[df_train['city']=='iq']['total_cases'].hist(bins=30)
plt.subplot(1,2,2)
sj_mean = round(df_train[df_train['city']=='sj']['total_cases'].mean(),2)
sj_var = round(df_train[df_train['city']=='sj']['total_cases'].var(),2)
plt.title('Histogram of Total cases in city: sj\n mean: {}, variance: {}'.format(sj_mean,sj_var),fontsize=20)
df_train[df_train['city']=='sj']['total_cases'].hist(bins=30)
plt.subplots_adjust(wspace=0.3)
plt.show()

png

III.B Data cleaning

The data has some missing values, and due to the nature of being a time series across different cities, the correct approach to fill these data points would be to use the foward fill method.

iq = df_train[df_train['city']=='iq']
sj = df_train[df_train['city']=='sj']
sj.fillna(method='ffill', inplace=True)
iq.fillna(method='ffill', inplace=True)
df_train = pd.concat([sj,iq],axis=0)
df_train.head()

5 rows × 25 columns

city year weekofyear week_start_date ndvi_ne ndvi_nw ndvi_se ndvi_sw precipitation_amt_mm reanalysis_air_temp_k ... reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c station_diur_temp_rng_c station_max_temp_c station_min_temp_c station_precip_mm total_cases
0 sj 1990 18 1990-04-30 0.122600 0.103725 0.198483 0.177617 12.42 297.572857 ... 73.365714 12.42 14.012857 2.628571 25.442857 6.900000 29.4 20.0 16.0
1 sj 1990 19 1990-05-07 0.169900 0.142175 0.162357 0.155486 22.82 298.211429 ... 77.368571 22.82 15.372857 2.371429 26.714286 6.371429 31.7 22.2 8.6
2 sj 1990 20 1990-05-14 0.032250 0.172967 0.157200 0.170843 34.54 298.781429 ... 82.052857 34.54 16.848571 2.300000 26.714286 6.485714 32.2 22.8 41.4
3 sj 1990 21 1990-05-21 0.128633 0.245067 0.227557 0.235886 15.36 298.987143 ... 80.337143 15.36 16.672857 2.428571 27.471429 6.771429 33.3 23.3 4.0
4 sj 1990 22 1990-05-28 0.196200 0.262200 0.251200 0.247340 7.52 299.518571 ... 80.460000 7.52 17.210000 3.014286 28.942857 9.371429 35.0 23.9 5.8

III.C Time series and stationarity

This appears to be a time series count-regression problem. In any time series it's important to determine if there is an existing trend in the time series - i.e. if it's stationary. Stationarity in time series indicates if the average is constant in time. To determine stationarity, we can use the augmented dickey fuller test. Thus we perform the augmented dickey fuller test on the time series for both cities.

If it's determined that all columns are stationary, we can treat this as a typical count-regression problem.

from statsmodels.tsa.stattools import adfuller
def dickey_fuller_test(data):
    test_statistic,p_value,lags_used,observations,critical_values,icbest = adfuller(data)
    metric = 'test statistic: {}, p_value: {}'.format(test_statistic,p_value)
    for key in sorted(critical_values.keys()):
        alpha = float(key.replace('%',''))/100
        critical = float(critical_values[key])
        if test_statistic<=critical and p_value<=alpha:
            metric2 = '{}:{}'.format(key,critical_values[key])
            return 'Stationary Series, Reject Null Hypothesis;\n '+metric+'\n '+metric2
    return 'Fail to reject null hypothesis, stationary series: '+metric

The data comes in different contexts: one time series for each city in the data studied. Hence we should run all the evaluations in the following contexts;

  • City IQ
  • City SJ
  • Both cities, combined

Now lets determine if the target, total_cases, is stationary.

time_series = df_train[['city','week_start_date','total_cases']]
time_series['week_start_date'] = pd.to_datetime(time_series['week_start_date'])
for city in ['sj','iq','both']:
    print('City: {}'.format(city))
    if city=='both':
        print(dickey_fuller_test(time_series['total_cases']))
    else:
        print(dickey_fuller_test(time_series[time_series['city']==city]['total_cases']))
    print('\n')
City: sj
Stationary Series, Reject Null Hypothesis;
 test statistic: -6.650077901931189, p_value: 5.1473186737592894e-09
 1%:-3.4374315551464734


City: iq
Stationary Series, Reject Null Hypothesis;
 test statistic: -6.085428681900057, p_value: 1.0672522948401663e-07
 1%:-3.4431115411022146


City: both
Stationary Series, Reject Null Hypothesis;
 test statistic: -6.6232582356851655, p_value: 5.963285375798725e-09
 1%:-3.434889827343955

Now we should run the test for stationarity across all the other features, in each city context.

features = [j for j in df_train.columns if j not in ['total_cases','week_start_date','city','year','weekofyear']]
for col in features:
    _ = df_train[['city','week_start_date',col]]
    _['week_start_date'] = pd.to_datetime(_['week_start_date'])
    for city in ['sj','iq','both']:
        if city=='both':
            test = dickey_fuller_test(_[col])
        else:
            test = dickey_fuller_test(_[_['city']==city][col])
        if 'Fail' in test:
            print('Feature: {} in cities: {} is non-stationary'.format(col,city))
Feature: station_diur_temp_rng_c in cities: both is non-stationary

We see the the only feature regarded as non-stationary is station_diur_temp_rng_c. Lets graphically see what may cause this.

plot_feature_timeseries(df_train,'station_diur_temp_rng_c')

png

Clearly, this isn't do to a trend evolving over time, but only due to the feature's average value being different for each city. In our series of tests, we also evaluated the time series being stationary when evaluating on each city individually. Thus for all intensive purposes, we can regard this feature as infract stationary.

III.D Correlations between features and target

Even though this is not a gaussian based regression problem, where the errors are assumed to be normally distributed, we can still inquite on the pearson correlation between each feature and our target variable total_cases. A bar plot showing the correlation of each feature with the target, in the context of each city can help us.

import numpy as np
import seaborn as sns
sj_corrs = {}
iq_corrs = {}
for col in df_train.columns:
    if col not in ['year','week_start_date','city','total_cases']:
        _ = df_train[df_train['city']=='sj'][[col,'total_cases']]
        sj = np.corrcoef(_[col],_['total_cases'])[0][1]
        sj_corrs[col]=[sj]
        __ = df_train[df_train['city']=='iq'][[col,'total_cases']]
        iq = np.corrcoef(__[col],__['total_cases'])[0][1]
        iq_corrs[col]=[iq]
plt.figure(figsize=(12,12))
plt.title('Pearson correlations for each feature with total_cases',fontsize=20)
sns.barplot(x=0, y='index', data=pd.DataFrame(sj_corrs).transpose().reset_index(), color="b",label='sj')
sns.barplot(x=0, y='index', data=pd.DataFrame(iq_corrs).transpose().reset_index(), color="r",label='iq')
plt.legend(fontsize=20)
plt.show()

png

From this plot we can see that most features are weakly correlated with the target. Some features have different correlations for different cities. For example, weekofyear has almost no correlation in city iq, but very strong correlation with sj. Perhaps this is due to geographical positioning of the cities, and seasonal effects may be more extreme in some locations over others.

III.E Correlations between features (heat map)

iq_features = df_train[df_train['city']=='iq'].drop(['week_start_date','year'],axis=1).corr()
sj_features = df_train[df_train['city']=='sj'].drop(['week_start_date','year'],axis=1).corr()
plt.figure(figsize=(10,20))
plt.subplot(2,1,1)
plt.title('Heat map for features in city IQ',fontsize=20)
sns.heatmap(iq_features)
plt.subplot(2,1,2)
plt.title('Heat map for features in city SJ',fontsize=20)
sns.heatmap(sj_features)
plt.subplots_adjust(hspace=0.7)
plt.show()

png

A few examples;

Correlations between different features range widely, between 0.997 (max) and -0.896 (min). Thus this may indicate colinearity between features that could be problematic for linear models.

For city SJ; * weekofyear has negative correlation with reanalysis_air_temp_k of 0.904 * reanalysis_dew_point_temp_k has negative correlation with reanalysis_tdtr_k of -0.374

For city IQ * weekofyear has negative correlation with reanalysis_specific_humidity_g_per_kg of 0.997 * reanalysis_dew_point_temp_k has positive correlation with reanalysis_tdtr_kc of -0.896

III.F Establishing a baseline: performance of a constant model

In a regression problem, one strategy to establish a baseline model is to ask the following questions;

  • How accurate would our model be if we only predicted the mean?
    • Or if we only predicted median?
    • Or if we only predicted zero?

What would the loss and performance metrics be for this? Using these metrics, we can conclude any model that performance worse than this is by definition no better than simply guessing the mean/median/zero for every row.

We should answer these questions in the context of each city, and the cities combined.

from sklearn.metrics import mean_absolute_error
def scores_of_bad_models(df):
    predict_median = [np.median(df['total_cases']) for i in range(len(df))]
    predict_mean = [df['total_cases'].mean() for i in range(len(df))]
    predict_zero = [0 for i in range(len(df))]
    predict_mean_error = round(mean_absolute_error(df['total_cases'],predict_mean),2)
    predict_zero_error = round(mean_absolute_error(df['total_cases'],predict_zero),2)
    predict_median_error = round(mean_absolute_error(df['total_cases'],predict_median),2)
    return predict_mean_error,predict_zero_error,predict_median_error

datasets = [df_train,df_train[df_train['city']=='sj'],df_train[df_train['city']=='iq']]
scores = {}
for i in range(3):
    if i==0:
        name = 'both'
    elif i==1:
        name = 'sj'
    elif i==2:
        name = 'iq'
    scores[name] = [*scores_of_bad_models(datasets[i])]

scores = pd.DataFrame(scores).transpose()
scores.columns = ['predict_mean','predict_zero','predict_median']
scores['min'] = scores.apply(lambda x: x.min(),axis=1)
scores.index.name='cities'
scores
cities predict_zero predict_median min
both 23.00 24.68 19.88
sj 28.35 34.18 24.71
iq 6.68 7.57 6.05

Thus for both cities combined, the predicting the median every day would yield a mean_absolute_error of 19.88. For city SJ it's 24.71 and IQ it's 6.05. Thus any useful model should have a mean absolute error lower than these values.

III.G Selecting the right linear model

This is a count regression problem, which means that at no time should we expect our model or measurements to be less than zero. Some other examples of similar problems are;

  • Number of hurricanes per year
  • Number of traffic accidents per month
  • Customer visits to a website per day

Hence to measure -5 negative cases of dengu fever on a given day really has no meaning. There are two linear model approaches to solving problems like this, each with different assumptions.

  • Poisson Regression
  • Negative binomial distribution

Poisson regression assumes that the average count per day will equal the variance. If this assumption is invalidated by our data, negative binomial distribution will be the appropriate model to use.

From the section above "examining the target variable", we know that;

  • City IQ:

    • mean: 7.57 cases/day
    • variance: 115.9
  • City SJ:

    • mean: 34.18 cases/day
    • variance: 2640

Hence the variance far exceeds the mean, so Negative binomial distribution is most relevant. This is known as over dispersion.

III.G1 Linear modeling: Predictability of each feature versus the target

One technique in feature selection is to examine the predictability using each feature versus the target variable, using a linear model (negative binomial distribution, hyper parameters tuned). This gives us an estimate of the usefulness of each feature in predicting the target, and we can eliminate features with poor predictability. This technique is limited to cases where the relationships between variabels are linear, but does give some useful insight. This script is inspired by source [1].

from statsmodels.tools import eval_measures
import statsmodels.formula.api as smf
import statsmodels.api as sm
def negative_binomial_one_feature(train,test,city,param_grid,feature):
    model_formula = "total_cases ~ {}".format(feature)
    best_alpha = 0
    best_score = 1000
    for alpha in param_grid:
        model = smf.glm(formula=model_formula,data=train,family=sm.families.NegativeBinomial(alpha=alpha))
        results = model.fit()
        predictions = results.predict(test).astype(int)
        score = eval_measures.meanabs(predictions, test.total_cases)
        if score < best_score:
            best_alpha = alpha
            best_score = score
    metrics = {'city':city,'best_MAE':best_score,'alpha':best_alpha,'feature':feature}
    return metrics


def feature_tests(data,features,param_grid):
    mask = np.random.rand(len(data))<0.8
    train = data[mask]
    test = data[~mask]
    results = []
    cities = ['sj','iq']
    for feature in features:
        for element in range(3):
            tr,te = train.copy(),test.copy()
            if element in [0,1]:
                city = cities[element]
                tr = tr[tr['city']==city]
                te = te[te['city']==city]
            else:
                city = 'both'
            tr.drop('city',axis=1,inplace=True)
            te.drop('city',axis=1,inplace=True)
            result = negative_binomial_one_feature(tr,te,city,param_grid,feature)
            results.append(result)
    return pd.DataFrame(results)

features = [j for j in df_train.columns if j not in ['city','year','week_start_date',target]]
param_grid = np.arange(0.001,20,0.05)
feature_results = feature_tests(df_train,features,param_grid)
feature_results.head()
feature city best_MAE alpha
weekofyear sj 23.384181 5.701
weekofyear iq 6.741935 0.001
weekofyear both 19.374074 1.601
ndvi_ne sj 25.005650 0.001
ndvi_ne iq 6.741935 0.001

Now we have a comparison of each the mean absolute error of each feature versus the target, where the linear model is fitted with an optimal hyper-parameter. From this we can look at the histogram of scores, and see how many features had a mean absolute error less than that of a constant model. Features that performed better than a constant model are probably useful, ones that didn't may not be.

plt.figure(figsize=(18,6))
plt.subplot(1,3,1)
plt.title('Feature scores for city: sj \n min MAE for constant model: {}'.format(scores.loc['sj']['min']))
feature_results[feature_results['city']=='sj']['best_MAE'].hist(bins=len(features))
plt.axvline(scores.loc['sj']['min'],label='median prediction error',color='red')
plt.xlabel('MAE',fontsize=20)
plt.ylabel('count',fontsize=20)
plt.legend()
plt.subplot(1,3,2)
plt.title('Feature scores for city: iq \n min MAE for constant model: {}'.format(scores.loc['iq']['min']))
feature_results[feature_results['city']=='iq']['best_MAE'].hist(bins=len(features))
plt.axvline(scores.loc['iq']['min'],label='median prediction error',color='red')
plt.subplot(1,3,3)
plt.title('Feature scores for both cities combined \n min MAE for constant model: {}'.format(scores.loc['both']['min']))
feature_results[feature_results['city']=='both']['best_MAE'].hist(bins=len(features))
plt.axvline(scores.loc['both']['min'],label='median prediction error',color='red')
plt.show()

png

From the graphs above we can see that a negative binomial distribution on each feature versus the target usually doesn't perform any better than simply guessing mean/median/zero every day.

Thus we can hypothesize; linear models probably have poor predictability on this data set.

III.G2 Feature selection

For there are some features that had a MAE better than the minimum MAE provided by a constant model. Another hypothesis can be formed suggesting that these features may be useful and a linear model with them only may provide some good performance. Lets see what these features are.

min_scores = pd.DataFrame(scores['min']).reset_index()
feature_results_joined = feature_results.merge(min_scores,left_on='city',right_on='cities')[['best_MAE','cities','alpha','min','feature']]
key_features = feature_results_joined[feature_results_joined['best_MAE']<=feature_results_joined['min']]
key_features_list = list(set(key_features['feature']))
key_features
best_MAE cities alpha min feature
23.384181 sj 5.701 24.71 weekofyear
24.502825 sj 0.151 24.71 precipitation_amt_mm
24.531073 sj 0.001 24.71 reanalysis_air_temp_k
24.587571 sj 0.001 24.71 reanalysis_avg_temp_k
24.186441 sj 0.201 24.71 reanalysis_dew_point_temp_k
24.406780 sj 0.801 24.71 reanalysis_max_air_temp_k
24.451977 sj 0.001 24.71 reanalysis_relative_humidity_percent
24.502825 sj 0.151 24.71 reanalysis_sat_precip_amt_mm
24.141243 sj 0.001 24.71 reanalysis_specific_humidity_g_per_kg
24.525424 sj 0.001 24.71 station_min_temp_c
24.581921 sj 0.901 24.71 station_precip_mm
19.374074 both 1.601 19.88 weekofyear
18.822222 both 0.001 19.88 reanalysis_air_temp_k
19.803704 both 0.051 19.88 reanalysis_avg_temp_k
18.511111 both 9.701 19.88 reanalysis_min_air_temp_k
18.522222 both 1.101 19.88 reanalysis_tdtr_k
19.359259 both 0.001 19.88 station_diur_temp_rng_c
19.085185 both 1.151 19.88 station_min_temp_c

The data frame above provides the results where the negative binomial distribution on the feature versus total_cases performed better than guessing the median. These feature were;

key_features_list
['reanalysis_max_air_temp_k',
 'reanalysis_min_air_temp_k',
 'reanalysis_avg_temp_k',
 'reanalysis_air_temp_k',
 'reanalysis_specific_humidity_g_per_kg',
 'station_diur_temp_rng_c',
 'reanalysis_dew_point_temp_k',
 'station_min_temp_c',
 'reanalysis_tdtr_k',
 'weekofyear',
 'station_precip_mm',
 'precipitation_amt_mm',
 'reanalysis_relative_humidity_percent',
 'reanalysis_sat_precip_amt_mm']

IV Modelling

The strategy for modeling is to begin with simple linear models, evaluate their performance and then attempt non-linear ones such as random forest regression.

IV.A Modelling: Negative Binomial distribution

We previously established a baseline performance of what guessing the median value every day would be, and we also identified. Now lets run experiments using negative binomial distribution on;

  • top features from the last step
  • all the features
def negative_binomial_models(data,param_grid,*train_features,return_predictions=False):
    if 'city' in data.columns:
        data['city']=data['city'].apply(lambda x: 1 if x=='sj' else 0)
    mask = np.random.rand(len(data))<0.8
    train = data[mask]
    test = data[~mask]

    model_formula = "total_cases ~"
    for index,f in enumerate([j for j in train_features if j!='total_cases']):
        if index==0:
            model_formula = model_formula+" {}".format(f)
        else:
            model_formula = model_formula+" + {}".format(f)
    best_alpha = None
    best_score = 1000

    for alpha in param_grid:
        model = smf.glm(formula=model_formula,data=train,family=sm.families.NegativeBinomial(alpha=alpha))
        results = model.fit()
        predictions = results.predict(test).astype(int)
        score = eval_measures.meanabs(predictions, test.total_cases)
        if score < best_score:
            best_alpha = alpha
            best_score = score
    # Step 3: refit on entire dataset
    best_model = smf.glm(formula=model_formula,data=train,family=sm.families.NegativeBinomial(alpha=best_alpha))
    best_results = best_model.fit()
    predictions_train = best_results.predict(train).astype(int)
    predictions_test = best_results.predict(test).astype(int)
    score_train = eval_measures.meanabs(predictions_train, train.total_cases)
    score_test = eval_measures.meanabs(predictions_test, test.total_cases)
    metrics = {'MAE_test':score_train,'MAE_train':score_test,'alpha':best_alpha}
    return metrics

Tune and test negative binomial distribution on the key features observed before.

negative_binomial_models(df_train,param_grid,*key_features_list)
{'MAE_test': 19.004198152812762,
 'MAE_train': 18.724528301886792,
 'alpha': 0.051000000000000004}

Tune and test negative binomial distribution on all the features.

all_features = [j for j in df_train.columns if j not in ['year','week_start_date',target]]
negative_binomial_models(df_train,param_grid,*all_features)
{'MAE_test': 17.706837606837606,
 'MAE_train': 21.377622377622377,
 'alpha': 0.101}

From these results we can see that although the model's performance on both a train and test set indicates a reasonably good fit, but the MAE performs approximately the same as guessing the median. Thus its worth investigating a better performing model than negative binomial distribution.

IV.B Modelling: Random Forest Regression

Random forest regression is a good model for data sets that contain non-linearities between features. As a first experiment. It's also good to use a standard scalarizer of our data before putting it into the random forest model.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import random

def preprocess_data(data):
    for c in ['week_start_date','year']:
        if c in data.columns:
            data.drop(c,axis=1,inplace=True)
    data['city']=data['city'].apply(lambda x: 1 if x=='sj' else 0)
    if target in data.columns:
        X = data.drop(target,axis=1)
        y = data[target]
    else:
        X = data
        y = None
    return X,y


def split_and_preprocess_data(data):
    X,y = preprocess_data(data)
    X_train, X_val, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)
    return X_train, X_val, y_train, y_val

criterion = ["poisson","mse","mae"]
max_depth = [None,20,25,30,35]
max_features = ["sqrt", "log2", None]
num_estimators = [10,50,100,150]

best_err = 1000
best_params = []

X_train, X_val, y_train, y_val = split_and_preprocess_data(df_train.copy())
for c in criterion:
    for d in max_depth:
        for f in max_features:
            for n in num_estimators:
                rf = RandomForestRegressor(n_estimators=n,criterion=c,max_depth=d,max_features=f)
                pipe_rf = Pipeline([('scaler', StandardScaler()), ('rf',rf)])
                pipe_rf.fit(X_train,y_train)
                y_pred = pipe_rf.predict(X_test)
                y_pred[y_pred==np.inf]=0
                y_pred[y_pred<0]=0
                err = mean_absolute_error(y_test,y_pred)
                if err<best_err:
                    best_err = err
                    if len(best_params)!=0:
                        best_params.pop()
                    best_params.append([c,d,f,n])

print(best_err)
print(best_params)
16.54417808219178
[['mse', 35, None, 10]]

IV.B1 Random Forest: Feature selection

Our MAE (for this split) is 16.54. This is reasonably good and better than the negative binomial distribution. Using these hyperparameters, lets fit the model to take advance of random forest's feature selection.

c='mse'
d=30
f=None
n=50
rf = RandomForestRegressor(n_estimators=n,criterion=c,max_depth=d,max_features=f)
pipe_rf = Pipeline([('scaler', StandardScaler()), ('rf',rf)])
pipe_rf.fit(X_train,y_train)

for feature in zip(X_train.columns, rf.feature_importances_):
    print(feature)
('city', 0.00012123375843353865)
('weekofyear', 0.0884120168882518)
('ndvi_ne', 0.018704338094528178)
('ndvi_nw', 0.02330064085826595)
('ndvi_se', 0.3065804669674904)
('ndvi_sw', 0.08197494560635799)
('precipitation_amt_mm', 0.008394869826651807)
('reanalysis_air_temp_k', 0.01718239611866463)
('reanalysis_avg_temp_k', 0.019715048212381053)
('reanalysis_dew_point_temp_k', 0.04598412742970319)
('reanalysis_max_air_temp_k', 0.023837285355508442)
('reanalysis_min_air_temp_k', 0.10103776364819778)
('reanalysis_precip_amt_kg_per_m2', 0.032496353106185906)
('reanalysis_relative_humidity_percent', 0.015229479597685027)
('reanalysis_sat_precip_amt_mm', 0.005335078003527012)
('reanalysis_specific_humidity_g_per_kg', 0.035410600175340604)
('reanalysis_tdtr_k', 0.04086107229416256)
('station_avg_temp_c', 0.02208406820316571)
('station_diur_temp_rng_c', 0.016308194900887744)
('station_max_temp_c', 0.04946073934079644)
('station_min_temp_c', 0.014044667116736333)
('station_precip_mm', 0.033524614497077845)
X_train, X_val, y_train, y_val = split_and_preprocess_data(df_train.copy())
rf = RandomForestRegressor(n_estimators=50,criterion='mse',max_depth=None,max_features=None)
pipe = Pipeline([('scaler', StandardScaler()), ('rf',rf)])
pipe.fit(X_train,y_train)


y_pred_val = pipe.predict(X_val)
y_pred_train = pipe.predict(X_train)
y_pred_val[y_pred_val == np.inf] = 0
y_pred_val[y_pred_val<0]=0
y_pred_train[y_pred_train == np.inf] = 0
y_pred_train[y_pred_train<0]=0
err_val = mean_absolute_error(y_val,y_pred_val)
err_train = mean_absolute_error(y_train,y_pred_train)
print(err_val)
print(err_train)
17.56821917808219
5.864226804123711

IV.B2 Random forest results * MAE validation set: 17.56 * MAE train set: 5.86

Despite the model being overfit, lets plot this and see how the predictions compare for the train and validation sets.

val = pd.concat([X_val,y_val],axis=1)
train = pd.concat([X_train,y_train],axis=1)
val['predictions'] = y_pred_val
train['predictions'] = y_pred_train
indices = df_train[['week_start_date','total_cases','weekofyear','year','city']]
val_result = val.join(indices,on=None,lsuffix='Left')[['total_cases','predictions','week_start_date','city']]#.set_index('week_start_date')
val_result.sort_values(by='week_start_date',inplace=True)
val_result['data_set'] = 'validation'
train_result = train.join(indices,on=None,lsuffix='Left')[['total_cases','predictions','week_start_date','city']]
train_result.sort_values(by='week_start_date',inplace=True)
train_result['data_set'] = 'train'
predictions = pd.concat([val_result,train_result],axis=0)
predictions['week_start_date'] = pd.to_datetime(predictions['week_start_date'])
predictions = predictions.sort_values(by='week_start_date').set_index('week_start_date')

def plot_results(df,city,label):
    plt.title('city: {}, {} set'.format(city,label))
    plt.plot(df[(df['city']==city)&(df['data_set']==label)]['predictions'],label='predictions',color='blue')
    plt.plot(df[(df['city']==city)&(df['data_set']==label)]['total_cases'],label='total_cases',color='red')
    plt.xticks(rotation=45)
    plt.legend()

plt.figure(figsize=(20,20))
plt.subplot(2,2,1)
plot_results(predictions,'sj','train')
plt.subplot(2,2,2)
plot_results(predictions,'iq','train')
plt.subplot(2,2,3)
plot_results(predictions,'sj','validation')
plt.subplot(2,2,4)
plot_results(predictions,'iq','validation')
plt.subplots_adjust(wspace=0.5,hspace=0.15)
print("Random Forest results, train and validation set")
plt.show()
Random Forest results, train and validation set

png

V: Forecasting on the test data

Finally, we use the hyperparameters from the best performing random forest reressor model to make forecasts on the test set.

X,y= preprocess_data(df_train.copy())
iq_test = df_test[df_test['city']=='iq'].copy()
sj_test = df_test[df_test['city']=='sj'].copy()
sj_test.fillna(method='ffill', inplace=True)
iq_test.fillna(method='ffill', inplace=True)
df_test = pd.concat([sj_test,iq_test],axis=0)

X_test = df_test.copy()
X_test.drop(['week_start_date','year'],axis=1,inplace=True)
X_test['city'] = X_test['city'].apply(lambda x: 1 if x=='sj' else 0)

rf = RandomForestRegressor(n_estimators=50,criterion='mse',max_depth=None,max_features=None)
pipe_final = Pipeline([('scaler', StandardScaler()), ('rf',rf)])
pipe_final.fit(X,y)
y_pred_test = pipe_final.predict(X_test).astype(int)
submission = pd.concat([df_test[['city','year','weekofyear']],pd.Series(y_pred_test)],axis=1)
submission.columns = ['city','year','weekofyear','total_cases']
submission.set_index(['city','year','weekofyear'],inplace=True)
submission.to_csv('submission.csv')

VI: Discussion & Further steps

We can draw a few conclusions about the data from the investigation; * The time series of all data is stationary. * There are non-linear relationships between the predictors and between the predictors and the target. * Collinearity is present between predictors * Linear models have poor performance * Ensemble tree techniques have issues with over fitting the data.

Further Steps

With more time, computational ability (this was done on a rather old macbook), more advanced modelling techniques can be integrated. The next step I would use would be neural networks.

VII: Sources