power-outages

Predicting cause of outage from Power Outage data in U.S.

Name(s): Miles Fichtner
Website Link: https://milesfichtner.github.io/power-outages/

Introduction

My question analyzes how we can predict the cause of power outages from a set of parameters in a public dataset from the Laboratory for Advancing Sustainable Critical Infrastructure at Purdue University. The dataset ranges in dates from 2000-2016 and provides substantial regional and sales data. As well as outage characteristics such as duration, date, severity level, and most importantly cause.

This dataset was interesting to me as I wanted to learn more about logistic regression with multinomial classification tasks. I attempt to determine the cause of the outage on a few major parameters available. Specifically:

Feature Description
OUTAGE.DURATION How long the outage lasted
CUSTOMERS.AFFECTED Total number of customers affected by the outage
NERC.REGION North American Electric Reliability Corporation
ANOMALY.LEVEL severity level of the outage
TOTAL.CUSTOMERS Total customers within the grid or provider

The dataset has 1534 rows of data, where each row represents a separate event

Data Cleaning and EDA (Exploratory Data Analysis)

At the beginning of my EDA, I looked at the TOTAL.CUSTOMERS and ‘ANOMALY.LEVEL’

I wanted to focus on regional data specific to residential, commercial or industrial customers as a proportion of the total customers. I thought that this data would help with defining different causes such as equipment failure, systems operations, or intentional attacks. As I thought that equipment and systems outages might be more common in industrial areas and intentional attacks more common in urban areas or residential.

Univariate Analysis:

Interesting aggregates:

RES_PERCENT COM_PERCENT IND_PERCENT
87.2075 12.1341 0.606341
87.3293 12.0819 0.588194
87.6765 11.6891 0.630401
87.5195 11.7683 0.711574
87.0738 12.1158 0.769811

I found a somewhat minimal difference between residential, commercial and industrial regions and the different classes I hoped to predict, so I began to look elsewhere.

So instead I looked at ‘TOTAL.CUSTOMERS’ more holistically against ‘ANOMALY.LEVEL’, to find that total customers followed U.S._STATE quite closely, meaning that if I wanted ot capture regional data and still have extra explanatory power coming from customer data, it might be better to look at CUSTOMERS.AFFECTED instead of total customers.

Bivariate Analysis

As you can see above, the TOTAL_CUSTOMERS and U.S.STATE are grouped quite closely, as to motivate me to look at CUSTOMERS.AFFECTED as a replacement. This also motivates that there could be a separate relationship between some of the regional data in customer totals and ANOMALY.LEVEL as there are some natural clusters that form.

However CUSTOMERS.AFFECTED required more thoughtfulness as many of the values had to be imputed.

CAUSE.CATEGORY Count with Values
equipment failure 60
fuel supply emergency 51
intentional attack 418
islanding 46
public appeal 69
severe weather 763
system operability disruption 127
CAUSE.CATEGORY Count Missing Values
equipment failure 30
fuel supply emergency 44
intentional attack 219
islanding 12
public appeal 48
severe weather 46
system operability disruption 44

Imputation

For imputation separate from CUSTOMERS.AFFECTED, I imputed ‘OUTAGE.DURATION’ conditional on U.S._STATE, and ANOMALY.LEVEL conditional on CAUSE.CATEGORY

While I understand that imputing based on cause would again create an artificial bias, there were very few outage event anomaly levels that needed imputation, so this likely will not impact model performance.

I chose to impute based on US state for Duration, as it seemed like a good basis and lack of correlation with our response variable, assuming random sampling.

Other Interesting Aggregates

CAUSE.CATEGORY Missing Percentage
equipment failure 0.500000
fuel supply emergency 0.862745
intentional attack 0.523923
islanding 0.260870
public appeal 0.695652
severe weather 0.060288
system operability disruption 0.346457

Framing a Prediction Problem

I am performing a multiclass classification task.

I chose to score my models on accuracy before anything else, given the multiclass nature of the question, finding the proportion of true predictions over all predictions for a given class provided the most utility. The multiclass nature also is representative of weighted accuracy in the model, this is in order to account for bias in classes that have many more outages, such as severe weather and intentional attack.

Baseline Model

I chose to go with Logistic Regression in my Baseline Model.

The process I used to determine my baseline model was relatively unconventional as well. I came back and recycled through features, rather than sticking with a base few. Originally I set up a base model with a basic pipeline scaling the ANOMALY.LEVEL and TOTAL_CUSTOMERS parameters. This model had an accuracy of 54% and a precision of 44%. These values were very low, and was part of the motivation in using CUSTOMERS.AFFECTED instead.

My second iteration, I used CUSTOMERS.AFFECTED and ANOMALY.LEVEL receiving better performance of 66% accuracy. I then included ‘OUTAGE.DURATION’ to get my current base model with an accuracy of 71%. This iteration process was meshed into my EDA more than anything else, but served as an important learning process.

All these parameters are nominal, and I used a Pipeline object with standard scaling as a preprocessing step.

Final Model

For the final model, I used a OneHotEncoded NERC.REGION to attempt to capture some of the major regional differences, this encapsulates some of the water and land percentage differences, in an effort to give more explanatory power for groups such as severe weather and islanding.

I engineered new features using a PolynomialRegression encoder and GridSearchCV with a RandomForestClassifier. I found that the optimal number of degrees was 2 and this was spread across all my nominal features: ‘OUTAGE.DURATION’, ‘CUSTOMERS.AFFECTED’, and ‘ANOMALY.LEVEL’.

I kept the standard scaling and one hot encoding step. The accuracy from the generalization and hyperparameter tuning improved my accuracy by 8% bringing the total model accuracy to 79%. While this was less than I was expecting. I think some of the accuracy could have been improved given my imputation decisions. Given the confusion matrix of the final model, I can see that the values were much more random and spread for all groups, while maintaining a large amount of the false positives in the intentional attack column. This likely shows how imputing based on a value of 0 is not a perfect method.

Other optimal parameters coming from my GridsearchCV were a maximum depth of 50, minimum sample leaf nodes being 1, and minimum sample split of 5. The best model precision was 77%.

I believe that the RandomTreeClassifier was able to perform better relative to the LogisticRegression given the medium data set size, larger number of nominal variables to be tuned and the likely non-linear relationships between the nominal data. Also the intentional attacks column having 19 large outliers despite the CUSTOMERS.AFFECTED values being largely 0, would require more robust outlier performance from Random forest classification.