power-outages

Predicting cause of outage from Power Outage data in U.S.

Name(s): Miles Fichtner
Website Link: https://milesfichtner.github.io/power-outages/

Introduction

My question analyzes how we can predict the cause of power outages from a set of parameters in a public dataset from the Laboratory for Advancing Sustainable Critical Infrastructure at Purdue University. The dataset ranges in dates from 2000-2016 and provides substantial regional and sales data. As well as outage characteristics such as duration, date, severity level, and most importantly cause.

This dataset was interesting to me as I wanted to learn more about logistic regression with multinomial classification tasks. I attempt to determine the cause of the outage on a few major parameters available. Specifically:

Feature	Description
OUTAGE.DURATION	How long the outage lasted
CUSTOMERS.AFFECTED	Total number of customers affected by the outage
NERC.REGION	North American Electric Reliability Corporation
ANOMALY.LEVEL	severity level of the outage
TOTAL.CUSTOMERS	Total customers within the grid or provider

The dataset has 1534 rows of data, where each row represents a separate event

Data Cleaning and EDA (Exploratory Data Analysis)

At the beginning of my EDA, I looked at the TOTAL.CUSTOMERS and ‘ANOMALY.LEVEL’

I wanted to focus on regional data specific to residential, commercial or industrial customers as a proportion of the total customers. I thought that this data would help with defining different causes such as equipment failure, systems operations, or intentional attacks. As I thought that equipment and systems outages might be more common in industrial areas and intentional attacks more common in urban areas or residential.

Univariate Analysis:

Interesting aggregates:

RES_PERCENT	COM_PERCENT	IND_PERCENT
87.2075	12.1341	0.606341
87.3293	12.0819	0.588194
87.6765	11.6891	0.630401
87.5195	11.7683	0.711574
87.0738	12.1158	0.769811

I found a somewhat minimal difference between residential, commercial and industrial regions and the different classes I hoped to predict, so I began to look elsewhere.

So instead I looked at ‘TOTAL.CUSTOMERS’ more holistically against ‘ANOMALY.LEVEL’, to find that total customers followed U.S._STATE quite closely, meaning that if I wanted ot capture regional data and still have extra explanatory power coming from customer data, it might be better to look at CUSTOMERS.AFFECTED instead of total customers.

Bivariate Analysis

As you can see above, the TOTAL_CUSTOMERS and U.S.STATE are grouped quite closely, as to motivate me to look at CUSTOMERS.AFFECTED as a replacement. This also motivates that there could be a separate relationship between some of the regional data in customer totals and ANOMALY.LEVEL as there are some natural clusters that form.

However CUSTOMERS.AFFECTED required more thoughtfulness as many of the values had to be imputed.

CAUSE.CATEGORY	Count with Values
equipment failure	60
fuel supply emergency	51
intentional attack	418
islanding	46
public appeal	69
severe weather	763
system operability disruption	127

CAUSE.CATEGORY	Count Missing Values
equipment failure	30
fuel supply emergency	44
intentional attack	219
islanding	12
public appeal	48
severe weather	46
system operability disruption	44

Imputation

For CUSTOMERS.AFFECTED, I will be filling all values as 0.
I originally wanted to impute either with a random distribution or mean conditional on CAUSE.CATEGORY, as above I can see that severe weather have almost all values (doesn’t need any imputation) and almost all other CAUSE.CATEGORY classes need lot’s of imputation.
However, imputing conditional on CAUSE.CATEGORY would introduce a level of ‘optimistic bias’ in the full data set given that CAUSE.CATEGORY is my response variable. Meaning if I imputed the variable before splitting my test data it would artifically boost my test and training accuracy.
Imputing based on a value of zero will still skew values downward, however, potentially NaN values would have been zero as the dataset didn’t find it relevant enough to mention per outage.

For imputation separate from CUSTOMERS.AFFECTED, I imputed ‘OUTAGE.DURATION’ conditional on U.S._STATE, and ANOMALY.LEVEL conditional on CAUSE.CATEGORY

While I understand that imputing based on cause would again create an artificial bias, there were very few outage event anomaly levels that needed imputation, so this likely will not impact model performance.

I chose to impute based on US state for Duration, as it seemed like a good basis and lack of correlation with our response variable, assuming random sampling.

Other Interesting Aggregates

CAUSE.CATEGORY	Missing Percentage
equipment failure	0.500000
fuel supply emergency	0.862745
intentional attack	0.523923
islanding	0.260870
public appeal	0.695652
severe weather	0.060288
system operability disruption	0.346457

Framing a Prediction Problem

I am performing a multiclass classification task.

My prediction goal will have a response variable of ‘CAUSE.CATEGORY’ and for the baseline model at least I will use OUTAGE.DURATION, CUSTOMERS.AFFECTED, and ANOMALY.LEVEL.
The motivation behind this is more of a backtracking effort, as it is not really possible to know how many customers are affected before you determine the cause, nor will the anomaly level likely be determined. However, I still think it would provide for an interesting analysis.
I looped back around as I wasn’t originally going to use OUTAGE.DURATION to help predict, but after getting better results in baseline and final model across the board I decided it would be useful. Duration likely gives a separate dimension to my data than regional differences that could be overlapped from percentage water use by state or regional differences, as well as would provide a separate dimension from CUSTOMERS.AFFECTED.

I chose to score my models on accuracy before anything else, given the multiclass nature of the question, finding the proportion of true predictions over all predictions for a given class provided the most utility. The multiclass nature also is representative of weighted accuracy in the model, this is in order to account for bias in classes that have many more outages, such as severe weather and intentional attack.

Baseline Model

I chose to go with Logistic Regression in my Baseline Model.

The process I used to determine my baseline model was relatively unconventional as well. I came back and recycled through features, rather than sticking with a base few. Originally I set up a base model with a basic pipeline scaling the ANOMALY.LEVEL and TOTAL_CUSTOMERS parameters. This model had an accuracy of 54% and a precision of 44%. These values were very low, and was part of the motivation in using CUSTOMERS.AFFECTED instead.

My second iteration, I used CUSTOMERS.AFFECTED and ANOMALY.LEVEL receiving better performance of 66% accuracy. I then included ‘OUTAGE.DURATION’ to get my current base model with an accuracy of 71%. This iteration process was meshed into my EDA more than anything else, but served as an important learning process.

All these parameters are nominal, and I used a Pipeline object with standard scaling as a preprocessing step.

Final Model

For the final model, I used a OneHotEncoded NERC.REGION to attempt to capture some of the major regional differences, this encapsulates some of the water and land percentage differences, in an effort to give more explanatory power for groups such as severe weather and islanding.

I engineered new features using a PolynomialRegression encoder and GridSearchCV with a RandomForestClassifier. I found that the optimal number of degrees was 2 and this was spread across all my nominal features: ‘OUTAGE.DURATION’, ‘CUSTOMERS.AFFECTED’, and ‘ANOMALY.LEVEL’.

I kept the standard scaling and one hot encoding step. The accuracy from the generalization and hyperparameter tuning improved my accuracy by 8% bringing the total model accuracy to 79%. While this was less than I was expecting. I think some of the accuracy could have been improved given my imputation decisions. Given the confusion matrix of the final model, I can see that the values were much more random and spread for all groups, while maintaining a large amount of the false positives in the intentional attack column. This likely shows how imputing based on a value of 0 is not a perfect method.

Other optimal parameters coming from my GridsearchCV were a maximum depth of 50, minimum sample leaf nodes being 1, and minimum sample split of 5. The best model precision was 77%.

I believe that the RandomTreeClassifier was able to perform better relative to the LogisticRegression given the medium data set size, larger number of nominal variables to be tuned and the likely non-linear relationships between the nominal data. Also the intentional attacks column having 19 large outliers despite the CUSTOMERS.AFFECTED values being largely 0, would require more robust outlier performance from Random forest classification.