Name(s): Miles Fichtner
Website Link: https://milesfichtner.github.io/power-outages/
My question analyzes how we can predict the cause of power outages from a set of parameters in a public dataset from the Laboratory for Advancing Sustainable Critical Infrastructure at Purdue University. The dataset ranges in dates from 2000-2016 and provides substantial regional and sales data. As well as outage characteristics such as duration, date, severity level, and most importantly cause.
This dataset was interesting to me as I wanted to learn more about logistic regression with multinomial classification tasks. I attempt to determine the cause of the outage on a few major parameters available. Specifically:
Feature | Description |
---|---|
OUTAGE.DURATION | How long the outage lasted |
CUSTOMERS.AFFECTED | Total number of customers affected by the outage |
NERC.REGION | North American Electric Reliability Corporation |
ANOMALY.LEVEL | severity level of the outage |
TOTAL.CUSTOMERS | Total customers within the grid or provider |
The dataset has 1534 rows of data, where each row represents a separate event
At the beginning of my EDA, I looked at the TOTAL.CUSTOMERS and ‘ANOMALY.LEVEL’
I wanted to focus on regional data specific to residential, commercial or industrial customers as a proportion of the total customers. I thought that this data would help with defining different causes such as equipment failure, systems operations, or intentional attacks. As I thought that equipment and systems outages might be more common in industrial areas and intentional attacks more common in urban areas or residential.
RES_PERCENT | COM_PERCENT | IND_PERCENT |
---|---|---|
87.2075 | 12.1341 | 0.606341 |
87.3293 | 12.0819 | 0.588194 |
87.6765 | 11.6891 | 0.630401 |
87.5195 | 11.7683 | 0.711574 |
87.0738 | 12.1158 | 0.769811 |
I found a somewhat minimal difference between residential, commercial and industrial regions and the different classes I hoped to predict, so I began to look elsewhere.
So instead I looked at ‘TOTAL.CUSTOMERS’ more holistically against ‘ANOMALY.LEVEL’, to find that total customers followed U.S._STATE quite closely, meaning that if I wanted ot capture regional data and still have extra explanatory power coming from customer data, it might be better to look at CUSTOMERS.AFFECTED instead of total customers.
As you can see above, the TOTAL_CUSTOMERS and U.S.STATE are grouped quite closely, as to motivate me to look at CUSTOMERS.AFFECTED as a replacement. This also motivates that there could be a separate relationship between some of the regional data in customer totals and ANOMALY.LEVEL as there are some natural clusters that form.
However CUSTOMERS.AFFECTED required more thoughtfulness as many of the values had to be imputed.
CAUSE.CATEGORY | Count with Values |
---|---|
equipment failure | 60 |
fuel supply emergency | 51 |
intentional attack | 418 |
islanding | 46 |
public appeal | 69 |
severe weather | 763 |
system operability disruption | 127 |
CAUSE.CATEGORY | Count Missing Values |
---|---|
equipment failure | 30 |
fuel supply emergency | 44 |
intentional attack | 219 |
islanding | 12 |
public appeal | 48 |
severe weather | 46 |
system operability disruption | 44 |
For CUSTOMERS.AFFECTED, I will be filling all values as 0.
I originally wanted to impute either with a random distribution or mean conditional on CAUSE.CATEGORY, as above I can see that severe weather have almost all values (doesn’t need any imputation) and almost all other CAUSE.CATEGORY classes need lot’s of imputation.
However, imputing conditional on CAUSE.CATEGORY would introduce a level of ‘optimistic bias’ in the full data set given that CAUSE.CATEGORY is my response variable. Meaning if I imputed the variable before splitting my test data it would artifically boost my test and training accuracy.
Imputing based on a value of zero will still skew values downward, however, potentially NaN values would have been zero as the dataset didn’t find it relevant enough to mention per outage.
For imputation separate from CUSTOMERS.AFFECTED, I imputed ‘OUTAGE.DURATION’ conditional on U.S._STATE, and ANOMALY.LEVEL conditional on CAUSE.CATEGORY
While I understand that imputing based on cause would again create an artificial bias, there were very few outage event anomaly levels that needed imputation, so this likely will not impact model performance.
I chose to impute based on US state for Duration, as it seemed like a good basis and lack of correlation with our response variable, assuming random sampling.
CAUSE.CATEGORY | Missing Percentage |
---|---|
equipment failure | 0.500000 |
fuel supply emergency | 0.862745 |
intentional attack | 0.523923 |
islanding | 0.260870 |
public appeal | 0.695652 |
severe weather | 0.060288 |
system operability disruption | 0.346457 |
I am performing a multiclass classification task.
I chose to score my models on accuracy before anything else, given the multiclass nature of the question, finding the proportion of true predictions over all predictions for a given class provided the most utility. The multiclass nature also is representative of weighted accuracy in the model, this is in order to account for bias in classes that have many more outages, such as severe weather and intentional attack.
I chose to go with Logistic Regression in my Baseline Model.
The process I used to determine my baseline model was relatively unconventional as well. I came back and recycled through features, rather than sticking with a base few. Originally I set up a base model with a basic pipeline scaling the ANOMALY.LEVEL and TOTAL_CUSTOMERS parameters. This model had an accuracy of 54% and a precision of 44%. These values were very low, and was part of the motivation in using CUSTOMERS.AFFECTED instead.
My second iteration, I used CUSTOMERS.AFFECTED and ANOMALY.LEVEL receiving better performance of 66% accuracy. I then included ‘OUTAGE.DURATION’ to get my current base model with an accuracy of 71%. This iteration process was meshed into my EDA more than anything else, but served as an important learning process.
All these parameters are nominal, and I used a Pipeline object with standard scaling as a preprocessing step.
For the final model, I used a OneHotEncoded NERC.REGION to attempt to capture some of the major regional differences, this encapsulates some of the water and land percentage differences, in an effort to give more explanatory power for groups such as severe weather and islanding.
I engineered new features using a PolynomialRegression encoder and GridSearchCV with a RandomForestClassifier. I found that the optimal number of degrees was 2 and this was spread across all my nominal features: ‘OUTAGE.DURATION’, ‘CUSTOMERS.AFFECTED’, and ‘ANOMALY.LEVEL’.
I kept the standard scaling and one hot encoding step. The accuracy from the generalization and hyperparameter tuning improved my accuracy by 8% bringing the total model accuracy to 79%. While this was less than I was expecting. I think some of the accuracy could have been improved given my imputation decisions. Given the confusion matrix of the final model, I can see that the values were much more random and spread for all groups, while maintaining a large amount of the false positives in the intentional attack column. This likely shows how imputing based on a value of 0 is not a perfect method.
Other optimal parameters coming from my GridsearchCV were a maximum depth of 50, minimum sample leaf nodes being 1, and minimum sample split of 5. The best model precision was 77%.
I believe that the RandomTreeClassifier was able to perform better relative to the LogisticRegression given the medium data set size, larger number of nominal variables to be tuned and the likely non-linear relationships between the nominal data. Also the intentional attacks column having 19 large outliers despite the CUSTOMERS.AFFECTED values being largely 0, would require more robust outlier performance from Random forest classification.