Prediction of Wildfire Causes

Ashutosh Bhayde(AI For Earth)
10 min readJun 15, 2021
The Bobcat fire burning in the Angeles National Forest in California KYLE GRILLOT/AFP VIA GETTY IMAGES

Are we talking and taking this issue seriously ?

As we are moving ahead in the years we have been facing many environment related issues like Drought, volcanoes , wildfires and many more. We want to focus our attention to wild fires which has cause a huge destruction to our Planet like Burning of our Birds and animals, Global warming, respiratory disease, Soil erosion and billions of destruction of property including killing innocent people and destroying their communities. We want to build a model which can try to predict the cause of fire based on the data collected over the years from various sources.

Business problem

We are already aware about dangerous wildfires occurred in USA, Australia, India and many more countries. In 2019, we have seen the destructive Amazon forest wildfires whose effect has seen in lot of countries as cloud of smoke and breathing issues . Although there are fire detection systems in place working, but it didn’t go far enough that it predicts the wildfires before they ever start and which could prevent destruction and losses or at least provide an early alarm to mitigate the effect. We know that there are combinations of factor that leads to a wildfire and we will try to use those factors not all because of unavailability of data at one place and try to predict as far as possible to prevent wildfires.

To achieve our task we are going to use Kaggle 1.88 Million Wildfires data that occurred in the United States from 1992 to 2015. This data has been collected by federal, state and local organization. The total wildfires data says, it has burned around 140 million acres of land during the span of 24-year. We will use this data to model and predict the outcomes. By using better machine learning modelling technique(Algorithms) we will try to predict as closely as possible on existing data(Avoid Overfitting). Once the model is trained, this can be used by local forest department to better prepare for natural disasters(wildfire) and also stop people from using combustible substance near such areas. Once we achieve a good amount of predictability percentage we will try to simulate this technique on some other wildfire data.

ML Formulation:

Available data has lots of features/columns but we have to pick those features which will help us predict causes accurately. There are thirteen different classes or causes of wildfire. => Multi class classification problem

Implementation

Exploratory Data Analysis: Insights

  • We are dealing with imbalance data set.
  • Almost all day of week has seen the wildfires however March, April, July and August are the month which have large wildfires.
  • The most common causes of wildfire are debris burning, Miscellaneous, Arson (the criminal act of deliberately setting fire to property) and natural lighting.
  • After the EDA we can say that states like CA(California), GA(Georgia)and TX(Texas) are largely affected and need more man power and technology to deal with wild fires.
  • Numbers of wild fires caused by ‘lighting’ are covering large area if occurred other than ‘miscellaneous’ .
  • There are no trend in the occurring of wild fires throughout the year so we can’t really predict the number of wildfires for the coming specific years.
  • We have performed various correlation test (Pearson, chi square test) and found very less correlation between the features and feature ,feature and target variable. we are here dealing with categorical and numerical data.
  • We have extracted 9 features to classify the class labels.
  • We have also tried to remove outliers from Latitude and longitude column using Box plot technique.

what is feature engineering and why we need it?

Feature Engineering: Now this is part which I think is most interesting part where your creativity, domain knowledge, understanding of existing features and business problem come into the picture and it helps to design new features. To have a good performing model or to improve the performance of model we need good features(In simple word quality of insights each feature have, which helps model to have a good performance).so this can be achieved by engineering new features either using existing features(avoiding multi collinearity ) or designing new feature based on domain knowledge.

Fire/Month: we have created the feature which consider the STATE and MONTH and count of number of wildfires.
Number of fires occurred in 10000 sq. KM of area: Using latitude and longitude calculate the area and number of wildfires in that area which shows the count of wildfires occurred in that surrounding location.

Before going forward, point to be noted here:

Data consist of unpredictable human/natural caused wildfires.

We are dealing with heavily imbalanced data set.

We have less features, missing temperature and environment condition information.

After performing feature analysis(Pearson and chi square test) we found that features are not closely correlated to target variable/class labels.

We are using Multi class log loss and confusion matrix as our performance metric. why ? Described below.

Model building:

Note: There are certain hyper-parameters in the shared code that can be fine tuned according to ones requirements. I have tuned them to find a sweet spot between speed, memory and accuracy.

Algorithms

K-NN(K Nearest Neighbors)

Logistic Regression

SVM(Support Vector Machine)

Random Forest Classifier

XGBClassifier

AdaBoostClassifier

Neural Networks

K-NN: K set of vector/points in dataset which is nearest to the query points. In K NN we tries to find the nearest neighbors to the query point with as much as same labels. To find the K value we have performed the hyperparameter tuning and plotted the loss graph from which we found that K=3 is the sweet spot for perfect prediction of causes.

Value of P(Power) in KNN SkLearn library decide the metric used to calculate the distance between the two vectors(Default is 2 which mean Euclidean distance).

Logistic Regression: It is actually a classification technique and can be interpreted in term of geometry, probability and loss function.

Probabilistic Interpretation:

W* = argmin Summ(- y_i log p_i - (1-y_i) log(1-p_i)) ,

where p_i = sigmoid (Wtranspose x_i)

y_i = {0,1}

Geometric Interpretation:

W* = argmin summ log(1+exp(- y_i Wtranspose x_i))+ Regularization term

y_i = {-1, +1}

Here, we have to find the W vector (optimal weight vector) normal to hyperplane that separates data points into classes where positive data points are in the direction of W(Binary classification).

why we use sigmoid function ?

W(weight vector) are largely impacted by outlier, to deal with this problem we use a technique of squashing the point which are very farther from decision surface.

Idea of squashing : if the signed distance(y_i Wtranspose x_i) is small we use as it is, but if the signed distance is large we make it small and sigmoid function helps to do that. The distance are squashed from [-infinity, +infinity] to [0,1]

Sigmoid function is easily differentiable and has probabilistic interpretation which helps to solve optimization problems.

Loss minimization interpretation:

Try to find the best W(weight vector) which has minimal number of misclassification

W*= argmin number of misclassification

For optimization we want function to be differentiable, now to overcome with the problem of 0–1 loss, we have logistic loss which is continuous, differentiable and a good approximation of it.

SVM minimizes hinge loss, Ada boost minimizes exponential loss and linear regression minimize squared loss.

SVM(Support vector machine):

From my Notes

Here, the main goal of classification is to find the optimal hyperplane that separates positive point from negative point as widely as possible. In SVM we try to maximize the margin to correctly classify the data point.

W*,b*= argmin (||w||/2) (Regularization term)+ C. average distance of misclassified point(Loss term: hinge)

where C=1/lambda

Random Forest classifier: Before going to random forest just have a overview of decision tree classifier because random forest uses decision tree as a estimator(Base models). In a simple word, Decision Tree is “nested if else condition” based classifier. As the tree grows deeper and deeper there is a chances of overfitting or increase in a model variance, now to overcome with this problem we use the technique called Bootstrap Aggregation(Bagging).Bagging is a concept which can reduce the variance in the model without impacting bias.

Random Forest : Decision Tree of reasonable depth+ Row sampling with replacement + Column Sampling + Aggregation(Majority Vote or Mean/Median)

Boosting : In this method we uses complete dataset in a sequential order. The data points which is wrongly classified with previous model is provided more weights so that they can be classified by the next leaner properly. In the test phase, each model is evaluated and based on the test error of each model, the prediction is weighted for voting. Boosting methods decreases the bias of the prediction.

Performance Metric:

Multi-class Log loss why?

We need a metric which will help us to provide the probability percentage for the occurrence of each class labels because we are working on wild fire data which is very critical and lifesaving so we need the probability for each class label or causes of fire and this will help us to consider those percentages of causes and decide accordingly before wildfire occurrence.

Now to achieve the correct probability for the causes we need a metric which will help us to do so. Fortunately we have log loss which minimize the loss considering the probability and even penalizes the model for a small deviation of value. Thus log loss will help us to minimize the loss and achieve the correct probability for the causes.

Mathematically, it is defined as-

Where,

F is loss

p_ij is the probability output by the classifier

y_ij is the binary variable(1 if expected label, 0 otherwise)

N is number of samples

M is number of classes

Confusion Matrix, Precision and Recall why?

Working on wildfire data is not easy because of its imbalanced nature of data and we know confusion matrix works very well with binary imbalance data. We have to consider that for our multi class problem to check whether the model is able to retrieve the class labels or not. Also, we have to check the precision and recall by seeing the matrix diagonally (Highlight bold in heat map) for multiclass problem we will understand more about model classification ability .This will help us to understand that the model is able to classify/retrieve the class labels correctly.

Comparation of model and Results:

After training the models on such a large data we have faced many challenges and memory issue , Nevertheless we have trained many models and below are the summary of them with log loss value(on Cross validation data). Also to add here, we could have trained/tuned the bagging model(Random forest) for some more estimators and even tried the stacking classifier with hyper parameter tuning. Among all of the models Random forest has performed very well in classifying the data which is imbalanced.

KNN worked very well because it works on nearest neighbor technique. We have also checked the models like logistic regression and SVM both on balanced and imbalanced data but they are slightly better than Random model. For up sampling we have used the IMBLearn SMOTE technique which again increased the data but still we didn’t found any much improvement on both the models.

From above we can see Random Forest has performed pretty well on the wildfire data. We have hyper parameter tuned the model and saved the best model in pickle file which we have used to deploy onto the server for prediction. we have built the web page(html) for receiving the input from user(This project is only for forest department) and a flask API to create a App around the model . Once the user enter the input from webpage, we normalize and vectorize(stored in pickle file) those input and pass it to trained model, now model not only predict the causes but also the chances of occurrence of each causes and print back the top five causes on the webpage(DM me for the Demo of Project).

Code:

You can find the code on my GitHub Repository, attached link.

Further improvements:

  • To improve the performance of model we can try adding new features like humidity and temperature.
  • We can also use deep learning model like CNN to include the Satellite Images of burnt location to predict the fire size and causes.
  • We can extend the project to work on different natural disasters prediction like flood , earth quake and many more for different countries.

References:

--

--