Using machine learning to help analyze an education problem
"Education is the most powerful weapon which you can use to change the world" - Nelson Mandela
Nelson Mandela was right. Education is a powerful weapon as well as one of life's greatest gifts. In the United States, there are various educational program options for children such as public schools, charter schools and private schools. Although there are several education pathways to choose from, do they all aim at the same goal of achieving student success?
I came across a challenge on Kaggle called PASSNYC: Data Science for Good. The challenge was created by PASSNYC, a not-for-profit organization that aims to help public school students in New York City, to assess the needs of students by using publicly available data to quantify the challenges they face at school and how to encourage students to take the Specialized High School Admissions Test (SHSAT). PASSNYC wants to find what schools are need of help to increase the number of students that take the tests. Although this challenge was interesting in its own matter, I wanted to solve a different problem using a machine learning approach: how can we help increase the student success of public schools?
Objective: Can we predict the student achievement rating of public schools?
In the data, there was a field that rates the achievement of students for each school that I used to determine whether the school was reaching the target of student success. Student achievement rating had 4 different levels: Not Meeting Target, Approaching Target, Meeting Target, and Exceeding Target. This would make our analysis a supervised multi-classification machine learning task:
Supervised: We were given the target data for the training data
Multi-class Classification: the target data has 4 different categorical values (which I later turned into 3 to increase the performance of the models)
Exploratory Data Analysis
The first approach, as with any data science project, was to understand the data. There were 160 features/columns in the data set consisting of both numerical and categorical values. Based on the features, I was able to discover 4 different types of groups: economic, race, education ratings, and tests. For example, the features that belonged to the economic group consisted of School Income Estimate (numerical) and Economic Need Index (numerical), which are used to identify the economic status of the public school.
The group of features that I ended up removing most in the final models was tests. These features included the total number of students that took the Specialized High School Admissions Test (SHSAT). During the first analysis, I included all of these features but would later discover that they were not integral to helping us solve the problem nor improving the performance of the models. I would only include the average scores of the test (both ELA and Math tests), but not the amount of students take the test.
What did the distribution of student achievement rating originally look like?
By looking at the distribution of student achievement rating, I realized that there would be some cleaning to do. There were only 6 schools in the data that were Not Meeting Target of student achievement, compared to over 600 schools that were Meeting Target. Since there were only a few schools that were Not Meeting Target, it would be difficult for the models to make predictions. Therefore, I decided to change the value of Not Meeting Target to Approaching Target, since in reality, they are both: A) not meeting the target of student achievement, and B) trying to meet the target (well, at least I hope they are). There were other features that had the 4 rating levels as well (Not Meeting Target, Approaching Target, Meeting Target, and Exceeding Target), so the changes were also made for those values. After making the changes, the distribution of student achievement looked like this:
Relationships between Target and Features
The next step in the analysis was to look for any relationships bewteen the target and features. To do this, I created a few density plots. By coloring the density curves by the target (student achievement rating), this showed how the distribution changes based on the rating.
The relationship between economic need index and student achievement were, in my honest opinion, as expected. The value for economic need index is between 0 and 1, with values closer to 1 meaning the school is in more economic need. Here, there are more public schools that have a high economic need index value are rated as Approaching Target, than the other ratings. Here, many of the public schools that had a high economic value index (more than 0.8) had a student achievement rating of Approaching Target. In other words, schools in need of economic support did not have a good student achievement rating.
The relationship between school income estimate and student achievement were the same as economic need index. Many of the public schools that had a lower estimated income (less than $50,000) had a student achievement rating of Approaching Target.
After looking into it a bit more, I discovered that public schools with more than $100,000 in estimated school income, were all either Meeting Target or Exceeding Target. Based on these observations, I speculated that income/economic status of the school would play an important role in the analysis.
Feature Engineering / Selection
The next step in the process was to make changes to the features to optimize the performance of the models. The columns that had percentage values were changed from numerical to ordinal values using one hot encoding, from 0 to 3. For example, an observation that had a value of 23% (0.23) for the feature Percent Hispanic, was updated to a an ordinal value of 1, while an observation that had a value of 88% (0.88) for the same feature was updated to a value of 3. Doing this to all the features that had a percentage (numerical) value improved the accuracy of the models!
For the features that had the rating values, I also used one hot encoding: Approaching Target: 1, MeetingTarget: 2, and ExceedingTarget: 3. This procedure also included the target, student achievement rating.
Another part of the analysis was looking at relationships between the features. The relationship between economic need index and race showed a significant pattern:
Here, I noticed that the economic need index varies by schools that consisted of a certain race. For example, public schools with a high Hispanic demographic (value of 3), were considered more in need of financial assistance. On the other side, public schools with a high White demographic (value of 3), did not seem to be in a significant need of financial assistance.
Model Selection and Testing
Since this was a supervised multi-classification machine learning analysis, I created four classification models:
Random Forest Classifier (RF)
Linear Support Vector Classification (LSVC)
Gaussian Naive Bayes (GNB)
Linear Discriminant Analysis (LDA)
To evaluate the predictions of the models, I used the Micro F1 Score. Since this was a multi-class problem, I had to average the F1 scores for each class. while also taking into account label imbalances. The micro averaging computes the total number of false positives, false negatives, and true positives over all classes, and then computes precision, recall, and f-score
using these counts.
Using 10-fold cross validation on the data, which trained and tested the models 10 times, here were the results:
The Linear Discriminant Analysis (LDA) model had the highest micro F1 score with 0.66 (higher is better) using 10-fold cross validation. The Random Forest (RF) model came in second with a score of 0.62. The model that (surprisingly) did not perform well was the Gaussian Naive Bayes (GNB), which scored drastically lower than the other models.
Using the random forest model, I ranked the feature importances to see how helpful the top features were in the analysis:
The average test scores of both ELA and Math tests were the most important features. This makes sense because with a higher average test store of a school, the school will most likely have a higher student achievement rating. One take from the feature importance was the high ranking of the two economic based features: economic need index and school income estimate. The race features were important in the analysis but not as important as I thought they would be.
Here are some of the things I learned from the analysis:
Schools that have lower income levels seem to be in need of improvement. If PASSNYC wants to provide funds/aid to schools, they should look at schools in economic need and schools with a higher percentage of minorities (Black / Hispanic)
The economic status of a school plays an important part on the achievement of its students: the more economic need a school needs, the lower achievement rate the school has
Economic status of the school varies by the dominated race of the schools: The more minority (Black / Hispanic) students a school has, the more economic need it has
The Linear Discriminant Analysis (LDA) scored well in this multi-class machine learning analysis compared to the other models.
Although the best model did a great job at predicting the student achievement rating of NYC public schools, there were some issues that I learned along the way that could help me out in future projects:
It's always great to go update the model selection / engineering! Although this advice is recommended in almost every data science related material, I fully understood its value. The data provided by Kaggle had over 100 features that I used on my first couple of models, but then after going back to the selection / engineering process, I eliminated most of them. Just because the features are available does not mean they will be useful.
More (non-redundant) features. The data that was provided could have had more useful features like size of school or student-to-teacher ratio. This type of data could have potentially increased the accuracy of our models.