NBA ShotLogs of Every Shot taken in the 2014-2015 NBA Season
https://www.kaggle.com/dansbecker/nba-shot-logs?rvi=1
There are a total 128070 Rows and 21 Columns in the dataset. The NBA API used to be public for people to access, however, the NBA has made the API private. This dataset is from the 2014-2015 season when the NBA API was still public. In the 21st Century, we have seen statistics really influence the sports industry. The NBA is no different and has perhaps seen the biggest change in approaches to gameplay due to the emergence of statisticians in the NBA. The NBA not only keeps track of basic statistics like points, field goal percentage, steal, block, assist, but also maybe where in the court was this shot taken or how many seconds did he have left to get the shot off. Traditionally, a small team of statisticians would be on specialized keypads to record any in game statistics. However the last couple of years has seen the NBA use different technology using cameras to track a player’s movement throughout the game.
Looking at the Dataset, there are many parts of the dataset to clean. In a basketball game, teams have 30 seconds to try to score points by shooting the ball through the hoop. However, there will be instances where a player gets the ball from his teammate with 0.9 seconds left on the shot clock. The player should not be penalized for this situation when determining shot efficiency because he had no choice but to take a poorly selected shot due to his teammates. Shots beyond the halfcourt line will be cleaned as well. Players heaving up full court shots as the period ends will affect their field goal percentage negatively. Using the dataset, I plan to compute who is the best shooter for each distance from the basket and who is the best shooter in contested situations. I will also utilize the touch time of a player to determine which players are the best off the dribble shooters or a catch and shoot shooter. Using the closest defender data and how close the defenders are, I can compute the toughest defenders to score against in the 2014-2015 NBA Season. A challenge that I foresee is cleaning the data. It’ll be very difficult to notice a random typo in a player’s name on certain entries among 128070 entries.
Every MLB pitch thrown from 2015-2018 season.
http://inalitic.com/datasets/mlb%20pitch%20data.html
There are 1048576 rows and 40 columns in the dataset. The dataset was scraped from MLB’s database. This dataset has recorded every pitch thrown in a Major League Baseball game from 2015 to 2018. Although the amount of statistics and data you can find about baseball has increased with better accessibility and analytics, baseball has been recording stats since the 19th century. The league appoints an official scorer to record the events happening on the field. Of course like baseball, we have seen the MLB rely on technology to get statistics. A main example of this is tools like Statcast can track a player’s reaction time, ball-speed and the route of a pitch.
When trying to judge a pitcher’s true ability, we must clean out certain situations in games where the gameplan might force a pitcher to intentionally walk players. That’ll work against us when we try to compute the true pitching accuracy later on. We will also clean out rows with missing values as well. We hope to answer which type of pitches(Fastball, changeup, curveball, etc) are used in different ball count scenarios. We also plan to answer which pitch has the most spin rate and the ballspeed used in different ball count scenarios. Finally, using the data on how many runners on base, we will be trying to determine the best pitchers in pressure situations.
Storm events in U.S
https://www.ncdc.noaa.gov/stormevents/
As the fourth largest country in the world, the United States has always been a victim of a variety of natural disasters. In 2020 alone, there were 22 natural disasters that costed a combined total of 95 billion dollars in damage, 13 of which were severe storms. Among the natural disasters, the most prominent ones have been hurricanes, tornadoes, blizzards, floods, etc. Severe storms have an average cost of 2.3 billion dollars and unfortunately it also happens to be the most prominent out of all natural disasters in the US. Natural disasters are not preventable even with extensive data. However, we can work to infer the patterns to minimize the lives lost and financial loss using the data. Wealth gap is a major issue in the United States. Therefore, we decided to perform an analysis on the socio-economic status and racial breakdown of each state. Our group plans on analyzing the factors influenced by these storm events within the U.S. We will be creating models and analyzing the severity of storms in each state in respect to frequency of the storm events, average magnitude, GDP by state, and infrastructure expenditure due to climate change by state. We will also be determining the socio-economic status and racial breakdown in each state in regards to the regions most effected by these storms.