# Regression Analysis Regression Analysis Instructions For the following questions, refer to this

Regression Analysis
Regression Analysis
Instructions
For the following questions, refer to this multi-site Beijing air pollution dataset from the UCI Machine Learning Repository. Each of these files contains hourly air quality data from 2013 to 2017 at a specific site in Beijing outlined at the link.
The input variables are year, month, day, hour, temperature, pressure, dew point, rain, wind direction, wind speed, and station. The possible output variables are concentrations of different pollutants in the air: PM2.5, PM10, SO2, NO2, CO, O3.
To earn full credit, you must submit a spreadsheet file (Excel or CSV format) or code file (R, Python, MATLAB, C++, or whatever) with your work and a typed report (Word or PDF format) consisting of answers to the questions written in complete sentences and excerpts from your code/output or Excel file and their outputs to support the conclusions.
Grading will be based entirely on the report, so all conclusions must be justified within. The code/spreadsheet file will be consulted only if to verify the work was done independently. Notes and other references may be used, but you may not work with other people.
Data Preprocessing
Merge the data from Dongsi, Shunyi, and Wanliu districts into one spreadsheet or data structure.
Delete rows with any missing data and delete the wind direction column.
Convert the station information into three binary columns differentiating the stations.
Convert the day, month, year into a single numerical variable.
Describing the Data
Look up information about the three districts of Beijing online. How do they differ? Hypothesize which districts are more likely to have more pollution.
Find the min, 1st quartile, median, 3rd quartile, and max of each variable (column).
Standardize the non-binary input variables and use the standardized data for future problems.
Test each of the following hypotheses for fine particular matter in the air (PM 2.5) at the stations:
µDongsi = µShunyi, µDongsi = µWanliu, µShunyi = µWanliu at the α = 0.04 level.
Interpret your results practically.
Find the correlation between every pair of variables. [Hint. Do not do it manually.]
Which variables have the strongest positive correlation and strongest negative correlation? Hypothesize why these two correlations are like this practically.
Find the least squares model predicting CO pollution using the standardized input variables.
Which single variable has the largest impact on CO pollution in the model?
· · ·
· · ·Find the p-value for the a test of the hypothesis H0 : β0 = β1 = = βk, where k is the number of predictors in the model. Does the model produce significant predictions at significance level 2%?
What is the adjusted r2 value? What does this percentage represent?
Find p-values of tests of the hypotheses H0 : βi = 0 for each i = 1, …, k. What do the results mean practically speaking? (Assume significance 2%.)
How does a linear model for predicting PM10 pollution differ from a model for predicting CO pollution? Does one model fit better? Are the same variables significant predictors?