Statistics Assignment #3
This assignment must be completed individually. This means that you may not ask any person other than the professor for help. You are permitted to use the internet to look up how to perform certain operations in R or to look up certain statistical concepts. If you have any question, please ask the professor directly.
Instructions for this assignment assume that you are using RStudio. Remember that you must install “base” R in order to use RStudio. You may use a different statistical software (e.g., SPSS, Stata) if you like, but it is your responsibility to make sure that the results, graphs, etc. from this software match the correct R output. You may not use Excel for this assignment.
In this document, I will tell you what code to run in RStudio. The specific code will be written in Courier font (i.e., the font you are reading now). Other instructions and notes will be written in Times New Roman (i.e., the font you are reading now). If the font is not Courier, you should not type that into RStudio. It will not work.
For early assignments, I will give you relatively detailed code and notes to go with them. However, once we have covered a certain command, the instructions will be less detailed. You should go back and look at old assignments and questions to see the more detailed notes on a certain command.
Finally, if I ask you to copy and paste your output, it is your responsibility to make sure that the output is readable. You may lose points if the output is not readable (e.g., the columns are not lined up properly). Therefore, I recommend taking a screenshot rather than copy and pasting the text.
For this assignment, please use the “Airbnb Data.csv” file posted on Blackboard.
This data has information about Airbnb listings in the Seattle area.
This data file has the following variables:
room_id – Identifier for the specific listing
host_id – Identifier for the host
room_type – The type of listing (entire room/apartment, private room, or shared room)
neighborhood – The specific city (Seattle, Kirkland, Bellevue, or Redmond)
reviews – How many reviews guests have left
overall_satisfaction – The average guest review (out of 5, rounded to nearest 0.5)
accommodates – The number of guests that can fit in the room/apartment
bedrooms – The number of bedrooms in the home (0 = studio apartment)
bathrooms – The number of bathrooms in the home
price – The average nightly price to rent the home
name – The descriptive name that owners gave to their listing
“BACKWARD LOOKING” QUESTIONS (20 points total)
GRADED FOR ACCURACY
Question 1 (5 points):
Load the airbnb.csv file into RStudio. Call the dataframe “airbnb.”
Create a table that shows the means of Airbnb price broken down by Neighborhood and by Room Type. Copy the code (1 point) and output (1 point) below.
Hint: See Assignment 1, Question 1 for similar code. Unlike in Assignment 1, Question 1, you are splitting the data up by two factors (Neighborhood and Room Type). So in the part of the code where you are supposed write the grouping factors, you should write neighborhood+room_type.
In which neighborhood (on average) is the most expensive to rent a shared room for the night? (1 point)
Create a binary variable that has a value of 1 if the rental is a shared room and a 0 if it is not (i.e., if it is an “entire home/apt” or a “private room.” Name that variable “shared.” You do not need to copy/paste the code here.
Hint: See “Week 4 Slides – FINAL” Slide #52 for example code.
Then create a table that shows a breakdown of the number of listings in each neighborhood that have a shared room vs. the number of listings that do not have a shared room. Use the following code: xtabs(~neighborhood+shared, data=airbnb)
Copy and paste your output below. (1 point)
Run a t-test comparing the average price of a shared room in Seattle to the average price of a shared room in Kirkland. Use the following code:
t.test(price~neighborhood,data=airbnb[airbnb$shared==1&(airbnb$neighborhood==”Kirkland, WA, United States”|airbnb$neighborhood==”Seattle, WA, United States”),])
This code looks far more complicated than it actually is. The first part of the function shows you that you are comparing price across neighborhoods. The (long) second part, just tells R to subset the data so you are looking only at shared rooms and either rentals in Kirkland or Seattle. It looks long because you have to use the full text that the data uses (e.g., “Kirkland, WA, United States”) in order to just subset “Kirkland.”
If your hypothesis is that the average price of shared rooms in Kirkland equals the average price of shared rooms in Seattle, would you reject or fail to reject your hypothesis? (1 point)
REJECT FAIL TO REJECT
Question 2 (15 points):
Imagine that you wanted to know the effect of neighborhood and the effect of sharing a room on the price of a rental.
Run a regression that uses price as the dependent variable and neighborhood and “shared” (the variable you created in Question 1c) as your independent variables. Do not include any interaction term. Hint: See Stats Assignment #2, Question 2 and “Week 5 Slides – FINAL” Slide #51 for similar code. Copy and paste your output below. (2 points)
Remember that “neighborhood” is a string variable (i.e., it has values that are not numbers). This means that when you put neighborhood in your regression, you need to use as.factor(neighborhood) to make it something that R can understand.
What is the coefficient associated with the “shared” variable? (1 point) What does it mean? (1 point)
The coefficient associated with being in the Kirkland neighborhood is statistically significant. This means that prices in Kirkland are significantly different than prices in another neighborhood. Which neighborhood? (2 points)
Rerun the regression from part (a) so that it allows you to test whether there are differences in prices between the Seattle and Redmond neighborhoods. Copy and paste your output below. Hint: Use the “relevel” function in R. See “Week 4 Slides – FINAL” Slide 54. (2 points)
Assuming an alpha value of .05, is there a significant difference between prices in Seattle and prices in Redmond? (1 point)
Imagine that you wanted to know whether or not there is an interaction between having a shared room and whether or not a rental is in the Seattle neighborhood. Create a new binary variable that equals 1 if a rental is in Seattle and 0 if a rental is in any of the other neighborhoods. Call this new variable “seattle.”
Run a regression with “price” as your dependent variable and “shared,” “seattle,” and the interaction between “shared” and “seattle” as your independent variables. Hint: See “Week 5 Slides – FINAL” Slide #52 for code that shows you how to include an interaction in a regression.
What is the coefficient associated with the interaction term? (2 points)
Describe in words what the interaction term represents in this particular regression (2 points).
Using your regression from part (f), predict the price of a shared room in Seattle. Hint: See “Week 5 Slides – FINAL” Slide #53. (1 point)
Using your regression from part (f), predict the price of a private room in a neighborhood outside of Seattle. (1 point)