1.Dataset Description.¶
This dataset contains house pricing in New York, providing informative insights into the real estate market. It includes variables such as house types, prices, number of bedrooms and bathrooms, property square footage, administrative and local areas, street names, and geographical coordinates.
2.Motivation and model¶
To analyze and predict house prices in the New York real estate market, we will employ linear regression and decision tree regression models. These models will be trained on selected variables to identify the factors that impact the price the most. Moreover, we will classify the selected variables using a K-means model, comparing the results with geographical coordinates to discover trends in the real estate market. For evaluation, we will employ standard performance metrics: accuracy, precision, recall, and F1-score. These metrics will provide a comprehensive understanding of model performance, considering both the efficiency and quality of our linear and classification model.
3. Data loading¶
Load the data and display the first few rows
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
file_path = '/Users/liyuang/Downloads/NY-House-Dataset.csv'
ny_house_data = pd.read_csv(file_path)
# Display the first few rows of the dataset
ny_house_data.head()
BROKERTITLE | TYPE | PRICE | BEDS | BATH | PROPERTYSQFT | ADDRESS | STATE | MAIN_ADDRESS | ADMINISTRATIVE_AREA_LEVEL_2 | LOCALITY | SUBLOCALITY | STREET_NAME | LONG_NAME | FORMATTED_ADDRESS | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Brokered by Douglas Elliman -111 Fifth Ave | Condo for sale | 315000 | 2 | 2.000000 | 1400.0 | 2 E 55th St Unit 803 | New York, NY 10022 | 2 E 55th St Unit 803New York, NY 10022 | New York County | New York | Manhattan | East 55th Street | Regis Residence | Regis Residence, 2 E 55th St #803, New York, N... | 40.761255 | -73.974483 |
1 | Brokered by Serhant | Condo for sale | 195000000 | 7 | 10.000000 | 17545.0 | Central Park Tower Penthouse-217 W 57th New Yo... | New York, NY 10019 | Central Park Tower Penthouse-217 W 57th New Yo... | United States | New York | New York County | New York | West 57th Street | 217 W 57th St, New York, NY 10019, USA | 40.766393 | -73.980991 |
2 | Brokered by Sowae Corp | House for sale | 260000 | 4 | 2.000000 | 2015.0 | 620 Sinclair Ave | Staten Island, NY 10312 | 620 Sinclair AveStaten Island, NY 10312 | United States | New York | Richmond County | Staten Island | Sinclair Avenue | 620 Sinclair Ave, Staten Island, NY 10312, USA | 40.541805 | -74.196109 |
3 | Brokered by COMPASS | Condo for sale | 69000 | 3 | 1.000000 | 445.0 | 2 E 55th St Unit 908W33 | Manhattan, NY 10022 | 2 E 55th St Unit 908W33Manhattan, NY 10022 | United States | New York | New York County | New York | East 55th Street | 2 E 55th St, New York, NY 10022, USA | 40.761398 | -73.974613 |
4 | Brokered by Sotheby's International Realty - E... | Townhouse for sale | 55000000 | 7 | 2.373861 | 14175.0 | 5 E 64th St | New York, NY 10065 | 5 E 64th StNew York, NY 10065 | United States | New York | New York County | New York | East 64th Street | 5 E 64th St, New York, NY 10065, USA | 40.767224 | -73.969856 |
Distribution of Property Prices¶
The following histogram shows the distribution of property prices after removing outliers.
# Define upper limits for price to remove outliers (95th percentile)
price_limit = ny_house_data['PRICE'].quantile(0.95)
# Filter the data to exclude outliers
filtered_data = ny_house_data[(ny_house_data['PRICE'] <= price_limit)]
# Plot distribution of filtered prices
plt.figure(figsize=(10, 6))
plt.hist(filtered_data['PRICE'].dropna(), bins=50, edgecolor='black')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Property Prices (Excluding Outliers)')
plt.show()