Reading Data¶
In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
In [2]:
property_vals = pd.read_csv("Property_Tax_Roll.csv")[["Parcel", "TaxYear", "TotalAssessedValue", "MillRate", "EstFairMkt", "FullAmt"]]
property_info = pd.read_csv("Assessor_Property_Information.csv")[["Parcel", "Address", "PropertyClass", "PropertyUse", "AreaName", "HomeStyle", "YearBuilt", "LotSize", "TotalLivingArea",
"Bedrooms", "FullBaths", "HalfBaths", "TotalLivingArea", "Basement", "ExteriorWall1", "ExteriorWall2", "Fireplaces", "CentralAir", "CurrentLand","Zoning1", "Zoning2","Zoning3",
"Zoning4", "NationalHistoricalDist", "ElementarySchool", "MiddleSchool", "HighSchool", "NoiseAirport", "NoiseRailroad", "NoiseStreet", "XCoord", "YCoord", "SHAPE_Length", "SHAPE_Area"]]
property_data = pd.merge(left = property_vals, right = property_info, left_on = ["Parcel"], right_on=["Parcel"], how = "inner").set_index("Parcel")
property_data.head(2)
Out[2]:
TaxYear | TotalAssessedValue | MillRate | EstFairMkt | FullAmt | Address | PropertyClass | PropertyUse | AreaName | HomeStyle | ... | ElementarySchool | MiddleSchool | HighSchool | NoiseAirport | NoiseRailroad | NoiseStreet | XCoord | YCoord | SHAPE_Length | SHAPE_Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parcel | |||||||||||||||||||||
60801101019 | 2022 | 273900 | 0.019815 | 286028 | 5064.51 | 2001 Rae Ln | Residential | Single family | Meadowood | Ranch | ... | Huegel | Toki | Memorial | 0 | 0 | 61 | 794441.3453 | 467139.3384 | 0.001641 | 1.460000e-07 |
60801101027 | 2022 | 274000 | 0.019815 | 286132 | 5066.49 | 2005 Rae Ln | Residential | Single family | Meadowood | Ranch | ... | Huegel | Toki | Memorial | 0 | 0 | 61 | 794440.8089 | 467045.1156 | 0.001745 | 1.510000e-07 |
2 rows × 38 columns
Questions to Answer¶
As we approach our graduation date, some of us consider settling down here in Madison. Possible plans include living, working, starting a business, or even farming here in the capital of Wisconsin. Thus, our questions of interest for this analysis are as follows:
- Exploring how taxes vary amongst properties in Madison:
- Is there a correlation between property values and tax amounts?
- How do tax rates and tax amounts vary across different property types?
- What is the distribution of the amount of tax payments per property class (Residential, Commercial, Industrial, Agricultural)?
- Analyzing how various property characteristics impact a property's value:
- Is there a clear relationship between the age of properties and their average market value?
- What is the relationship between property type and property value?
- Does each environmental factor impact a property's assessed value?
- Relating the two independent variables above, is it possible to classify properties based on their environmental aspects?
- Can high-value properties be identified based on physical features such as location, size, and amenities?
- How are the residential properties around UW-Madison valued compared to areas outside the university?
- Bringing it all together, can we construct an accurate model to predict a residential property's assessed value?
Important Variables¶
Our initial datasets consist a total of 222 variables. After going through them, we think that TotalAssessedValue
, MillRate
, EstFairMkt
, FullAmt
(amount of tax), PropertyClass
, PropertyUse
, AreaName
, HomeStyle
, YearBuilt
, the physical characteristics of a property, the zoning data, the environmental features, and schools are the most useful features for our analysis.
Methods¶
Question 1:
- We plan to use a combination of a simple regression model (tax amounts vs property values) and data visualization techniques such as a bar chart and a boxplot to understand how property taxes are valued.
Question 2:
- We will create an additional column called
Age
which is calculated from 2022 -YearBuilt
, then we aim to graph a scatter plot of the average market value vs. binnedAge
. - For subquestions 2 and 3, we aim to plot bar charts.
- We view subquestions 4 and 5 as multi-class classification problems. Thus, we plan to utilize and compare models such as decision tree, kNN, SVM, and logistic regression. To increase accuracy, we will tune the hyperparameters, perform rescaling and one-hot encoding, and use stacking methods.
- In regard to subquestion 6, we intend to graph side-by-side boxplots to compare the two distributions.
Question 3:
- Since a residential property's assessed value is a continuous variable, we frame this question as a regression problem. We will perform feature selection with LASSO before comparing various regression models, including linear regression, the SGD algorithm, and other methods. Here, we hope to improve the $R^2$ score with ensemble learning methods.