STAT 451 Machine Learning Project: Understanding How Properties in Madison are Assessed¶

Group: Project 16¶

Ahsan Fawwaz (asyraf@wisc.edu), Faris Hazim (mohamedzaimi@wisc.edu), Imran Iskander (biniskanderg@wisc.edu), Nick Elias (nelias@wisc.edu), Tyler Kelly (tpkelly@wisc.edu)¶

Reading Data¶

In [1]:

import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:

property_vals = pd.read_csv("Property_Tax_Roll.csv")[["Parcel", "TaxYear", "TotalAssessedValue", "MillRate", "EstFairMkt", "FullAmt"]]
property_info = pd.read_csv("Assessor_Property_Information.csv")[["Parcel", "Address", "PropertyClass", "PropertyUse", "AreaName", "HomeStyle", "YearBuilt", "LotSize", "TotalLivingArea",
"Bedrooms", "FullBaths", "HalfBaths", "TotalLivingArea", "Basement", "ExteriorWall1", "ExteriorWall2", "Fireplaces", "CentralAir", "CurrentLand","Zoning1", "Zoning2","Zoning3", 
"Zoning4", "NationalHistoricalDist", "ElementarySchool", "MiddleSchool", "HighSchool", "NoiseAirport", "NoiseRailroad", "NoiseStreet", "XCoord", "YCoord", "SHAPE_Length", "SHAPE_Area"]]
property_data = pd.merge(left = property_vals, right = property_info, left_on = ["Parcel"], right_on=["Parcel"], how = "inner").set_index("Parcel")
property_data.head(2)

Out[2]:

	TaxYear	TotalAssessedValue	MillRate	EstFairMkt	FullAmt	Address	PropertyClass	PropertyUse	AreaName	HomeStyle	...	ElementarySchool	MiddleSchool	HighSchool	NoiseAirport	NoiseRailroad	NoiseStreet	XCoord	YCoord	SHAPE_Length	SHAPE_Area
Parcel
60801101019	2022	273900	0.019815	286028	5064.51	2001 Rae Ln	Residential	Single family	Meadowood	Ranch	...	Huegel	Toki	Memorial	0	0	61	794441.3453	467139.3384	0.001641	1.460000e-07
60801101027	2022	274000	0.019815	286132	5066.49	2005 Rae Ln	Residential	Single family	Meadowood	Ranch	...	Huegel	Toki	Memorial	0	0	61	794440.8089	467045.1156	0.001745	1.510000e-07

2 rows × 38 columns

Questions to Answer¶

As we approach our graduation date, some of us consider settling down here in Madison. Possible plans include living, working, starting a business, or even farming here in the capital of Wisconsin. Thus, our questions of interest for this analysis are as follows:

Exploring how taxes vary amongst properties in Madison:

Is there a correlation between property values and tax amounts?
How do tax rates and tax amounts vary across different property types?
What is the distribution of the amount of tax payments per property class (Residential, Commercial, Industrial, Agricultural)?

Analyzing how various property characteristics impact a property's value:

Is there a clear relationship between the age of properties and their average market value?
What is the relationship between property type and property value?
Does each environmental factor impact a property's assessed value?
Relating the two independent variables above, is it possible to classify properties based on their environmental aspects?
Can high-value properties be identified based on physical features such as location, size, and amenities?
How are the residential properties around UW-Madison valued compared to areas outside the university?

Bringing it all together, can we construct an accurate model to predict a residential property's assessed value?

Important Variables¶

Our initial datasets consist a total of 222 variables. After going through them, we think that TotalAssessedValue, MillRate, EstFairMkt, FullAmt (amount of tax), PropertyClass, PropertyUse, AreaName, HomeStyle, YearBuilt, the physical characteristics of a property, the zoning data, the environmental features, and schools are the most useful features for our analysis.

Methods¶

Question 1:

We plan to use a combination of a simple regression model (tax amounts vs property values) and data visualization techniques such as a bar chart and a boxplot to understand how property taxes are valued.

Question 2:

We will create an additional column called Age which is calculated from 2022 - YearBuilt, then we aim to graph a scatter plot of the average market value vs. binned Age.
For subquestions 2 and 3, we aim to plot bar charts.
We view subquestions 4 and 5 as multi-class classification problems. Thus, we plan to utilize and compare models such as decision tree, kNN, SVM, and logistic regression. To increase accuracy, we will tune the hyperparameters, perform rescaling and one-hot encoding, and use stacking methods.
In regard to subquestion 6, we intend to graph side-by-side boxplots to compare the two distributions.

Question 3:

Since a residential property's assessed value is a continuous variable, we frame this question as a regression problem. We will perform feature selection with LASSO before comparing various regression models, including linear regression, the SGD algorithm, and other methods. Here, we hope to improve the $R^2$ score with ensemble learning methods.