Group 1 Project Proposal¶

Yuval Lerman, John Oh, Jedidiah Schloesser, Brian Slupecki, Kasey White¶

Dataset¶

The dataset) we will be exploring describes various Android softwares that are potentially malware. For each software, the device permissions and API-based features are shown in addition to a malware/goodware label.

Questions¶

Out of the many features in the dataset, which are most important for detecting malware?
How accurately can we detect malware?
What type of model is best suited detecting malware?
Can a model trained for one-class classifcation achieve similar accuracy to a model trained for binary classification?

Variables¶

Label (Malware vs. Goodware)
214 Permission-based features. (ex: SEND_SMS, READ_CONTACTS)
27 API based features (ex: take picture, get last location)

Methods¶

Various Binary Classification Methods

Logistic Regression
SVM
KNN
Decision Tree / Random Forest

One-Class Gaussian Models
Techniques for feature engineering
Techniques to consider an imbalanced dataset

In [1]:

import pandas as pd
malware = pd.read_csv('TUANDROMD.csv')
malware.head()

Out[1]:

	ACCESS_NETWORK_STATE	...	Landroid/telephony/TelephonyManager;->getLine1Number	Landroid/telephony/TelephonyManager;->getNetworkOperator	Landroid/telephony/TelephonyManager;->getNetworkOperatorName	Landroid/telephony/TelephonyManager;->getNetworkCountryIso	Landroid/telephony/TelephonyManager;->getSimOperator	Landroid/telephony/TelephonyManager;->getSimCountryIso	Lorg/apache/http/impl/client/DefaultHttpClient;->execute	Label
0	1.0	...	1.0	1.0	1.0	0.0	0.0	0.0	1.0	malware
1	1.0	...	0.0	0.0	0.0	1.0	0.0	1.0	0.0	malware
2	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	malware
3	0.0	...	0.0	1.0	1.0	1.0	1.0	1.0	0.0	malware
4	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	malware

5 rows × 242 columns