Dataset¶
The dataset) we will be exploring describes various Android softwares that are potentially malware. For each software, the device permissions and API-based features are shown in addition to a malware/goodware label.
Questions¶
- Out of the many features in the dataset, which are most important for detecting malware?
- How accurately can we detect malware?
- What type of model is best suited detecting malware?
- Can a model trained for one-class classifcation achieve similar accuracy to a model trained for binary classification?
Variables¶
- Label (Malware vs. Goodware)
- 214 Permission-based features. (ex: SEND_SMS, READ_CONTACTS)
- 27 API based features (ex: take picture, get last location)
Methods¶
- Various Binary Classification Methods
- Logistic Regression
- SVM
- KNN
- Decision Tree / Random Forest
- One-Class Gaussian Models
- Techniques for feature engineering
- Techniques to consider an imbalanced dataset
In [1]:
import pandas as pd
malware = pd.read_csv('TUANDROMD.csv')
malware.head()
Out[1]:
ACCESS_ALL_DOWNLOADS | ACCESS_CACHE_FILESYSTEM | ACCESS_CHECKIN_PROPERTIES | ACCESS_COARSE_LOCATION | ACCESS_COARSE_UPDATES | ACCESS_FINE_LOCATION | ACCESS_LOCATION_EXTRA_COMMANDS | ACCESS_MOCK_LOCATION | ACCESS_MTK_MMHW | ACCESS_NETWORK_STATE | ... | Landroid/telephony/TelephonyManager;->getLine1Number | Landroid/telephony/TelephonyManager;->getNetworkOperator | Landroid/telephony/TelephonyManager;->getNetworkOperatorName | Landroid/telephony/TelephonyManager;->getNetworkCountryIso | Landroid/telephony/TelephonyManager;->getSimOperator | Landroid/telephony/TelephonyManager;->getSimOperatorName | Landroid/telephony/TelephonyManager;->getSimCountryIso | Landroid/telephony/TelephonyManager;->getSimSerialNumber | Lorg/apache/http/impl/client/DefaultHttpClient;->execute | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | malware |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | malware |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | malware |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | malware |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | malware |
5 rows × 242 columns