1 Motivation¶
In the latter part of the twenty’s century, the internet revolutionized how people share information, often without stringent editorial standards. Recently, social media has emerged as a significant news source for a considerable number of individuals. As reported by Statistic "https://www.statista.com/statistics/947869/facebook-product-mau/", approximately 3.96 billion people worldwide are active social media users monthly. Social media offers evident advantages in news dissemination, including immediate access to information, free distribution, no time constraints, and diverse content. However, these platforms lack substantial regulation and oversight.
Fake news detection falls under the umbrella of text classification, where it is divided into binary classification (distinguishing between real and fake news) or multi-class classification for higher granularity. We are interested in using some approaches for detecting fake news with SoTA FNC-1 dataset.
2 Goals¶
The following observations will be evaluated through this project:
- To use one-hot encoder to extract textual feature from text data.
- To apply different machine learning approaches to classify the different labels, e.g., Linear- regression, Decision Tree, Random Forest and ensemble learning.
3 Methodology¶
The experiments starts from data preprocessing and feature extraction. Then we apply different machine learning algorithm to make the classification.
- Data Preprocessing - The datasets need to be preprocessed. The techniques of preprocessing includes text cleaning, punctuation removal, lowercase, etc.
- Word embedding - Feature extraction is another key point for the model performance. The proposed methods would be statistical methods (i.e. one-hot encoding).
- Machine Learning approaches - The FNC-1 dataset contains 49972 labelled news (only 10% of the data will be used in the experiment). We first convert the textual labels into numerical labels. For machine learning approach perspective, Linear-regression, Decision Tree, Random Forest and ensemble learning are chosen to make the classification of different labels in this project. And their performance will be compared as well.
4 Code Example¶
import pandas as pd
stances = pd.read_csv('./fnc-1/train_stances.csv')
bodies = pd.read_csv('./fnc-1/train_bodies.csv')
data_merged = pd.merge(bodies, stances, on="Body ID")
stances
Headline | Body ID | Stance | |
---|---|---|---|
0 | Police find mass graves with at least '15 bodi... | 712 | unrelated |
1 | Hundreds of Palestinians flee floods in Gaza a... | 158 | agree |
2 | Christian Bale passes on role of Steve Jobs, a... | 137 | unrelated |
3 | HBO and Apple in Talks for $15/Month Apple TV ... | 1034 | unrelated |
4 | Spider burrowed through tourist's stomach and ... | 1923 | disagree |
... | ... | ... | ... |
49967 | Urgent: The Leader of ISIL 'Abu Bakr al-Baghda... | 1681 | unrelated |
49968 | Brian Williams slams social media for speculat... | 2419 | unrelated |
49969 | Mexico Says Missing Students Not Found In Firs... | 1156 | agree |
49970 | US Lawmaker: Ten ISIS Fighters Have Been Appre... | 1012 | discuss |
49971 | Shots Heard In Alleged Brown Shooting Recordin... | 2044 | unrelated |
49972 rows × 3 columns