proposal

1 Motivation¶

In the latter part of the twenty’s century, the internet revolutionized how people share information, often without stringent editorial standards. Recently, social media has emerged as a significant news source for a considerable number of individuals. As reported by Statistic "https://www.statista.com/statistics/947869/facebook-product-mau/", approximately 3.96 billion people worldwide are active social media users monthly. Social media offers evident advantages in news dissemination, including immediate access to information, free distribution, no time constraints, and diverse content. However, these platforms lack substantial regulation and oversight.

Fake news detection falls under the umbrella of text classification, where it is divided into binary classification (distinguishing between real and fake news) or multi-class classification for higher granularity. We are interested in using some approaches for detecting fake news with SoTA FNC-1 dataset.

2 Goals¶

The following observations will be evaluated through this project:

To use one-hot encoder to extract textual feature from text data.
To apply different machine learning approaches to classify the different labels, e.g., Linear- regression, Decision Tree, Random Forest and ensemble learning.

3 Methodology¶

The experiments starts from data preprocessing and feature extraction. Then we apply different machine learning algorithm to make the classification.

Data Preprocessing - The datasets need to be preprocessed. The techniques of preprocessing includes text cleaning, punctuation removal, lowercase, etc.
Word embedding - Feature extraction is another key point for the model performance. The proposed methods would be statistical methods (i.e. one-hot encoding).
Machine Learning approaches - The FNC-1 dataset contains 49972 labelled news (only 10% of the data will be used in the experiment). We first convert the textual labels into numerical labels. For machine learning approach perspective, Linear-regression, Decision Tree, Random Forest and ensemble learning are chosen to make the classification of different labels in this project. And their performance will be compared as well.

4 Code Example¶

In [1]:

import pandas as pd
stances = pd.read_csv('./fnc-1/train_stances.csv')
bodies = pd.read_csv('./fnc-1/train_bodies.csv')
data_merged = pd.merge(bodies, stances, on="Body ID")
stances

Out[1]:

	Headline	Body ID	Stance
0	Police find mass graves with at least '15 bodi...	712	unrelated
1	Hundreds of Palestinians flee floods in Gaza a...	158	agree
2	Christian Bale passes on role of Steve Jobs, a...	137	unrelated
3	HBO and Apple in Talks for $15/Month Apple TV ...	1034	unrelated
4	Spider burrowed through tourist's stomach and ...	1923	disagree
...	...	...	...
49967	Urgent: The Leader of ISIL 'Abu Bakr al-Baghda...	1681	unrelated
49968	Brian Williams slams social media for speculat...	2419	unrelated
49969	Mexico Says Missing Students Not Found In Firs...	1156	agree
49970	US Lawmaker: Ten ISIS Fighters Have Been Appre...	1012	discuss
49971	Shots Heard In Alleged Brown Shooting Recordin...	2044	unrelated

49972 rows × 3 columns