In [82]:
import pandas as pd
df = pd.read_csv("fake_news_dataset.csv") #reading the dataset
df = df[['id', 'state', 'date_published', 'source',
'category', 'sentiment_score', 'word_count', 'has_images',
'has_videos', 'readability_score', 'num_shares', 'num_comments',
'political_bias', 'fact_check_rating', 'is_satirical', 'trust_score',
'source_reputation', 'clickbait_score', 'plagiarism_score', 'label']] #we dont want author, title, character count or text because the
#information is irrelevant, so we remove it
df.dropna #we drop all NA values
df['True'] = df['label'].replace({'Fake':0, 'Real': 1}) #we make a new column called "True" and changing the string "Fake" to 0 and "Real" to 1.
df = df[['id', 'state', 'date_published', 'source',
'category', 'sentiment_score', 'word_count', 'has_images',
'has_videos', 'readability_score', 'num_shares', 'num_comments',
'political_bias', 'fact_check_rating', 'is_satirical', 'trust_score',
'source_reputation', 'clickbait_score', 'plagiarism_score', 'True']] #we do the same thing, we just remove 'label'
df.head(4) #print out the first four samples
C:\Users\leoxi\AppData\Local\Temp\ipykernel_6828\3749520299.py:12: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` df['True'] = df['label'].replace({'Fake':0, 'Real': 1}) #we make a new column called "True" and changing the string "Fake" to 0 and "Real" to 1.
Out[82]:
id | state | date_published | source | category | sentiment_score | word_count | has_images | has_videos | readability_score | num_shares | num_comments | political_bias | fact_check_rating | is_satirical | trust_score | source_reputation | clickbait_score | plagiarism_score | True | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Tennessee | 30-11-2021 | The Onion | Entertainment | -0.22 | 1302 | 0 | 0 | 66.18 | 47305 | 450 | Center | FALSE | 1 | 76 | 6 | 0.84 | 53.35 | 0 |
1 | 2 | Wisconsin | 02-09-2021 | The Guardian | Technology | 0.92 | 322 | 1 | 0 | 41.10 | 39804 | 530 | Left | Mixed | 1 | 1 | 5 | 0.85 | 28.28 | 0 |
2 | 3 | Missouri | 13-04-2021 | New York Times | Sports | 0.25 | 228 | 0 | 1 | 30.04 | 45860 | 763 | Center | Mixed | 0 | 57 | 1 | 0.72 | 0.38 | 0 |
3 | 4 | North Carolina | 08-03-2020 | CNN | Sports | 0.94 | 155 | 1 | 0 | 75.16 | 34222 | 945 | Center | TRUE | 1 | 18 | 10 | 0.92 | 32.20 | 0 |
Questions we want to answer using this data
- Which state might be populated with the most fake news
- What is the difference between real and fake news and how can the model can detect it?
- Are there common topics that are fake or real among the data?
- What is the relationship of increase in fake news when a political event is occuring?
Variables we will be using
If you look at the dataset above, we will be using all of these variables. Here are them in a list:
'id' : Number label of the news
'state' : where the news was published
'date_published' : date when the article was published
'source' : where the article was published
'category' : type of article
'sentiment_score' : emotion score
'word_count' : word count of article
'has_images' : if the article has images
'has_videos' : if the article has vidoes
'readability_score' : how easy the text is to read
'num_shares' : how many times the article was shared
'num_comments' : how many comments were left
'political_bias' : if the article was more republican or democratic
'fact_check_rating' : if the article was fact checked
'is_satirical' : if the article is satire
'trust_score' : trustworthiness of the article
'source_reputation' : reputability of the source
'clickbait_score' : how many people clicked on the article
'plagiarism_score' : percentage of work matches sources in plagiarism database
'True' : if the article is fake or real
Methods we will use to answer our questions
We will use different modeling methods to find out which one gives us a more precise answer, such as using Logistic Regression, Decision Trees, SVMs, kNN, etc. We also want to use train_test_split in order to train our model on "new" data, to ensure there isn't any bias, and it is handling the information well.
We will also use model data to capure it's accuracy, precision recall, w coefficients and w intercepts to determine accuracy and desired fitting.