In [82]:
import pandas as pd
df = pd.read_csv("fake_news_dataset.csv") #reading the dataset 
df = df[['id', 'state', 'date_published', 'source',                       
       'category', 'sentiment_score', 'word_count', 'has_images',
       'has_videos', 'readability_score', 'num_shares', 'num_comments',
       'political_bias', 'fact_check_rating', 'is_satirical', 'trust_score',
       'source_reputation', 'clickbait_score', 'plagiarism_score', 'label']]   #we dont want author, title, character count or text because the
                                                                               #information is irrelevant, so we remove it

df.dropna #we drop all NA values

df['True'] = df['label'].replace({'Fake':0, 'Real': 1}) #we make a new column called "True" and changing the string "Fake" to 0 and "Real" to 1.

df = df[['id', 'state', 'date_published', 'source',
       'category', 'sentiment_score', 'word_count', 'has_images',
       'has_videos', 'readability_score', 'num_shares', 'num_comments',
       'political_bias', 'fact_check_rating', 'is_satirical', 'trust_score',
       'source_reputation', 'clickbait_score', 'plagiarism_score', 'True']]   #we do the same thing, we just remove  'label'
  
df.head(4)  #print out the first four samples
C:\Users\leoxi\AppData\Local\Temp\ipykernel_6828\3749520299.py:12: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df['True'] = df['label'].replace({'Fake':0, 'Real': 1}) #we make a new column called "True" and changing the string "Fake" to 0 and "Real" to 1.
Out[82]:
id state date_published source category sentiment_score word_count has_images has_videos readability_score num_shares num_comments political_bias fact_check_rating is_satirical trust_score source_reputation clickbait_score plagiarism_score True
0 1 Tennessee 30-11-2021 The Onion Entertainment -0.22 1302 0 0 66.18 47305 450 Center FALSE 1 76 6 0.84 53.35 0
1 2 Wisconsin 02-09-2021 The Guardian Technology 0.92 322 1 0 41.10 39804 530 Left Mixed 1 1 5 0.85 28.28 0
2 3 Missouri 13-04-2021 New York Times Sports 0.25 228 0 1 30.04 45860 763 Center Mixed 0 57 1 0.72 0.38 0
3 4 North Carolina 08-03-2020 CNN Sports 0.94 155 1 0 75.16 34222 945 Center TRUE 1 18 10 0.92 32.20 0

Questions we want to answer using this data

  1. Which state might be populated with the most fake news
  2. What is the difference between real and fake news and how can the model can detect it?
  3. Are there common topics that are fake or real among the data?
  4. What is the relationship of increase in fake news when a political event is occuring?

Variables we will be using

If you look at the dataset above, we will be using all of these variables. Here are them in a list:


'id' : Number label of the news
'state' : where the news was published
'date_published' : date when the article was published
'source' : where the article was published
'category' : type of article
'sentiment_score' : emotion score
'word_count' : word count of article
'has_images' : if the article has images
'has_videos' : if the article has vidoes
'readability_score' : how easy the text is to read
'num_shares' : how many times the article was shared
'num_comments' : how many comments were left
'political_bias' : if the article was more republican or democratic
'fact_check_rating' : if the article was fact checked
'is_satirical' : if the article is satire
'trust_score' : trustworthiness of the article
'source_reputation' : reputability of the source
'clickbait_score' : how many people clicked on the article
'plagiarism_score' : percentage of work matches sources in plagiarism database
'True' : if the article is fake or real

Methods we will use to answer our questions


We will use different modeling methods to find out which one gives us a more precise answer, such as using Logistic Regression, Decision Trees, SVMs, kNN, etc. We also want to use train_test_split in order to train our model on "new" data, to ensure there isn't any bias, and it is handling the information well.

We will also use model data to capure it's accuracy, precision recall, w coefficients and w intercepts to determine accuracy and desired fitting.