Fake News Detection
The problem at hand is called fake news detection. In the context of information technology and AI, “fake news” is misleading or false information presented as true news. With the rise of social media and online platforms, the spread of fake news has been prevalent. It can be harmful in many ways, like influencing public opinion based on false information or causing unnecessary panic and confusion among people.
The AI technique used to solve this problem falls under the domain of Natural Language Processing (NLP) which is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language. The goal of NLP is to read, decipher, understand, and make sense of human language in a valuable way.
Specifically, the machine learning model used here is called the PassiveAggressiveClassifier. This is a type of online learning algorithm. The online learning model is very suitable for large scale learning problems, and it’s quite useful when we have a large stream of incoming data, where it’s not feasible to train over the entire data set.
The PassiveAggressiveClassifier is part of a family of algorithms for large-scale learning. It’s very similar to the Perceptron in that it does not require a learning rate. However, it does include a regularization parameter.
In layman terms, this is an algorithm that remains ‘passive’ when dealing with an outcome that has been correctly classified but turns ‘aggressive’ in the event of a miscalculation, updating and adjusting itself to avoid the mistake in the future.
The specific tasks it’s used for here include:
Text Feature Extraction: Before we feed the text into a machine learning model, we have to convert it into some kind of numeric representation that the model can understand. This is where CountVectorizer comes in. It’s a method used to convert the text data into a matrix of token counts.
Text Classification: This is the task of predicting the class (i.e., category) of a given piece of text. Here, we use it to predict whether a given piece of news is “real” or “fake”. The PassiveAggressiveClassifier is particularly well-suited to this task because it can efficiently handle large amounts of data and provide accurate predictions.
Step 1: Import Necessary Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load the Data
# Read the data
df = pd.read_csv('train.csv')
Step 3: Inspect the Data
# Display the first few records
print(df.head())
# Summary of the dataset
print(df.info())
id title author \
0 0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus
1 1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn
2 2 Why the Truth Might Get You Fired Consortiumnews.com
3 3 15 Civilians Killed In Single US Airstrike Hav... Jessica Purkiss
4 4 Iranian woman jailed for fictional unpublished... Howard Portnoy
text label
0 House Dem Aide: We Didn’t Even See Comey’s Let... 1
1 Ever get the feeling your life circles the rou... 0
2 Why the Truth Might Get You Fired October 29, ... 1
3 Videos 15 Civilians Killed In Single US Airstr... 1
4 Print \nAn Iranian woman has been sentenced to... 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20800 non-null int64
1 title 20242 non-null object
2 author 18843 non-null object
3 text 20761 non-null object
4 label 20800 non-null int64
dtypes: int64(2), object(3)
memory usage: 812.6+ KB
None
Before we begin pre-processing, we are inspecting our data. This gives us a rough idea about the dataset’s structure and any potential issues it might have such as missing values.
Step 4: Prepare the Labels
# Get the labels
labels = df.label
Step 5: Split the Data
# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(df['text'], labels, test_size=0.2, random_state=7)
We split our dataset into a training set and a test set. This is to ensure that we have a fair evaluation of our model, by testing it on unseen data.
Step 6: Handle Missing Values
# Fill NaN values with empty string
x_train = x_train.fillna('')
x_test = x_test.fillna('')
We’re handling any potential missing values in our dataset. Since our feature is text, we can fill missing values with an empty string.
Step 7: Initialize and Apply Count Vectorizer
# Initialize a CountVectorizer
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the training data
count_train = count_vectorizer.fit_transform(x_train.values)
# Transform the test data
count_test = count_vectorizer.transform(x_test.values)
We’re initializing our CountVectorizer and fitting it to our data. This converts our text data into a format that our model can understand.
Step 8: Train the Model
# Initialize a PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(count_train, y_train)
PassiveAggressiveClassifier(max_iter=50)
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PassiveAggressiveClassifier(max_iter=50)
Here we’re initializing our PassiveAggressiveClassifier and fitting it to our training data.
Step 9: Make Predictions and Evaluate the Model
# Predict on the test set and calculate accuracy
y_pred = pac.predict(count_test)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100,2)}%')
# Confusion matrix
confusion_mat = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', confusion_mat)
Accuracy: 94.18%
Confusion Matrix:
[[1930 130]
[ 112 1988]]
We are making predictions on our test set and evaluating our model’s performance. In this case, we’re using accuracy as our metric.
Step 10: Visualize Results
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
# Get the confusion matrix
cm = confusion_matrix(y_test,y_pred)
# Plot the confusion matrix in a heat map
plt.figure(figsize=(7,7))
sns.heatmap(cm, annot=True, fmt="d")
plt.title('Confusion matrix of the classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
Text(58.222222222222214, 0.5, 'True')
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import seaborn as sns
# We'll use CountVectorizer to count the word frequencies
vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the training data
train_matrix = vectorizer.fit_transform(x_train)
# Get the word frequencies
word_freq_df = pd.DataFrame(train_matrix.toarray(), columns=vectorizer.get_feature_names_out())
word_freq = word_freq_df.sum(axis=0)
# Get the 20 most common words
top_words = word_freq.sort_values(ascending=False).head(20)
plt.figure(figsize=(10, 8))
sns.barplot(x=top_words.values, y=top_words.index)
plt.title('Top 20 words in fake news texts')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.show()