Fraudulent News Detection
Defination
The term “Fraudulent News” was a unheard of and not prevalent a couple of decades ago but in this digital era of social media, it has surfaced as a huge monster.
Problem Statement
•The authenticity of Information has become a longstanding issue affecting businesses and society, both for printed and digital media.
- On social networks, the reach and effects of information spread occur at fast pace and so amplified that distorted, inaccurate , or false information acquires a tremendous potential to cause real world impacts, within minutes, for millions of users.
Objective
The main Objective of this project to detect whether the news is fake or not by using the classifier
Data Description
There are 6 columns in the dataset
•“id”: Unique id of each news article
•“headline”: It is the title of the news.
•“news”: It contains the full text of the news article
•“Unnamed:0”: It is a serial number
•“written_by”: It represents the author of the news article
- “label”: It tells whether the news is fake (1) or not fake (0).
EDA Steps
1.Understanding of the data
2.Checking the missing values
3.Detecting outliers
4. Data Cleaning
Importing the Libraries and dataset
Importing the required Libraries
Importing the dataset for Data Analysis
Reading the Data
Knowing the Information of the Dataset
Checking the shape of the dataset
Observation:-
df.shape() describes the total rows and columns available in the dataset
There are 20800 rows and 6 columns in the dataset
Checking the missing values
•df.isnull().sum() Shows the presence of missing values in the data set.
•There were no missing values present in the dataset.
Checking for numerical columns and their distribution.
Observation
•There are more than 2 unique values in all the columns.
•Hence the columns are the continuous data.
•The data is not ordinal type of categorical data.
- Hence outliers can be found in the dataset.
Data cleaning and Importing nlp libraries.
The above are the necessary libraries which is required for text cleaning.
The above is the code to extract only text from the dataset by eliminating all the extra special characters, numerical. By creating the empty list and appending the cleaned data into it.
Lemmatization does the same thing as stemming and try to bring a word to its base form, but unlike stemming it do keep in account the actual meaning of the base word i.e. the base word belongs to any specific language. The ‘base word’ is known as ‘Lemma’.
The column which is resulted after cleaning.
Dataset View
Dataset with new column
Distplot of Orginal Length
Distplot of Cleaned Length
word cloud of label=1
Word cloud of label=0
prediction using Multinomial DB
The accuracy score of Multinomial DB was found to be 84 percent
Confusion Matrix
Prediction Using Logistic Regression
Confusion Matrix
Prediction using Decision Tree Classifier
Prediction using Ada Boost Classifier
ROC Curve for Naive Bayes Classifier
ROC Curve for Random Forest Classifier
Importing Tensor Flow
Adding the Required Libraries for dataset
One Hot Representation
Importing NLTK
Data Pre Processing using Porter Stemmer
Embedding Representation
Test Train Split
Adding Dropout