Fraudulent News Detection

Harshitha indumathi
5 min readAug 7, 2021

Defination

The term “Fraudulent News” was a unheard of and not prevalent a couple of decades ago but in this digital era of social media, it has surfaced as a huge monster.

is Earth is Flat? fake or non Fake
is Earth is Flat ? checking is it a Fake or not?

Problem Statement

•The authenticity of Information has become a longstanding issue affecting businesses and society, both for printed and digital media.

  • On social networks, the reach and effects of information spread occur at fast pace and so amplified that distorted, inaccurate , or false information acquires a tremendous potential to cause real world impacts, within minutes, for millions of users.

Objective

The main Objective of this project to detect whether the news is fake or not by using the classifier

Data Description

There are 6 columns in the dataset

•“id”: Unique id of each news article

•“headline”: It is the title of the news.

•“news”: It contains the full text of the news article

•“Unnamed:0”: It is a serial number

•“written_by”: It represents the author of the news article

  • “label”: It tells whether the news is fake (1) or not fake (0).

EDA Steps

1.Understanding of the data

2.Checking the missing values

3.Detecting outliers

4. Data Cleaning

Importing the Libraries and dataset

Importing the required Libraries

Importing the required libraries

Importing the dataset for Data Analysis

Reading the Data

Knowing the Information of the Dataset

Checking the shape of the dataset

Observation:-

df.shape() describes the total rows and columns available in the dataset

There are 20800 rows and 6 columns in the dataset

Checking the missing values

•df.isnull().sum() Shows the presence of missing values in the data set.

•There were no missing values present in the dataset.

Checking for numerical columns and their distribution.

Observation

•There are more than 2 unique values in all the columns.

•Hence the columns are the continuous data.

•The data is not ordinal type of categorical data.

  • Hence outliers can be found in the dataset.

Data cleaning and Importing nlp libraries.

The above are the necessary libraries which is required for text cleaning.

The above is the code to extract only text from the dataset by eliminating all the extra special characters, numerical. By creating the empty list and appending the cleaned data into it.

Lemmatization does the same thing as stemming and try to bring a word to its base form, but unlike stemming it do keep in account the actual meaning of the base word i.e. the base word belongs to any specific language. The ‘base word’ is known as ‘Lemma’.

The column which is resulted after cleaning.

Dataset View

Dataset with new column

Distplot of Orginal Length

Distplot of Cleaned Length

word cloud of label=1

Word cloud of label=0

prediction using Multinomial DB

The accuracy score of Multinomial DB was found to be 84 percent

Confusion Matrix

Prediction Using Logistic Regression

Confusion Matrix

Prediction using Decision Tree Classifier

Prediction using Ada Boost Classifier

ROC Curve for Naive Bayes Classifier

ROC Curve for Random Forest Classifier

Importing Tensor Flow

Adding the Required Libraries for dataset

One Hot Representation

Importing NLTK

Data Pre Processing using Porter Stemmer

Embedding Representation

Test Train Split

Adding Dropout

--

--

Harshitha indumathi

Harshitha completed my graduation in B.E . this is my first article on medium