Fraudulent News Detection

5 min readAug 7, 2021

Defination

The term “Fraudulent News” was a unheard of and not prevalent a couple of decades ago but in this digital era of social media, it has surfaced as a huge monster.

is Earth is Flat? fake or non Fake — is Earth is Flat ? checking is it a Fake or not?

Problem Statement

•The authenticity of Information has become a longstanding issue affecting businesses and society, both for printed and digital media.

On social networks, the reach and effects of information spread occur at fast pace and so amplified that distorted, inaccurate , or false information acquires a tremendous potential to cause real world impacts, within minutes, for millions of users.

Objective

The main Objective of this project to detect whether the news is fake or not by using the classifier

Data Description

There are 6 columns in the dataset

•“id”: Unique id of each news article

•“headline”: It is the title of the news.

•“news”: It contains the full text of the news article

•“Unnamed:0”: It is a serial number

•“written_by”: It represents the author of the news article

“label”: It tells whether the news is fake (1) or not fake (0).

EDA Steps

1.Understanding of the data

2.Checking the missing values

3.Detecting outliers

4. Data Cleaning

Importing the Libraries and dataset

Importing the required Libraries

Importing the dataset for Data Analysis

Reading the Data

Knowing the Information of the Dataset

Checking the shape of the dataset

Observation:-

df.shape() describes the total rows and columns available in the dataset

There are 20800 rows and 6 columns in the dataset

Checking the missing values

•df.isnull().sum() Shows the presence of missing values in the data set.

•There were no missing values present in the dataset.

Checking for numerical columns and their distribution.

Observation

•There are more than 2 unique values in all the columns.

•Hence the columns are the continuous data.

•The data is not ordinal type of categorical data.

Hence outliers can be found in the dataset.

Data cleaning and Importing nlp libraries.

The above are the necessary libraries which is required for text cleaning.

The above is the code to extract only text from the dataset by eliminating all the extra special characters, numerical. By creating the empty list and appending the cleaned data into it.

Lemmatization does the same thing as stemming and try to bring a word to its base form, but unlike stemming it do keep in account the actual meaning of the base word i.e. the base word belongs to any specific language. The ‘base word’ is known as ‘Lemma’.

The column which is resulted after cleaning.