News Sorting with NLP: BBC News Dataset 📰🔍

Overview

Welcome to the News Sorting project using Natural Language Processing (NLP) techniques applied to the BBC News Dataset. This project aims to classify news articles into predefined categories such as business, entertainment, politics, sport, and tech. By leveraging NLP, we'll extract features from the text data to build a machine learning model capable of accurately categorizing news articles.

Installation

Clone the repository:

git clone https://github.com/Rahul-404/bbc-news-sorting.git
cd bbc-news-sorting

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Requirements

To run the project, you'll need Python 3.x and the following libraries:

numpy
pandas
scikit-learn
matplotlib
seaborn
nltk
wordcloud
tensorflow
mlflow

Install the required dependencies:

pip install -r requirements.txt

🔧 To make any updates in code: Follow the workflow

Update config.yaml
Update secrets.yaml [Optional]
Update params.yaml
Update the entity -> config_entity.py
Update the configuration manager in src config
Update the components
Update the pipeline
Update the main.py
Update the dvc.yaml

Dataset

The BBC News Dataset consists of news articles published by the BBC, categorized into five predefined classes: business, entertainment, politics, sport, and tech. Each article contains textual content along with its corresponding category label. The dataset is available on Kaggle.

Train, Test and Validation Split:

splitting data based on Category and the distribution of tokens per example across the category to avoid bias of token length and distribution of words.

around 10K words are there in train data and, from those only ~3.2K words are out-of-vocabulary this might cause some test error but it will help us to get well generalized model

Train : ~ 10K
- ~5.5K english + ~4.5K non-english
Validation : ~ 3.2K
- ~1.1K english + ~2.1K non-english

Approach

Data Preprocessing:
- Text cleaning
  - Normalizing
  - remove currencies
  - remove distance
  - remove conturies
  - remove numbers
  - remove special characters
  - remove punctuations
  - remove multiple spaces
  - remove stopwords
  - lemmetization
- Tokenization
- Vectorization
  1. One-Hot Encoding
  2. TF-IDF Encoding
  3. Word2Vec Embeddings
  4. Glove Embeddings
  5. Fasttext Embeddings
Machine Learning Models:
- Logistic Regression
- Support Vector Machine (SVM)
- Naive Bayes
- Random Forest
- Gradient Boost
Deep Learning Models:
- Multi Layer Perceptron
- LSTM with 2 Dense layers
- LSTM
- Bidirectional LSTM
Model Evaluation:
- Precision, Recall, F1-Score
- Confusion matrix and ROC curve for performance analysis.

Results

The models were evaluated on precision, recall, F1-score and ROC curve. Below are the results:

+---------------------------+----------+-------+----------+
| Baseline Model (F1 Score) | Word2Vec | Glove | Fasttext |
+---------------------------+----------+-------+----------+
|    Logistic Regression    |   0.96   |  0.96 |   0.96   |
|        Naive Bayes        |   0.84   |  0.92 |   0.8    |
|            SVC            |   0.96   |  0.96 |   0.96   |
|       Random Forest       |   0.96   |  0.95 |   0.95   |
|       Gradient Boost      |   0.97   |  0.96 |   0.93   |
+---------------------------+----------+-------+----------+

+------------------------------+-------------------+---------------------+
|            Model             |     Embedding     | Validation F1 Score |
+------------------------------+-------------------+---------------------+
| Baseline Logistic Regression |       GloVe       |         0.97        |
|             MLP              |   No Embeddings   |         0.92        |
|             MLP              |       Glove       |         0.14        |
|             MLP              |   Glove(Trained)  |         0.95        |
|             MLP              |      Fasttext     |         0.92        |
|             MLP              | Fasttext(Trained) |         0.96        |
|     LSTM 2 Dense layers      |      Fasttext     |         0.18        |
|             LSTM             |      Fasttext     |         0.13        |
|      Bidirectional LSTM      |       Glove       |         0.95        |
+------------------------------+-------------------+---------------------+

Confusion matrices and other relevant graphs:

             precision    recall  f1-score   support

           0       0.98      0.96      0.97       101
           1       0.97      0.97      0.97        78
           2       0.96      0.96      0.96        82
           3       1.00      1.00      1.00       104
           4       0.98      1.00      0.99        82

    accuracy                           0.98       447
   macro avg       0.98      0.98      0.98       447
weighted avg       0.98      0.98      0.98       447

Usage

To make predictions on new news articles, you can use the following function:

from src.news_sorting_project.components.predictor import PredictionMaker
from src.news_sorting_project.config.configuration import ConfigurationManager


categories = {
    'Business': '💼',
    'Entertainment': '🎬',
    'Politics': '🗳️',
    'Sports': '🏅',
    'Technology': '💻',
}

config = ConfigurationManager()
model_predict_config = config.get_model_predict_config()
model_clean_config = config.get_data_cleaning_config()
model_transform_config = config.get_data_transform_config()
make_prediction = PredictionMaker(model_predict_config,
                                    model_clean_config, 
                                    model_transform_config,
                                    )

article_text = "Your news article text here."
probabilities = make_prediction.predict(article_text)
classes = list(categories.keys())
probabilities = [prob / sum(probabilities) for prob in probabilities]  # Normalize
predicted_class = classes[np.argmax(probabilities)]


print(predicted_class)

You can also run the training script to retrain the models:

python train.py

Project Demo

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
config		config
docs		docs
notebooks		notebooks
src/news_sorting_project		src/news_sorting_project
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
template.py		template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Sorting with NLP: BBC News Dataset 📰🔍

Overview

Table of Contents

Installation

Requirements

🔧 To make any updates in code: Follow the workflow

Dataset

Train, Test and Validation Split:

Approach

Results

Usage

Project Demo

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

News Sorting with NLP: BBC News Dataset 📰🔍

Overview

Table of Contents

Installation

Requirements

🔧 To make any updates in code: Follow the workflow

Dataset

Train, Test and Validation Split:

Approach

Results

Usage

Project Demo

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages