Welcome to the News Sorting project using Natural Language Processing (NLP) techniques applied to the BBC News Dataset. This project aims to classify news articles into predefined categories such as business, entertainment, politics, sport, and tech. By leveraging NLP, we'll extract features from the text data to build a machine learning model capable of accurately categorizing news articles.
Clone the repository:
git clone https://github.com/Rahul-404/bbc-news-sorting.git
cd bbc-news-sortingCreate and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateTo run the project, you'll need Python 3.x and the following libraries:
- numpy
- pandas
- scikit-learn
- matplotlib
- seaborn
- nltk
- wordcloud
- tensorflow
- mlflow
Install the required dependencies:
pip install -r requirements.txt- Update config.yaml
- Update secrets.yaml [Optional]
- Update params.yaml
- Update the entity -> config_entity.py
- Update the configuration manager in src config
- Update the components
- Update the pipeline
- Update the main.py
- Update the dvc.yaml
The BBC News Dataset consists of news articles published by the BBC, categorized into five predefined classes: business, entertainment, politics, sport, and tech. Each article contains textual content along with its corresponding category label. The dataset is available on Kaggle.
splitting data based on Category and the distribution of tokens per example across the category to avoid bias of token length and distribution of words.
around 10K words are there in train data and, from those only ~3.2K words are out-of-vocabulary this might cause some test error but it will help us to get well generalized model
- Train : ~ 10K
- ~5.5K english + ~4.5K non-english
- Validation : ~ 3.2K
- ~1.1K english + ~2.1K non-english
-
Data Preprocessing:
-
Text cleaning
- Normalizing
- remove currencies
- remove distance
- remove conturies
- remove numbers
- remove special characters
- remove punctuations
- remove multiple spaces
- remove stopwords
- lemmetization
-
Tokenization
-
Vectorization
- One-Hot Encoding
- TF-IDF Encoding
- Word2Vec Embeddings
- Glove Embeddings
- Fasttext Embeddings
-
-
Machine Learning Models:
- Logistic Regression
- Support Vector Machine (SVM)
- Naive Bayes
- Random Forest
- Gradient Boost
-
Deep Learning Models:
- Multi Layer Perceptron
- LSTM with 2 Dense layers
- LSTM
- Bidirectional LSTM
-
Model Evaluation:
- Precision, Recall, F1-Score
- Confusion matrix and ROC curve for performance analysis.
The models were evaluated on precision, recall, F1-score and ROC curve. Below are the results:
+---------------------------+----------+-------+----------+
| Baseline Model (F1 Score) | Word2Vec | Glove | Fasttext |
+---------------------------+----------+-------+----------+
| Logistic Regression | 0.96 | 0.96 | 0.96 |
| Naive Bayes | 0.84 | 0.92 | 0.8 |
| SVC | 0.96 | 0.96 | 0.96 |
| Random Forest | 0.96 | 0.95 | 0.95 |
| Gradient Boost | 0.97 | 0.96 | 0.93 |
+---------------------------+----------+-------+----------+
+------------------------------+-------------------+---------------------+
| Model | Embedding | Validation F1 Score |
+------------------------------+-------------------+---------------------+
| Baseline Logistic Regression | GloVe | 0.97 |
| MLP | No Embeddings | 0.92 |
| MLP | Glove | 0.14 |
| MLP | Glove(Trained) | 0.95 |
| MLP | Fasttext | 0.92 |
| MLP | Fasttext(Trained) | 0.96 |
| LSTM 2 Dense layers | Fasttext | 0.18 |
| LSTM | Fasttext | 0.13 |
| Bidirectional LSTM | Glove | 0.95 |
+------------------------------+-------------------+---------------------+
Confusion matrices and other relevant graphs:
precision recall f1-score support
0 0.98 0.96 0.97 101
1 0.97 0.97 0.97 78
2 0.96 0.96 0.96 82
3 1.00 1.00 1.00 104
4 0.98 1.00 0.99 82
accuracy 0.98 447
macro avg 0.98 0.98 0.98 447
weighted avg 0.98 0.98 0.98 447
To make predictions on new news articles, you can use the following function:
from src.news_sorting_project.components.predictor import PredictionMaker
from src.news_sorting_project.config.configuration import ConfigurationManager
categories = {
'Business': 'πΌ',
'Entertainment': 'π¬',
'Politics': 'π³οΈ',
'Sports': 'π
',
'Technology': 'π»',
}
config = ConfigurationManager()
model_predict_config = config.get_model_predict_config()
model_clean_config = config.get_data_cleaning_config()
model_transform_config = config.get_data_transform_config()
make_prediction = PredictionMaker(model_predict_config,
model_clean_config,
model_transform_config,
)
article_text = "Your news article text here."
probabilities = make_prediction.predict(article_text)
classes = list(categories.keys())
probabilities = [prob / sum(probabilities) for prob in probabilities] # Normalize
predicted_class = classes[np.argmax(probabilities)]
print(predicted_class)You can also run the training script to retrain the models:
python train.pyThis project is licensed under the MIT License - see the LICENSE file for details.


