Skip to content

srensi/side-effect-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

side-effect-NLP

NLP Algorithms for detecting adverse events in patient narratives.

Data download

cd side-effect-NLP
mkdir data
cd data
wget http://evexdb.org/pmresources/vec-space-models/PMC-w2v.bin

Strategy

Download Tweets.
Connect annotations to tweets.
(1) Predict ADR or not ADR from tweet, using LSTM.
(2) Predict ADR classification from tweet

Word Vector and other resources
BioLab NLP homepage: http://bio.nlplab.org/#word-vectors
Pretrained Vectors: http://evexdb.org/pmresources/vec-space-models/
Side Effect Database: http://sideeffects.embl.de/
Pubmed (biomedical lit database): https://www.ncbi.nlm.nih.gov/pubmed/
Gensim Tutorial: https://radimrehurek.com/gensim/models/word2vec.html

Pseudo Code

Step 0. Pull all word vectors for set of MedDRA terms Y = (y_1, y_2, …, y_n) If meddra term consist of 2 words, throw away second word.

Step 1. Input sentence. Output array of symptom words Extract symptom potential words/terms

  • Normalize words/terms (we will need to find tool for this - NIH, Emily M.)

Step 2. Input symptom words (w_1, w_2, …, w_n). Output sum of symptom word vectors (x_sum). Get vectors for words (w_1, w_2, …, w_n) Return array of word vectors (x_1, x_2, …, x_n)

Step 3. Input array of word vectors X. Return sum, average, or max of word vectors x_agg = agg_fxn(x_1, x_2, …, x_n).

Step 4. Input aggregate word vector (x_agg) and array of MedDRA word vectors (y_1, y_2, …, y_n). Output medDRA term. For each medDRA word vector y_i in array compute dist(x_agg, y_i) = z_i softmax(Z) return argmin(Z)

Side Effects File. Note columns 5 and 6 are most important. (indexing from 1)

More Data Resources

Azadeh's dataset

Adverse drug reaction (ADR) lexicon: https://healthlanguageprocessing.files.wordpress.com/2018/03/adr_lexicon.zip
Annotated (labeled) tweets: https://healthlanguageprocessing.files.wordpress.com/2018/03/download_tweets1.zip
Unlabled tweets: https://healthlanguageprocessing.files.wordpress.com/2018/03/download_tweets.zip

Binary ADR Classifier for Tweets

Twiiter dataset for binary classification: https://healthlanguageprocessing.files.wordpress.com/2018/03/adr_classify_twitter_data.zip
Script for downloading tweets: https://healthlanguageprocessing.files.wordpress.com/2018/03/download_binary_twitter_data.zip
Polarity cues (?): https://healthlanguageprocessing.files.wordpress.com/2018/03/polaritycues.zip
Binary Classifier Repo: https://bitbucket.org/asarker/adrbinaryclassifier/get/bce087f4cc5d.zip

Binary LSTM

LSTM for Sequence Classification: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

NLTK For cleaning text data.

https://machinelearningmastery.com/clean-text-machine-learning-python/

About

NLP Algorithms for detecting adverse events in patient generated text.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages