side-effect-NLP

NLP Algorithms for detecting adverse events in patient narratives.

Data download

cd side-effect-NLP
mkdir data
cd data
wget http://evexdb.org/pmresources/vec-space-models/PMC-w2v.bin

Strategy

Download Tweets.
Connect annotations to tweets.
(1) Predict ADR or not ADR from tweet, using LSTM.
(2) Predict ADR classification from tweet

Word Vector and other resources
BioLab NLP homepage: http://bio.nlplab.org/#word-vectors
Pretrained Vectors: http://evexdb.org/pmresources/vec-space-models/
Side Effect Database: http://sideeffects.embl.de/
Pubmed (biomedical lit database): https://www.ncbi.nlm.nih.gov/pubmed/
Gensim Tutorial: https://radimrehurek.com/gensim/models/word2vec.html

Pseudo Code

Step 0. Pull all word vectors for set of MedDRA terms Y = (y_1, y_2, …, y_n) If meddra term consist of 2 words, throw away second word.

Step 1. Input sentence. Output array of symptom words Extract symptom potential words/terms

Normalize words/terms (we will need to find tool for this - NIH, Emily M.)

Step 2. Input symptom words (w_1, w_2, …, w_n). Output sum of symptom word vectors (x_sum). Get vectors for words (w_1, w_2, …, w_n) Return array of word vectors (x_1, x_2, …, x_n)

Step 3. Input array of word vectors X. Return sum, average, or max of word vectors x_agg = agg_fxn(x_1, x_2, …, x_n).

Step 4. Input aggregate word vector (x_agg) and array of MedDRA word vectors (y_1, y_2, …, y_n). Output medDRA term. For each medDRA word vector y_i in array compute dist(x_agg, y_i) = z_i softmax(Z) return argmin(Z)

Side Effects File. Note columns 5 and 6 are most important. (indexing from 1)

More Data Resources

Azadeh's dataset

Adverse drug reaction (ADR) lexicon: https://healthlanguageprocessing.files.wordpress.com/2018/03/adr_lexicon.zip
Annotated (labeled) tweets: https://healthlanguageprocessing.files.wordpress.com/2018/03/download_tweets1.zip
Unlabled tweets: https://healthlanguageprocessing.files.wordpress.com/2018/03/download_tweets.zip

Binary ADR Classifier for Tweets

Twiiter dataset for binary classification: https://healthlanguageprocessing.files.wordpress.com/2018/03/adr_classify_twitter_data.zip
Script for downloading tweets: https://healthlanguageprocessing.files.wordpress.com/2018/03/download_binary_twitter_data.zip
Polarity cues (?): https://healthlanguageprocessing.files.wordpress.com/2018/03/polaritycues.zip
Binary Classifier Repo: https://bitbucket.org/asarker/adrbinaryclassifier/get/bce087f4cc5d.zip

Binary LSTM

LSTM for Sequence Classification: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

NLTK For cleaning text data.

https://machinelearningmastery.com/clean-text-machine-learning-python/

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lasix_meddra.tsv		lasix_meddra.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

side-effect-NLP

Data download

Strategy

Pseudo Code

More Data Resources

Azadeh's dataset

Binary ADR Classifier for Tweets

Binary LSTM

NLTK For cleaning text data.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

side-effect-NLP

Data download

Strategy

Pseudo Code

More Data Resources

Azadeh's dataset

Binary ADR Classifier for Tweets

Binary LSTM

NLTK For cleaning text data.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages