-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Clojure] Add fastText example #15340
Conversation
|
Great! I look forward to reviewing this 😸 |
|
pinging @Chouffe to take a look if he has time as well since he helped shape the issue ticket 😄 |
|
I will take a look shortly @adc17 |
|
I could run fastText: cnn-text-classification.classifier=> (train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :fastText})
Loading all the movie reviews from data/mr-data
WARN org.apache.mxnet.WarnIfNotDisposed: LEAK: [one-time warning] An instance of org.apache.mxnet.Symbol was not disposed. Set property mxnet.traceLeakedObjects to true to enable tracing
Loading the fastText pre-trained word embeddings from data/fastText/wiki.simple.vec
Shuffling the data and splitting into training and test sets
{:sentence-count 2000, :sentence-size 62, :vocab-size 8078, :embedding-size 300, :pretrained-embedding :fastText}
Getting ready to train for 10 epochs
===========
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
[18:54:04] src/operator/tensor/./matrix_op-inl.h:200: Using target_shape will be deprecated.
[18:54:04] src/operator/tensor/./matrix_op-inl.h:200: Using target_shape will be deprecated.
INFO org.apache.mxnet.module.BaseModule: Epoch[0] Train-accuracy=0.5326316
INFO org.apache.mxnet.module.BaseModule: Epoch[0] Time cost=4463
INFO org.apache.mxnet.module.BaseModule: Epoch[0] Validation-accuracy=0.59
...
INFO org.apache.mxnet.module.BaseModule: Epoch[8] Train-accuracy=0.9836842
INFO org.apache.mxnet.module.BaseModule: Epoch[8] Time cost=4093
INFO org.apache.mxnet.module.BaseModule: Epoch[8] Validation-accuracy=0.73
INFO org.apache.mxnet.module.BaseModule: Epoch[9] Train-accuracy=0.9878947
INFO org.apache.mxnet.module.BaseModule: Epoch[9] Time cost=3861
INFO org.apache.mxnet.module.BaseModule: Epoch[9] Validation-accuracy=0.75Thanks a lot for adding this @adc17! It seems to work really well :-) |
Chouffe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good!
I left some minor comments @adc17.
| (defn load-glove [glove-file-path] | ||
| (println "Loading the glove pre-trained word embeddings from " glove-file-path) | ||
| (into {} (read-text-embedding-pairs (io/reader glove-file-path)))) | ||
| (into {} (read-text-embedding-pairs (line-seq (io/reader glove-file-path))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is becoming hard to read. Can we use a threading macro here?
(->> (io/reader path)
line-seq
read-text-embedding-pairs
(into {}))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I considered this myself!
|
|
||
| (defn load-fasttext [fasttext-file-path] | ||
| (println "Loading the fastText pre-trained word embeddings from " fasttext-file-path) | ||
| (into {} (read-text-embedding-pairs (remove-fasttext-metadata (line-seq (io/reader fasttext-file-path)))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also use a threading macro here for readability?
| vocab-embeddings (case pretrained-embedding | ||
| :glove (->> (load-glove (glove-file-path embedding-size)) | ||
| (build-vocab-embeddings vocab embedding-size)) | ||
| :fastText (->> (load-fasttext fasttext-file-path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a keyword like :fast-text or :fasttext instead of having an uppercase character.
|
|
||
| (def remove-fasttext-metadata rest) | ||
|
|
||
| (defn load-fasttext [fasttext-file-path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark the IO functions with ! to convey it?
(defn load-fasttext! ...)| In order to run training with word2vec on the complete data set, you will need to run: | ||
| ``` | ||
| (train-convnet {:embedding-size 300 :batch-size 100 :test-size 1000 :num-epoch 10 :pretrained-embedding :word2vec}) | ||
| (train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 1000 :num-epoch 10 :pretrained-embedding :word2vec}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the README!
| Then you can run training on a subset of examples through the repl using: | ||
| ``` | ||
| (train-convnet {:embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :word2vec}) | ||
| (train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :word2vec}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the README! Should we use (context/default-context) instead though?
|
|
||
| Then you can run training on a subset of examples through the repl using: | ||
| ``` | ||
| (train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :fastText}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use (context/default-context) instead?
| You can run through the repl with | ||
| `(train-convnet {:embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})` | ||
|
|
||
| `(train-convnet {:devs [(context/cpu 0)] :embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the README! Should we use (context/default-context) instead?
|
Thanks for the thorough review @Chouffe; I agree with all your suggestions.
I did this for fastText; the word2vec link is a google drive address that requires confirmation prior to download, so I didn't script it for now. We'd need a workaround like this to automate the download: https://stackoverflow.com/a/32742700/7028216 |
52601e4 to
f454f3c
Compare
gigasquid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your contribution 💯
Description
Resolves #14118
Checklist
Essentials
Changes