[Clojure] Add fastText example #15340

AlexChalk · 2019-06-24T19:07:07Z

Description

Right now the CNN text classification example provides support for glove and word2vec embeddings. This PR extends that support to fastText.
Additionally, this PR updates the README.md instructions for running examples from the REPL.

Resolves #14118

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR).
For new examples, README.md is added to explain what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable.
To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change.

Changes

CNN text classification (including docs)

gigasquid · 2019-06-24T20:05:20Z

Great! I look forward to reviewing this 😸

gigasquid · 2019-06-24T20:14:47Z

pinging @Chouffe to take a look if he has time as well since he helped shape the issue ticket 😄

Chouffe · 2019-06-25T08:36:42Z

I will take a look shortly @adc17
Looking forward to reviewing it!

Chouffe · 2019-06-25T17:00:52Z

I could run fastText:

cnn-text-classification.classifier=> (train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :fastText})
Loading all the movie reviews from  data/mr-data
WARN  org.apache.mxnet.WarnIfNotDisposed: LEAK: [one-time warning] An instance of org.apache.mxnet.Symbol was not disposed. Set property mxnet.traceLeakedObjects to true to enable tracing
Loading the fastText pre-trained word embeddings from  data/fastText/wiki.simple.vec
Shuffling the data and splitting into training and test sets
{:sentence-count 2000, :sentence-size 62, :vocab-size 8078, :embedding-size 300, :pretrained-embedding :fastText}
Getting ready to train for  10  epochs
===========
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
[18:54:04] src/operator/tensor/./matrix_op-inl.h:200: Using target_shape will be deprecated.
[18:54:04] src/operator/tensor/./matrix_op-inl.h:200: Using target_shape will be deprecated.
INFO  org.apache.mxnet.module.BaseModule: Epoch[0] Train-accuracy=0.5326316
INFO  org.apache.mxnet.module.BaseModule: Epoch[0] Time cost=4463
INFO  org.apache.mxnet.module.BaseModule: Epoch[0] Validation-accuracy=0.59
...
INFO  org.apache.mxnet.module.BaseModule: Epoch[8] Train-accuracy=0.9836842
INFO  org.apache.mxnet.module.BaseModule: Epoch[8] Time cost=4093
INFO  org.apache.mxnet.module.BaseModule: Epoch[8] Validation-accuracy=0.73
INFO  org.apache.mxnet.module.BaseModule: Epoch[9] Train-accuracy=0.9878947
INFO  org.apache.mxnet.module.BaseModule: Epoch[9] Time cost=3861
INFO  org.apache.mxnet.module.BaseModule: Epoch[9] Validation-accuracy=0.75

Thanks a lot for adding this @adc17! It seems to work really well :-)
I am wondering if we should add bash scripts to fetch the data (for word2vec and fastText embeddings)?

Chouffe

Overall looks really good!
I left some minor comments @adc17.

Chouffe · 2019-06-25T17:02:59Z

...clojure-package/examples/cnn-text-classification/src/cnn_text_classification/data_helper.clj

 (defn load-glove [glove-file-path]
  (println "Loading the glove pre-trained word embeddings from " glove-file-path)
-  (into {} (read-text-embedding-pairs (io/reader glove-file-path))))
+  (into {} (read-text-embedding-pairs (line-seq (io/reader glove-file-path)))))


It is becoming hard to read. Can we use a threading macro here?

(->> (io/reader path) line-seq read-text-embedding-pairs (into {}))

Yep, I considered this myself!

Chouffe · 2019-06-25T17:03:27Z

...clojure-package/examples/cnn-text-classification/src/cnn_text_classification/data_helper.clj

+
+(defn load-fasttext [fasttext-file-path]
+  (println "Loading the fastText pre-trained word embeddings from " fasttext-file-path)
+  (into {} (read-text-embedding-pairs (remove-fasttext-metadata (line-seq (io/reader fasttext-file-path))))))


Can we also use a threading macro here for readability?

Chouffe · 2019-06-25T17:04:11Z

...clojure-package/examples/cnn-text-classification/src/cnn_text_classification/data_helper.clj

        vocab-embeddings (case pretrained-embedding
                           :glove (->> (load-glove (glove-file-path embedding-size))
                                       (build-vocab-embeddings vocab embedding-size))
+                           :fastText (->> (load-fasttext fasttext-file-path)


I would prefer a keyword like :fast-text or :fasttext instead of having an uppercase character.

Chouffe · 2019-06-25T17:05:08Z

...clojure-package/examples/cnn-text-classification/src/cnn_text_classification/data_helper.clj

+
+(def remove-fasttext-metadata rest)
+
+(defn load-fasttext [fasttext-file-path]


Should we mark the IO functions with ! to convey it?

(defn load-fasttext! ...)

Chouffe · 2019-06-25T17:05:52Z

contrib/clojure-package/examples/cnn-text-classification/README.md

 In order to run training with word2vec on the complete data set, you will need to run:
 ```
-(train-convnet {:embedding-size 300 :batch-size 100 :test-size 1000 :num-epoch 10 :pretrained-embedding :word2vec})
+(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 1000 :num-epoch 10 :pretrained-embedding :word2vec})


Thanks for fixing the README!

Chouffe · 2019-06-25T17:06:36Z

contrib/clojure-package/examples/cnn-text-classification/README.md

 Then you can run training on a subset of examples through the repl using:
 ```
-(train-convnet {:embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :word2vec})
+(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :word2vec})


Thanks for fixing the README! Should we use (context/default-context) instead though?

Chouffe · 2019-06-25T17:07:07Z

contrib/clojure-package/examples/cnn-text-classification/README.md

+
+Then you can run training on a subset of examples through the repl using:
+```
+(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :fastText})


Should we use (context/default-context) instead?

Chouffe · 2019-06-25T17:08:16Z

contrib/clojure-package/examples/cnn-text-classification/README.md

 You can run through the repl with
-`(train-convnet {:embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})`
-
+`(train-convnet {:devs [(context/cpu 0)] :embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})`


Thanks for fixing the README! Should we use (context/default-context) instead?

gigasquid · 2019-06-26T20:31:05Z

Thanks for your contribution @adc17 and thanks for reviewing @Chouffe 😸

I took a look and it looks good to me too. When then feedback items are addressed and CI is green, it will be good to merge.

AlexChalk · 2019-06-28T15:56:36Z

Thanks for the thorough review @Chouffe; I agree with all your suggestions.

I am wondering if we should add bash scripts to fetch the data (for word2vec and fastText embeddings)?

I did this for fastText; the word2vec link is a google drive address that requires confirmation prior to download, so I didn't script it for now. We'd need a workaround like this to automate the download: https://stackoverflow.com/a/32742700/7028216

gigasquid

Thanks again for your contribution 💯

AlexChalk added 5 commits June 19, 2019 08:27

Add fastText to CNN text classification examples

42be01d

Update repl running instructions

e4beba8

Complete solution with OOM workaround

5f42152

Complete solution with smaller fastText dataset

add3ce9

Add approx validation accuracy to readme

a8d3bf3

AlexChalk requested a review from gigasquid as a code owner June 24, 2019 19:07

AlexChalk mentioned this pull request Jun 24, 2019

[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118

Closed

gigasquid added the Clojure label Jun 24, 2019

gigasquid changed the title ~~Add fastText example~~ [Clojure] Add fastText example Jun 24, 2019

Chouffe reviewed Jun 25, 2019

View reviewed changes

AlexChalk added 5 commits June 28, 2019 11:29

Add threading macro

6fa3db3

Use consistent fasttext casing

0e94485

Add bangs to io reader functions

2466e18

Reference default context setting in readme

7471534

Change fasttext references in readme

9d89cbc

Add data fetching shellscript for fasttext

f454f3c

AlexChalk force-pushed the adc/add-fasttext-example branch from 52601e4 to f454f3c Compare June 30, 2019 02:10

gigasquid approved these changes Jun 30, 2019

View reviewed changes

gigasquid merged commit b869ecd into apache:master Jun 30, 2019


		(def remove-fasttext-metadata rest)

		(defn load-fasttext [fasttext-file-path]

[Clojure] Add fastText example #15340

[Clojure] Add fastText example #15340

Uh oh!

Conversation

AlexChalk commented Jun 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Uh oh!

gigasquid commented Jun 24, 2019

Uh oh!

gigasquid commented Jun 24, 2019

Uh oh!

Chouffe commented Jun 25, 2019

Uh oh!

Chouffe commented Jun 25, 2019

Uh oh!

Chouffe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gigasquid commented Jun 26, 2019

Uh oh!

AlexChalk commented Jun 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gigasquid left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlexChalk commented Jun 24, 2019 •

edited

Loading

AlexChalk commented Jun 28, 2019 •

edited

Loading