Don`t Merge Yet - Add Caffe2, PyTorch and GPU support(using Accelerators)#24
Don`t Merge Yet - Add Caffe2, PyTorch and GPU support(using Accelerators)#24animeshsingh merged 9 commits intomasterfrom
Conversation
|
The new commit allow LCM to use different learner images (including GPU images). Now users can use the Framework version section to select the Framework version of their choice. Please note that the example jobs are configured to pass for every framework on the list. Thus, it might not demonstrate the real world workload. (e.g. For Caffe2 the example is not performing epochs since the latest CPU image is not capable to do that yet.) To run workloads using GPU, users must satisfy the following prerequisites. Currently, using feature gate
|
animeshsingh
left a comment
There was a problem hiding this comment.
@Tomcli please look at some comments
| * [Caffe](http://caffe.berkeleyvision.org/) version "1.0-py2" | ||
| * [Caffe2](https://caffe2.ai/) version "0.8.1" | ||
| * [PyTorch](http://pytorch.org/) version "0.2" | ||
|
|
There was a problem hiding this comment.
Should we also test TF 1.5? Anything holding that?
etc/examples/caffe2-model/LICENSE
Outdated
| @@ -0,0 +1,21 @@ | |||
| The MIT License (MIT) | |||
There was a problem hiding this comment.
Are there other MIT License files in the code?
etc/examples/caffe2-model/LICENSE
Outdated
| @@ -0,0 +1,21 @@ | |||
| The MIT License (MIT) | |||
|
|
|||
| Copyright (c) Microsoft Corporation | |||
There was a problem hiding this comment.
This says "Copyright (c) Microsoft Corporation" ?
| @@ -0,0 +1,25 @@ | |||
| # The train/test net protocol buffer definition | |||
There was a problem hiding this comment.
Do we need to duplicate this whole file for a minimal change? Can we ask users to change 1/2 params manually - or give a script?
https://github.com/IBM/FfDL/blob/master/etc/examples/caffe-model/lenet_solver.prototxt
| @@ -0,0 +1,168 @@ | |||
| name: "LeNet" | |||
There was a problem hiding this comment.
Same as above vis a vis duplication
| @@ -0,0 +1,25 @@ | |||
| name: mnist-caffe-gpu-model | |||
There was a problem hiding this comment.
Same as above - can`t these be instructions like change these params in manifest + proto files, and then run? This way users also learn
| @@ -1,4 +1,3 @@ | |||
|
|
|||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
Same duplication issue...?
|
@animeshsingh For the first three comments I think you are looking at the older commit. Regrading to the duplication, I can merge the cpu and gpu example together and give a more detailed instructions in the gpu-guide.md. |
|
I put the CPU and GPU in a single example, so for TensorFlow and Caffe, there will be an extra gpu-manifest.yml to guide the user how to deploy with GPU resources. Then, the detailed instructions on converting CPU jobs to GPU is available at gpu-guide. |
Add new community CPU images for Caffe2 and PyTorch
Add sample CPU jobs for Caffe2 and PyTorch
Add other community learner images (including GPU images)
Add sample GPU jobs for TensorFlow and Caffe
We need to rebuild ffdl-lcm in order to reflect the new changes for the FfDL DockerHub images.
For GPU usage, temporary solution will be available at gpu-guide.
This PR fixes #16 , #17 , and #20 .