Skip to content
This repository was archived by the owner on Jan 29, 2026. It is now read-only.

Don`t Merge Yet - Add Caffe2, PyTorch and GPU support(using Accelerators)#24

Merged
animeshsingh merged 9 commits intomasterfrom
new-frameworks
Feb 28, 2018
Merged

Don`t Merge Yet - Add Caffe2, PyTorch and GPU support(using Accelerators)#24
animeshsingh merged 9 commits intomasterfrom
new-frameworks

Conversation

@Tomcli
Copy link
Contributor

@Tomcli Tomcli commented Feb 23, 2018

  • Add new community CPU images for Caffe2 and PyTorch

  • Add sample CPU jobs for Caffe2 and PyTorch

  • Add other community learner images (including GPU images)

  • Add sample GPU jobs for TensorFlow and Caffe

  • We need to rebuild ffdl-lcm in order to reflect the new changes for the FfDL DockerHub images.

For GPU usage, temporary solution will be available at gpu-guide.

This PR fixes #16 , #17 , and #20 .

@Tomcli Tomcli added the enhancement New feature or request label Feb 23, 2018
@Tomcli
Copy link
Contributor Author

Tomcli commented Feb 27, 2018

The new commit allow LCM to use different learner images (including GPU images). Now users can use the Framework version section to select the Framework version of their choice.

Please note that the example jobs are configured to pass for every framework on the list. Thus, it might not demonstrate the real world workload. (e.g. For Caffe2 the example is not performing epochs since the latest CPU image is not capable to do that yet.)

To run workloads using GPU, users must satisfy the following prerequisites. Currently, using feature gate Accelerators is the only option we have.

  • Please do not merge this PR until we build and upload the new version of ffdl-lcm image on DockerHub since the new example jobs is not backward compatible with the old ffdl-lcm image.

@Tomcli Tomcli changed the title Add Caffe2 and PyTorch CPU support Add Caffe2 and PyTorch and GPU support(using Accelerators) Feb 27, 2018
@Tomcli Tomcli changed the title Add Caffe2 and PyTorch and GPU support(using Accelerators) Add Caffe2, PyTorch and GPU support(using Accelerators) Feb 27, 2018
@animeshsingh animeshsingh changed the title Add Caffe2, PyTorch and GPU support(using Accelerators) Don`t Merge Yet - Add Caffe2, PyTorch and GPU support(using Accelerators) Feb 27, 2018
Copy link
Contributor

@animeshsingh animeshsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tomcli please look at some comments

* [Caffe](http://caffe.berkeleyvision.org/) version "1.0-py2"
* [Caffe2](https://caffe2.ai/) version "0.8.1"
* [PyTorch](http://pytorch.org/) version "0.2"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test TF 1.5? Anything holding that?

@@ -0,0 +1,21 @@
The MIT License (MIT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other MIT License files in the code?

@@ -0,0 +1,21 @@
The MIT License (MIT)

Copyright (c) Microsoft Corporation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says "Copyright (c) Microsoft Corporation" ?

@@ -0,0 +1,25 @@
# The train/test net protocol buffer definition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to duplicate this whole file for a minimal change? Can we ask users to change 1/2 params manually - or give a script?
https://github.com/IBM/FfDL/blob/master/etc/examples/caffe-model/lenet_solver.prototxt

@@ -0,0 +1,168 @@
name: "LeNet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above vis a vis duplication

@@ -0,0 +1,25 @@
name: mnist-caffe-gpu-model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - can`t these be instructions like change these params in manifest + proto files, and then run? This way users also learn

@@ -1,4 +1,3 @@

#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same duplication issue...?

@Tomcli
Copy link
Contributor Author

Tomcli commented Feb 28, 2018

@animeshsingh For the first three comments I think you are looking at the older commit. Regrading to the duplication, I can merge the cpu and gpu example together and give a more detailed instructions in the gpu-guide.md.

@Tomcli
Copy link
Contributor Author

Tomcli commented Feb 28, 2018

I put the CPU and GPU in a single example, so for TensorFlow and Caffe, there will be an extra gpu-manifest.yml to guide the user how to deploy with GPU resources. Then, the detailed instructions on converting CPU jobs to GPU is available at gpu-guide.

@animeshsingh animeshsingh merged commit 80769d0 into master Feb 28, 2018
@Tomcli Tomcli deleted the new-frameworks branch February 28, 2018 22:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prepare TensorFlow and Caffe sample GPU jobs.

2 participants