Skip to content

[tune] Tweaks to Trainable and Verbosity#2889

Merged
richardliaw merged 23 commits intoray-project:masterfrom
richardliaw:tune_other_fixes
Oct 12, 2018
Merged

[tune] Tweaks to Trainable and Verbosity#2889
richardliaw merged 23 commits intoray-project:masterfrom
richardliaw:tune_other_fixes

Conversation

@richardliaw
Copy link
Copy Markdown
Contributor

@richardliaw richardliaw commented Sep 15, 2018

Fix HyperOpt verbosity (hopefully), and tweak Trainable.

TODOs:

  • when I had a dict of state, I still had to write file saving/restoring code
  • the semantics of checkpoint_dir vs checkpoint_path was unclear, perhaps it could use an example in the interface class
  • it wasn't super clear that self.config was automatically set, perhaps it should be an argument for _setup()
  • the error message for when you return None / the wrong thing from _train() is hard to understand

From Ray-dev discussion: We should increase the clarity on the docs.
https://groups.google.com/forum/#!topic/ray-dev/wx5zdzbgMfs

  • "Suggestion Algorithms like HyperOpt don't extend the VariantGenerator so lambda spec: ..."

  • Clean up HyperOptSearch (don't expose add_configurations)

  • Clean up HyperBand documentation

  • Provide a better error when a config is mis-specified.

  File "<PROJECT_HOME>/env_27/lib/python2.7/site-packages/ray/tune/trial_runner.py", line 199, in has_resources
    return self.trial_executor.has_resources(resources)
  File "<PROJECT_HOME>/env_27/lib/python2.7/site-packages/ray/tune/ray_trial_executor.py", line 206, in has_resources
    have_space = (resources.cpu_total() <= cpu_avail
  File "<PROJECT_HOME>/env_27/lib/python2.7/site-packages/ray/tune/trial.py", line 55, in cpu_total
    return self.cpu + self.extra_cpu
TypeError: unsupported operand type(s) for +: 'function' and 'int'

#2870 #2888.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8237/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8238/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8243/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8244/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8379/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8381/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8382/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8386/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8387/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8392/
Test FAILed.

@richardliaw richardliaw requested a review from ericl September 28, 2018 01:48
@richardliaw
Copy link
Copy Markdown
Contributor Author

This is ready for review - not totally sure what we want for

when I had a dict of state, I still had to write file saving/restoring code

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8418/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8419/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8421/
Test PASSed.

Comment thread python/ray/tune/trainable.py Outdated
def _save(self, checkpoint_dir):
"""Subclasses should override this to implement save().

See also: ray.tune.Trainable.save_dict.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super sure that's useful -- how about we say you can also return a dict, and if so it will be auto saved?

Comment thread python/ray/tune/trainable.py Outdated
would default to `checkpoint_dir`.
checkpoint_path: The checkpoint path that will be
passed to restore(). This can be different from
checkpoint_dir.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually have a use case for it being different? It would be nice to add a check that checkpoint_path is a child of checkpoint_dir.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8495/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8496/
Test FAILed.

ericl
ericl previously requested changes Oct 2, 2018
Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main question is whether the dict is correct since it needs to include the iteration id.

Comment thread python/ray/tune/result.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TUNE_RESULTS_DIR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkpoint path needs to include the current iteration right? Otherwise they will collide?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8535/
Test FAILed.

Comment thread python/ray/tune/trainable.py Outdated
"episodes_total": self._episodes_total,
"saved_as_dict": saved_as_dict
}, open(checkpoint_path + ".tune_metadata", "wb"))
self._checkpoint_num += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self.iteration

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that makes sense when you do checkpoint_at_end... you get something like experiment/checkpoint_433.pkl

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Oct 4, 2018 via email

@richardliaw
Copy link
Copy Markdown
Contributor Author

richardliaw commented Oct 4, 2018

Yep, all Trial ids increase monotonically now as of a previous PR (#2874)... how strongly do you feel about using self.iteration as checkpoint index?

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Oct 4, 2018 via email

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8551/
Test FAILed.


test_trainable = TestTrain()
checkpoint_1 = test_trainable.save()
import ipdb; ipdb.set_trace(context=5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed?

Copy link
Copy Markdown
Contributor Author

@richardliaw richardliaw Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you reviewed an older version of the code), see most recent 2 commits

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8558/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8598/
Test PASSed.

@richardliaw richardliaw merged commit f9b58d7 into ray-project:master Oct 12, 2018
@richardliaw richardliaw deleted the tune_other_fixes branch October 12, 2018 06:42
@richardliaw
Copy link
Copy Markdown
Contributor Author

Errors unrelated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants