[tune] Tweaks to Trainable and Verbosity by richardliaw · Pull Request #2889 · ray-project/ray

richardliaw · 2018-09-15T09:09:46Z

Fix HyperOpt verbosity (hopefully), and tweak Trainable.

TODOs:

when I had a dict of state, I still had to write file saving/restoring code
the semantics of checkpoint_dir vs checkpoint_path was unclear, perhaps it could use an example in the interface class
it wasn't super clear that self.config was automatically set, perhaps it should be an argument for _setup()
the error message for when you return None / the wrong thing from _train() is hard to understand

From Ray-dev discussion: We should increase the clarity on the docs.
https://groups.google.com/forum/#!topic/ray-dev/wx5zdzbgMfs

"Suggestion Algorithms like HyperOpt don't extend the VariantGenerator so lambda spec: ..."
Clean up HyperOptSearch (don't expose add_configurations)
Clean up HyperBand documentation
Provide a better error when a config is mis-specified.

  File "<PROJECT_HOME>/env_27/lib/python2.7/site-packages/ray/tune/trial_runner.py", line 199, in has_resources
    return self.trial_executor.has_resources(resources)
  File "<PROJECT_HOME>/env_27/lib/python2.7/site-packages/ray/tune/ray_trial_executor.py", line 206, in has_resources
    have_space = (resources.cpu_total() <= cpu_avail
  File "<PROJECT_HOME>/env_27/lib/python2.7/site-packages/ray/tune/trial.py", line 55, in cpu_total
    return self.cpu + self.extra_cpu
TypeError: unsupported operand type(s) for +: 'function' and 'int'

#2870 #2888.

AmplabJenkins · 2018-09-15T09:31:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8237/
Test FAILed.

AmplabJenkins · 2018-09-15T09:35:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8238/
Test FAILed.

AmplabJenkins · 2018-09-15T18:35:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8243/
Test FAILed.

AmplabJenkins · 2018-09-15T18:55:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8244/
Test FAILed.

AmplabJenkins · 2018-09-26T21:36:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8379/
Test FAILed.

AmplabJenkins · 2018-09-26T21:57:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8381/
Test FAILed.

AmplabJenkins · 2018-09-26T22:06:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8382/
Test FAILed.

AmplabJenkins · 2018-09-27T01:11:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8386/
Test FAILed.

AmplabJenkins · 2018-09-27T03:43:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8387/
Test FAILed.

AmplabJenkins · 2018-09-27T06:31:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8392/
Test FAILed.

richardliaw · 2018-09-28T01:48:48Z

This is ready for review - not totally sure what we want for

when I had a dict of state, I still had to write file saving/restoring code

AmplabJenkins · 2018-09-28T02:47:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8418/
Test FAILed.

AmplabJenkins · 2018-09-28T03:00:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8419/
Test PASSed.

AmplabJenkins · 2018-09-28T03:20:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8421/
Test PASSed.

ericl · 2018-09-29T07:14:17Z

    def _save(self, checkpoint_dir):
        """Subclasses should override this to implement save().

+        See also: ray.tune.Trainable.save_dict.


I'm not super sure that's useful -- how about we say you can also return a dict, and if so it will be auto saved?

ericl · 2018-09-29T07:14:56Z

-                would default to `checkpoint_dir`.
+            checkpoint_path: The checkpoint path that will be
+                passed to restore(). This can be different from
+                checkpoint_dir.


Do we actually have a use case for it being different? It would be nice to add a check that checkpoint_path is a child of checkpoint_dir.

AmplabJenkins · 2018-10-01T21:49:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8495/
Test PASSed.

AmplabJenkins · 2018-10-01T21:52:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8496/
Test FAILed.

ericl

Main question is whether the dict is correct since it needs to include the iteration id.

ericl · 2018-10-02T18:46:52Z

TUNE_RESULTS_DIR?

ericl · 2018-10-02T18:48:30Z

The checkpoint path needs to include the current iteration right? Otherwise they will collide?

Addressed

AmplabJenkins · 2018-10-04T00:25:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8535/
Test FAILed.

ericl · 2018-10-04T00:57:19Z

+            "episodes_total": self._episodes_total,
+            "saved_as_dict": saved_as_dict
+        }, open(checkpoint_path + ".tune_metadata", "wb"))
+        self._checkpoint_num += 1


Use self.iteration

I'm not sure if that makes sense when you do checkpoint_at_end... you get something like experiment/checkpoint_433.pkl

ericl · 2018-10-04T02:14:20Z

The convention has always been iter in rllib though... Also did you guarantee that id increases monotonically like iter does?

…

On Wed, Oct 3, 2018, 6:01 PM Richard Liaw ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/tune/trainable.py <#2889 (comment)>: > @@ -225,10 +228,16 @@ def save(self, checkpoint_dir=None): "wb")) else: raise ValueError("Return value from `_save` must be dict or str.") - pickle.dump([ - self._experiment_id, self._iteration, self._timesteps_total, - self._time_total, self._episodes_total, saved_as_dict - ], open(checkpoint_path + ".tune_metadata", "wb")) + pickle.dump({ + "experiment_id": self._experiment_id, + "iteration": self._iteration, + "checkpoint_num": self._checkpoint_num, + "timesteps_total": self._timesteps_total, + "time_total": self._time_total, + "episodes_total": self._episodes_total, + "saved_as_dict": saved_as_dict + }, open(checkpoint_path + ".tune_metadata", "wb")) + self._checkpoint_num += 1 I'm not sure if that makes sense when you do checkpoint_at_end... you get something like experiment/checkpoint_433.pkl — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2889 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SvFxmhsvE7GwmxftUIny2c6YFrosks5uhV4GgaJpZM4WqRfk> .

richardliaw · 2018-10-04T05:42:39Z

Yep, all Trial ids increase monotonically now as of a previous PR (#2874)... how strongly do you feel about using self.iteration as checkpoint index?

ericl · 2018-10-04T16:03:24Z

Using iteration is a must.

…

On Wed, Oct 3, 2018, 10:42 PM Richard Liaw ***@***.***> wrote: Yep, all Trial ids increase monotonically now as of a previous PR... how strongly do you feel about using self.iteration as checkpoint index? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2889 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SteN8_yzmTrPwe1zLjv19EZX4DrNks5uhZ_bgaJpZM4WqRfk> .

AmplabJenkins · 2018-10-05T02:42:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8551/
Test FAILed.

ericl · 2018-10-05T20:28:15Z

+
+        test_trainable = TestTrain()
+        checkpoint_1 = test_trainable.save()
+        import ipdb; ipdb.set_trace(context=5)


Should this be removed?

(you reviewed an older version of the code), see most recent 2 commits

AmplabJenkins · 2018-10-05T22:04:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8558/
Test PASSed.

…ne_other_fixes

AmplabJenkins · 2018-10-12T01:33:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8598/
Test PASSed.

richardliaw · 2018-10-12T06:42:22Z

Errors unrelated

richardliaw added 3 commits September 15, 2018 02:02

reduce hyperopt verbosity

23016ad

Some trainable tweaks

5c9ed75

fix examples

e75e1c8

fix

53ce264

get rid of extra output

5027b2c

richardliaw added 2 commits September 26, 2018 14:14

Merge branch 'master' into tune_other_fixes

2c5531c

small fixes

08a1167

better error message

76fb1ee

more trainable tweaks

f8a2f52

richardliaw force-pushed the tune_other_fixes branch from 5aa5dca to f8a2f52 Compare September 27, 2018 06:09

richardliaw added 2 commits September 27, 2018 18:33

fix rllib setup

394681a

retweak wording for variant generation

9a9facf

richardliaw requested a review from ericl September 28, 2018 01:48

fix wording

493208d

ericl reviewed Sep 29, 2018

View reviewed changes

richardliaw added 4 commits October 1, 2018 13:01

add UT and saving functionality

012f111

flexible result directory

84e24a3

lint

627b114

Merge branch 'master' into tune_other_fixes

4809a1e

richardliaw assigned ericl Oct 2, 2018

ericl previously requested changes Oct 2, 2018

View reviewed changes

richardliaw added 2 commits October 3, 2018 16:39

add unique checkpointing

44a8968

nit

475653f

ericl reviewed Oct 4, 2018

View reviewed changes

fix

891be10

remove ipdb

8804b2a

ericl approved these changes Oct 5, 2018

View reviewed changes

fix directory creation

85df363

richardliaw added 2 commits October 11, 2018 17:13

lint

0d9e733

Merge branch 'tune_other_fixes' of github.com:richardliaw/ray into tu…

d56b653

…ne_other_fixes

richardliaw merged commit f9b58d7 into ray-project:master Oct 12, 2018

richardliaw deleted the tune_other_fixes branch October 12, 2018 06:42

This was referenced Oct 26, 2018

[tune] Make it easier to implement a correct trainable class #2870

Closed

[tune] Mute dill warnings #2888

Closed

Conversation

richardliaw commented Sep 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Sep 15, 2018

Uh oh!

AmplabJenkins commented Sep 15, 2018

Uh oh!

AmplabJenkins commented Sep 15, 2018

Uh oh!

AmplabJenkins commented Sep 15, 2018

Uh oh!

AmplabJenkins commented Sep 26, 2018

Uh oh!

AmplabJenkins commented Sep 26, 2018

Uh oh!

AmplabJenkins commented Sep 26, 2018

Uh oh!

AmplabJenkins commented Sep 27, 2018

Uh oh!

AmplabJenkins commented Sep 27, 2018

Uh oh!

AmplabJenkins commented Sep 27, 2018

Uh oh!

richardliaw commented Sep 28, 2018

Uh oh!

AmplabJenkins commented Sep 28, 2018

Uh oh!

AmplabJenkins commented Sep 28, 2018

Uh oh!

AmplabJenkins commented Sep 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 1, 2018

Uh oh!

AmplabJenkins commented Oct 1, 2018

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl commented Oct 4, 2018 via email

Uh oh!

richardliaw commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Oct 4, 2018 via email

Uh oh!

AmplabJenkins commented Oct 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 5, 2018

Uh oh!

AmplabJenkins commented Oct 12, 2018

Uh oh!

richardliaw commented Oct 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

richardliaw commented Sep 15, 2018 •

edited

Loading

richardliaw commented Oct 4, 2018 •

edited

Loading

richardliaw Oct 5, 2018 •

edited

Loading