Price-reward: strengthen price signal and add negative-price overdrive#27
Price-reward: strengthen price signal and add negative-price overdrive#27rbx merged 2 commits intoFairRootGroup:masterfrom
Conversation
enlorenz
commented
Mar 2, 2026
- keep legacy price reward path for ablation/debugging
- rework active price reward to use saturating price-advantage and job-volume terms
- normalize active price reward against a fixed [0, 1] baseline range
- add a dedicated negative-price overdrive branch for work done at < 0 prices
- decouple overdrive job scaling via NEGATIVE_PRICE_JOB_TAU so negative-price reward ramps early even when PRICE_JOB_TAU is large
- keep legacy price reward path for ablation/debugging - rework active price reward to use saturating price-advantage and job-volume terms - normalize active price reward against a fixed [0, 1] baseline range - add a dedicated negative-price overdrive branch for work done at < 0 prices - decouple overdrive job scaling via NEGATIVE_PRICE_JOB_TAU so negative-price reward ramps early even when PRICE_JOB_TAU is large
📝 WalkthroughWalkthroughPrices shift application in Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/reward_calculation.py (1)
125-125: Remove the stray no-op string literal in_reward_price.Line 125 is an extra standalone string literal, which is not used and reads like a second docstring.
🧹 Proposed cleanup
- """Calculates rewards with pre-computed normalization bounds."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/reward_calculation.py` at line 125, Remove the stray standalone string literal inside the _reward_price function (it's acting like an accidental second docstring and does nothing); locate the _reward_price definition and delete that extra quoted string so the function contains only its intended docstring (if any) and functional code.src/prices.py (1)
27-28: Replace the always-false gate with an explicit feature toggle.Line 28 (
min_price < 1 and 0) permanently disables this path in a non-obvious way. Use a named flag so intent is explicit and safe to revert after testing.♻️ Proposed refactor
- # hard block this for now during testing to keep things simple, but we could consider allowing it in the future with proper handling. - if min_price < 1 and 0: + # Temporary test toggle: keep raw prices (including negatives) unless explicitly enabled. + APPLY_PRICE_SHIFT = False + if min_price < 1 and APPLY_PRICE_SHIFT: self.price_shift = 1 - min_price prices = prices + self.price_shift🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/prices.py` around lines 27 - 28, The condition "if min_price < 1 and 0" is an always-false gate and should be replaced with a clear feature toggle: introduce a named boolean flag (e.g., ALLOW_MIN_PRICE_BELOW_ONE or ENABLE_MIN_PRICE_OVERRIDE) and use it in the conditional as "if min_price < 1 and <FLAG>". Update the check around the min_price logic in src/prices.py to read the flag (module-level constant or config/env-driven) so the path can be safely enabled/disabled and is obvious to future readers; ensure any tests or callers referencing this behavior use the new flag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/reward_calculation.py`:
- Around line 55-60: The legacy normalization uses a fixed 0.0 min which is
wrong because _reward_price_legacy can be negative (context_avg -
current_price); update initialization so _min_price_reward_legacy =
-_max_price_reward_legacy (compute _max_price_reward_legacy via
_reward_price_legacy(self.prices.MIN_PRICE, self.prices.MAX_PRICE, ...)) so the
min/max reflect symmetric extremes, and change _reward_price_normalized_legacy
to normalize using (raw - _min_price_reward_legacy) / (_max_price_reward_legacy
- _min_price_reward_legacy) and clamp the result to [0,1]; reference symbols:
_min_price_reward_legacy, _max_price_reward_legacy, _reward_price_legacy,
_reward_price_normalized_legacy, context_avg, current_price.
---
Nitpick comments:
In `@src/prices.py`:
- Around line 27-28: The condition "if min_price < 1 and 0" is an always-false
gate and should be replaced with a clear feature toggle: introduce a named
boolean flag (e.g., ALLOW_MIN_PRICE_BELOW_ONE or ENABLE_MIN_PRICE_OVERRIDE) and
use it in the conditional as "if min_price < 1 and <FLAG>". Update the check
around the min_price logic in src/prices.py to read the flag (module-level
constant or config/env-driven) so the path can be safely enabled/disabled and is
obvious to future readers; ensure any tests or callers referencing this behavior
use the new flag.
In `@src/reward_calculation.py`:
- Line 125: Remove the stray standalone string literal inside the _reward_price
function (it's acting like an accidental second docstring and does nothing);
locate the _reward_price definition and delete that extra quoted string so the
function contains only its intended docstring (if any) and functional code.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/reward_calculation.py (1)
65-70:⚠️ Potential issue | 🟠 MajorFix legacy price normalization bounds/clamping to avoid out-of-range values.
At Line 65,
_min_price_reward_legacyis fixed at0.0, but Line 117 can produce negative raw legacy reward. This makes Line 125 return values below 0 for valid inputs.🐛 Proposed fix
- self._min_price_reward_legacy = 0.0 + self._min_price_reward_legacy = self._reward_price_legacy( + self.prices.MAX_PRICE, + self.prices.MIN_PRICE, + MAX_NEW_JOBS_PER_HOUR, + ) self._max_price_reward_legacy = self._reward_price_legacy( self.prices.MIN_PRICE, self.prices.MAX_PRICE, MAX_NEW_JOBS_PER_HOUR, ) @@ - return self._normalize(current_reward, self._min_price_reward_legacy, self._max_price_reward_legacy) + normalized = self._normalize(current_reward, self._min_price_reward_legacy, self._max_price_reward_legacy) + return float(np.clip(normalized, 0.0, 1.0))Also applies to: 120-126
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/reward_calculation.py` around lines 65 - 70, Compute true legacy reward bounds by evaluating _reward_price_legacy at the price endpoints and set _min_price_reward_legacy/_max_price_reward_legacy to the min/max of those evaluations (instead of hardcoding 0.0), and then ensure the normalization logic that scales raw legacy rewards (the code using _reward_price_legacy around the normalization block) clamps the final normalized value into the allowed range (e.g., clamp to [0.0, 1.0] or to [_min_price_reward_legacy, _max_price_reward_legacy] as appropriate) so negative or out-of-range normalized rewards cannot be returned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/reward_calculation.py`:
- Around line 65-70: Compute true legacy reward bounds by evaluating
_reward_price_legacy at the price endpoints and set
_min_price_reward_legacy/_max_price_reward_legacy to the min/max of those
evaluations (instead of hardcoding 0.0), and then ensure the normalization logic
that scales raw legacy rewards (the code using _reward_price_legacy around the
normalization block) clamps the final normalized value into the allowed range
(e.g., clamp to [0.0, 1.0] or to [_min_price_reward_legacy,
_max_price_reward_legacy] as appropriate) so negative or out-of-range normalized
rewards cannot be returned.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/reward_calculation.py (2)
252-256: Variable naming may cause confusion.The variable
price_reward_normis no longer truly normalized to[0, 1]since it can exceed 1.0 with negative-price overdrive. Consider renaming toprice_rewardorprice_reward_rawto avoid implying normalization, or update the return tuple documentation accordingly.♻️ Optional: Rename for clarity
- # 2. increase reward if jobs were scheduled in this step and the current price is below average, with fast saturation and extra bonus for negative-price overdrive. - price_reward_norm = self._reward_price( + # 2. increase reward if jobs were scheduled in this step and the current price is below average, with fast saturation and extra bonus for negative-price overdrive. + price_reward = self._reward_price( current_price, average_future_price, num_processed_jobs ) - price_reward_weighted = weights.price_weight * price_reward_norm + price_reward_weighted = weights.price_weight * price_rewardNote: This would require updating the return statement and docstring as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/reward_calculation.py` around lines 252 - 256, The name price_reward_norm is misleading because _reward_price can return values >1.0; rename price_reward_norm to price_reward (or price_reward_raw) and update all usages (e.g., the multiplication into price_reward_weighted and any return values or docstrings) so the symbol change is consistent, and update the docstring/return documentation for _reward_price if it currently claims a normalized [0,1] value.
120-125: Consider clipping legacy normalized output for consistency.The legacy normalization relies on symmetric bounds but doesn't clip the final result. While the bounds should keep values in
[0, 1]for typical inputs, edge cases could produce values outside this range. Since this path is for ablation/debugging, you may want to add a clip for consistent comparison with the active path.♻️ Optional: Add clipping for edge-case safety
def _reward_price_normalized_legacy(self, current_price: float, average_future_price: float, num_processed_jobs: int) -> float: """Legacy normalized price reward [0, 1] in typical operating range.""" if num_processed_jobs == 0: return 0.0 current_reward = self._reward_price_legacy(current_price, average_future_price, num_processed_jobs) - return self._normalize(current_reward, self._min_price_reward_legacy, self._max_price_reward_legacy) + normalized = self._normalize(current_reward, self._min_price_reward_legacy, self._max_price_reward_legacy) + return float(np.clip(normalized, 0.0, 1.0))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/reward_calculation.py` around lines 120 - 125, The legacy normalized reward in _reward_price_normalized_legacy may produce out-of-range values in edge cases; after computing current_reward and calling self._normalize(current_reward, self._min_price_reward_legacy, self._max_price_reward_legacy) clamp the resulting float to the [0.0, 1.0] range (e.g., using min/max or a clamp helper) before returning to ensure consistency with the active path and safe comparisons during ablation/debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/reward_calculation.py`:
- Around line 252-256: The name price_reward_norm is misleading because
_reward_price can return values >1.0; rename price_reward_norm to price_reward
(or price_reward_raw) and update all usages (e.g., the multiplication into
price_reward_weighted and any return values or docstrings) so the symbol change
is consistent, and update the docstring/return documentation for _reward_price
if it currently claims a normalized [0,1] value.
- Around line 120-125: The legacy normalized reward in
_reward_price_normalized_legacy may produce out-of-range values in edge cases;
after computing current_reward and calling self._normalize(current_reward,
self._min_price_reward_legacy, self._max_price_reward_legacy) clamp the
resulting float to the [0.0, 1.0] range (e.g., using min/max or a clamp helper)
before returning to ensure consistency with the active path and safe comparisons
during ablation/debugging.
f187ff1 to
a83a843
Compare
…ling - make the active price signal signed so expensive execution can produce negative reward - add asymmetric job saturation constants: - PRICE_JOB_TAU_POS for low-price reward growth - PRICE_JOB_TAU_NEG for faster high-price penalty growth - keep negative-price overdrive decoupled via NEGATIVE_PRICE_JOB_TAU - preserve negative-price overdrive floor/bonus behavior
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/environment.py (1)
478-485: Clarify thatprice_rewardis not normalized anymore.Line 478 still labels these as normalized components, but
price_rewardis now signed and can exceed1.0. Updating the label/comment will prevent misreading plot diagnostics.♻️ Suggested clarification
- # Store normalized reward components for plotting + # Store reward components for plotting. + # NOTE: price_reward is signed and not strictly normalized. self.metrics.eff_rewards.append(eff_reward_norm * 100) self.metrics.price_rewards.append(price_reward * 100) self.metrics.job_age_penalties.append(job_age_penalty_norm * 100) self.metrics.idle_penalties.append(idle_penalty_norm * 100)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/environment.py` around lines 478 - 485, The comment above the reward-appending block is misleading because price_reward is no longer normalized; update the comment from "Store normalized reward components for plotting" to something like "Store reward components for plotting (price_reward is signed/unnormalized and may exceed 1.0)" and add a short inline note next to the self.metrics.price_rewards and self.metrics.episode_price_rewards appends to clarify that price_reward is unnormalized so downstream plotting/analysis treats its scale appropriately.src/prices.py (1)
20-20: MakeENABLE_PRICE_SHIFTconfigurable at construction if this is meant to be a real flag.Right now it is always reset to
Falseinside__init__, so callers cannot enable shifting without editing source. Consider wiring it as an init parameter.♻️ Suggested wiring
-class Prices: +class Prices: @@ - def __init__(self, external_prices: np.ndarray | list[float] | None = None) -> None: + def __init__( + self, + external_prices: np.ndarray | list[float] | None = None, + enable_price_shift: bool = False, + ) -> None: @@ - self.ENABLE_PRICE_SHIFT: bool = False + self.ENABLE_PRICE_SHIFT: bool = enable_price_shiftAlso applies to: 28-29
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/prices.py` at line 20, ENABLE_PRICE_SHIFT is hardcoded to False inside __init__; make it configurable by adding an __init__ parameter (e.g., enable_price_shift: bool = False) and assign it to self.ENABLE_PRICE_SHIFT (self.ENABLE_PRICE_SHIFT = enable_price_shift). Do the same pattern for the other boolean flags initialized nearby in __init__ (the flags set around lines 28-29) so callers can enable them via constructor parameters and defaults remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/environment.py`:
- Around line 478-485: The comment above the reward-appending block is
misleading because price_reward is no longer normalized; update the comment from
"Store normalized reward components for plotting" to something like "Store
reward components for plotting (price_reward is signed/unnormalized and may
exceed 1.0)" and add a short inline note next to the self.metrics.price_rewards
and self.metrics.episode_price_rewards appends to clarify that price_reward is
unnormalized so downstream plotting/analysis treats its scale appropriately.
In `@src/prices.py`:
- Line 20: ENABLE_PRICE_SHIFT is hardcoded to False inside __init__; make it
configurable by adding an __init__ parameter (e.g., enable_price_shift: bool =
False) and assign it to self.ENABLE_PRICE_SHIFT (self.ENABLE_PRICE_SHIFT =
enable_price_shift). Do the same pattern for the other boolean flags initialized
nearby in __init__ (the flags set around lines 28-29) so callers can enable them
via constructor parameters and defaults remain unchanged.