LCORE-1062: reconnection logic in quota runner#921
LCORE-1062: reconnection logic in quota runner#921tisnik merged 4 commits intolightspeed-core:mainfrom
Conversation
WalkthroughAdds DB connection resiliency to the quota scheduler: introduces Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler
participant DB
participant Limiter
Scheduler->>DB: initial connect (retry loop up to N times)
alt connect successful
DB-->>Scheduler: connection
Scheduler->>Scheduler: for each limiter
Scheduler->>DB: connected(connection)? (SELECT 1)
alt connected
DB-->>Scheduler: OK
Scheduler->>Limiter: process limiter using DB connection
Limiter-->>Scheduler: result
else disconnected
Scheduler->>DB: close old connection
Scheduler->>DB: attempt reconnect (retry loop)
alt reconnect successful
DB-->>Scheduler: new connection
Scheduler->>Limiter: process limiter using DB connection
Limiter-->>Scheduler: result
else reconnect failed
Scheduler-->>Limiter: skip limiter (log warning)
end
end
else initial connect failed after N
DB-->>Scheduler: failure (log & abort start)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/runners/quota_scheduler.py (4)
30-32: Consider adding type annotations and documenting units.The constants lack type annotations, and
RECONNECTION_DELAYdoesn't indicate its unit (seconds). This aids readability and aligns with the project's type annotation guidelines.-DATABASE_RECONNECTION_COUNT = 10 -RECONNECTION_DELAY = 1 +DATABASE_RECONNECTION_COUNT: int = 10 +RECONNECTION_DELAY: int = 1 # seconds
61-71: Broad exception catch flagged by linter.The pipeline reports
W0718: Catching too general exception Exception. While catching broadly is reasonable for connection resilience, consider catching more specific exceptions (e.g.,psycopg2.OperationalError,sqlite3.Error) or adding a# pylint: disable=broad-exception-caughtcomment if this is intentional, consistent with line 104.- except Exception as e: + except Exception as e: # pylint: disable=broad-exception-caught
96-100: Consider closing the old connection and applying retry logic.Two concerns:
- The old connection isn't closed before reconnecting, which could leak resources.
- Unlike the initial connection (which retries 10 times), runtime reconnection only attempts once per limiter iteration, which may cause repeated failures if the database is temporarily unavailable.
Consider closing the stale connection and potentially reusing the retry logic:
if not connected(connection): + try: + connection.close() + except Exception: # pylint: disable=broad-exception-caught + pass # Connection already dead connection = connect(config)
34-110: Function has too many branches (15/12).The pipeline reports
R0912: Too many branches. The reconnection logic has increased complexity. Consider extracting the initial connection establishment (lines 61-71) into a helper function likeestablish_connection_with_retry(config) -> Optional[Any]to reduce branch count and improve testability.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/runners/quota_scheduler.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: Use absolute imports for internal modules in LCS project (e.g.,from auth import get_auth_dependency)
All modules must start with descriptive docstrings explaining their purpose
Uselogger = logging.getLogger(__name__)pattern for module logging
All functions must include complete type annotations for parameters and return types, using modern syntax (str | int) andOptional[Type]orType | None
All functions must have docstrings with brief descriptions following Google Python docstring conventions
Function names must use snake_case with descriptive, action-oriented names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying input parameters
Useasync deffor I/O operations and external API calls
All classes must include descriptive docstrings explaining their purpose following Google Python docstring conventions
Class names must use PascalCase with descriptive names and standard suffixes:Configurationfor config classes,Error/Exceptionfor exceptions,Resolverfor strategy patterns,Interfacefor abstract base classes
Abstract classes must use ABC with@abstractmethoddecorators
Include complete type annotations for all class attributes in Python classes
Useimport loggingand module logger pattern with standard log levels: debug, info, warning, error
Files:
src/runners/quota_scheduler.py
🧬 Code graph analysis (1)
src/runners/quota_scheduler.py (2)
src/utils/connection_decorator.py (1)
connection(6-48)src/models/config.py (1)
config(324-341)
🪛 GitHub Actions: Python linter
src/runners/quota_scheduler.py
[error] 66-66: pylint: W0718 Catching too general exception Exception (broad-exception-caught).
[error] 34-34: pylint: R0912 Too many branches (15/12) (too-many-branches).
[error] 123-123: pylint: W0718 Catching too general exception Exception (broad-exception-caught).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
| def connected(connection: Any) -> bool: | ||
| """Check if DB is still connected.""" | ||
| if connection is None: | ||
| logger.warning("Not connected, need to reconnect later") | ||
| return False | ||
| try: | ||
| with connection.cursor() as cursor: | ||
| cursor.execute("SELECT 1") | ||
| logger.info("Connection to storage is ok") | ||
| return True | ||
| except Exception as e: | ||
| logger.error("Disconnected from storage: %s", e) | ||
| return False | ||
|
|
There was a problem hiding this comment.
SQLite incompatibility: avoid cursor context manager.
The function uses with connection.cursor() as cursor: but comments elsewhere in this file (lines 241-242, 282-283) explicitly warn that context managers cannot be used with SQLite cursors for compatibility reasons. This will likely cause failures when using SQLite storage.
Additionally, the docstring is incomplete per Google conventions (missing Parameters/Returns), and the broad exception should have a pylint disable comment to match the pipeline requirement.
def connected(connection: Any) -> bool:
- """Check if DB is still connected."""
+ """Check if DB is still connected.
+
+ Parameters:
+ connection: Database connection object to verify.
+
+ Returns:
+ bool: True if connection is active, False otherwise.
+ """
if connection is None:
logger.warning("Not connected, need to reconnect later")
return False
try:
- with connection.cursor() as cursor:
- cursor.execute("SELECT 1")
+ # for compatibility with SQLite it is not possible to use context manager
+ cursor = connection.cursor()
+ cursor.execute("SELECT 1")
+ cursor.close()
logger.info("Connection to storage is ok")
return True
- except Exception as e:
+ except Exception as e: # pylint: disable=broad-exception-caught
logger.error("Disconnected from storage: %s", e)
return False🧰 Tools
🪛 GitHub Actions: Python linter
[error] 123-123: pylint: W0718 Catching too general exception Exception (broad-exception-caught).
🤖 Prompt for AI Agents
In src/runners/quota_scheduler.py around lines 113 to 126, the connected()
function must be changed to avoid using a cursor context manager (which breaks
compatibility with SQLite), so replace the "with connection.cursor() as cursor:"
pattern with an explicit cursor = connection.cursor() and ensure cursor.close()
in a finally block or after use; update the docstring to Google style including
Parameters (connection: Any) and Returns (bool) descriptions; and add a pylint
disable comment for the broad exception on the except line (e.g., "# pylint:
disable=broad-except") while keeping the exception handling otherwise the same.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/runners/quota_scheduler.py (1)
114-128: Complete the docstring per Google conventions.The cursor management and pylint comment have been addressed from the previous review, but the docstring still lacks the required Parameters and Returns sections per Google Python docstring conventions.
Apply this diff to complete the docstring:
def connected(connection: Any) -> bool: - """Check if DB is still connected.""" + """Check if DB is still connected. + + Parameters: + connection: Database connection object to verify. + + Returns: + bool: True if connection is active, False otherwise. + """ if connection is None:
🧹 Nitpick comments (1)
src/runners/quota_scheduler.py (1)
30-31: Add docstring comments for module-level constants.These constants control critical reconnection behavior but lack documentation explaining their purpose and units.
Consider adding comments:
+# Maximum number of connection attempts before giving up DATABASE_RECONNECTION_COUNT: int = 10 +# Delay in seconds between reconnection attempts RECONNECTION_DELAY: int = 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/runners/quota_scheduler.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: Use absolute imports for internal modules in LCS project (e.g.,from auth import get_auth_dependency)
All modules must start with descriptive docstrings explaining their purpose
Uselogger = logging.getLogger(__name__)pattern for module logging
All functions must include complete type annotations for parameters and return types, using modern syntax (str | int) andOptional[Type]orType | None
All functions must have docstrings with brief descriptions following Google Python docstring conventions
Function names must use snake_case with descriptive, action-oriented names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying input parameters
Useasync deffor I/O operations and external API calls
All classes must include descriptive docstrings explaining their purpose following Google Python docstring conventions
Class names must use PascalCase with descriptive names and standard suffixes:Configurationfor config classes,Error/Exceptionfor exceptions,Resolverfor strategy patterns,Interfacefor abstract base classes
Abstract classes must use ABC with@abstractmethoddecorators
Include complete type annotations for all class attributes in Python classes
Useimport loggingand module logger pattern with standard log levels: debug, info, warning, error
Files:
src/runners/quota_scheduler.py
🧬 Code graph analysis (1)
src/runners/quota_scheduler.py (2)
src/utils/connection_decorator.py (1)
connection(6-48)src/models/config.py (1)
config(324-341)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: build-pr
🔇 Additional comments (2)
src/runners/quota_scheduler.py (2)
62-72: LGTM! Solid retry logic with appropriate error handling.The for-else pattern correctly implements the retry mechanism with proper delays and logging. The broad exception handling is necessary here and correctly disabled with pylint.
97-101: LGTM! Appropriate runtime reconnection handling.The connection check before each limiter with a single reconnect attempt is a good design choice—it prevents blocking the scheduler loop while still attempting recovery. The continue ensures other limiters are processed even if one fails.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/runners/quota_scheduler.py (1)
119-134: Complete the docstring per coding guidelines.The SQLite cursor context manager issue has been correctly fixed. However, the docstring is still incomplete according to Google Python docstring conventions.
As per coding guidelines, all functions must include complete docstrings with Parameters and Returns sections.
Apply this diff to complete the docstring:
def connected(connection: Any) -> bool: - """Check if DB is still connected.""" + """ + Check if DB is still connected. + + Parameters: + connection (Any): Database connection object to verify. + + Returns: + bool: True if connection is active, False otherwise. + """ if connection is None:
🧹 Nitpick comments (1)
src/runners/quota_scheduler.py (1)
30-31: Consider making reconnection parameters configurable.The hardcoded retry count and delay work for typical scenarios but may need adjustment in production environments with varying network conditions or database load patterns. Consider adding these to
QuotaHandlersConfigurationor reading them from environment variables.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/runners/quota_scheduler.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: Use absolute imports for internal modules in LCS project (e.g.,from auth import get_auth_dependency)
All modules must start with descriptive docstrings explaining their purpose
Uselogger = logging.getLogger(__name__)pattern for module logging
All functions must include complete type annotations for parameters and return types, using modern syntax (str | int) andOptional[Type]orType | None
All functions must have docstrings with brief descriptions following Google Python docstring conventions
Function names must use snake_case with descriptive, action-oriented names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying input parameters
Useasync deffor I/O operations and external API calls
All classes must include descriptive docstrings explaining their purpose following Google Python docstring conventions
Class names must use PascalCase with descriptive names and standard suffixes:Configurationfor config classes,Error/Exceptionfor exceptions,Resolverfor strategy patterns,Interfacefor abstract base classes
Abstract classes must use ABC with@abstractmethoddecorators
Include complete type annotations for all class attributes in Python classes
Useimport loggingand module logger pattern with standard log levels: debug, info, warning, error
Files:
src/runners/quota_scheduler.py
🧬 Code graph analysis (1)
src/runners/quota_scheduler.py (2)
src/utils/connection_decorator.py (1)
connection(6-48)src/models/config.py (1)
config(324-341)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🔇 Additional comments (2)
src/runners/quota_scheduler.py (2)
62-72: LGTM: Robust initial connection retry logic.The retry loop correctly handles transient connection failures during startup. The for-else pattern ensures the function returns False only after exhausting all retry attempts.
97-106: LGTM: Runtime reconnection enhances resilience.The connection health check before processing each limiter correctly handles connection drops during operation. The safe cleanup and continue-on-failure pattern ensures the scheduler remains operational even if individual limiter processing fails.
6e1aa1c to
7e4744f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/runners/quota_scheduler.py (2)
30-31: Consider documenting the reconnection constants.While the constant names are descriptive, adding brief comments would clarify their purpose and make future tuning easier.
Example:
+# Maximum number of connection attempts before giving up DATABASE_RECONNECTION_COUNT: int = 10 +# Delay in seconds between reconnection attempts RECONNECTION_DELAY: int = 1
119-141: Ensure cursor is closed even when execute fails.If
cursor.execute()raises an exception,cursor.close()won't be called, potentially leaking cursor resources in this long-running process.Apply this diff to guarantee cursor cleanup:
def connected(connection: Any) -> bool: """Check if DB is still connected. Parameters: connection: Database connection object to verify. Returns: bool: True if connection is active, False otherwise. """ if connection is None: logger.warning("Not connected, need to reconnect later") return False try: # for compatibility with SQLite it is not possible to use context manager there cursor = connection.cursor() - cursor.execute("SELECT 1") - cursor.close() + try: + cursor.execute("SELECT 1") + finally: + cursor.close() logger.info("Connection to storage is ok") return True except Exception as e: # pylint: disable=broad-exception-caught logger.error("Disconnected from storage: %s", e) return False
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/runners/quota_scheduler.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: Use absolute imports for internal modules in LCS project (e.g.,from auth import get_auth_dependency)
All modules must start with descriptive docstrings explaining their purpose
Uselogger = logging.getLogger(__name__)pattern for module logging
All functions must include complete type annotations for parameters and return types, using modern syntax (str | int) andOptional[Type]orType | None
All functions must have docstrings with brief descriptions following Google Python docstring conventions
Function names must use snake_case with descriptive, action-oriented names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying input parameters
Useasync deffor I/O operations and external API calls
All classes must include descriptive docstrings explaining their purpose following Google Python docstring conventions
Class names must use PascalCase with descriptive names and standard suffixes:Configurationfor config classes,Error/Exceptionfor exceptions,Resolverfor strategy patterns,Interfacefor abstract base classes
Abstract classes must use ABC with@abstractmethoddecorators
Include complete type annotations for all class attributes in Python classes
Useimport loggingand module logger pattern with standard log levels: debug, info, warning, error
Files:
src/runners/quota_scheduler.py
🧬 Code graph analysis (1)
src/runners/quota_scheduler.py (2)
src/utils/connection_decorator.py (1)
connection(6-48)src/models/config.py (1)
config(324-341)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
| for _ in range(DATABASE_RECONNECTION_COUNT): | ||
| try: | ||
| connection = connect(config) | ||
| if connection is not None: | ||
| break | ||
| except Exception as e: # pylint: disable=broad-exception-caught | ||
| logger.warning("Can not connect to database, will try later: %s", e) | ||
| sleep(RECONNECTION_DELAY) | ||
| else: | ||
| logger.warning("Can not connect to database, skipping") | ||
| return False |
There was a problem hiding this comment.
Optimize retry loop to avoid unnecessary final delay.
The current loop sleeps after each attempt, including the final failed one. This adds an unnecessary 1-second delay before returning.
Consider moving the sleep to the beginning of the loop after the first iteration:
+ retry_count = 0
- for _ in range(DATABASE_RECONNECTION_COUNT):
+ while retry_count < DATABASE_RECONNECTION_COUNT:
+ if retry_count > 0:
+ sleep(RECONNECTION_DELAY)
+ retry_count += 1
try:
connection = connect(config)
if connection is not None:
break
except Exception as e: # pylint: disable=broad-exception-caught
logger.warning("Can not connect to database, will try later: %s", e)
- sleep(RECONNECTION_DELAY)
else:
logger.warning("Can not connect to database, skipping")
return False🤖 Prompt for AI Agents
In src/runners/quota_scheduler.py around lines 62 to 72, the retry loop
currently calls sleep(RECONNECTION_DELAY) after every attempt, causing an
unnecessary delay after the final failed attempt; change the logic so the delay
happens only between attempts (either move the sleep to the start of the loop
after the first iteration or conditionally call sleep only when the current
attempt is not the last), keeping the same number of retries and preserving
exception handling and logging.
| if not connected(connection): | ||
| # the old connection might be closed to avoid resource leaks | ||
| try: | ||
| connection.close() | ||
| except Exception: # pylint: disable=broad-exception-caught | ||
| pass # Connection already dead | ||
| connection = connect(config) | ||
| if connection is None: | ||
| logger.warning("Can not connect to database, skipping") | ||
| continue |
There was a problem hiding this comment.
Reinitialize tables after reconnecting to database.
When the connection is lost and re-established, the quota table may not exist (e.g., if the database was recreated or wiped). Without calling init_tables() after reconnection, subsequent quota operations will fail.
Apply this diff to ensure tables are initialized after reconnection:
if not connected(connection):
# the old connection might be closed to avoid resource leaks
try:
connection.close()
except Exception: # pylint: disable=broad-exception-caught
pass # Connection already dead
connection = connect(config)
if connection is None:
logger.warning("Can not connect to database, skipping")
continue
+ # Reinitialize tables after reconnection
+ if create_quota_table is not None:
+ init_tables(connection, create_quota_table)
quota_revocation(
connection, limiter, increase_quota_statement, reset_quota_statement
)🤖 Prompt for AI Agents
In src/runners/quota_scheduler.py around lines 97 to 106, when reconnecting to
the DB the code does not reinitialize schema/tables; after assigning connection
= connect(config) call the existing init_tables(...) helper using the same
arguments/signature used elsewhere (e.g., init_tables(connection) or
init_tables(config, connection)), catch and log any exceptions from init_tables,
and if initialization fails log a warning and continue the loop so subsequent
quota operations don't run against a missing schema.
Description
LCORE-1062: reconnection logic in quota runner
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.