feat(server): observability improvements and startup/retry fixes #159
Open
anderslindho wants to merge 6 commits intomasterfrom
Open
feat(server): observability improvements and startup/retry fixes #159anderslindho wants to merge 6 commits intomasterfrom
anderslindho wants to merge 6 commits intomasterfrom
Conversation
…ssion Defaulting to True changed the failure mode from "exhaust retries and release the global lock" to "hold the lock indefinitely", which freezes all commit processing when CF is unreachable. False restores sensible behaviour: the commit drops and the IOC retries on its next reconnect.
…thread clean_service() blocks the reactor for the full duration of the sweep, holding the global lock and preventing all IOC commits. cleanOnStart now schedules the sweep as a background thread after startup, so commits are accepted immediately. cleanOnStop uses deferToThread so the reactor stays live during shutdown while the lock is still held.
A numeric iocName that matches the source port range means the iocid changes on every reconnect, silently accumulating stale channels in CF. Log a warning when this is detected so misconfigured reccasters can be found. Distinguishes disconnect-before-upload from update-without-initial to avoid the same warning firing twice for the same event.
There is currently no runtime visibility into whether maxActive is throttling connections or how many channels the CF processor is tracking. A LoopingCall in RecService logs active/queued connections against the limit every 60 s; CFProcessor logs known_iocs and tracked_channels on the same interval. Both are configurable via statusInterval (0 disables).
Without timing, slow CF commits are invisible until they cause a backlog. Per-attempt duration is now logged so latency regressions show up in the log stream. push_to_cf also checks processor.running on each iteration so a service stop during a retry loop drains immediately rather than waiting up to 60 s per attempt.
Exposes connection state, CF processor state, and CF push performance as Prometheus metrics on a configurable HTTP port (metricsPort, disabled by default). Requires the optional prometheus-client dependency (pip install recceiver[metrics]). Gracefully degrades to no-ops if prometheus-client is not installed so the dependency is truly optional.
|
jacomago
reviewed
May 8, 2026
| """Stop the CFProcessor service with lock held. | ||
|
|
||
| If clean_on_stop is enabled, mark all channels as inactive. | ||
| The sweep runs in a thread so the reactor stays live during shutdown. |
Contributor
There was a problem hiding this comment.
We don't want reactor live during any type of clean do we?
| record_info_by_name = CFProcessor.record_info_by_name(record_infos, ioc_info) | ||
| self.update_ioc_infos(transaction, ioc_info, records_to_delete, record_info_by_name) | ||
| if not transaction.connected and ioc_info.ioc_id not in self.iocs: | ||
| _log.warning( |
| # Time interval for sending recceiver advertisments | ||
| #announceInterval = 15.0 | ||
|
|
||
| # Interval in seconds between periodic status log lines (0 to disable) |
Contributor
There was a problem hiding this comment.
not 100% sure about this, but ok.
And also update recceiver-full.conf.
| count = 0 | ||
| sleep = 1.0 | ||
| while processor.cf_config.push_always_retry or count < processor.cf_config.push_max_retries: | ||
| if not processor.running: |
Contributor
There was a problem hiding this comment.
Is there a way to have a test for this?
simon-ess
reviewed
May 8, 2026
| ioc_name = str(port) | ||
| _log.debug("IOC at %s:%d did not send IOCNAME; using port as iocName", host, port) | ||
| _log.debug("IOC at %s:%d has no iocName; using source port as iocName", host, port) | ||
| if ioc_name.isdigit() and 1024 <= int(ioc_name) <= 65535: |
Contributor
There was a problem hiding this comment.
This condition seems odd to me. Does this ever occur outside of the conditional right above it?
| def stopService(self): | ||
| _log.info("Stopping RecService") | ||
|
|
||
| if hasattr(self, "_statusLoop") and self._statusLoop.running: |
Contributor
There was a problem hiding this comment.
Ugh, I don't love using hasattr. Why not just initialise _statusLoop to None?
| self.lock.release() | ||
|
|
||
| self._statusLoop = task.LoopingCall(self._logStatus) | ||
| self._statusLoop.start(60.0, now=False) |
Contributor
There was a problem hiding this comment.
This doesn't look very configurable to me...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Addresses several operational pain points found in production at ESS.
Fixes
push_always_retrydefaulted toTrue(a regression), which could freeze all commit processing indefinitely when CF was unreachable. Default is nowFalse.cleanOnStart/cleanOnStopswept channels synchronously on the reactor thread, blocking IOC commits for the full duration of the sweep. Both now run in a background thread.Observability
maxActiveand the CF processor's IOC and channel counts - the key indicators that were invisible during the production incident.pip install recceiver[metrics]). Exposes connection state, CF processor state, and a push duration histogram. Degrades gracefully to no-ops if prometheus-client is not installed.