Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughComprehensive metrics instrumentation system introduced across API handlers, RPC routing, and blockchain components. New metrics package provides runtime and store metric exporters with Prometheus integration. Handler constructors updated to accept metrics exporters; quorum verification enhanced with application context; RPC routing refactored with observability middleware and connection origin tracking; database store extended with metric aggregation methods. Changes
Sequence DiagramsequenceDiagram
participant Client
participant RPCRouter
participant ObservabilityMiddleware
participant Handler
participant MetricsExporter
participant Database
Client->>RPCRouter: RPC Request
RPCRouter->>ObservabilityMiddleware: Process Request
Note over ObservabilityMiddleware: Extract method & path<br/>Record RPC message
ObservabilityMiddleware->>Handler: Invoke Handler
alt Handler Success
Handler->>Database: Store Operation (metricStore)
Database-->>Handler: Result
Handler->>MetricsExporter: IncRPCRequest(method, path, true)
Handler->>MetricsExporter: ObserveRPCDuration(method, path, true, duration)
else Handler Error
Handler->>MetricsExporter: IncRPCRequest(method, path, false)
Handler->>MetricsExporter: ObserveRPCDuration(method, path, false, duration)
end
Handler-->>RPCRouter: Response
RPCRouter-->>Client: RPC Response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @dimast-x, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the observability of the Clearnode system by integrating a new Prometheus-based metrics framework. It introduces granular tracking for various operational aspects, including RPC interactions, blockchain activities, and application state changes, alongside aggregate metrics derived from the database. This refactoring provides a more robust and detailed monitoring capability, allowing for better insights into system performance and health. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive metrics system using Prometheus, which is a great addition for observability. The changes are well-structured, with a new metrics package, injection of the metrics exporter into relevant components, and a clever metricStore wrapper to ensure metrics are only recorded for successful database transactions. The new middleware for RPC metrics and the background worker for store-based metrics are also well-implemented.
I have a couple of suggestions for improvement:
- Using descriptive string labels for
action_typemetrics in the blockchain worker. - Ensuring consistency in the metric types (Counter vs. Gauge) for session key metrics.
Overall, this is a solid implementation of a crucial feature.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
pkg/rpc/node.go (1)
188-201:⚠️ Potential issue | 🟡 MinorStale doc comment references removed callbacks.
Line 194 still mentions "Invokes lifecycle callbacks (OnConnect, OnDisconnect, etc.)" but these callbacks no longer exist. Update the godoc to reflect the new
ObserveConnections-based mechanism.📝 Suggested doc update
-// 4. Invokes lifecycle callbacks (OnConnect, OnDisconnect, etc.) +// 4. Tracks connection lifecycle via ObserveConnections🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/rpc/node.go` around lines 188 - 201, Update the ServeHTTP godoc to remove references to the removed lifecycle callbacks (OnConnect, OnDisconnect) and instead describe the new ObserveConnections-based mechanism: explain that ServeHTTP upgrades HTTP to WebSocket, creates/manages connection IDs and state, spawns per-connection goroutines for read/process/write/health, and notifies connection lifecycle via ObserveConnections; reference the ServeHTTP method and ObserveConnections so readers can locate the new flow.clearnode/api/channel_v1/submit_state.go (1)
100-102:⚠️ Potential issue | 🟡 MinorPre-existing:
erris stale in the "missing user signature" error message.Not introduced by this PR, but worth noting: on line 101,
erris the result of the priorPackStatecall (line 94), which succeeded if execution reached here. The%vwill format<nil>, producing a misleading message like"missing incoming state user signature: <nil>".Suggested fix
- return rpc.Errorf("missing incoming state user signature: %v", err) + return rpc.Errorf("missing incoming state user signature")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/api/channel_v1/submit_state.go` around lines 100 - 102, The error message uses a stale variable `err` (from the earlier PackState call) when checking `incomingState.UserSig == nil`, causing misleading "<nil>" output; update the `rpc.Errorf` in the check for `incomingState.UserSig == nil` to not reference `err` (either remove the "%v" and err entirely or include a relevant value like an identifier from `incomingState`) so the message clearly states "missing incoming state user signature" (locate the check around the `incomingState.UserSig == nil` branch in submit_state.go and update the rpc.Errorf call accordingly).clearnode/api/app_session_v1/handler.go (1)
61-73:⚠️ Potential issue | 🔴 CriticalAdd length guard before accessing signature byte at index 0.
Line 67 accesses
sigBytes[0]without checking the slice length. Whenhexutil.Decode("0x")returns an empty slice (without error), this causes an index-out-of-bounds panic. AlthoughRecover()includes a length check, it's called after line 67, so the panic occurs before it's reached.🐛 Proposed fix
sigBytes, err := hexutil.Decode(sigHex) if err != nil { return rpc.Errorf("failed to decode signature: %v", err) } + if len(sigBytes) == 0 { + return rpc.Errorf("empty signature") + } sigType := app.AppSessionSignerTypeV1(sigBytes[0])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/api/app_session_v1/handler.go` around lines 61 - 73, The loop over signatures accesses sigBytes[0] before ensuring sigBytes is non-empty, which can panic when hexutil.Decode returns an empty slice; add a length guard immediately after decoding in the handler (the loop that decodes sigHex) to check len(sigBytes) > 0 and return a descriptive rpc.Errorf (or mark validation failed) if it's empty, before constructing app.AppSessionSignerTypeV1 or calling appSessionSignerValidator.Recover; keep the existing metrics calls to h.metrics.IncAppSessionUpdateSigValidation around Recover as-is but ensure the early-return on empty sigBytes happens first.
🧹 Nitpick comments (12)
pkg/blockchain/evm/reactor.go (1)
58-123: Callback not invoked on the early-return path for truly unknown events (line 63–65).When
eventMappingdoesn't contain the event ID,HandleEventreturns at line 65 without reaching the callback at line 121. This means events that aren't part of the contract ABI won't be reflected in metrics at all — no success or failure is recorded.If this is intentional (only track ABI-known events), a brief comment at the early return would clarify intent. If you do want full observability, move the callback (or add a second invocation) before that early return.
Option: track unknown-ABI events as failures
eventName, ok := eventMapping[eventID] if !ok { logger.Warn("unknown event ID", "eventID", eventID.Hex(), "blockNumber", l.BlockNumber, "txHash", l.TxHash.String(), "logIndex", l.Index) + if r.onEventProcessed != nil { + r.onEventProcessed(r.blockchainID, false) + } return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/blockchain/evm/reactor.go` around lines 58 - 123, HandleEvent currently returns early when eventMapping lacks the eventID, skipping the metrics callback; update HandleEvent to invoke the onEventProcessed callback before the early return so unknown-ABI events are recorded as failures (call r.onEventProcessed(r.blockchainID, false) immediately before returning when !ok), or if the omission was intentional, add a clarifying comment at that early-return explaining why unknown events must not be tracked; reference symbols: HandleEvent, eventMapping, eventID, onEventProcessed, and r.blockchainID.clearnode/api/metric_store.go (1)
64-68: Consider clearing callbacks after flush to prevent accidental double-invocation.If
flush()is ever called more than once (e.g., due to a code path change), all metric callbacks would fire again. Clearing the slice after iteration is a low-cost defensive measure.♻️ Proposed fix
func (s *metricStore) flush() { for _, cb := range s.callbacks { cb() } + s.callbacks = nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/api/metric_store.go` around lines 64 - 68, The flush() method on metricStore currently iterates and invokes each callback (s.callbacks) but leaves the slice intact; change flush (the metricStore.flush method) to clear the callbacks slice after calling them (e.g., set s.callbacks = nil or assign an empty slice) so subsequent calls won't re-invoke the same callbacks; keep the invocation loop as-is but add the slice-clearing step immediately after iteration to defensively prevent double-invocation.clearnode/api/utils.go (1)
5-28: Payload is deserialized solely to extract one field for metric labeling.This function re-deserializes the entire request payload (marshal → unmarshal round-trip via
Translate) just to extract a single sub-field. On the hot path, this adds unnecessary allocation. Consider whether the handler could pass the extracted path downstream instead, or cache the deserialized payload on the context.That said, the current approach is pragmatic and functionally correct. The
breakon error correctly falls through toreturn "default".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/api/utils.go` around lines 5 - 28, getMethodPath currently re-deserializes the entire request payload just to read a single field, causing extra allocations on the hot path; update call sites or context handling so the needed metric label is provided directly instead of round-tripping Translate: either have the handlers that already decode the payload pass the extracted path string into the rpc.Context (or into a new field like MethodPath) before metrics are recorded, or cache the parsed request struct on the context so getMethodPath can read the specific fields (AppStateUpdate.Intent, State.Transition.Type) without calling Translate; change getMethodPath to read the pre-extracted/cached value and fall back to "default" only if missing.pkg/rpc/connection_hub.go (1)
72-74:observeConnectionscallback invoked while holdinghub.mulock.If the callback (or anything it triggers) ever calls back into the hub (e.g.,
Get,Add,Remove), it will deadlock. Consider invoking the callback after releasing the lock, or document this constraint clearly.♻️ Proposed fix for Add (similar pattern for Remove)
func (hub *ConnectionHub) Add(conn Connection) error { if conn == nil { return fmt.Errorf("connection cannot be nil") } connID := conn.ConnectionID() + var totalConns uint32 + var origin string hub.mu.Lock() - defer hub.mu.Unlock() if _, exists := hub.connections[connID]; exists { + hub.mu.Unlock() return fmt.Errorf("connection with ID %s already exists", connID) } hub.connections[connID] = conn sourceID := getSourceID(conn.Origin()) hub.sourceMap[sourceID]++ - hub.observeConnections(defaultConnectionRegion, conn.Origin(), uint32(len(hub.connections))) + totalConns = uint32(len(hub.connections)) + origin = conn.Origin() + hub.mu.Unlock() + + hub.observeConnections(defaultConnectionRegion, origin, totalConns) return nil }Also applies to: 114-118
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/rpc/connection_hub.go` around lines 72 - 74, observeConnections is being called while holding hub.mu which can deadlock if the callback calls back into the hub (Get/Add/Remove); to fix, avoid calling hub.observeConnections under the lock: in methods that currently call observeConnections (e.g., Add and Remove where you compute sourceID, update hub.sourceMap and then call hub.observeConnections), capture any necessary values (sourceID, connection origin, connection count/region) while holding hub.mu, release the lock, then invoke hub.observeConnections with those captured values; ensure no hub state is mutated after unlocking before the callback runs and apply the same pattern wherever observeConnections is invoked (including the other occurrence similar to 114-118).clearnode/blockchain_worker.go (1)
187-187: Add String() method to BlockchainActionType for readable metric labels.Metric labels currently use raw numeric values (
fmt.Sprintf("%d", action.Type)) like"1","10", etc. Add a String() method toBlockchainActionTypeto emit human-readable labels like"checkpoint","initiateEscrowDeposit", similar to howActionResultimplements it inclearnode/metrics/exporter.go. This improves dashboard interpretability.Applies to: Lines 187, 195
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/blockchain_worker.go` at line 187, Add a String() method on the BlockchainActionType type that returns human-readable labels (e.g., "checkpoint", "initiateEscrowDeposit") similar to ActionResult's implementation in clearnode/metrics/exporter.go, then replace uses of fmt.Sprintf("%d", action.Type) in blockchain_worker.go (calls to w.metrics.IncBlockchainAction) with action.Type.String() so metrics use readable labels; ensure the String() covers all enum values and has a sensible default for unknown values.clearnode/main.go (2)
160-195: Typo in parameter name and clean implementation.
metricExportedon line 167 appears to be a typo formetricExporter.Otherwise, the store metrics exporter implementation is clean — proper ticker lifecycle, error handling, and graceful shutdown path.
Fix typo
- metricExported metrics.StoreMetricExporter, logger log.Logger) { + metricExporter metrics.StoreMetricExporter, logger log.Logger) {And update the usage on lines 179 and 187 accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/main.go` around lines 160 - 195, Rename the parameter metricExported to metricExporter in runStoreMetricsExporter and update all its usages (calls to SetAppSessions and SetChannels) to use metricExporter instead of metricExported; ensure the function signature and every reference inside the function body (the assignments inside the ticker case) are consistently renamed to avoid the typo.
106-106: Store metrics exporter goroutine won't receive cancellation on shutdown.The
ctxpassed here iscontext.Background()(line 27), which is never cancelled. The goroutine's<-ctx.Done()path (line 190) will never trigger, so it leaks until process exit. This is consistent with the existing pattern for listeners/workers (line 156 TODO), but worth noting for a future cleanup where all background goroutines are wired to a cancellable context.Suggested fix: use a cancellable context for all background goroutines
- ctx := context.Background() + ctx, cancelCtx := context.WithCancel(context.Background()) + defer cancelCtx()Then the existing shutdown signal handling would call
cancelCtx()before shutting down the servers, ensuring all background goroutines (listeners, workers, store metrics exporter) receive the cancellation signal.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/main.go` at line 106, The store metrics exporter is started with a non-cancellable root context (ctx from context.Background()), so its <-ctx.Done() branch in runStoreMetricsExporter never triggers; change initialization to derive a cancellable context (e.g., ctx, cancel := context.WithCancel(context.Background())) and pass that cancellable ctx into go runStoreMetricsExporter(...), then ensure the existing shutdown signal handler calls cancel() (or the shared cancelCtx()) before shutting down servers so runStoreMetricsExporter and other background goroutines receive cancellation and exit cleanly.clearnode/metrics/interface.go (1)
24-24: Minor:durationSecsparameter name is misleading for atime.Durationtype.
time.Durationis natively in nanoseconds. The parameter namedurationSecsimplies the caller should pass a value in seconds, which could lead to confusion. Consider renaming to justduration.Rename suggestion
- ObserveRPCDuration(method, path string, success bool, durationSecs time.Duration) // + + ObserveRPCDuration(method, path string, success bool, duration time.Duration) // +🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/interface.go` at line 24, The ObserveRPCDuration interface method uses a misleading parameter name durationSecs for a time.Duration; rename the parameter to duration in the ObserveRPCDuration(method, path string, success bool, duration time.Duration) signature and update all implementations and callers (methods/functions that implement ObserveRPCDuration and any calls to ObserveRPCDuration) to use the new parameter name to avoid implying the value should be seconds while preserving the time.Duration type.clearnode/metrics/exporter.go (4)
278-310: Remove or relocate planning comments.These trailing comments (lines 278–310) appear to be task-tracking or design notes rather than code documentation. They'll become stale as the implementation evolves and add noise. Consider moving them to the PR description, a design doc, or removing them entirely now that the implementation is in place.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/exporter.go` around lines 278 - 310, Remove the trailing planning/comment block (the lines listing metrics like `channel_state_validations_total{sig_type, result}`, `rpc_messages_total{msg_type, method}`, `app_state_updates{application}`, `blockchain_actions_total{asset, blockchain_id, action_type, result}`, and the "metric worker" list) from clearnode/metrics/exporter.go and either delete it or relocate it to the PR description or a separate design doc; ensure the file contains only implementation-level comments (or convert to a short TODO with a link to the external design doc if you must keep a pointer).
267-276: Consider adding a helper constructor forActionResultfrombool.The pattern
res := ActionResultFailed; if b { res = ActionResultSuccess }is repeated six times. A small helper would reduce boilerplate:Suggested helper
+func toActionResult(ok bool) ActionResult { + if ok { + return ActionResultSuccess + } + return ActionResultFailed +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/exporter.go` around lines 267 - 276, The code repeats the pattern `res := ActionResultFailed; if b { res = ActionResultSuccess }` across the codebase; add a small helper function ActionResultFromBool(b bool) ActionResult (or NewActionResultFromBool) that returns ActionResultSuccess when b is true and ActionResultFailed otherwise, keep the existing ActionResult type and String() method unchanged, and replace the six call sites with calls to this new helper (referencing ActionResult and ActionResultFromBool) to remove boilerplate and centralize the logic.
27-34: Watch for label cardinality onapplication,origin, and similar dynamic labels.Labels like
applicationID(line 30, 138, 142) andorigin(line 132) are user/client-controlled and could grow unboundedly. High-cardinality labels are a well-known Prometheus anti-pattern that can cause memory pressure and slow queries. Ensure these dimensions are bounded in practice, or consider using an allowlist/bucketing strategy.Also applies to: 129-132, 135-142
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/exporter.go` around lines 27 - 34, The prometheus metrics appSessionsTotal and channelsTotal (and other metrics using labels like "application", "origin", "asset") use user-controlled labels which can create unbounded cardinality; update the metric design to bound those dimensions by replacing raw IDs with a safe representation (e.g., use an allowlist of known application names, bucketized/hashed categories, or map IDs to a limited set of enums) before calling prometheus.NewGaugeVec or when observing values, and add validation or mapping helpers (e.g., sanitizeApplicationLabel, bucketOriginLabel) referenced from the metric update sites so only bounded label values are emitted.
37-44:MustRegisterpanics on failure, but the constructor returnserror— consider usingRegisterinstead.Both constructors return
(T, error), yet they useMustRegisterwhich panics on duplicate or conflicting metric registration. This is inconsistent: callers expect errors to be surfaced through the return value, not via panics.Suggested approach: use non-panicking Register
- if reg != nil { - reg.MustRegister( - m.userStatesTotal, - // ... - ) - } else { - return nil, fmt.Errorf("prometheus registerer not provided") - } + if reg == nil { + return nil, fmt.Errorf("prometheus registerer not provided") + } + collectors := []prometheus.Collector{ + m.userStatesTotal, + // ...all collectors... + } + for _, c := range collectors { + if err := reg.Register(c); err != nil { + return nil, fmt.Errorf("registering metric: %w", err) + } + }Also applies to: 157-176
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/exporter.go` around lines 37 - 44, The constructors currently call reg.MustRegister which panics on registration failures; change to use the non-panicking reg.Register and handle the returned error(s). Replace calls to MustRegister for metrics like m.appSessionsTotal and m.channelsTotal (and the other registration block later in the file) with Register calls, check the error returned, and return a wrapped/formatted error from the constructor if registration fails so failures surface via the constructor's error return instead of panicking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/v1-push.yml:
- Line 5: The push workflow's branches trigger currently contains the feature
branch entry "feat-v1/metrics" alongside "release/v1.0.0"; remove
"feat-v1/metrics" from the branches array in the v1-push workflow so only
release/v1.0.0 remains (i.e., change the branches list from [ release/v1.0.0,
feat-v1/metrics ] to just [ release/v1.0.0 ]) and commit that change before
merging this PR to avoid accidental RC tagging and overwriting latest-rc images.
In `@clearnode/blockchain_worker.go`:
- Around line 191-195: The code currently returns early when
w.store.Complete(action.ID, txHash) fails, so the successful on-chain action
isn't recorded by w.metrics.IncBlockchainAction; update the logic so
IncBlockchainAction(state.Asset, w.blockchainID, fmt.Sprintf("%d", action.Type),
true) is called regardless of Complete() failing (i.e., call the metric before
the early return or call it in both the success and the error branch), and still
log the Complete() error via logger.Error (use w.store.Complete, logger.Error,
w.metrics.IncBlockchainAction, action.ID, txHash, action.Type, state.Asset,
w.blockchainID to locate the spots to change).
In `@clearnode/metrics/exporter.go`:
- Around line 82-83: The two metric fields channelSessionKeysTotal and
appSessionKeysTotal have a type/semantic mismatch: change both field types from
prometheus.Counter to prometheus.Gauge and ensure their initializers use
prometheus.NewGauge with prometheus.GaugeOpts (and keep the "_active" names) so
the types match the semantics; alternatively, if you decide they must be
monotonic counters, rename the metrics to use "_total" and use
prometheus.NewCounter with prometheus.CounterOpts for both—update the field
declarations and their constructors for channelSessionKeysTotal and
appSessionKeysTotal to be consistent.
In `@pkg/rpc/connection_hub.go`:
- Around line 72-74: observeConnections is being called with the global total
(uint32(len(hub.connections))) which causes SetRPCConnections' per-origin gauge
(labels region/origin) to show the global count; instead pass the per-origin
count from hub.sourceMap. Change calls to observeConnections in the Add (where
sourceID is computed and hub.sourceMap[sourceID]++) and Remove methods to use
uint32(hub.sourceMap[sourceID]) (after increment/decrement) rather than
uint32(len(hub.connections)) so the metric reflects per-origin connection
counts.
---
Outside diff comments:
In `@clearnode/api/app_session_v1/handler.go`:
- Around line 61-73: The loop over signatures accesses sigBytes[0] before
ensuring sigBytes is non-empty, which can panic when hexutil.Decode returns an
empty slice; add a length guard immediately after decoding in the handler (the
loop that decodes sigHex) to check len(sigBytes) > 0 and return a descriptive
rpc.Errorf (or mark validation failed) if it's empty, before constructing
app.AppSessionSignerTypeV1 or calling appSessionSignerValidator.Recover; keep
the existing metrics calls to h.metrics.IncAppSessionUpdateSigValidation around
Recover as-is but ensure the early-return on empty sigBytes happens first.
In `@clearnode/api/channel_v1/submit_state.go`:
- Around line 100-102: The error message uses a stale variable `err` (from the
earlier PackState call) when checking `incomingState.UserSig == nil`, causing
misleading "<nil>" output; update the `rpc.Errorf` in the check for
`incomingState.UserSig == nil` to not reference `err` (either remove the "%v"
and err entirely or include a relevant value like an identifier from
`incomingState`) so the message clearly states "missing incoming state user
signature" (locate the check around the `incomingState.UserSig == nil` branch in
submit_state.go and update the rpc.Errorf call accordingly).
In `@pkg/rpc/node.go`:
- Around line 188-201: Update the ServeHTTP godoc to remove references to the
removed lifecycle callbacks (OnConnect, OnDisconnect) and instead describe the
new ObserveConnections-based mechanism: explain that ServeHTTP upgrades HTTP to
WebSocket, creates/manages connection IDs and state, spawns per-connection
goroutines for read/process/write/health, and notifies connection lifecycle via
ObserveConnections; reference the ServeHTTP method and ObserveConnections so
readers can locate the new flow.
---
Nitpick comments:
In `@clearnode/api/metric_store.go`:
- Around line 64-68: The flush() method on metricStore currently iterates and
invokes each callback (s.callbacks) but leaves the slice intact; change flush
(the metricStore.flush method) to clear the callbacks slice after calling them
(e.g., set s.callbacks = nil or assign an empty slice) so subsequent calls won't
re-invoke the same callbacks; keep the invocation loop as-is but add the
slice-clearing step immediately after iteration to defensively prevent
double-invocation.
In `@clearnode/api/utils.go`:
- Around line 5-28: getMethodPath currently re-deserializes the entire request
payload just to read a single field, causing extra allocations on the hot path;
update call sites or context handling so the needed metric label is provided
directly instead of round-tripping Translate: either have the handlers that
already decode the payload pass the extracted path string into the rpc.Context
(or into a new field like MethodPath) before metrics are recorded, or cache the
parsed request struct on the context so getMethodPath can read the specific
fields (AppStateUpdate.Intent, State.Transition.Type) without calling Translate;
change getMethodPath to read the pre-extracted/cached value and fall back to
"default" only if missing.
In `@clearnode/blockchain_worker.go`:
- Line 187: Add a String() method on the BlockchainActionType type that returns
human-readable labels (e.g., "checkpoint", "initiateEscrowDeposit") similar to
ActionResult's implementation in clearnode/metrics/exporter.go, then replace
uses of fmt.Sprintf("%d", action.Type) in blockchain_worker.go (calls to
w.metrics.IncBlockchainAction) with action.Type.String() so metrics use readable
labels; ensure the String() covers all enum values and has a sensible default
for unknown values.
In `@clearnode/main.go`:
- Around line 160-195: Rename the parameter metricExported to metricExporter in
runStoreMetricsExporter and update all its usages (calls to SetAppSessions and
SetChannels) to use metricExporter instead of metricExported; ensure the
function signature and every reference inside the function body (the assignments
inside the ticker case) are consistently renamed to avoid the typo.
- Line 106: The store metrics exporter is started with a non-cancellable root
context (ctx from context.Background()), so its <-ctx.Done() branch in
runStoreMetricsExporter never triggers; change initialization to derive a
cancellable context (e.g., ctx, cancel :=
context.WithCancel(context.Background())) and pass that cancellable ctx into go
runStoreMetricsExporter(...), then ensure the existing shutdown signal handler
calls cancel() (or the shared cancelCtx()) before shutting down servers so
runStoreMetricsExporter and other background goroutines receive cancellation and
exit cleanly.
In `@clearnode/metrics/exporter.go`:
- Around line 278-310: Remove the trailing planning/comment block (the lines
listing metrics like `channel_state_validations_total{sig_type, result}`,
`rpc_messages_total{msg_type, method}`, `app_state_updates{application}`,
`blockchain_actions_total{asset, blockchain_id, action_type, result}`, and the
"metric worker" list) from clearnode/metrics/exporter.go and either delete it or
relocate it to the PR description or a separate design doc; ensure the file
contains only implementation-level comments (or convert to a short TODO with a
link to the external design doc if you must keep a pointer).
- Around line 267-276: The code repeats the pattern `res := ActionResultFailed;
if b { res = ActionResultSuccess }` across the codebase; add a small helper
function ActionResultFromBool(b bool) ActionResult (or NewActionResultFromBool)
that returns ActionResultSuccess when b is true and ActionResultFailed
otherwise, keep the existing ActionResult type and String() method unchanged,
and replace the six call sites with calls to this new helper (referencing
ActionResult and ActionResultFromBool) to remove boilerplate and centralize the
logic.
- Around line 27-34: The prometheus metrics appSessionsTotal and channelsTotal
(and other metrics using labels like "application", "origin", "asset") use
user-controlled labels which can create unbounded cardinality; update the metric
design to bound those dimensions by replacing raw IDs with a safe representation
(e.g., use an allowlist of known application names, bucketized/hashed
categories, or map IDs to a limited set of enums) before calling
prometheus.NewGaugeVec or when observing values, and add validation or mapping
helpers (e.g., sanitizeApplicationLabel, bucketOriginLabel) referenced from the
metric update sites so only bounded label values are emitted.
- Around line 37-44: The constructors currently call reg.MustRegister which
panics on registration failures; change to use the non-panicking reg.Register
and handle the returned error(s). Replace calls to MustRegister for metrics like
m.appSessionsTotal and m.channelsTotal (and the other registration block later
in the file) with Register calls, check the error returned, and return a
wrapped/formatted error from the constructor if registration fails so failures
surface via the constructor's error return instead of panicking.
In `@clearnode/metrics/interface.go`:
- Line 24: The ObserveRPCDuration interface method uses a misleading parameter
name durationSecs for a time.Duration; rename the parameter to duration in the
ObserveRPCDuration(method, path string, success bool, duration time.Duration)
signature and update all implementations and callers (methods/functions that
implement ObserveRPCDuration and any calls to ObserveRPCDuration) to use the new
parameter name to avoid implying the value should be seconds while preserving
the time.Duration type.
In `@pkg/blockchain/evm/reactor.go`:
- Around line 58-123: HandleEvent currently returns early when eventMapping
lacks the eventID, skipping the metrics callback; update HandleEvent to invoke
the onEventProcessed callback before the early return so unknown-ABI events are
recorded as failures (call r.onEventProcessed(r.blockchainID, false) immediately
before returning when !ok), or if the omission was intentional, add a clarifying
comment at that early-return explaining why unknown events must not be tracked;
reference symbols: HandleEvent, eventMapping, eventID, onEventProcessed, and
r.blockchainID.
In `@pkg/rpc/connection_hub.go`:
- Around line 72-74: observeConnections is being called while holding hub.mu
which can deadlock if the callback calls back into the hub (Get/Add/Remove); to
fix, avoid calling hub.observeConnections under the lock: in methods that
currently call observeConnections (e.g., Add and Remove where you compute
sourceID, update hub.sourceMap and then call hub.observeConnections), capture
any necessary values (sourceID, connection origin, connection count/region)
while holding hub.mu, release the lock, then invoke hub.observeConnections with
those captured values; ensure no hub state is mutated after unlocking before the
callback runs and apply the same pattern wherever observeConnections is invoked
(including the other occurrence similar to 114-118).
c1007ff to
a7e684a
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
clearnode/api/app_session_v1/handler.go (1)
62-67:⚠️ Potential issue | 🟡 MinorPotential panic if
sigBytesis empty.
hexutil.Decode("0x")returns an empty byte slice without error. AccessingsigBytes[0]on line 67 would panic with an index-out-of-range error. Add a length check after decode.🛡️ Proposed fix
sigBytes, err := hexutil.Decode(sigHex) if err != nil { return rpc.Errorf("failed to decode signature: %v", err) } + if len(sigBytes) == 0 { + return rpc.Errorf("empty signature bytes") + } sigType := app.AppSessionSignerTypeV1(sigBytes[0])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/api/app_session_v1/handler.go` around lines 62 - 67, After decoding sigHex into sigBytes in the handler (where sigBytes is created and later used to form app.AppSessionSignerTypeV1), add a guard that verifies len(sigBytes) > 0 and return a descriptive rpc.Errorf if it's empty to avoid an index-out-of-range panic; locate the decode block that sets sigBytes and the subsequent use sigType := app.AppSessionSignerTypeV1(sigBytes[0]) and perform the length check between those lines, returning an appropriate error when sigBytes is empty.pkg/rpc/node.go (1)
188-201:⚠️ Potential issue | 🟡 MinorStale doc comment references removed callbacks.
Line 194 still mentions "Invokes lifecycle callbacks (OnConnect, OnDisconnect, etc.)" but these callbacks have been replaced by
ObserveConnections. Update the doc comment to reflect the current design.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/rpc/node.go` around lines 188 - 201, Update the ServeHTTP doc comment to remove references to the old lifecycle callbacks and describe the current connection observation model: mention that ServeHTTP upgrades requests to WebSocket, creates a unique connection ID and manages state, spawns goroutines for concurrent message processing, and uses ObserveConnections to monitor connection lifecycle and events (instead of OnConnect/OnDisconnect). Ensure the comment references ServeHTTP and ObserveConnections so future readers can locate the implementation and understand the design change.
🧹 Nitpick comments (10)
pkg/rpc/connection_hub.go (1)
152-155:getSourceIDis a trivial pass-through.Currently just returns its input. If there's no planned transformation, consider inlining the call to reduce indirection.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/rpc/connection_hub.go` around lines 152 - 155, getSourceID is a no-op that simply returns its input; remove this indirection by deleting the getSourceID function and replacing all calls to getSourceID(origin) with the origin string directly (or, if you intend to preserve a hook for future transformations, keep the function but add a clear TODO comment and unit test). Update any callers referencing getSourceID in connection_hub.go (and elsewhere) to use the raw origin value to eliminate the trivial wrapper.clearnode/api/utils.go (1)
5-28: Double deserialization of the request payload for metrics labeling.
Translatere-marshals/unmarshals the payload just to extract a sub-label. This runs on every request matching these three methods, even though the actual handler will deserialize the same payload again downstream. For a metrics utility this is acceptable overhead, but if request volume grows, consider caching the result or extracting the label post-handler.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/api/utils.go` around lines 5 - 28, getMethodPath currently re-deserializes the request payload for the three RPC methods (rpc.AppSessionsV1SubmitAppStateMethod, rpc.ChannelsV1RequestCreationMethod, rpc.ChannelsV1SubmitStateMethod) which duplicates work the handler will do; instead, persist the extracted label once when the handler first deserializes the payload and read that cached value in getMethodPath (e.g., add a parsed-label field on rpc.Context or set a context key like "methodPathLabel" during handler deserialization), or use a concurrent request-scoped cache keyed by request ID to store results of translating into rpc.AppSessionsV1SubmitAppStateRequest / rpc.ChannelsV1RequestCreationRequest / rpc.ChannelsV1SubmitStateRequest so getMethodPath returns the cached label rather than calling Translate again.clearnode/metrics/exporter.go (4)
294-326: Remove leftover planning/spec comments.These trailing comments appear to be design notes or a checklist. They should be removed or moved to a design doc / issue tracker before merging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/exporter.go` around lines 294 - 326, Remove the leftover planning/spec comments block in exporter.go (the multi-line commented checklist referencing metrics like `user_states_total`, `transactions_total`, `channel_state_validations_total`, `rpc_messages_total`, `app_state_updates`, `blockchain_actions_total`, etc.). Delete that entire commented section from the file or move it into a design doc/issue tracker, and ensure no stray TODO/spec comments remain in the `metrics/exporter.go` file before merging.
25-49: Consider checkingreg != nilbefore allocating metric objects.Both constructors create all metric objects before checking if the registerer is nil. Moving the nil check upfront avoids unnecessary allocations on the error path.
Early nil check
func NewStoreMetricExporter(reg prometheus.Registerer) (StoreMetricExporter, error) { + if reg == nil { + return nil, fmt.Errorf("prometheus registerer not provided") + } + m := &storeMetricExporter{ // ... } - if reg != nil { - reg.MustRegister( - m.appSessionsTotal, - m.channelsTotal, - ) - } else { - return nil, fmt.Errorf("prometheus registerer not provided") - } + reg.MustRegister( + m.appSessionsTotal, + m.channelsTotal, + ) return m, nil }Same pattern applies to
NewRuntimeMetricExporter.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/exporter.go` around lines 25 - 49, Move the prometheus registerer nil-check to the top of NewStoreMetricExporter (and similarly NewRuntimeMetricExporter) so you return an error immediately if reg == nil, before constructing any prometheus.GaugeVecs; specifically, in NewStoreMetricExporter check reg != nil first, return fmt.Errorf("prometheus registerer not provided") if nil, and only then create m.appSessionsTotal and m.channelsTotal and call reg.MustRegister on them.
102-116: Metric naming:_activesuffix on monotonic counters is misleading.
channel_session_keys_activeandapp_session_keys_activeuse the_activesuffix, which by Prometheus naming conventions implies a gauge (current snapshot). Since the interface only supportsInc(), these are monotonically increasing counters and should use the_totalsuffix per Prometheus naming best practices.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/exporter.go` around lines 102 - 116, The metric names use an `_active` suffix which implies a gauge but the code treats them as monotonic counters; update channelSessionKeysTotal (prometheus.NewCounter) to use the metric name "channel_session_keys_total" and adjust its Help text to "Total number of channel session keys issued"; change appSessionKeysTotal (currently prometheus.NewGauge) to a prometheus.NewCounter, rename its metric to "app_session_keys_total" and update its Help text to "Total number of app session keys issued" so both follow Prometheus `_total` naming for counters.
203-206:InexactFloat64()for transaction amounts — acceptable precision trade-off?
amount.InexactFloat64()on Line 205 will lose precision for very large decimal values. This is inherent to Prometheus counters (which usefloat64), so it's likely acceptable for metrics purposes, but worth noting if exact financial tracking via metrics is expected.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/exporter.go` around lines 203 - 206, In runtimeMetricExporter.RecordTransaction the call to amount.InexactFloat64() can lose precision for very large decimal amounts; update the method to explicitly handle overflow/precision by (1) converting with a guarded conversion: compute the maximum representable decimal for float64 and if amount > max then set value to math.MaxFloat64 (or math.Inf(+1)) or clamp to a sensible cap, (2) optionally round or scale the decimal (e.g., amount.Round(scale)) before conversion if you want a defined precision, and (3) add a short comment on RecordTransaction explaining that Prometheus counters use float64 so exact financial precision is not preserved; reference the function name runtimeMetricExporter.RecordTransaction and the fields transactionsAmountTotal to locate where to apply this change.clearnode/metrics/interface.go (2)
12-36: Well-structured metrics interface with clean subsystem grouping.The interface is well-organized by subsystem, and the noop implementation makes it test-friendly.
One minor naming nit: the
durationSecsparameter on Line 24 is typedtime.Duration(not raw seconds). The concrete implementation inexporter.gocalls.Seconds()internally, so the caller passes atime.Duration— consider renaming to justdurationto avoid confusion.Proposed rename
- ObserveRPCDuration(method, path string, success bool, durationSecs time.Duration) // + + ObserveRPCDuration(method, path string, success bool, duration time.Duration) // +🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/interface.go` around lines 12 - 36, Rename the ObserveRPCDuration parameter to avoid implying it's raw seconds: change the parameter name durationSecs in the RuntimeMetricExporter interface to duration (or simply dur) so it correctly reflects its type time.Duration; update the interface method ObserveRPCDuration(method, path string, success bool, duration time.Duration) and adjust any implementations (e.g., ObserveRPCDuration in exporter.go) and all callers to use the new parameter name (ObserveRPCDuration) to keep the semantics (callers pass a time.Duration and implementations may call .Seconds()).
57-61: No noop provided forStoreMetricExporter.If tests or non-production code paths ever need a
StoreMetricExporter, there's no noop equivalent. If it's only used in the metric worker (not in test-wired handlers), this is fine for now — but consider adding one for symmetry if needed later.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/metrics/interface.go` around lines 57 - 61, Add a no-op implementation of the StoreMetricExporter interface so tests and non-production paths can use it: create an exported type (e.g., NoopStoreMetricExporter) that implements SetAppSessions(applicationID string, status app.AppSessionStatus, count uint64) and SetChannels(asset string, status core.ChannelStatus, count uint64) with empty method bodies; use this noop where a metric exporter is optional to avoid nil checks and to satisfy the interface in test wiring.clearnode/api/metric_store.go (1)
64-68:flush()does not clear callbacks after execution.If
flush()is ever called more than once on the samemetricStoreinstance, callbacks would fire again. This is likely safe if the store is created per-transaction, but a defensive clear would prevent accidental double-emission.Defensive clear after flush
func (s *metricStore) flush() { for _, cb := range s.callbacks { cb() } + s.callbacks = nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/api/metric_store.go` around lines 64 - 68, The flush method on metricStore currently invokes each callback but leaves s.callbacks intact, allowing callbacks to fire again; modify metricStore.flush to clear the callbacks after iterating (e.g., set s.callbacks = nil or s.callbacks = s.callbacks[:0]) so subsequent calls do nothing, referencing the metricStore.flush method and the s.callbacks field when making the change.clearnode/api/rpc_router.go (1)
113-139: Metrics are recorded for ping requests but logging is skipped — verify this is intentional.Lines 125-128 emit metrics for all requests (including
ping), then lines 130-133 short-circuit logging for ping. If ping requests are high-frequency, the metrics cardinality is fine (same method label each time), but confirm this is the desired behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clearnode/api/rpc_router.go` around lines 113 - 139, ObservabilityMiddleware currently records metrics for all requests then returns early to skip logging for ping; decide whether ping should be excluded from metrics or just from logging and implement accordingly: either move the ping check (c.Request.Method == rpc.NodeV1PingMethod.String()) to the top of ObservabilityMiddleware before calling runtimeMetrics.* to skip both metrics and logging, or keep metrics but explicitly guard each metric call (runtimeMetrics.IncRPCMessage, runtimeMetrics.IncRPCRequest, runtimeMetrics.ObserveRPCDuration) with a condition to skip them for ping while leaving the existing early return for logging; update only the ObservabilityMiddleware function to reflect the chosen behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clearnode/main.go`:
- Line 26: The current ctx := context.Background() is not cancellable so the
goroutine started by runStoreMetricsExporter never observes shutdown; change
main to create a cancellable context via ctx, ctxCancel :=
context.WithCancel(context.Background()), defer ctxCancel(), and pass that ctx
into runStoreMetricsExporter; optionally call ctxCancel() explicitly just before
initiating the shutdown sequence to ensure the store metrics goroutine sees
ctx.Done() and exits cleanly (verify runStoreMetricsExporter is selecting on
ctx.Done()).
- Line 156: Rename the misspelled parameter metricExported to metricExporter in
the function signature and update all references to that parameter (e.g., where
metricExported is used to register/export metrics) so they now use
metricExporter; ensure you update the symbol in the same function body and any
calls that pass this parameter name (search for metricExported in
clearnode/main.go and replace with metricExporter) and run a build to verify no
remaining references.
In `@clearnode/runtime.go`:
- Around line 139-142: The call to rpc.NewWebsocketNode in runtime.go assigns
(rpcNode, err) but never checks err; add an error check immediately after that
call (checking err and handling it appropriately — e.g., log via logger or
processLogger and return the error or exit) to avoid proceeding with a
nil/invalid rpcNode; locate the rpc.NewWebsocketNode invocation and ensure you
validate err before using rpcNode anywhere in the surrounding function.
In `@pkg/rpc/connection_hub.go`:
- Around line 104-118: ConnectionHub.Remove currently reports the global
connection total and leaves zero-count entries in hub.sourceMap; update Remove
to (1) decrement the per-origin count via
hub.sourceMap[getSourceID(conn.Origin())], and if the resulting count == 0
delete that key from hub.sourceMap to avoid leaks, and (2) call
hub.observeConnections with the per-origin region (use the sourceID as the
region identifier, not defaultConnectionRegion) and pass the updated per-origin
count (uint32 of the remaining count) rather than uint32(len(hub.connections));
reference ConnectionHub.Remove, hub.sourceMap, getSourceID, and
observeConnections when making the change.
---
Outside diff comments:
In `@clearnode/api/app_session_v1/handler.go`:
- Around line 62-67: After decoding sigHex into sigBytes in the handler (where
sigBytes is created and later used to form app.AppSessionSignerTypeV1), add a
guard that verifies len(sigBytes) > 0 and return a descriptive rpc.Errorf if
it's empty to avoid an index-out-of-range panic; locate the decode block that
sets sigBytes and the subsequent use sigType :=
app.AppSessionSignerTypeV1(sigBytes[0]) and perform the length check between
those lines, returning an appropriate error when sigBytes is empty.
In `@pkg/rpc/node.go`:
- Around line 188-201: Update the ServeHTTP doc comment to remove references to
the old lifecycle callbacks and describe the current connection observation
model: mention that ServeHTTP upgrades requests to WebSocket, creates a unique
connection ID and manages state, spawns goroutines for concurrent message
processing, and uses ObserveConnections to monitor connection lifecycle and
events (instead of OnConnect/OnDisconnect). Ensure the comment references
ServeHTTP and ObserveConnections so future readers can locate the implementation
and understand the design change.
---
Duplicate comments:
In `@clearnode/metrics/exporter.go`:
- Around line 83-86: The two metrics have inconsistent types/names:
appSessionKeysTotal is declared as prometheus.Counter but initialized with
prometheus.NewGauge, and channelSessionKeysTotal uses a name with "_active" but
is declared a Counter; make both consistent counters with _total naming. Change
the initialization of appSessionKeysTotal to use prometheus.NewCounter with
prometheus.CounterOpts and the name "app_session_keys_total", and update
channelSessionKeysTotal initialization to prometheus.NewCounter with the name
"channel_session_keys_total"; ensure the field declarations
(channelSessionKeysTotal, appSessionKeysTotal) remain prometheus.Counter so
types match the NewCounter returns and keep using Inc() where used.
In `@pkg/rpc/connection_hub.go`:
- Around line 72-74: The metric call to observeConnections is passing the global
connection count (uint32(len(hub.connections))) while labeling by origin,
causing each origin gauge to show the global total; update the two sites (the
creation path that uses getSourceID(conn.Origin()) and the Remove path) to pass
the per-origin count hub.sourceMap[sourceID] instead so
observeConnections(origin, ...) receives the per-origin
uint32(hub.sourceMap[sourceID]) value.
---
Nitpick comments:
In `@clearnode/api/metric_store.go`:
- Around line 64-68: The flush method on metricStore currently invokes each
callback but leaves s.callbacks intact, allowing callbacks to fire again; modify
metricStore.flush to clear the callbacks after iterating (e.g., set s.callbacks
= nil or s.callbacks = s.callbacks[:0]) so subsequent calls do nothing,
referencing the metricStore.flush method and the s.callbacks field when making
the change.
In `@clearnode/api/rpc_router.go`:
- Around line 113-139: ObservabilityMiddleware currently records metrics for all
requests then returns early to skip logging for ping; decide whether ping should
be excluded from metrics or just from logging and implement accordingly: either
move the ping check (c.Request.Method == rpc.NodeV1PingMethod.String()) to the
top of ObservabilityMiddleware before calling runtimeMetrics.* to skip both
metrics and logging, or keep metrics but explicitly guard each metric call
(runtimeMetrics.IncRPCMessage, runtimeMetrics.IncRPCRequest,
runtimeMetrics.ObserveRPCDuration) with a condition to skip them for ping while
leaving the existing early return for logging; update only the
ObservabilityMiddleware function to reflect the chosen behavior.
In `@clearnode/api/utils.go`:
- Around line 5-28: getMethodPath currently re-deserializes the request payload
for the three RPC methods (rpc.AppSessionsV1SubmitAppStateMethod,
rpc.ChannelsV1RequestCreationMethod, rpc.ChannelsV1SubmitStateMethod) which
duplicates work the handler will do; instead, persist the extracted label once
when the handler first deserializes the payload and read that cached value in
getMethodPath (e.g., add a parsed-label field on rpc.Context or set a context
key like "methodPathLabel" during handler deserialization), or use a concurrent
request-scoped cache keyed by request ID to store results of translating into
rpc.AppSessionsV1SubmitAppStateRequest / rpc.ChannelsV1RequestCreationRequest /
rpc.ChannelsV1SubmitStateRequest so getMethodPath returns the cached label
rather than calling Translate again.
In `@clearnode/metrics/exporter.go`:
- Around line 294-326: Remove the leftover planning/spec comments block in
exporter.go (the multi-line commented checklist referencing metrics like
`user_states_total`, `transactions_total`, `channel_state_validations_total`,
`rpc_messages_total`, `app_state_updates`, `blockchain_actions_total`, etc.).
Delete that entire commented section from the file or move it into a design
doc/issue tracker, and ensure no stray TODO/spec comments remain in the
`metrics/exporter.go` file before merging.
- Around line 25-49: Move the prometheus registerer nil-check to the top of
NewStoreMetricExporter (and similarly NewRuntimeMetricExporter) so you return an
error immediately if reg == nil, before constructing any prometheus.GaugeVecs;
specifically, in NewStoreMetricExporter check reg != nil first, return
fmt.Errorf("prometheus registerer not provided") if nil, and only then create
m.appSessionsTotal and m.channelsTotal and call reg.MustRegister on them.
- Around line 102-116: The metric names use an `_active` suffix which implies a
gauge but the code treats them as monotonic counters; update
channelSessionKeysTotal (prometheus.NewCounter) to use the metric name
"channel_session_keys_total" and adjust its Help text to "Total number of
channel session keys issued"; change appSessionKeysTotal (currently
prometheus.NewGauge) to a prometheus.NewCounter, rename its metric to
"app_session_keys_total" and update its Help text to "Total number of app
session keys issued" so both follow Prometheus `_total` naming for counters.
- Around line 203-206: In runtimeMetricExporter.RecordTransaction the call to
amount.InexactFloat64() can lose precision for very large decimal amounts;
update the method to explicitly handle overflow/precision by (1) converting with
a guarded conversion: compute the maximum representable decimal for float64 and
if amount > max then set value to math.MaxFloat64 (or math.Inf(+1)) or clamp to
a sensible cap, (2) optionally round or scale the decimal (e.g.,
amount.Round(scale)) before conversion if you want a defined precision, and (3)
add a short comment on RecordTransaction explaining that Prometheus counters use
float64 so exact financial precision is not preserved; reference the function
name runtimeMetricExporter.RecordTransaction and the fields
transactionsAmountTotal to locate where to apply this change.
In `@clearnode/metrics/interface.go`:
- Around line 12-36: Rename the ObserveRPCDuration parameter to avoid implying
it's raw seconds: change the parameter name durationSecs in the
RuntimeMetricExporter interface to duration (or simply dur) so it correctly
reflects its type time.Duration; update the interface method
ObserveRPCDuration(method, path string, success bool, duration time.Duration)
and adjust any implementations (e.g., ObserveRPCDuration in exporter.go) and all
callers to use the new parameter name (ObserveRPCDuration) to keep the semantics
(callers pass a time.Duration and implementations may call .Seconds()).
- Around line 57-61: Add a no-op implementation of the StoreMetricExporter
interface so tests and non-production paths can use it: create an exported type
(e.g., NoopStoreMetricExporter) that implements SetAppSessions(applicationID
string, status app.AppSessionStatus, count uint64) and SetChannels(asset string,
status core.ChannelStatus, count uint64) with empty method bodies; use this noop
where a metric exporter is optional to avoid nil checks and to satisfy the
interface in test wiring.
In `@pkg/rpc/connection_hub.go`:
- Around line 152-155: getSourceID is a no-op that simply returns its input;
remove this indirection by deleting the getSourceID function and replacing all
calls to getSourceID(origin) with the origin string directly (or, if you intend
to preserve a hook for future transformations, keep the function but add a clear
TODO comment and unit test). Update any callers referencing getSourceID in
connection_hub.go (and elsewhere) to use the raw origin value to eliminate the
trivial wrapper.
9e07df7 to
3cb3a07
Compare
Co-authored-by: Anton Filonenko <philanton@users.noreply.github.com>
3cb3a07 to
1075fbc
Compare
Summary by CodeRabbit
New Features
Bug Fixes & Improvements
Deprecations