Skip to content

[BUG] dlg_replicated_delete() does not run DLGCB_TERMINATED callbacks, breaking pua_dialoginfo (BLF) in anycast clusters #3880

@gostkov

Description

@gostkov

OpenSIPS version you are running

version: opensips 3.6.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, F_PARALLEL_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: d5222226a
main.c compiled on  with gcc 12

Description
In an anycast cluster with two OpenSIPS nodes using dialog_replication_cluster and tm_replication_cluster, the dlg_replicated_delete() function does not call run_dlg_callbacks(DLGCB_TERMINATED, ...). This means that when a dialog is terminated on one node and the deletion is replicated to the node that originally created the dialog, modules like pua_dialoginfo never learn about the termination. As a result, PUBLISH with <state>terminated</state> is never sent, leaving stale presentity records in the database and causing BLF (Busy Lamp Field) indicators to remain lit indefinitely after a call ends.

Setup

  • Two OpenSIPS 3.6 nodes behind an anycast IP.
  • tm_replication_cluster = 1 (anycast TM replication)
  • dialog_replication_cluster = 1
  • Modules: dialog, presence, pua, pua_dialoginfo, presence_dialoginfo
  • pua_dialoginfo with presence_server pointing to the anycast IP

Scenario

  1. INVITE arrives at Node 1 (via anycast). create_dialog() and dialoginfo_set("A") are called. The pua_dialoginfo module registers DLGCB_TERMINATED callback on this dialog. A PUBLISH with <state>confirmed</state> is sent to the local presence module. The pua record (including the ETag) is stored in Node 1's pua hashtable. The dialog is replicated to Node 2.
  2. Due to asymmetric anycast routing, BYE from the callee (Asterisk) arrives at Node 2. loose_route() succeeds (dialog was replicated). Node 2 forwards BYE to the caller. The 200 OK from the caller arrives at Node 1 (anycast), which replicates it to Node 2 via tm_replication_cluster.
  3. Node 2 completes the BYE transaction. The dialog module fires DLGCB_TERMINATED. pua_dialoginfo calls dialog_publish("terminated", ...) with expires=0. However, the pua module on Node 2 has no matching pua record (it was created on Node 1), so send_publish_int() returns ERR_PUBLISH_NO_RECORD and the PUBLISH is silently dropped:
    // modules/pua/send_publish.c, send_publish_int()
    if(presentity== NULL)
    {
        if(publ->expires== 0)
        {
            LM_DBG("request for a publish with expires 0 and"
                    " no record found\n");
            ret = ERR_PUBLISH_NO_RECORD;
            goto error;
        }
  4. Node 2 replicates the dialog deletion to Node 1. dlg_replicated_delete() runs on Node 1 — it transitions the dialog state and frees resources, but never calls run_dlg_callbacks(DLGCB_TERMINATED, ...):
    // modules/dialog/dlg_replication.c, dlg_replicated_delete()
    destroy_linkers(dlg);
    remove_dlg_prof_table(dlg, 0);
    next_state_dlg(dlg, DLG_EVENT_REQBYE, ...);
    // ... remove timer ...
    unref_dlg(dlg, 1 + unref);  // dialog freed without DLGCB_TERMINATED!
  5. Result: Node 1 (which has the pua record with the correct ETag) never fires DLGCB_TERMINATEDpua_dialoginfo never sends PUBLISH terminated → the presentity record remains confirmed → BLF stays lit.

Comparison with local BYE processing

When BYE is processed locally, DLGCB_TERMINATED is always called:

  • dlg_handlers.c line 2150: run_dlg_callbacks(DLGCB_TERMINATED, dlg, req, ...)
  • dlg_req_within.c line 261: run_dlg_callbacks(DLGCB_TERMINATED, dlg, fake_msg, ...)
  • dlg_handlers.c (timeout) line 2555: run_dlg_callbacks(DLGCB_EXPIRED, dlg, fake_msg, ...)
    Only dlg_replicated_delete() is missing this callback.

Proposed fix
Add run_dlg_callbacks(DLGCB_TERMINATED, ...) to dlg_replicated_delete(), using a fake message and processing context (same pattern as dual_bye_event() in dlg_req_within.c:257-270):

// After remove_dlg_timer() and before unref_dlg():
{
    struct sip_msg *fake_msg;
    context_p old_ctx;
    context_p *new_ctx;
    if (push_new_processing_context(dlg, &old_ctx, &new_ctx,
            &fake_msg) == 0) {
        run_dlg_callbacks(DLGCB_TERMINATED, dlg, fake_msg,
            DLG_DIR_NONE, -1, NULL, 0, 1);
        if (current_processing_ctx == NULL)
            *new_ctx = NULL;
        else
            context_destroy(CONTEXT_GLOBAL, *new_ctx);
        set_global_context(old_ctx);
        release_dummy_sip_msg(fake_msg);
    }
}

This requires adding #include "dlg_req_within.h" to dlg_replication.c.

Versions affected

  • 3.6 branch (confirmed)
  • master (confirmed — same code)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions