-
Notifications
You must be signed in to change notification settings - Fork 135
Description
This issue is a followup to #346
@brandur gave me this recommendation
In your case, an alternative: drop the uniqueness checks and then implement your job such that it checks on start up the last time its data was updated. If the update was very recent, it falls through with a no op. So you'd still be inserting lots of jobs, but most of them wouldn't be doing any work, and you wouldn't suffer the unique performance penalty.
however, this solution currently schedules hundreds of rps across our clusters, which is causing a lot of extra load, across all the job logic + the notifier.
more importantly, we have ~200-400k unique units of work every hour or so, but we would really like these things to be done every 15 minutes. without a uniqueness filter, it schedules millions of units of work every hour that while they do end up getting deduplicated at work-time, at the expense of large amounts of db work that ends up slowing down other calculations and other routines, which causes a vicious cycle of more jobs not getting completed, and more jobs piling up.
a side effect is this also causes is that the few places where we do schedule unique to become very slow, and so we basically can't use the unique feature in any jobs without fear those scheduling operations taking multiple seconds because of all the operations currently going on in the jobs table.
we could move river to a separate postgres cluster, but at that point, we would migrate away from river, because the advantage of it running in the same database as our data is gone.
for now we are likely going to implement our own hooks on top of the existing river client using inserttx to not schedule tasks when we dont need to - but it really feels like a weakness of river's unique insert feature. i'm still not really sure who it's for, since it can't scale to any reasonable throughput, and also is missing a good amount of features that come standard in other work queues (the most obvious that comes to mind is being able to do subset of args).
it would be really nice if there was some sort of uniqueness mechanism that didn't use advisory locks, for instance, a nullable unique column with a user-definable id on input in the jobs column immediately comes to mind. this would allow me to de-duplicate tasks by a subset of arguments and time interval/sequence id, which is more than enough for me.