Streaming KMeans [MLLIB][SPARK-3254]#2942
Conversation
- Used trainOn and predictOn pattern, similar to StreamingLinearAlgorithm - Decay factor can be set explicitly, or via fractional decay parameters expressed in units of number of batches, or number of points - Unit tests for basic functionality and decay settings
|
Test build #22209 has started for PR 2942 at commit
|
|
Test build #22209 has finished for PR 2942 at commit
|
|
Test PASSed. |
|
Should we create another PR for the python bindings/example? |
|
@anantasty This PR is still in review. If you are interested in Python binding of streaming algorithms. Could you help add one for StreamingLinearRegression? Thanks! |
|
I would certainly be interested in doing that. I just wasn't sure if it was
|
|
It should be in a separate JIRA (and hence a separate PR). Please create a JIRA for |
|
@anantasty Agreed, should be separate, but would be very cool to have! Ping me as well, happy to provide feedback. |
There was a problem hiding this comment.
- line too wide
KMeans->k-means
|
Test build #22426 has started for PR 2942 at commit
|
|
Test build #22426 has finished for PR 2942 at commit
|
|
Test FAILed. |
|
Test build #22428 has started for PR 2942 at commit
|
|
Test build #22428 has finished for PR 2942 at commit
|
|
Test PASSed. |
- Use a single halfLife parameter that now determines the decay factor directly - Allow specification of timeUnit for the halfLife as “batches” or “points” - Documentation adjusted accordingly
|
@mengxr I implemented the new parameterization (and tried to make the docs on it more intuitive), see what you think! |
|
Test build #22607 has started for PR 2942 at commit
|
|
Test build #22607 has finished for PR 2942 at commit
|
|
Test PASSed. |
|
@freeman-lab I made some changes: freeman-lab#1 , which includes the following:
If the update looks good to you, could you merge that PR? Thanks! |
Update Streaming K-Means
|
Test build #22673 has started for PR 2942 at commit
|
|
Test build #22673 has finished for PR 2942 at commit
|
|
Test PASSed. |
|
@mengxr great updates! LGMT. Just need to update the doc/examples in a couple places I think. |
|
Test build #22677 has started for PR 2942 at commit
|
|
Test build #22677 has finished for PR 2942 at commit
|
|
Test PASSed. |
|
LGTM. Merged into master. Thanks for adding streaming k-means! |
This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.
The PR includes:
@tdas @mengxr @rezazadeh