You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I fit some (relatively) large-ish GLMs in statsmodels and have been experimenting with using patsy instead of a home rolled thing. My home rolled method isn't very good (I tend to underestimate challenges...). I've gotten some better hardware so now some models that used to not work with patsy (because of memory constraints) work now. I've run across a few things that might make it easier for me to use patsy more. Happy to work on PRs for them if there's interest.
Categorical NA logic: Currently, it appears that when a categorical is fed through patsy, every inidividual value is checked against a rather detailed list on how to handle NaNs/Missing Values/Empty whatever. I ran a cProfile on this and it was quite slow. I think the bottleneck is here: https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L341
I know that NaNs/NA/None/Empty is a mess in general, but its a fact of life in my line of work (insurance modeling). I'm wondering if we scope out exactly all the scenarios we need to control for and use (pandas maybe?) to do this more elegantly? I'm not sure of the scope as there's far more players here than me.
Reading from on-disk data stores: Memory is something of a problem for me. a typical model that I run might eat up 10+ GB of RAM. It works, but obviously is not ideal. As far as (relatively) mature tools, I've found bcolz's ctables to be pretty good (and fast). HDFStores/dask would be nice too. I'm not sure if xarray support for categorical data #91 relates to this (I don't know if xarray works well as an on-disk storage/data tool).
Partial predictions: This is partly a statsmodels partly a patsy idea... I'd like the ability to do a so-called partial predict. Essentially I have a model like y ~ a + b + a:c. I want to come up with predictions y assuming that just a changes or just b changes. I think the process would look something like (assuming we're talking about changing only a) creating a new design matrix with every unique value of a as a separate row, and have the most frequent (or some other innocuous value) of b and c as constant for these rows. Then feed this dataset through the statsmodels predict routine. This is very helpful for GLMs with the log link--which is the bulk of what I work with.
Categorical grouping: Suppose I have categories A, B, C, D, and E. The categories aren't really sortable in any logical way, but some could be grouped. Has there been any thought on how (or if) to allow this?
Weights: For methods like standardize, it may make sense to weight the observations. (Really only applicable if you have a really skewed data where certain values are more prevalent on higher weighted records.
Hello,
I fit some (relatively) large-ish GLMs in statsmodels and have been experimenting with using
patsyinstead of a home rolled thing. My home rolled method isn't very good (I tend to underestimate challenges...). I've gotten some better hardware so now some models that used to not work with patsy (because of memory constraints) work now. I've run across a few things that might make it easier for me to use patsy more. Happy to work on PRs for them if there's interest.patsy, every inidividual value is checked against a rather detailed list on how to handle NaNs/Missing Values/Empty whatever. I ran a cProfile on this and it was quite slow. I think the bottleneck is here:https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L341
I know that NaNs/NA/None/Empty is a mess in general, but its a fact of life in my line of work (insurance modeling). I'm wondering if we scope out exactly all the scenarios we need to control for and use (pandas maybe?) to do this more elegantly? I'm not sure of the scope as there's far more players here than me.
y ~ a + b + a:c. I want to come up with predictionsyassuming that justachanges or justbchanges. I think the process would look something like (assuming we're talking about changing onlya) creating a new design matrix with every unique value ofaas a separate row, and have the most frequent (or some other innocuous value) ofbandcas constant for these rows. Then feed this dataset through the statsmodelspredictroutine. This is very helpful for GLMs with the log link--which is the bulk of what I work with.standardize, it may make sense to weight the observations. (Really only applicable if you have a really skewed data where certain values are more prevalent on higher weighted records.