patsy questions/wishlist

Hello,

I fit some (relatively) large-ish GLMs in statsmodels and have been experimenting with using `patsy` instead of a home rolled thing. My home rolled method isn't very good (I tend to underestimate challenges...). I've gotten some better hardware so now some models that used to not work with patsy (because of memory constraints) work now. I've run across a few things that might make it easier for me to use patsy more. Happy to work on PRs for them if there's interest. 
1. Categorical NA logic: Currently, it appears that when a categorical is fed through `patsy`, every inidividual value is checked against a rather detailed list on how to handle NaNs/Missing Values/Empty whatever. I ran a cProfile on this and it was quite slow. I think the bottleneck is here:
   https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L341
   I know that NaNs/NA/None/Empty is a mess in general, but its a fact of life in my line of work (insurance modeling). I'm wondering if we scope out exactly all the scenarios we need to control for and use (pandas maybe?) to do this more elegantly? I'm not sure of the scope as there's far more players here than me.
2. Reading from on-disk data stores: Memory is something of a problem for me. a typical model that I run might eat up 10+ GB of RAM. It works, but obviously is not ideal. As far as (relatively) mature tools, I've found bcolz's ctables to be pretty good (and fast). HDFStores/dask would be nice too. I'm not sure if #91 relates to this (I don't know if xarray works well as an on-disk storage/data tool). 
3. Partial predictions: This is partly a statsmodels partly a patsy idea... I'd like the ability to do a so-called partial predict. Essentially I have a model like `y ~ a + b + a:c`. I want to come up with predictions `y` assuming that just `a` changes or just `b` changes. I think the process would look something like (assuming we're talking about changing only `a`) creating a new design matrix with every unique value of `a` as a separate row, and have the most frequent (or some other innocuous value) of `b` and `c` as constant for these rows. Then feed this dataset through the statsmodels `predict` routine. This is very helpful for GLMs with the log link--which is the bulk of what I work with.
4. Categorical grouping: Suppose I have categories A, B, C, D, and E. The categories aren't really sortable in any logical way, but some could be grouped. Has there been any thought on how (or if) to allow this? 
5. Weights: For methods like `standardize`, it may make sense to weight the observations. (Really only applicable if you have a really skewed data where certain values are more prevalent on higher weighted records. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

patsy questions/wishlist #93

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

patsy questions/wishlist #93

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions