Flash Attention

So I noticed it runs WAY slow, then realized my card was not set up for that, I am running ye oldie p40. So no tensor cores. But this fellow over at flash attention apparently made it possible to work without them? https://github.com/ggerganov/llama.cpp/pull/7188 I assume this in not implemented for this yet, any chance?