This is cool and all, but why the fuck are you processing millions of records in pandas (or even polars)? And if shaving two minutes is such a game changer for batch data processing, why isn’t the entire pipeline designed for real-time/streaming/nrt analytics rather than repeated batches? How frequently are the batches rerun? Hourly? Is it actually batch or is this aiming for microbatch?
And no shit installing conda-based packages are going to be slower. They’re prepackaged bundles of libraries that naturally run the risk of installing bloat via unnecessary dependencies/dated packages.
Why isn’t any of this happening in a database? Is the math that difficult to implement (if even as UDFs) in something with scalable compute and or streams support?
Just everything about shaving down batch processing screams red flag to me.
