Polars vs Pandas and Citizen Data Scientists

If you are deep in the world of analytics, you have probably already heard about Polars. Beyond the technical comparison, the emerging interest in Polars indicates advancement in creating more citizen data scientists.

The definition of a true data scientist has been hazy for almost a decade. Analytics professionals like me were also classified as data scientists, who primarily worked with open-source tools libraries (like data frame libraries designed for single machines, like Python’s Pandas). While what constitutes a true data scientist is up for debate, there is no doubt that libraries like Pandas have helped make advanced analytics easier for those who do not love writing code from scratch or do not see value in that.

That has been the reason Pandas has been such a popular library. The new entrant, Polars, seems to have the same advantages. And then some more.

At the core of what is making Polars get traction are two key characteristics:

  • It is much faster than Pandas (5-10 tmes faster)
  • It leverages significantly less memory than Pandas. Pandas may require 5 to 10 times as much RAM as the size of the dataset being leveraged. For Polars, it is around 2-5 times.

Not that Polars is going to eclipse Pandas soon enough. The primary bottleneck is that it is not yet compatible with frequently used Python data visualization libraries and machine learning libraries like PyTorch and sci-kit learn.

The reasons behind the two advantageous characteristics of Polars (highlighted above) have been listed in the appendix of this article. But let us try to understand why Polars is getting the attention it is, and what that means for organizations.

The last few years have seen rapid democratization of advanced analytics, thanks to open-source tools and plenty of easily accessible online training resources (the majority of which are free). This has created a “turbo-charged” set of planning professionals. The professionals in this category are planners and analysts who already used analytics in some form but were primarily limited to worksheets.

Many in this category embraced open learning, self-learning, or organization-sponsored reskilling to gain skills that came easily with open-source languages (like Python) and associated libraries (like Pandas). This is the level of penetration of analytics that organizations need in their frontline planners and analysts to become truly data-driven.

For these smart folks, who were already engaged with analytics, it was not difficult to leverage easy-to-learn Python libraries to perform at least some elementary machine-learning experiments. While this led to some of them being labeled as data scientists, I would rather call them citizen data scientists. For them, leveraging Python libraries was akin to self-service analytics.

You are probably getting impatient at this point, asking where the rising popularity of Polars fits into all this.

As the popularity of single-machine libraries like Pandas increased, and so did their adoption, it highlighted one key bottleneck on the path of proliferating citizen data scientists in numbers across organizations.

Not every organization provides systems with powerful RAM and processors to hundreds or thousands of its analyst pool. And that is where the bottleneck emerged as the data sets became larger. After all, many of these analysts wanted to experiment with these open-source tools because they wanted to play with larger datasets. However the computing time and memory requirements were a bottleneck.

Polars seeks to address that bottleneck.

While we are looking at Polars primarily from the lens of a new, more efficient Python library, the implications and trends embedded are important too. We will see more such libraries emerge in the next few years. We will see more advances in addressing the bottlenecks citizen data scientists working within processes run into. This may be a great time for many organizations to understand and plan how to resell their teams of analysts and planners. So that when the time is right (in approximately two years), they can build that army that will lead them to victory.


Appendix: Why Polars is faster and needs less memory

  • While Pandas is built on top of Python libraries like NumPy, Polars has been developed leveraging Rust, a low-level language as fast as C and C++. Also, Rust allows safe concurrency, which means that Polars can use all the cores in your processor for complex queries.
  • Polars has expressive APIs, allowing you to express any operation as a Polars method.
  • Polars can perform both eager and lazy execution, vs eager execution of Pandas.
  • Polar’s data library is built on Apache Arrow, providing interoperability, which speeds up performance. The interoperability allows it to skip format conversions as the data flows through the steps of the data pipeline. Also, since two processes can share the same data without making a copy, the memory efficiency aspect comes into play.

References:


Leave a comment