dask
My Linked Notes
- 2020-10-27
- I do think they have some competition in the space, mainly [[ray]] and [[dask]] (or more apt, their parent companies [[anyscale]] and [[Coiled Computing]])
- Well, they aren't much competition yet (as they are really small), but they could get there. I think the huge advantage these new companies have is they are based on python libraries that are much easier to use than spark, while quickly becoming just as powerful
- For both ray and dask, you can
pip install
their libraries and get started running them locally. They can also be used to parallelize [[pandas]], which is the number 1 tool for most data scientists- This is really important, because data scientists don't want to have to learn spark, but to run big data processing jobs, technologies like spark were the only way
- dask and ray provide a "native" python way to run parallel computations. And of course, their parent businesses are going to let users pay to have a reliable and easy to use environment to run their software
- Ray and Dask could be in trouble if databricks decides it wants to start offering environments with those libraries included
- Looks running ray on databricks is already happening here, but the guy who gave this talk is from Anyscale, so I guess he isn't scared running ray on databricks will hurt them
- both of these libraries also already provide easy install instructions for how to run clusters on many different platforms, from EKS, to Azure, to whatever else
- Besides parallelizing pandas, they both have api's for parallelizing any python task that run on a laptop and scale up to any cluster
- Ray is working on a streaming library, while dask half supports streaming with queues
- Spark is an industry accepted tool for large scale data streaming
- I do think they have some competition in the space, mainly [[ray]] and [[dask]] (or more apt, their parent companies [[anyscale]] and [[Coiled Computing]])
- trailheads
One last thing
If you liked these notes, hit me on Twitter!