ray

My Linked Notes

2020-10-27
- I do think they have some competition in the space, mainly [[ray]] and [[dask]] (or more apt, their parent companies [[anyscale]] and [[Coiled Computing]])
  - Well, they aren't much competition yet (as they are really small), but they could get there. I think the huge advantage these new companies have is they are based on python libraries that are much easier to use than spark, while quickly becoming just as powerful
  - For both ray and dask, you can pip install their libraries and get started running them locally. They can also be used to parallelize [[pandas]], which is the number 1 tool for most data scientists
    - This is really important, because data scientists don't want to have to learn spark, but to run big data processing jobs, technologies like spark were the only way
    - dask and ray provide a "native" python way to run parallel computations. And of course, their parent businesses are going to let users pay to have a reliable and easy to use environment to run their software
  - Ray and Dask could be in trouble if databricks decides it wants to start offering environments with those libraries included
    - Looks running ray on databricks is already happening here, but the guy who gave this talk is from Anyscale, so I guess he isn't scared running ray on databricks will hurt them
  - both of these libraries also already provide easy install instructions for how to run clusters on many different platforms, from EKS, to Azure, to whatever else
  - Besides parallelizing pandas, they both have api's for parallelizing any python task that run on a laptop and scale up to any cluster
  - Ray is working on a streaming library, while dask half supports streaming with queues
    - Spark is an industry accepted tool for large scale data streaming
trailheads
- [[ray]]

One last thing

If you liked these notes, hit me on Twitter!