ray



My Linked Notes

  • 2020-10-27
    • I do think they have some competition in the space, mainly [[ray]] and [[dask]] (or more apt, their parent companies [[anyscale]] and [[Coiled Computing]])
      • Well, they aren't much competition yet (as they are really small), but they could get there. I think the huge advantage these new companies have is they are based on python libraries that are much easier to use than spark, while quickly becoming just as powerful
      • For both ray and dask, you can pip install their libraries and get started running them locally. They can also be used to parallelize [[pandas]], which is the number 1 tool for most data scientists
        • This is really important, because data scientists don't want to have to learn spark, but to run big data processing jobs, technologies like spark were the only way
        • dask and ray provide a "native" python way to run parallel computations. And of course, their parent businesses are going to let users pay to have a reliable and easy to use environment to run their software
      • Ray and Dask could be in trouble if databricks decides it wants to start offering environments with those libraries included
        • Looks running ray on databricks is already happening here, but the guy who gave this talk is from Anyscale, so I guess he isn't scared running ray on databricks will hurt them
      • both of these libraries also already provide easy install instructions for how to run clusters on many different platforms, from EKS, to Azure, to whatever else
      • Besides parallelizing pandas, they both have api's for parallelizing any python task that run on a laptop and scale up to any cluster
      • Ray is working on a streaming library, while dask half supports streaming with queues
        • Spark is an industry accepted tool for large scale data streaming
  • trailheads

One last thing

If you liked these notes, hit me on Twitter!