databricks



My Linked Notes

  • 2020-10-27
    • With recent news about the [[databricks]]ipo has got me thinking about the space
      • Databricks main offering is providing and easy to use and reliable interface to run [[spark]] at large scale. Spark is an open sourced project, so they were an early entrance into the space of providing a platform that makes running open source software easier. One of the biggest advantages of databricks is that users no longer have to worry about installing and managing spark, which can be a nightmare.
        • Spark is still very widely accepted and useful software library
      • I do think they have some competition in the space, mainly [[ray]] and [[dask]] (or more apt, their parent companies [[anyscale]] and [[Coiled Computing]])
        • Well, they aren't much competition yet (as they are really small), but they could get there. I think the huge advantage these new companies have is they are based on python libraries that are much easier to use than spark, while quickly becoming just as powerful
        • For both ray and dask, you can pip install their libraries and get started running them locally. They can also be used to parallelize [[pandas]], which is the number 1 tool for most data scientists
          • This is really important, because data scientists don't want to have to learn spark, but to run big data processing jobs, technologies like spark were the only way
          • dask and ray provide a "native" python way to run parallel computations. And of course, their parent businesses are going to let users pay to have a reliable and easy to use environment to run their software
        • Ray and Dask could be in trouble if databricks decides it wants to start offering environments with those libraries included
          • Looks running ray on databricks is already happening here, but the guy who gave this talk is from Anyscale, so I guess he isn't scared running ray on databricks will hurt them
        • both of these libraries also already provide easy install instructions for how to run clusters on many different platforms, from EKS, to Azure, to whatever else
        • Besides parallelizing pandas, they both have api's for parallelizing any python task that run on a laptop and scale up to any cluster
        • Ray is working on a streaming library, while dask half supports streaming with queues
          • Spark is an industry accepted tool for large scale data streaming

One last thing

If you liked these notes, hit me on Twitter!