The Technology Gap in Data Science

The ability to extract value from data is more urgent than ever for all major businesses. However, according to Gartner, over 85% of data science projects fail. Bodo aims to solve this problem by eliminating key hurdles in the application development process.

Billions of dollars of capital investments in AI/ML platforms aim to simplify analytics for data teams, yet enterprises struggle to gain the insights they need from their data assets. The fundamental problem? Existing technologies do not empower most data scientists to develop applications at scale. Keeping pace with rapidly changing business environments requires one to quickly:

  • Understand the business problem
  • Find the right analytics-based solution, and
  • Translate it into working code.

Analytics applications are data-hungry and need to scale to large datasets. This requires learning and applying some form of parallel programming for compute clusters. Usually, data scientists are experts in their business domains but not in high-performance computing - resulting in a significant skill mismatch. It is simply too much to ask of a data scientist.

Bodo: The First Auto-Parallel Platform

The ability to parallelize programs automatically is a key component and driver of data science productivity at scale. Bodo is the first platform that provides automatic parallelization and High-Performance Computing (HPC) capabilities for analytics applications. This democratization of HPC allows data scientists to focus on solving the problem instead of rewriting their Python code in various languages and parallel libraries such as Apache Spark and SQL. Previous attempts have focused on building new parallel libraries with APIs similar to data science ones. However, they still require managing parallelism at the application level, even if they claim to be “drop-in replacement”. In contrast, Bodo provides the first practical compiler auto-parallelization algorithm, which had been an elusive holy grail in computer science for decades.

Auto-parallelization means that data scientists can design, develop, and test their code as though the code is serial, and Bodo incorporates parallelization transparently at runtime. We have achieved this by 1) focusing on standard data science APIs in Python and treating them as first class programming languages and, 2) building on available LLVM, Python, and HPC technologies.

Business Pain Points and Requirements

In addition to the slowdown of Moore’s law1), this parallelism technology gap has led to complex solutions and processes in the enterprise. Most such solutions and processes tend to “glue” together packages to create applications. An unintended consequence of this is described in a paper by Google2 which argues that even mature analytics systems might end up being (at most) 5% machine learning code and (at least) 95% glue code. These glue codes, pipeline jungles, and re-writing of the native Python code are at the root of the increasing complexity and cost for businesses. Enterprises need solutions that deliver simplicity, agility, performance, efficiency, and lower aggregate cost at the same time. We believe that analytics will become pervasive in all enterprises that even those with modest programming backgrounds should be able to mine and extract meaning out of data, like any natural resource.

Bodo’s Solution Offering

Bodo offers a universal analytics data optimization enginethis is different from a library. Bodo utilizes the simplicity of the lingua franca of Data Science (Python) as well as the scalability and efficiency of HPC architecture (with MPI). This closes the “Productivity-Performance” gap by providing data science applications, the same architecture that is used in the most powerful supercomputers today. The results are unprecedented productivity, performance, and scalability with lower infrastructure costs at every level of the enterprise stack.

  • Simplicity: Bodo parallelizes and optimizes native Python code automatically, eliminating glue code complexity.

  • Productivity: Data Scientists can now own the entire pipeline from prototype to development to production with their original Python code. This allows their time to be spent on developing ML code vs. re-writing their code for production scaling.

  • Performance and Scalability: We have shown well over 100X improvement over Spark and over 10000X improvement over Python/Pandas on applications from several Fortune 500 and Fortune 10 companies. We offer linear and unlimited scaling, as demonstrated on thousands of CPU cores.

  • Quality: As a result, Data Scientists can now accelerate their workflow and gain higher-quality insights since they can now perform more development iteration and validation in a given timeframe.

  • Aggregate Cost: Our automatic parallelization and optimization technology can maximize CPU utilization. This has been demonstrated to yield a 10–100X reduction in infrastructure costs, including the cost of operating in the Cloud.

  • Adaptability: Bodo’s engine is highly adaptable. Not only is it hardware agnostic (CPUs, GPUs, FPGAs, DSAs), but it can also be integrated with other environments like Spark and SQL. This flexibility allows enterprises to get more out of their investments by making those systems more efficient and performant using Bodo.

Looking Ahead

While our vision is to be the Data Optimization Engine for all workloads and personas, our singular focus is enabling a performant, efficient, and productive engine for Pythonic workloads. We have developed our initial solution on CPUs, and we have plans to port and optimize on GPUs and FPGAs soon. Technology has always evolved toward the path of least resistance: simplicity, performance, and cost-effectiveness. Bodo intends to be a force in that evolution, but we cannot do it alone. We are continuously engaging and learning from customers and partners. We will actively participate, contribute, and listen to Data Science and open source communities to guide our direction in this evolution. Follow us at bodo.ai for future plans and roll out.

1: Moore, Gordon. “The Future of Integrated Electronics.” Fairchild Semiconductor internal publication (1964).

2: D. Sculley et al. “Hidden technical debt in Machine learning systems.” (In proceedings NIPS’15).