If you have spent any time reading our blog posts, you’ll instantly recognize our roots in Open Source. While we have done many projects over the years utilizing proprietary software as well as open source, our hearts are always beholden to the open source community. Not only have we delivered highly successful projects, we have become close friends, and cohorts with many of the people that started these projects, and use these projects on a daily basis. So, our blog post today is really the “Top 10” list of open source projects that we would typically use in a data analytics and data strategy consulting project. These are in no specific order, but we find great value in all of these!
Apache Hop - https://hop.apache.org/
The Hop Orchestration Platform, or Apache Hop, aims to facilitate all aspects of data and metadata orchestration.
Hop is an entirely new open source data integration platform that is easy to use, fast and flexible
Hop aims to be the future of data integration. Visual development enables developers to be more productive than they can be through code. Our Design once, run anywhere workflows and pipelines can be designed in the Hop Gui and run on the Hop native engine (local or remote), or on Spark, Flink, Google Dataflow or AWS EMR through Beam. Lifecycle Management enables developers and administrators to switch between projects, environments and purposes without leaving your train of thought.
Apache Superset - https://superset.apache.org/
Superset is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts. This is a great option for our customers that want to utilize open source software, but need to install everything on premise.
Apache Iceberg - https://iceberg.apache.org/
Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.
Apache Beam - https://beam.apache.org/
Apache Beam is a unified programming model for both batch and streaming data processing, enabling efficient execution across diverse distributed execution engines and providing extensibility points for connecting to different technologies and user communities.
Apache Airflow - https://airflow.apache.org/
The mission of Apache Airflow is the creation and maintenance of software related to workflow automation and scheduling that can be used to author and manage data pipelines. By the way, using Apache Airflow in conjunction with Apache Hop is a great way to implement pipelines, and monitor those pipelines!
Apache Calcite - https://calcite.apache.org/
Calcite is a framework for writing data management systems. It converts queries, represented in relational algebra, into an efficient executable form using pluggable query transformation rules. There is an optional SQL parser and JDBC driver. Calcite does not store data or have a preferred execution engine. Data formats, execution algorithms, planning rules, operator types, metadata, and cost model are added at runtime as plugins.
Apache Drill - https://drill.apache.org/
Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google's Dremel.
Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Apache HTTP Server - https://httpd.apache.org/
The granddaddy of Apache projects! If you have ever built websites or web applications in the past 25+ years, you’ve probably used Apache http server!
The Apache HTTP Server Project is an effort to develop and maintain an open-source HTTP server for modern operating systems including UNIX and Windows. The goal of this project is to provide a secure, efficient and extensible server that provides HTTP services in sync with the current HTTP standards.
The Apache HTTP Server ("httpd") was launched in 1995 and it has been the most popular web server on the Internet since April 1996.
Apache Spark - https://spark.apache.org/
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Use cases for Apache Spark include, Batch/Streaming data, SQL Analytics against big data sources for dashboarding and ad-hoc reporting, data science, and machine learning.
Apache Pinot - https://pinot.apache.org/
Apache Pinot is a real-time distributed online analytical processing (OLAP) datastore. Use Pinot to ingest and immediately query data from streaming or batch data sources (including, Apache Kafka, Amazon Kinesis, Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage).
Apache Pinot includes the following: Ultra low-latency analytics even at extremely high throughput. Columnar data store with several smart indexing and pre-aggregation techniques. Scaling up and out with no upper bound. Consistent performance based on the size of your cluster and an expected query per second (QPS) threshold. It's perfect for user-facing real-time analytics and other analytical use cases, including internal dashboards, anomaly detection, and ad hoc data exploration.
The previous Apache projects are only a very small percentage of all the wonderful Apache projects. These were ones that are quite beneficial for the type of consulting we do at KPI Forge. We hope you find this informative. As always, let’s talk about your data, and help you make great business decisions that are data driven!
Comments