top of page

Building a Basic APM Tool with Apache Hop, PostgreSQL, and Apache Superset

In the world of application performance monitoring (APM), having real-time insights into system metrics can make all the difference in maintaining smooth operations. While there are plenty of commercial tools available, building your own basic APM setup can be a cost-effective and educational alternative. In this post, we'll walk through creating a simple APM tool that collects system metrics every 15 seconds using Apache Hop for data ingestion, PostgreSQL on AWS Free Tier for storage, and Apache Superset for visualization.


This setup is ideal for monitoring basic metrics like CPU usage, memory consumption, and system timestamps on a single machine or server. We'll focus on a pipeline in Apache Hop to gather data, a workflow to handle periodic execution, and Superset dashboards to make the data actionable. Let's dive in!


Prerequisites


Java 11 or later installed (required for Apache Hop).

Apache Hop downloaded and installed from the official site (hop.apache.org). It's a lightweight ETL tool that's easy to set up—just unzip and run.

Apache Superset installed locally via Docker, or a free tier at Preset.io. You can follow the quickstart guide on superset.apache.org.

Basic familiarity with databases and ETL concepts.


Step 1: Setting Up PostgreSQL on AWS Free Tier


We'll use Amazon RDS to host our PostgreSQL database, leveraging the AWS Free Tier to keep costs at zero for light usage.


1. Log in to your AWS Management Console and navigate to RDS.

2. Click "Create database" and select PostgreSQL as the engine.

3. Choose the Free Tier template (db.t4g.micro instance class, 20 GB storage).

4. Set a database name (e.g., `apm_metrics`), master username, and password. Enable public access for simplicity (but secure it in production!).

5. Launch the instance—it should take a few minutes to provision.

6. Once ready, note the endpoint (e.g., `apm-metrics.cxxxxxx.us-east-1.rds.amazonaws.com`) and port (default 5432).



Step 2: Building the Data Collection Pipeline in Apache Hop


Apache Hop excels at building data pipelines visually. We'll create a simple pipeline that generates a starting row, fetches system info, and writes it to PostgreSQL.


1. Launch Apache Hop GUI (run `hop-gui.sh` or `hop-gui.bat`).

2. Create a new pipeline file (`.hpl`).

3. Add the **Row Generator** transform: This "drives" the pipeline by generating a single empty row to initiate the flow. Check "Never Stop Generating Rows". Set the interval to 15000 milliseconds (15 seconds)

ree

4. Connect it to the **Get System Data** transform: This pulls system metrics. Select fields like:

- System date (variable) for timestamp.

- Hostname.

- IP address.

- JVM free memory.

- Total physical memory size (bytes).

- JVM CPU time (milliseconds) for basic CPU insights.

Name the output fields to match your DB table (e.g., `timestamp`, `hostname`).

5. Connect to a **Table Output** transform: Configure the database connection to your AWS RDS PostgreSQL (use the endpoint, username, password). Target the `system_metrics` table and map fields accordingly. Enable "Specify database fields" for accuracy. Click the SQL button to run the DDL that creates the table on your target Postgres Instance


Test the pipeline by running it—it should insert one row of metrics into your DB every 15 seconds.


ree

ree


Step 3: Visualizing Metrics with Apache Superset


Now that data is flowing in, let's visualize it.


1. Start Apache Superset (e.g., via `docker-compose up` if using Docker).

2. In Superset or in your preset.io account, add a new database connection: Choose PostgreSQL, enter your RDS endpoint, username, password, and database name.

3. Create a dataset from the `system_metrics` table.

4. Build charts:

- Line chart for JVM free memory over time (x-axis: timestamp, y-axis: jvm_free_memory).

- Gauge or bar chart for current total physical memory.

- Time-series for CPU time to spot spikes.

5. Assemble them into a dashboard titled "APM Dashboard."


Set the dashboard to refresh the dashboard periodically to see live updates as new metrics arrive.


ree

Potential Enhancements and Considerations


Scaling Metrics: Add more fields from Get System Data, like available processors or swap space, for deeper insights. You can also add operating system commands to gather detailed process related metrics including CPU usage, Disk I/O, process wait times, etc.

Alerts: Integrate Superset's alerting features or add a notification step in Hop.

Security: In production, use VPCs for RDS, avoid public access, and secure credentials in Hop.

Performance: Collecting every 15 seconds is frequent—adjust based on your needs to avoid overload. AWS Free Tier has limits (e.g., 750 hours/month), so monitor usage.

Alternatives: For more advanced APM, consider integrating with tools like Prometheus, but this setup is great for starters.

Advanced Capabilities: Superset offers forecasting algorithms like PROPHET for predictive capabilities. You can also add standard anomaly detection algorithms can be implemented in the ETL or AI Modeling processes.


This basic APM tool demonstrates the power of open-source tools for custom monitoring. With Apache Hop handling ingestion, PostgreSQL for storage, and Superset for visuals, you have a flexible foundation. Try it out and tweak it for your environment!


If you would like to find out more about how a complete open source solution can be tailored and scaled out for your company, let’s talk! http://www.kpi-forge.com/get-started

 
 
 

Recent Posts

See All
The High Cost of Wasted Time

Gathering data for spreadsheets can be a significant time sink. On average, knowledge workers spend about 30% of their workday searching...

 
 
 
bottom of page