Building a Grafana Plugin That Shows Logs on Hover

Grafana Hover Plugin Demo

Picture this: it’s 3 AM, and you’re staring at a spike in your metrics dashboard. You know something’s wrong, but finding the relevant logs means opening multiple tabs, matching timestamps, filtering by service, and hoping you got the timezone right. By the time you find what you’re looking for, another alert has fired.

I built a Grafana panel plugin that eliminates this frustration. Simply hover over any point on a graph, and instantly see the relevant logs right there. The plugin shows you which logs appear more frequently during incidents, which ones have disappeared, and highlights unusual patterns that might be important.

When Grafana approves the plugin, you’ll find it in the community plugin library as “Hover”. Until then, you can install it from source.

The Problem: Dashboard-to-Logs Gap

The typical debugging workflow is painfully slow:

Alert fires, you open Grafana
See a spike in errors or latency
Note the exact timestamp (and hope you got the timezone right)
Switch to your logging system in a new tab
Enter time range and add filters
Scroll through hundreds of logs
Realize you need to check a different service
Repeat steps 3-7

This fragmented approach makes it hard to connect cause and effect quickly when you need it most.

How It Works

The system has three main components that work together to make hover queries lightning fast:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Service A  │     │  Service B  │     │  Service C  │
│    Logs     │     │    Logs     │     │    Logs     │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │ Push logs
                            ▼
                   ┌─────────────────┐
                   │  Log Analysis   │
                   │    Service      │
                   │                 │
                   │ • Aho-Corasick  │
                   │ • Template ID   │
                   │ • 502K logs/sec │
                   └────────┬────────┘
                            │ Template IDs + metadata
                            ▼
                   ┌─────────────────┐
                   │   ClickHouse    │
                   │                 │
                   │ • Time-series   │
                   │ • Template IDs  │
                   │ • Fast queries  │
                   └────────┬────────┘
                            │ Query on hover
                            ▼
                   ┌─────────────────┐
                   │ Grafana Plugin  │
                   │                 │
                   │ • KL divergence │
                   │ • Show logs     │
                   └─────────────────┘

The Three-Layer Architecture

1. Log Analysis Service Processes incoming logs in real-time and converts each log message into a template ID using pattern matching. For example, “User 12345 failed to login” becomes template “User failed to login” with ID #47.

2. ClickHouse Storage Stores these template IDs with timestamps, making it possible to quickly query “what log patterns appeared between 2:15 and 2:20 AM?”

3. Grafana Plugin When you hover over a metric spike, it queries ClickHouse for that time window, compares the log patterns against normal periods, and shows you which logs became more or less frequent.

The key insight is doing the expensive work (pattern matching) during log ingestion, not during queries. This makes hover responses instant even when processing hundreds of thousands of logs per second.

Creating Log Templates

The system builds templates using the LILAC model approach. Here’s how it works:

First encounter: When a new log arrives, an LLM identifies the variable parts (timestamps, user IDs, values) and creates a template
Template storage: The template gets stored with a unique ID in a parse tree
Future matching: Similar logs match against existing templates without needing the LLM

For example:

Original log: "2024-10-13 14:30:15 User 12345 failed authentication"
Template: "<TIMESTAMP> User <ID> failed authentication"
Template ID: #47

This works better than clustering approaches because it preserves exact structural patterns. Logs like “user could not sign in” and “user failed to sign in” might seem similar but could come from different code paths with different operational meanings. The template system keeps them separate when they should be separate.

Fast Pattern Matching with Aho-Corasick

To achieve high throughput, the system uses the Aho-Corasick algorithm for pattern matching. Unlike traditional approaches that check each template one by one (O(n×m) complexity), Aho-Corasick can match against thousands of templates simultaneously in O(n+m) time.

This makes the system much faster than other log parsers:

Build time: O(total template length) - happens once when templates are added
Match time: O(log length + matches found) - scales with log size, not template count
Memory: O(total template length) - efficient storage

The algorithm builds a finite automaton that can process each character in a log exactly once while checking against all known templates simultaneously.

Performance Results

Testing across 16 different log datasets shows consistently high throughput and accuracy:

Dataset	Throughput (logs/sec)	Latency	Accuracy
Apache	1,891,850	0.5μs	100.00%
Spark	1,049,043	1.0μs	100.00%
Healthapp	736,524	1.4μs	100.00%
Openssh	718,671	1.4μs	100.00%
Hdfs	599,610	1.7μs	100.00%
Windows	590,377	1.7μs	100.00%
Proxifier	569,929	1.8μs	100.00%
Zookeeper	409,368	2.4μs	100.00%
Hpc	408,389	2.4μs	100.00%
Bgl	260,724	3.8μs	99.90%
Hadoop	243,379	4.1μs	100.00%
Android	226,830	4.4μs	100.00%
Openstack	122,997	8.1μs	100.00%
Thunderbird	109,941	9.1μs	100.00%
Mac	56,232	17.8μs	100.00%
Linux	42,471	23.5μs	96.35%

Overall Performance: 502,271 logs/sec average throughput with 98-99% accuracy

The system processes over half a million logs per second during the template matching phase (after templates are created offline). Latency stays under 25 microseconds even for the most complex log formats.

Comparison with Other Log Parsers

Here’s how this implementation compares to existing solutions:

Method	Throughput (logs/sec)	Group Accuracy	Notes
Hover	502,271	98-99%	Template matching only (after offline template creation)
ByteBrain-LogParser	229,000	90-98%	Current state-of-the-art (ByteDance, 2024)
UniParser	~1,000*	99%	LLM-based, highest accuracy but slow
LILAC	~5,000*	93-94%	LLM + adaptive caching
Drain	>500,000*	~85%*	Fast streaming parser, scales linearly

*Throughput estimates based on processing capabilities; accuracy from LogHub/LogHub-2.0 benchmarks

Note: Hover’s throughput is for template matching only, after offline template creation. Other systems include end-to-end parsing. Accuracy varies by dataset complexity - ByteBrain ranges from 90% (LogHub-2.0) to 98% (LogHub). LLM-based parsers achieve higher accuracy but at substantial throughput cost.

The Aho-Corasick approach achieves 2.2x higher throughput than ByteBrain-LogParser during the template matching phase while maintaining comparable accuracy. The main tradeoff is slightly higher variance on certain log types (like Linux system logs), which can be improved by fine-tuning the LLM with more examples from those specific log formats.

Handling New Templates Without Blocking

When the system encounters a new log pattern, it needs to add a new template to the matcher. This creates a challenge: how do you update the pattern matcher while it’s actively processing logs?

The solution uses ArcSwap for lock-free updates:

Reader threads load a pointer to the current pattern matcher and process logs normally
Writer thread encounters an unknown pattern, sends it to the LLM for template creation
Writer thread builds a completely new pattern matcher with the new template
Atomic swap replaces the old matcher with the new one
Old matcher stays alive until all readers finish with it

Currently, rebuilding the matcher takes ~10ms for 5500 templates, during which ~85 logs might get misclassified. For most alerting systems that wait several minutes before triggering, this is acceptable.

A future optimization would implement structural sharing—only copying the parts of the matcher that actually changed. This could reduce misclassified logs during updates from ~85 to 1-2.

Real-World Impact

After using this plugin in production:

Faster debugging: Root cause identification dropped from ~15 minutes to ~3 minutes Better workflow: 80% reduction in tab switching between dashboards and logging systems
Fewer false alarms: 40% reduction in unnecessary escalations when logs showed expected behavior

The plugin eliminates the cognitive overhead of correlating metrics with logs manually, letting you focus on actually solving problems instead of hunting for context.

What’s Next

Trace integration: Show distributed traces when hovering over latency spikes Better template descriptions: More descriptive placeholders like [USER_ID] instead of [MASK]
Model improvements: Fine-tuning LLMs for specific log formats (Linux system logs currently need more examples) Incremental updates: Only send changes to the pattern matcher instead of rebuilding it entirely

Try It Yourself

Both components are open source:

Grafana Plugin: https://github.com/StandardRunbook/grafana-hover-plugin
Log Analysis Backend: https://github.com/StandardRunbook/log_analysis

The plugin is currently pending Grafana’s approval process. Install from source:

# Install the plugin
git clone https://github.com/StandardRunbook/grafana-hover-plugin
cd grafana-hover-plugin
pnpm run build
pnpm run server

# Install the log analysis backend
git clone https://github.com/StandardRunbook/log_analysis
cd log_analysis
# Follow setup instructions in README

Your 3 AM debugging sessions will never be the same.