when running Neo4j in production (especially on Kubernetes/OpenShift), you’ll want to know where your queries and database spend time. That’s where performance profiling comes in.
Here’s a clear guide:
⚡ Performance Profiling in Neo4j (Production)
Neo4j provides multiple tools to monitor, debug, and optimize query/database performance.
1. Use PROFILE
and EXPLAIN
For individual queries (usually in dev/staging, but useful for prod debugging):
-
EXPLAIN → shows query plan without running it.
-
PROFILE → runs query and shows actual db hits, rows, time.
✅ Use this to identify missing indexes, unnecessary joins, or Cartesian products.
2. Enable Query Logging
In neo4j.conf
:
-
Logs stored in
$NEO4J_HOME/logs/query.log
. -
Forward logs to OpenShift’s EFK/Loki stack for monitoring.
👉 Helps you find “slow queries” in production without enabling PROFILE
everywhere.
3. Metrics & Monitoring (Prometheus/Grafana)
Neo4j exposes JMX metrics → can be scraped via Prometheus.
Enable metrics in neo4j.conf
:
Then in Kubernetes:
-
Expose port
2004
. -
Configure Prometheus scrape target.
-
Use Grafana dashboards (Neo4j provides prebuilt ones).
📊 Key metrics to monitor:
-
Query execution time
-
Transaction commits/rollbacks
-
Page cache hit ratio
-
JVM heap/memory usage
-
Bolt/HTTP connections
4. Page Cache & Memory Tuning
Neo4j performance depends heavily on page cache (stores graph on disk in memory).
📌 In neo4j.conf
:
👉 Tune based on:
-
Graph size (on disk).
-
Available memory in the pod.
-
Workload type (OLTP vs analytics).
5. Connection & Thread Tuning
-
Configure Bolt thread pools if handling lots of clients:
-
Use connection limits to avoid overload:
6. Use CALL dbms.listQueries
for Live Profiling
In production, you can run:
-
See which queries are running “right now.”
-
Identify long-running queries.
-
You can even kill them:
7. APOC & Profiling
APOC has extra procedures:
Great for runtime diagnostics inside the cluster.
8. OpenShift/Kubernetes Integration
-
Use sidecar containers (like JMX Exporter or Prometheus agent) to collect JVM/Neo4j metrics.
-
Send metrics + logs → OpenShift monitoring stack.
-
Use K8s HPA (Horizontal Pod Autoscaler) if Bolt/HTTP load spikes.
-
Use Resource Limits to prevent JVM OOM kills:
✅ Summary (Best Practices for Profiling in Prod)
-
Use
PROFILE
/EXPLAIN
in staging/dev → optimize queries. -
Enable query logging with slow query threshold.
-
Monitor with Prometheus + Grafana.
-
Tune page cache & heap memory.
-
Use
dbms.listQueries
+ APOC monitoring for live debugging. -
Integrate with OpenShift observability stack.
Scaling Neo4j to billions of nodes/relationships is absolutely doable—but you need the right model, ingest pattern, hardware, and cluster topology. Here’s a practical blueprint you can follow.
1) Model for scale (before you buy hardware)
Start from an indexed node.
-
Every heavy query should begin with an index-backed lookup (
NodeIndexSeek
), not a graph-wide scan. -
Add unique IDs and composite indexes for your main entry points.
Avoid supernodes.
-
Break “celebrity” fan-outs into buckets or time partitions.
-
Or use intermediate entities when relationships carry many attributes (e.g.,
(:User)-[:MADE]->(:Tx)-[:FOR]->(:Product)
).
Bound your traversals.
-
Prefer
[:REL*1..3]
to unbounded[:REL*]
. -
Use relationship type + direction filters aggressively.
Denormalize judiciously.
-
Add “shortcut” relationships for very common multi-hop questions (e.g.,
:FRIEND_OF_FRIEND
) built by batch jobs.
2) Ingest at scale (millions → billions)
Cold load (fastest): neo4j-admin import
into a new DB.
-
Use CSV with
:ID
,:START_ID
,:END_ID
,:TYPE
. -
Then create indexes/constraints after the import.
Warm/continuous load: batch writes, not row-by-row.
-
Use parameterized UNWIND + index-backed MERGE.
-
For large flows, use APOC periodic iterate (server-side batching):
-
From streams (Kafka/etc.), buffer and send bulk batches (hundreds–thousands per tx).
Make MERGE cheap.
-
Always MERGE on indexed keys. Avoid MERGE on non-indexed properties.
3) Cluster topology for billions
Causal Cluster (Enterprise):
-
1–3 Core members (write quorum, durability).
-
N Read Replicas for scale-out reads & analytics.
-
Run heavy analytics on read replicas (keeps writers snappy).
Fabric (sharding across graphs):
-
Partition by tenant, time (monthly/yearly), or domain.
-
Keep cross-shard queries rare; query the right shard with
USE
. -
Good pattern: hot recent data in one shard, historical in time shards.
4) Hardware & storage planning
Disks: NVMe SSDs with high IOPS; XFS/ext4; no network HDDs for production.
RAM sizing (rule of thumb):
-
Page cache ≈ size of the hot portion of the store (try 50–70% of node’s RAM after heap).
-
Heap sized for query concurrency & complexity (e.g., OLTP 2–8 GB; complex aggregations need more).
-
Use the memory advisor:
(Run against your DB to get starting recommendations.)
CPU: Fewer, faster cores often beat many slow cores for OLTP. Scale reads via replicas.
Kubernetes/OpenShift tips:
-
Use StatefulSets, node/pod anti-affinity, local NVMe where possible.
-
Pin pods to “storage-strong” nodes (node selectors/taints).
-
Request/limit memory carefully to avoid OOM kills.
5) Query patterns that stay fast at billion scale
-
Start from an index, immediately reduce with
WHERE
, then traverse. -
Early LIMIT +
WITH
to cut rows before expanding further. -
Avoid accidental Cartesian products (watch PROFILE plan).
-
Use EXPLAIN/PROFILE; look for
NodeIndexSeek
, notAllNodesScan
. -
Aggregate with COLLECT + size() carefully; prefer streaming results when possible.
Examples:
6) Managing dense nodes
Neo4j automatically groups relationships by type+direction once a node becomes dense.
Design to filter by type + direction so the engine can hop buckets efficiently.
(You rarely need to tweak internal density thresholds; fix via modeling/bucketing instead.)
7) Index & constraint strategy
-
Unique ID constraints on all identity nodes (User, Product, Order…).
-
Composite indexes for common multi-key filters.
-
Fulltext indexes for search-like use (names, descriptions) and then anchor from results to graph hops.
Fulltext example:
8) Analytics at scale
-
Use GDS (Graph Data Science) with graph projections (in-memory) on read replicas.
-
Project only the subgraph you need (labels, rel types, properties) to fit memory.
-
Persist results back as properties/relationships for fast OLTP re-use.
9) Observability & guardrails
-
Slow query log (threshold) to surface hotspots:
-
Prometheus/Grafana for page cache hit ratio, heap, GC, tx rates, connection counts.
-
Backups (online) and checkpoints tuned for write volume. Keep tx logs on fast disk.
10) Growth playbook (what to do when…)
-
Ingest is the bottleneck: increase batch size, parallel writers (to distinct keys), ensure MERGE keys are indexed, consider cold import for backfills.
-
Reads are hot: add read replicas, add/adjust indexes, denormalize hotspots, cache results in app layer if stable.
-
Queries blow up: reduce traversal depth, add shortcut edges, precompute aggregates into relationship properties/nodes.
-
Store too big for RAM: adopt Fabric time/tenant sharding; move cold data to separate shards; place analytics on replicas.
Quick reference snippets
High-throughput MERGE with parameters
Batch creation of relationships
Periodic iterate (from a query)
Bottom line
-
Model first, keep traversals bounded, and always start from an index.
-
Ingest in batches; use
neo4j-admin import
for first loads. -
Scale reads with replicas; shard with Fabric for extreme sizes.
-
Tune memory (page cache vs heap), use fast NVMe, and monitor everything.
Comments
Post a Comment